部署EMR on VKE产品中Ray服务。
通过kubectl命令运行RayJob作业,需要安装kubectl工具(参考安装和设置 kubectl),且配置火山引擎VKE集群的连接信息(参考连接VKE集群)。
在火山引擎EMR on VKE产品中场景Ray服务后,便可以通过以下步骤提交RayJob:
编写RayJob配置文件:首先,创建一个YAML格式的配置文件,文件中定义RayJob的资源需求、Ray集群的配置、以及作业的入口点等信息。可以参考下面提供的RayJob Yaml模板。
应用RayJob配置:使用kubectl
命令行工具将RayJob的配置文件应用到火山引擎VKE集群中。打开终端或命令行界面,执行以下命令:
kubectl apply -f <RayJob配置文件的名称>
kubectl get rayjob # 或者获取更详细的状态信息: kubectl describe rayjob <RayJob名称>
kubectl
查看Pod的日志kubectl logs <RayJob生成的Pod名称>
如果RayJob配置中启用了Ingress,在EMR管控端找到相应集群,通过快速连接打开Ray Dashboard UI,查看作业执行信息。
shutdownAfterJobFinishes: true
),则需要手动删除RayJob和相关的Kubernetes资源。使用以下命令删除RayJob以及与之相关的所有资源:kubectl apply -f <RayJob配置文件的名称> # 或使用下面的命令 kubectl delete rayjob <rayjob-name>
此外,如果你的RayJob需要与外部存储或服务进行交互,确保相应的存储类、PVC、Secrets等也已经在火山引擎VKE集群中预先配置好。
下面介绍下RayJob使用的Yaml模版,您可以根据不同场景进行增删和修改。
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc名称 spec: storageClassName: ebs-essd #使用云盘 火上引擎还支持 NAS、CloudFS、TOS accessModes: - ReadWriteOnce resources: requests: storage: 40Gi #根据实际需要情况配置 --- apiVersion: ray.io/v1 kind: RayJob metadata: name: rayjob-sample annotations: #此annotation必须要加 否则访问dashboard会报404 nginx.ingress.kubernetes.io/rewrite-target: /$1 spec: entrypoint: python ***.py # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false. shutdownAfterJobFinishes: true # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes. ttlSecondsAfterFinished: 300 # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller. rayClusterSpec: # 开启 ingress enableIngress: true rayVersion: '2.9.3' # should match the Ray version in the image of the containers # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod. enableInTreeAutoscaling: true autoscalerOptions: # upscalingMode is "Default" or "Aggressive" or "Conservative" # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster. # Default: Upscaling is not rate-limited. # Aggressive: An alias for Default; upscaling is not rate-limited. upscalingMode: Conservative # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources. idleTimeoutSeconds: 60 # imagePullPolicy optionally overrides the autoscaler container's default image pull policy (IfNotPresent). imagePullPolicy: IfNotPresent # Optionally specify the autoscaler container's securityContext. securityContext: {} env: [] envFrom: [] headGroupSpec: # 开启 ingress enableIngress: true rayStartParams: dashboard-host: '0.0.0.0' template: spec: initContainers: - name: git-clone image: alpine/git command: - /bin/sh - -c - | git clone 代码仓库地址 /app volumeMounts: - name: esb mountPath: /app containers: - name: ray-head image: 镜像地址 ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client resources: limits: cpu: "1" requests: cpu: "1" volumeMounts: - name: esb mountPath: /app volumes: - name: esb # 代码放到 云盘 或者直接配置为 emptyDir 都可以 persistentVolumeClaim: claimName: pvc名称 workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 5 # logical group name, for this called small-group, also can be functional groupName: autoscaling-group # The `rayStartParams` are used to configure the `ray start` command. # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. rayStartParams: {} #pod template template: spec: imagePullSecrets: - name: 拉取镜像密钥 containers: - name: ray-worker image: 镜像仓库地址 preStop: exec: command: [ "/bin/sh","-c","ray stop" ] resources: limits: cpu: "1" nvidia.com/gpu: 1 requests: cpu: "1" nvidia.com/gpu: 1
关键参数说明:
参数名称 | 参数说明 |
---|---|
enableIngress | 是否开启Ingress。 |
enableInTreeAutoscaling | 是否开启弹性伸缩。 |
shutdownAfterJobFinishes | 任务完成后是否释放RayCluster |
ttlSecondsAfterFinished | 任务完成后多久时间释放RayCluster,单位为秒 |
idleTimeoutSeconds | 空闲worker缩容前的等待时间,单位为秒 |
storageClassName | 配置pvc的StorageClass。 目前支持ebs-essd/ebs-ssd等 |
Ray的镜像地址和镜像密钥可以参考镜像参考填写,也可以使用自定义镜像。