You need to enable JavaScript to run this app.
导航
提交RayJob使用指导
最近更新时间:2024.05.20 11:11:25首次发布时间:2024.05.20 11:11:25

1 环境准备

  1. 部署EMR on VKE产品中Ray服务。

  2. 通过kubectl命令运行RayJob作业,需要安装kubectl工具(参考安装和设置 kubectl),且配置火山引擎VKE集群的连接信息(参考连接VKE集群)。

2 RayJob提交指导

在火山引擎EMR on VKE产品中场景Ray服务后,便可以通过以下步骤提交RayJob:

  1. 编写RayJob配置文件:首先,创建一个YAML格式的配置文件,文件中定义RayJob的资源需求、Ray集群的配置、以及作业的入口点等信息。可以参考下面提供的RayJob Yaml模板。

  2. 应用RayJob配置:使用kubectl命令行工具将RayJob的配置文件应用到火山引擎VKE集群中。打开终端或命令行界面,执行以下命令:

kubectl apply -f <RayJob配置文件的名称>
  1. 验证RayJob是否运行:使用以下命令检查RayJob的状态:
kubectl get rayjob

# 或者获取更详细的状态信息:
kubectl describe rayjob <RayJob名称>
  1. 查看日志:可以使用kubectl查看Pod的日志
kubectl logs <RayJob生成的Pod名称>

如果RayJob配置中启用了Ingress,在EMR管控端找到相应集群,通过快速连接打开Ray Dashboard UI,查看作业执行信息。

  1. 清理:作业完成后,如果不希望自动清理(如果配置了shutdownAfterJobFinishes: true),则需要手动删除RayJob和相关的Kubernetes资源。使用以下命令删除RayJob以及与之相关的所有资源:
kubectl apply -f <RayJob配置文件的名称>

# 或使用下面的命令
kubectl delete rayjob <rayjob-name>

此外,如果你的RayJob需要与外部存储或服务进行交互,确保相应的存储类、PVC、Secrets等也已经在火山引擎VKE集群中预先配置好。

下面介绍下RayJob使用的Yaml模版,您可以根据不同场景进行增删和修改。

3 RayJob使用的Yaml模版

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: pvc名称
spec:
 storageClassName: ebs-essd #使用云盘 火上引擎还支持 NAS、CloudFS、TOS
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 40Gi #根据实际需要情况配置
---
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
  annotations:
    #此annotation必须要加 否则访问dashboard会报404
    nginx.ingress.kubernetes.io/rewrite-target: /$1 
spec:
  entrypoint: python ***.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  shutdownAfterJobFinishes: true

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  ttlSecondsAfterFinished: 300
 
  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    # 开启 ingress
    enableIngress: true
    rayVersion: '2.9.3' # should match the Ray version in the image of the containers
    # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod.
    enableInTreeAutoscaling: true
    autoscalerOptions:
      # upscalingMode is "Default" or "Aggressive" or "Conservative"
      # Conservative: Upscaling is rate-limited; the number of pending worker pods is at most the size of the Ray cluster.
      # Default: Upscaling is not rate-limited.
      # Aggressive: An alias for Default; upscaling is not rate-limited.
      upscalingMode: Conservative
      # idleTimeoutSeconds is the number of seconds to wait before scaling down a worker pod which is not using Ray resources.
      idleTimeoutSeconds: 60
      # imagePullPolicy optionally overrides the autoscaler container's default image pull policy (IfNotPresent).
      imagePullPolicy: IfNotPresent
      # Optionally specify the autoscaler container's securityContext.
      securityContext: {}
      env: []
      envFrom: []
    headGroupSpec:
      # 开启 ingress
      enableIngress: true
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          initContainers:
            - name: git-clone
              image: alpine/git
              command:
                 - /bin/sh
                 - -c
                 - |
                  git clone 代码仓库地址 /app
              volumeMounts:
                 - name: esb
                   mountPath: /app
          containers:
            - name: ray-head
              image: 镜像地址
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "1"
              volumeMounts:
                - name: esb
                  mountPath: /app                
          volumes:
            - name: esb
              # 代码放到 云盘 或者直接配置为 emptyDir 都可以  
              persistentVolumeClaim:
                claimName: pvc名称
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: autoscaling-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            imagePullSecrets:
              - name: 拉取镜像密钥
            containers:
              - name: ray-worker
                image: 镜像仓库地址
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                    nvidia.com/gpu: 1 
                  requests:
                    cpu: "1"
                    nvidia.com/gpu: 1

关键参数说明:

参数名称参数说明

enableIngress

是否开启Ingress。
设置此参数时 同时需要meta里面加上下述annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$1

enableInTreeAutoscaling

是否开启弹性伸缩。
开启后 ,弹性伸缩将作为sidecar容器存在于Head pod里面, 负责集群扩缩容任务/
开启弹性伸缩后,workerGroupSpecs里面的minReplicas与maxReplicas配置了扩缩容的上下限制

shutdownAfterJobFinishes任务完成后是否释放RayCluster
ttlSecondsAfterFinished任务完成后多久时间释放RayCluster,单位为秒
idleTimeoutSeconds空闲worker缩容前的等待时间,单位为秒

storageClassName

配置pvc的StorageClass。

目前支持ebs-essd/ebs-ssd等

Ray的镜像地址和镜像密钥可以参考镜像参考填写,也可以使用自定义镜像。