You need to enable JavaScript to run this app.
导航
RayJob快速入门
最近更新时间:2024.06.03 15:24:08首次发布时间:2024.05.20 11:11:19

Ray on VKE场景下,可以通过采用RayJob方式提交Ray作业。使用RayJob时,KubeRay会自动创建一个RayCluster,并在集群就绪时提交Ray作业。同时也支持在Ray作业结束后自动删除RayCluster。
本章节详细介绍了Rayjob的使用示例,在该示例中执行一个简单的python代码段,以便更快的上手和使用。

1 准备条件

  1. 部署EMR on VKE产品中Ray服务。
  2. 通过kubectl命令运行RayJob作业,需要安装kubectl工具(参考安装和设置 kubectl),且配置火山引擎VKE集群的连接信息(参考连接VKE集群)。

2 使用指导

  1. 准备Rayjob使用的yaml文件:rayjob.sample.yaml
    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      name: rayjob-sample
    spec:
      entrypoint: python /home/ray/samples/sample_code.py
          
      # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
      rayClusterSpec:
        rayVersion: '2.9.3' # should match the Ray version in the image of the containers
        # Ray head pod template
        headGroupSpec:
          rayStartParams:
            dashboard-host: '0.0.0.0'
          template:
            spec:
              containers:
                - name: ray-head
                  image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray:2.9.3-py3.9-ubuntu20.04-1.2.0
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265 # Ray dashboard
                      name: dashboard
                    - containerPort: 10001
                      name: client
                  resources:
                    limits:
                      cpu: "1"
                    requests:
                      cpu: "200m"
                  volumeMounts:
                    - mountPath: /home/ray/samples
                      name: code-sample
              volumes:
                - name: code-sample
                  configMap:
                    # Provide the name of the ConfigMap you want to mount.
                    name: ray-job-code-sample
                    # An array of keys from the ConfigMap to create as files
                    items:
                      - key: sample_code.py
                        path: sample_code.py
        workerGroupSpecs:
          # the pod replicas in this group typed worker
          - replicas: 1
            minReplicas: 1
            maxReplicas: 5
            # logical group name, for this called small-group, also can be functional
            groupName: small-group
            # The `rayStartParams` are used to configure the `ray start` command.
            # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
            # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
            rayStartParams: {}
            #pod template
            template:
              spec:
                containers:
                  - name: ray-worker
                    image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray:2.9.3-py3.9-ubuntu20.04-1.2.0
                    lifecycle: 
                      preStop:
                        exec:
                          command: [ "/bin/sh","-c","ray stop" ]
                    resources:
                      limits:
                        cpu: "1"
                      requests:
                        cpu: "200m"
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: ray-job-code-sample
    data:
      sample_code.py: |
        import ray
        import os
        import requests
    
        ray.init()
    
        @ray.remote
        class Counter:
            def __init__(self):
                self.name = "counter_name"
                self.counter = 0
    
            def inc(self):
                self.counter += 1
    
            def get_counter(self):
                return "{} got {}".format(self.name, self.counter)
    
        counter = Counter.remote()
    
        for _ in range(5):
            ray.get(counter.inc.remote())
            print(ray.get(counter.get_counter.remote()))
    

在该Yaml文件中:

  • 定义RayJob名称名为“rayjob-sample
  • 指定了作业入口“python /home/ray/samples/sample_code.py
  • HeadGroupSpec中,定义RayCluster中Head信息:使用image名称、端口,以及资源限制等,并指定挂载卷
  • WorkerGroupSpecs中,定义RayCluster中Worker信息。
  • 最后的python脚本中,
    • 定义了一个Counter类,它是一个Ray远程对象,包含增加计数器和获取当前计数的方法。
    • 创建了一个Counter的远程实例,并进行了5次远程增加操作,每次增加后都会打印出当前的计数。
  1. 执行上述Yaml文件

    kubectl apply -f rayjob.sample.yaml -n <命名空间>
    
  2. 查看RayJob执行情况
    在VKE集群中可以看到创建RayCluster集群和RayJod使用的Pod。

  3. 删除RayJob

    kubectl delete -f rayjob.sample.yaml -n <命名空间>