Ray on VKE场景下,可以通过采用RayJob方式提交Ray作业。使用RayJob时,KubeRay会自动创建一个RayCluster,并在集群就绪时提交Ray作业。同时也支持在Ray作业结束后自动删除RayCluster。
本章节详细介绍了Rayjob的使用示例,在该示例中执行一个简单的python代码段,以便更快的上手和使用。
apiVersion: ray.io/v1 kind: RayJob metadata: name: rayjob-sample spec: entrypoint: python /home/ray/samples/sample_code.py # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller. rayClusterSpec: rayVersion: '2.9.3' # should match the Ray version in the image of the containers # Ray head pod template headGroupSpec: rayStartParams: dashboard-host: '0.0.0.0' template: spec: containers: - name: ray-head image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray:2.9.3-py3.9-ubuntu20.04-1.2.0 ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client resources: limits: cpu: "1" requests: cpu: "200m" volumeMounts: - mountPath: /home/ray/samples name: code-sample volumes: - name: code-sample configMap: # Provide the name of the ConfigMap you want to mount. name: ray-job-code-sample # An array of keys from the ConfigMap to create as files items: - key: sample_code.py path: sample_code.py workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 5 # logical group name, for this called small-group, also can be functional groupName: small-group # The `rayStartParams` are used to configure the `ray start` command. # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. rayStartParams: {} #pod template template: spec: containers: - name: ray-worker image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray:2.9.3-py3.9-ubuntu20.04-1.2.0 lifecycle: preStop: exec: command: [ "/bin/sh","-c","ray stop" ] resources: limits: cpu: "1" requests: cpu: "200m" --- apiVersion: v1 kind: ConfigMap metadata: name: ray-job-code-sample data: sample_code.py: | import ray import os import requests ray.init() @ray.remote class Counter: def __init__(self): self.name = "counter_name" self.counter = 0 def inc(self): self.counter += 1 def get_counter(self): return "{} got {}".format(self.name, self.counter) counter = Counter.remote() for _ in range(5): ray.get(counter.inc.remote()) print(ray.get(counter.get_counter.remote()))
在该Yaml文件中:
rayjob-sample
”python /home/ray/samples/sample_code.py
”Counter
类,它是一个Ray远程对象,包含增加计数器和获取当前计数的方法。Counter
的远程实例,并进行了5次远程增加操作,每次增加后都会打印出当前的计数。执行上述Yaml文件
kubectl apply -f rayjob.sample.yaml -n <命名空间>
查看RayJob执行情况
在VKE集群中可以看到创建RayCluster集群和RayJod使用的Pod。
删除RayJob
kubectl delete -f rayjob.sample.yaml -n <命名空间>