You need to enable JavaScript to run this app.
导航
通过Ray Dashboard查看任务情况
最近更新时间:2024.05.20 11:11:34首次发布时间:2024.05.20 11:11:34

在 Kubernetes 中使用 Ingress 控制器来管理外部访问集群服务的路由,本章节介绍下,如何通过Ingress方式提供RayCluster的Dashboard UI,这样即可通过Ray的UI查看任务情况。

1 准备条件

  1. 快速入门中的准备条件

2 使用指导

  1. 准备yaml文件:rayjob.ingress.yaml
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
  annotations:
    #此annotation必须要加 否则访问dashboard会报404
    nginx.ingress.kubernetes.io/rewrite-target: /$1 
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  shutdownAfterJobFinishes: true

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  ttlSecondsAfterFinished: 3000

  runtimeEnvYAML: |
    env_vars:
      counter_name: "test_counter"
      
  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: '2.9.3' # should match the Ray version in the image of the containers
    # Ray head pod template
    headGroupSpec:
      # 开启 ingress
      enableIngress: true
      rayStartParams:
        dashboard-host: '0.0.0.0'
      template:
        spec:
          containers:
            - name: ray-head
              image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray:2.9.3-py3.9-ubuntu20.04-1.2.0
              ports:  
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            - name: code-sample
              configMap:
                # Provide the name of the ConfigMap you want to mount.
                name: ray-job-code-sample
                # An array of keys from the ConfigMap to create as files
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        # The `rayStartParams` are used to configure the `ray start` command.
        # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
        # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
        rayStartParams: {}
        #pod template
        template:
          spec:
            containers:
              - name: ray-worker
                image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray:2.9.3-py3.9-ubuntu20.04-20240402
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    assert requests.__version__ == "2.31.0"

yaml文件中nginx.ingress.kubernetes.io/rewrite-target: /$1 是一个注解,专门用于 Nginx Ingress 控制器,通过 Ingress 路由的请求指定一个新的 URL 目标路径,/$1 表示 Nginx Ingress 控制器将会捕获进入请求的 URI 中的第一部分(通常是路径的第一个部分),并将其作为参数传递给后端服务。

  1. 执行yaml文件

    kubectl apply -f rayjob.ingress.yaml -n <命名空间>
    
  2. 访问Ray Dashboard ui

    可以通过下述方式获取ingress访问的入口endpoint信息:

    kubectl get ingress -n <命名空间>
    kubectl describe ingress ingress名称 -n <命名空间>
    


      将上述信息组合即为最终版URL,以上图为例:http://<ingress地址>/rayjob-sample-raycluster-c59cq/ 即可访问Dashboard.

  3. 删除RayJob作业

    kubectl delete -f rayjob.ingress.yaml -n <命名空间>