You need to enable JavaScript to run this app.
导航
通过EMR监控RayJob
最近更新时间:2024.05.20 11:11:45首次发布时间:2024.05.20 11:11:45

在 Kubernetes 中可以采用Prometheus进行指标采集。本章节介绍下,在RayJob的Yaml文件配置Prometheus的相关参数,可以将RayJob的指标收集到EMR的Ray服务中。

1 准备条件

  1. 快速入门中的准备条件

  2. 适配EMR产品版本1.4.0及其以上版本

2 使用指导

  1. 准备yaml文件:rayjob.metrics.yaml

    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      name: rayjob-lxy-online
    spec:
      entrypoint: python /home/ray/samples/sample_code.py
      shutdownAfterJobFinishes: false
      runtimeEnvYAML: |
        pip:
          - requests==2.26.0
          - pendulum==2.1.2
        env_vars:
          counter_name: "test_counter"
      rayClusterSpec:
        rayVersion: "2.9.3"
        # 开启autoscaler
        enableInTreeAutoscaling: true
        headGroupSpec:
          rayStartParams:
            dashboard-host: "0.0.0.0"
          #pod template
          template:
            metadata:
              annotations:
                prometheus.io/path: /metrics
                prometheus.io/port: "8080"
                prometheus.io/scrape: "true"
            spec:
              imagePullSecrets:
                - name: lxy-docker-secret
              containers:
                - name: ray-head
                  image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0
                  ports:
                    - containerPort: 6379
                      name: gcs-server
                    - containerPort: 8265 # Ray dashboard
                      name: dashboard
                    - containerPort: 10001
                      name: client
                  resources:
                    limits:
                      cpu: "1"
                    requests:
                      cpu: "200m"
                  volumeMounts:
                    - mountPath: /home/ray/samples
                      name: code-sample
              volumes:
                # You set volumes at the Pod level, then mount them into containers inside that Pod
                - name: code-sample
                  configMap:
                    name: ray-job-code-sample
                    items:
                      - key: sample_code.py
                        path: sample_code.py
        workerGroupSpecs:
          # the pod replicas in this group typed worker
          - replicas: 1
            minReplicas: 1
            maxReplicas: 5
            # logical group name, for this called small-group, also can be functional
            groupName: small-group
            rayStartParams: {}
            #pod template
            template:
              metadata:
                annotations:
                  prometheus.io/path: /metrics
                  prometheus.io/port: "8080"
                  prometheus.io/scrape: "true"
              spec:
                imagePullSecrets:
                  - name: lxy-docker-secret
                containers:
                  - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                    image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0
                    lifecycle:
                      preStop:
                        exec:
                          command: ["/bin/sh", "-c", "ray stop"]
                    resources:
                      limits:
                        cpu: "1"
                      requests:
                        cpu: "200m"
      submitterPodTemplate:
        spec:
          imagePullSecrets:
            - name: lxy-docker-secret
          restartPolicy: Never
          containers:
            - name: my-custom-rayjob-submitter-pod
              image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0
      ######################Ray code sample#################################
    # this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
    # it is mounted into the container and executed to show the Ray job at work
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: ray-job-code-sample
    data:
      sample_code.py: |
        import ray
        import os
        import requests
        ray.init()
        @ray.remote
        class Counter:
            def __init__(self):
                # Used to verify runtimeEnv
                self.name = os.getenv("counter_name")
                assert self.name == "test_counter"
                self.counter = 0
            def inc(self):
                self.counter += 1
            def get_counter(self):
                return "{} got {}".format(self.name, self.counter)
        counter = Counter.remote()
        for _ in range(5):
            ray.get(counter.inc.remote())
            print(ray.get(counter.get_counter.remote()))
    

在该yaml文件中有如下三个注解与Prometheus监控系统相关:

  • prometheus.io/path: /metrics指定Prometheus抓取指标时应该访问的路径。在这个例子中,/metrics 是容器暴露指标的HTTP端点。Prometheus服务器会周期性地请求这个路径来获取指标数据。

  • prometheus.io/port: "8080":指定Prometheus抓取指标时应该使用的端口。在这个例子中,端口是 8080。这意味着Prometheus会尝试连接到Ray集群节点的 8080 端口来获取指标。

  • prometheus.io/scrape: "true":指定Prometheus是否应该抓取(scrape)指定的端点。设置为 "true" 表示Prometheus需要抓取这个端点的指标。

  1. 执行yaml文件

    kubectl apply -f rayjob.metrics.yaml -n <命名空间>
    
  2. 在EMR集群中,集群监控中可以看到Ray的监控指标信息

  3. 删除RayJob作业

    kubectl delete -f rayjob.metrics.yaml -n <命名空间>