通过EMR监控RayJob--E-MapReduce-火山引擎

文档中心

立即注册

导航

E-MapReduce

通过EMR监控RayJob

最近更新时间：2024.05.20 11:11:45首次发布时间：2024.05.20 11:11:45

在 Kubernetes 中可以采用Prometheus进行指标采集。本章节介绍下，在RayJob的Yaml文件配置Prometheus的相关参数，可以将RayJob的指标收集到EMR的Ray服务中。

1 准备条件

同快速入门中的准备条件
适配EMR产品版本1.4.0及其以上版本

2 使用指导

准备yaml文件：rayjob.metrics.yaml

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-lxy-online
spec:
  entrypoint: python /home/ray/samples/sample_code.py
  shutdownAfterJobFinishes: false
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"
  rayClusterSpec:
    rayVersion: "2.9.3"
    # 开启autoscaler
    enableInTreeAutoscaling: true
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      #pod template
      template:
        metadata:
          annotations:
            prometheus.io/path: /metrics
            prometheus.io/port: "8080"
            prometheus.io/scrape: "true"
        spec:
          imagePullSecrets:
            - name: lxy-docker-secret
          containers:
            - name: ray-head
              image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
              resources:
                limits:
                  cpu: "1"
                requests:
                  cpu: "200m"
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            # You set volumes at the Pod level, then mount them into containers inside that Pod
            - name: code-sample
              configMap:
                name: ray-job-code-sample
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      # the pod replicas in this group typed worker
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        # logical group name, for this called small-group, also can be functional
        groupName: small-group
        rayStartParams: {}
        #pod template
        template:
          metadata:
            annotations:
              prometheus.io/path: /metrics
              prometheus.io/port: "8080"
              prometheus.io/scrape: "true"
          spec:
            imagePullSecrets:
              - name: lxy-docker-secret
            containers:
              - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
                image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "200m"
  submitterPodTemplate:
    spec:
      imagePullSecrets:
        - name: lxy-docker-secret
      restartPolicy: Never
      containers:
        - name: my-custom-rayjob-submitter-pod
          image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0
  ######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests
    ray.init()
    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0
        def inc(self):
            self.counter += 1
        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)
    counter = Counter.remote()
    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

在该yaml文件中有如下三个注解与Prometheus监控系统相关：

prometheus.io/path: /metrics指定Prometheus抓取指标时应该访问的路径。在这个例子中，/metrics 是容器暴露指标的HTTP端点。Prometheus服务器会周期性地请求这个路径来获取指标数据。
prometheus.io/port: "8080"：指定Prometheus抓取指标时应该使用的端口。在这个例子中，端口是 8080。这意味着Prometheus会尝试连接到Ray集群节点的 8080 端口来获取指标。
prometheus.io/scrape: "true"：指定Prometheus是否应该抓取（scrape）指定的端点。设置为 "true" 表示Prometheus需要抓取这个端点的指标。

执行yaml文件

kubectl apply -f rayjob.metrics.yaml -n <命名空间>

在EMR集群中，集群监控中可以看到Ray的监控指标信息

删除RayJob作业

kubectl delete -f rayjob.metrics.yaml -n <命名空间>