在 Kubernetes 中可以采用Prometheus进行指标采集。本章节介绍下,在RayJob的Yaml文件配置Prometheus的相关参数,可以将RayJob的指标收集到EMR的Ray服务中。
同快速入门中的准备条件
适配EMR产品版本1.4.0及其以上版本
准备yaml文件:rayjob.metrics.yaml
apiVersion: ray.io/v1 kind: RayJob metadata: name: rayjob-lxy-online spec: entrypoint: python /home/ray/samples/sample_code.py shutdownAfterJobFinishes: false runtimeEnvYAML: | pip: - requests==2.26.0 - pendulum==2.1.2 env_vars: counter_name: "test_counter" rayClusterSpec: rayVersion: "2.9.3" # 开启autoscaler enableInTreeAutoscaling: true headGroupSpec: rayStartParams: dashboard-host: "0.0.0.0" #pod template template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scrape: "true" spec: imagePullSecrets: - name: lxy-docker-secret containers: - name: ray-head image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0 ports: - containerPort: 6379 name: gcs-server - containerPort: 8265 # Ray dashboard name: dashboard - containerPort: 10001 name: client resources: limits: cpu: "1" requests: cpu: "200m" volumeMounts: - mountPath: /home/ray/samples name: code-sample volumes: # You set volumes at the Pod level, then mount them into containers inside that Pod - name: code-sample configMap: name: ray-job-code-sample items: - key: sample_code.py path: sample_code.py workerGroupSpecs: # the pod replicas in this group typed worker - replicas: 1 minReplicas: 1 maxReplicas: 5 # logical group name, for this called small-group, also can be functional groupName: small-group rayStartParams: {} #pod template template: metadata: annotations: prometheus.io/path: /metrics prometheus.io/port: "8080" prometheus.io/scrape: "true" spec: imagePullSecrets: - name: lxy-docker-secret containers: - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0 lifecycle: preStop: exec: command: ["/bin/sh", "-c", "ray stop"] resources: limits: cpu: "1" requests: cpu: "200m" submitterPodTemplate: spec: imagePullSecrets: - name: lxy-docker-secret restartPolicy: Never containers: - name: my-custom-rayjob-submitter-pod image: emr-vke-public-cn-beijing.cr.volces.com/emr/ray-ml:2.9.3-py3.9-ubuntu20.04-1.2.0 ######################Ray code sample################################# # this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example # it is mounted into the container and executed to show the Ray job at work --- apiVersion: v1 kind: ConfigMap metadata: name: ray-job-code-sample data: sample_code.py: | import ray import os import requests ray.init() @ray.remote class Counter: def __init__(self): # Used to verify runtimeEnv self.name = os.getenv("counter_name") assert self.name == "test_counter" self.counter = 0 def inc(self): self.counter += 1 def get_counter(self): return "{} got {}".format(self.name, self.counter) counter = Counter.remote() for _ in range(5): ray.get(counter.inc.remote()) print(ray.get(counter.get_counter.remote()))
在该yaml文件中有如下三个注解与Prometheus监控系统相关:
prometheus.io/path: /metrics
指定Prometheus抓取指标时应该访问的路径。在这个例子中,/metrics
是容器暴露指标的HTTP端点。Prometheus服务器会周期性地请求这个路径来获取指标数据。
prometheus.io/port: "8080"
:指定Prometheus抓取指标时应该使用的端口。在这个例子中,端口是 8080
。这意味着Prometheus会尝试连接到Ray集群节点的 8080
端口来获取指标。
prometheus.io/scrape: "true"
:指定Prometheus是否应该抓取(scrape)指定的端点。设置为 "true"
表示Prometheus需要抓取这个端点的指标。
执行yaml文件
kubectl apply -f rayjob.metrics.yaml -n <命名空间>
在EMR集群中,集群监控中可以看到Ray的监控指标信息
删除RayJob作业
kubectl delete -f rayjob.metrics.yaml -n <命名空间>