启动一个ray job submit任务,该任务的日志有:
文件名 | 日志内容 | 示例 |
---|---|---|
job-driver-raysubmit_*.log | client提交日志 | |
runtime_env_setup-xxx.log | runtime env存储信息 | |
python-core-driver-*.log | driver端的日志 | |
worker-xxx-$(id)-xx.err | worker日志 | |
python-core-worker-*.log | worker端的日志 | |
gcs_server.* | GCS日志 |
在 VKE 集群 - 日志中心 - 日志采集规则 页面新建采集规则
由于采集日志的格式较多且存在不同的采集规则, 这里需要配置多条
新建规则时选择刚创建的TLS 项目与日志
其中每一套规则都需要配置K8s 采集规则
app.kubernetes.io/instance ray-cluster
采集路径:
/tmp/ray/session_*/logs/dashboard*
日志样例
2024-03-16 16:22:03,644 INFO metrics_agent.py:286 -- Metrics from a worker (8b330b47e5e538cb97c8f55092cec27bfceee0cf089da84bc7a5d1f3) is cleaned up due to timeout. Time since last report 120.52756787417457s 2024-03-16 16:23:04,414 ERROR optional_utils.py:293 -- Unexpected error in handler: Cannot write to closing transport Traceback (most recent call last): File "/home/ray/.conda/envs/ray/lib/python3.10/site-packages/ray/dashboard/optional_utils.py", line 291, in decorator return await f(self, *args, **kwargs) File "/home/ray/.conda/envs/ray/lib/python3.10/site-packages/ray/dashboard/modules/job/job_agent.py", line 193, in tail_job_logs await ws.send_str(lines) File "/home/ray/.conda/envs/ray/lib/python3.10/site-packages/aiohttp/web_ws.py", line 335, in send_str await self._writer.send(data, binary=False, compress=compress) File "/home/ray/.conda/envs/ray/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 729, in send await self._send_frame(message, WSMsgType.TEXT, compress) File "/home/ray/.conda/envs/ray/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 682, in _send_frame self._write(header + message) File "/home/ray/.conda/envs/ray/lib/python3.10/site-packages/aiohttp/http_websocket.py", line 702, in _write raise ConnectionResetError("Cannot write to closing transport") ConnectionResetError: Cannot write to closing transport 2024-03-16 16:23:04,415 INFO web_log.py:211 -- 192.168.0.26 [16/Mar/2024:16:20:25 +0800] "GET /api/job_agent/jobs/raysubmit_bAzBtNgnRMgkSuEG/logs/tail HTTP/1.1" 500 0 "-" "Python/3.10 aiohttp/3.9.3"
首行正则表达式
\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径
/tmp/ray/session_*/logs/gcs_server*
日志样例
[2024-03-14 19:21:59,582 I 24 24] (gcs_server) gcs_server.cc:846: Event stats: Global stats: 32 total (18 active) Queueing time: mean = 6.000 ms, max = 64.021 ms, min = 1.671 us, total = 191.992 ms Execution time: mean = 2.004 ms, total = 64.122 ms Event stats: GcsInMemoryStore.Put - 6 total (3 active), Execution time: mean = 10.672 ms, total = 64.033 ms, Queueing time: mean = 10.617 ms, max = 63.697 ms, min = 1.671 us, total = 63.702 ms InternalKVGcsService.grpc_client.InternalKVPut - 6 total (6 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s GcsInMemoryStore.GetAll - 5 total (0 active), Execution time: mean = 5.677 us, total = 28.383 us, Queueing time: mean = 24.712 us, max = 28.782 us, min = 21.689 us, total = 123.559 us InternalKVGcsService.grpc_server.InternalKVPut - 4 total (4 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s InternalKVGcsService.grpc_server.InternalKVPut.HandleRequestImpl - 4 total (1 active), Execution time: mean = 8.098 us, total = 32.392 us, Queueing time: mean = 44.057 us, max = 60.544 us, min = 55.316 us, total = 176.230 us PeriodicalRunner.RunFnPeriodically - 4 total (2 active, 1 running), Execution time: mean = 4.667 us, total = 18.669 us, Queueing time: mean = 31.997 ms, max = 64.021 ms, min = 63.966 ms, total = 127.988 ms GcsInMemoryStore.Get - 1 total (0 active), Execution time: mean = 8.677 us, total = 8.677 us, Queueing time: mean = 2.324 us, max = 2.324 us, min = 2.324 us, total = 2.324 us ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s RayletLoadPulled - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s [2024-03-14 19:21:59,583 I 24 24] (gcs_server) gcs_server.cc:847: GcsTaskManager Event stats:
首行正则表达式:
\[\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径
/tmp/ray/session_*/logs/job-driver-raysubmit*
文件样例
2024-03-16 16:17:48,360 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 192.168.0.26:6379... 2024-03-16 16:17:48,366 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at ^[[1m^[[32m192.168.0.26:8265 ^[[39m^[[22m DeprecationWarning: `route_prefix` in `@serve.deployment` has been deprecated. To specify a route prefix for an application, pass it into `serve.run` instead. ^[[36m(ProxyActor pid=22282)^[[0m INFO 2024-03-16 16:17:52,005 proxy 192.168.0.26 proxy.py:1143 - Proxy actor 3ea43c4b93176981b6c12a1e17000000 starting on node ce22d6b42fd3df5ce0ecd2799287a87da1dfbda85ca591bc30428d00. ^[[36m(ProxyActor pid=22282)^[[0m INFO 2024-03-16 16:17:52,009 proxy 192.168.0.26 proxy.py:1357 - Starting HTTP server on node: ce22d6b42fd3df5ce0ecd2799287a87da1dfbda85ca591bc30428d00 listening on port 8000 ^[[36m(ProxyActor pid=22282)^[[0m INFO: Started server process [22282] ^[[36m(ServeController pid=22226)^[[0m INFO 2024-03-16 16:17:52,200 controller 22226 deployment_state.py:1547 - Deploying new version of deployment BoostingModel in application 'default'. Setting initial target number of replicas to 1. ^[[36m(ServeController pid=22226)^[[0m INFO 2024-03-16 16:17:52,302 controller 22226 deployment_state.py:1831 - Adding 1 replica to deployment BoostingModel in application 'default'. {"result":"versicolor"} ^[[36m(ServeReplica:default:BoostingModel pid=22335)^[[0m Received http request with data [1.2, 1.0, 1.1, 0.9] ^[[36m(ServeReplica:default:BoostingModel pid=22335)^[[0m INFO 2024-03-16 16:17:55,188 default_BoostingModel xQuXbM 9809fe8a-5c91-448f-be90-35fe3dab8a15 /iris replica.py:772 - __CALL__ OK 5.2ms
首行正则表达式
\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
解析采集路径:
采集路径样例
/tmp/ray/session_2024-03-14_19-21-59_498571_8/logs/job-driver-raysubmit_1388WhQ575piY84f.log
提取正则表达式
/tmp/ray/session_.*/logs/job-driver-(raysubmit_[A-Za-z0-9]*)\.log
提取结果: submitId
采集路径
/tmp/ray/session_*/logs/monitor*
文件样例
2024-03-14 19:22:06,072 INFO autoscaler.py:426 -- ======== Autoscaler status: 2024-03-14 19:22:06.072033 ======== Node status --------------------------------------------------------------- Active: 1 node_ce22d6b42fd3df5ce0ecd2799287a87da1dfbda85ca591bc30428d00 Pending: (no pending nodes) Recent failures: (no failures) Resources --------------------------------------------------------------- Usage: 0.0/2.0 CPU 0.0/1.0 GPU 0B/9.31GiB memory 0B/2.70GiB object_store_memory Demands: (no resource demands) 2024-03-14 19:22:06,072 INFO autoscaler.py:469 -- The autoscaler took 0.001 seconds to complete the update iteration.
首行正则表达式
\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径
/tmp/ray/session_*/logs/python-core-driver-*
文件格式
[2024-03-14 19:23:38,356 I 788 788] core_worker.cc:2737: Creating actor: 548cb9bd733042def4f87ccf03000000 [2024-03-14 19:23:41,264 I 788 788] direct_actor_transport.cc:188: Actor creation task finished, task_id: ffffffffffffffff548cb9bd733042def4f87ccf03000000, actor_id: 548cb9bd733042def4f87ccf03000000, actor_repr_name: [2024-03-14 19:24:38,355 I 788 817] core_worker.cc:588: Event stats: Global stats: 946 total (8 active) Queueing time: mean = 62.903 us, max = 38.679 ms, min = 8.188 us, total = 59.506 ms Execution time: mean = 32.356 ms, total = 30.609 s Event stats: CoreWorker.RecoverObjects - 600 total (1 active), Execution time: mean = 3.418 us, total = 2.051 ms, Queueing time: mean = 88.084 us, max = 38.679 ms, min = 9.317 us, total = 52.850 ms CoreWorker.ExitIfParentRayletDies - 61 total (1 active), Execution time: mean = 7.183 us, total = 438.167 us, Queueing time: mean = 24.236 us, max = 41.793 us, min = 11.091 us, total = 1.478 ms NodeManagerService.grpc_client.ReportWorkerBacklog - 60 total (0 active), Execution time: mean = 394.125 us, total = 23.648 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorker.InternalHeartbeat - 60 total (1 active), Execution time: mean = 86.542 us, total = 5.193 ms, Queueing time: mean = 27.382 us, max = 398.484 us, min = 9.719 us, total = 1.643 ms NodeManagerService.grpc_client.ReportWorkerBacklog.OnReplyReceived - 60 total (0 active), Execution time: mean = 13.998 us, total = 839.895 us, Queueing time: mean = 20.086 us, max = 29.476 us, min = 10.978 us, total = 1.205 ms CoreWorkerService.grpc_server.PushTask - 18 total (1 active), Execution time: mean = 1.698 s, total = 30.566 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorkerService.grpc_server.PushTask.HandleRequestImpl - 18 total (0 active), Execution time: mean = 42.514 us, total = 765.250 us, Queueing time: mean = 23.470 us, max = 34.842 us, min = 12.368 us, total = 422.467 us CoreWorkerService.grpc_server.GetCoreWorkerStats - 12 total (0 active), Execution time: mean = 195.349 us, total = 2.344 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorkerService.grpc_server.GetCoreWorkerStats.HandleRequestImpl - 12 total (0 active), Execution time: mean = 42.440 us, total = 509.284 us, Queueing time: mean = 29.899 us, max = 74.633 us, min = 19.090 us, total = 358.783 us CoreWorker.RecordMetrics - 12 total (1 active), Execution time: mean = 103.716 us, total = 1.245 ms, Queueing time: mean = 13.754 us, max = 21.320 us, min = 11.162 us, total = 165.050 us [2024-03-14 19:23:41,264 I 788 788] direct_actor_transport.cc:188: Actor creation task finished, task_id: ffffffffffffffff548cb9bd733042def4f87ccf03000000, actor_id: 548cb9bd733042def4f87ccf03000000, actor_repr_name:
首行正则表达式
\[\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径样例
/tmp/ray/session_2024-03-14_19-21-59_498571_8/logs/python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_261.log
提取正则表达式
/tmp/ray/session_.*/logs/python-core-driver-([0-9]*)fffffffffffffffffffffff.*.log
提取结果: jobId
文件路径
/tmp/ray/session_*/logs/python-core-worker-*
日志样例
[2024-03-14 19:23:38,356 I 788 788] core_worker.cc:2737: Creating actor: 548cb9bd733042def4f87ccf03000000 [2024-03-14 19:23:41,264 I 788 788] direct_actor_transport.cc:188: Actor creation task finished, task_id: ffffffffffffffff548cb9bd733042def4f87ccf03000000, actor_id: 548cb9bd733042def4f87ccf03000000, actor_repr_name: [2024-03-14 19:24:38,355 I 788 817] core_worker.cc:588: Event stats: Global stats: 946 total (8 active) Queueing time: mean = 62.903 us, max = 38.679 ms, min = 8.188 us, total = 59.506 ms Execution time: mean = 32.356 ms, total = 30.609 s Event stats: CoreWorker.RecoverObjects - 600 total (1 active), Execution time: mean = 3.418 us, total = 2.051 ms, Queueing time: mean = 88.084 us, max = 38.679 ms, min = 9.317 us, total = 52.850 ms CoreWorker.ExitIfParentRayletDies - 61 total (1 active), Execution time: mean = 7.183 us, total = 438.167 us, Queueing time: mean = 24.236 us, max = 41.793 us, min = 11.091 us, total = 1.478 ms NodeManagerService.grpc_client.ReportWorkerBacklog - 60 total (0 active), Execution time: mean = 394.125 us, total = 23.648 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorker.InternalHeartbeat - 60 total (1 active), Execution time: mean = 86.542 us, total = 5.193 ms, Queueing time: mean = 27.382 us, max = 398.484 us, min = 9.719 us, total = 1.643 ms NodeManagerService.grpc_client.ReportWorkerBacklog.OnReplyReceived - 60 total (0 active), Execution time: mean = 13.998 us, total = 839.895 us, Queueing time: mean = 20.086 us, max = 29.476 us, min = 10.978 us, total = 1.205 ms CoreWorkerService.grpc_server.PushTask - 18 total (1 active), Execution time: mean = 1.698 s, total = 30.566 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorkerService.grpc_server.PushTask.HandleRequestImpl - 18 total (0 active), Execution time: mean = 42.514 us, total = 765.250 us, Queueing time: mean = 23.470 us, max = 34.842 us, min = 12.368 us, total = 422.467 us CoreWorkerService.grpc_server.GetCoreWorkerStats - 12 total (0 active), Execution time: mean = 195.349 us, total = 2.344 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s CoreWorkerService.grpc_server.GetCoreWorkerStats.HandleRequestImpl - 12 total (0 active), Execution time: mean = 42.440 us, total = 509.284 us, Queueing time: mean = 29.899 us, max = 74.633 us, min = 19.090 us, total = 358.783 us CoreWorker.RecordMetrics - 12 total (1 active), Execution time: mean = 103.716 us, total = 1.245 ms, Queueing time: mean = 13.754 us, max = 21.320 us, min = 11.162 us, total = 165.050 us [2024-03-14 19:23:41,264 I 788 788] direct_actor_transport.cc:188: Actor creation task finished, task_id: ffffffffffffffff548cb9bd733042def4f87ccf03000000, actor_id: 548cb9bd733042def4f87ccf03000000, actor_repr_name:
首行正则表达式
\[\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径样例
/tmp/ray/session_2024-03-14_19-21-59_498571_8/logs/python-core-worker-009a17774013feebeec40ff17429f237265ef3ca550a9cfe1c280074_7754.log
提取正则表达式
/tmp/ray/session_.*/logs/python-core-worker-([A-Za-z0-9]*)_.*.log
提取结果: workerId
采集路径
/tmp/ray/session_*/logs/ray_client_server*
日志格式
2024-03-14 19:22:01,378 INFO server.py:885 -- Starting Ray Client server on 0.0.0.0:10001, args Namespace(host='0.0.0.0', port=10001, mode='proxy', address='192.168.0.26:6379', redis_password=None, runtime_env_agent_address='http://192.168.0.26:50711')
首行正则表达式:
\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径
/tmp/ray/session_*/logs/raylet*
日志格式
[2024-03-14 19:22:01,572 I 211 211] (raylet) main.cc:174: Setting cluster ID to: ccbe1b0b8b55be54fbb0a0487aa1741d4ed4bad8e4c63f014f60063b [2024-03-14 19:22:01,575 I 211 211] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service. [2024-03-14 19:22:01,576 I 211 211] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 2.9011GB of memory. [2024-03-14 19:22:01,576 I 211 211] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled [2024-03-14 19:22:01,576 I 211 242] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(2901147656, /dev/shm/plasmaXXXXXX) [2024-03-14 19:22:01,576 I 211 242] (raylet) store.cc:561: ========== Plasma store: ================= Current usage: 0 / 2.9011 GB - num bytes created total: 0 0 pending objects of total size 0MB - objects spillable: 0 - bytes spillable: 0 - objects unsealed: 0 - bytes unsealed: 0 - objects in use: 0 - bytes in use: 0 - objects evictable: 0 - bytes evictable: 0 - objects created by worker: 0 - bytes created by worker: 0 - objects restored: 0 - bytes restored: 0 - objects received: 0 - bytes received: 0 - objects errored: 0 - bytes errored: 0 [2024-03-14 19:22:01,578 I 211 211] (raylet) grpc_server.cc:129: ObjectManager server started, listening on port 34401.
首行正则表达式:
\[\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径
/tmp/ray/session_*/logs/runtime_env*
采集日志
2024-03-16 16:26:36,640 INFO runtime_env_agent.py:509 -- Got request from raylet to decrease reference for runtime env: {"working_dir": "gcs://_ray_pkg_7e51b6d3d9a2e7cd.zip"}. 2024-03-16 16:26:36,641 INFO runtime_env_agent.py:129 -- Unused runtime env {"working_dir": "gcs://_ray_pkg_7e51b6d3d9a2e7cd.zip"}. 2024-03-16 16:26:36,641 INFO runtime_env_agent.py:263 -- Runtime env {"working_dir": "gcs://_ray_pkg_7e51b6d3d9a2e7cd.zip"} removed from env-level cache.
首行正则表达式
\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径样例 /tmp/ray/session_2024-03-14_19-21-59_498571_8/logs/runtime_env_setup-01000000.log
提取正则表达式
/tmp/ray/session_.*/logs/runtime_env_setup-([A-Za-z0-9]*).log
提取结果 jobId
采集路径
/tmp/ray/session_*/logs/worker-*
日志样例
[2024-03-16 10:39:15,071 E 17806 17806] logging.cc:361: *** SIGABRT received at time=1710556755 on cpu 13 *** ray::JobSupervisor(PyEval_EvalCode+0x87) [0x591cd7] PyEval_EvalCode ray::JobSupervisor() [0x5c2967] run_eval_code_obj ray::JobSupervisor() [0x5bdad0] run_mod ray::JobSupervisor() [0x45956b] pyrun_file.cold ray::JobSupervisor(_PyRun_SimpleFileObject+0x19f) [0x5b805f] _PyRun_SimpleFileObject ray::JobSupervisor(_PyRun_AnyFileObject+0x43) [0x5b7dc3] _PyRun_AnyFileObject ray::JobSupervisor(Py_RunMain+0x38d) [0x5b4b7d] Py_RunMain ray::JobSupervisor(Py_BytesMain+0x39) [0x584e49] Py_BytesMain /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7faf7df45083] __libc_start_main ray::JobSupervisor() [0x584cfe] *** SIGABRT received at time=1710556755 on cpu 13 *** PC: @ 0x7faf7df6400b (unknown) raise @ 0x7faf7e281420 1491767840 (unknown) @ 0x7faf7c13135a 80 __cxxabiv1::__terminate() @ 0x7faf7c8d3a4a 32 std::_Sp_counted_base<>::_M_release() @ 0x7faf7ca58bd2 96 std::_Sp_counted_ptr_inplace<>::_M_dispose() @ 0x7faf7c8d3a4a 32 std::_Sp_counted_base<>::_M_release() @ 0x7faf7c990032 144 std::default_delete<>::operator()() @ 0x7faf7ca01fdb 128 ray::core::CoreWorker::~CoreWorker() @ 0x7faf7c8d3a4a 32 std::_Sp_counted_base<>::_M_release() @ 0x7faf7ca40154 112 ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop() @ 0x7faf7ca4025d 32 ray::core::CoreWorkerProcess::RunTaskExecutionLoop() @ 0x7faf7c852667 32 __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop() @ 0x4ff2f4 (unknown) method_vectorcall_NOARGS @ ... and at least 1 more frames [2024-03-16 10:39:15,071 E 17806 17806] logging.cc:361: *** SIGABRT received at time=1710556755 on cpu 13 ***
首行正则表达式
\[?\d+-\d+-\d+\s\d+:\d+:\d+,\d+\s.*
采集路径样例 /tmp/ray/session_2024-03-14_19-21-59_498571_8/logs/worker-02db26fc7946af0a8de4aa4cfe923e2141a6078677d17180dda261e2-10000000-14731.out
提取正则表达式
/tmp/ray/session_.*/logs/worker-([A-Za-z0-9]*)-([A-Za-z0-9]*)-.*
提取结果 workerId jobId
查询条件 * | SELECT DISTINCT(submitId)
查询条件 submitId: "raysubmit_kE8xm63Wi36UBHaK"
查询条件: jobId: "1b000000" | select DISTINCT(workerId)
查询条件: jobId: "1b000000" AND workerId: "0ae97b84873e69ea83836a5f49b40ba55330d2875e8300820a8d2e16"
查询条件: jobId: "1b000000" AND content: "GPU"