在 快速开始 中成功送出了第一个 Primus 训练任务,现在您可以试着使用 Primus 进行分布式的 TensorFlow 训练任务吧!在这里会示范三种不同的 TensorFlow 分布式策略依序为 Single Node,MultiWorkerMirrored 以及 ParameterServer。
由于 TensorFlow 训练需要训练资料以及 Python 环境,在这里您需要进行更多的准备工作!
# Change to yarn user $ su --shell=/bin/bash - yarn # Create the workspace $ mkdir ~/primus-playground $ cd ~/primus-playground $ cp -r /usr/lib/emr/current/tensorflow_on_yarn/examples . # Build the Python virtual environment $ cd examples/shared/venv $ ./build.sh # Prepare the workspace on HDFS and the datasets $ cd ~/primus-playground/ $ hdfs dfs -mkdir mnist $ hdfs dfs -mkdir mnist/models $ hdfs dfs -put examples/shared/mnist/data mnist
注意
在一切准备工作就绪之后,您就可以开始分布式的 TensorFlow 训练了!
首先您可以先来观察一下 Primus 训练配置,从配置中可以发现在设定上相较于 Hello Primus,多指定了训练资源,其中包含了 Primus virtual environent 跟训练脚本,同时有了更复杂的训练指令!
{ "name": "primus_tensorflow_single", "files": [ "examples/shared/venv/venv.tar.gz", // Python virtual environent "examples/tensorflow-single" // 训练脚本资料夹路径 ], "role": [ { "roleName": "main", "num": 1, // 单点训练 "vcores": 1, "memoryMb": 512, "jvmMemoryMb": 512, "command": "./tensorflow-single/main.sh venv.tar.gz", // 训练指令 "successPercent": 100, "failover": { "commonFailoverPolicy": { "commonFailover": { "maxFailureTimes": 10, "maxFailurePolicy": "FAIL_ATTEMPT" } } } } ] }
使用 primus-submit 提交训练!
# Submit Primus application $ cd ~/primus-playground $ primus-submit --primus_conf examples/tensorflow-single/primus_config.json ... 22/03/03 18:36:47 INFO impl.YarnClientImpl: Submitted application <YARN-APPLICATION-ID> 22/03/03 18:36:47 INFO client.YarnSubmitCmdRunner: Tracking URL: http://emr-master-1-1:8088/proxy/<YARN-APPLICATION-ID>/ 22/03/03 18:36:57 INFO client.YarnSubmitCmdRunner: Training successfully started. Scheduling took 10010 ms. 22/03/03 18:38:18 INFO client.YarnSubmitCmdRunner: State: FINISHED Progress: 100.0% 22/03/03 18:38:18 INFO client.YarnSubmitCmdRunner: Application <YARN-APPLICATION-ID> finished with state FINISHED at 2022-03-03 18:38 22/03/03 18:38:18 INFO client.YarnSubmitCmdRunner: Final Application Status: SUCCEEDED ... # Observe YARN logs $ yarn logs --applicationId <YARN-APPLICATION-ID> | grep -E "Epoch|FIN" ... + echo FIN Epoch 1/5 Epoch 2/5 Epoch 3/5 Epoch 4/5 Epoch 5/5 FIN ...
最后因为这个范例有将模型输出到 HDFS 上,所以您可以透过 Python 脚本测试模型的表现!
$ cd ~/primus-playground/examples/tensorflow-single $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server/ $ export HADOOP_HDFS_HOME=/usr/lib/emr/current/hadoop $ export CLASSPATH=$(hadoop classpath --glob) $ python3.9 evaluate.py \ --mnist hdfs://emr-master-1-1:8020/user/yarn/mnist/data \ --model hdfs://emr-master-1-1:8020/user/yarn/mnist/models/model-single ... Model accuracy: [0.29252758622169495, 0.9218999743461609] ...
在提交 Primus 训练任务之前,您可以观察一下 Primus 训练配置。可以快速发现 Multi Worker Mirrored 训练的 Primus 训练配置和 Single Node 训练的 Primus 训练配置非常相似,主要的差异为Multi Worker Mirrored 的训练需要多个节点来完成 ,因此在 Primus 训练配置中,将 “num” 的节点数设定成 2。
{ "name": "primus_tensorflow_multiworkermirrored", "maxAppAttempts": 1, "files": [ "examples/shared/venv/venv.tar.gz", "examples/tensorflow-multiworkermirrored" ], "role": [ { "roleName": "worker", "num": 2, // 双点训练 "vcores": 1, "memoryMb": 512, "jvmMemoryMb": 512, "command": "./tensorflow-multiworkermirrored/main.sh venv.tar.gz", "successPercent": 100, "failover": { "hybridDeploymentFailoverPolicy": { "commonFailover": { "maxFailureTimes": 10, "maxFailurePolicy": "FAIL_ATTEMPT" } } } } ] }
使用 primus-submit 提交训练!
# Submit Primus application $ cd ~/primus-playground $ primus-submit --primus_conf examples/tensorflow-multiworkermirrored/primus_config.json ... 22/03/03 18:42:53 INFO impl.YarnClientImpl: Submitted application <YARN-APPLICATION-ID> 22/03/03 18:42:53 INFO client.YarnSubmitCmdRunner: Tracking URL: http://emr-master-1-1:8088/proxy/<YARN-APPLICATION-ID>/ 22/03/03 18:43:03 INFO client.YarnSubmitCmdRunner: Training successfully started. Scheduling took 10014 ms. 22/03/03 18:44:34 INFO client.YarnSubmitCmdRunner: State: FINISHED Progress: 100.0% 22/03/03 18:44:34 INFO client.YarnSubmitCmdRunner: Application <YARN-APPLICATION-ID> finished with state FINISHED at 2022-03-03 18:44 22/03/03 18:44:34 INFO client.YarnSubmitCmdRunner: Final Application Status: SUCCEEDED ... # Observe YARN logs $ yarn logs --applicationId <YARN-APPLICATION-ID> | grep -E "Epoch|FIN" ... + echo FIN Epoch 1/3 Epoch 2/3 Epoch 3/3 FIN + echo FIN Epoch 1/3 Epoch 2/3 Epoch 3/3 FIN ...
同样的因为这个范例最后有将模型输出到 HDFS 上,所以您也可以透过 Python 脚本测试模型的表现!
$ cd ~/primus-playground/examples/tensorflow-multiworkermirrored $ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server/ $ export HADOOP_HDFS_HOME=/usr/lib/emr/current/hadoop $ export CLASSPATH=$(hadoop classpath --glob) $ python3.9 evaluate.py \ --mnist hdfs://emr-master-1-1:8020/user/yarn/mnist/data \ --model hdfs://emr-master-1-1:8020/user/yarn/mnist/models/model-multiworkermirrored ... Model accuracy: [2.0982000827789307, 0.5758000016212463] ...
一样的从观察 Primus 训练配置开始,相较于 Singe Node 的 Primus 训练配置,Parameter Server 所需要的 Primus 训练配置巨大了许多。
但是其中最主要的差别只有两个部分,分别是更多的角色以及 PS 和 Worker 这两个角色的退出条件为 0%,因为在 TensorFlow Parameter Server 的分布式策略中,这两种角色属于常驻型角色因此训练进程不会自行退出。
{ "name": "primus_tensorflow_parameterserver", "maxAppAttempts": 1, "files": [ "examples/basics/shared/venv/venv.tar.gz", "examples/basics/tensorflow-parameterserver/main.sh", "examples/basics/tensorflow-parameterserver/chief.py", "examples/basics/tensorflow-parameterserver/ps.py", "examples/basics/tensorflow-parameterserver/worker.py" ], "role": [ // 更多的角色! { "roleName": "chief", "num": 1, "vcores": 1, "memoryMb": 512, "jvmMemoryMb": 512, "command": "./main.sh venv.tar.gz chief.py", "successPercent": 100, "failover": { "commonFailoverPolicy": { "maxFailureTimes": 1, "maxFailurePolicy": "FAIL_ATTEMPT" } } }, { "roleName": "ps", "num": 1, "vcores": 1, "memoryMb": 512, "jvmMemoryMb": 512, "command": "./main.sh venv.tar.gz ps.py", "successPercent": 0, // TensorFlow strategy 里的 Parameter Server 在是常驻的,因此我们不需要等待这个 progress 完成 "failover": { "commonFailoverPolicy": { "maxFailureTimes": 1, "maxFailurePolicy": "FAIL_ATTEMPT" } } }, { "roleName": "worker", "num": 1, "vcores": 1, "memoryMb": 512, "jvmMemoryMb": 512, "command": "./main.sh venv.tar.gz worker.py", "successPercent": 0, // 同 Parameter Server "failover": { "commonFailoverPolicy": { "maxFailureTimes": 1, "maxFailurePolicy": "FAIL_ATTEMPT" } } } ] }
使用 primus-submit 提交训练!
# Submit Primus application $ cd ~/primus-playground $ primus-submit --primus_conf examples/tensorflow-parameterserver/primus_config.json ... 22/03/03 18:58:39 INFO impl.YarnClientImpl: Submitted application <YARN-APPLICATION-ID> 22/03/03 18:58:39 INFO client.YarnSubmitCmdRunner: Tracking URL: http://emr-master-1-1:8088/proxy/<YARN-APPLICATION-ID>/ 22/03/03 18:58:49 INFO client.YarnSubmitCmdRunner: Training successfully started. Scheduling took 10010 ms. 22/03/03 19:00:23 INFO client.YarnSubmitCmdRunner: State: FINISHED Progress: 100.0% 22/03/03 19:00:23 INFO client.YarnSubmitCmdRunner: Application <YARN-APPLICATION-ID> finished with state FINISHED at 2022-03-03 19:00 22/03/03 19:00:23 INFO client.YarnSubmitCmdRunner: Final Application Status: SUCCEEDED ... # Observe YARN logs $ yarn logs --applicationId <YARN-APPLICATION-ID> | grep -E "Epoch|FIN" ... + echo FIN Epoch 1/5 Epoch 2/5 Epoch 3/5 Epoch 4/5 Epoch 5/5 FIN ...