站点图标 谷姐靓号网

Ray 手动部署集群

5/5 - (4 votes)

弱鸡咋玩呢?试着让小鸡们联手?

一、在每个小鸡上都安装ray环境

apt-get install python3-pip
pip3 install -U ray

二、部署集群

选择一个小鸡作为头节点(head)

ray start --head --port=6379
header运行ray成功
Local node IP: IP-of-head-node
2020-12-23 16:12:46,527 INFO services.py:1092 -- View the Ray dashboard at http://localhost:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='IP-of-head-node:6379' --redis-password='5241590000000000'
  
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop

选择其他小鸡作为子节点(worker)

根据以上头节点的提示,在所有子节点上运行(注意替换IP-of-head-node)

ray start --address='IP-of-head-node:6379' --redis-password='5241590000000000'

三、测试子节点是否成功和头节点/构成集群

在头节点小鸡上新建一个Python脚本

vim ray_test.py
import ray
import time

# ray.init(address="auto")
ray.init(address='auto', _redis_password='5241590000000000')

@ray.remote
def f():
    time.sleep(0.01)
    return ray._private.services.get_node_ip_address()

if __name__ == "__main__":
    ips = set(ray.get([f.remote() for _ in range(1000)]))
    print(ips)
    print(len(ips))

这里对官方的实例中的f函数做了修改,否则会报

AttributeError: module 'ray' has no attribute 'services'

错误。

在头节点小鸡上运行脚本

python3 ray_test.py

输出

2020-12-23 16:34:03,700 INFO worker.py:651 -- Connecting to existing Ray cluster at address: 192.3.231.zz:6379
{'67.198.228.xx', '104.168.89.yy', '192.3.231.zz'}
3

可见头节点成功连接上子节点,同时在子节点上运行了f()函数并获取到子节点的IP信息。输出结果表示集群中有三个小鸡,其中有一个是头结点,其余两个是子节点。

四、后记

集群是搭建起来了,接下来可以试着跑一些网络请求密集型的小程序。

参考:
1. 官网文档
2. Ray Distributed AI Framework Curriculum Offered on the Intel® AI DevCloud
3. Modern Parallel and Distributed Python: A Quick Tutorial on Ray

退出移动版