Description
I’ve got 3 standard Supermicro towers with 256GB RAM, an SSD, 5 HDDs, and 4 GPUs each. Ethernet connects them to the “controller” Dell server with access to the internet and is supposed to gate SSH…
Summary
- Hardware I’ve got 3 standard Supermicro towers with 256GB RAM, an SSD, 5 HDDs, and 4 GPUs each.
- As I mentioned in one of my old blog posts, it is critical to disable IOMMU if you plan peer-to-peer GPU communication, e.g., multi-GPU model training in Tensorflow or PyTorch.
- If the users do not care about high availability and failovers, it is enough to spawn only one controller.
- mergerFS is a nice FUSE (does not require a kernel module) tool to reach that goal.