Deploying Kubeflow to a Bare-Metal GPU Cluster from Scratch

By Medium - 2021-03-18

Description

I’ve got 3 standard Supermicro towers with 256GB RAM, an SSD, 5 HDDs, and 4 GPUs each. Ethernet connects them to the “controller” Dell server with access to the internet and is supposed to gate SSH…

Summary

  • Hardware I’ve got 3 standard Supermicro towers with 256GB RAM, an SSD, 5 HDDs, and 4 GPUs each.
  • As I mentioned in one of my old blog posts, it is critical to disable IOMMU if you plan peer-to-peer GPU communication, e.g., multi-GPU model training in Tensorflow or PyTorch.
  • If the users do not care about high availability and failovers, it is enough to spawn only one controller.
  • mergerFS is a nice FUSE (does not require a kernel module) tool to reach that goal.

 

Topics

  1. Backend (0.3)
  2. Machine_Learning (0.14)
  3. UX (0.12)

Similar Articles

A Custom

By Kubernetes - 2020-12-21

Author: Chris Seto (Cockroach Labs) As long as you're willing to follow the rules, deploying on Kubernetes and air travel can be quite pleasant. More often than not, things will "just work". However, ...

Don't Panic

By Kubernetes - 2020-12-02

Authors: Jorge Castro, Duffie Cooley, Kat Cosgrove, Justin Garrison, Noah Kantrowitz, Bob Killen, Rey Lejano, Dan “POP” Papandrea, Jeffrey Sica, Davanum “Dims” Srinivas Kubernetes is deprecating Docke ...