mirror of
https://github.com/imbue-ai/cluster-health.git
synced 2024-06-28 12:52:40 +03:00
master
Training Infrastructure Scripts
This repository contains various scripts developed at Imbue for managing a large cluster of H100s, detecting and fixing hardware issues, and generally ensuring smooth model training. You can read more about our process here
The code is organized as follows:
gpu_stress_testcontains a check that the GPUs on each machine are able to allocate large tensors and perform standard operations.health_checkscontains various checks we use to determine which hosts are healthy, as well as automated solutions to common issues.host_validationcontains tests to check that the GPUs on a given machine are able to communicate with each other (via NVLink) and with GPUs on other machines (via InfiniBand).ufm_eventscontains a script which parses the UFM event log and other logs, checks for relevant events, and determines which network ports should be disabled.ib_burncontains a script for generating a comprehensive burn-in workload for IB fabrics, aiming to exercise every available link.
Description
Languages
Python
95.4%
Shell
4.6%