smolcluster

Distributed Deep Learning Library

Training neural networks across heterogeneous hardware with PyTorch

Overview

smolcluster is a distributed deep learning library designed for training neural networks across heterogeneous hardware using PyTorch and socket-based communication. It enables researchers and developers to leverage multiple machines with different capabilities for distributed training and inference.

The library supports various distributed training algorithms including Elastic Distributed Parallelism (EDP), Synchronous Parameter Server (SyncPS), and Model Parallelism. It can run on diverse hardware including Mac minis, Raspberry Pis, MacBooks, and Windows machines.

Cluster Architecture

Smolcluster Architecture Diagram

Key Features

🔄 Distributed Training Algorithms

  • Elastic Distributed Parallelism (EDP) - Asynchronous data parallelism with stale gradient tolerance, ideal for heterogeneous clusters
  • Synchronous Parameter Server (SyncPS) - Synchronous data parallelism with barrier coordination for homogeneous clusters
  • Model Parallelism (MP) - Layer-wise model distribution perfect for large models and inference serving

🖥️ Hardware Support

Train across heterogeneous hardware including Mac minis, Raspberry Pis, MacBooks, and Windows machines. The framework intelligently handles different hardware capabilities and network latencies.

🤖 Model Support

Built-in support for MNIST, GPT-2, and custom neural networks. Includes distributed inference with model parallelism and streaming token generation for language models.

📊 Monitoring & Logging

  • Grafana + Loki - Centralized log aggregation with real-time queries across all nodes
  • Weights & Biases Integration - Automatic tracking of training metrics, gradient norms, and hardware utilization
  • Web Interface - React-based chat UI for GPT inference

Demo

Distributed GPT-2 Inference with Model Parallelism:

  • Model: GPT-2 (117M parameters)
  • Hardware: iPad client + 2× Mac Mini M4 (2025)
  • Algorithm: Model Parallelism with layer distribution
  • Demo: Real-time streaming token generation across distributed layers
  • Workflow: User prompts from iPad → activations forwarded between Mac Minis → tokens streamed back

Getting Started

Note: Smolcluster requires a distributed hardware setup and network configuration before you can begin training. This is not a straightforward installation.

Prerequisites

Before using Smolcluster, you need to set up your distributed cluster:

  • Hardware Setup: Configure your machines (Mac minis, Raspberry Pis, GPUs, etc.)
  • Network Configuration:
    • Mac minis: Thunderbolt connections and network bridges
    • Raspberry Pi/GPUs: Ethernet connections
    • SSH setup with proper gateways and key authentication
  • Cluster Configuration: YAML configuration files for your specific topology

Installation

Once your cluster is properly configured:

# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/YuvrajSingh-mist/smolcluster.git
cd smolcluster
uv sync

Important: Please refer to the Cluster Setup Guide for detailed hardware setup, networking configuration, and troubleshooting before attempting to run training scripts.

Technical Details

Smolcluster implements a distributed training system designed from the ground up for heterogeneous hardware. The library supports multiple distributed training paradigms, each optimized for different cluster configurations and network topologies.

Communication Infrastructure

  • Socket-based Communication - Raw TCP sockets for reliable, low-level control over gradient and activation transfers between nodes. No dependency on MPI or specialized networking libraries.
  • Pickle Serialization - PyTorch tensors serialized with pickle for efficient network transmission, with optional gradient quantization for bandwidth reduction.
  • Hybrid Network Support - Handles complex topologies mixing Thunderbolt fabric (10Gbps+) and Ethernet edge connections (1Gbps), with proper routing and gateway configuration.
  • Asynchronous I/O - Non-blocking socket operations enable workers to compute while waiting for network transfers in EDP mode.

Distributed Training Modes

  • Elastic Distributed Parallelism (EDP) - Workers train independently with stale gradient tolerance. The parameter server accepts gradients from any model version, making it resilient to stragglers and network latency variance. Workers periodically pull the latest weights without synchronization barriers.
  • Synchronous Parameter Server (SyncPS) - Barrier-based coordination where the server waits for all workers to submit gradients before updating. Uses Polyak averaging and synchronous weight broadcasts for faster convergence on homogeneous clusters.
  • Model Parallelism - Sequential layer distribution across nodes with activation forwarding. Enables training and inference of models exceeding single-device memory. Each worker holds a subset of layers and forwards activations to the next rank.

Data Management

  • Automatic Data Partitioning - Dataset automatically sharded across workers based on global rank and world size, ensuring no data overlap.
  • Deterministic Shuffling - Seeded random number generators ensure reproducible data ordering across runs.
  • Streaming Support - Memory-efficient data loading for large datasets with PyTorch DataLoader integration.

Fault Tolerance & Monitoring

  • Checkpointing - Periodic model snapshots with configurable intervals. Supports resuming training from the latest checkpoint after failures.
  • Distributed Logging - Grafana + Loki stack aggregates logs from all nodes in real-time. Promtail agents on each machine forward structured logs to a central Loki instance.
  • Weights & Biases Integration - Automatic logging of training metrics, gradient norms, per-layer statistics, and system metrics (GPU utilization, memory usage, network throughput).
  • Timeout Handling - Configurable timeouts prevent deadlocks when workers fail or network partitions occur.

Performance Optimizations

  • Gradient Quantization - Optional 8-bit quantization reduces gradient transfer size by 4x with minimal accuracy impact.
  • CPU-based Computation - Designed to utilize CPU cores on commodity hardware (Mac minis, Raspberry Pis) rather than requiring GPUs.
  • Mixed Precision Training - FP16 automatic mixed precision support for compatible hardware to accelerate training.
  • Gradient Accumulation - Simulates larger batch sizes by accumulating gradients over multiple micro-batches before updating.

See the architecture diagram above for a visual representation of the cluster topology.

Documentation

Comprehensive guides to help you get the most out of Smolcluster:

License

Smolcluster is released under the MIT License.

Contributions are welcome! Visit the GitHub repository to get involved.