Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

TL;DR

A developer rewrote and optimized matrix multiplication code in Swift to improve training speed of a GPT-2-like model on Apple Silicon. The goal is to scale from gigaflop to teraflop performance, with ongoing improvements.

A developer has begun optimizing matrix multiplication code in Swift to accelerate training of large language models (LLMs) on Apple Silicon, with initial efforts aiming to push performance from gigaflops to teraflops.

The developer, inspired by Andrej Karpathy’s llm.c, rewrote core matrix multiplication routines in Swift, initially achieving very slow performance. Through iterative optimization—including leveraging SIMD, Apple Silicon’s AMX units, and multi-threading—the code has begun to approach higher computational throughput. The focus is on training a GPT-2 compatible model, with benchmarks showing initial steps toward reaching teraflop levels, a significant increase over the initial gigaflop performance.

While the code is currently a plain Swift implementation without relying on high-level frameworks, the developer plans to incorporate Metal GPU acceleration and further low-level optimizations. The project aims to demonstrate that training neural networks directly in Swift on Apple Silicon is feasible and can be made efficient with targeted tuning.

Why It Matters

This development matters because it challenges the assumption that high-performance neural network training on Mac hardware requires specialized frameworks or external libraries. Achieving teraflop performance in Swift could open new possibilities for native ML development on Apple Silicon, enabling researchers and developers to experiment without relying solely on Python-based tools or cloud services. It also provides insight into the raw computational capabilities of Apple Silicon’s CPU, SIMD, and AMX units for deep learning workloads.

Amazon

Apple Silicon compatible GPU acceleration cards

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background

Two years ago, the author revisited a neural network project from the early 2000s, sparking interest in training neural networks directly on Mac hardware. Inspired by Karpathy’s llm.c, the author rewrote the code in Swift, initially facing performance challenges. The project aims to optimize core matrix operations critical to neural network training, with the broader goal of enabling efficient, native ML development on Apple Silicon. Previous benchmarks indicated that the initial implementation ran at less than 1 Gflop/s, but recent efforts are aimed at scaling this up significantly.

“Optimizing matrix multiplication in Swift is a process of iterative improvements—leveraging SIMD, AMX, and multi-threading—to push performance into the teraflop range.”

— Developer

“Apple Silicon’s AMX units are a game-changer for high-performance matrix operations, but exploiting them requires low-level programming and careful tuning.”

— Hardware expert

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is still unclear how close the current implementation is to reaching true teraflop performance, as benchmarks are still in progress. Additionally, the impact of integrating Metal GPU acceleration and further low-level optimizations remains to be seen, and whether the approach scales to larger models or more complex training routines is uncertain.

2023 Apple MacBook Pro with M2 Max Chip (16.2-inch, 32GB, 1TB SSD Storage) - Space Gray (Renewed)

2023 Apple MacBook Pro with M2 Max Chip (16.2-inch, 32GB, 1TB SSD Storage) – Space Gray (Renewed)

SUPERCHARGED BY M2 PRO OR M2 MAX — Take on demanding projects with the M2 Pro or M2…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

What’s Next

The next steps involve benchmarking the optimized Swift code on Apple Silicon, integrating Metal GPU acceleration, and testing with larger models. The developer plans to publish detailed performance metrics and possibly share the code for community testing and further optimization.

Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization

Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main goal of this optimization effort?

The main goal is to enable training large language models in Swift on Apple Silicon at performance levels approaching teraflops, demonstrating that native development is feasible without relying on external ML frameworks.

Why focus on matrix multiplication?

Matrix multiplication is the core computational kernel in neural network training, accounting for most floating-point operations. Improving its performance directly impacts overall training speed.

What hardware features are being exploited?

The developer is leveraging SIMD instructions, Apple Silicon’s AMX units, multi-threading, and plans to incorporate Metal GPU acceleration to boost performance.

Will this approach replace existing ML frameworks?

Not necessarily; the project aims to demonstrate the potential of native Swift implementations and hardware-aware optimizations. Established frameworks like TensorFlow or PyTorch will still be preferable for most users due to their maturity and extensive optimization.

When can we expect to see benchmark results?

The developer is currently benchmarking and plans to publish detailed results once the code is further optimized and integrated with GPU acceleration, likely within the next few months.

You May Also Like

WSL 2 is getting faster Windows file system access

Microsoft has introduced a new DMA pool feature in WSL 2, reducing bottlenecks and boosting cross-OS file access performance.

Honda and Toyota see sharp Chinese sales drops as competition heats up

Honda and Toyota’s Chinese vehicle sales dropped significantly in April amid rising local competition and higher fuel prices, highlighting challenges for Japanese automakers.

Nanotechnology in 2025: Tiny Tech With Big Impact

Just as nanotechnology transforms industries by 2025, discover how tiny innovations will shape your everyday life and the world around you.

Apple’s Siri revamp could include auto-deleting chats

Apple plans to introduce a new Siri app with privacy-focused features, including automatic chat deletion options, at WWDC 2026, according to reports.