Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

TL;DR

A developer rewrote and optimized matrix multiplication code in Swift to improve training speed of a GPT-2-like model on Apple Silicon. The goal is to scale from gigaflop to teraflop performance, with ongoing improvements.

A developer has begun optimizing matrix multiplication code in Swift to accelerate training of large language models (LLMs) on Apple Silicon, with initial efforts aiming to push performance from gigaflops to teraflops.

The developer, inspired by Andrej Karpathy’s llm.c, rewrote core matrix multiplication routines in Swift, initially achieving very slow performance. Through iterative optimization—including leveraging SIMD, Apple Silicon’s AMX units, and multi-threading—the code has begun to approach higher computational throughput. The focus is on training a GPT-2 compatible model, with benchmarks showing initial steps toward reaching teraflop levels, a significant increase over the initial gigaflop performance.

While the code is currently a plain Swift implementation without relying on high-level frameworks, the developer plans to incorporate Metal GPU acceleration and further low-level optimizations. The project aims to demonstrate that training neural networks directly in Swift on Apple Silicon is feasible and can be made efficient with targeted tuning.

Why It Matters

This development matters because it challenges the assumption that high-performance neural network training on Mac hardware requires specialized frameworks or external libraries. Achieving teraflop performance in Swift could open new possibilities for native ML development on Apple Silicon, enabling researchers and developers to experiment without relying solely on Python-based tools or cloud services. It also provides insight into the raw computational capabilities of Apple Silicon’s CPU, SIMD, and AMX units for deep learning workloads.

Amazon

Apple Silicon compatible GPU acceleration cards

As an affiliate, we earn on qualifying purchases.

Background

Two years ago, the author revisited a neural network project from the early 2000s, sparking interest in training neural networks directly on Mac hardware. Inspired by Karpathy’s llm.c, the author rewrote the code in Swift, initially facing performance challenges. The project aims to optimize core matrix operations critical to neural network training, with the broader goal of enabling efficient, native ML development on Apple Silicon. Previous benchmarks indicated that the initial implementation ran at less than 1 Gflop/s, but recent efforts are aimed at scaling this up significantly.

“Optimizing matrix multiplication in Swift is a process of iterative improvements—leveraging SIMD, AMX, and multi-threading—to push performance into the teraflop range.”

— Developer

“Apple Silicon’s AMX units are a game-changer for high-performance matrix operations, but exploiting them requires low-level programming and careful tuning.”

— Hardware expert

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

As an affiliate, we earn on qualifying purchases.

What Remains Unclear

It is still unclear how close the current implementation is to reaching true teraflop performance, as benchmarks are still in progress. Additionally, the impact of integrating Metal GPU acceleration and further low-level optimizations remains to be seen, and whether the approach scales to larger models or more complex training routines is uncertain.

2023 Apple MacBook Pro with M2 Max Chip (16.2-inch, 32GB, 1TB SSD Storage) – Space Gray (Renewed)

SUPERCHARGED BY M2 PRO OR M2 MAX — Take on demanding projects with the M2 Pro or M2…

As an affiliate, we earn on qualifying purchases.

What’s Next

The next steps involve benchmarking the optimized Swift code on Apple Silicon, integrating Metal GPU acceleration, and testing with larger models. The developer plans to publish detailed performance metrics and possibly share the code for community testing and further optimization.

Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization

As an affiliate, we earn on qualifying purchases.

Key Questions

What is the main goal of this optimization effort?

The main goal is to enable training large language models in Swift on Apple Silicon at performance levels approaching teraflops, demonstrating that native development is feasible without relying on external ML frameworks.

Why focus on matrix multiplication?

Matrix multiplication is the core computational kernel in neural network training, accounting for most floating-point operations. Improving its performance directly impacts overall training speed.

What hardware features are being exploited?

The developer is leveraging SIMD instructions, Apple Silicon’s AMX units, multi-threading, and plans to incorporate Metal GPU acceleration to boost performance.

Will this approach replace existing ML frameworks?

Not necessarily; the project aims to demonstrate the potential of native Swift implementations and hardware-aware optimizations. Established frameworks like TensorFlow or PyTorch will still be preferable for most users due to their maturity and extensive optimization.

When can we expect to see benchmark results?

The developer is currently benchmarking and plans to publish detailed results once the code is further optimized and integrated with GPU acceleration, likely within the next few months.

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Up next

Chinese rare-earth miners bullish ahead of Trump-Xi summit

Author

Geek Salad Team

Share article

Why It Matters

Apple Silicon compatible GPU acceleration cards

Background

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

What Remains Unclear

2023 Apple MacBook Pro with M2 Max Chip (16.2-inch, 32GB, 1TB SSD Storage) – Space Gray (Renewed)

What’s Next

Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization

Key Questions

What is the main goal of this optimization effort?

Why focus on matrix multiplication?

What hardware features are being exploited?

Will this approach replace existing ML frameworks?

When can we expect to see benchmark results?

Apple backs Google after EU orders Android be opened up to AI rivals

Anthropic announces 200K context fine-tuning

Asus enters the RAM market during the largest memory shortage in history, 48GB kit lands at $880 — brand’s first DDR5 kit makes the RTX 5070 Ti look like a bargain

The SSD Squeeze: Why Storage Joined the Party

Data Center Surges In Global Coverage

Explore The Top 10 AI Mini PCs For 2026

Leading AI-Enhanced E Ink Tablets to Watch in 2026

This AI-Run Company Turns Business Survival Into a Live Technology Test

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

Up next

Author

Geek Salad Team

Share article

Why It Matters

Apple Silicon compatible GPU acceleration cards

Background

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch

What Remains Unclear

2023 Apple MacBook Pro with M2 Max Chip (16.2-inch, 32GB, 1TB SSD Storage) – Space Gray (Renewed)

What’s Next

Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization

Key Questions

What is the main goal of this optimization effort?

Why focus on matrix multiplication?

What hardware features are being exploited?

Will this approach replace existing ML frameworks?

When can we expect to see benchmark results?

You May Also Like