GGML Deep Dive VII: Tensor Representaion and Memory Layout

Introduction In previous posts, we’ve encountered the concept of tensors in GGML many times. However, we’ve only explored their simplest usage—cases without quantization, without permutation (where the tensor has a contiguous in-memory layout), and without tensor views. In more complex scenarios, tensors exhibit far more intricate behaviors, sometimes even counterintuitive ones. In this post, I’ll take a deeper dive into how tensors work in GGML. ggml_tensor Data Structure Overview First, let’s take a look at the ggml_tensor data structure defined in include/ggml.h. Here are some fields that you should pay attention to: ...

March 7, 2025 · 8 min · Yifei Wang

GGML Deep Dive VI: GGUF File Parsing

Introduction Initially, I planned to go over the mnist example in this post, but after looking into it, I realized that most of its key points had already been covered in previous posts : ) So instead, I’ll focus on the only part that hasn’t been discussed yet: the GGUF file format and how GGML parses it. Note: This blog series is based on commit 475e012. GGUF File Structure Overview GGML uses a binary file format called GGUF to store models, including metadata and tensor weights. Below is an overview of its structure from GGML’s official documentation: ...

March 6, 2025 · 3 min · Yifei Wang

CMake Notes I: Different Scopes in Cmake

I Hate CMake CMake is the most popular tool for managing C++ projects of any scale—and for good reason. It’s powerful, flexible, and capable of handling almost any build requirement. If you think it’s missing a feature, you best guess is it already exists—you just haven’t found it yet. But to be honest, I’ve never liked CMake. Writing CMakeLists feels like mental torture, thanks to its many quirks and weird behaviors. ...

March 1, 2025 · 4 min · Yifei Wang

Ggml Deep Dive V: Backend Mode

Introduction In the previous blog post, we wrapped up our exploration of the simple-ctx example, which runs graph computations in context mode. As mentioned earlier, this mode is not the best practice for using GGML, as it doesn’t support device backends like CUDA and Metal. In this blog post, we will shift our focus to the simple-backend example, which demonstrates GGML’s complete workflow under backend mode. Note: This blog series is based on commit 475e012. ...

February 27, 2025 · 7 min · Yifei Wang

GGML Deep Dive IV: Computation in Context-only Mode, Part 2

Introduction In the previous two blog posts, we’ve explored GGML’s minimal example simple-ctx, discussing memory management in context mode and how GGML constructs a computational graph. Now, it’s time to dive into its final part—how GGML actually performs the computation. Let’s get started! Note: This blog series is based on commit 475e012. Creating the Compute Plan (ggml_cplan) Continuing from the previous post, after analyzing the function build_graph (line 63 in simple-ctx.cpp), we now step into the function ggml_graph_compute_with_ctx, which plays a crucial role in GGML’s execution framework. ...

February 21, 2025 · 8 min · Yifei Wang

GGML Deep Dive III: Computation in Context-only Mode, Part 1

Introduction In the previous post, we explored how GGML, in context mode, allocates and manages memory using ggml_context and ggml_object. We walked through this mechanism in the load_model function. Now, it’s time to delve into how GGML executes actual tensor computations on top of ggml_context. In this blog post, we’ll explore how GGML constructs and manages the data structures required to represent a computational graph. Let’s get started! ...

February 18, 2025 · 5 min · Yifei Wang

GGML Deep Dive II: Memory Management in Context-only Mode

Introduction Continuing from the previous post, if you’ve followed all the steps outlined earlier, you should now be able to debug any example provided by GGML. To make the thought process as clear as possible, the first example we will analyze is simple, specifically ./examples/simple/simple-ctx. Essentially, this example performs matrix multiplication between two hard-coded matrices purely on the CPU. It is a minimal example compared to real-world GGML applications—it involves no file loading or parsing, is hardware-agnostic, and, most importantly, all computations occur exclusively on the CPU, with all memory allocations happening in RAM. These characteristics make it an excellent candidate for demonstrating the core GGML workflow. ...

February 8, 2025 · 7 min · Yifei Wang

GGML Deep Dive I: Environment Setup

Introduction Large language models (LLMs) have revolutionized AI applications, and llama.cpp stands out as a powerful framework for efficient, cross-platform LLM inference. At the core of llama.cpp is ggml, a highly optimized tensor computation library written in pure C/C++ with no external dependencies. Gaining a deep understanding of GGML is essential for comprehending how llama.cpp operates at a low level. This blog series aims to demystify GGML by examining its source code, breaking down its design, and exploring the principles behind its performance optimizations. ...

February 8, 2025 · 4 min · Yifei Wang

Zion National Park & Bryce Canyon National Park

Shot With: NIKKOR Z 24-120mm f/4 S

December 23, 2024 · 1 min · Yifei Wang

Grand Canyon

Shot With: NIKKOR Z 24-120mm f/4 S

December 13, 2024 · 1 min · Yifei Wang