Zig profiling on Apple Silicon

If you’re a developer rocking an Apple Silicon Mac and writing in Zig, congratulations - you’ve chosen the scenic route through the desert of profiling tools. It’s just you, your code, and a tumbleweed named Apple Instruments. But don’t worry - we’ll try to find some oases.

Okay, it’s not that bad, but we’re far away from the rich ecosystem of profiling tools available on Linux.

Note:

I have limited experience using low-level languages, so this article doesn’t provide a deep dive into profiling, but rather serves as an entry point to the world of profiling.

Classification

We focus only on these types of profilers:

CPU time profilers:
- Statistical (timer-based) sampling: periodically samples stacks to estimate where time is spent.
- Hardware event–based sampling (PMU): samples on counter overflows (cycles, cache-misses, branches) to attribute microarchitectural stalls.
Instrumentation profilers: insert probes at function entry/exit or around code regions to mark scopes.

There are many other types of profilers, like memory profilers, network profilers, etc., but we won’t cover them here.

Many profilers combine methods from both categories.

For Linux, we have perf, valgrind, and tracy. These tools cover almost all possible cases. Sadly, that’s not true for Apple Silicon Macs:

perf - supports only Linux as it relies on the Linux kernel.
valgrind - doesn’t support macOS on arm64.
tracy - mostly works, but callstack sampling is not supported.

Interfaces

Apple provides several interfaces for profiling:

Mach Interface - provides access to threads, address spaces, memory objects, and IPC primitives.
DTrace Framework - similar to Mach, but requires root privileges and disabling System Integrity Protection (SIP) for profiling system apps.
kperf - private framework, Apple’s alternative to Linux perf.

Available Tools

1. Samply

Samply is a sampling profiler that collects stack traces, per thread, at a specified sampling interval (default: 1ms or 1000Hz). Both on- and off-CPU samples are collected.

It relies on the Mach Interface to collect samples and uses the Firefox Profiler as its UI.

Features:

Sampling executables or already running processes
Feature-rich UI: call tree, flamegraph, source code, CPU usage

Install:

# using cargo
cargo install --locked samply

# homebrew
brew install samply

Usage:

# fresh start
samply record <command>

# pid (requires signing binary, can be done using `samply setup`)
samply record -p <pid>

samply

2. poop

It’s not a joke - it’s the Performance Optimizer Observation Platform, shortened to poop. This tool was created by Andrew Kelley to compare performance based on hardware counters. The upstream version doesn’t support macOS as it relies fully on perf. There is a PR that adds support for macOS, including Apple Silicon. Since the PR hasn’t been merged yet, and its build fails, you can use this fork. Under the hood, it uses Apple’s private framework kperf.

Big shout-out to:

ibireme for researching Apple’s open sourced code and reverse engineering Apple Instruments.
tensorush for implementing the PR.

Keep in mind that this PR hasn’t been merged yet, so it may contain bugs. I tested it on macOS 15.5 (M2 Pro), and it works fine, but I can’t guarantee that it will work on newer versions as it relies on a private API and Apple can change it at any time.

Features:

Getting hardware counters, e.g. branch misses, instructions
Comparing performance across different commands

Install:

git clone https://github.com/verte-zerg/poop.git -b kperf-macos
cd poop
# use zig version 0.14.1
zig build --release=fast
cp ./zig-out/bin/poop ~/.local/bin/poop

Usage:

# root privileges are required on macOS
sudo poop <cmd1> <cmd2> ...

# you can specify the duration param (in ms) to increase count of runs
# count = <duration> // <first time run>
sudo poop --duration 60000 <cmd1> <cmd2> ...

poop

3. Tracy

Tracy is a real-time instrumentation and sampling profiler. It supports time-based zone sampling but does not support callstack sampling (e.g. full backtraces) on Apple Silicon. You can still use its instrumentation features, which are useful if you want to profile a long-running process.

I can’t describe the profiler better than Marcos Slomp at CppCon 2023. That said, keep in mind that callstack sampling doesn’t work on macOS with Apple Silicon. He doesn’t mention this in the talk and skips a live demo in favor of pre-recorded results. Maybe it’s a skill issue on my side and I just don’t know how to cook it, but Tracy’s manual (see page 27) clearly says callstack sampling isn’t supported on Apple Silicon.

Features:

Rich UI: source code, call tree, CPU/GPU usage, and much more
Remote profiling
Many instrumentation features, like messages, scopes, values

Install: Since Tracy is an instrumentation profiler, you should embed the client library and some additional code in your app.

Clone the tracy repository

git clone https://github.com/wolfpld/tracy.git

Install tracy-profiler using Homebrew or by manually compiling the tracy repository.

brew install tracy

Note that Homebrew contains only version 0.11.1, while version 0.12.2 is already released. I recommend sticking with the Homebrew version to avoid the hassle of building 0.12.2 from scratch.

Copy the tracy.zig implementation to your codebase. You can take it from the Zig repo, making sure you use the file from the corresponding version of your Zig.
Update your build.zig file to include the tracy library. As a reference, you can use the build.zig file from the Zig repo, just search for the tracy keyword. Here is a diff over a base build.zig:

diff --git a/build.zig b/build.zig
index 6bc8766..0011695 100644
--- a/build.zig
+++ b/build.zig
@@ -9,6 +9,11 @@ pub fn build(b: *std.Build) void {
         .target = target,
     });

+    const tracy = b.option([]const u8, "tracy", "Enable Tracy integration. Supply path to Tracy source");
+    const tracy_callstack = b.option(bool, "tracy-callstack", "Include callstack information with Tracy data. Does nothing if -Dtracy is not provided") orelse (tracy != null);
+    const tracy_allocation = b.option(bool, "tracy-allocation", "Include allocation information with Tracy data. Does nothing if -Dtracy is not provided") orelse (tracy != null);
+    const tracy_callstack_depth: u32 = b.option(u32, "tracy-callstack-depth", "Declare callstack depth for Tracy data. Does nothing if -Dtracy_callstack is not provided") orelse 10;
+
     const exe = b.addExecutable(.{
         .name = "tracy_demo",
         .root_module = b.createModule(.{
@@ -23,6 +28,27 @@ pub fn build(b: *std.Build) void {

     b.installArtifact(exe);

+    const exe_options = b.addOptions();
+    exe.root_module.addOptions("build_options", exe_options);
+
+    exe_options.addOption(bool, "enable_tracy", tracy != null);
+    exe_options.addOption(bool, "enable_tracy_callstack", tracy_callstack);
+    exe_options.addOption(bool, "enable_tracy_allocation", tracy_allocation);
+    exe_options.addOption(u32, "tracy_callstack_depth", tracy_callstack_depth);
+
+    if (tracy) |tracy_path| {
+        const client_cpp = b.pathJoin(
+            &[_][]const u8{ tracy_path, "public", "TracyClient.cpp" },
+        );
+
+        const tracy_c_flags: []const []const u8 = &.{ "-DTRACY_ENABLE=1", "-fno-sanitize=undefined" };
+
+        exe.root_module.addIncludePath(.{ .cwd_relative = tracy_path });
+        exe.root_module.addCSourceFile(.{ .file = .{ .cwd_relative = client_cpp }, .flags = tracy_c_flags });
+        exe.root_module.linkSystemLibrary("c++", .{ .use_pkg_config = .no });
+        exe.root_module.link_libc = true;
+    }
+
     const run_step = b.step("run", "Run the app");

     const run_cmd = b.addRunArtifact(exe);

Instrument your code with tracy. Here is a simple usage example:

diff --git a/src/main.zig b/src/main.zig
index e21e514..0c06f22 100644
--- a/src/main.zig
+++ b/src/main.zig
@@ -1,7 +1,11 @@
 const std = @import("std");
 const tracy_demo = @import("tracy_demo");
+const tracy = @import("tracy.zig");

 pub fn main() !void {
+    const tr = tracy.trace(@src());
+    defer tr.end();
+
     // Prints to stderr, ignoring potential errors.
     std.debug.print("All your {s} are belong to us.\n", .{"codebase"});
     try tracy_demo.bufferedPrint();

Build your app with -Dtracy=<path to tracy repo>.

zig build -Dtracy=<path to tracy repo>

Usage:

Run tracy-profiler, click Connect
Run your program.

The connection should be established and you’ll see the tracy profiler UI. On Apple Silicon it will show only CPU usage, but you can instrument your code with additional frames and messages. Just take a look at the copied tracy.zig.

tracy

4. Apple Instruments

Apple Instruments is a powerful tool that allows you to perform CPU profiling, fetch hardware counters, and much more.

Instruments offers most of the same capabilities as samply or poop - and adds more, like GPU usage and counters, HTTP traffic, Neural Engine events, etc.

I’d only reach for it if the other tools fall short. The main downside of this tool is that its UI and app in general is too slow, which sounds like a joke, since we’re talking about performance here!

It includes a command-line tool, xctrace, for scripting recordings and exports - but it’s painfully slow as well. For example, I have a ray-tracing binary that runs in about 4 seconds, but recording 3 runs and exporting them to XML (yes, it’s the only supported output format) took about 40 seconds (about 30 seconds overhead).

You can read about it in the Apple Documentation or in various other articles.

apple-instruments

Conclusion

The profiling landscape on Apple Silicon is not as rich as on Linux, but there are still some good options available. I found that samply is the best to take a quick look at the performance of my app, while poop is an excellent tool for iterative performance optimization. tracy is a powerful tool for instrumentation profiling, although I haven’t yet had a real use case for it.

Classification#

Interfaces#

Available Tools#

1. Samply#

2. poop#

3. Tracy#

4. Apple Instruments#

Conclusion#

Sources#

Classification

Interfaces

Available Tools

1. Samply

2. poop

3. Tracy

4. Apple Instruments

Conclusion

Sources