If you’re a developer rocking an Apple Silicon Mac and writing in Zig, congratulations - you’ve chosen the scenic route through the desert of profiling tools. It’s just you, your code, and a tumbleweed named Apple Instruments. But don’t worry - we’ll try to find some oases.
Okay, it’s not that bad, but we’re far away from the rich ecosystem of profiling tools available on Linux.
Note:
I have limited experience using low-level languages, so this article doesn’t provide a deep dive into profiling, but rather serves as an entry point to the world of profiling.
Classification
We focus only on these types of profilers:
-
CPU time profilers:
- Statistical (timer-based) sampling: periodically samples stacks to estimate where time is spent.
- Hardware event–based sampling (PMU): samples on counter overflows (cycles, cache-misses, branches) to attribute microarchitectural stalls.
-
Instrumentation profilers: insert probes at function entry/exit or around code regions to mark scopes.
There are many other types of profilers, like memory profilers, network profilers, etc., but we won’t cover them here.
Many profilers combine methods from both categories.
For Linux, we have perf, valgrind, and tracy. These tools cover almost all possible cases. Sadly, that’s not true for Apple Silicon Macs:
- perf - supports only Linux as it relies on the Linux kernel.
- valgrind - doesn’t support macOS on arm64.
- tracy - mostly works, but callstack sampling is not supported.
Interfaces
Apple provides several interfaces for profiling:
- Mach Interface - provides access to threads, address spaces, memory objects, and IPC primitives.
- DTrace Framework - similar to Mach, but requires root privileges and disabling System Integrity Protection (SIP) for profiling system apps.
- kperf - private framework, Apple’s alternative to Linux perf.
Available Tools
1. Samply
Samply is a sampling profiler that collects stack traces, per thread, at a specified sampling interval (default: 1ms or 1000Hz). Both on- and off-CPU samples are collected.
It relies on the Mach Interface to collect samples and uses the Firefox Profiler as its UI.
Features:
- Sampling executables or already running processes
- Feature-rich UI: call tree, flamegraph, source code, CPU usage
Install:
# using cargo
cargo install --locked samply
# homebrew
brew install samply
Usage:
# fresh start
samply record <command>
# pid (requires signing binary, can be done using `samply setup`)
samply record -p <pid>
2. poop
It’s not a joke - it’s the Performance Optimizer Observation Platform, shortened to poop. This tool was created by Andrew Kelley to compare performance based on hardware counters. The upstream version doesn’t support macOS as it relies fully on perf. There is a PR that adds support for macOS, including Apple Silicon. Since the PR hasn’t been merged yet, and its build fails, you can use this fork. Under the hood, it uses Apple’s private framework kperf.
Big shout-out to:
- ibireme for researching Apple’s open sourced code and reverse engineering Apple Instruments.
- tensorush for implementing the PR.
Keep in mind that this PR hasn’t been merged yet, so it may contain bugs. I tested it on macOS 15.5 (M2 Pro), and it works fine, but I can’t guarantee that it will work on newer versions as it relies on a private API and Apple can change it at any time.
Features:
- Getting hardware counters, e.g. branch misses, instructions
- Comparing performance across different commands
Install:
git clone https://github.com/verte-zerg/poop.git -b kperf-macos
cd poop
# use zig version 0.14.1
zig build --release=fast
cp ./zig-out/bin/poop ~/.local/bin/poop
Usage:
# root privileges are required on macOS
sudo poop <cmd1> <cmd2> ...
# you can specify the duration param (in ms) to increase count of runs
# count = <duration> // <first time run>
sudo poop --duration 60000 <cmd1> <cmd2> ...
3. Tracy
Tracy is a real-time instrumentation and sampling profiler. It supports time-based zone sampling but does not support callstack sampling (e.g. full backtraces) on Apple Silicon. You can still use its instrumentation features, which are useful if you want to profile a long-running process.
I can’t describe the profiler better than Marcos Slomp at CppCon 2023. That said, keep in mind that callstack sampling doesn’t work on macOS with Apple Silicon. He doesn’t mention this in the talk and skips a live demo in favor of pre-recorded results. Maybe it’s a skill issue on my side and I just don’t know how to cook it, but Tracy’s manual (see page 27) clearly says callstack sampling isn’t supported on Apple Silicon.
Features:
- Rich UI: source code, call tree, CPU/GPU usage, and much more
- Remote profiling
- Many instrumentation features, like messages, scopes, values
Install: Since Tracy is an instrumentation profiler, you should embed the client library and some additional code in your app.
- Clone the tracy repository
git clone https://github.com/wolfpld/tracy.git
- Install tracy-profiler using Homebrew or by manually compiling the tracy repository.
brew install tracy
Note that Homebrew contains only version 0.11.1, while version 0.12.2 is already released. I recommend sticking with the Homebrew version to avoid the hassle of building 0.12.2 from scratch.
-
Copy the
tracy.zig
implementation to your codebase. You can take it from the Zig repo, making sure you use the file from the corresponding version of your Zig. -
Update your
build.zig
file to include the tracy library. As a reference, you can use thebuild.zig
file from the Zig repo, just search for thetracy
keyword. Here is a diff over a basebuild.zig
:
diff --git a/build.zig b/build.zig
index 6bc8766..0011695 100644
--- a/build.zig
+++ b/build.zig
@@ -9,6 +9,11 @@ pub fn build(b: *std.Build) void {
.target = target,
});
+ const tracy = b.option([]const u8, "tracy", "Enable Tracy integration. Supply path to Tracy source");
+ const tracy_callstack = b.option(bool, "tracy-callstack", "Include callstack information with Tracy data. Does nothing if -Dtracy is not provided") orelse (tracy != null);
+ const tracy_allocation = b.option(bool, "tracy-allocation", "Include allocation information with Tracy data. Does nothing if -Dtracy is not provided") orelse (tracy != null);
+ const tracy_callstack_depth: u32 = b.option(u32, "tracy-callstack-depth", "Declare callstack depth for Tracy data. Does nothing if -Dtracy_callstack is not provided") orelse 10;
+
const exe = b.addExecutable(.{
.name = "tracy_demo",
.root_module = b.createModule(.{
@@ -23,6 +28,27 @@ pub fn build(b: *std.Build) void {
b.installArtifact(exe);
+ const exe_options = b.addOptions();
+ exe.root_module.addOptions("build_options", exe_options);
+
+ exe_options.addOption(bool, "enable_tracy", tracy != null);
+ exe_options.addOption(bool, "enable_tracy_callstack", tracy_callstack);
+ exe_options.addOption(bool, "enable_tracy_allocation", tracy_allocation);
+ exe_options.addOption(u32, "tracy_callstack_depth", tracy_callstack_depth);
+
+ if (tracy) |tracy_path| {
+ const client_cpp = b.pathJoin(
+ &[_][]const u8{ tracy_path, "public", "TracyClient.cpp" },
+ );
+
+ const tracy_c_flags: []const []const u8 = &.{ "-DTRACY_ENABLE=1", "-fno-sanitize=undefined" };
+
+ exe.root_module.addIncludePath(.{ .cwd_relative = tracy_path });
+ exe.root_module.addCSourceFile(.{ .file = .{ .cwd_relative = client_cpp }, .flags = tracy_c_flags });
+ exe.root_module.linkSystemLibrary("c++", .{ .use_pkg_config = .no });
+ exe.root_module.link_libc = true;
+ }
+
const run_step = b.step("run", "Run the app");
const run_cmd = b.addRunArtifact(exe);
- Instrument your code with
tracy
. Here is a simple usage example:
diff --git a/src/main.zig b/src/main.zig
index e21e514..0c06f22 100644
--- a/src/main.zig
+++ b/src/main.zig
@@ -1,7 +1,11 @@
const std = @import("std");
const tracy_demo = @import("tracy_demo");
+const tracy = @import("tracy.zig");
pub fn main() !void {
+ const tr = tracy.trace(@src());
+ defer tr.end();
+
// Prints to stderr, ignoring potential errors.
std.debug.print("All your {s} are belong to us.\n", .{"codebase"});
try tracy_demo.bufferedPrint();
- Build your app with
-Dtracy=<path to tracy repo>
.
zig build -Dtracy=<path to tracy repo>
Usage:
- Run tracy-profiler, click
Connect
- Run your program.
The connection should be established and you’ll see the tracy profiler UI. On Apple Silicon it will show only CPU usage,
but you can instrument your code with additional frames and messages. Just take a look at the copied tracy.zig
.
4. Apple Instruments
Apple Instruments is a powerful tool that allows you to perform CPU profiling, fetch hardware counters, and much more.
Instruments offers most of the same capabilities as samply or poop - and adds more, like GPU usage and counters, HTTP traffic, Neural Engine events, etc.
I’d only reach for it if the other tools fall short. The main downside of this tool is that its UI and app in general is too slow, which sounds like a joke, since we’re talking about performance here!
It includes a command-line tool, xctrace, for scripting recordings and exports - but it’s painfully slow as well. For example, I have a ray-tracing binary that runs in about 4 seconds, but recording 3 runs and exporting them to XML (yes, it’s the only supported output format) took about 40 seconds (about 30 seconds overhead).
You can read about it in the Apple Documentation or in various other articles.
Conclusion
The profiling landscape on Apple Silicon is not as rich as on Linux, but there are still some good options available. I found that samply is the best to take a quick look at the performance of my app, while poop is an excellent tool for iterative performance optimization. tracy is a powerful tool for instrumentation profiling, although I haven’t yet had a real use case for it.
Sources
- Tracy - User Manual
- Article - Counting cycles and instructions on the Apple M1 processor
- Article - Using dtrace on MacOS with SIP enabled
- Article - macOS Profiling - visualize program bottleneck with Flamegraph
- Talk - CppCon 2023: Marcos Slomp - An Introduction to Tracy Profiler in C++
- X - Post about reading Apple M1 CPU performance counters in macOS
- X - Post about profiling on Apple Silicon