Last time I wrote about profiling in Zig on Apple Silicon, I touched on PMU counter profiling. This time I decided to go further and create my own tool to fetch all available counters for Apple Silicon processors (M1, M2, and later).
Brief explanation of PMU counters
PMU (Performance Monitoring Unit) counters are hardware counters that track microarchitectural events inside the CPU, e.g. executed instructions, retired operations, branches, cache misses, and more.
CPUs usually expose a mix of fixed and programmable counters. Fixed counters represent predefined events (often things like cycles and instructions), while programmable counters can be configured to track a selected set of events.
Using PMU counters, developers can better understand the performance characteristics of their applications, e.g. the number of cache misses, branch mispredictions, instruction mix, and other low-level metrics.
Motivation
One of the solutions for fetching these counters was the poop tool written by Andrew Kelly and the PR to his repository that adds the ability to fetch CPU counters on Apple Silicon by tensorush.
The main problem with this PR is that it was gracefully rejected by Andrew, and I fully get that decision, since it’s hard to support additional implementation, especially if you don’t use it.
I’ve created a fork that you can use. It’s actually a good solution if you need to fetch several predefined PMU counters.
But I’ve decided to go a bit further and implement another tool for Apple Silicon Macs that can fetch all counters supported by Apple Silicon. Since that required understanding how it works, my tool implementation quickly became a research project about Apple’s private kperf API.
Basically, this article is a journey through how the research went.
To perform experiments, I used a MacBook with an M2 Pro chip running macOS 15.6.1.
Apple Instruments and the first weird limitation
I started by researching how Apple Instruments works.
So, I opened Instruments, went to the “CPU Counters” template, and tried to add/delete counters.
I quickly found out that Apple Instruments doesn’t support fetching more than 10 counters, sometimes 8, and sometimes less. I was constantly getting errors like '<SOME_COUNTER>' conflicts with a previously added event. The maximum that I could get is 10 counters. So, the first takeaway was that there is a limit to how many counters I can fetch, and another is that counters are, in some way, incompatible with each other. Why and how they’re incompatible is a good question.
The difference between 10 and 8 counters can be explained by the fact that two of the counters are possibly fixed (always monitored by the CPU). In Instruments, they have special aliases. Take a look at this subset of available counters (5 out of 60):
- INST_ALL
- INST_BARRIER
- Cycles (FIXED_CYCLES)
- L1D_CACHE_MISS_LD
- Instructions (FIXED_INSTRUCTIONS)
As you can see, the Cycles and Instructions counters have special aliases and the FIXED_ prefix. And there are only two such counters in Instruments. So, we can assume that Apple Silicon has 2 fixed counters.
I’ve tried to search for any information about these limitations. These attempts lead me to Apple’s CPU optimization guide, which is a great overview of how to optimize performance based on CPU counters.
Note: to access this guide you need to sign in to the Apple Developer portal.
But there I haven’t found any information about limitations and incompatibilities. At least I got an interesting guide, a list of counters for each processor (M1-M4), and their descriptions.
Reverse-engineered kperf and the first experiments
The next information source was reverse-engineered code that uses the macOS kperf framework to fetch CPU counters. It’s great work performed by ibireme. The original code you can find here.
It’s worth mentioning that kperf is a private Apple framework that provides an interface for performance monitoring. It’s not documented publicly, so the only way to understand how it works is to study reverse-engineered code that uses it or do your own reverse-engineering. The tool also requires sudo privileges to run.
So, I took the Zig port of the reverse-engineered kperf code and tried to find any incompatible pairs. Using this code I ran the first experiment, which basically goes through all counter pairs, adds them to the monitor list, and fetches counters for a simple function. The first good news is that I’ve found 6 counters (further: group M) that are incompatible in pairs.
The code works with the following algorithm: we make some preparation by initializing several configs and then add counters one by one. These incompatibilities happen when I add a new counter from group M to an already added counter from the same group. In this case, I get an error that the new counter cannot be added.
Group M (6 counters):
- INST_ALL
- INST_INT_ALU
- INST_INT_ST
- INST_LDST
- INST_SIMD_ALU
- RETIRE_UOP
That’s it, the goal is reached, at least I thought so. But when I tried to add 8, in my opinion, compatible counters (not from group M), my attempt failed. I just got another error that these counters cannot be added together. It means that besides counters that are incompatible in pairs, there should be something else. I continued my experiments with finding incompatible counters by creating all possible combinations of triples, quadruples, and up to sets of 8.
When combinatorics stopped making sense
These are the results of running the same experiment for different set sizes.
- pairs - only 6 counters (group M)
- triples - no new incompatible counters (I excluded cases with 2 counters from group M)
- quadruples - adds a new group of 18 counters (further: group G) that are incompatible in quadruples or in triples when combined with group M counters
- sets of 5-6 - no new incompatible counters (I excluded cases discovered before)
- sets of 7 - a lot of new incompatibilities
Group G (18 counters):
- BRANCH_CALL_INDIR_MISPRED_NONSPEC
- BRANCH_COND_MISPRED_NONSPEC
- BRANCH_INDIR_MISPRED_NONSPEC
- BRANCH_MISPRED_NONSPEC
- BRANCH_RET_INDIR_MISPRED_NONSPEC
- INST_BARRIER
- INST_BRANCH
- INST_BRANCH_CALL
- INST_BRANCH_COND
- INST_BRANCH_INDIR
- INST_BRANCH_RET
- INST_BRANCH_TAKEN
- INST_INT_LD
- INST_SIMD_LD
- INST_SIMD_ST
- L1D_CACHE_MISS_LD_NONSPEC
- L1D_CACHE_MISS_ST_NONSPEC
- L1D_TLB_MISS_NONSPEC
So, the final list blew my mind: I’ve found 18_673_166 new incompatible cases. This is the number of failing combinations for sets of size 7. By the way, the total number of combinations for sets of size 7 is C(55, 7) = 202_927_725. Due to inconsistencies between Apple’s guide and the real available counters, I checked only 55 out of 60 counters.
I’ve tried really hard to understand what’s going on, because the list didn’t make any sense. With a little help from combinatorics, Python, and several hours of trying to get the desired number, I ended up with this final formula. You don’t need to check it, but please take a look at it - I don’t want that time to be completely wasted.
C(30, 6) + C(30, 5) * C(18, 1) + C(30, 4) * C(18, 2) + C(30, 5) * C(6, 1) + C(30, 4) * C(18, 1) * C(6, 1)
+ C(30, 6) + C(30, 5) * C(18, 1) + C(30, 4) * C(18, 2)
+ C(12, 6) + C(12, 5) * C(17, 1) + C(12, 4) * C(17, 2) + C(12, 5) * C(5, 1) + C(12, 4) * C(17, 1) * C(5, 1)
+ C(8, 6) + C(8, 5) * C(16, 1) + C(8, 4) * C(16, 2) + C(8, 5) * C(5, 1) + C(8, 4) * C(16, 1) * C(5, 1)
+ C(7, 6) + C(7, 5) * C(15, 1) + C(7, 4) * C(15, 2) + C(7, 5) * C(5, 1) + C(7, 4) * C(15, 1) * C(5, 1)
+ C(5, 4) * (C(10, 3) + C(10, 2) * C(5, 1) + C(10, 1) * C(5, 2))
+ C(5, 5) * (C(10, 2) + C(10, 1) * C(5, 1))
+ C(5, 4) * C(5, 1) * (C(10, 2) + C(10, 1) * C(5, 1) + C(5, 2))
+ C(5, 5) * C(5, 1) * (C(10, 1) + C(5, 1))
where C(n, k) is the number of unique subsets of size k from a total of n elements. It’s also called the binomial coefficient, or “n choose k”.
I didn’t even try to run the experiment for sets of 8 because it looked like a dead end. My experiments kinda failed, but at least I refreshed my knowledge of combinatorics.
Order matters
I couldn’t stop thinking that it just couldn’t be implemented this way.
My final attempt was to change the order of counters and voila, the number of failed cases changed (I sadly don’t remember the exact numbers). This time I decided to skip the part with figuring out how the final number was made.
So, the new takeaway is: the order in which you add counters DOES matter. And the scariest thing is that the order matters even in the Apple Instruments app.
You can verify this yourself: try adding these counters in this specific order:
- L1D_TLB_ACCESS
- L1D_TLB_MISS
- L1D_CACHE_MISS_ST
- L1D_CACHE_MISS_LD
- LD_UNIT_UOP
- ST_UNIT_UOP
- INST_LDST
The last one must show a red circle (error), but if you change the order to:
- L1D_TLB_ACCESS
- L1D_TLB_MISS
- L1D_CACHE_MISS_ST
- L1D_CACHE_MISS_LD
- LD_UNIT_UOP
- INST_LDST
- ST_UNIT_UOP
Basically, swap the last two counters, and everything will work fine.
So, if you ever have an issue with an incompatible counter in Apple Instruments, just try to reorder counters in the app. We’ll get back to the right order shortly.
The missing piece
Okay, the order matters, but which order is correct, and why does it matter? It forced me to research the reverse-engineered kperf client one more time and find what I can get from it. kpep_db and kpep_event caught my eye, especially their structures.
/// KPEP event (size: 48/28 bytes on 64/32 bit OS)
typedef struct kpep_event {
const char *name; ///< Unique name of an event, such as "INST_RETIRED.ANY".
const char *description; ///< Description for this event.
const char *errata; ///< Errata, currently NULL.
const char *alias; ///< Alias name, such as "Instructions", "Cycles".
const char *fallback; ///< Fallback event name for fixed counter.
u32 mask;
u8 number;
u8 umask;
u8 reserved;
u8 is_fixed;
} kpep_event;
/// KPEP database (size: 144/80 bytes on 64/32 bit OS)
typedef struct kpep_db {
const char *name; ///< Database name, such as "haswell".
const char *cpu_id; ///< Plist name, such as "cpu_7_8_10b282dc".
const char *marketing_name; ///< Marketing name, such as "Intel Haswell".
void *plist_data; ///< Plist data (CFDataRef), currently NULL.
void *event_map; ///< All events (CFDict<CFSTR(event_name), kpep_event *>).
kpep_event *event_arr; ///< Event struct buffer (sizeof(kpep_event) * events_count).
kpep_event **fixed_event_arr; ///< Fixed counter events (sizeof(kpep_event *) * fixed_counter_count)
void *alias_map; ///< All aliases (CFDict<CFSTR(event_name), kpep_event *>).
usize reserved_1;
usize reserved_2;
usize reserved_3;
usize event_count; ///< All events count.
usize alias_count;
usize fixed_counter_count;
usize config_counter_count;
usize power_counter_count;
u32 architecture;
u32 fixed_counter_bits;
u32 config_counter_bits;
u32 power_counter_bits;
} kpep_db;
The first thing that I did was get a list of counters using the kpep_db_events function and display all fields. I provided only name, alias, and mask, because other fields are mostly empty or not relevant.
| # | name | alias | mask |
|---|---|---|---|
| 1 | FIXED_CYCLES | Cycles | 0b0000000001 |
| 2 | FIXED_INSTRUCTIONS | Instructions | 0b0000000010 |
| 3 | RETIRE_UOP | N/A | 0b0010000000 |
| 4 | CORE_ACTIVE_CYCLE | N/A | 0b1111111100 |
| 5 | L1I_TLB_FILL | N/A | 0b1111111100 |
| 6 | L1D_TLB_FILL | N/A | 0b1111111100 |
| 7 | MMU_TABLE_WALK_INSTRUCTION | N/A | 0b1111111100 |
| 8 | MMU_TABLE_WALK_DATA | N/A | 0b1111111100 |
| 9 | L2_TLB_MISS_INSTRUCTION | N/A | 0b1111111100 |
| 10 | L2_TLB_MISS_DATA | N/A | 0b1111111100 |
| 11 | MMU_VIRTUAL_MEMORY_FAULT_NONSPEC | N/A | 0b1111111100 |
| 12 | INTERRUPT_PENDING | N/A | 0b1111111100 |
| 13 | MAP_STALL_DISPATCH | N/A | 0b1111111100 |
| 14 | MAP_REWIND | N/A | 0b1111111100 |
| 15 | MAP_STALL | N/A | 0b1111111100 |
| 16 | MAP_INT_UOP | N/A | 0b1111111100 |
| 17 | MAP_LDST_UOP | N/A | 0b1111111100 |
| 18 | MAP_SIMD_UOP | N/A | 0b1111111100 |
| 19 | FLUSH_RESTART_OTHER_NONSPEC | N/A | 0b1111111100 |
| 20 | INST_ALL | N/A | 0b0010000000 |
| 21 | INST_BRANCH | N/A | 0b0011100000 |
| 22 | INST_BRANCH_CALL | N/A | 0b0011100000 |
| 23 | INST_BRANCH_RET | N/A | 0b0011100000 |
| 24 | INST_BRANCH_TAKEN | N/A | 0b0011100000 |
| 25 | INST_BRANCH_INDIR | N/A | 0b0011100000 |
| 26 | INST_BRANCH_COND | N/A | 0b0011100000 |
| 27 | INST_INT_LD | N/A | 0b0011100000 |
| 28 | INST_INT_ST | N/A | 0b0010000000 |
| 29 | INST_INT_ALU | N/A | 0b0010000000 |
| 30 | INST_SIMD_LD | N/A | 0b0011100000 |
| 31 | INST_SIMD_ST | N/A | 0b0011100000 |
| 32 | INST_SIMD_ALU | N/A | 0b0010000000 |
| 33 | INST_LDST | N/A | 0b0010000000 |
| 34 | INST_BARRIER | N/A | 0b0011100000 |
| 35 | L1D_TLB_ACCESS | N/A | 0b1111111100 |
| 36 | L1D_TLB_MISS | N/A | 0b1111111100 |
| 37 | L1D_CACHE_MISS_ST | N/A | 0b1111111100 |
| 38 | L1D_CACHE_MISS_LD | N/A | 0b1111111100 |
| 39 | LD_UNIT_UOP | N/A | 0b1111111100 |
| 40 | ST_UNIT_UOP | N/A | 0b1111111100 |
| 41 | L1D_CACHE_WRITEBACK | N/A | 0b1111111100 |
| 42 | LDST_X64_UOP | N/A | 0b1111111100 |
| 43 | LDST_XPG_UOP | N/A | 0b1111111100 |
| 44 | ATOMIC_OR_EXCLUSIVE_SUCC | N/A | 0b1111111100 |
| 45 | ATOMIC_OR_EXCLUSIVE_FAIL | N/A | 0b1111111100 |
| 46 | L1D_CACHE_MISS_LD_NONSPEC | N/A | 0b0011100000 |
| 47 | L1D_CACHE_MISS_ST_NONSPEC | N/A | 0b0011100000 |
| 48 | L1D_TLB_MISS_NONSPEC | N/A | 0b0011100000 |
| 49 | ST_MEMORY_ORDER_VIOLATION_NONSPEC | N/A | 0b0011100000 |
| 50 | BRANCH_COND_MISPRED_NONSPEC | N/A | 0b0011100000 |
| 51 | BRANCH_INDIR_MISPRED_NONSPEC | N/A | 0b0011100000 |
| 52 | BRANCH_RET_INDIR_MISPRED_NONSPEC | N/A | 0b0011100000 |
| 53 | BRANCH_CALL_INDIR_MISPRED_NONSPEC | N/A | 0b0011100000 |
| 54 | BRANCH_MISPRED_NONSPEC | N/A | 0b0011100000 |
| 55 | L1I_TLB_MISS_DEMAND | N/A | 0b1111111100 |
| 56 | MAP_DISPATCH_BUBBLE | N/A | 0b1111111100 |
| 57 | L1I_CACHE_MISS_DEMAND | N/A | 0b1111111100 |
| 58 | FETCH_RESTART | N/A | 0b1111111100 |
| 59 | ST_NT_UOP | N/A | 0b1111111100 |
| 60 | LD_NT_UOP | N/A | 0b1111111100 |
If you take a look at the mask field, all incompatibilities suddenly become clear.
The algorithm
The mask field gives us answers to all mysteries above:
- The maximum number of counters is 10, based on the mask width (10 bits).
- Two counters are unique: they’re compatible with any other counter because they have unique masks (fixed counters): Cycles (
0b0000000001) and Instructions (0b0000000010). - Besides these two counters, there are 6 others that we found before (group M) that are incompatible in pairs. Their masks are the same (
0b0010000000), so they book the same slot when added, causing incompatibility. - There are also 19 counters that have the same specific mask. From the previous research we also found them (group G, but due to inconsistencies between Apple’s guide and the actual list, the group contained only 18 counters). These counters are incompatible in quadruples because when you add 3 of them, they book all 3 available slots, and when you try to add the 4th one, there is no slot left. Group G counters overlap with group M counters, which means that counters from group M also book slots that group G counters use.
Regarding the order, I think the following statement explains how the algorithm for adding new counters works:
When we add a counter to the list, it picks the first available slot starting from the lower bit based on its mask.
The reason why order matters is that if you add an event with a wide mask, it books a slot starting from the right side and doesn’t leave a slot for other “special” counters (with a specific mask). So, to get predictable behavior, it’s better to add events in a specific order. A simple ascending order (by mask) is suitable in this case, but it can be much more complicated if mask has a more complex format.
Here is an explanation of why the Instruments app shows an error for the previous example:
| # | Counter | Mask | Resulting mask |
|---|---|---|---|
| 0 | Initial | N/A | 🟢🟢🟢🟢🟢🟢🟢🟢🟢🟢 |
| 1 | L1D_TLB_ACCESS | 1111111100 |
🟢🟢🟢🟢🟢🟢🟢🟡🟢🟢 |
| 2 | L1D_TLB_MISS | 1111111100 |
🟢🟢🟢🟢🟢🟢🟡🟡🟢🟢 |
| 3 | L1D_CACHE_MISS_ST | 1111111100 |
🟢🟢🟢🟢🟢🟡🟡🟡🟢🟢 |
| 4 | L1D_CACHE_MISS_LD | 1111111100 |
🟢🟢🟢🟢🟡🟡🟡🟡🟢🟢 |
| 5 | LD_UNIT_UOP | 1111111100 |
🟢🟢🟢🟡🟡🟡🟡🟡🟢🟢 |
| 6 | ST_UNIT_UOP | 1111111100 |
🟢🟢🟡🟡🟡🟡🟡🟡🟢🟢 |
| 7 | INST_LDST | 0010000000 |
🟢🟢🔴🟡🟡🟡🟡🟡🟢🟢 (conflict) |
But if we swap the last two counters:
| # | Counter | Mask | Resulting mask |
|---|---|---|---|
| 5 | LD_UNIT_UOP | 1111111100 |
🟢🟢🟢🟡🟡🟡🟡🟡🟢🟢 |
| 6 | INST_LDST | 0010000000 |
🟢🟢🟡🟡🟡🟡🟡🟡🟢🟢 |
| 7 | ST_UNIT_UOP | 1111111100 |
🟢🟡🟡🟡🟡🟡🟡🟡🟢🟢 |
In this case, everything works fine.
Building a tool
Wrapping everything up, I’ve created a tool called “Lauka”. “Created” is quite a bold word: I’ve taken a base from the original poop repository and the scoop library, merged them, rewrote the CLI and extended it. There is also new functionality, e.g. selecting events to monitor, displaying all available events, and warming up. Also, I deleted support for Linux and Intel-based Macs. So this tool only works for Apple Silicon Macs.
I don’t want to repeat README.md content, you can check it out on GitHub.
Here is a small example of how to run the tool and what output it produces:
$ lauka -- "./build-old" './build-new -O2'
Benchmark 1 (9 runs): ./build-old
measurement mean ± σ min … max outliers
wall_time 591ms ± 7.6ms 583ms … 605ms 0 (0%)
peak_rss 137MB ± 0.3MB 136.6MB … 137.4MB 0 (0%)
core_active_cycle 2.51G ± 22.1M 2.48G … 2.54G 0 (0%)
inst_all 3.62G ± 23.9M 3.53G … 3.69G 0 (0%)
l1d_cache_miss_ld_nonspec 3.58M ± 31.7K 3.54M … 3.63M 0 (0%)
branch_mispred_nonspec 21.4M ± 58.2K 21.3M … 21.5M 0 (0%)
Benchmark 2 (9 runs): ./build-new -O2
measurement mean ± σ min … max outliers delta
wall_time 130ms ± 8.3ms 125ms … 141ms 0 (0%) ⚡ −78.0% ± 0.5%
peak_rss 91.9MB ± 0.09MB 91.8MB … 92.1MB 0 (0%) −32.9% ± 0.1%
core_active_cycle 507M ± 2.35M 503M … 511M 0 (0%) −79.8% ± 0.1%
inst_all 796M ± 10.7M 781M … 809M 0 (0%) −78.0% ± 0.1%
l1d_cache_miss_ld_nonspec 352K ± 7.7K 318K … 355K 0 (0%) −90.2% ± 0.1%
branch_mispred_nonspec 4.52M ± 11.5K 4.51M … 4.57M 2 (5%) −78.9% ± 0.0%
Summary
Quite a nice journey it was: an almost complete lack of documentation, not much discussion, and one great reverse-engineered code as the only reliable source.
Looking back, I think that I’ve made several mistakes:
- Trying to search for information only related to Apple Silicon. I think I could have found more by researching how it’s implemented on Linux first.
- Reviewing the reverse-engineered code only briefly. Doing a deeper dive earlier could have helped me avoid the combinatorial explosion.
- Spending too much time trying to figure out the final number of incompatible sets, instead of focusing on finding the root cause.
But I don’t regret anything, because the experience was unforgettable. And I like that I’ve made these mistakes.
I hope you found the research quite interesting and useful. Thanks for reading!