experimental new pass manager

The LLVM Project Blog

LLVM Project News and Details from the Trenches

The New Pass Manager

Llvm’s new pass manager, what is a pass manager.

A pass manager schedules transformation passes and analyses to be run on IR in a specific order. Passes can run on an entire module, a single function, or something more abstract such as a strongly connected component (SCC) in a call graph or a loop inside of a function. Scheduling can be simple, such as running a list of module passes, or running function passes on every function inside a module. Scheduling can also be more involved, such as making sure we visit SCCs in the call graph in the correct order.

A pass manager is also responsible for managing analysis results. Analyses (e.g. dominator tree) should be shared between passes whenever possible for efficiency reasons, since recomputing analyses can be expensive. To do so, the pass manager must cache results and recompute them when they are invalidated by transforms.

For testing purposes, we can add specific passes to a pass manager to test those passes. However, the typical use case is to run a predetermined pass pipeline. For example, clang -O2 runs a predetermined set of passes on the input IR.

What is LLVM’s new pass manager?

LLVM currently has two separate pass managers: the legacy pass manager (legacy PM) and the new pass manager (new PM). When referring to “legacy PM” and “new PM”, this includes all of the surrounding infrastructure, not just the entity that manages passes.

The legacy PM has been in use for a very long time and did its job fairly well. However, there were some missing features required for better optimization opportunities, most notably the ability to use function analysis results for arbitrary functions from the inliner. The specific motivating use case was that the inliner wanted to look at the profile data of callees recursively, especially in regards to deferred inlining where the inliner wants to look through simple “wrapper” functions. The legacy PM did not support retrieval of analyses for arbitrary functions in a CGSCC pass. A CGSCC pass runs on a specific strongly connected component (SCC) of the call graph. The pass manager makes sure we visit SCCs bottom-up so that callees are as optimized as possible when we get to their callers and callers have as precise information as possible. LLVM’s inliner is a CGSCC pass due to being a bottom-up inliner. This major limitation of the legacy PM, along with other warts, prompted the desire for a new pass manager.

Currently the new PM applies only to the middle-end optimization pipeline working with LLVM IR. The backend codegen pipeline still works only with the legacy PM, mostly because most codegen passes don’t work on LLVM IR, but rather machine IR (MIR), and nobody has yet put in the time to create the new PM infrastructure for MIR passes and to migrate all of the backends to use the new PM. Migrating to the new PM for the codegen pipeline likely won’t unlock performance gains since there are almost no interprocedural codegen passes. However, it would clean up a lot of technical debt.

With the legacy PM, each pass declares which analyses it requires and preserves, and the pass manager schedules those analyses as passes to be run if they aren’t currently cached or have been invalidated. Declaring ahead of time which analyses a pass may need is unnecessary boilerplate, and a pass might not end up using all analyses in all cases.

The new PM takes a different approach of completely separating analyses and normal passes. Rather than having the pass manager take care of analyses, a separate analysis manager is in charge of computing, caching, and invalidating analyses. Passes can simply request an analysis from the analysis manager, allowing for lazily computing analyses. In order for a pass to communicate that analyses have been invalidated, it returns which analyses it has preserved. The pass manager tells the analysis manager to handle invalidated cached analyses. This results in less boilerplate and better separation of concerns between passes and analyses.

Since the legacy PM modelled analyses as passes to be scheduled and run, we can’t efficiently access analyses to arbitrary functions. For a function analysis, the corresponding analysis pass will only contain the info for the current function, which is created during the latest run of the analysis pass. We can manually create analyses for other functions, but they won’t be cached anywhere, leading to lots of redundant work and unacceptable compile time regressions. Since analyses are handled by an analysis manager in the new PM, the analysis manager can cache arbitrary analyses for arbitrary functions.

To support CGSCC analyses, we need a key to cache analyses. For things like functions and loops, we have persistent data structures for those to use as keys. However, the legacy CGSCC pass manager only stored the functions in the current SCC in memory and did not have a persistent call graph data structure to use as keys to cache analyses. So we need to keep the whole graph in memory to have something to use as a key. And if we have a persistent call graph, we need to make sure it is up to date if passes change its structure. To avoid too much redundant work regenerating a potentially large but sparse graph, we need to incrementally update the graph. This is the reason behind the complexity of the CGSCC pass manager in the new PM.

Within an SCC, a transform might break a call graph cycle and split the SCC. One issue with the legacy CGSCC infrastructure is that it simply stores all the functions in the current SCC in an array, then iterates through the functions in that order without ever revisiting functions. Consider the following SCC containing two functions.

Say we first visit foo, then visit bar and remove the dead call.

We now want to revisit foo since we have better information, most notably that foo is in its own SCC. The legacy CGSCC pass manager would simply move on to the next part of the call graph. So as part of the new PM’s incremental call graph update, if an SCC is split, we make sure to visit the newly split SCCs bottom-up. This may involve revisiting a function we have already visited, but that is intentional as to give passes a chance to observe more precise information.

When adding passes to the legacy pass manager, the nesting of different pass types is implicit. For example, adding function passes after a module pass implicitly creates a function pass manager over a contiguous list of function passes. This is fine in theory, although it can be a little confusing. And some pipelines want to run a CGSCC pass independently of a function pass that comes right after, rather than nesting the function pass into the CGSCC pass via a CGSCC pass manager. The new PM makes the nesting more explicit by only allowing pass managers to contain passes of the equivalent type. For example, a function pass manager can only contain function passes. To add a loop pass to a function pass manager, the loop pass must be wrapped in a loop-to-function adaptor to turn it into a function pass. The IR nesting in the new PM is module (-> CGSCC) -> function -> loop, where the CGSCC nesting is optional. Requiring the CGSCC nesting was considered to simplify things, but the extra runtime overhead of building the call graph and the extra code for proper nesting to run function passes was enough to make the CGSCC nesting optional.

The legacy pass manager relies on many global flags and registries. This is supported by macros generating functions and variables to initialize passes, and any users of the legacy pass manager must make sure to call a function to initialize these passes. But we need some way for a pass manager builder to be aware of all passes for testing purposes. The way the new PM does this is by having the pass manager builder include the definitions of all passes, then use a large mapping of pass IDs to pass constructors to create a function that parses a textual description of a pipeline and adds passes. Users of a pass manager builder can add plugins that register parsing callbacks to handle custom out-of-tree passes. Although there is a global list of functions, there is no mutable global state since each pass manager builder can parse pass pipelines without going through a global registry. Other options, like debugging the execution of a pass manager, are also specified via the constructor, and not through a global flag.

There has been a desire to parallelize LLVM passes for a long time. Although the pass manager infrastructure is not the only blocker, the legacy PM did have a couple of issues blocking parallelization. At the call graph level, only sibling SCCs can be parallelized. Creating SCCs on demand makes it hard to find sibling SCCs. The new PM’s computation of the entire call graph makes it easy to find sibling SCCs to parallelize SCC passes on. Module analyses can be computed from function passes in the legacy PM. Some passes only use analyses if they are cached, so parallelization can cause non-determinism since a module analysis may or may not exist based on other parallel pipelines. The new PM only allows function passes to access cached module analyses and does not allow running them. This has the downside of needing to make sure that certain higher-level analyses are present before running a lower-level pipeline, e.g. making sure GlobalsAA has been computed before running a function pipeline.

Making the new pass manager the default pass manager

Some major users of LLVM switched to using the new PM by default many years ago. There were some efforts upstream to make the new PM work for all use cases. For example, all Clang tests had been passing with the new PM for a while. However, a vast majority of LLVM tests were still only testing the legacy PM. opt , the program typically used to test passes, had syntax to run passes using the legacy PM, opt -instcombine , and syntax to run passes using the new PM, opt -passes=instcombine . The vast majority of tests used the legacy PM syntax, so if the new PM were to be switched on by default, most LLVM tests wouldn’t be testing the new PM passes. (a good number of tests already manually ran against both)

To make tests using opt run against the new PM, we can either manually make them run twice, once against the legacy PM and once against the new PM, or we can automatically translate opt -instcombine to opt -passes=instcombine when the new PM is on by default. Rather than update every test, an -enable-new-pm option was added to opt , which translates the legacy syntax to the new syntax.

With this new option, we started discovering what features the legacy PM had that existing users of the new PM weren’t concerned with. Turning this on locally of course initially caused many tests to fail. Many passes hadn’t yet been ported to the new PM and some opt features didn’t work with the new PM. We ported passes and features that made sense to port to the new PM, and pinned tests using legacy PM features that didn’t make sense to port to the new PM.

Some of the more interesting issues with the new PM uncovered with -enable-new-pm :

The optnone function attribute didn’t cause optional passes to be skipped. Using the existing pass instrumentation framework, which calls callbacks before and after running a pass, and also allows passes to be skipped, this was a very simple pass instrumentation. However, some passes must be run to preserve correctness, so we ended up marking some passes as required.
Opt-bisect wasn’t supported in the new PM. It is used for bisecting which pass in a pipeline causes a miscompile by skipping passes after a certain point. This similarly was fairly easily implemented via pass instrumentation. This similarly always runs required passes.
Various target-specific tests were failing. Upon inspection, some passes that were expected to be run in something like the -O2 pipeline weren’t being run. Some backend targets add custom passes into the default pipelines. Some of these passes are required for correctness, such as passes to lower target-specific intrinsics. The legacy PM had a way for a TargetMachine to inject passes into default pipelines via TargetMachine::adjustPassManager() . A new PM equivalent was introduced and the target-specific passes in the optimization pipeline were ported to the new PM. This wasn’t previously an issue because existing users of the new PM were mostly concerned with x86, which didn’t use this feature in the legacy PM.
Some coroutine tests were asserting in the CGSCC infrastructure. It turns out that the new PM CGSCC infrastructure didn’t support extracting parts of a function into another (aka outlining) in a CGSCC pass. There were some initial failed attempts at hacks to work around this issue, which didn’t properly update the call graph and didn’t handle recursion from newly outlined functions. Finally we came up with a solution that fit into the existing CGSCC infrastructure and properly kept the call graph valid, although coroutine-specific call graph transformations had to be accomodated.

Improvements

Various projects/companies have already been using the new PM for performance reasons for many years. Separately, Chrome recently started using PGO and ThinLTO to make Chrome faster, each with noticeable performance wins. After the new PM was turned on by default in LLVM, Chrome followed suit and turned on the new PM, seeing 3-4% improvements in Speedometer 2.0 for Linux and Windows, on top of a 8-9MB size decrease. It’s likely that better usage of profile information as well as better handling of larger ThinLTO call graphs lead to these improvements.

However, smaller applications with tiny hotspots likely won’t see much benefit from the new PM since the improvements brought on by the new PM tend to be more relevant to large codebases.

Aside from user-facing improvements, this also helps LLVM’s code health by standardizing on one of the two pass managers for the optimization. While we can’t yet remove the legacy pass manager, we can start the deprecation of it, at least for the optimization pipeline. Then hopefully at some point we can start to remove parts of the optimization pipeline that are legacy PM-specific.

What’s next?

To begin the process of removing the use of the legacy PM in the optimization pipeline, we need to make sure that anything using the legacy PM has an alternative using the new PM. Just to list a couple: bugpoint, the LLVM C API, GPU divergence analysis.

As mentioned before, the codegen pipeline still only works with the legacy PM. Although there has been work to start making the codegen pipeline work with the new PM, it is still very far from being usable. This is a great entry point into LLVM, please ask on llvm-dev for more information if you’re interested.

Cling -- Beyond Just Interpreting C++
Women in Compilers and Tools Meetup Series

This is an archive of the discontinued LLVM Phabricator instance.

As per the discussion on D58375 , we disable test that have optimizations under the new PM. This patch adds -fno-experimental-new-pass-manager to RUNS that:

Already run with optimizations ( -O1 or higher) that were missed in D58375 .
Explicitly test new PM behavior along side some new PM RUNS, but are missing this flag if new PM is enabled by default.
Specify -O without the number. Based on getOptimizationLevel() , it seems the default is 2, and the IR appears to be the same when changed to -O2 , so update the test to explicitly say -O2 and provide -fno-experimental-new-pass-manager`.

Diff Detail

Event timeline.

I understand the change to explicitly say -O2 . I also understand the change to add an explicit -fno-experimental-new-pass-manager to a RUN line when we have another RUN line that explicitly uses -fexperiemntal-new-pass-manager .

But for many of these cases, there is no new-PM RUN line in the test. If we're going to fix one RUN line to a particular pass manager, we should fix both IMO unless there is something quite odd about the IR being tested that makes the result of optimizinng be weirdly different between the two pass managers. That would be somewhat surprising, so I kinda want to know if we actually see that so often to understand what is causing this divergence.

In general, Clang tests shouldn't be checking such complex optimizations that the differences between the pass managers really manifests.....

I did some more digging and found the following differences between PMs for each test, and they seem to all differ and can be fixed for different reasons.

CodeGen/aggregate-assign-call.c : The new PM on -O1 codegen produces the do/while loop differently but still emits the lifetime start/end intrinsics around temporary variables, which is what the test checks for. Here we can just check for the instructions while ignoring the branches generated.

CodeGen/arm_acle.c : A bunch of instructions seem to have been combined into a call to llvm.fshl.i32 , so we just check for those under the new PM.

CodeGen/available-externally-suppress.c : available_externally was not emitted during -O2 -flto runs when it should still be retained for link time inlining purposes. This can be fixed by checking that we aren't LTOPrelinking when adding the EliminateAvailableExternallyPass .

CodeGen/pgo-sample.c [No new fix] : The test checks that PruneEH runs, but there doesn’t seem to be a corresponding new PM pass for it. Should there be? If there is one then we can just check for that in the debug PM dump.

CodeGen/x86_64-instrument-functions.c [No new fix] : This one's a bit complicated. The initial problem seems to be that EntryExitInstrumenterPass was not added into the pipeline to emit __cyg_profile_func_enter/exit() calls. Adding that to the pipeline seems to instrument correctly now and add these calls, but we run into a problem with inlining. root() expects 2 calls to __cyg_profile_func_enter and 2 calls to __cyg_profile_func_exit in its body,

but we get only 1 call

I suspect this is due to the leaf() being inlined into root() even though inlining should happen after it. The legacy EntryExitInstrumenter pass seems to be added to the legacy FunctionPassManager first before all others passes. Under the legacy PM, when this pass runs, root() should look like:

but root() looks like this in the new PM pass:

CodeGenCXX/auto-var-init.cpp : Auto var initialization is performed differently under new PM than legacy. With initialization for structs with default values, the legacy PM will store the stream of 0xAAs, then initialize the members with default values in a constructor, whereas the new PM will just store the correct value for the whole struct right away without initialization. In one case, it seems that ctor function was also inlined. Fixed by check accounting for these checks separately.

CodeGenCXX/conditional-temporaries.cpp : The new pm seems to be unable to statically determine the return value of some functions. For now I just add separate checks for the new PM IR.

CodeGenCXX/member-function-pointer-calls.cpp : Same as CodeGenCXX/conditional-temporaries.cpp . In this case, the new PM codegen also contains calls to lifetime start/end intrinsics. Perhaps this time it could be due to the intrinsics not being optimized out?

CodeGenObjC/os_log.m : Legacy test expects argument of ptrtoint and some functions to be a specific type and argument, though the new PM codegen uses a different type and argument which are also valid since @llvm.objc.retainAutoreleasedReturnValue(val) always returns val .

CodeGenObjCXX/os_log.mm : A function seems to have been inlined under the new PM. Here we just prevent inlining for the new PM run since the test expects this specific function call.

Misc/pr32207.c [No new fix] : The banner is something that appears only in the legacy PM. I think the test only checks that the banner is printed and not so much that the particular pass is run.

In D63156#1543937 , @leonardchan wrote: I did some more digging and found the following differences between PMs for each test, and they seem to all differ and can be fixed for different reasons.

Awesome, and sorry this is such a big effort, but I think these are all going to end up being pretty interesting root causes. I guess yay, Clang's testing is doing its job?

Weird, but makes sense.

Yikes, this is a good bug to fix!

The analog would be function-attrs and function(simplify-cfg) in some combination. Maybe this should just check that some specific thing is optimized away at -O2 rather than checking the specific details of pass pipeline construction?

CodeGen/x86_64-instrument-functions.c [No new fix] : This one's a bit complicated. The initial problem seems to be that EntryExitInstrumenterPass was not added into the pipeline to emit __cyg_profile_func_enter/exit() calls. Adding that to the pipeline seems to instrument correctly now and add these calls, but we run into a problem with inlining. root() expects 2 calls to __cyg_profile_func_enter and 2 calls to __cyg_profile_func_exit in its body, ; Function Attrs: nounwind define i32 @root(i32 returned %x) #0 { entry: %0 = tail call i8* @llvm.returnaddress(i32 0) tail call void @__cyg_profile_func_enter(i8* bitcast (i32 (i32)* @root to i8*), i8* %0) #2 tail call void @__cyg_profile_func_enter(i8* bitcast (i32 (i32)* @leaf to i8*), i8* %0) #2 tail call void @__cyg_profile_func_exit(i8* bitcast (i32 (i32)* @leaf to i8*), i8* %0) #2 tail call void @__cyg_profile_func_exit(i8* bitcast (i32 (i32)* @root to i8*), i8* %0) #2 ret i32 %x } but we get only 1 call ; Function Attrs: norecurse nounwind readnone define i32 @root(i32 returned %x) local_unnamed_addr #0 { entry: %0 = call i8* @llvm.returnaddress(i32 0) call void @__cyg_profile_func_enter(i8* bitcast (i32 (i32)* @root to i8*), i8* %0) %1 = call i8* @llvm.returnaddress(i32 0) call void @__cyg_profile_func_exit(i8* bitcast (i32 (i32)* @root to i8*), i8* %1) ret i32 %x } I suspect this is due to the leaf() being inlined into root() even though inlining should happen after it. The legacy EntryExitInstrumenter pass seems to be added to the legacy FunctionPassManager first before all others passes. Under the legacy PM, when this pass runs, root() should look like: ; Function Attrs: nounwind define i32 @root(i32 %x) #1 { entry: %x.addr = alloca i32, align 4 store i32 %x, i32* %x.addr, align 4, !tbaa !2 %0 = load i32, i32* %x.addr, align 4, !tbaa !2 %call = call i32 @leaf(i32 %0) ; this call is not yet inlined ret i32 %call } but root() looks like this in the new PM pass: ; Function Attrs: norecurse nounwind readnone define i32 @root(i32 returned %x) local_unnamed_addr #1 { entry: ret i32 %x }

Yeah, I think the entry/exit pass really wants to come *before* any other pass. That should be possible even in the new PM by using an extension point?

Cool, makes sense.

Interesting. Definitely the right thing for this patch. Maybe file a PR to track understanding why the new PM struggles here.

Weird, but that's also my best guess. Again, maybe we need a bug to track why this is different.

CodeGenObjC/os_log.m : Legacy test expects argument of ptrtoint and some functions to be a specific type and argument, though the new PM codegen uses a different type and argument which are also valid since @llvm.objc.retainAutoreleasedReturnValue(val) always returns val . CodeGenObjCXX/os_log.mm : A function seems to have been inlined under the new PM. Here we just prevent inlining for the new PM run since the test expects this specific function call.

You could also use a noinline attribute in the code to express the *specific* call that needs to be retained?

Yeah, this seems far enough down in the weeds that just making it PM-specific is about the only option.

For now I added a check that those 2 extra passes are run in the debug pass dump. It seems that the original test just wanted to see that a particular pass was run, so I left it at that, but I can still add a separate test/run for checking that function-attrs and function(simplify-cfg) replace invoke instructions at -O2 (which is what I think PruneEH does).

Moved this into D63577 .

Created https://bugs.llvm.org/show_bug.cgi?id=42333 to track these 2.

Hmm noinline doesn't seem to be working in this case. The call is made to a code generated builtin function ( __os_log_helper ).

just a minor comment on one of these...

Revision Contents

Diff 206013, cfe/trunk/test/codegen/aggregate-assign-call.c, cfe/trunk/test/codegen/arm_acle.c, cfe/trunk/test/codegen/cspgo-instrumentation.c, cfe/trunk/test/codegen/cspgo-instrumentation_lto.c, cfe/trunk/test/codegen/cspgo-instrumentation_thinlto.c, cfe/trunk/test/codegen/pgo-instrumentation.c, cfe/trunk/test/codegen/thinlto-debug-pm.c, cfe/trunk/test/codegencxx/auto-var-init.cpp, cfe/trunk/test/codegencxx/conditional-temporaries.cpp, cfe/trunk/test/codegencxx/member-function-pointer-calls.cpp, cfe/trunk/test/codegenobjc/os_log.m, cfe/trunk/test/codegenobjcxx/os_log.mm, cfe/trunk/test/misc/pr32207.c.