Faster partial builds: 1M tasks in 10 seconds

Incremental builds in large projects can sometimes have some unnecessary overhead. This are typically mitigated by splitting up the project into several components. Other common approaches tend to minimize the amount of task objects to process by creating task batches or by running unity builds. Such optimizations are typically implemented as Waf extensions because they tend to add assumptions that may exclude useful scenarios.

Creating such optimizations is typically done in an iterative fashion, starting from a profile, making a few assumptions and then implementing a solution. The purpose of this post is to illustrate this process on a very large project simulation on the process of improving incremental build times. As a disclaimer, it must be mentioned that the simulated project does not correspond to a real-world case and was only meant for illustration purposes.

A Large test project

A project generator in the Waf source tree can be used to generate a C++ projects. In this case there will be 2000 static libraries using 500 classes (one source file and one header per class). Each source file randomly includes 15 headers from the same library and 5 headers from other libraries of the project.

$ ./utils/genbench.py /temp/build 2000 500 15 5

This project thus contains 1 million mostly empty C++ files (with 1 million headers) for a total of 2,000 static libraries. The full build time takes 55 minutes (this can be reduced to 20 minutes by using nobuild ), and an incremental build takes about 7 minutes after changing one file. Let us view the profile for the incremental build by using the --profile switch:

$ waf --profile
'build' finished successfully (7m21.910s)
Sat Oct 14 16:25:35 2017    profi.txt

   328869198 function calls (324860660 primitive calls) in 441.939 seconds

   Ordered by: internal time
   List reduced from 538 to 75 due to restriction <75>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  6000011  215.459    0.000  215.459    0.000 {method 'read' of 'file' objects}
  1         15.830   15.830   16.458   16.458 {cPickle.loads}
  3000015   13.877    0.000   13.877    0.000 {open}
 41558744   12.361    0.000   16.030    0.000 waf/waflib/Task.py:233(priority)
 22967867   12.203    0.000  264.451    0.000 waf/waflib/Node.py:891(get_bld_sig)
 32971867   11.432    0.000   11.432    0.000 {method 'update' of '_hashlib.HASH' objects}
  3000000   10.460    0.000  247.576    0.000 waf/waflib/Utils.py:250(h_file)
 20779372    9.328    0.000   25.358    0.000 waf/waflib/Task.py:192(__lt__)
  4006076    8.443    0.000    8.443    0.000 {posix.stat}
  1000000    8.331    0.000   77.843    0.000 waf/waflib/Task.py:790(compute_sig_implicit_deps)
  3010003    7.597    0.000    8.411    0.000 waf/waflib/Node.py:378(make_node)

Nothing really stands out at this point, aside from the massive amount of tasks to process (1M).

Assumptions

We will now list a few assumptions that apply to this project and try to derive some optimizations:

There are lots of targets with link tasks and inter-dependent libraries:
- Object-only targets can be ignored for the time being.
- Each task generator holds a fraction of the tasks from the total.
The project phase is simple:
- Full project builds without --targets or pruning from sub-folders
- The installation phase is unnecessary
Task generators are statically defined:
- task generator source files are not obtained from globs
- use= dependencies are fully described in user scripts
The project is large so it is unlikely to located over a network file system, therefore file modification changes (timestamps) are reliable.

One simple idea is to avoid creating all task objects when performing incremental builds by creating a map of tasks to the dependency files and using file timestamps to detect changes very quickly. Subsequent builds can then detect which files have changed and apply the relevant task generators, getting through the whole build much faster.

Adding a Waf extension

We do not really want to modify the Waf implementation at this point, so we will instead add all the code for our experiments to a Waf extension. The file fast_partial.py is created in the directory of the project, and then included in the wscript file by trying to load the extension options:

#! /usr/bin/env python
# encoding: utf-8

top  = '.'
out  = 'out'

def options(opt):
  opt.load('compiler_cxx')
  opt.load('fast_partial', tooldir='.')

def configure(conf):
  ...

No error is raised if the extension does not provide any command-line options. Any class defined in the extension will be loaded when the build starts, so that no project re-configuration is necessary. The sections below will now focus on the implementation in fast_partial.py.

Mapping files to task generators

The first step is to create a build context subclass to create or update this mapping. Since the store method is called after a build is finished, we override it and have it call a new store_timestamps method to create or update our new mapping.

class bld(Build.BuildContext):
  def store(self):
    pass
  def store_tstamps(self):
    pass

  def get_build_iterator(self):
    pass
  def compute_needed_tgs(self):
    pass

To apply the mapping, the method get_build_iterator is then overridden to call a new compute_needed_tgs method which will apply the filter. Finally, a new is_stale method is bound to task generators to detect when tasks really need to be created:

@taskgen_method
def is_stale(self):
  pass

With these changes, the runtime already shows a nice improvement. There is a noticeable gap between an incremental build processing one file, and an incremental build with all files up-to-date (no-op).

$ echo " " >> lib_512/class_3.cpp ; waf; waf
Waf: Entering directory `/temp/build/out'
[189/501] Compiling lib_512/class_3.cpp
Waf: Leaving directory `/temp/build/out'
'build' finished successfully (1m16.846s)

Waf: Entering directory `/temp/build/out'
Waf: Leaving directory `/temp/build/out'
'build' finished successfully (32.900s)

We will now revisit the profile data for the incremental build version:

'build' finished successfully (1m20.945s)
Sun Oct 15 14:59:15 2017    profi.txt

   23311106 function calls (23305549 primitive calls) in 80.967 seconds

   Ordered by: internal time
   List reduced from 591 to 75 due to restriction <75>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  2   43.460   21.730   44.590   22.295 {cPickle.dumps}
  2   22.246   11.123   22.876   11.438 {cPickle.loads}
  7956254    4.277    0.000    7.390    0.000 fast_partial.py:270(tstamp)
  3011352    3.128    0.000    3.128    0.000 {posix.stat}
     2000    2.269    0.001   10.683    0.005 fast_partial.py:228(is_stale)
     2002    1.002    0.001    1.002    0.001 {zip}

The next issue is that the main data file still needs to be loaded in memory when a build starts (322MB):

$ du -sm /temp/build/out/.wafpickle
322     /temp/build/out/.wafpickle-linux2-34016752-20
358     /temp/build/out/.wafpickle_tstamp_db_file

Splitting the main data file

One obvious approach is to have one of such data file per task generator instead of a global one for the whole build. To do this, we remember that each task generator is bound to a build context, and then implement add a build context object for each task generator. This is achieved by adding a new build context proxy class:

class bld_proxy(object):
  def __init__(self, bld):
    object.__setattr__(self, 'bld', bld)
    ...

  def __setattr__(self, name, value):
    bld = object.__getattribute__(self, 'bld')
    setattr(bld, name, value)

  def __delattr__(self, name):
    bld = object.__getattribute__(self, 'bld')
    delattr(bld, name)

  def __getattribute__(self, name):
    try:
      return object.__getattribute__(self, name)
    except AttributeError:
      bld = object.__getattribute__(self, 'bld')
      return getattr(bld, name)

  ...

Proxy objects reference a global build context and will in general have only few properties set using __setattr__. They may return attributes from the global context in most cases through the usage of __getattribute__.

Next, the proxy re-implements the store and restore methods so that build data gets actually read and written for each target. Since the build cache serializes Node objects, it is also necessary for the proxy to provide its own Node class and avoid Python serialization errors.

Finally, a new instance of the proxy is created whenever a new task generator is instantiated. The __call__ method of our main build context is overridden for this effect:

class bld(Build.BuildContext):
  def __call__(self, *k, **kw):
    bld = kw['bld'] = bld_proxy(self)
    ret = TaskGen.task_gen(*k, **kw)
    self.task_gen_cache_names = {}
    self.add_to_group(ret, group=kw.get('group'))
    ret.bld = bld

Since timestamps are being used, it becomes tempting to leverage the md5_tstamp extension that hashes files only when a change is detected. This will not work out of the box because of the new proxy, so a custom implementation is necessary, this can be achieved by replacing the default method through monkey-patching:

def h_file(filename):
  ...
waflib.Node.Node.h_file = h_file

With this scheme, incremental builds are substantially reduced:

$ echo " " >> lib_512/class_3.cpp; waf; waf
Waf: Entering directory `/temp/build/out'
[189/501] Compiling lib_512/class_3.cpp
Waf: Leaving directory `/temp/build/out'
'build' finished successfully (9.515s)

Waf: Entering directory `/temp/build/out'
Waf: Leaving directory `/temp/build/out'
'build' finished successfully (6.693s)

The full build time remains essentially the same, though memory consumption increases substantially (4GB to 9GB). The fully build also tends to start processing the first tasks a little faster.

Conclusion

This post has shown the application of a few optimizations to improve the incremental build time of a large simulated project after reviewing profile data and making assumptions about the build structure. The incremental build time went down from 7 minutes to 10 seconds.

The current implementation can be found in the extras folder of the Waf 2.0 source tree. A few assumptions above such as limiting the usage of globs can be worked around, but this is deemed unnecessary for the purposes of this article.