How to Read Source Code of Large Program

Mitmproxy as an example

You should read the source code from particular version rather then the latest one.

Recently, I read an article 如何以“正确的姿势”阅读开源软件代码 about how to read and benifit from the open source software. It suggests the following steps:

Clone the project to your machine.
Check the release list of this project (from Github).
Find a release version that your could fully understand, for example, 1.0 or prior.
Make sure you can understand the code of last version.
Then pick the important release and read its code.
Finally you could go for the latest code.

For some small project, like some handy script for pentesting, it is easy to figure out it structures and details about implementation. But for some “Big Buddy” like Mitmproxy, I have to say that I’d tried many times but it usually mess up somewhere.

I don’t think this is a good method at first, it sounds kind of stupid, from “particular version” means that you probably need to start from the very first release version, whose features have been removed or refractor in completely different way. But from another perspective, tracking these releases can make you understand how the contributors dealing with problems better. Why they separate a single file into many? Why they rewrite the class in this way? Why they change the name of the variable? or why they keep working on this project and love it so much? : P

I started to read from the first release, which is v0.22 released at Oct, 31th, 2012. Actually, this version simply implemented nothing but netlib. The code is straightforward and there are only hundreds of lines of code, I could not image how many time and effort was devoted to this project to make it contains more than 6,000 commits! It took me more than two weeks to go through all the big release to v0.18. I took some note to help myself understand the structure of it.

Notes-of-Mitmproxy

Only read the code is not enough, it makes the reading more valuable if you find some shining points from it. Here I pick up some awesome implementations and share with you.

Script

The design of script in Mitmproxy makes the program extendable. But the implementation could be tricky, especially when it comes to mutiple scripts, the API should be clear for the users and make sure the script will not affect the main funtions of the program. From v0.18, the script module is defined as an addon, and the contributors did make every script acts like an addon. There are three key points I’d like to mention:

scriptenv(path, args)

@contextlib.contextmanager
def scriptenv(path, args):
    oldargs = sys.argv
    sys.argv = [path] + args
    script_dir = os.path.dirname(os.path.abspath(path))
    sys.path.append(script_dir)
    try:
        yield
    except SystemExit as v:
        ctx.log.error("Script exited with code %s" % v.code)
    except Exception:
        etype, value, tb = sys.exc_info()
        tb = cut_traceback(tb, "scriptenv").tb_next
        ctx.log.error(
            "Script error: %s" % "".join(
                traceback.format_exception(etype, value, tb)
            )
        )
    finally:
        sys.argv = oldargs
        sys.path.pop()

The usage of contextmanager here solve the problem of locating the script, the yield keyword in split the code into two parts, or you could image that the codes inside the contextmanger will be filled into the yield‘s position. So the code first will save the old arguments and set the new path, in order to locate the script file; then define how to deal with the exceptions and recover the arguments once finished. This could make the import the script to the program without touching the much of the running context.

load_script(path, args)

def load_script(path, args):
    with open(path, "rb") as f:
        try:
            code = compile(f.read(), path, 'exec')
        except SyntaxError as e:
            ctx.log.error(
                "Script error: %s line %s: %s" % (
                    e.filename, e.lineno, e.msg
                )
            )
            return
    ns = {'__file__': os.path.abspath(path)}
    with scriptenv(path, args):
        exec(code, ns)
    return types.SimpleNamespace(**ns)

In most common way to load a .py file and run it is directly call exec(code), but the code here first create a new ns dict for namespace, and run the code in that new namespace. This can protect the existed object in the main namespace, even we miswrite something collided with the main function in script, it could work in the new namespace. At most of the time, the exec() here would run the code instantly, but load it all to the new namespace, then the namespace will convert into a SimpleNamespace object return, and become a member variable ns of the a Script object.

When we need to run the functions of the script, we could use getattr(self.ns, name) to obtain the functions from the namespace.

Script class

for i in eventsequence.Events:
    if not hasattr(self, i):
        def mkprox():
            evt = i

            def prox(*args, **kwargs):
                self.run(evt, *args, **kwargs)
            return prox
        setattr(self, i, mkprox())

This piece of code comes from the __init__() of Script class, it guarantee that every script have the same machnism like other addons. It iterative over the events and set the hookers for each events.

Through the program, there are many inspiring implementations to talk about, I’ll try my best to give some PRs and maybe share more interesing tricks here.