Part 4: Python

Hello & welcome to Part 4! This is where we get into the code and start making changes.

What is all this stuff?

By now, you’ve probably had a look through the files in the ismir2018-oss-tutorial repository, and have some sense of what they all are. Here, we’ll go through it in a bit more detail.

Looking closer

How is code organized in Python?

The first thing to understand is that Python allows two different use cases: scripts and packages:

A script is a file containing Python code, e.g., myscript.py that’s meant to be executed directly by a user.
A package is a way of bundling up pieces of code for use by other packages and scripts in the future.

When you say pip install <some-package-name> or conda install <some-package-name>, you’re installing a package. numpy is a package, for example.

Both scripts and packages are important for reproducibility, but it’s important to know when to use each. Some rules of thumb:

If you want a bit of code to be usable across multiple projects, put it in a package.
If you just want someone else to be able to run your exact code, a script might be better.

Of course, these are just general suggestions, and there will always be exceptions. Use your best judgement!

Checkpoint: install toymir

In the ismir2018-oss-tutorial folder, type

ls -lR

to see the contents of the repository.

Next, in the same folder, install the package for development by saying

python setup.py develop

This command uses the setup.py script to install the current package in the Python environment. The develop action means that files will not be copied over, so any changes you make to the code will be immediately reflected in the environment. This way, you can work on the code as a developer without having to reinstall it after every change.

So how do packages work?

In its simplest form, a package can be defined from a single source file (e.g., <package>.py), but it is more common to split even simple packages into modules. Python packages mirror the file and directory structure of the source code to keep things organized. For example, our toy package looks like:

toymir/
    __init__.py
    freq.py
    version.py
    ...

The package is called toymir, and you would use it in a script by saying

import toymir

When Python encounters the import command, it will locate toymir in its search path, and look for either toymir.py or toymir/__init__.py.

The __init__.py convention is magical: it is always the first thing loaded, and can contain any arbitrary python code. Typically, __init__.py files are minimal, and only contain comments and other import functions necessary to initialize the package.

loading all the imports

In this case, __init__.py might look like:

from .version import __version__
from .freq import *

These lines are executed when import toymir is executed, and in turn, import the rest of the modules within the package. The first line brings in only one variable (__version__) from the version.py module. Importing the variable directly in __init__.py makes it accessible to the user as toymir.__version__, which is the convention for specifying version numbers in Python packages.

The second line imports any variables, classes, and functions defined in the freq.py module. After saying import toymir, a user can access functions as toymir.hz_to_period() (for example).

Modules can also have sub-structure, with nested folders, each including their own __init__.py files. In general, it’s a good idea to limit submodules to not get too deep, if only because users don’t like typing long strings to access functions!

Checkpoint

Now that you have toymir installed, let’s see what it does. Start an ipython console, and type

import toymir

at the prompt.

Next, type toymir.<TAB> and you should see a table of the contents of the toymir package, which you can navigate by arrow keys. Select hz_to_period and press enter: this should leave toymir.hz_to_period on your input prompt.

Now, add a ? to the end of the line, so it reads toymir.hz_to_period? and press <ENTER>. This will bring up the documentation for the hz_to_period function, which you can quit by hitting q. If you add two ?? instead of one, it will bring up both the documentation and the source code (if available). This is handy for quickly getting a feel for how a function works!

Relative and absolute imports

Which package do i load?

You may have noticed that the import lines in __init__.py above look funny. You might be used to seeing so-called absolute import statements like import numpy, or from scipy.signal import convolve1d, but what’s all this period business in from .freq import *?

This is a special convention to make sure that Python does not get confused about where imports come from. Imagine there was already a package installed on the system called freq. Then from freq import * from within toymir/__init__.py could be ambiguous! Saying from .freq import * tells Python: the freq.py you’re looking for is in this folder. This is critical when you have common submodule names like utils, and avoiding unintentional recursive imports when the submodule shares the package name (like in our case here).

Checkpoint

Go into toymir/__init__.py and change the second import from

from .freq import *

from . import freq

Once you’ve done that and saved the result, start ipython, and import toymir. Instead of toymir.hz_to_period, the function should now be located at toymir.freq.hz_to_period. Verify this by running:

toymir.hz_to_period(40)

which should fail with AttributeError. Next, run

toymir.freq.hz_to_period(100)

should return 0.01.

Once you’re done, reset the repository to a clean state (undoing your change) by saying:

git reset --hard HEAD

Including data

Sometimes, packages need to include data as well as code. The tool to do this is called pkg_resources. Typically, a package will provide functions to make it easy for a user to access any bundled data, or load it directly if the data is necessary for internal use by the package (e.g., model parameters).

The installer

Finally, outside of the package directory (but still in the repository), you’ll see setup.py. This is the script that is executed when the package is installed and/or packaged, and should contain all the necessary metadata (including dependencies and included data files) for successful installation.

Success!

Ready to keep going? Onward to Part 5.

Eager for more?

Bonus round!

Best practices for awesome packages

Raise exceptions, not asserts. Asserts give the user no chance to diagnose what went wrong!
Use numpydoc format for your documentation strings. We’ll see more of that in Part 6.
Allow the user to seed your random number generators. This is critical for reproducibility.
Keep a change-log in your documentation, and include the dates!
Specify your dependency versions!
Provide functions, not classes.
- Internal classes are okay, but don’t make users learn object hierarchies!
- If you do need classes, don’t extend from other packages. Wrap them instead.

Tips and tricks for successful scripts

Always seed your random number generator! You don’t want different results if you re-run it next week, right?
If you have an experiment that consists of multiple sequential stages, make separate scripts and prefix them with numbers: 01-preprocess.py, 02-train.py, 03-evaluate.py, etc. Be sure to document how to use your scripts in a README file!
Use the argparse package to handle command-line arguments nicely!
Use tqdm for cool progress bars!