Part 4: Python
Hello & welcome to Part 4! This is where we get into the code and start making changes.
What is all this stuff?
By now, you’ve probably had a look through the files in the ismir2018-oss-tutorial
repository,
and have some sense of what they all are. Here, we’ll go through it in a bit more detail.
How is code organized in Python?
The first thing to understand is that Python allows two different use cases: scripts and packages:
-
A script is a file containing Python code, e.g.,
myscript.py
that’s meant to be executed directly by a user. -
A package is a way of bundling up pieces of code for use by other packages and scripts in the future.
When you say pip install <some-package-name>
or conda install <some-package-name>
, you’re installing a
package. numpy
is a package, for example.
Both scripts and packages are important for reproducibility, but it’s important to know when to use each. Some rules of thumb:
-
If you want a bit of code to be usable across multiple projects, put it in a package.
-
If you just want someone else to be able to run your exact code, a script might be better.
Of course, these are just general suggestions, and there will always be exceptions. Use your best judgement!
Checkpoint: install toymir
In the ismir2018-oss-tutorial
folder, type
ls -lR
to see the contents of the repository.
Next, in the same folder, install the package for development by saying
python setup.py develop
This command uses the setup.py
script to install the current package in the Python environment.
The develop
action means that files will not be copied over, so any changes you make to the code will be immediately reflected in the
environment.
This way, you can work on the code as a developer without having to reinstall it after every change.
So how do packages work?
In its simplest form, a package can be defined from a single source file (e.g., <package>.py
), but it is more common to split even simple packages into modules.
Python packages mirror the file and directory structure of the source code to keep things organized. For example, our toy package looks like:
toymir/
__init__.py
freq.py
version.py
...
The package is called toymir
, and you would use it in a script by saying
import toymir
When Python encounters the import
command, it will locate toymir
in its search path,
and look for either toymir.py
or toymir/__init__.py
.
The __init__.py
convention is magical: it is always the first thing loaded, and can contain any arbitrary python code.
Typically, __init__.py
files are minimal, and only contain comments and other import functions
necessary to initialize the package.
In this case, __init__.py
might look like:
from .version import __version__
from .freq import *
These lines are executed when import toymir
is executed, and in turn, import the
rest of the modules within the package.
The first line brings in only one variable (__version__
) from the version.py
module.
Importing the variable directly in __init__.py
makes it accessible to the user as
toymir.__version__
, which is the convention for specifying version numbers in Python packages.
The second line imports any variables, classes, and functions defined in the freq.py
module.
After saying import toymir
, a user can access functions as toymir.hz_to_period()
(for
example).
Modules can also have sub-structure, with nested folders, each including their own __init__.py
files. In general, it’s a good idea to limit submodules to not get too deep, if only because users don’t like typing long strings to access functions!
Checkpoint
Now that you have toymir
installed, let’s see what it does. Start an ipython
console, and type
import toymir
at the prompt.
Next, type toymir.<TAB>
and you should see a table of the contents of the toymir
package, which you can navigate by arrow keys.
Select hz_to_period
and press enter: this should leave toymir.hz_to_period
on your input prompt.
Now, add a ?
to the end of the line, so it reads toymir.hz_to_period?
and press <ENTER>
.
This will bring up the documentation for the hz_to_period
function, which you can quit by hitting q
.
If you add two ??
instead of one, it will bring up both the documentation and the source code (if available).
This is handy for quickly getting a feel for how a function works!
Relative and absolute imports
You may have noticed that the import lines in __init__.py
above look funny. You might be used to seeing
so-called absolute import
statements like import numpy
, or from scipy.signal import convolve1d
, but what’s all this
period business in from .freq import *
?
This is a special convention to make sure that Python does not get confused about where imports
come from. Imagine there was already a package installed on the system called
freq
. Then from freq import *
from within toymir/__init__.py
could be
ambiguous! Saying from .freq import *
tells Python: the freq.py you’re looking for is in this
folder. This is critical when you have common submodule names like utils
, and avoiding
unintentional recursive imports when the submodule shares the package name (like in our case
here).
Checkpoint
Go into toymir/__init__.py
and change the second import from
from .freq import *
to
from . import freq
Once you’ve done that and saved the result, start ipython
, and import toymir
.
Instead of toymir.hz_to_period
, the function should now be located at toymir.freq.hz_to_period
.
Verify this by running:
toymir.hz_to_period(40)
which should fail with AttributeError
. Next, run
toymir.freq.hz_to_period(100)
should return 0.01
.
Once you’re done, reset the repository to a clean state (undoing your change) by saying:
git reset --hard HEAD
Including data
Sometimes, packages need to include data as well as code.
The tool to do this is called pkg_resources
. Typically, a package will provide functions to
make it easy for a user to access any bundled data, or load it directly if the data is necessary
for internal use by the package (e.g., model parameters).
The installer
Finally, outside of the package directory (but still in the repository), you’ll see setup.py
.
This is the script that is executed when the package is installed and/or packaged, and should
contain all the necessary metadata (including dependencies and included data files) for
successful installation.
Success!
Ready to keep going? Onward to Part 5.
Eager for more?
Best practices for awesome packages
-
Raise exceptions, not asserts. Asserts give the user no chance to diagnose what went wrong!
-
Use
numpydoc
format for your documentation strings. We’ll see more of that in Part 6. -
Allow the user to seed your random number generators. This is critical for reproducibility.
-
Keep a change-log in your documentation, and include the dates!
-
Specify your dependency versions!
-
Provide functions, not classes.
- Internal classes are okay, but don’t make users learn object hierarchies!
- If you do need classes, don’t extend from other packages. Wrap them instead.
Tips and tricks for successful scripts
-
Always seed your random number generator! You don’t want different results if you re-run it next week, right?
-
If you have an experiment that consists of multiple sequential stages, make separate scripts and prefix them with numbers:
01-preprocess.py
,02-train.py
,03-evaluate.py
, etc. Be sure to document how to use your scripts in a README file! -
Use the
argparse
package to handle command-line arguments nicely! -
Use
tqdm
for cool progress bars!