All in One View

Content from Software Packaging Overview

Last updated on 2026-06-08 | Edit this page

Overview

Questions

What is software packaging?
How is packaging related to reproducibility and the FAIR4RS principles?
What content compromises a software package?

Objectives

Recognise the importance of software packaging to ensure reproducibility.
Understand what the basic building blocks of a package are in general

Introduction

One of the most challenging aspects of research is reproducibility. This necessitates the need to ensure that both research data and research software adhere to a set of guidelines that better enable open research practices across all disciplines. The recent adaptation of the original FAIR principles (Findable, Accessible, Interoperable, Reusable) means that research software can now also benefit from the same general framework as research data, whilst accounting for their inherent differences, including software versioning, dependency management, writing documentation, and choosing an appropriate license. This is commonly referred to as the FAIR4RS (FAIR for Research Software) principles.

Discussion

Can you recall a time when you have used someone else’s software but encountered difficulties in reproducing their results? What challenges did you face and how did you overcome them?

Software packaging is one of the core elements of reproducible research software. In general, it encompasses the process of collecting and configuring software components into a format that can be easily deployed on different computing environments.

alt text for accessibility purposes — *Figure 1: A software package is like a box containing all the items you need for a particular activity, neatly packed together to transport to someone else*.

Callout

Think about what a package is in general; you typically have a box of items that you want to post to someone else in the world. But before you post it for others to use, you need to make sure the package has things like: an address label, an instruction manual, and protective material.

Challenge

Challenge 1: Packaging Analogy

Using the analogy in the callout above, provide an example for each package attribute in terms of the software attribute.

Give me a hint

The above callout should help you think about a few different possible analogies.

Show me the solution

Box of items: The software itself (source code and additional resources e.g. data, images).
Address label: Computer readable instructions to reproduce an environment with the dependencies setup
Instruction manual: User documentation explaining how to use the software effectively.
Protective materials: Error handling routines to safeguard the software from misuse or unexpected situations, and tests to verify the code functions under a variety of conditions

Why Should You Package Your Software?

As we’ve touched on above, there are several benefits to packaging your software:

Ease of installation: Instead of manually copying individual files and setting configurations, users can automatically install the package using a package manager.
Dependency management: Automates the process of installing the dependencies to prevent version conflicts and ensures the software environment is consistently correct
Accessibility: Storing completed packages on central repositories ensures they are easier to access for other researchers and contributes to the software ecosystem
Standardisation: Packaging enforces a standard structure and format for software, making it easier for users and developers to understand.
Reproducibility: A packaged piece of software is a snapshot of the code and its dependencies at a specific point in time, making it easier to reproduce results
Research impact and collaboration: From a research impact perspective, software packaging ensures reproducibility, accessibility, and ease of dissemination to the wider research community.

If you decide that packaging your software is right for your project, there are some important questions that you, as the developer, should consider before getting started:

Target Users: Who are you building this package for? Beginners, experienced users, or a specific domain? This will influence the licence you choose as well as the level of detail needed in the documentation and the complexity of dependencies you may need to include. (Software lifecycle course materials)
Generalization: Is your code generalized to the extent it can be used by others, or does it rely upon your specific machine’s file paths or local input data? Will it handle data in different formats? (Software Design Course coming soon!)
Dependencies: What other libraries does your package rely on to function? What about hardware dependencies? Have these been documented in a standard format? (Reproducible environments course materials)
Testability: How will users test your package? You should consider including unit tests and examples to demonstrate usage and ensure your code functions as expected. (Testing and CI course materials)
Documentation: Is there a README detailing installation instructions and basic usage as well as internal documentation in the form of docstrings? (Documentation course materials)
Scalability: As your project and software grows in size and complexity, how can you effectively handle the increased modules, dependencies and distribution requirements?

Of course, this is not an exhaustive list, however, once you have thought about candidate solutions for these questions, you’re in a good position to start packaging your software.

Anatomy of a Software Package

The most basic directory structure of a software package looks something like:

📦 my_project/
├── 📂 <source>/
│   └── 📄 <code>
└── 📄 README.md
└── 📄 <metadata>

where

📄 README.md provides the essential information about your package including how to install and use it
📄 <metadata> is a configuration file detailing the most important metadata: package name, version, authors, and dependencies amongst others. Python’s metadata file is pyproject.toml while R’s is DESCRIPTION. The following sections will show examples of both of these.
📂 <source> contains the source code. The structure and naming conventions of this folder will differ between programming languages

Challenge

Challenge 2: Improving your project’s packaging

The directory structure of a basic package shown above is a good starting point, but it can be improved. From what you have learned so far, what other files and folders could you include in your package to provide better organisation, readability, and compatibility?

Give me a hint

Use the emoji folder structure above to get you started.

Show me the solution

A possible improvement could be to include the following to your package:

📦 my_project/
├── 📂 <source>/
│   └── 📄 <code>
├── 📂 tests/
│   └── 📄 <tests>
├── 📂 docs/
│   └── 📄 example_usage.md
├── 📂 resources/
│   └── 📄 example_data.csv
└── 📄 LICENSE.md
└── 📄 CODE_OF_CONDUCT.md
└── 📄 README.md
└── 📄 <metadata>

Where:

📄 LICENSE.md explicitly details licensing terms and conditions under which the package’s code and associated assets are made available to others for use, modification, and distribution.
📄 docs contains additional long form documentation, either hand-written and/or generated automatically from the docstrings. Often hosted as a separate website.
📄 resources contains any additional resources, such as example data or images
📄 CODE_OF_CONDUCT.md outlines community guidelines for behaviour and standards for contributors

Although we have touched on the core concepts of packaging, we still need to learn about how to write the metadata and logic for building a package in both Python and R. The next episode of this course will show example minimal package structures for both of these languages.

Key Points

Reproducibility is an integral concept in the FAIR4RS principles. Appropriate software packaging is one way to account for reproducible research software, which involves collecting and configuring software components into a format deployable across different computer systems.
Software packaging is akin to the packaging a box for shipment. Attributes such as the software source code, installation instructions, user documentation, and test scripts all support to ensure reproducibility.
The purpose of a software package is to install source code for execution on various systems, with considerations including target users, dependencies, testability and scalability.

Content from Accessing Packages

Last updated on 2026-06-08 | Edit this page

Overview

Questions

How can I access my own package?
What are the different ways of downloading Python packages?
What are the different ways of installing R packages?

Objectives

Install packages from source in both R and Python
Install packages using pip
Install packages in R

Introduction

Due to the proliferation of software into many areas of academic research, it is quite likely that you won’t be the first person to set off on solving any particular task. Many others have worked on common problems and then shared their solution in the form of a package, which you can conveniently integrate into your own code and use!

Callout

Popular Packages

Some of the most popular packages you may have heard of are:

Numpy (Python)
Pandas (Python)
tidyverse (R)
ggplot2 (R)

Python

To use a package that is installed you use the key word import in Python.

PYTHON

# This imports the pandas package and gives it a new name 'pd'.
import pandas as pd 

# Use the package to read a file
pd.read_csv("my_data.csv")

Python Package Index (PyPI)

The Python Package Index or PyPI is an online repository of Python packages hosting over 500,000 packages! While there are alternatives such as conda-forge, PyPI is by far the most commonly used and likely to have all you need.

Discussion

Exercise 1: Explore PyPI

Explore PyPI to get familiar with it, try searching for packages that are relevant to your research domain / role!

Callout

pip

pip (package installer for Python) is the standard tool for installing packages from PyPI. You can think of PyPI being the supermarket full of packages and pip being the delivery van bringing it to you.

Using pip

pip itself is a Python package that can be found on PyPI. It however comes pre-installed with most Python installations, for example python.org and inside virtual environments.

The most common way to use pip is from the command line. At the top of a package page on PyPI will be the example line you need to install the package

python -m pip install numpy

The above will install numpy from PyPI, a popular scientific computing package enabling a wide range of mathematical and scientific functions.

Callout

You may notice a wheel file download during the pip install, for example Downloading numpy-2.0.0-cp312-cp312-win_amd64.whl. A wheel in Python is a pre-built package format that allows for quicker and more efficient installation, so when it is downloaded your local computer doesn’t need to do any building. The alternative is source files which often take the form .zip or .tar.gz, which when downloaded will then need to be built then installed, which is often far slower.

Discussion

Exercise 2: Create venv and install Numpy

Step 1: Create a venv in the .venv directory using the command python -m venv .venv and activate it with

.\.venv\Scripts\activate

source .venv/bin/activate

When activated you should see the name of your environment in brackets at the start of your terminal line

Step 2: Install Numpy into your new environment

Step 3: Check your results with python -m pip list

Step 4: Deactivate your environment with deactivate

Virtual Environments

Check out this documentation or the FAIR4RS course on virtual environments to learn more!

Installing from source

pip can also be used to install packages from source. This means that the package file structure (source) is on your local computer and pip installs it using the instructions from the setup.py or pyproject.toml file. This is especially handy for packages either not on PyPI, like ones downloaded from github, or for your own packages you’re developing.

python -m pip install .

python3 -m pip install .

Here the . means to install your current directory as a Python package. For this to work the directory your command line interface is currently in needs to have a packaging file, i.e. setup.py or pyproject.toml.

R

To load a package in R the library function is used, which will automatically make all the exported functions from that package available in the R environment. Contrast this behaviour with Python where you still need to write pd.read_csv rather than read_csv.

R

# Loads the readr package and places all its exported functions into the global namespace
library(readr)

# Use the package to read a file
read_csv("my_data.csv")

Less commonly you can access a single function from a package using the :: syntax, this is useful for situations where you might not want to populate your environment with all the exported functions from that library, e.g. if you know there will be a conflict.

R

readr::read_csv("my_data.csv")

The Comprehensive R Archive Network (CRAN)

R’s primary package repository is called CRAN. It currently (at the time of writing) hosts 23,667 packages, significantly fewer than PyPI, but there are several substantial differences between these two repositories. PyPI functions as more of a warehouse with a very low barrier to entry, hence the vast number of packages. CRAN by contrast operates as a curated library with very strict criteria for incoming packages (as well as for any updates!). This stark difference in philosophy results in a very different user experience.

Discussion

Exercise 3: Explore CRAN

Explore CRAN to get familiar with it, try searching for packages that are relevant to your research domain / role! The task views are particularly useful for scoping out the range of existing packages in your field. Look at the page for a package (e.g. deSolve) and identify the different types of documentation available.

`install.packages`

R does not have a separate package manager like pip that can be run as an external tool. The primary way of installing packages is through the install.packages command which is run from within R. This function by default installs from CRAN, although it can be used to install from other repositories setup in the same fashion as CRAN. The example below shows how to use it to install packages (NB: it will ask for a mirror if one isn’t setup, the 0-Cloud option is recommended).

R

install.packages(c("dplyr", "ggplot2"))

Callout

Binary vs source packages in R

Like with Python, packages will be downloaded as pre-built binaries but only for Windows (.zip) and Mac (.tgz) users - on Linux machines the source package (.tar.gz) will be downloaded and built directly on the user’s machine. This is because CRAN does not offer pre-built Linux packages owing to the vast heterogeneity in Linux toolchains. Python accomplishes this via the Manylinux standard.

Other package sources

Because the barrier to publishing a package on CRAN is so high, there exist several alternatives that have built up popularity in different fields.

GitHub

A far more streamlined alternative to hosting your package on CRAN is to maintain it on GitHub where users can install it from. This fits in neatly into a version-controlled workflow (see our course on version control) and for many use cases will be sufficient.

To install a package from GitHub the pak package is required which can install from GitHub using the pkg_install function. This example will install the current development version of dplyr straight from GitHub, which will be newer than the stable release on CRAN.

R

pak::pkg_install("tidyverse/dplyr")

Callout

`devtools` vs `remotes` vs `pak`

You might come across documentation recommending to use devtools or remotes to install GitHub-hosted packages instead of pak. devtools::install_github was once the recommended method, then this functionality was spun-out into the more minimal remotes package to keep devtools more streamlined on package development. Now as of the time of writing pak is the current one-stop-shop for package management in R and is faster and more robust than devtools/remotes and can handle more backends than install.packages.

Bioconductor

Before GitHub became so widespread as a package repository, the Bioinformatics community created their own alternative to CRAN: Bioconductor. This is accessed via the R package BiocManager or via pak again by using the syntax pak::pkg_install("bioc::<package>").

Local

pak can also be used to install local packages from source by passing the pkg_install function a path to a package stored on the local filesystem.

R

pak::pkg_install("path/to/source")

Key Points

pip is the most common tool used to download and access Python packages from PyPI.
PyPI is an online package repository which users can choose to upload their packages to for others to use.
CRAN is R’s repository which has a far higher barrier to entry than PyPI.
install.packages is the gold standard way of installing packages from CRAN.
pak is the modern fast package manager in R and can install from a variety of sources.
Both pip and pak can also be used to install packages on your local system (installing from source).

Content from Creating Python Packages

Last updated on 2026-06-08 | Edit this page

Overview

Questions

Where do I start if I want to make a Python package?
What will I need / want in my package?
What’s considered good practice with packaging?

Objectives

Create and build a basic example Python package
Understand all the parts and decisions in making the package

Introduction

This episode will see us creating our own Python project from scratch and installing it ready for use. Feel free if you’re feeling adventurous to create your own package content or follow along with this example of a Fibonacci counter.

Python Package Structure

The barebones directory structure of a Python package is as follows:

📦 my-package/
├── 📂 src/
│   └── 📂 my_package/
│       └── 📄 my_code.py
│       └── 📄 __init__.py
└── 📄 pyproject.toml

where

📦 my-package/ is the root directory of the project.
📂 my_package/ is the package directory containing the source code.
📄 pyproject.toml is a configuration file for setting up the package, containing basic metadata.

Tools such as uv and pip use the pyproject.toml file to configure how the package is built, distributed, and installed.

Callout

NB: the difference in punctuation between my-package and my_package

The top level folder name is the project name as it would be listed on PyPI and cannot contain underscores (but hyphens are valid), while src/my_package is the name of an importable package in Python and cannot contain hyphens (but underscores are valid). This oddity is sometimes observed in the wild; for example the SciKit Learn package is installed with pip install scikit-learn but the actual imports are to import sklearn.

A simple solution to the awkwardness of having two subtly different names is to only use alphanumeric characters and is the common approach in Python.

Optional: What is init.py?

At this point, it’s worth discussing the use of the __init__.py file. The __init__.py script is used to mark a directory as a Python package, allowing the contained modules to be imported (note; the use of double underscores in Python, often abbreviated to dunder lines, signal that this script should be “hidden” from users, helping distinguish this script from others). It also contains any initialisation code for the package.

For instance, consider the times you have imported a package, such as numpy. The ability to write import numpy is enabled by the modular structuring of the numpy package, including the __init__.py file. The complete import numpy statement then means Python searches for the numpy package in its search path (sys.path) and loads its contents into the namespace under the name numpy. Packages that follow the folder structure above are often referred to as regular packages.

However, in Python versions >= 3.3, the concept of implicit namespace packages (see PEP 420) was introduced. Namespace packages are commonly used to split a regular Python package (as described above) across multiple directories, which ultimately means the __init__.py file is technically not required to create any Python package. For the purposes of this course, we will use an __init__.py to keep with convention and avoid complications with namespace packages.

Challenge

What other files and content go into a package?

Think back to the earlier episodes and try to recall all the things that can go into a package.

Show me the solution

Other metadata files - e.g. LICENCE, README.md, citation.cff
tests - A directory full of test (unit, integration, etc…)
Extended documentation
Example data or other resources

In this episode we will only be creating a minimal example so many of the files you have thought of won’t be included. Next we will be creating our directory structure.

In either your documents folder if you are on Windows or your home directory if you are on macOS or Linux, create a folder called fibonnaci-uoy-<name> where <name> is either your University username or a random string if you don’t want your username to be displayed publicly on the web (when we publish our packages to Test PyPi at the end of the session). In the reminder of this episode the placeholder abc123 will be used to represent your username/random string. Populate the newly created directory with the following sub-folders and empty files.

📦 fibonacci-uoy-abc123/
├── 📂 src/
│   └── 📂 fibonacci_uoy_abc123/
│       └── 📄 sequence.py
│       └── 📄 __init__.py
├── 📄 pyproject.toml
└── 📄 README.md

Optional: Using uv to create package skeleton

The Reproducible Computational Environments introduced the uv package and project manager. One of its many features is the ability to create a package skeleton with the command below, which will create a directory called fibonacci-uoy-abc123 in the current working directory. In addition to creating the required file structure, it will also populate the pyproject.toml with basic metadata.

uv init --package fibonacci-uoy-abc123

Configuration File

The first thing we will do in this project is look at the metadata, stored in pyproject.toml. .toml files have sections (termed ‘tables’) denoted by [<title>] lines. In a pyproject.toml file there are 2 tables required at minimum: [build-system] and [project]. Take a look at the minimum example pyproject.toml below (this is what is populated by uv, along with a project.scripts table not shown here).

TOML

[project]
name = "fibonacci-uoy-abc123"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
authors = [
    { name = "YOUR NAME", email = "YOUR_EMAIL@DOMAIN }
]
dependencies = []

[build-system]
requires = ["TODO"]
build-backend = "TODO"

[project]

The [project] table is where your package’s core metadata is declared. If you used uv to create your skeleton then the author information may have been automatically populated from your git settings. Any dependencies used by your package must be declared in the dependencies list, for example numpy or pandas.

[build-system]

The [build-system] table specifies information required to build your project directory into a package, both the name of the build tool (requires) and the command it needs to run (build-backend). There are multiple popular build tools that can be used to build your project, in this tutorial we will use uv, as it is simple and very popular and fits in neatly to a uv managed project.

Callout

pyproject.toml documentation

The full list of accepted keys can be found here in the documentation

Challenge

Create your configuration file

Populate your pyproject.toml file with the two required tables

Show me the solution

TOML

[project]
name = "fibonacci-uoy-abc123"
version = "0.1.0"
description = "A package which can produce the Fibonacci sequence"
readme = "README.md"
authors = [
    { name = "Your Name", email = "youremail@email.com" }
]
dependencies = []

[build-system]
requires = ["uv_build"]
build-backend = "uv_build"

Optional: What does ‘building’ a Python package mean?

Building a Python package means converting the raw source code into a wheel (.whl) that is ready to be installed. If any extensions are present (such as C++) they will be compiled as well and bundled into the wheel. However, even for pure Python packages there are still several differences between the raw source code and the wheel: namely that the wheel is stripped of all metadata beyond that needed to install the package and that the code is reorganized into a standardized folder structure that can be placed directly into a user’s library.

Optional: What is project.scripts?

If you used uv to create the package skeleton you might have noticed the [project.scripts] table in pyproject.toml as follows. This section is used to create new entry-points into your package so that certain functions can be run from the command line easily. The example below means that running fibonacci-uoy-abc123 on the command line will run the main function from this package, rather than having to type python -c "import fibonacci_uoy_abc123; fibonacci_uoy_abc123.main()". This is particularly useful when your package provides a tool for others to use, instead of (or in addition to) library functions to be imported.

TOML

[project.scripts]
fibonacci-uoy-abc123 = "fibonacci_uoy_abc123:main"

Creating Python modules

The next step in this episode is to finally write some Python code! .py files are termed ‘modules’ in the context of a package and they are stored in src/<package>.

This example package will allow a user to find any value from the Fibonacci sequence. The Fibonacci sequence is a series of whole numbers where each number is the sum of the two previous numbers. The first 8 numbers of the sequence are 0, 1, 1, 2, 3, 5, 8, 13.

A Python implementation of an algorithm to return the Fibonacci sequence for a specified number of terms is shown below. Add this into the sequence.py module that you created earlier.

PYTHON

def compute(n_terms):
  current_num = 0
  next_num = 1

  for i in range(n_terms):
    print(current_num)
    prev_num = current_num
    current_num = next_num
    next_num = prev_num + current_num

Callout

Reinventing the wheel

It is good to ask yourself if the package or features you are designing have been done before. Obviously we have chosen a simple function as the focus of this episode is on packaging code rather than developing novel code.

Challenge

Using your Python module

Create a script in your project directory that imports and uses your sequence script. This will serve as a good quick test that it works.

Show me the solution

Create the file in the project folder fibonacci-uoy-abc123, for example use_fibonacci.py.
Import and run the compute function:

PYTHON

from fibonacci_uoy_abc123.sequence import compute

compute(5)

If you try running python use_fibonacci.py at first it will fail with ModuleNotFoundError: No module named 'fibonacci_uoy_abc123' - this is because it isn’t installed! Running python -m pip install . from the same working directory as the pyproject.toml will install your package, thereby fixing this error. Alternatively, if you’re using uv you can simply run uv run python use_fibonacci.py, which will automatically install the current working version of the package into a virtual environment before running the script.

Callout

Editable Install

When installing your own package locally with pip, there is an option called editable or -e for short. python -m pip install -e .

With a default installation (without -e), any changes to your source package will only appear in your Python environment when your package is rebuilt and reinstalled. The editable option allows for quick development of a package by removing that need to be reinstalled, for this reason it is sometimes called development mode!

Adding dependencies

So far our package is entirely self-contained and doesn’t require any other libraries. However, this isn’t a very realistic scenario as very few packages are written entirely from scratch without building on other libraries.

We’ll now look at an example of how dependencies are added during the package development process. To do so, we’ll extend the iterative Fibonacci generation function by using Binet’s Formula to vectorize the process. Vectorization means mathematical operations are performed on an entire list of numbers at once in a single step (by delegating to fast numerical libraries written in C/C++) rather than looping through each iteration. As well as being faster than iterative approaches, vectorization methods also scale well with the number of iterations. Python doesn’t support vectorized functions on its in-built list data structure, so instead we’ll use numpy and its array datatype, which does support vectorized operations.

The Python function below will calculate Fibonacci’s sequence using Binet’s formula.

PYTHON

import numpy as np

def compute_numpy(n_terms):
    # Create an array of indices from 0 to n_terms-1
    n = np.arange(n_terms)
    
    # Define the Golden Ratio components
    phi = (1 + np.sqrt(5)) / 2
    psi = (1 - np.sqrt(5)) / 2
    
    # Apply Binet's Formula across the entire array at once
    # F(n) = (phi^n - psi^n) / sqrt(5)
    fib_sequence = (phi**n - psi**n) / np.sqrt(5)
    
    # Round to the nearest integer and convert to int
    fib_sequence = np.rint(fib_sequence).astype(np.int32)
    
    # Print the results
    for x in fib_sequence:
        print(x)

Challenge

Add the vectorized function to the package

Try adding the compute_numpy function to the package and ensure that you can run it. Think about what additions to the modules you’ll need to make, as well as the package metadata.

Show me the solution

Add the compute_numpy function to sequence.py
Import it and run it in use_fibonacci.py
Add numpy to the dependencies list in pyproject.toml
Reinstall the package in editable mode

TOML

dependencies = [
    "numpy>=2.4.6",
]

Optional: adding dependencies with uv

If you’re using uv to manage your package you can simply run uv add numpy which will add numpy to dependencies in pyproject.toml and then install it into the project’s virtual environment. Subsequent uv run python use_fibonacci.py calls will correctly import numpy.

What Python packaging file formats and tools exist?

While reading about Python packaging, you will likely stumble across a massive alphabet soup of tools, file formats, and historical terms:

Ancient History: distutils, setup.py, eggs
The Transition Era: setuptools, requirements.txt, setup.cfg
Modern Standards: wheels, pyproject.toml, Poetry, uv

Fortunately, the Python community has largely settled on pyproject.toml as the modern, unified standard. You don’t need to master all of these historical tools to build a package today. However, understanding how we got here will make the current ecosystem make a lot more sense!

distutils, setup.py, and setuptools

The first standard tool for installing packages was distutils (short for distribution utilities), which debuted in the late 1990s. It relied on a setup.py script to configure and install packages.

However, storing a package’s configuration inside an executable Python file presented some serious problems:

Security Risks: Because setup.py is actual Python code, running pip install meant executing arbitrary code on your machine. A malicious package could easily hide malware inside its setup script.
Boilerplate & Clutter: Every package required writing repetitive, messy code just to define basic things like the package name and version.
The “Chicken-and-Egg” Problem: To read a setup.py file, you need to run Python. But if that setup.py file required a specific helper library to run, you couldn’t install the helper library without running the file first.

Because distutils was very basic and slow to evolve, a third-party project called setuptools was created to supersede it. setuptools added massive improvements, such as the ability to automatically find packages inside your code, declare dependencies, and introduced the first true Python package format: Eggs.

eggs and wheels

Before you can upload your code to the Python Package Index (PyPI) for others to use, it needs to be bundled into a single file. Eggs (.egg) were introduced by setuptools as Python’s first standard package format. While a massive leap forward at the time, they had severe limitations:

No Standardized Metadata: Eggs didn’t have a universally agreed-upon internal structure, making it hard for other tools to interact with them.
Installation Quirks: They were often treated as zipped files added directly to your Python path, which caused bizarre import bugs and made uninstalling them incredibly messy.
Platform Issues: They didn’t handle compiled code (like C extensions) cleanly across different operating systems.

To fix this, the community created the Wheel (.whl) format in 2012. Wheels completely superseded Eggs. A Wheel is essentially a highly standardized, pre-compiled ZIP file. Because all the heavy lifting is done before you download it, pip can install a Wheel almost instantly by simply unzipping it directly into your environment.

requirements.txt

A requirements.txt is a text file where each line represents a package or library that your project depends on. A package managing tool like pip can use this file to install all the necessary dependencies.

requests==2.26.0
numpy>=1.21.0
matplotlib<4.0

While requirements.txt is incredibly common, it is not a packaging tool. requirements.txt is meant for deployments (e.g., telling other researchers exactly what specific versions of packages to download so the application runs identically on their machine). Packaging tools (like pyproject.toml) are meant for distribution (e.g., telling the world what abstract dependencies your library needs so it can be safely installed alongside other software).

Third-party tools

Over the years, a plethora of third-party tools emerged to plug the gaps left by Python’s built-in utilities. Managing a project required juggling separate tools for dependency resolution, virtual environments, and publishing.

pyvenv & virtualenv: Early tools dedicated entirely to creating isolated environments so different projects wouldn’t break each other’s dependencies. (pyvenv was later deprecated in favor of Python’s built-in venv module).
Poetry: One of the most successful all-in-one modern tools. Poetry revolutionized Python by combining dependency management, environment isolation, and package building into a single tool using a pyproject.toml file.

Today, while Poetry remains highly popular, the ecosystem is shifting toward ultra-fast, next-generation tools like uv, which handles environments, syncing, and building at lightning speeds while strictly respecting modern packaging standards.

Pyproject.toml

Introduced in PEP517, the latest file for packaging a python project is the pyproject.toml file. Like a .cfg file, a toml file is designed to be easy to read and declarative. It is the current recommended way to package your Python

Callout

TOML stands for Tom’s Obvious Minimal Language!

When originally introduced, pyproject.toml was only designed to solve the “chicken-and-egg” problem by declaring exactly which build system pip should download to build your package. A bare minimum pyproject.toml looked like this.

TOML

[build-system]
# Minimum requirements for the build system to execute.
requires = ["setuptools", "wheel"]

At first, your project’s actual metadata (like its name, version, and author) still had to live in a secondary file like setup.cfg or setup.py.

With the introduction of PEP621 in 2020, project metadata could also be stored in the pyproject.toml files, meaning you only now need the single file to specify all the build requirements and metadata required for your package! This is still the preferred way in the community.

By moving to pyproject.toml, Python packaging has finally aligned with other modern languages (like Rust’s Cargo.toml or Node’s package.json), giving beginners a safe, clean, and unified way to manage their code.

Key Points

A package can be built with as little as 3 files: a metadata file, a Python script, and an __init__.py file
pyproject.toml files have 2 key tables, [build-system] and [project]
Editable installs allow for quick and easy package development
There are multiple standards out there for Python packaging, but pyproject.toml is the current recommended way.
uv streamlines the package development process over using inbuilt Python tooling

Content from Creating R Packages

Last updated on 2026-06-08 | Edit this page

Overview

Questions

Where do I start if I want to make an R package?
What will I need / want in my package?
What’s considered good practice with packaging?

Objectives

Create and build a basic example R package
Understand all the parts and decisions in making the package

Introduction

This episode will see us creating our own R package from scratch. Feel free if you’re feeling adventurous to create your own package content or follow along with this example of a Fibonacci counter.

Callout

We will be using a couple of R packages to assist with boiler-plate code generation so ensure that these are installed:

devtools
usethis

R Package Structure

The most basic directory structure of an R package is as follows:

📦 myproject/
├── 📂 R/
│   └── 📄 my_code.R
├── 📄 DESCRIPTION
└── 📄 NAMESPACE

where

📦 myproject/ is the root directory of the package.
📂 R/ contains the source code (.R files).
📄 DESCRIPTION contains basic metadata describing the package
📄 NAMESPACE is an automatically generated file that details imports (functions from other packages used in your package) and exports (functions your package makes available to other users). It should never be manually edited.

Challenge

What other files and content go into a package?

Think back to the earlier episodes and try to recall all the things that can go into a package.

Show me the solution

Other metadata files - e.g. LICENCE, README.md, citation.cff
tests - A directory full of test (unit, integration, etc…)
Documentation, both docstrings and long-form (termed ‘vignettees’ in R)
Example data or other resources

Package names must contain just letters, numbers and ‘.’ - hyphens and underscores are not permitted. Choosing a name that succinctly describes your package in a catchy way is quite a challenge! In this episode we’ll be making a package to generate Fibonacci sequences, so we’ll use the package name fibonacci.

To create this package skeleton we’ll use the usethis package, which in addition to creating the folder structure will also populate the DESCRIPTION with some basic metadata. To do this, run the command below from within RStudio:

This will do several things:

Creates directory fibonacci at location /path/to, so be sure to set the path to somewhere sensible such as your Documents or Home folder
Creates folder for R scripts
Creates DESCRIPTION and populates it
Creates an R project
Opens the project in a new RStudio window

usethis::create_package('/path/to/fibonacci')

Once you are working in the new project take a look at the Files tab to see what has been created:

📦 fibonacci/
├── 📂 R/
│   └── 📄 my_code.R
├── 📄 .gitignore
├── 📄 .Rbuildignore
├── 📄 fibonacci.Rproj
├── 📄 DESCRIPTION
└── 📄 NAMESPACE

The additional files beyond the bare minimum detailed before are:

📄 fibonacci.Rproj defines the folder as an R package, used by RStudio to set working directories and allow for easier switching between projects
📄 .gitignore removes the .Rproj.user file from git (NB: a git repository hasn’t been initialised)
📄 .Rbuildignore removes the .Rproj and .Rproj.user files from the R package build process (more details later)

Optional: Using usethis

usethis does nothing that you cannot do by hand - it simply creates folders and files in the right place. It’s particularly useful for generating skeleton package directories which can be tedious otherwise, and ensuring that everything is laid out correctly.

While it’s a very useful tool it’s still important to understand what it is doing - fortunately usethis is very verbose and explains every step it has taken.

`DESCRIPTION`

We’ll next turn our attention to the DESCRIPTION metadata file. This should look like the output below. It contains fields for basic information such as the name, a description, version number (more on versioning in a later episode), a longer description, and a licence. You don’t need to worry about the last three fields - they are related to how the documentation is generated and the defaults will suffice.

Package: fibonacci
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R:
    person("First", "Last", , "first.last@example.com", role = c("aut", "cre"))
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
    license
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 8.0.0

Optional: Choosing a licence

There are a large number of different software licences, differentiated by how you want others to be able to use and extend your code.

The most permissive licence is the MIT licence and is a sensible default if you don’t want to place any restrictions on the use of your software. choosealicense.com can help identify a more suitable licence if you don’t want it fully permissive.

Once you have chosen a licence, you can add it to the package with one of the usethis helpers; e.g. usethis::use_mit_license()

NB: in British English the more common spelling is ‘licence’ with a second ‘c’, but in US English it is spelled ‘license’ and is the more common spelling found in documentation.

Discussion

Populate `DESCRIPTION`

Enter appropriate values for the Title, Description, and Author.

Adding code

The example application for this episode is the same as in the Python episode: generating values from a Fibonacci sequence of a specified length. An R implementation of the iterative function is shown below.

R

compute <- function(n_terms) {
    current_num <- 0
    next_num <- 1

    for (i in seq(n_terms)) {
        print(current_num)
        prev_num <- current_num
        current_num <- next_num
        next_num <- prev_num + current_num
    }
}

To add this this to the project we can either manually save it into a file in the R/ directory or run usethis::use_r("sequence") which will create the file R/sequence.R and open it in RStudio, saving you a few clicks.

To load the latest development version of the package into the environment we run devtools::load_all(). This doesn’t install the package into your main R installation - instead it just makes the functions available in your current session. It needs to be run each time a file in R/ is modified.

Then can run from REPL with just compute(5) and verify works as expected

Challenge

Confirm that the code runs

Ensure that the compute function can be run and gives the expected output.

Show me the solution

R

> devtools::load_all()
> compute(5)
[1] 0
[1] 1
[1] 1
[1] 2
[1] 3

Documenting functions

As discussed in a previous lesson, docstrings are a way of providing a summary of a function to users. They detail the purpose of the function, describe every parameter and the return value, show example usage, and provide any other information that will help users. In R the docstrings can be viewed using the ? operator, e.g. ?lm will show the documentation page for the lm standard library function.

In this section we will annotate our Fibonacci function to generate a similar help page. To do so, in RStudio ensure that sequence.R is open in the editor and place the cursor in the compute function. Then select Code -> Insert Roxygen Skeleton from the top bar. This will insert formatted comments above the function’s code.

Challenge

Validating the package

Running devtools::check() provides a way of ensuring that our package is correctly formatted and adheres to the standards to be uploaded onto CRAN. Try running it now and see if your package passes the tests. You should have at least 1 warning resulting from the unpopulated docstrings. Try to get to 0 errors, 0 warnings, and 0 notes.

NB: you can safely can ignore warnings about unable to verify current time, these are resulting from the check getting the current time from an website that occasionally isn’t reachable.

Show me the solution

You might need to add a licence if you haven’t already with usethis::use_mit_licence() or similar.

Here is an example of the populated docstrings, which can be viewed in the Help pane with ?compute.

R

#' Generates values from the Fibonacci sequence.
#'
#' @param n_terms The number of terms to generate.
#'
#' @returns Doesn't return anything but prints the values instead.
#' @export
#'
#' @examples
#' compute(5)

Adding dependencies

Most packages depend on other R packages rather than being entirely self-contained. This section will demonstrate how to bring in existing R packages and use their functions.

In the Python example we used numpy to turn the iterative sequence generation into a vectorized version. However, most mathematical operators and functions in base R support vectorized functionality so there is no need to use an external package. Instead, we’ll bring in the dplyr library to incorporate a tidyverse-style approach.

The function below shows the same implementation of Binet’s Formula but this time within the framework of a tidyverse workflow using a tibble rather than a base R data.frame and the mutate function to add new columns.

The way in which we’d normally access the tibble and mutate functions in a data analysis script is to run library(dplyr), but library calls are not used in package code as they modify the user’s environment. Instead, we must explicitly specify the package source using the :: notation, i.e. dplyr::mutate.

R

compute_vectorized <- function(n_terms) {
    # Create an array of indices from 0 to n_terms-1
    df <- dplyr::tibble(
        n = seq(0, n_terms-1)
    )

    # Define the Golden Ratio components
    phi <- (1 + sqrt(5)) / 2
    psi <- (1 - sqrt(5)) / 2

    # Apply Binet's Formula across the entire array at once
    # F(n) = (phi^n - psi^n) / sqrt(5)
    df <- df |>
        dplyr::mutate(
            sequence_raw = (phi**n - psi**n) / sqrt(5),
            sequence_int = as.integer(sequence_raw)
        )
    df
}

But devtools::check() should show a warning - despite the source code stating that these functions are imported from dplyr, the package itself needs to declare dplyr as a dependency.

R

❯ checking dependencies in R code ... WARNING
  '::' or ':::' import not declared from: ‘dplyr’

❯ checking R code for possible problems ... NOTE
  compute_vectorized: no visible binding for global variable ‘n’
  compute_vectorized: no visible binding for global variable
    ‘sequence_raw’
  Undefined global functions or variables:
    n sequence_raw

Challenge

Add `dplyr` as a dependency

The DESCRIPTION file needs to be updated to declare dplyr as a package. Unlike Python’s pyproject.toml, which uv initiates with an empty dependency list, there is no obvious placeholder for where to do this.

Give me a hint

Either refer to the R packaging documentation, or have a look at the available usethis functions as one of them will come in handy here (type usethis:: then press TAB in the R console to view all the functions in a package.)

Show me the solution

An Imports section needs to be added to DESCRIPTION as follows. This can either be modified by hand or automatically using usethis::use_package("dplyr").

Imports:
    dplyr

Optional: Fixing no visible binding notes

The tidyverse functions heavily rely on so-called Non-Standard Evaluation. This means referring to columns directly without quotation marks or explicitly referencing it as a column. Compare these 3 ways to make a column called ‘foo’. The first 2 use standard evaluation - the R interpreter knows that foo is either a column because it is directly referenced as such using the $ syntax, or it is written as a string. However, in the third example the R interpreter just sees foo = 1:10 and thinks that foo is a global variable that hasn’t been declared before, hence the note.

R

df$foo <- 1:10
df['foo'] <- 1:10
df |> mutate(foo = 1:10)

In practice, these notes can be safely ignored if you do not intend for your package to be uploaded to CRAN. However, if you do intend to store it in CRAN then the check must be completely clean. The simplest solution is to simply not use non-standard evaluation. Non-standard evaluation makes interactive data analysis code cleaner, but it does not offer much benefit for library code which is not written very frequently.

The compute_vectorized function should now be able to be loaded and run from the R console! Verify that the results are the same as compute and make any suitable modifications - including adding docstrings.

Optional: Types of dependencies

There are 3 types of dependencies in R packages.

Imports: Must be installed for your package to work. This is what usethis::use_package() defaults to and is generally the right choice
Suggests: Not required for the core package functionality but might be needed for development, i.e. running tests or building vignettes
Depends: Used to pin to a specific R version and packages. But it also loads all packages into the namespace so do not use unless absolutely necessary

Key Points

A package can be built with as little as 3 files: DESCRIPTION, NAMESPACE, and a source file.
usethis helps generate package skeletons, add dependencies, and add source code files
devtools::load_all() loads the current package allowing for quick testing without needing to install it
devtools::check() validates the package structure and contents

Content from Versioning

Last updated on 2026-06-08 | Edit this page

Overview

Questions

Why is versioning essential in software development? What problems can arise if versioning is not properly managed?
How can automation tools, such as those for version bumping, improve the software development process?
Why is it important to maintain consistency and transparency in software releases?

Objectives

Explain why versioning is crucial for software development, particularly in maintaining reproducibility and ensuring consistent behaviour of the code after changes.

Introduction

In previous episodes, we developed a basic Python/R package to demonstrate the importance of software reproducibility. However, a crucial question that we haven’t addressed yet is: how can we, as the developers, ensure that a change in our package’s source code does not result in the code failing or behaving incorrectly? This is also an important consideration for when you are releasing your package.

Discussion

One of the pitfalls of packaging is to fall into poor naming conventions, even for scripts. For instance, how many times have you worked on scripts that was named my_script_v1.py or my_script_final_version.py? What were your main challenges with this approach, and what alternative solutions can you think of to circumvent this naive approach?

Semantic Versioning

The answer the question above is based on a concept called versioning. Versioning is the practice of assigning unique version numbers to different states or releases of a given package to track its development, improvements, and bug fixes over time. The most popular approach for software packaging is to use the Semantic Versioning framework, and can be summarised as follows:

Given a version number X.Y.Z, where X is the major version, Y is the minor version and Z is the patch version, you increment:

X when you make incompatible API changes,

Y when you add functionality in a backwards compatible manner,

Z when you make backwards compatible bug fixes.

Callout

Recall: API

An Application Programming Interface (API) is the name given to the way different programs or parts of a program to communicate with each other. It provides a set of functions, methods that can be used to interact with a piece of software or data services. Commonly, APIs are used within web-based applications to enable users to receive information from a given service, such as logging into social media accounts, creating weather widgets, or finding geographical locations.

The first version of any package typically starts at 0.1.0, and any changes following the semantic versioning rules above results in an increment to the appropriate version numbers. For example, updating a software from version (0.1.0) to (1.0.0) is called a major release. Version (1.0.0) is commonly referred to as the first stable release of the package.

An important point to highlight is the semantic versioning guidance above is a general rule of thumb. Exactly when you decide to bump the versions of your package is dependent on you, as the developer, to be able to make that decision. Developers typically take the size of the project into account as a factor; for example, small packages may require a patch release for every individual bug that is fixed. On the other hand, larger packages often group multiple bug fixes into a single patch release to help with tractability because making a release for every fix would accumulate in a myriad of releases, which can be confusing for users and other developers. The table below shows 3 examples of major, minor and patch releases developers made for the Python language itself.

Release Type	Version Change	Description
Major Release	2.0.0 to 3.0.0	Introduced significant and incompatible changes, such as the print function and new syntax.
Minor Release	3.7.0 to 3.8.0	Added new features like the walrus operator and positional-only parameters, backward-compatible.
Patch Release	3.8.0 to 3.8.1	Fixed bugs and made performance improvements without adding new features or breaking changes.

Table 1: Examples of major, minor and patch releases of the Python language.

Callout

Pre-release Versions

Pre-release versions in semantic versioning are versions of the software that are still in development or testing before a stable release. They are denoted by appending a hyphen and a series of dot-separated identifiers to the version number, such as 1.0.0-alpha or 1.0.0-beta.1. These versions allow developers to release early versions for testing and feedback while clearly indicating their status.

Callout

Once we publicly release a version of our software, it is crucial to maintain consistency and avoid altering it retroactively. Any necessary fixes needs to be addressed through subsequent releases, typically indicated by an increment in the patch number. For instance, Python 2 reached its final version, 2.7.18, in 2020, more than a decade after the release of Python 3.0. If the developers decided to discontinue support for an older version, leaving vulnerabilities unresolved, they would have to transparently communicate this to their users and encourage them to upgrade.

Challenge

Challenge 1: Semantic Versioning Decision Making

Imagine you are a developer working on a library called DataTools, which provides various utilities for data manipulation. The library uses semantic versioning and is currently at version 1.2.3. You have implemented a new feature that adds support for reading and writing CSV files with custom delimiters.

According to semantic versioning, should you bump the version to 1.3.0, 1.2.4, or 2.0.0? Explain your reasoning.

Give me a hint

Think about whether the new feature introduces any breaking changes for existing users.

Show me the solution

According to semantic versioning, since the new feature adds functionality in a backward-compatible manner, the version should be bumped to 1.3.0. This signifies a minor version increase.

Callout

Versioning vs Version Control

Note; although they share similarities, you should not confuse software versioning and version controlling your software. The table below outlines some similarities and differences to help you differentiate them:

Aspect	Version Control	Versioning
Purpose	Tracking changes, enhancing collaboration, and maintaining a history of revisions	Differentiating between various stages of software development or releases, ensuring clear identification of updates and changes
Features	Branching, conflict resolution, merging	Version numbering, compatibility guidelines, and release notes
Example	Git	Semantic Versioning
Benefits	Collaboration, code integrity, and project management	Communication of changes (major, minor, patch), transparency, and compatibility
Challenges	Managing conflicts and merges with multiple contributors, ensuring training for teams, and integrating within existing processes	Ensuring backward compatibility and avoiding confusion with version numbers that accurately reflect the changes

Key Points

Versioning is crucial for tracking the development, improvements, and bug fixes of a software package over time. It ensures that changes are documented and managed systematically, aiding in reproducibility and reliability of the software.
Versioning enables users to track code changes and dependencies, allowing reliable recreation of specific software versions, and further aiding the reproducibility of your software.

Content from Publishing Packages

Last updated on 2026-06-08 | Edit this page

Overview

Questions

How can I make my software easily accessible to a general audience?

Objectives

Become familiar with using GitHub to host software packages
Learn how to publish Python packages on PyPI
Learn how to publish R packages on CRAN

Finishing touches

README

You’ve created a package that conforms to your programming language’s standards that provides functionality that might be useful to others. This is the point at which you can start to think about releasing and publishing your software once some housekeeping has been taken care of.

Firstly, all packages must contain a README.md file that explains what the project is, how users can install it and how they can use it. A good example of a README.md file may look something like:


# My Project

My Project is a simple utility tool designed to perform basic operations on text files.
Whether you need to count words, find specific phrases, or extract data, this tool has you covered.

## Installation

You can install My Project via pip:

$ pip install my-project

## Usage

from my_project import text_utils

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
word_count = text_utils.count_words(text)
print("Word count:", word_count)

This will output:

Word count: 9

Notice that the README.md should be included at the top level of our project directory.

Python - referencing README in metadata

Your README should also be reference in pyproject.toml by adding in the following line:

TOML


[project]
readme = "README.md"

R - Rmarkdown READMEs

In R, README.md can be automatically generated from an Rmarkdown README.Rmd file, allowing your example outputs to be generated directly from the code.

Refer to the usethis documentation for further details.

Callout

In the README.md file, developers also usually include in a “contributing” section for new users that are typically outside of the project. The purpose of this section is to encourage new developers to work on the project, while ensuring they follow the etiquette set by the project developers. This may look something like:

### Contributing

Contributions to My Project are welcome! If you'd like to contribute, please follow these steps:

1. Fork the repository.
2. Create a new branch for your feature (git checkout -b feature/new-feature).
3. Make your changes and ensure tests pass.
4. Commit your changes (git commit -am 'Add new feature').
5. Push to the branch (git push origin feature/new-feature).
6. Create a new Pull Request.

Licensing

Following this, it is essential for your software to have a license to emphasise to users what their rights are in regards to usage and redistribution. The purpose of this is to provide the developer with some legal protections, if needed. There are many different open source licenses available, and it is up to the developer(s) to choose the appropriate license. You can explore alternative open source licenses at www.choosealicense.com. It is important to note that your selection of license may be constrained by the licenses of your dependencies.

The most common license used in open source projects is the MIT license. The MIT license is permissive, which allows users to freely use, modify, and distribute software while providing a disclaimer of liability.

Callout

The MIT License has the following terms:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Both R and Python have a section in their corresponding metadata files for linking the LICENCE file:

R’s DESCRIPTION file can be updated with the name of the licence and a reference to the file automatically by using the usethis::use_mit_license() or similar function (as described in the Creating R Package episode)

License: MIT + file LICENSE

While Python’s pyproject.toml has a license field for the same purpose.

TOML

[project]
license = {file = "LICENSE"}

Releasing software on GitHub

Once you have prepared all of the material above, you will be in a good position to release your software to a wider audience. Given that you already hopefully using version control (see the course for further details), you will already have access to a platform for hosting your software, as both R and Python can install packages from Git hosting providers, including GitHub.

In R the pak package is needed to install packages from GitHub, in the format username/repository. This is very common practice as the large overhead to host packages on CRAN means many developers either solely host their package on GitHub or will host development versions on it.

R

pak::pak("tidyverse/dplyr")

In Python this practice is less common for large packages but it is a perfectly valid route if you do not wish to host your package on PyPI.

BASH

pip install git+https://github.com/pandas-dev/pandas.git

If you are using GitHub to host your code, any new releases of your software should be explicitly ‘tagged’. Tags are a way of permanently tagging a specific point in your repository’s history, which can be used to denote a version that is suitable for others to use. A tag is an immutable reference to a commit (or series of commits), making it simple to identify specific versions of a software, and the tags are commonly identified in conjunction with the Semantic Versioning framework (e.g. v1.0.0). For more information about how GitHub uses tags for software releases, see releases.

Tagging a release on GitHub involves going to your repository’s page, clicking on the ‘Releases’ link on the right hand navigation bar, and following the steps to Draft a New Release by creating a new tag with the appropriate version number. Others can then install your package referencing specific releases.

Callout

Automating releases

GitHub Actions, introduced in the Testing and Continuous Integration lesson can also be used to automatically generate releases based on certain conditions - for example a tag being pushed that fits a certain format.

Callout

Remember to never publish any sensitive information, such as passwords, directly on GitHub. Storing sensitive data in your repository makes it publicly accessible (if your repository is public) or easily accessible to anyone with repository access (if private). This can lead to unauthorised access, security breaches, and potential misuse of your code. Instead, use should use GitHub Secrets or environment variables to securely manage the sensitive information, ensuring it is kept safe and only accessible by authorised collaborators or workflows.

Publishing to PyPI

*Figure 2: Screenshot of the main landing page of PyPI.*

PyPI (or the Python Packaging Index) is the official package repository for the Python community. It serves as the central location where developers can publish and share their packages, making them easily accessible to the wider community. When we use pip to install packages from the command line, it fetches them from PyPI by default. Uploading your packages to PyPI is recommended if you want to distribute your projects widely, as it allows other developers to easily find, install, and use your software.

Callout

Developers often use TestPyPI for testing and validating packages before they are officially published on PyPI.

To publish packages to PyPI two tools are needed:

build, which is a command-line tool used to build source distributions, and wheel distributions of Python projects based on the metadata specified in the pyproject.toml.
twine is the tool we use to securely upload the built distributions to PyPI, which handles tasks like authentication and transfer of package files.

These are both Python packages and can be installed by pip.

BASH

pip install build twine

Running build will create dist/your-project-name-1.0.0.tar.gz (source distribution) and dist/your-project-name-1.0.0-py3-none-any.whl (wheel distribution) in the dist directory.

BASH

python -m build

Next, we can use twine to a) validate the build and b) upload the package to TestPyPI to test everything is in order.

BASH

twine check dist/*

twine upload --repository testpypi dist/*

Once we have confirmed that everything works as expected on TestPyPI, we may proceed with installing our package to PyPI:

BASH

twine upload dist/*

Finally, once our package is available on PyPI other users can install it using the regular pip command:

BASH

pip install your-project-name

Publishing packages with uv

As with package development, uv also offers tooling for publishing without requiring the installation of any further dependencies. In particular it has build and publish commands that handle these two steps - provided uv was set as the build-backend in pyproject.toml.

BASH

# Build the wheels and source package
uv build

# Publish to TestPyPI
uv publish --publish-url https://test.pypi.org/legacy/

# Publish to PyPI
uv publish

Publishing to CRAN

Publishing a package to CRAN is a very different matter to PyPI. Whereas PyPI acts as an unregulated market where users can upload packages without any restrictions on their content, CRAN functions more akin to a boutique with a gatekeeper. Any candidate packages must follow a stringent list of checks, including but not limited to:

Validating metadata (even down to ensuring the punctuation requirements are met)
The documentation builds
All tests pass

These checks will be run on both the latest release of R and the upcoming development version across all major operating systems, and will be run periodically if your package is accepted on CRAN with your package being at risked of being removed if it fails to fix any issues that appear. Finally, all submissions will be assessed by human reviewers. The submission process can be started via devtools::submit_cran() which will interactively step you through all the criteria and automate the process of building and uploading your package.

Given that R packages can be installed from GitHub very straightforwardly, why might you consider jumping through all the additional hoops for CRAN? Essentially because it offers users of your software confidence that the package will run exactly as described and it meets a minimum threshold of quality. This isn’t to say that CRAN is always the right choice - for many small pieces of work GitHub will be a perfectly valid option.

Key Points

R and Python packages can both be installed directly from GitHub
GitHub allows you to create named releases using tags
You can easily publish your package on PyPI for the wider Python community, allowing your users to simply install your software using pip install.
Publishing a package on CRAN is a thorough process with a manual review

All in One View

Overview

Questions

Objectives

Introduction

Challenge 1: Packaging Analogy

Give me a hint

Show me the solution

Why Should You Package Your Software?

Anatomy of a Software Package

Challenge 2: Improving your project’s packaging

Give me a hint

Show me the solution

Overview

Questions

Objectives

Introduction

Popular Packages

Python

PYTHON

Python Package Index (PyPI)

Exercise 1: Explore PyPI

pip

Using pip

Exercise 2: Create venv and install Numpy

Virtual Environments

Installing from source

R

R

R

The Comprehensive R Archive Network (CRAN)

Exercise 3: Explore CRAN

install.packages

R

Binary vs source packages in R

Other package sources

GitHub

R

devtools vs remotes vs pak

Bioconductor

Local

R

Overview

Questions

Objectives

Introduction

Python Package Structure

Optional: What is __init__.py?

What other files and content go into a package?

Show me the solution

Optional: Using uv to create package skeleton

Configuration File

TOML

[project]

[build-system]

pyproject.toml documentation

Create your configuration file

Show me the solution

TOML

Optional: What does ‘building’ a Python package mean?

Optional: What is project.scripts?

TOML

Creating Python modules

PYTHON

Reinventing the wheel

Using your Python module

Show me the solution

PYTHON

Editable Install

Adding dependencies

PYTHON

Add the vectorized function to the package

Show me the solution

TOML

Optional: adding dependencies with uv

What Python packaging file formats and tools exist?

distutils, setup.py, and setuptools

eggs and wheels

requirements.txt

Third-party tools

`install.packages`

`devtools` vs `remotes` vs `pak`

Optional: What is init.py?

`DESCRIPTION`

Populate `DESCRIPTION`

Add `dplyr` as a dependency