3  Development Workflow

It is important to create a development environment and workflow that not only allows effective collaboration but also sets a foundation for the growth and evolution of your project. In this guide, we discuss organizing your project in a repository and setting up a workflow for personal and collaborative projects.

3.1 Project organization

In software development, the initial choices will affect the final outcomes of our project. Among these choices, an important one is how to structure your project. To ensure your work is reproducible, a crucial initial step is to systematically organize your projects.

3.1.1 Essential principles

  • Directory Structure: Employ a consistent and meaningful directory naming convention.
  • Naming Files and Directories: Use underscores or hyphens.
  • Handling Access Levels: Utilize different Git repositories for public and private parts of your project. Use .gitignore or a specific non-tracked folder for sensitive content and/or files that are too large.
  • Clear Documentation: Include a README at the root to provide a project summary and add an appropriate LICENSE to your project. This establishes the terms under which others can engage, reuse, and modify it. Also, this ensures your work is legally safeguarded and the usage rights are clearly defined.
  • Adhere to Coding Standards: Follow a consistent coding style to enhance code readability.

3.1.2 Other recommendations

  • Code Reusability: Store reusable software elements in a separate repository for efficiency across projects and consider packaging them.
  • Code Modularity: Aim for modular code design to improve maintainability and reusability, especially in larger projects.
  • Dependency Management: Use virtual environments (Python) or similar tools to manage project dependencies, ensuring consistent environments.
  • CI/CD Integration: Consider setting up Continuous Integration/Continuous Deployment pipelines to streamline testing and deployment processes.

A common repository structure that works well for MATLAB and Python projects:

your_project/

├── build/                    # Compiled application for distribution (if applicable)
├── docs/                     # documentation directory
├── lib/                      # third-party libraries
├── notebooks/                # Jupyter notebooks or MATLAB Live Editor scripts
├── src/                      # your project's source code, including the main script
   └── mypkg/                # package
       ├── module            # nested module
       └── subpkg1/          # sub-package
├── tests/                    # your test directory  

├── data/                     # data files used in the project (if applicable)
├── processed_data/           # files from your analysis (if applicable)
├── results/                  # results (if applicable)

├── .gitignore                # untracked files 
├── requirements.txt          # software dependencies (Python)
├── README.md                 # overview
└── LICENSE                   # license information

This structure is a guideline and can be adapted based on the specific needs and practices of your project. Some additional observations:

  • Naming convention: use lowercase for folders. Particular metadata files are often capitalized, such as README, LICENSE, CONTRIBUTING, CODE_OF_CONDUCT, CHANGELOG, CITATION.cff, NOTICE, and MANIFEST.
  • Carefully consider how users will access your software. They may not have access to your repository structure when installing it as a library.
  • Generally, all content that is generated upon build- or runtime should be added to .gitignore. This likely includes the content of processed_data and results folder.
  • Git cannot track empty folders. If you want to add empty folders to enforce a folder structure, e.g., processed_data orresults, add the file .gitkeep to the folder.

3.2 Project templates

Templates are versatile tools that aim to standardize the software development process across various domains.

3.2.1 GitHub repository templates

You can make an existing repository a template, so you and others can generate new repositories with the same directory structure, branches, and files. Note, the template repository cannot include files stored using Git LFS. For more info, check out Creating a template repository.

3.2.2 Cookiecutter for Python

Cookiecutter creates Python projects from project templates. The advantage of using Cookiecutter is that new projects are set up quickly from a standardized template structure and can include everything needed to get started on a project, such as directory layouts, sample code, and even integrations with tools and services.

  • Cookiecutter PyPackage: A comprehensive template for Python projects, facilitating the creation of Python packages with best practices in testing, documentation, and package structure. Ideal for developers looking to distribute their Python libraries.
  • Cookiecutter Data Science: Tailored for data science projects, this template organizes data, models, analyses, and notebooks, ensuring that data science projects are reproducible and well-documented from the start.
  • Cookiecutter Machine Learning: Designed specifically for machine learning projects, this template includes directories for datasets, models, notebooks, and scripts, supporting ML project best practices and facilitating experimentation and collaboration.
    • https://dagshub.com/DagsHub/Cookiecutter-MLOps
    • https://github.com/Chim-SO/cookiecutter-mlops

For installation instructions, check out Cookiecutter installation instructions.

3.3 Reusing projects and repositories

Packaging

Create an installable package or library that can be installed as a dependency in the environment.

Git submodules

Git submodules allow you to keep a Git repository as a subdirectory of another Git repository. It is a record that points to a specific commit in another external repository. Submodules are useful for incorporating external code or libraries into your project while keeping them separate and easily updatable.

Adding submodules

This will add a new submodule to your repository: git submodule add <repo-url>

Cloning a repository with submodules

When you clone a repository that has submodules, you will have to initialize and fetch the submodules: git submodule init and then git submodule update.

To update the submodules to the latest commit use: git submodule update --remote.

You can also point to a specific commit within a submodule by navigating to the submodule’s directory and using: git checkout <specific-commit>, and then committing the change to the main repository.

Tip

You can use the shorthand command that automatically clones, initializes, and updates all the submodules:

git clone --recurse-submodules <repo-url>

Check the status of your submodules

To check the status of your submodules, run: git submodule status

There should also be a file called .gitmodules, it’s important to also version control that similarly to .gitignore. Then, commit and push your changes, as you would typically.

If you are using GitHub Desktop, be aware that there might be some limitations when working with submodules. While GitHub Desktop supports basic submodule functionality, some operations may require using the command line. Known issues include difficulties in initializing submodules, switching branches with submodules, and visualizing submodule changes. These limitations are acknowledged and tracked by the GitHub Desktop team. Although some issues have been addressed over time, there might still be case-by-case issues.

See this discussion as an example. For more details, refer to the official GitHub Desktop documentation or issue tracker.

Further reading

Git subtree

Git subtree allows you to merge the history of one repository into another as a subdirectory. It essentially brings the contents of a repository into another as if it were part of the directory structure.

In summary, submodules are more suitable when you need to maintain separate histories and explicit references to specific commits of nested repositories, while subtrees are useful when you want to merge the history of nested repositories into a single repository without maintaining separate references.

To be avoided
  • Storing commonly-used folders in a separate folder on your system and adding the folder to the Python PATH. Other users/developers will not have access to these folders.
  • Direct copy-and-pasting of code as you lose any upstream changes to the external repository.

3.4 Dependency management

Managing dependencies is a critical aspect of any software project. Efficient dependency management ensures that your project is reproducible, easy to set up, and less prone to conflicts between the different libraries that your code depends on.

3.4.1 Python

Ensuring that every contributor uses the same dependency versions is essential for project consistency and stability.

  • Virtual Environments: Use venv or virtualenv to create isolated Python environments for your projects. This prevents package versions from interfering with each other across different projects.
  • Requirements File: A requirements file to list all dependencies with their specific versions. You can generate this file using the command pip freeze > requirements.txt in an activated virtual environment.
  • Dependency Management Tools: Tools like poetry and pipenv provide a more sophisticated dependency management by handling virtual environment creation and dependency resolution in a more integrated manner.
Tip

Consider using Conda, it is a preferred choice within the research software community. Conda is a system package manager that allows for managing both packages and environments. It is ideal for projects requiring specific Python versions, packages not available via pip, and other dependencies such as R libraries, C and C++ libraries.

3.4.2 MATLAB

MATLAB does not use virtual environments in the same sense as Python, but it allows for setting up paths and toolboxes that act similarly by organizing and encapsulating project-specific functions and scripts. Dependency management in MATLAB often involves ensuring the correct toolboxes are licensed and available, and using MATLAB’s Project feature to manage and share paths and environments with others.

MATLAB toolbox requirements can be found with the function requiredfilesandproducts or with the Dependency Analyzer.