Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Build and deploy slides

on:
pull_request:
branches: [ "main" ]
push:
branches: [ "main" ]

# Allows manual run
workflow_dispatch:

jobs:
# Builds slides with quarto and deploys them to a branch
build:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2

- name: Render Quarto Project
run: |
cd src
quarto render slides.qmd
cd ../

- name: Test pages build
if: github.ref != 'refs/heads/main'
uses: JamesIves/github-pages-deploy-action@v4
with:
branch: test-pages
folder: src
dry-run: true

- name: Deploy pages for main
if: github.ref == 'refs/heads/main'
uses: JamesIves/github-pages-deploy-action@v4
with:
branch: gh-pages
folder: src
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.html
src/slides_files
43 changes: 43 additions & 0 deletions src/dependencies.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Dependencies

## Dependencies

- All software has dependencies
- Some are more obvious than others:
- Data/input
- Packages/libraries e.g. numpy, Eigen
- System libraries
- Compiler/Interpreter
- If your code can't run without it, it's a dependency!

## How to discover dependencies

- Some dependencies may be "implicit"
- For example, you may have a library installed on your system
- Since the code "just works", you may not be aware of the dependency
- To find these, try running on a different system (or multiple) and see what breaks

## How to declare dependencies

- List them in a tracked file in the repository
- e.g. add a "Dependencies" section to your README.md
- Specify:
- Versions of each dependency e.g. numpy 2.3.9
- Where/how to aquire the dependency

## Dependency metadata

- There are automated ways of resolving dependencies
- Usually language/tool specific
- Some tools automatically update dependency metadata
- e.g. Rust's cargo, Julia's Pkg, uv for Python
- Project file: Depencies and compatible versions
- Lock file: Write exact version (plus other metadata e.g. source) of *every*
dependency you are using
- Important to track both - lock files record the exact environment you use

## System dependencies

- Conda
- Docker
- Nix/Guix
30 changes: 30 additions & 0 deletions src/documentation.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Documentation

## Documentation

- Not all information can be conveyed in code
- We need to tell other people how to use our projects
- And sometimes ourselves!
- Documentation covers anything outside of the code/metadata

## README

- Markdown file at the project root
- Should contain:
- Description of project
- Dependencies
- Instructions on building/running

## Comments

- Comments in code are also another form of documentation
- Comments should:
- Explain *why* the code is doing something
- Give context that is external to the scope

## Generating Docs

- Use tools that generate docs from source code
- Single source of truth
- Comments/Docstrings embedded in code
- Reduce separation between code and docs
24 changes: 24 additions & 0 deletions src/fair_principles.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# FAIR Principles

---

- Findable: Software, and it's metadata, are easy for humans and machines to
find.

---

- Accessible: Software, and it's metadata, are retrievable via standardised
protocols.

---

- Interoperable: Software interoperates with other software by exchanging
data and/or metadata, and/or through interaction via a application
programming interfaces (APIs), described through standards.

---

- Reusable: Software is both usable (can be executed) and reusable (can be
understood, modified, built upon, or incorporated into other software).

See: https://www.nature.com/articles/s41597-022-01710-x
29 changes: 29 additions & 0 deletions src/introduction.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
## What is reproducibility?

For this course we will take the following definition:

- *Reproducible*:
Performing the same analysis on the same data produces the same results

## Why is reproducibility important?

In the context of scientific computing/analysis, we want to be able to:

- Verify our own results
- Verify the results of others

By making our work reproducible, we ensure that both these things are not just
possible, but straightforward

## Additional benefits

- Safely implement changes
- Can perform workflow on different inputs more easily
- Simpler for new team members to get started
- Better collaboration

## Where do we go from here...

Throughout the rest of this session, we will walk through the steps that we can
take to go from an ad hoc collection of scripts into a reproducible scientific
workflow!
7 changes: 7 additions & 0 deletions src/introduction_walkthrough.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## A likely scenario

- You have just joined a new research group as a Student/Researcher/PI.
- The group use a custom pipeline/setup to perform their data analysis/simulations.
- You try to get the setup working on your local system/a new hpc system and...
*It doesn't work!*

57 changes: 57 additions & 0 deletions src/slides.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
title: Reproducibility in Scientific Computing

format:
revealjs:
theme: night
logo: https://iccs.cam.ac.uk/sites/default/files/iccs_ucam_combined_reverse_colour.png

authors:
- name: Jack Franklin
- name: Marion Weinzierl
---

{{< include introduction.qmd >}}

{{< include version_control.qmd >}}

{{< include dependencies.qmd >}}

{{< include testing.qmd >}}

{{< include documentation.qmd >}}

{{< include fair_principles.qmd >}}

# Conclusion/Outlook

## Reproducibility is important

Primary benefits:
- Confidence in scientific results
- Peer review/cross analysis

Additional benefits:
- Allows for code resuse
- Better collaboration

## Ingredients for reproducibility:

- Version Control
- Dependency Metadata
- Public Accessibility

## Even better if

- Testing for:
- Verification
- Regression checks

## Make it easy!

- When starting from scratch, much easier to implement these as you go
- For a large project:
- Add to VC
- Document dependencies
- Follow best practice for new code
- Implement small improvements whenever modifying
45 changes: 45 additions & 0 deletions src/testing.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Testing

## Testing

- Important to test code
- Check that code does what it should
- Test on inputs outside of the "normal" range
- Verify that results of code do not change
- Can also be used to check dependency changes

## Unit tests

- Test the smallest logical unit of the code
- Ensure each component works as intended
- Test functions for known results
- Compare to previously produced results

## Integration tests

- Test that components work together
- Try to have a range of complexity of tests
- Can use previous results to validate model
- Ensure no regression of results

## Adding tests to a project

- Often we inherit large projects with no unit tests
- How do we improve test coverage in this case?

## Adding tests to a project

1. Create integration tests - use previous results or create "golden outputs"
2. Identify and extract parts of the code which can be split apart
3. Create unit tests for the new functions
4. Run the integration tests to ensure results have not changed
5. Repeat 2-4 until all code has unit tests

- Whenever you change a part of the code, try to use this method
- Code coverage will slowly improve, with less extra work

## Automating tests (CI etc)

- Automate testing to ensure tests pass for every commit
- Also useful for tests that can take a long time/need lots of resources
- If hosting code on e.g. GitHub, GitLab etc, can use Continuous Integration (CI)
51 changes: 51 additions & 0 deletions src/version_control.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Version Control

## Version Control

- The first thing we should do is move our project into version control (VC)
- This way we never lose the original state of the project
- We can then try things without worrying about breaking anything!
- This will also benefit any later development, so the sooner the better

## What to add to VC

- DON'T do this:
``` bash
git add .
```

- Our repository should only contain:
- Code/scripts
- Documentation
- Metadata
- i.e. just text files

There will be some exceptions to this rule, but for the vast majority of cases
it will be true.

## What to add to VC

- Large datafiles should be hosted separately (e.g. on Zenodo)
- External dependencies should be declared
- e.g. link to Zenodo dataset in docs and code
- Use .gitignore to automatically ignore any unwanted files
- e.g. build outputs

## Aside - testing with worktrees

- git worktrees are like "local clones" of a repository
- Create a worktree:
``` bash
git worktree add -b <new-branch-name> <path>
```
- Will make a new directory, with only files that are tracked
- Can use as a cleanroom to ensure all dependencies are there
- For more info: `git worktree add --help`

## What to do next?

- The repository can then also be hosted a remote service (e.g. GitHub, GitLab, Codeberg, Bitbucket)
- This will make collaboration with other people a lot easier!
- It will also mean that any work done can be accessed by collaborators


Loading