Coding 101: Best Practices for Writing Software
Build Your Development Suite
To code effectively, you need an appropriate development setup. We recommend at least the following
- An appropriate development environment
- An environment manager
- Git, installed system-wide
- An appropriate linter
- An appropriate code formatter
You can find abbreviated recommendations for some of these tools in our
Computer Setup page.
Using an environment manager allows you to effectively control the versions of different software you use. This is critical to avoid issues like dependency conflicts on your computer and to fix software versions for experimental consistency.
Miniconda,
Mamba, and
Micromamba are all good options for this.
Git is a standard tool for version control and uploading your code to another location, such as
Github, so that you can collaborate with others. When setting up a git repository, also set up git hooks to run a code formatter. These formatter are language specific, but will automatically adjust the code to fit to a common, easy to read standard. They may additionally catch other smaller style issues like dead code or unused statements. We recommend
Ruff for Python and
Styler for R. This must be set up through Git to ensure all collaborators adhere to running the same formatter.
Depending on what programming language you’re working with, the exact other programs you install will vary. For multi-language environments,
VSCode is among the most widely used IDEs. You might also consider language-specific environments like
R-studio or
PyCharm, though there may be subscription costs depending on which environment you chose.
If you select a general-purpose development environment like
VSCode or
Zed, you will probably need to set up a linter.
Ruff is a good python option for this as well, while R has
Lintr.
Use Unix
The further you get in developing, the more pain points you will experience on Windows. Some of this pain can be elevated by using
Windows Subsystem for Linux, though it will require some additional setup to manage files between the native Windows operating system and the Linux system running on top of it. WSL will also require additional configuration with your IDE of choice. A more streamlined approach would be to switch to a Unix-like operating system entirely, such as MacOS or a Linux distribution.
Choose an appropriate language for the task
Python is a general purpose interpreted language, while R is better for statistical analysis. If you are relatively new to programming, both are acceptable places to start learning, so long as it is task appropriate. Certain tools are only available for certain scripting languages – its best to check the dependencies you plan on using before starting the project. While python statistical analysis is less robust, especially without resorting to third party packages, its versatility allows you to more easily create standalone applications including command line interface (CLI) tools and even web interfaces.
There are limits to the strength of interpreted languages like R and Python. As computational workloads become more demanding (matrix multiplication, genome-scale, multiprocessing), it may be more appropriate to use a compiled language like C++ or Rust. Start small – these tend to be more difficult to learn for beginners than R or Python. In many applications these compiled languages offer far superior performance benefits. It is also usually possible to port these low level languages to run as packages in other, higher level interpreted languages.
Write Readable, Well-Documented Code
One function, One Job
With the notable exception of ‘runs a pipeline’, functions should have one job. Blocking your code out into functions with defined purposes creates code with improved readability and portability. As an example, lets say you were writing a program that calculates the area of shapes and records that information in a database.
import sqlite3
# Function that does both calculation and database insertion
def calculate_and_add_area_to_database(width, height, db_connection):
area = width * height
cursor = db_connection.cursor()
cursor.execute("INSERT INTO areas (area) VALUES (?)", (area,))
db_connection.commit()
# Usage
width = 5
height = 10
# Assuming `db_connection` is a valid SQLite connection
db_connection = sqlite3.connect('areas.db')
calculate_and_add_area_to_database(width, height, db_connection)
db_connection.close()
While this might be okay for a short script, it is problematic in a larger tool. For each shape you calculate, you need to separately define how to upload data to a database. Instead, it is better to define the database upload in a single place that gets called for each shape.
import sqlite3
# Function to perform calculations
def calculate_area_of_rectangle(width, height):
return width * height
# Function to add result to a database
def add_area_to_database(area, db_connection):
cursor = db_connection.cursor()
cursor.execute("INSERT INTO areas (area) VALUES (?)", (area,))
db_connection.commit()
# Usage
width = 5
height = 10
area = calculate_area_of_rectangle(width, height)
# Assuming `db_connection` is a valid SQLite connection
db_connection = sqlite3.connect('areas.db')
add_area_to_database(area, db_connection)
db_connection.close()
Use Descriptive Variables
Variables, methods, and attributes should have descriptive names. While some shorthands are common (such as i or j for an integer that increments across a loop in Python), others such as “x”, “y”, “thing”, or “placeholder” can be cryptic. Remember to also follow the formatting conventions for your language.
Take the two following blocks of code as an example
# Original vector
x <- c(5, 10, 15, 20, 25)
# Calculate mean
m <- mean(x)
# Calculate standard deviation
s <- sd(x)
# Normalize the vector
y <- (x - m) / s
# Print normalized vector
print(y)
Versus the descriptive:
# Original vector of sample values
sample_values <- c(5, 10, 15, 20, 25)
# Calculate the mean of the sample values
mean_sample_values <- mean(sample_values)
# Calculate the standard deviation of the sample values
std_dev_sample_values <- sd(sample_values)
# Normalize the sample values
normalized_sample_values <- (sample_values - mean_sample_values) / std_dev_sample_values
# Print normalized sample values
print(normalized_sample_values)
Now, imagine you are a developer working on this software and were tasked with adding some downstream functionality. It is much simpler for you to quickly identify the normalized values in the second block because it is explicitly named. In fact, it is likely that a programmer might completely miss 'y' when working adding on to the first block of code, as it is not at all obvious that 'y' contains a vector at all.
Good Code Doesn't Need Comments
Readable code does not need to have comments –
but it should still have them. If you feel the need to comment your code, then consider rewriting your code so that is easily interpretable. Then, add comments in plain language to reinforce what the code should be doing and assist (sometimes novice) third parties.
Similar to the prior example, code that is readable is easier to work with. Try and desipher what this javascript code is trying to accomplish:
// Performs some magic
function f(a) {
let s = 0;
for (let i = 0; i < a.length; i++) {
s += a[i] * a[i];
}
return s;
}
let arr = [1, 2, 3, 4];
let result = f(arr);
console.log(result);
Although somewhat excessive, the functionality of the code becomes plainly clear with both well written code that is commented in plain English.
// Function to calculate the sum of squares of an array of numbers
function calculateSumOfSquares(numbers) {
// Initialize a variable to hold the sum of squares
let sumOfSquares = 0;
// Iterate over each number in the array
for (let i = 0; i < numbers.length; i++) {
// Add the square of the current number to the sum of squares
sumOfSquares += numbers[i] * numbers[i];
}
// Return the final sum of squares
return sumOfSquares;
}
// Define an array of numbers
let arrayOfNumbers = [1, 2, 3, 4];
// Call the function and store the result
let sumOfSquaresResult = calculateSumOfSquares(arrayOfNumbers);
// Print the result to the console
console.log(sumOfSquaresResult);
While this may seem similar to adherence to good naming conventions, and it is, this becomes more important as individual lines get longer/more complex, or more advanced language features are exploited.
Functions, Especially Externally Facing, Should be Self Documenting
Many languages have a common concept of ‘Doc strings’ – some comment-like content at the beginning of a function that describes how the function operates. Not only does this improve readability, but it also can be leveraged to automatically create online documentation and save you work. If you’re programming in python, you should also set up type hinting here.
A python example is provided below, but the exact format varies between languages and usually has a standardized structure.
def example_function(pos_arg1: Int, pos_arg2: Str, kwarg1: Float=None, kwarg2: Bool=True, *args, **kwargs) -> Dict:
"""
Example function to demonstrate a comprehensive docstring.
Parameters:
pos_arg1 (int): The first positional argument, an integer.
pos_arg2 (str): The second positional argument, a string.
kwarg1 (float, optional): An optional keyword argument, a float. Default is None.
kwarg2 (bool, optional): Another optional keyword argument, a boolean. Default is True.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.
Returns:
dict: A dictionary containing all the input arguments.
"""
result = {
'pos_arg1': pos_arg1,
'pos_arg2': pos_arg2,
'kwarg1': kwarg1,
'kwarg2': kwarg2,
'args': args,
'kwargs': kwargs
}
return result
# Example usage
result = example_function(10, 'hello', 3.14, extra_arg1='value1', extra_arg2='value2')
print(result)
Keep Code Shallow
Code often demands complex if-then statements and complicated loops that require a substantial amount of conditional logic to perform. The deeper into conditions you get, the harder it is for readers to track when such code would be running. This can lead to cases where code either runs where it shouldn't, or fails to run where it should.
#include <iostream>
#include <vector>
int main() {
// Define a 2D vector (matrix) with some positive and negative integers
std::vector<std::vector<int>> matrix = {
{1, -2, 3},
{-4, 5, -6},
{7, -8, 9}
};
// Variable to store the sum of positive elements
int sum = 0;
// Loop through each row of the matrix
for (int i = 0; i < matrix.size(); ++i) {
// Loop through each element of the row
for (int j = 0; j < matrix[i].size(); ++j) {
// Check if the element is positive
if (matrix[i][j] > 0) {
// Add the positive element to the sum
sum += matrix[i][j];
}
}
}
// Print the sum of positive elements
std::cout << "Sum of positive elements: " << sum << std::endl;
return 0;
}
If possible, you should pull out as much logic as possible to the beginning of a function so that the code complexity remains shallow. If that's not possible, remember the principle of one function, one job, and break the code out into separate functions.
#include <iostream>
#include <vector>
// Helper function to calculate the sum of positive elements in a row
int sumPositiveElements(const std::vector<int>& row) {
// Variable to store the sum of positive elements in the row
int sum = 0;
// Loop through each element in the row
for (int value : row) {
// Check if the element is positive
if (value > 0) {
// Add the positive element to the sum
sum += value;
}
}
// Return the sum of positive elements in the row
return sum;
}
int main() {
// Define a 2D vector (matrix) with some positive and negative integers
std::vector<std::vector<int>> matrix = {
{1, -2, 3},
{-4, 5, -6},
{7, -8, 9}
};
// Variable to store the total sum of positive elements in the matrix
int totalSum = 0;
// Loop through each row of the matrix
for (const auto& row : matrix) {
// Use the helper function to add the sum of positive elements in the row to the total sum
totalSum += sumPositiveElements(row);
}
// Print the sum of positive elements
std::cout << "Sum of positive elements: " << totalSum << std::endl;
return 0;
}
Make your code distributable
If you are making a tool, make sure this tool is distributable. It is too common of a trap to develop a script as a tool for publication but stop short of making the script software. Unless you’re writing a script for yourself to handle a highly specific job, you probably want to allow others to take advantage of what you developed. An easy way to do that is to take advantage of your programming language’s distribution tools. Design your tool from the start to be a package and avoid the work redesigning it for use by others.
Take the following two software installation instructions:
- Go to the website for the dependency "Foo" and install their software by following their instructions
- Make sure your environment or computer has python version between 3.x.y and 3.x.z installed.
- Download the source code and unpackage the file in a folder
- Add the folder to your path
- Run the software by using the command "bar"
or
- Install with the command "conda install bar"
- Run the software by using the command "bar"
Which software tool would you try first?
Take advance of tools that give you performance
Where you can, employ multiprocessing to speed up your code. Modern laptops tend to have 4-8 cores that can complete tasks simultaneously at a given time, while desktops tend to have 6-16, and dedicated high-performance compute nodes can have over 100. More cores does not necessarily equal faster, but it usually does when programs are designed to take advantage of them. Parallelism is more difficult to implement in Python than in R or lower level languages.
If you encounter a situation where you must run a loop more than a handful of times, chances are you have better options. When working with large datasets, try to work directly with DataFrame structures, whether native to the language like R’s DataFrame or through a package like Polars. These are often written in faster compiled languages. These tools give you the capacity to instruct your computer to run operations across the dataset, rather than piecemeal. This gives substantial performance improvements. In terms of speed, DataFrame methods are much faster than mapping/applying a function, which is faster than looping. Further, these tools are sometimes built capable of multiprocessing.
Avoid AI Code Completion (for now)
While AI assistants make writing code faster, its important as a novice developer to build a solid foundation of coding skills. Using AI assistants while you are new to programming creates two problems. It places itself at the wheel of writing code, robbing you of the experience needed to code effectively. This in turn creates a situation where you may not understand the code it writes, turning your time from writing code to debugging somebody else’s. Instead of relying on AI code completion, leverage documentation and forums like StackExchange to solve problems. When you do come across code you don’t understand – assuming the code is already publicly available – language models might help explain what is happening to you. Proceed with caution though, as they are prone to hallucination.