A minimal reverse proxy using PHP & cURL

A reverse proxy acts as an intermediary between a client and one or more servers. Requests sent by the client are received by the proxy and passed on to one of the servers in the background. There are many scenarios in which such a setup might be useful. For example, reverse proxies can be used to transparently distribute the load from incoming requests to several servers.

There are many different ways in which a reverse proxy could be implemented. Specialized web servers like Nginx are obviously a good choice, but circumstances might put constraints on the choice of tools. Luckily, implementing a reverse proxy is possible in just about any programming language. After all, everything that is required is being able to receive HTTP requests and pass them on in a slightly modified form. Turns out that even PHP can do it! 😉

<?php

// Define getallheaders() in case that it doesn't already exist (e.g. Nginx, PHP-FPM, FastCGI)
// Taken from https://www.php.net/manual/en/function.getallheaders.php#84262
if (!function_exists('getallheaders')) { 
    function getallheaders() { 
       $headers = array (); 
       foreach ($_SERVER as $name => $value) { 
           if (substr($name, 0, 5) == 'HTTP_') { 
               $headers[str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5)))))] = $value; 
           } 
       } 
       return $headers; 
    } 
} 

function reformat($headers) {
    foreach ($headers as $name => $value) {
        yield "$name: $value";
    }
}

// Configuration parameters
$proxied_url = 'https://www.example.com';
$proxied_host = parse_url($proxied_url)['host'];

$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

// HTTP messages consist of a request line such as 'GET https://example.com/asdf HTTP/1.1'…
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $_SERVER['REQUEST_METHOD']);
curl_setopt($ch, CURLOPT_URL, $proxied_url . $_SERVER['REQUEST_URI']);

// … a set of header fields…
$request_headers = getallheaders();
$request_headers['Host'] = $proxied_host;
$request_headers['X-Forwarded-Host'] = $_SERVER['SERVER_NAME'];
$request_headers = iterator_to_array(reformat($request_headers));
curl_setopt($ch, CURLOPT_HTTPHEADER, $request_headers);

// … and a message body.
$request_body = file_get_contents('php://input');
curl_setopt($ch, CURLOPT_POSTFIELDS, $request_body);

// Retrieve response headers in the same request as the body
// Taken from https://stackoverflow.com/a/41135574/3144403
$response_headers = [];
curl_setopt($ch, CURLOPT_HEADERFUNCTION,
    function($curl, $header) use (&amp;$response_headers) {
        $len = strlen($header);
        $header = explode(':', $header, 2);
        if (count($header) < 2) // ignore invalid headers
          return $len;

        $response_headers[strtolower(trim($header[0]))][] = trim($header[1]);

        return $len;
    }
);

$response_body = curl_exec($ch);
$response_code = curl_getinfo($ch, CURLINFO_RESPONSE_CODE);
curl_close($ch);

// Set the appropriate response status code &amp; headers
http_response_code($response_code);
foreach($response_headers as $name => $values)
    foreach($values as $value)
        header("$name: $value", false);

echo $response_body;

Whitelisting files in Serverless

By default, the Serverless framework will deploy every file in the project directory. In most cases, this is not what you want. serverless.yml offers the option to control which files will be deployed to your cloud provider by using the exclude / include parameters.

Due to the way these parameters are implemented, it is not possible to create a whitelist of files by only specifying include. In addition to the whitelist, you first have to exclude everything for the include parameter to actually have an effect:

package:
  exclude:
    - ./**
  include:
    - main.py
    - (…)

Busting some caches with Django

In contrast to server-side code, client-side assets such as JavaScript files and static images are not directly deployed to where they are ultimately executed or displayed (i.e. the user’s browser). Rather, they are downloaded on demand whenever a browser retrieves a page for the first time.

For sake of efficiency, the browser saves the retrieved assets locally for future reference. This will result in a more fluid user experience as certain actions such as reloading a page won’t require downloading the very same files again.

As a consequence, changes to static assets are not guaranteed to be reflected on the client side without delay. After all, the browser might still be accessing previously cached versions of those assets. This can lead to an array of problems, ranging from mild display errors caused by an outdated CSS file to significant problems in functionality. For example, imagine some cached JavaScript code attempting to retrieve data from an API endpoint that doesn’t exist anymore on the server side.

To avoid these kinds of issues, the developer needs to ensure that a cache busting strategy is in place. Cache busting forces the browser to download a fresh copy of some static asset if it has changed since the last page visit. But how can we force the browser to do that?

Well, browsers decide whether or not to (re-)fetch a resource based on its file name. If the browser already knows a file name and if it hasn’t been too long since the last download, the browser will re-use the cached version. Therefore, we can trigger the download of a new asset version by giving it a name the browser has not yet encountered.

Django’s approach to naming different versions of an asset is to insert part of its MD5 hash into its name. For instance, if we have a CSS file called home.css, Django’s ManifestStaticFilesStorage will rename it to something like home.789f58f23e78.css when running Django’s built-in ./manage.py collectstatic command.

In addition to renaming the files, the ManifestStaticFilesStorage will generate a file called staticfiles.json. This file contains a mapping from the original file names to the hash-based names. Its purpose it to make the process of accessing the hash-based asset versions more efficient by not having to re-compute a static file’s MD5 hash when referencing it with the {% static %} template tag.

While working with the ManifestStaticFilesStorage, two things turned out not to fit my workflow. Firstly, running ./manage.py collectstatic frequently resulted in an error stating that an asset declared deep inside some vendor CSS (managed with npm) couldn’t be found. I am sure there are cases where this error would be useful information, but in my case it was more annoying than valuable.

Secondly, after running ./manage.py collectstatic the STATIC_ROOT directory would not only contain the renamed static files but also the original ones. To fix these two issues, I made some modifications to the ManifestStaticFilesStorage. Feel free to use the code for your setup too!

import os

from django.conf import settings
from django.contrib.staticfiles.storage import ManifestStaticFilesStorage


class CustomManifestStaticFilesStorage(ManifestStaticFilesStorage):
    def hashed_name(self, name, content=None, filename=None):
        try:
            return super().hashed_name(name, content, filename)
        except ValueError:
            return name

    def save_manifest(self):
        super().save_manifest()
        for path in self.hashed_files:
            os.remove(os.path.join(settings.STATIC_ROOT, path))

Prevent Gutenberg from breaking words in table blocks

For some reason, the Gutenberg editor applies the following CSS rule to table block cells:

.wp-block-table td, .wp-block-table th {
    word-break: break-all;
}

This rule will make your tables look something like this:

As you can see, a cell’s text content will be wrapped mid-word which might not be the behavior you’d expect from your table cells (at least it wasn’t was I was expecting).

To fix the issue, add the following rule either to your theme’s CSS file or use the “Add custom CSS” functionality offered by the WordPress site customizer:

.wp-block-table td, .wp-block-table th {
    word-break: normal;
}

You could also define a utility CSS class to apply the rule on a per-block basis:

.word-break-normal td, .word-break-normal th {
    word-break: normal;
}

He or She? Or: The basics of (binary) classifier evaluation

Of all the amazing scientific discoveries of the 20th century, the most astonishing has to be that “men are from Mars, [and] women are from Venus”. (Not that the differences weren’t obvious pre-20th century, but it’s always good to have something in writing.)

If indeed the genders do originate from different planets, then surely the ways in which they use language must be very different as well. In fact, the differences should be so gleamingly obvious that even a computer should be able to tell them, right?

So we’re building an author gender classifier…

In natural language processing, there is a task called author profiling. One of its subtasks, author gender identification, deals with detecting the gender of a text’s author. Please note that for the sake of didactic simplicity (and not an old-fashioned view of gender identity), I’ll confine myself to the two traditional genders.

In supervised machine learning, a classifier is a function that takes some object or element and assigns it to a set of pre-defined classes. As it turns out, the task of author gender identification is a nice example of a classification problem. More specifically, we are dealing with binary classification since we assume only two possible classes.

By default, these classes are labelled as positive (aka “yes”) and negative (aka “no”). Needless to say, it is perfectly fine to adapt the naming of the two possible outcomes. In our case, female and male (aka “not female”) seem like plausible choices.

It all starts with the data

We are about to train supervised classifiers and so we first need to obtain a good amount of training data. Understandably, I wasn’t too excited about manually collecting thousands of training examples. Therefore, I went ahead and wrote a Scrapy spider to automatically collect articles from nytimes.com on a per-author basis.

If you are interested in the spider code, you’re welcome to check it out. Our industrious spider managed to collect the titles and summaries of more than 210000 articles as well as their authors’ genders. All in all, there were about 2.5 times more male articles than female ones. This is a great real-world example of a problem known as class imbalance or data imbalance.

Meet the stars of the show

With the data kindly collected by the NewYorkTimesSpider, we’ll train two supervised classifiers and compare their performance. To this purpose, we’ll make use of scikit-learn, one of the most popular Python frameworks for machine learning. We’ll be training two different classification models: Naive Bayes (NB) and Gradient Boosting (GB).

NB is a classic and historically quite successful model in all kinds of real-world domains including text analysis & classification. The GB model is a more recent development that has achieved considerable success on problems posed on kaggle.com.

This article will not delve into the algorithmic details of these two models. Rather, we’ll assume a black box view and focus on their evaluation. The same goes for the topic of feature extraction. For instructional purposes, we’ll go with a very basic feature set based on the tried-and-tested bag-of-words representation. scikit-learn comes with an efficient implementation which spares us having to reinvent the wheel.

Evaluation metrics 101

Unfortunately, no classifier is perfect and so each decision (positive vs. negative or female vs. male) can either be true (correct) or false (incorrect). This leaves us with a total of 2*2 = 4 boxes we can put each classifier decision (aka prediction) into:

Predicted \ ActualPositiveNegative
PositiveTrue positive (TP)False positive (FP)
NegativeFalse negative (FN)True negative (TN)

As presented in the table, true positives are positive examples correctly classified as positive. On the other hand, false negatives, are positive examples misclassified as negative. The same relationship goes for true negatives and false positives. In the area of machine learning, a 2-by-2 table structure such as the above is commonly referred to as a confusion matrix.

A confusion matrix can serve as the basis for calculating a number of metrics. A metric is a method of reducing the confusion matrix to a single (scalar) value. This reduction is very important because it gives us one value to focus on when improving our classifiers. If we didn’t have this one value, we could endlessly argue back and forth about whether this or that confusion matrix represents a better result.

The below table summarizes some of the the most fundamental & widely used metrics for classifier evaluation. Note that although all of them result in values between 0 and 1, I will describe them in terms of percentages for the sake of intuition. Also, some metrics have different names in different fields and contexts. I will highlight the names most commonly used in machine learning in bold.

MetricFormulaDescription / Intuition
Accuracy\frac{TP + TN}{TP + TN + FP + FN}What percentage of elements were predicted correctly? How good is the classifier at finding both positive & negative elements?
True positive rate (aka recall, sensitivity)\frac{TP}{TP + FN}What percentage of positive elements were predicted correctly? How good is the classifier at finding positive elements?
False positive rate\frac{FP}{TP + FN}What percentage of positive elements were predicted incorrectly? How bad is the classifier at finding positive elements?
True negative rate (aka specificity)\frac{TN}{TN + FP}What percentage of negative elements were predicted correctly? How good is the classifier at finding negative elements?
False negative rate\frac{FN}{TN + FP}What percentage of negative elements were predicted incorrectly? How bad is the classifier at finding negative elements?
Precision (aka positive predictive value)\frac{TP}{TP + FP}What percentage of elements predicted as positive were actually positive?
F1 score\frac{2*Precision*Recall}{Precision + Recall}
Weighted average of the precision and recall with precision and recall being weighted equally. How good is the classifier in terms of both precision & recall?

Now that we have basic understanding of the fundametal metrics for evaluating classifiers, it’s time to put the theory into practice (i.e. write some code). Luckily for us, scikit-learn comes with many pre-implemented metrics. In addition to the metrics, scikit-learn also provides us with a number of pre-implemented cross-validation schemes.

One of the primary motivations for cross-validating your classifiers is to reduce the variance between multiple runs of the same evaluation setup. This holds especially true for situations where only a limited amount of data is available in the first place. In such cases, splitting your data into multiple datasets (a training and a test dataset) will reduce the number of training samples even further.

Oftentimes, this reduction will lead to significant performance differences between two or more evaluation runs caused by particular random choices of training and test sets. After partioning the dataset and running the evaluation multiple times, we can average the results and thereby arrive at a more reliable overall evaluation result.

The importance of a baseline

The evaluation code is available as a Jupyter notebook. Besides a data loading function and the two classifiers to be tested, the notebook also contains the definition of a baseline for our evaluation (HeOrSheBaselineClassifier). A baseline is a simple classifier that gives us a basis for comparing our actual models to.

In many cases, choosing a baseline is a quite straightforward process. For example, in our domain of newspaper articles, about 71.5% of articles were written by men. Therefore, it makes sense to define a baseline classifier that unconditionally predicts an article to have a male author. If a classifier can’t deliver a better performance than this super simple baseline classifier, then obviously it can’t be any good.

To summarize, a baseline provides us with a performance minimum that we should be able to exceed in any case. scikit-learn accelerates the development of baseline classifiers by providing the DummyClassifier class that the HeOrSheBaselineClassifier inherits from.

Finally, results

If we take a look at the Jupyter evaluation notebook, we can see that both classifiers significantly outperform our baseline in every metric. Though overall the GB classifier offers better performance, the NB model features a better precision score.

Obviously, the classifiers presented in the course of this post are only the tip of the iceberg. But even though we haven’t performed any optimization, the results are already significantly better than the expected minimum performance (i.e. the baseline). What this means is that there is a statistical difference in how often each gender uses specific words since word counts were the only features employed by the presented models.

The results of the above evaluation might serve as the basis for another post on where to go from here. Further resources on how to improve upon the existing performance can be found in the academic literature (e.g. Author gender identification from text).

Avoiding query code duplication in Django with custom model managers

For a second, please imagine we are building the next big social network. We decide that our “revolutionary” new app shall allow its users to create profiles. Besides a mandatory user name and avatar, we consider a profile complete only if the user also supplies an email or physical address (or both). In other words, a profile is not complete as long as we can’t contact the user in some way (via their email or physical address).

Based on the above requirements, our lead developer comes up with the following models to represent our use case:

from django.db import models
 
 
class User(models.Model):
    name = models.CharField(max_length=50)
    avatar = models.ImageField()
    email_address = models.EmailField(blank=True, null=True)
    address = models.ForeignKey('Address', blank=True, null=True)
 
 
class Address(models.Model):
    street = models.CharField(max_length=50)
    city = models.CharField(max_length=50)
    country = models.CharField(max_length=50)

Next, suppose that for some reason we would like to distinguish between users with complete profiles from those with incomplete ones. Our developer comes up with the following query to make things happen:

from django.db.models import Q
 
User.objects.filter(Q(email_address__isnull=False) | Q(address__isnull=False))

It’s not hard to imagine that this query might be relevant in several situations. For example, we might need it for displaying a list of complete user profiles but we might also need it for filtering the users that can be contacted. Of course we could just go ahead and copy the query to multiple places, but in the spirit of DRY it makes a lot of sense not to do that.

Luckily, Django offers a built-in alternative to copying query code. Thanks to the concept of custom model managers, we can define a query once and use it over and over again in different places.

class CustomUserManager(models.Manager):
    def with_complete_profiles(self):
        return self.get_queryset().filter(Q(email_address__isnull=False) | Q(address__isnull=False))
 
 
class User(models.Model):
    name = models.CharField(max_length=50)
    avatar = models.ImageField()
    email_address = models.EmailField(blank=True, null=True)
    address = models.ForeignKey('Address', blank=True, null=True)
 
    objects = CustomUserManager()

In the previous code example, we are effectively overriding Django’s default manager for the User model by redefining the objects attribute. From now on, we can readably and cleanly retrieve all users with complete profiles by calling the with_complete_profiles() method on the manager:

User.objects.with_complete_profiles()

Neat!

Hosting multiple sites within a single Django project

Imagine we are running a successful online store for cat food (let’s call it catfood247.com). Since things are going so well, we would like to expand our business with a second store for dog food, dogfood247.com. Does this mean we’ll have to set up a separate server even though the two stores will be very similar and share a lot of code? Having more servers means higher maintenance & running costs which are obviously things that, if possible, we would like to avoid.

Luckily, Django’s built-in “sites” framework enables us to run two or more websites within a single Django installation. Consider the following project layout:

.
└── petfood
    ├── petfood
    │   ├── settings.py
    ├── catfood
    │   ├── urls.py
    │   ├── views.py
    │   ├── …
    ├── dogfood
    │   ├── urls.py
    │   ├── views.py
    │   ├── …
    ├── manage.py
    └── …

We have three Django apps: petfood, the main app holding the global settings.py file every Django project needs to have; and catfood as well as dogfood , the two apps representing the actual sites to be served.

In addition to the app directories, we need to create two instances of Django’s Site model. The Site model is a simple Django model meant to logically represent a website by its domain & user-defined name. In our case, the following two Site objects are needed:

Site(name='catfood', domain='catfood247.com')
Site(name='dogfood', domain='dogfood247.com')

In other words, each site is represented by a Django app directory (including it’s own URLconf) and a corresponding Site object in Django’s database.

The only thing left to do is create a mechanism for determining the right URLconf on a per-request basis. Django’s documentation gives us a valuable hint at how to achieve what we want:

When a user requests a page from your Django-powered site, this is the algorithm the system follows to determine which Python code to execute:

1. Django determines the root URLconf module to use. Ordinarily, this is the value of the ROOT_URLCONF setting, but if the incoming HttpRequest object has a urlconf attribute (set by middleware), its value will be used in place of the ROOT_URLCONF setting.
2. (…)

How Django processes a request

In other words, Django makes it possible to set a URLconf for each separate request, thereby allowing us to differentiate between to or more Sites and their respective URLconf. Let’s go ahead and define a new middleware that sets the request.urlconf attribute based on the requested site’s name (e.g. catfood):

class SetURLConfMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response
 
    def __call__(self, request):
        request.urlconf = f"{request.site.name}.urls"
        response = self.get_response(request)
        return response

Don’t forget to add this middleware to the MIDDLEWARE list in your settings.py in order to activate it. Also, make sure to insert it after Django’s CurrentSiteMiddleware so that it has access to the request.site attribute.

And that’s all there is to it! From now on, whenever Django receives a request, it first determines the site the request should be forwarded to based on the request’s domain. This simple method makes it possible to support an arbitray number of sites within a single Django setup.

Note for Heroku users: Don’t forget to register each of your domains with your Heroku app!

Spring-cleaning your (Arch) Linux system

Disclaimer: Some operations mentioned in this post are potentially destructive and irreversible. Be sure to back up all your important data before proceeding.

Note: This post is written from the point of view of an Arch Linux user. Most steps presented below should nevertheless translate well to other distributions.

Through the usual course of their operation, operating systems (even Arch Linux) tend to slowly accumulate obsolete data. In most cases, this is not a problem. However, if you are like me, it gives you a nice and warm feeling to have a clean system. Apart from that, keeping your file system clean will also help you save some disk space and reduce the duration of system upgrades. More importantly, it will soon make you an expert of your operating system.

pacreport is a utility that lists (possibly) obsolete packages and files on your system. You can get it by installing the pacutils package. The following magic command will run pacreport, reformat its ouput and pipe the result into several files for easier post-editing:

sudo pacreport --unowned-files | head -n -2 | awk '$1 ~ /^[A-Z]/ {print $0} $1 !~ /^[A-Z]/ {print $1}' | csplit -szf pacreport - /:$/ {*}

The command should leave you with five files named pacreport0*. These files will help us in removing the following categories of obsolete data:

  1. Obsolete packages
  2. Obsolete system-level files
  3. Obsolete user files

Uninstall obsolete packages

Packages become obsolete for at least two reasons. For one, you might simply not need them anymore (unneeded packages). And secondly, they might have been installed as dependencies for other packages that are long gone (orphaned packages).

The pacreport command has generated three package lists for us: pacreport01, pacreport02 and pacreport03. Each of these lists potentially contains unneeded or orphaned packages. Now it’s your turn to go through these lists and leave only those packages you would like to remove. If you are unsure about some package, use the pacman -Qi some-package command to get more information. In case you would like to keep a package listed in pacreport02, remove it from the file and mark it as explicity installed:

sudo pacman -D --asexplicit some-package

Once you are done, remove the files’ header lines and run the following command:

sudo pacman -Rscn $(cat pacreport01 pacreport02 pacreport03)

After double-checking the output, confirm the removal operation to finally remove the listed packages. In my case, more than 400 packages were removed in this way.

Remove obsolete system-level files

Many administrative processes store data all across the filesystem. For example, pacman stores downloaded packages in /var/cache/pacman/pkg/ but does not remove them automatically. In case of problems after an upgrade, this practice allows downgrading a package without the need to re-download an older version. On the other hand, this directory can grow very large in size if not pruned periodically.

The paccache script that comes with the pacman-contrib package deletes all cached package versions except for the three most recent:

sudo paccache -r

To (additionally) remove all cached versions currently not installed execute the pacman -Sc command. In this way, you remove all cached versions of packages not currently installed and only retain the three most recent versions of all installed packages.

But it is not only pacman spreading files across the filesystem. Any process with sufficient permissions can create files wherever it likes. As these files are not part of the original distribution package, they are not automatically removed when uninstalling a package.

Luckily, the above pacreport command has also generated a list of unowned files. Open pacreport00 and go through the list of files, removing only those file paths you would like to keep. /etc/pacreport.conf allows you to track unowned-but-needed files and their associations (run man pacreport for an example). Then, when using pacreport --unowned-files, the files referenced in /etc/pacreport.conf will be omitted. Finally, remove the files left in pacreport00 with the below command:

sudo rm -r $(cat pacreport00)

Remove obsolete user files

Removing personal files in /home is often the most labour-intensive step as it can’t be automated easily. This is because any process a user executes can create any file within that user’s home directory. Fortunately, the process of manually removing obsolete files from your /home directory isn’t as tedious as it might sound at first.

My method of choice is performing a manual depth-first traversal of my /home directory tree, evaluating the files & directories I encounter and, if appropriate, removing them. Pay special attention to the following directories:

  • ~/.config/ – default directory for application configuration files
  • ~/.cache/ – default user cache directory
  • ~/.local/share/ – also used for application-specific configuration files

Validating constraints across multiple form fields in Django

After all fields of a Django form have been validated in isolation, the Form().clean(self) method is called to conclude the validation process. This method is meant to house validation logic not associated with one field in particular.

For example, let’s suppose we have an application where our users can order gourmet-level cat & dog food. For some awkward legal reason, though, the amount of cat & dog food items taken together cannot exceed 50 items per order.

Clearly, this requirement cannot be expressed in relation to only one field. Rather, two values have to be taken into account together during form validation. The below code sample illustrates a solution to our example use case:

from django import forms
 
 
class PetFoodForm(forms.Form):
    cat_cans = forms.IntegerField(initial=0, min_value=0)
    dog_cans = forms.IntegerField(initial=0, min_value=0)
 
    def clean(self):
        cleaned_data = super().clean()
        cat_cans = cleaned_data.get("cat_cans")
        dog_cans = cleaned_data.get("dog_cans")
 
        if cat_cans and dog_cans and (cat_cans + dog_cans > 50):
            raise forms.ValidationError("The number of selected items exceeds 50.")
 
 
form1 = PetFoodForm({
    'dog_cans': '15',
    'cat_cans': '15'
})
assert form1.is_valid()
 
form2 = PetFoodForm({
    'dog_cans': '30',
    'cat_cans': '30'
})
assert not form2.is_valid()

Where to define and instantiate associated models in Django

For the sake of example, let’s consider the following UML diagram describing two model classes and how they are interrelated:

We can see that there are two classes, User and UserProfile, with a one-to-one association between them. That is, each instance of class User is associated with one and only one instance of UserProfile and vice versa. This type of association is frequently encountered when modelling all sorts of real-world domains.

In Django, this kind of relationship between two entities is expressed as a OneToOneField defined within some Model class and pointing to another (or the same, in the case of a reflexive relationship). This raises the question of where to put the OneToOneField: Should it be an attribute of User or UserProfile?

You might be doubting the relevance of this question as–thanks to the related_name mechanism (aka reverse references)–we can later on traverse the association in either direction. This is certainly true, but there are other considerations to the association’s semantics that should be taken into account.

Existential dependence

One of these considerations is whether one object’s existence depends on the existence of the other. In the context of our example, does a UserProfile depend on the existence of a User? In most cases that would arguably hold true. The opposite statement could more likely turn out to not be true. Depending on the application, you could argue that a User can exist without having a UserProfile.

In this case, it would make sense to define the reference as an attribute of the UserProfile class. This way, you can express that whenever we delete a User instance, the associated UserProfile will be deleted as well. This would result in the following two Django model definitions:

class User(models.Model):
    pass


class UserProfile(models.Model):
    user = models.OneToOneField('User', on_delete=models.CASCADE)

If we defined the association within the User model class, the result would be semantically different:

class User(models.Model):
    profile = models.OneToOneField('User', on_delete=models.CASCADE)


class UserProfile(models.Model):
    pass

In the latter code example, the existence of a User instance depends on the existence of its associated UserProfile. Per se, neither piece of code is in any aspect “wronger” than the other. To know right from wrong, we would simply need to know more about the modeled domain.

Order of instantiation

Another factor to consider when deciding on how to associate models is the order of instantiation. In neither of the above model definitions is it possible to create both a User and a UserProfile instance at the same time. It is therefore necessary to decide which object to instantiate first.

As an example, let’s consider two possible scenarios of a user registration system. In the first scenario, a user completes the registration process and can later on fill out an optional UserProfile. In the second system, potential users are first asked to complete a profile as part of an application process. Only after they have been approved will an actual User instance be created.

Where to instantiate the associated objects

Once one has decided on where to put the associating attribute, it’s time to think about where to actually create the model instances. In the spirit of our example, we would like to create one and only one UserProfile whenever a new User has succesfully registered. At first glance, multiple places look like promising candidates for this functionality.

The __init__(…) magic method

Arguably the most obvious candidate is Python’s __init__(…) constructor method. After all, we would like to create a UserProfile whenever a new User is added or, in other words, initialized. However, this logic disregards the distinction between what happens in the database and what happens on the Python level.

__init__(…) is a Python construct and will be executed whenever a model instance is created in Python. This happens when we create a new User object for the first time, but it also happens whenever we retrieve an already existing instance from the database. In other words, if we were to instantiate a user’s UserProfile within the User model’s __init__(…) method, before long we would end up creating more than one profile instance per user!

The __new__(…) magic method

Another of Python’s magic methods, __new__(…), poses the same problem as __init__(…), as does Django’s save(…) method. The latter is called not only when creating a User object, but also when updating it. This behavior is not what we are looking for.

Signals to the rescue

Luckily, there is another mechanism that we can leverage to achieve what we want. Django features a variety of built-in signals which are a way for a piece of code to get notified when actions occur elsewhere in the framework. Beyond the built-in signals, a developer can easily define custom ones for application-specific events.

post_save is one of Django’s built-in signals. As its name strongly suggests, it is fired whenever an object has been saved to the application database. Though it’s not exactly what we need, we can still go ahead and base a new custom signal on the existing one:

# In myapp/signals.py

from django.db.models.signals import post_save
from django.dispatch import receiver, Signal


post_create = Signal()


@receiver(post_save)
def send_post_create(created, **kwargs):
    if created:
        post_create.send(**kwargs)

In Django, all signals are instances of the django.dispatch.Signal class. All you have to do to define a new signal is to name and instantiate a Signal instance. To actually fire the signal, we hook into post_save. Upon receiving an instance of this built-in signal and checking the created flag, we simply send a post_create signal. From now on we can react to the instantiation of a new object simply by defining another @receiver function with post_create as its signal.