« Back to Posts

The Data Science Behind IP Address Matching

IP addresses provide a universal identifier of devices across all internet connected devices. From a web browser to a mobile phone or an internet-connected smart TV, all devices have an IP address, and this address is typically recorded automatically by the web server every time a request passes from the device to a web server.

By matching IP addresses seen across different services at different times, we can build an identity graph that matches together behaviors on mobile devices to those on connected TVs and other web browsers.

The detailed implementation of any such device graph is complicated because IP addresses frequently change (even in the US, over 20% of homes shift IP address at least once a month), people use many devices in multiple different locations, and in many places – such as homes, offices, and Starbucks – numerous people share the same IP address at the same time.

To build a sound match between individuals and IP addresses, we use a combination of deterministic and probabilistic matching to create a household device graph to which we can then match individual events.

Deterministic matches start with IP addresses that are fixed or are uniquely associated only with a single device. If, for example, we have a connected smart TV on a static IP address that we only ever see on a single mobile device, then – provided we have a large enough dataset – we can be confident that this mobile device matches with the home of that connected smart TV. Any viewing on the TV can be associated with the owner of that mobile device, mainly if we see the same IP address on the mobile phone at the same time that the TV is switched on.

Such examples are unfortunately few and far between, and it is much more common for mobile phones and web browsers to appear on multiple different IP addresses at different times. In the absence of an anchor device like a connected TV – which we know is likely to be physically stationary – we can only build up the device graph of the home by looking for sets of devices commonly observed on the same IP addresses at the same time. In data science terms, this is a “market basket” problem, similar to Amazon’s “people who bought this also bought this” we are looking for “people who were on this IP address were also on this IP address.” We approach this problem of mining association rules using the Apriori algorithm, and this gives us a probabilistically combined set of devices that are frequently seen together on the same IP address.

These set of rules subsequently need to be cleaned to remove IP address with too many different devices on them (most likely offices, VPN access points, military bases, or a local Starbucks), and we need assign devices to an individual household. People who frequently spend time on their partner’s or parents’ WiFi will come up in multiple locations and these need to be removed together with other noise that will clog up the dataset.

At the end of this, we have a clean device graph that maps individual devices to only a single home, and we can then tie individual events to households across devices. If we have enough information, we can go further and infer the activities of specific individuals within the home as well.

The quality of the model will depend not only on the quality of the data science put into it but also on the breadth and depth of the datasets. If we have data for many millions of devices and multiple different device types, then we’ll have a very rich dataset to create a device graph. If we’re limited only to mobile devices, browser cookies, or smart TVs, then we’ll have fewer options. As is frequently the case, diversity of data sources is the key to creating an enriched dataset, and the quality of the IP graph exponentially increases the more datasets we add.