Skip to content
Exploration
Authors
On this page
Published on

Address lookup - Part 1 - Analysis

Exploration
Authors

Executive summary: In Belgium, a collaborative effort between regional and federal authorities has led to the creation of a unified service that aggregates all Belgian addresses into a comprehensive dataset. This dataset is called BeStAddress.


Address lookup - Analysis (1)

Introduction

This article outlines an experiment conducted by the Data Office of the Ministère de la Fédération Wallonie-Bruxelles aimed at leveraging existing address datasets more effectively.

The primary objective is to optimize the process of looking up addresses by postal codes, street names, and house numbers within these datasets, and eventually matching them with the BeStAddress dataset.

In Belgium, a collaborative initiative led by BOSA, in partnership with regional and federal authorities, resulted in the creation of a unified service that consolidates all Belgian addresses into a single, comprehensive dataset known as BeStAddress.

Quick overview of BeStAddress

At first glance, the dataset appears to be quite comprehensive, providing a wealth of information on addresses in Belgium. Key data points include:

  • Municipality names in French, Dutch, and German
  • Postnames in French, Dutch, and German: In Belgium, postnames are used to identify specific zones within a municipality or across parts of one or several municipalities (further details on this will be provided later).
  • Street names in French, Dutch, and German (often available in only one language depending on the region)
  • Postcodes
  • House numbers
  • Box numbers

Example of a single record (as JSON):

json
{
  "status": "current",
  "streetname_fr": "Rue de la Loi",
  "streetname_de": null,
  "streetname_nl": "Wetstraat",
  "house_number": "16",
  "box_number": null,
  "postcode": "1000",
  "municipality_name_fr": "Bruxelles",
  "municipality_name_de": null,
  "municipality_name_nl": "Brussel",
  "postname_fr": "Bruxelles (Centre)",
  "postname_nl": "Brussel (Centrum)",
  "region_code": "BE-BRU",
  "municipality_id": 21004,
  "street_id": 1679,
  "address_id": 210225,
  "EPSG:4326_lat": 50.84616,
  "EPSG:4326_lon": 4.36653,
  "EPSG:31370_x": 148303.0,
  "EPSG:31370_y": 169017.0
}

Summarize the datasets

Based on the export from BOSA website of the dataset in flat format (gzipped CSV) of the 2024-09-23, the dataset contains 6 608 518 records of status = current.

Here are summaries of each column in each dataset:

As we can see, the dataset contains a lot of fields. It is also multilingual, which seems obivious for an official dataset in Belgium (where French, Dutch and German are the official languages). It also contains some specific identifiers such as:

  • municipality_id (unique identifier for each municipality is equal to the NIS codes)
  • street_id
  • address_id

Checking uniqueness of the identifiers

The first step is to check the uniqueness of the identifiers in the dataset. We will check if the street_id, municipality_id, and address_id are unique.

Let's write a query that will count per region the number of unique street_id, municipality_id, and address_id and the total number of records. If the sum of the unique counts is equal to the total number of records, it means that the identifiers are unique in the dataset.

sql
SELECT 'BE-BRU' AS region_code,
        COUNT(DISTINCT municipality_id) AS municipality_id_count,
        COUNT(DISTINCT street_id) AS street_id_count,
        COUNT(DISTINCT address_id) AS address_id_count
FROM bestAddress_BRU
WHERE status = 'current'
UNION ALL
SELECT 'BE-WAL' AS region_code,
        COUNT(DISTINCT municipality_id) AS municipality_id_count,
        COUNT(DISTINCT street_id) AS street_id_count,
        COUNT(DISTINCT address_id) AS address_id_count
FROM bestAddress_WAL
WHERE status = 'current'
UNION ALL
SELECT 'BE-VLG' AS region_code,
        COUNT(DISTINCT municipality_id) AS municipality_id_count,
        COUNT(DISTINCT street_id) AS street_id_count,
        COUNT(DISTINCT address_id) AS address_id_count
FROM bestAddress_VLG
WHERE status = 'current'
UNION ALL
SELECT  'TOTAL' AS region_code,
        COUNT(DISTINCT municipality_id) AS municipality_id_count,
        COUNT(DISTINCT street_id) AS street_id_count,
        COUNT(DISTINCT address_id) AS address_id_count
FROM bestAddress;

Results

region_codemunicipality_id_countstreet_id_countaddress_id_count
BE-BRU195095790713
BE-WAL262577071983442
BE-VLG300803543834363
TOTAL5811384574203561
SUM = TOTALok!nopenope
  • The municipality_id is unique in the dataset, which is mean that a municipality is unique in each region and is not shared between regions. This is expected as the municipality_id is equal to the NIS codes. The municipalities have a political and administrative meaning and are unique in each region.
  • The street_id shows that a street can be shared between multiple municipalities. This is not surprising as some streets can cross multiple municipalities.
  • The address_id is not unique throughout the three datasets, but it is probably unique in each dataset.

Let's check the uniqueness of the address_id in each dataset.

sql
SELECT 'BE-BRU' AS region_code,
        COUNT(DISTINCT address_id) AS address_id_count,
        COUNT(*) AS total_row_count
FROM bestAddress_BRU
WHERE status = 'current'
UNION ALL
SELECT 'BE-WAL' AS region_code,
        COUNT(DISTINCT address_id) AS address_id_count,
        COUNT(*) AS total_row_count
FROM bestAddress_WAL
WHERE status = 'current'
UNION ALL
SELECT 'BE-VLG' AS region_code,
        COUNT(DISTINCT address_id) AS address_id_count,
        COUNT(*) AS total_row_count
FROM bestAddress_VLG
WHERE status = 'current'
UNION ALL
SELECT 'TOTAL' AS region_code,
        COUNT(DISTINCT address_id) AS address_id_count,
        COUNT(*) AS total_row_count
FROM bestAddress;
region_codeaddress_id_counttotal_row_count
BE-BRU790713790713
BE-WAL19834421983442
BE-VLG38343633834363
TOTAL42035616608518

Indeed the address_id is unique in each dataset but is not unique throughout the three datasets.

Understanding the data structure - Deep dive

At first glance, we could think from the variables that a postcode include 1 or more municipalities and each municipality include 1 or more streets.

The model could simply be something like this:

best address wrong model
best address wrong model

Let's check the data to see if this is true.

In some cases, a street can be shared between 2 municipalities. This is the case for example of the street Rue de la Loi which is in postcodes ['1000', '1012', '1040','1045', '1049'] for status_code = 'BE-BRU'.

sql
SELECT
DISTINCT(postcode)
FROM bestAddress_bxl
WHERE 1=1
AND street_id = 1679
AND status = 'current'

On the other hand, a municipality_id can belong to multiple postcodes.

sql
SELECT
DISTINCT(postcode)
FROM bestAddress_bxl
WHERE 1 = 1
AND municipality_id = 21004 -- Bruxelles
AND status = 'current'

This will return 14 distincts postcodes: ['1047', '1010', '1045', '1030', '1070', '1000', '1020', '1012', '1130', '1040', '1050', '1120', '1007', '1049'].

In reverse, a postcode can have multiple municipality_id values.

sql
SELECT
DISTINCT(municipality_id)
FROM bestAddress_bxl
WHERE 1=1
AND postcode = '1050' -- Ixelles
AND status = 'current'

This will return 3 distincts municipality_id: ['21009', '21013', '21004'].

Finally, a street with the same name, could have multiple street_id and municipality_id even within the same postcode. This is the case for example of the street Avenue Louise.

sql
SELECT
municipality_id,
street_id,
postcode
FROM bestAddress_BRU
WHERE 1 = 1
AND streetname_fr = 'Avenue Louise'
AND status = 'current'
GROUP BY all;
municipality_idstreet_idpostcode
2100418581050
2100939191050
210137511060

Global overview

To better understand the relationships between the different entities in the dataset, we can perform some checks to see if there are any many-to-many relationships between the entities.

Let's check the data
Let's check the data

1. Checking if a postcode belongs to multiple municipality_id values

To check if a postcode is associated with multiple municipality_id values, you can group by the postcode and count distinct municipality_id values. If a postcode has more than 1 unique municipality_id, it means that the postcode is associated with multiple municipalities.

sql
SELECT postcode,
COUNT(DISTINCT municipality_id) AS municipality_count
FROM bestAddress
GROUP BY postcode
HAVING COUNT(DISTINCT municipality_id) > 1;
postcodemunicipality_count
49102
10503
10302
10702
10402

This query will return all postcodes that are associated with more than one municipality_id. We only get 5 postcodes that are associated with multiple municipalities.

2. Checking if a municipality_id belongs to multiple postcode values

Similarly, to check if a municipality_id is associated with multiple postcode values, group by municipality_id and count distinct postcode values. If a municipality_id has more than 1 unique postcode, it means that the municipality_id is associated with multiple postcodes.

sql
SELECT municipality_id,
COUNT(DISTINCT postcode) AS postcode_count
FROM bestAddress
GROUP BY municipality_id
HAVING COUNT(DISTINCT postcode) > 1;
municipality_idpostcode_count
730422
1100214
......

This query will return all municipality_id that are associated with more than one postcode. We have a 221 municipality_id that are associated with multiple postcodes. Only the first ones are shown here.

The relationship between postcodes and municipalities is then a many-to-many relationship.

3. Checking if a street_id belongs to multiple municipality_id values

To check if a street is associated with multiple municipalities, you can group by the street_id and count distinct municipality_id values. If a street has more than 1 unique municipality_id, it means that the street is associated with multiple municipalities.

sql
SELECT concat_ws('-',region_code, street_id) AS unique_street,
COUNT(DISTINCT municipality_id) AS municipality_count
FROM bestAddress
GROUP BY concat_ws('-',region_code, street_id) AS unique_street
HAVING COUNT(DISTINCT municipality_id) > 1;

This query will return all street_ids in the same region that are associated with more than one municipality_id. We have 15 street_id that are associated with more than one municipality_id.

4. Checking if a street_id can have multiple postname_* values

sql
SELECT
concat_ws('-',municipality_id, street_id) AS unique_street,
COUNT(DISTINCT concat_ws(' / ', postname_fr, postname_nl)) AS postname_count
FROM bestAddress
GROUP BY all
HAVING COUNT(DISTINCT default_postname) > 1;

This query will return all unique streets that are associated with more than one postname_* values. We have 3104 unique streets that are associated with more than one postname_* values.

Example of a street with multiple postnames:

sql
SELECT
DISTINCT(default_postname),
COUNT(address_id) AS address_count
FROM bestAddress
WHERE municipality_id = 25117 AND street_id = 7705267
GROUP BY all;
default_postnameaddress_count
Chastre32
Villeroux1

We conclude that a street can be associated with multiple postnames.

Takeaways from the analysis

Let's check the data
Let's check the data
  • The postcode can be associated with multiple municipality_id values and a municipality_id can be associated with multiple postcode values. The relationship between postcodes and municipalities is then a many-to-many relationship.
  • The municipality_id is unique in the dataset, which is mean that a municipality is unique in each region and is not shared between regions.
  • The address_id is unique in each dataset but is not unique throughout the three datasets. A same address_id can be found in multiple regions.
  • The street_id shows that a street can be shared between multiple municipalities and that a municipality can have multiple streets. The relationship between streets and municipalities is then a many-to-many relationship.
  • Postname are not unique for a street. A single street can have multiple postnames. We will consider the postname_* as an attribute of the address and not as a key.

Next steps

In the next article, we will explore build a Big Table model of the dataset to use it in a more efficient way for address lookup.

Next steps: Address lookup - Part 2 - Sourcing

References