Skip to content
Published on

Address lookup - Part 2b: Messing around

Exploration
Authors

Executive summary: In Belgium, a collaborative effort between regional and federal authorities has led to the creation of a unified service that aggregates all Belgian addresses into a comprehensive dataset. This dataset is called BeStAddress.


Address lookup - Part 2b: Messing around

This post is a continuation of the previous post where we have analyzed the data model of the BeStAddress dataset.

Introduction

In Belgium, each municipality is responsible for naming its streets. Although it is recommended to avoid similar street names, this can sometimes result in two streets within the same municipality having very similar names. In this post, we will explore the BeStAddress dataset to identify pairs of streets with similar names that are geographically close to each other.

To analyse this, we will use the unified table that aggregates all the data from the different regions of Belgium to find pairs of streets with similar names that are close to each other.

We will proceed in three steps:

  1. Find the average latitude and longitude of each street in each municipality.
  2. Calculate the distance between each pair of streets in the same municipality.
  3. Calculate the Levenshtein distance between the street names of each pair of streets.

Step 1: Find the average latitude and longitude of street in each municipality

This step will allow us to identify the geographical center of each street in the dataset and then use this information to calculate the distance between pairs of streets.

sql
WITH averages_by_streets AS (
      SELECT municipality_id,
             street_id,
             round(AVG(latitude), 4) AS latitude,
             round(AVG(longitude), 4) AS longitude,
      FROM bestAddress
      GROUP BY municipality_id, street_id
  )
SELECT * FROM averages_by_streets

Step 2. Distance Calculation

We will calculate the distance between each pair of streets in the same municipality. We will use the Haversine formula to calculate the distance between two points on the Earth's surface given their latitude and longitude.

After calculating the distance between each pair of streets, we will filter the results to only include pairs of streets that are less than 5 kilometers apart.

sql
WITH averages_by_streets AS (
      SELECT municipality_id,
             street_id,
             round(AVG(latitude), 4) AS latitude,
             round(AVG(longitude), 4) AS longitude,
      FROM bestAddress
      WHERE postcode = '4140' -- Filter by municipality_id == Sprimont
      GROUP BY municipality_id, street_id
  ),
combination_streets AS (
    SELECT a.municipality_id AS municipality_id_1,
           a.street_id AS street_id_1,
           a.latitude AS lat1,
           a.longitude AS lon1,
           b.municipality_id AS municipality_id_2,
           b.street_id AS street_id_2,
           b.latitude AS lat2,
           b.longitude AS lon2
    FROM averages_by_streets a
    CROSS JOIN averages_by_streets b
    WHERE 1 = 1
      AND (a.municipality_id, a.street_id) <> (b.municipality_id, b.street_id) -- Exclude self-joins
      AND (a.municipality_id < b.municipality_id
           OR (a.municipality_id = b.municipality_id AND a.street_id < b.street_id)
        )  -- Ensure unique pairs
),
distance_calculation AS (
    SELECT municipality_id_1,
           street_id_1,
           municipality_id_2,
           street_id_2,
           -- Haversine Formula
           -- 6371 is the radius of the Earth in kilometers
           -- RADIANS converts degrees to radians
           6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(lat2 - lat1) / 2), 2) +
                                COS(RADIANS(lat1)) * COS(RADIANS(lat2)) *
                                POWER(SIN(RADIANS(lon2 - lon1) / 2), 2))) AS distance_km
    FROM combination_streets
)
SELECT municipality_id_1,
       street_id_1,
       municipality_id_2,
       street_id_2,
       round(distance_km, 2) AS distance_km
FROM distance_calculation
WHERE distance_km < 5;

Step 3. Levenshtein Distance Calculation

Finally, we will calculate the Levenshtein distance between the street names of each pair of streets. The Levenshtein distance is a measure of the similarity between two strings. It is defined as the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other.

sql
WITH averages_by_streets AS (
      SELECT municipality_id,
             street_id,
             round(AVG(latitude), 4) AS latitude,
             round(AVG(longitude), 4) AS longitude,
      FROM bestAddress
      WHERE postcode = '4140' -- Filter by municipality_id == Sprimont
      GROUP BY municipality_id, street_id
  ),
unique_streetname AS (
    SELECT municipality_id,
           street_id,
           default_streetname
    FROM bestAddress
    WHERE 1=1
    AND street_id IN (SELECT street_id FROM averages_by_streets)
    AND municipality_id IN (SELECT municipality_id FROM averages_by_streets)
    GROUP BY ALL
),
combination_streets AS (
    SELECT a.municipality_id AS municipality_id_1,
           a.street_id AS street_id_1,
           a.latitude AS lat1,
           a.longitude AS lon1,
           b.municipality_id AS municipality_id_2,
           b.street_id AS street_id_2,
           b.latitude AS lat2,
           b.longitude AS lon2
    FROM averages_by_streets a
    CROSS JOIN averages_by_streets b
    WHERE 1 = 1
      AND (a.municipality_id, a.street_id) <> (b.municipality_id, b.street_id) -- Exclude self-joins
      AND (a.municipality_id < b.municipality_id
           OR (a.municipality_id = b.municipality_id AND a.street_id < b.street_id)
        )  -- Ensure unique pairs
),
distance_calculation AS (
    SELECT municipality_id_1,
           street_id_1,
           municipality_id_2,
           street_id_2,
           -- Haversine Formula
           -- 6371 is the radius of the Earth in kilometers
           -- RADIANS converts degrees to radians
           6371 * 2 * ASIN(SQRT(POWER(SIN(RADIANS(lat2 - lat1) / 2), 2) +
                                COS(RADIANS(lat1)) * COS(RADIANS(lat2)) *
                                POWER(SIN(RADIANS(lon2 - lon1) / 2), 2))) AS distance_km
    FROM combination_streets
), combination_with_distances_filtered AS (
SELECT municipality_id_1,
       street_id_1,
       municipality_id_2,
       street_id_2,
       round(distance_km, 2) AS distance_km
FROM distance_calculation
WHERE distance_km < 5
),
levenstein_distance AS (
SELECT
    municipality_id_1,
    street_id_1,
    municipality_id_2,
    street_id_2,
    distance_km,
    a.default_streetname AS streetname_1,
    b.default_streetname AS streetname_2,
    levenshtein(a.default_streetname, b.default_streetname) AS levenshtein_distance
FROM combination_with_distances_filtered
LEFT JOIN unique_streetname AS a
   ON a.municipality_id = municipality_id_1
      AND a.street_id = street_id_1
LEFT JOIN unique_streetname AS b
   ON b.municipality_id = municipality_id_2
      AND b.street_id = street_id_2
)
SELECT * FROM levenstein_distance
WHERE 1=1
   AND levenshtein_distance < 10
   AND levenshtein_distance > 0
ORDER BY levenshtein_distance, distance_km;

That's it! We have successfully identified pairs of streets with similar names that are geographically close to each other. It is a long query, but it is very powerful and can be used to identify potential issues with street names in the BeStAddress dataset.

Results

municipality_id_1street_id_1municipality_id_2street_id_2distance_kmstreetname_1streetname_2levenshtein_distance
6210077329836210077329914.16Rue de DamréRue de Gomzé3
6210077329966210077330182.53Rue de la CarrièreRue de la Sablière3
6210077330626210077330670.63Rue des FawesRue des Marets3
6210077331106210077331173.48Rue du RouaRue du Voué3
6210077329816210077330474.02Rue de CoreuxRue de Theux3
6210077329816210077330421.13Rue de CoreuxRue de Rouvreux3
6210077329616210077331290.27Rue Basse LilléRue Haute Lillé3
6210077330926210077331084.55Rue du CoqRue du Pont3
6210077331056210077628620.33Rue du PahyRue du Parc2
6210077331246210077331324.37Rue FerreuseRue Houreuse3
6210077331236210077331243.9Rue FerrerRue Ferreuse3
6210077330816210077628622.6Rue du BaronRue du Parc3
6210077330836210077628621.66Rue du BatyRue du Parc3
6210077330836210077331051.99Rue du BatyRue du Pahy2
6210077330816210077330833.85Rue du BaronRue du Baty3

Further exploration

Further exploration
Further exploration

This dataset is rich in information and can be used to answer a wide range of questions. Here are a few examples of other queries that can be run on the BeStAddress dataset:

Finding all unique combinations of two different street_id within the same municipality_id

To retrieve all unique combinations of two different street IDs within the same municipality_id from a SQL database, you can use a self-join. A self-join allows you to pair rows from the same table where certain conditions are met (in this case, two differents street IDs but the same municipality ID).

sql
WITH DistinctStreets AS (
    SELECT
        municipality_id,
        street_id,
        round(avg(latitude), 3) AS latitude, -- Round to 3 decimal places (= ~110 meters)
        round(avg(longitude), 3) AS longitude
    FROM
        bestAddress
    GROUP BY
        municipality_id, street_id
)
SELECT
    a.municipality_id,
    a.street_id AS first_street_id,
    b.street_id AS second_street_id,
    a.latitude AS first_street_latitude,
    a.longitude AS first_street_longitude,
    b.latitude AS second_street_latitude,
    b.longitude AS second_street_longitude
FROM
    DistinctStreets AS a
JOIN
    DistinctStreets AS b
    ON a.municipality_id = b.municipality_id -- Same municipality
        AND a.street_id < b.street_id -- Ensure different street ids and avoid duplicate combinations (like (1, 2) and (2, 1))
ORDER BY
    a.municipality_id,
    a.street_id,
    b.street_id
LIMIT 100;

Percentage of non-NULL streetname_fr, streetname_de, and streetname_nl per municipality

sql
WITH DistinctStreets AS (
    SELECT
        DISTINCT municipality_id, street_id, streetname_fr, streetname_de, streetname_nl
    FROM
        BestAddress
)
SELECT
    municipality_id,
    COUNT(street_id) AS total_streets,  -- Total number of distinct streets per municipality

    -- Percentage of non-NULL streetname_fr
    (COUNT(
    CASE WHEN streetname_fr IS NOT NULL THEN 1 END)
    /
    CAST(COUNT(street_id) AS FLOAT) * 100)
    AS pct_streetname_fr,

    -- Percentage of non-NULL streetname_de
    (COUNT(
    CASE WHEN streetname_de IS NOT NULL THEN 1 END)
    /
    CAST(COUNT(street_id) AS FLOAT) * 100)
    AS pct_streetname_de,

    -- Percentage of non-NULL streetname_nl
    (COUNT(
    CASE WHEN streetname_nl IS NOT NULL THEN 1 END)
    /
    CAST(COUNT(street_id) AS FLOAT) * 100)
    AS pct_streetname_nl

FROM
    DistinctStreets
GROUP BY
    municipality_id;

Percentage of non-NULL streetname_fr, streetname_de, and streetname_nl per municipality, excluding municipalities where any of the streetname percentages are 0 or 100

sql
WITH DistinctStreets AS (
      SELECT
          DISTINCT municipality_id, street_id, streetname_fr, streetname_de, streetname_nl, region_code
      FROM
          BestAddress
  )
  SELECT
      municipality_id,
      region_code,
      COUNT(street_id) AS total_streets,  -- Total number of distinct streets per municipality

      -- Percentage of non-NULL streetname_fr
      (COUNT(CASE WHEN streetname_fr IS NOT NULL THEN 1 END) / CAST(COUNT(street_id) AS FLOAT) * 100) AS pct_streetname_fr,

      -- Percentage of non-NULL streetname_de
      (COUNT(CASE WHEN streetname_de IS NOT NULL THEN 1 END) / CAST(COUNT(street_id) AS FLOAT) * 100) AS pct_streetname_de,

      -- Percentage of non-NULL streetname_nl
      (COUNT(CASE WHEN streetname_nl IS NOT NULL THEN 1 END) / CAST(COUNT(street_id) AS FLOAT) * 100) AS pct_streetname_nl

  FROM
      DistinctStreets
  GROUP BY
      municipality_id, region_code
  HAVING
      -- Filter municipalities where any of the streetname percentages are not 0 or 100
      (COUNT(CASE WHEN streetname_fr IS NOT NULL THEN 1 END) / CAST(COUNT(street_id) AS FLOAT) * 100) NOT IN (0, 100)
      OR (COUNT(CASE WHEN streetname_de IS NOT NULL THEN 1 END) / CAST(COUNT(street_id) AS FLOAT) * 100) NOT IN (0, 100)
      OR (COUNT(CASE WHEN streetname_nl IS NOT NULL THEN 1 END) / CAST(COUNT(street_id) AS FLOAT) * 100) NOT IN (0, 100)
  ORDER BY
      region_code;