2  Data Sources and Ingestion

2.1 Overview

This chapter fetches raw data from two public sources, loads it into DuckDB, and verifies schema and coverage before analysis.

2.2 Fetching Ookla Data via AWS CLI

The Ookla Open Dataset is hosted in a public S3 bucket — no AWS account needed. Run once from the project root:

bash data-raw/fetch_ookla.sh

This downloads one Parquet file per quarter (~200 MB each) for 2022–2024:

data/ookla/
  year=2022/quarter=1/2022-01-01_performance_mobile_tiles.parquet
  year=2022/quarter=2/...
  ...
  year=2024/quarter=4/...

Why AWS CLI here instead of R? aws s3 cp --recursive handles retries, partial downloads, and progress reporting better than a bare download.file(). It also makes the data provenance explicit — you can see exactly what bucket and path the files came from.

2.3 Loading into DuckDB with arrow

2.4 Schema and Coverage

8 records
column_name column_type null key default extra
quadkey VARCHAR YES NA NA NA
avg_d_kbps INTEGER YES NA NA NA
avg_u_kbps INTEGER YES NA NA NA
avg_lat_ms INTEGER YES NA NA NA
tests INTEGER YES NA NA NA
devices INTEGER YES NA NA NA
year INTEGER YES NA NA NA
quarter INTEGER YES NA NA NA
Displaying records 1 - 10
year quarter tiles avg_down_mbps avg_up_mbps avg_latency_ms
2022 1 3820724 62.56 13.67 39.3
2022 2 4027744 63.40 13.56 41.4
2022 3 4046154 65.67 13.74 42.0
2022 4 3838065 71.50 14.45 41.7
2023 1 3728229 77.57 15.10 41.7
2023 2 3864546 84.39 15.54 41.2
2023 3 4005796 90.22 15.77 41.3
2023 4 3771204 100.43 16.84 39.8
2024 1 3674000 107.84 17.45 39.1
2024 2 3703161 108.30 17.06 39.9

2.5 Data Dictionary

Column Type Description
quadkey VARCHAR Bing Maps tile ID at zoom level 16 (~600m tiles)
avg_d_kbps INTEGER Mean download speed across all tests in tile (kbps)
avg_u_kbps INTEGER Mean upload speed (kbps)
avg_lat_ms INTEGER Mean latency (ms)
tests INTEGER Number of speed tests in tile
devices INTEGER Unique devices tested
year / quarter INTEGER Partition columns added at ingest

2.6 About Quadkeys

Ookla tiles use the Bing Maps quadkey system at zoom level 16. Each tile covers roughly 600m × 600m at mid-latitudes. The sf package can convert these to geometries for spatial joins.

2.7 Problems

  1. How many unique tiles appear in every quarter of the dataset (consistent coverage)?
  2. Which quarter has the highest average test count per tile?
  3. Run aws s3 ls s3://ookla-open-data/parquet/performance/type=fixed/ --no-sign-request and compare the fixed broadband schema to the mobile schema.