| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| quadkey | VARCHAR | YES | NA | NA | NA |
| avg_d_kbps | INTEGER | YES | NA | NA | NA |
| avg_u_kbps | INTEGER | YES | NA | NA | NA |
| avg_lat_ms | INTEGER | YES | NA | NA | NA |
| tests | INTEGER | YES | NA | NA | NA |
| devices | INTEGER | YES | NA | NA | NA |
| year | INTEGER | YES | NA | NA | NA |
| quarter | INTEGER | YES | NA | NA | NA |
2 Data Sources and Ingestion
2.1 Overview
This chapter fetches raw data from two public sources, loads it into DuckDB, and verifies schema and coverage before analysis.
2.2 Fetching Ookla Data via AWS CLI
The Ookla Open Dataset is hosted in a public S3 bucket — no AWS account needed. Run once from the project root:
bash data-raw/fetch_ookla.shThis downloads one Parquet file per quarter (~200 MB each) for 2022–2024:
data/ookla/
year=2022/quarter=1/2022-01-01_performance_mobile_tiles.parquet
year=2022/quarter=2/...
...
year=2024/quarter=4/...
Why AWS CLI here instead of R? aws s3 cp --recursive handles retries, partial downloads, and progress reporting better than a bare download.file(). It also makes the data provenance explicit — you can see exactly what bucket and path the files came from.
2.3 Loading into DuckDB with arrow
2.4 Schema and Coverage
| year | quarter | tiles | avg_down_mbps | avg_up_mbps | avg_latency_ms |
|---|---|---|---|---|---|
| 2022 | 1 | 3820724 | 62.56 | 13.67 | 39.3 |
| 2022 | 2 | 4027744 | 63.40 | 13.56 | 41.4 |
| 2022 | 3 | 4046154 | 65.67 | 13.74 | 42.0 |
| 2022 | 4 | 3838065 | 71.50 | 14.45 | 41.7 |
| 2023 | 1 | 3728229 | 77.57 | 15.10 | 41.7 |
| 2023 | 2 | 3864546 | 84.39 | 15.54 | 41.2 |
| 2023 | 3 | 4005796 | 90.22 | 15.77 | 41.3 |
| 2023 | 4 | 3771204 | 100.43 | 16.84 | 39.8 |
| 2024 | 1 | 3674000 | 107.84 | 17.45 | 39.1 |
| 2024 | 2 | 3703161 | 108.30 | 17.06 | 39.9 |
2.5 Data Dictionary
| Column | Type | Description |
|---|---|---|
quadkey |
VARCHAR | Bing Maps tile ID at zoom level 16 (~600m tiles) |
avg_d_kbps |
INTEGER | Mean download speed across all tests in tile (kbps) |
avg_u_kbps |
INTEGER | Mean upload speed (kbps) |
avg_lat_ms |
INTEGER | Mean latency (ms) |
tests |
INTEGER | Number of speed tests in tile |
devices |
INTEGER | Unique devices tested |
year / quarter |
INTEGER | Partition columns added at ingest |
2.6 About Quadkeys
Ookla tiles use the Bing Maps quadkey system at zoom level 16. Each tile covers roughly 600m × 600m at mid-latitudes. The sf package can convert these to geometries for spatial joins.
2.7 Problems
- How many unique tiles appear in every quarter of the dataset (consistent coverage)?
- Which quarter has the highest average test count per tile?
- Run
aws s3 ls s3://ookla-open-data/parquet/performance/type=fixed/ --no-sign-requestand compare the fixed broadband schema to the mobile schema.