About
heritage.site is an experimental data project that I have started in 2022 to discover and learn tools, practices in the data engineering space.
Essentially it is based on a full backup of en.wikipedia.org, filtering all articles that relate to a heritage site (containing a specific infobox
template) and join with the related pageview dump to measure each page popularity.
A lot of data cleansing.
Then the sites are geographically sorted and display on this site for convenient access.
There is a lot more I wish I can add to the site to explore more data about that domain, so stay tuned.
December 2022 - Stage 1
- Extract ~100k sites from the 1st-Oct-2022 database dump of en.wikipedia.org
- Process and render 24,970 sites on that website
June 2023 - Stage 2
- Process all 102,287 sites from the previous database dump of en.wikipedia.org
- Download, organise, compress 88,897 sites images. Add images in the website
February 2024 - Stage 3
- Improve data cleansing for date field, to more accurately parse date: remove ~4000 post-1945 sites
- Better coordinates parsing: add ~8000 sites
- Add ~5 more infobox types: add ~9000 sites (eg: military installation, hotel)
- First re-run of the entire data pipeline on fresh wikipedia en dump from 1st-Feb-2024: add ~7000 sites
- Fix the California page bug (only 3 sites displayed)
- More data cleansing on site name (removing duplication because of extra native language site name)
- Add 22,945 new site images
New grand total of 122,500 sites
Contact
For comments, feedback or report you can fill up this contact form.
Gabriel