Subject: Overpass API developpement
List archive
- From: mmd <>
- To:
- Subject: [overpass] [PerfProject2016] Full attic database creation in 4 days
- Date: Mon, 23 May 2016 19:47:49 +0200
Hello,
as part of the Overpass Performance Project 2016, improving the overall
time to create a full attic db was one of the primary focus topics. If
you set up your own instance before, you probably used the existing
clone files. That is still the recommended approach for most users.
However, when switching to a different compression algorithm or in case
of bugs in the template db implementation, being able to quickly set up
a full attic database from scratch is of paramount importance.
Unfortunately, there's very little documentation available on previous
run times. Back in 2014, Roland set up a database for roughly 700 days,
with reportedly took less than 1 week. That didn't include compression
at that time. For the current v0.7.52 zlib compressed database, I
couldn't find any figures at all. Some Github ticket suggest, that the
current rate of catching up using minutely diffs is about 30-fold real time.
It's about time to dig a bit deeper.
Initial tests on the dev instance quickly turned out to be quite time
consuming with an estimated total runtime of at least 6 weeks. After
switching to a more powerful 8 core server with 32 GB memory and SSD,
initial tests on lz4 cut the time down to 13 days. Processing updates
was done using daily diffs rather than minutely diffs. That still seemed
quite a lot for 1340 days (=all changes since the license change in
September 2012). Large parts of the processing were CPU bound to due to
fast SSDs. Nevertheless, only 1 core was used all of the time.
I decided to move dedicated parts of the database update logic to multi
threaded processing (based on C++11 standard mechanism, no external
libs). That affects solely those parts where 8 different files are first
read from disk, decompressed, changes applied, compressed and written to
disk again. Also, I reorganized the database a few times via db cloning,
mainly to cut down disk space. That brought down the full attic db setup
down to 8-8.5 days.
Next step was to increase the number of days, which are handled in one
update run. So far, I used update_database, but then switched to
update_from_dir and apply_osc_to_db.sh. Usually, that script is used to
apply several minutely diffs in a go. Well, why not use that mechanism
to apply several days at once, permitting up to 4GB of uncompressed
change file? Depending on the data, this corresponds to 6-12 days worth
of OSM data. Running the update this way seemed to work quite well on 32
GB main memory, although update_from_dir sometimes needed more than 20
GB. If you're short on main memory, that may not be an option.
Well, luckily, the total processing time dropped down to just 4 days,
corresponding to about 330 OSM days per day. This should be good enough
for the time being.
Two additional points worth noting:
- The overall processing slows down quite a bit over time, likely caused
by the increased amount of data to be processed. I didn't investigate
this part any further, but 2-3 years down the road, that might need some
revisiting.
- Database size grows quite a lot during the update process. A
subsequent clone-db sometimes reduced that size by 50-60GB. Again I
didn't investigate where those large differences come from.
I put lots of stats on the wiki page [1]. If I find some more time, I'll
probably add further comments to that page. Also, you can find the full
attic db for lz4 on the dev instance [2]. The respective branch is
mentioned on the wiki page as well.
Best,
mmd
[1]
https://wiki.openstreetmap.org/wiki/User:Mmd/Overpass_API/Performance_Project_2016/Full_Attic_DB_Setup
[2] http://dev.overpass-api.de/clone_lz4/
- [overpass] [PerfProject2016] Full attic database creation in 4 days, mmd, 05/23/2016
Archive powered by MHonArc 2.6.18.