I was finally able to start reconciling the Vancouver Storefronts Inventory (VSI from now on) and the OSM nodes. VSI has 578 coffe/café matches, OSM has 574. These numbers are so close, it gives me hope.
When searching from nodes in OSM that have a nearby (<10 m) node in VSI, 54 results come out. Of those, 51 are perfect matches (business name in OSM is the same as in VSI, except for things like “Starbucks” in OSM vs “Starbucks Coffee” in VSI). This isn’t too thrilling, but honestly a near 10% perfect match from the get go is pretty sweet.
Using 10 meters is pretty bold, so I’ll experiment a bit on a healthy threshold that gives me more matches but doesn’t yield too many false matches. A 25 m radius already jumps to 391 matches and a 50 m radius gives 705 which is obviously too much.
If I have the time, I should also probably start getting fancy with fuzzy matching business names to get the obvious non-identical matches out of the way so I can investigate proper mismatches.
Discussion
Comment from SK53 on 9 July 2022 at 19:05
The UK community has done a fair bit of similar work: mainly because we have a great open dataset on Food Hygiene Ratings.
There’s a tool which does matching, which is integrated with some editors (code on GitHub).
I also tried to categorise, the wide range of, matching approaches one might want to use with such data. Although I experimented with these I never moved on to integrate them.
I’d be interested to learn what you discover.
Comment from villasv on 9 July 2022 at 20:16
Very interesting. Thank you for the pointers @SK53! Some great insights on methods for reconciling two sources. It looks like I’ll be heavily limited on options if I try to stick with SQL-only, but if there’s will there’s a way.