A while back I was doing some tasks to clean up company names. Wikipedia has a useful page but I couldn’t find a simple way to use this information in a python script. So after some downloading and wrangling data from this wikipedia page, here is a link to the code I ended up with. Posting it here in case it is of use to someone else out there.
Some useful factoids I picked up along the way:
- Take a lot of care with Unicode. Characters that look the same (depending on font) might be very different. For example:
'KT vs КТ'.lower() == 'kt vs кт'
- Some abbreviations might appear at the beginning of a name, not just at the end. For example ENEL RUSSIA PJSC vs PJSC “AEROFLOT”.