Python script to handle company abbreviations

A while back I was doing some tasks to clean up company names. Wikipedia has a useful page but I couldn’t find a simple way to use this information in a python script. So after some downloading and wrangling data from this wikipedia page, here is a link to the code I ended up with. Posting it here in case it is of use to someone else out there.

Some useful factoids I picked up along the way:

  1. Take a lot of care with Unicode. Characters that look the same (depending on font) might be very different. For example: 'KT vs КТ'.lower() == 'kt vs кт'
  2. Some abbreviations might appear at the beginning of a name, not just at the end. For example ENEL RUSSIA PJSC vs PJSC “AEROFLOT”.

 

Advertisements