Machine Learning

ML Topic Extraction Update

This is an update to https://alanbuxton.wordpress.com/2022/01/19/first-steps-in-natural-language-topic-understanding. It’s scratching an itch I have about using machine learning to pick out useful information from text articles on topics like: who is being appointed to a new senior role in a company; what companies are launching new products in new regions etc. My first try, and a review of the various existing approaches out there, was first summarised here: https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/.

After this recent nonsense about whether language models are sentient or not, I’ve decided to use language that doesn’t imply any level of consciousness or intelligence. So I’m not going to be using the word “understanding” any more. The algorithm clearly doesn’t understand the text it is being given in the same way that a human understands text.

Since the previous version of the topic extraction system I implemented logic that use constituency parsing and graphs in networkx to better model the relationships amongst the different entities. It went a long way to improving the quality of the results but the Appointment topic extraction, for example, still struggles in two particular use cases:

  • When lots of people are being appointed to one role (e.g. a lot of people being announced as partners)
  • When one person is taking on a new role that someone else is leaving (e.g. “Jane Smith is taking on the CEO role that Peter Franklin has stepped down from”)

At this point the post-processing is pretty complex. Instead of going further on with this approach I’m going back to square one. I once saw a maxim along the lines of “once your rules get complex, it’s best to replace them with machine learning”. This will mean throwing away a lot of code so emotionally it’s hard to do. And there is an open question will be how much more labelled data the algorithm would need to learn these relationships accurately. But it will be fun to find out.

A simplified version of the app, covering Appointments (senior hires and fires) and Locations (setting up a new HQ, launching in a new location) is available on Heroku at https://syracuse-1145.herokuapp.com/. Feedback more than welcome.