Machine Learning

Clicking on Garbage Captcha

Feels like these days it would be easier for an AI to deal with a captcha than for a human being to complete it.

Today I had possibly my worst captcha-related experience.

This was not deemed acceptable.

I had to also click the middle one on the right. Even though it needs quite a lot of imagination to see it as a set of traffic lights.

Then I was allowed to proceed.

Feeling like a proper grumpy old man having to deal with this nonsense.

I can imagine an LLM could hallucinate that the middle right image is a traffic lights. But a human being would have to be pretty high to come to the same conclusion.

Machine Learning

Hallucinations are how GenAI works

There’s been a lot of talk about how ChatGPT (and similar) hallucinate when they generate text. It’s an unfortunate choice of words to use because it sounds like some kind of side-effect. But in reality it’s central to the way that Generative AI works.

Do you know why they’re called hallucinations?

Before we had ChatGPT making text generation popular, creating images was the the big thing in Generative AI. When researchers looked into how these images were being generated they saw something that looked, well, like hallucinations.

Here’s an example from DeepDream.

And of course we had many, many cases where image generation tools would “hallucinate” extra fingers or lopsided ears etc.

When you’re talking about image generation, a word like “hallucinate” works quite well. If you see a generated image that looks like an acid trip or has wonky ears in or whatever then you tend to consider the whole image a hallucinated image. You treat the whole image as not real.

But when you’re seeing a ChatGPT output with 3 incorrect facts in, you don’t think of the whole text as a hallucination. You tend to consider those 3 incorrect facts as a hallucination, distinct from the rest of the text. Weird how people tend to treat pictures as “one thing” but text as “a bunch of different things stitched together that we can consider separately”.

There have been attempts to use a different word when talking about text generation: Confabulation. According to the National Institute of Health, confabulation is:

a neuropsychiatric disorder wherein a patient generates a false memory without the intention of deceit

https://www.ncbi.nlm.nih.gov/books/NBK536961/

But too late! “Hallucination” had already taken its grip on the public’s imagination.

When you look at a ChatGPT output you should expect that the whole thing is one confabulation. This confabulation is often useful but it’s still just made up text, however cleverly it is made up.

There are all kinds of guardrails that people have to build around the LLMs to help guard against these confabulations/hallucinations. But ultimately, when you’re generating images or text with GenAI, you are asking a machine to hallucinate / confabulate for you. Please remember that.

Machine Learning, Technology Adoption

Productizing AI is Hard: Part 94

This is a story about how a crude internet joke got from Reddit to ChatGPT to the top of Google.

Here is what you get currently if you ask Google “Are there any African countries starting with K”

The featured snippet is obviously nonsense. Once you get past the featured snippet it’s got sensible content, but the whole point of the featured snippet is: We display featured snippets when our systems determine this format will help people more easily discover what they’re seeking, both from the description about the page and when they click on the link to read the page itself. They’re especially helpful for those on mobile or searching by voice.

So you’d be forgiven for thinking that Google has quite a high bar for what it puts into a featured snippet.

With my debugging hat on, my first hypothesis is that Google is interpreting this search query as the first line in a joke rather than as a genuine question. If that’s the case then this featured snippet is a great one to show because it builds on the joke.

But. Even if that explains the logic behind showing the snippet, it doesn’t mean that this is the best snippet to show. I’d still consider this a bug if it were in one of my systems. At the very least there should be some context to say: “if this is the first line in a joke, then here is the expected response”.

How did this joke get into the featured snippet?

Here’s the page that the Google featured snippet links to. It’s a web page showing a purported chat with ChatGPT that shows ChatGPT agreeing that there are no African countries starting with K. It’s from emergentmind.com, a website that includes lots of content about ChatGPT

I don’t know whether this is a genuine example of ChatGPT producing text that looks grammatically correct but is actually nonsense, or whether it’s a spoof that was added to emergentmind.com as a joke. But there is definitely a lot of this “African countries starting with k” content on Reddit, and we know that Reddit was used to train ChatGPT. So it’s very plausible that ChatGPT picked up this “knowledge”, but, being a language model, can’t tell whether it’s reality, fake or just a joke.

Either way, the fact that this is presented as ChatGPT text on emergentmind.com helps give it enough weight to get into a featured snippet.

One obvious lesson is don’t trust featured snippets on Google. Only last month I wrote about another featured snippet that got things wrong, this time about terms of use for LinkedIn. Use DuckDuckGo if you just want a solid search engine that finds relevant pages, no more, no less.

But this example raises some interesting food for thought ….

Food for thought for people working with LLMs:

  1. If you are training your model on “the entire internet”[1] then you will get lots of garbage in there
  2. As more and more content gets created by large language models, the garbage problem will only get worse

And food for thought for people trying to build products with LLMs:

  1. Creating a demo of something that looks good using LLMs is super easy, but turning it into a user-facing product that can handle all these garbage cases remains hard. Not impossible, but still hard work.
  2. So how do you design your product to maximize the benefits from LLMs while minimizing the downside risk when your LLM gets things wrong?[2]

I’ve written in the past about the hype cycle related to NLP. That was 4 months ago in April. Back then I was uncomfortable that people were hyping LLMs out of all proportion to their capabilities. Now it seems that we are heading towards the trough of disillusionment – with people blowing out of all proportion the negative aspects. The good news is that, if it’s taken less than 6 months to get from the peak of “Large Language Models are showing signs of sentience and they’re going to take your job” to the trough of “ChatGPT keeps getting things wrong and OpenAI is about to go under”[3], then this must mean that the plateau of productivity beckons. I think it’s pretty close (months vs years).

Hat tip to https://mastodon.online/@rodhilton@mastodon.social/110894818521176741 for the context. (Warning before you click – the joke is pretty crude and it’s arguable how funny it is).

Notes

[1] For whatever definition you have for “entire” and “internet”

[2] I saw a Yann LeCun quote (can’t find it now, sadly, so perhaps it’s apocryphal) about one company using 30 LLMs to cross-check results and decrease the risk of any one of them hallucinating. I’m sure this brute force approach can work, but there will also be other smarter ways, depending on the use case

[3] Whether OpenAI succeeds or fails as a company has very little to do with the long-term productivity gains from LLMs, in much the same way that Friendster’s demise didn’t spell the end of social networking platforms

Machine Learning

From AI to Assistive Computation?

This post on Mastodon has been playing on my mind. It was written on 27th November, after the debacle with Galactica but before ChatGPT burst into the public’s consciousness.

Link to the full thread on Mastodon

I love the challenge it posts.

I am sure there are some areas where the term “AI” is meaningful, for example in academic research. But in the wider world, Ilyaz has a very strong argument.

Usually when people think of AI they’ll imagine something along the lines of 2001: A Space Odyssey or Aliens or I, Robot or Bladerunner or Ex Machina: Something that seems uncannily human but isn’t. I had this image in mind when I first wanted to understand AI and so read Artificial Intelligence: A Modern Approach. What an anti-climax that book was. Did you know that, strictly speaking, the ghosts in pac-man are AI’s? A piece of code that has its own objectives to carry out, like a pac-man ghost, counts as AI. It doesn’t have to ‘think’.

Alan Turing invented The Turing Test in 1950 as a test for AI. For a long time this seemed like a decent proxy for AI: if you’re talking to two things and can’t tell which is the human and which is the machine then we may as well say that the machine is artificially intelligent.

But these days you have large language models that can easily pass the Turing Test. It’s got to the point that ChatGPT has been explicitly coded/taught to fail the Turing test. We’ve got to the point where the AI’s can fake being human so much that they’re being programmed to not sound like humans!

A good description of these language models is ‘Stochastic Parrots‘: ‘Parrots’ because they repeat the patterns they have seen without necessarily understanding any meaning and ‘Stochastic’ because there is randomness in the way they have learnt to generate text.

Services like ChatGPT are bringing this sort of tech into the mainstream and transforming what we understand is possible with computers. This is a pattern we’ve seen before. The best analogy I can think of for where we are today in the world of AI tech is how Spreadsheets and then Search Engines and then Smartphones changed the world we live in.

They don’t herald the advent of Skynet (any more than any other tech from one of the tech titans), nor do they herald a solution for the world’s ills.

So maybe we should reserve the term ‘AI’ for the realms of academic study and instead use a term like ‘Assistive Computation’ as Ilyaz suggests when it comes to real-world applications.

Pretty provocative but at the same time pretty compelling.

To end this post, I’ll leave you with an old AI/ML joke that is somewhat relevant to the discussion here (though these days you’d have to replace with ‘linear regression’ with ‘text-davinci-003’ to get the same vibe):

Edited 2023-01-30: Added link to the full thread on Mastodon

Machine Learning

Evaluating Syracuse – part 2

I recently wrote about the results of trying out my M&A entity extraction project that is smart enough to create simple graphs of which company has done what with which other company.

For a side project very much in alpha it stood up pretty well against the best of the other offerings out there. At least in the first case I looked at. Here are two more complex examples chosen at random

Test 1 – M&A activity with multiple participants

Article: Searchlight Capital Partners Completes the Acquisition of the Operations and Assets of Frontier Communications in the Northwest of the U.S. to form Ziply Fiber

Syracuse

It shows which organizations have been involved in the purchase, which organization sold the assets (Frontier) and the fact that the target entity is an organization called Ziply Fiber.

To improve, it could make it clearer that Ziply is a new entity being created rather than the purchase of an entity already called Ziply from Frontier. Also to identify that this is related to US North West assets. But otherwise pretty good.

Expert.ai

As before, it’s really good at identifying all the organizations in the text, even the ones that aren’t relevant to the story, e.g. Royal Canadian Mounted Police.

The relations piece is patchy. From the headline it determines that Searchlight Capital Partners is completing an acquisition of some operations, and also there is a relationship between the verb ‘complete’ and the assets of Frontier Communications. Pretty good result from this sentence, but not completely clear that there is an acquisition of assets.

Next sentence has a really good catch that Searchlight is forming Ziply

It only identifies one of the other parties involved in the transaction. It doesn’t tie the ‘it’ to Searchlight – you’d have to infer that from another relationship. And it doesn’t flag any of the other participants.

Test 2 – Digest Article

Article: Deals of the day-Mergers and acquisitions

Syracuse

It’s identifying 7 distinct stories. There are 8 bullet points in the Reuters story – one of which is about something that isn’t happening. Syracuse picks all of the real stories. It messes up Takeaway.com’s takeover of Just Eat by separating out Takeway and com as two different organizations, but apart from that looks pretty good.

I’m particularly gratified how it flags Exor as the spender and Agnelli as another kind of participant in the story about Exor raising its stake in GEDI. Agnelli is the family behind Exor, so they are involved, but strictly speaking the company doing the buying is Exor.

Expert.ai

Most of the entities are extracted correctly. A couple of notable errors:

  1. It finds a company called ‘Buyout’ (really this is the description of a type of firm, not the name of the firm)
  2. It also gets Takeaway.com wrong – but where Syracuse split this into two entities, Expert.ai flags it as a URL rather than a company (in yellow in the second image below)

The relationship piece is also pretty impressive from an academic point of view, but hard to piece together what is really going on from a practical point of view. Take the first story about Mediaset as an example and look at the relationships that Expert.ai identifies in the 4 graphs below. First one identifies that Mediaset belongs to Italy and is saying something. The other 3 talk about an “it” doing various things, but don’t tie this ‘it’ back to Mediaset.

Conclusion

Looking pretty good for Syracuse, if I say so myself :D.

Machine Learning

Revisiting Entity Extraction

In September 2021 I wrote about the difficulties of getting anything beyond basic named entity recognition. You could easily get the names of companies mentioned in a news article, but not whether one company was acquiring another or whether two companies were forming a joint venture, etc. Not to mention the perennial “Bloomberg problem”: Bloomberg is named in loads of different stories. Usually they are referenced as a company reporting the story, sometimes as the owner of the Bloomberg Terminal. Only a tiny proportion of mentions of Bloomberg are about actions that the Bloomberg company is done.

These were very real problems that a team I was involved in were facing around 2017, and were still not fixed in 2021. I figured I’d see if more recent ML techologies, specifically Transformers, could help solve these problems. I’ve made a simple Heroku app, called Syracuse, to showcase the results. It’s very alpha, but the quality is not too bad right now.

Meanwhile, the state of the art has moved on leaps and bounds over the past year. So I’m going to compare Syracuse with the winner from my 2021 comparison: Expert.ai‘s Document Analysis Tool and with ChatGPT – the new kid on the NLP block.

A Simple Test

Article: Avalara Acquires Artificial Intelligence Technology and Expertise from Indix to Aggregate, Structure and Deliver Global Product and Tax Information

The headline says it all: Avalara has acquired some Tech and Expertise from Indix.

Expert.AI

It is very comprehensive. For my purposes, too comprehensive. It identifies 3 companies: Avalara, ICR and Indix. The story is about Avalara acquiring IP from Indix. ICR is the communications company that is making the press release. ICR appearing in this list is an example of the “Bloomberg Problem” in action. Also it’s incorrect to call Indix IP a company – the company is Indix. The relevant sentence in the article mentions Indix’s IP, not a company called Indix IP: “Avalara believes its ability to collect, organize, and structure this content is accelerated with the acquisition of the Indix IP.

It also identifies many geographic locations, but many of them are irrelevant to the story as they are just lists of where Avalara has offices. If you wanted to search a database of UK-based M&A activity you would not want this story to come up.

Expert.AI’s relationship extraction is really impressive, but again, overly comprehensive. This first graph shows that Avalara gets expertise, technology and structure from Indix IP to aggregate things.

But there are also many many other graphs which are less useful, e.g:

Conclusion: Very powerful. Arguably too powerful. It reminds me of the age-old Google problem – I don’t want 1,487,585 results in 0.2 seconds. I’m already drowning in information, I want something that surfaces the answer quickly.

ChatGPT

I tried a few different prompts. First I included the background text then added a simple prompt:

I’m blown away by the quality of the summary here (no mention of ICR, LLC, so it’s not suffering from the Bloomberg Problem). But it’s not structured. Let’s try another prompt.

Again, it’s an impressive summary, but it’s not structured data.

Expert.ai + ChatGPT

I wonder what the results would be by combining a ChatGPT summary with Expert.AI document analysis. Turns out, not much use.

Syracuse

Link to data: https://syracuse-1145.herokuapp.com/m_and_as/1

Anyone looking at the URLs will recognise that this is the first entry in the database. This is the first example that I tried as an unseen test case (no cherry-picking here).

It shows the key information in a more concise graph as below. Avalara is a spender, Indix is receiving some kind of payment and the relevant target is some Indix Technology (the downward triangle represents something that is not an organization)

I’m pretty happy with this result. It shows that despite how impressive something like Expert.AI and ChatGPT are, they have limitations when applying to more specific problems, like in this case. Fortunately there are other open source ML technologies out there that can help, though it’s a job of work to stitch them together appropriately to get a decent result.

In future posts I’ll share more comparisons of more complex articles and share some insights into what I’ve learned about large language models through this process (spoiler – there are no silver bullets).

Machine Learning

Large Language Models, Hype and Prompt Chaining

Meta released Galactica recently to great fanfare and then rapidly removed it.

Janelle Shane poked some fun at Galactica in a post that showed how you can get it give you nonsense answers while making then serious point that you should be very aware of the hype. From a research point of view, Galactica is obviously super exciting. From a real-life point of view, you’re not about to replace your chatbot with Galactica, not while it suffers from hallucinations.

But there are serious use-cases for large language models like Galactica and Googles’ Flan-T5. Just not writing fully-fledged research articles.

You have to ask the model a number of smaller questions one after the other. In the jargon: ‘prompt chaining‘. For example – referring to Janelle’s example question that fooled Galactica:

Prompt: how many giraffes are in a mitochondria?
Galactica: 1

Don’t treat the language model as the holder of all knowledge. Treat the language model as an assistant who is super keen to help and is desperate not to offend you. You have to be careful what you ask, and perhaps ask several questions to get to the real answer. Here is an example I did with Flan T5 using a playground space on HuggingFace.

Prompt: Does mitochondria contain giraffes?
Flan T5: no

Prompt: How many giraffes are in a mitochondria?
Flan T5: ten

Using the same question that Galactica was given, we get a nonsense answer. Flan T5 is even more keen that Galactica to give us an impressive-sounding answer. But if you take both questions together then you can draw a more meaningful conclusion. Chain the prompts and first ask the ‘yes/no’ question and then only ask the second question depending on the answer you get from the first.

Having written all of this, today I learnt about OpenAI’s ChatGPT which seems like a massive step forward towards solving the hallucination problem. I love how fast this space is moving these days.

Machine Learning

Entity extraction powered by Flan

A Thanksgiving update on my side project. See here for an outline of the problem. In short, existing natural language processing techniques are good at generic entity extraction, but not at really getting to the core of the story.

I call it the ‘Bloomberg problem’. Imagine this text: “Bloomberg reported that Foo Inc has bought Bar Corp”. Bloomberg is not relevant in this story. But it is relevant in this one: “Bloomberg has just announced a new version of their Terminal”.

I wrote about my first attempt to address this problem, and then followed it up in July. I’ve been doing some more finessing since then and am pretty happy with the results. There is still some tidying up to do but I’m pretty confident that the building blocks are all there.

The big changes since July are:

  1. Replacing a lot of the post-processing logic with a model trained on more data. This was heartbreaking (throw away work, sad face emoji) but at the same time exhilarating (it works a lot better with less code in, big smile emoji).
  2. Implementing Flan T5 to help with some of the more generic areas.

At a high level this is how it works:

  1. The model
    • Approx 400 tagged docs (in total, across train, val and test sets)
    • Some judicious data synthesis
    • Trained a Named Entity Recognition model based on roberta-base
  2. Post-processing is a combination of
    • Benepar for constituency parsing to identify the relationships between entities for most cases
    • FlanT5 to help with the less obvious relationships.

Next steps are going to be to start representing this as a knowledge graph, which is a more natural way of exploring the data.

See below for a screenshot of the appointment topics extracted recently. These are available online at https://syracuse-1145.herokuapp.com/appointments

And below are the URLS for these appointment topics

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio

In this example, we have a number of companies listed – some of the company that is appointing these two new individuals and some are companies where the individuals used to work. Not all the company records are equally relevant.

Wolters Kluwer appoints Kevin Hay as Vice President of Sales for FRR

The Native Antigen Company Strengthens Senior Team to Support Growing Product Portfolio (Business Wire version)

Kering boosted by report Gucci’s creative director Michele to step down

Broadcat Announces Appointment of Director of Operations

Former MediaTek General Counsel Dr. Hsu Wei-Fu Joins ProLogium Technology to Reinforce Solid-State Battery IP Protection and Patent Portfolio Strategy

HpVac Appoints Joana Vitte, MD, PhD, as Chief Scientific Officer

Highview Power Appoints Sandra Redding to its Leadership Team

Highview Power Appoints Sandra Redding to its Leadership Team (Business Wire version)

ASML Supervisory Board changes announced

Recommendation from Equinor’s nomination committee

I’m pretty impressed with this one. There are a lot of organizations mentioned in this document with one person joining and one person leaving. The system has correctly identified the relevant individuals and organization. There is some redundancy: Board Member and Board Of Directors are identified as the same role, but that’s something that can easily be cleaned up in some more post-processing.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member

Similarly, this article includes the organization that Rob has been appointed to and the names of organizations where he has worked before.

SG Analytics appoints Rob Mitchell as the new Advisory Board Member (Business Wire version)

Machine Learning

ML Topic Extraction Update

This is an update to https://alanbuxton.wordpress.com/2022/01/19/first-steps-in-natural-language-topic-understanding. It’s scratching an itch I have about using machine learning to pick out useful information from text articles on topics like: who is being appointed to a new senior role in a company; what companies are launching new products in new regions etc. My first try, and a review of the various existing approaches out there, was first summarised here: https://alanbuxton.wordpress.com/2021/09/21/transformers-for-use-oriented-entity-extraction/.

After this recent nonsense about whether language models are sentient or not, I’ve decided to use language that doesn’t imply any level of consciousness or intelligence. So I’m not going to be using the word “understanding” any more. The algorithm clearly doesn’t understand the text it is being given in the same way that a human understands text.

Since the previous version of the topic extraction system I implemented logic that use constituency parsing and graphs in networkx to better model the relationships amongst the different entities. It went a long way to improving the quality of the results but the Appointment topic extraction, for example, still struggles in two particular use cases:

  • When lots of people are being appointed to one role (e.g. a lot of people being announced as partners)
  • When one person is taking on a new role that someone else is leaving (e.g. “Jane Smith is taking on the CEO role that Peter Franklin has stepped down from”)

At this point the post-processing is pretty complex. Instead of going further on with this approach I’m going back to square one. I once saw a maxim along the lines of “once your rules get complex, it’s best to replace them with machine learning”. This will mean throwing away a lot of code so emotionally it’s hard to do. And there is an open question will be how much more labelled data the algorithm would need to learn these relationships accurately. But it will be fun to find out.

A simplified version of the app, covering Appointments (senior hires and fires) and Locations (setting up a new HQ, launching in a new location) is available on Heroku at https://syracuse-1145.herokuapp.com/. Feedback more than welcome.

Anthropology, Machine Learning

Notes on Sentience and Large Language Models

A collection of some relevant commentary about the recent story that a Google employee [was] reportedly put on leave after claiming chatbot became sentient, similar to a ‘kid that happened to know physics’.

First: the posting of the interview with the AI (it’s called LaMDA). It’s worth a read if you want to see what is sitting under all the fuss.

Second: The unequivocal counter-argument that, while this is a tremendous advance in language modelling, it is not sentience, it is just a massive statistical model that generates text based on text it has seen before:

Nonsense. Neither LaMDA nor any of its cousins (GPT-3) are remotely intelligent. All they do is match patterns, draw from massive statistical databases of human language. The patterns might be cool, but language these systems utter doesn’t actually mean anything at all. And it sure as hell doesn’t mean that these systems are sentient.

Third: The text that the AI generates depends on the text it has seen. It is very responsive to leading questions:

Is this language model sentient? Is sentience a matter of “yes you are a sentient” (e.g. a dog) or “no you aren’t (e.g. a rock)”. Or are there matters of degree? Is a bee as sentient as a dolphin? If a machine became sentient would humans even be able to recognise it?

We used to think that the Turing test would be a good benchmark of whether a machine could (appear to) think. We are now well past that point.

It takes me back to my anthropology studies. It turns out it’s quite tricky to define what distinguishes humans from animals. We used to think it was ‘tool use’ until we discovered primates using tools. When we saw something that passed as ‘human’ according to the definition, but was obviously not human, we changed the definition. Seems we are in a similar place with AI.

A more pressing issue than the ‘sentient or not’ debate is ‘what are we going to do about it’. It’s the ethical side of these advances which are both terrific and terrifying. With deepfake imagery and these language models, the possibilities both good and bad are hard to get your head around.

So I leave you with.

Fourth: