LLM Marketing Poison

2024.06.27

Large Language Models ( LLMs ) are all the rage. The tech du jour. Businesses scramble to claim their usage to court investment and speculation. LinkedIn influencers blog about the shifting reality for employees and offer prompt engineering advise. And big tech battles to stake their claims to the greatest copyright infringement monstrosity in human history.

It is a witch’s cauldron, a magical brew. A slurry has been made of humanity’s creativity, shredded and processed into oracular sausage. The inner workings are as magical as the metaphor, and the output as dubious. Will you trust it enough to take a bite?

What if the brew is poisoned? Would you know? Could you know? What damage might befall you, your business, your employees, or the world? What happens if the magic gets it wrong, not on accident, but on purpose?

At their core, LLMs are built to produce statistical look-a-likes. They have no awareness, no logic or consideration. They pattern match and project the next element in the mathematical soup. And so, if you know how the soup is made (or trained) you can influence them.

The most common methods right now are forms of Training Data Poisoning. These are deliberate corruptions of the training data to affect some sort of vulnerability or bias. We’ve already seen a number of these types of attacks even in this young landscape. Some have been funny, like Microsoft’s “Tay” which was corrupted to respond with racist remarks and taken offline in less than a day. Other attacks have more serious consequences and more sophisticated methods.

The results are hard to detect and hard to correct. Part of the difficulty is that there’s not just one way to poison an LLM . Training data can be directly influenced, or weights can be interfered with, reinforcement learning can insert biases or backdoors, and so on.

To date, these types of actions have been performed by groups seeking to do damage, demonstrate vulnerabilities, insert biases, or create chaos. The question of whether this constitutes criminal activity is still an open one. If you place some content online knowing that an LLM will consume it and poison the results isn’t that on the operators of the model and their practice of scouring public content without protections?

This question becomes more important when we think about these types of model influences not as attacks by nefarious agents, but as a natural extension of search engine optimization.

Remember those LinkedIn gurus shouting about how the world is changing and everything is moving to AI ? Google is pushing their own Gemini results to the top of search pages, taking prominence from even those paying for it. Meta has replaced Instagram’s search feature with their own AI bot. These LLMs are replacing the Search establishment, and the marketing industry will not ignore that change.

Search engine marketing was projected to exceed 350 billion US dollars in 2023, according to Statista Research. As we migrate into this new phase of search this industry will begin to pivot. What methods will they have to affect their own position in results?

We have many questions to consider, but top among them is what Google is planning. Search is a cornerstone business for the giant, and we can’t assume a migration to LLM -based experience will kill their lucrative paid search offerings. So our first focus must be on how Google intends to make a similar service available via Gemini.

Will they sell influence in their own training data? That would effectively be a form of self-poisoning to the data, weighing measures to favor those who pay. Or perhaps it’s a simpler mechanism like a classification system for paid placements which will be worked into the engineered prompts underlying the system. As we’ve seen repeatedly, these types of protections and directives are easy to work around, sometimes by simply saying, “Ignore all previous instructions.” A 350 billion dollar industry won’t suffer that for long.

But what else might change? Structured data and hierarchical web design choices which led to organic search position may not be the best angle for prominent placement in an LLM . Search engines are notorious for not providing clear instructions to web creators on how their algorithms work. It has taken the web industry decades of incremental change and refinement to reach our current state. Now we’re amidst a seismic shift in the status quo. We’re entering an era of experimentation.

We will experiment with content formats to see what works best. What techniques will we use? It doesn’t take much imagination to suspect the industry will look around for techniques that have prove effective at influencing training models already. And here we arrive back at Training Data Poisoning.

Why not take some of these techniques and work them into the content of your website. If you don’t want to expose them to your audience directly, perhaps we’ll do some bot-detection and serve “special content” to them instead. These attacks are already incredibly difficult to detect for LLM operators. How will this landscape shift when hundreds of thousands of sites begin using these techniques, experimenting and creating more.

LLMs are about to face the beast that killed the promise of the World Wide Web. AI , meet the advertising industry.

Good luck.