LLM’s Are Not Search Engines – Create 301 Redirects


Are you looking to make a wholesale URL structural change to your website. Maybe you are chasing SEO optimisation, but please think twice when changing your URL structure in the world of LLM’s (Large Language Models). Are you taking the right steps to ensure a great end user experience?

Technical debt, I know you are aware of it. LLM’s bring another dimension to think about.

Hygiene in IT is important. Hygiene is important everywhere, from personal hygiene through to your software estate it’s the 1 and 2%’ers that catch organisations out. Some technologies are somewhat forgiving to the naked eye. Perhaps there is no noticeable end-user impact, but your security attack area is wider than it needs to be, or the reason for this article, perhaps you are missing out on business and diluting the end-user experience of your website.

As I spend time with business around Australian and New-Zealand in 2023, Generative AI is more than a buzz word (Web3, Metaverse etc). Organisations are infusing AI into their stacks today. Generally, there are four capabilities that are fast becoming common.

Content generation, Summarization, Code Generation and Semantic Search

Its Semantic Search I want to talk about. Semantic Search with OpenAI (and other LLMS’s) greatly reduces the time to search through documents. They find those needle in the haystacks much better than a traditional Lucene based search engine such as Azure Cognitive Search and Elastic Search.

But here is the thing, these LLM’s are based on a pretrained model. They will have crawled your website up until a point in time.

Notice bottom right, ChatGPT knows about ‘things’ up until a point in time.

LLM’s do not function like a traditional search engine such as Google or Bing which regularly re-index and crawl your website. The cost for these LLM’s to retrain can cost upwards of 10’s of millions of dollars. As a result, it could be some time (months to years) before your website gets reindexed by your favourite LLM.

So, what does this mean? It means it’s more important than ever for you to have permanent redirects in place if you change your website structure, because whilst a regular search engine will regularly update and detect this, today at the time of this post, LLM’s wont.

Whilst you could supplement your own data with the use of a LLM to provide a sitemap, if an end-user is using a product like GPT-4 (ChatGPT) that results in a link to your website it could be incredibly outdated.

Let’s demonstrate this problem. Here I am in the Azure OpenAI Chat Playground, which is based on the GPT-4 LLM.

I am in the Azure Open AI GPT Completions playground.

I have asked GPT-4 to provide me with a URL to help me prepare for tax time in Australia using the domain *.myob.com.au and sure enough I get a URL, but does it work? The model returns a response based on my prompt but the response I get results in a 404 HTTP status code.

And here we have a HTTP 404 message

Not a great end-use experience and with LLM’s powering more and more systems every day, we need to address this problem.

How can you prevent a bad end-user experience?
There are two things we want to do. Firstly, we want to prevent end-users receiving 404s, but secondly if we do end up with an influx of 404s we want to detect these.

Prevention – In the world of LLM’s reindexing is sparsely performed, as a result we need to train our webmasters/devops/developers/architect personas on the use of maintaining redirection paths for a period of time. My suggestion would be for up to one year. This can be performed via a variety of ways, but a common method would be to use mod_rewrite or similar and map the old path to the new path.

The rewrite rules for this website, notice how I am rewriting the URI with a 301 redirect

Detection – What steps are you taking to monitor non 200 HTTP status codes? General rule of thumb is if your logs comprise of more than 1% of 4xx (Bad Request, Un-Authorized, Not Found) and 5xx (Internal Errors) you have a problem. If you are not ingesting and monitoring your webserver logs today, please do so. From Azure Cognitive Search through to a Kibana dashboard this is a problem that has been solved many times over and can often be an early indicator to when a deployment has gone wrong, or even bad actors poking around.

Sample Kibana Dashboard plotting HTTP Status Codes from an Elastic Search index. – https://discuss.elastic.co/t/updating-status-code-data-with-predefined-labels-in-dashboard/283668

Summary
Large Language Models provide us a new way to interact with our data like never before, however they also bring added challenges that can amplify poor website hygiene. They are not a Search Engine and if you are planning to change your website structure, ensure you leave a trail of breadcrumbs for these LLMs to follow so that your end users are not left with a bad experience.

Thanks
Shane Baldacchino

Leave a Comment