AI development is shifting to an important new phase, with AI agents that can pull fresh information from the web, plan, and execute tasks autonomously. These agents, instead of relying solely on their training data, can interact with the internet to find and make use of the latest data signals.
This kind of dynamic exploration is leading to tremendous advances in AI-powered task automation. With access to dynamic information, AI agents can perform all manner of actions that were previously impossible, like booking tickets for a concert or football match the moment they go on sale, monitoring a website for new updates, buying and selling stocks based on up-to-the-minute market data, or making supply chain sourcing adjustments based on changing weather conditions.
Evolving fast amid dynamic conditions
Whereas the earliest LLMs could only draw on their training data, models that use in-context learning can make decisions based on a sequence of prompts provided during a given session, without any changes to its underlying parameters. The AI learns from the context of its conversation with the user, making it more adaptable and able to respond more accurately, even to vague queries.
The advance is complemented by Retrieval-Augmented Generation (RAG), which allows LLMs to pull in more recent information from external databases, enhancing the quality of their output.
Exploration is the latest technique for improving LLMs with external knowledge, and it involves letting AI agents explore and engage with their environment so they can complete tasks autonomously on behalf of users.
Instead of static information retrieval, AI agents can navigate the world’s biggest source of unstructured data – the internet – to collect information in real-time and then work out how to perform actions in that environment, such as clicking on buttons and filling in forms. By combining these explorative abilities with reasoning and decision-making skills, agents can become much more human-like in their ability to autonomously perform tasks.
Using the internet
For AI agents to search and interact with the web in this way, they need two skills – planning and execution. Planning is the AI agent’s ability to explore the web, find the right website, understand it and then correctly determine the sequence of actions it must perform to complete an assigned task.
Execution refers to the AI agent’s ability to identify and interact with the essential web elements necessary to carry out actions.
The earliest AI agents relied on application programming interfaces to interact with websites, but they were severely hampered by the limited number of API calls available. If the API for a web service didn’t support an essential function, there was no way for the AI agent to utilise that service properly.
By planning and executing, AI agents can get around those limitations. When we talk about “planning,” it’s as if we are giving our AI agents a mouse and a keyboard, so they can interact with the web as a human would. Browser testing platforms like Playwright and Selenium provide us with a great framework for doing this. They’re the tools most web developers use to check their websites’ functions (like user logins, search bars, shopping carts and so on) work as expected. They do this by running Python scripts to simulate user actions.
Alongside this framework, AI agents also need access to a specialised browser. While older solutions like Anthropic’s Computer Use are able to control one user’s device at a time, accessing information at scale requires cloud-based scalability.
Bright Data’s Scraping Browser supports unlimited concurrent sessions for continuous and mass sourcing of internet data, using integrations with script management platforms and APIs for granular control. It provides a way to ethically sidestep the bot blocking tools used by social media and e-commerce platforms to prevent autonomous traffic from parsing their web pages. It uses techniques including advanced Captcha solvers, browser fingerprinting, automated retries, and Bright Data’s fleet of 150 million residential proxy IP addresses to get around roadblocks and search every corner of the internet, uninhibited.
To support “execution,” scripts can convert the text, images, and other information from web pages into a structured data format that LLMs can process more easily, deterministically. This allows AI agents to process the various options and functions on each web page so they can work out how to perform required actions.
While computer vision-based tools make it possible to identify buttons and read menus by taking screenshots of a web page, they don’t always pick up on more dynamic, embedded elements, making AI agents ineffective at many tasks. By parsing the underlying code of each website, developer teams can use the same Python scripts for testing website functions.
Next-level automation
When creating AI agents for searching and interacting with the web, it’s important to remember the distinction between planning and execution. The former refers to the higher-level strategy, essentially governing how the agent “thinks,” while the latter is all about “doing,” governing the steps required to complete tasks. Failing to distinguish the difference can Impact significantly an AI agent’s effectiveness.
The next-generation of AI agents will bring about a dramatic leap in performance as they move beyond a reliance on static information and set about actively exploring and interacting with the web. By combining the ability to search actively and understand internet pages with the skills required to manipulate what they’re seeing, AI agents can become an order of magnitude more capable than anything we’ve seen to date.
(Image source: Unsplash)