May 22, 2024

Introducing Fuji-Web

Table of contents
Authors:
Mengdi Chen, Lingjie Feng and Ary Swaminathan
5/22/2024

AI agents are transforming how we interact with computers and the digital world. Powered by the latest development of Large Language Models (LLMs), we believe in the near future, AI agents will become highly capable personal assistants that can streamline our daily tasks, transform the way we work, and make the world more accessible to everyone.

We’ve been enabling complex workflow augmentation for some of the most sensitive industrial and advanced manufacturing applications around the globe. We have discussed the challenges of attaining the level of sophisticated reasoning and calibrated uncertainty-awareness required for decision-making in these high-stakes environments.

But in the end, our AI can act after it reasons – often across clunky internal enterprise tools and diverse web environments (Jira, internal wikis, operator notes, SOPs, and specifications). This creates an opportunity to rethink workflow interfaces and provide transformative experiences.

To this end, we are releasing and open-sourcing Fuji-Web, a tool that redefines web interaction. It is based on what we’ve learned from our research on building agents enabled by world models. Fuji-Web also serves as an initial preview of experimental approaches taken in our enterprise product roadmap.

Fuji-Web is a research preview, not an official Normal Computing product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!

What is Fuji-Web?

Fuji-Web is an intelligent AI partner that understands the user’s intent, navigates websites autonomously, and executes tasks on the user’s behalf while explaining each action step. With Fuji-Web, we are able to demonstrate how our multi-modal parsing and annotation scheme achieves state-of-the-art performance (see Benchmarks). It harnesses vision, DOM awareness, and semantic HTML understanding to focus on essential webpage elements while filtering out noise. This, combined with a “Prior Knowledge Augmentation” knowledge graph that encodes a high-level understanding of websites based on past experiences, allows Fuji-Web to function like an expert assistant with a guidebook for each site.

As an important component of Normal Computing’s overall efforts to deliver AI partners for high-stakes applications, we aim to develop a reliable and effective agent layer. Fuji-Web is part of this journey to explore AI capabilities by building and working with the open-source community.

Technical Overview

Under the simple UI of Fuji-Web, which lives in the side panel of your browser, is an agent powered by LLM, supported by sensors and actuators. While the agent, at its core, is a very simple ReAct agent, we invested a lot of effort to ensure the sensors provide more relevant information and the actuators reliably complete tasks.

Unlike traditional methods that involve sending entire HTML strings or screenshots to a language model, Fuji-Web adopts a genuinely novel approach. At a high level, Fuji-Web sends a clean screenshot alongside another annotated screenshot – augmenting with textual descriptions that highlight important interactive elements, such as input fields and buttons. This method aims to simplify web navigation by emphasizing elements like buttons and input fields, crucial for interaction through clicks and typing.

Note: all prompts (text and image) are sent directly to the API of your selection. Fuji-Web does not attempt to collect any information from you.

Highlights

Full browser automation right in your sidepanel

Harness the power of multi-modal Large Language Models to navigate and manage web tasks with simple, intuitive user commands. Unlike many other web agents, you can install it as an extension to the browser you use everyday, and call it out any time you need it. Even in the middle of tasks, you can hand them off to the agent, and watch it take care of the rest.

Prior Knowledge Augmentation system

Fuji-Web navigates websites with the wisdom of past experiences that improves its understanding of web dynamics. Users can customize and inject domain-specific insights in real-time by adding instructions in the settings menu. In the future, we will support more mechanisms to help the agent understand and use websites, including a custom text/CSS selector and JS execution.

Improving website understanding

When annotating a website and composing the prompt, Fuji-Web carefully filters for the relevant elements. This is critical in ensuring a high success rate.

Figure: annotated side-by-side annotation of mcmaster.com used in prompt

The project leverages HTML semantics and WAI-ARIA roles to identify these interactive components accurately.


Website Elements found with interactive HTML tags
Elements found with interactive HTML tags + WAI-ARIA roles
amazon.com 534 547
twitter.com 56 121
github.com 1364 1446

Not only does Fuji-Web never miss standard interactive elements, but it also keeps irrelevant elements off the list. It assesses the current state and computed styles of DOM elements to differentiate between essential and redundant, invisible, or obscured elements, enhancing the relevance of the information fed into the AI.

These nuances, including how Fuji-Web manages to discern and ignore overlaid or hidden elements, exemplify the overall approach to ensuring the agent’s interactions are as human-like as possible.

Figure: comparison of annotated elements on twitter.com when post drafting popover is shown. Left: without “top-layer element only” filter. Right: with “top-layer element only” filter

Benchmarks

We compared Fuji-Web’s ability to successfully complete real-world tasks to WebVoyager using their proposed benchmarks. As of today, we have finished running and evaluating the tasks on 7 websites, and observe compelling quality and performance. Results are shown in the following table:

Table Example
Allrecipes ArXiv Apple Google Search BBC News Github Cambridge Dictionary
GPT-4 (All Tools) 11.1% 17.1% 44.2% 60.5% 9.5% 48.8% 25.6%
WebVoyager 53.3% 51.2% 65.1% 76.7% 61.9% 63.4% 65.1%
Fuji-Web 64.4% 65.1% 60.4% 81.4% 76.2% 73.2% 86.0%

Table: The main results for Fuji-web. GPT-4 (all tools) and WebVoyager success rates are reported in the WebVoyager paper (last revised Feb 29 2024, using GPT-4V model). Fuji-web was benchmarked using the GPT-4o model.

We plan to release a more detailed report later.

Limitations

Fuji-Web relies on technology similar to that used for screen readers. Code that is not optimized and has incorrect tags, roles, and other HTML elements may cause Fuji-Web to get confused. This is not only an issue for Fuji-Web but a broader issue for accessibility.

Fuji-Web will not work well on sites that deviate from standard HTML tag usage and complex graphics-based websites.

Missing Semantics

The annotation approach Fuji-Web adopts does have its limitations. Websites with poor semantic practices might lead to oversight of some interactable elements that deviate from standard HTML tag usage and do not provide additional semantics via WAI-ARIA.

In other words, the more a website is accessible to users with physical or mental impairments, the better Fuji-Web can understand and use it.

For more information about accessibility, please refer to MDN’s Accessibility guide.

Non-semantic Web Technologies

As explained previously, Fuji-Web heavily relies on semantic information from HTML. Some complex applications (e.g., Google Sheets) use Canvas and WebGL to draw graphics, text and buttons, so Fuji-Web is not able to interact with a large part of the UI.

Limited Interaction Types

To reduce complexity, the interactions Fuji-Web can perform on the webpage are limited. Currently one such limitation is that Fuji-Web can only scroll the entire webpage – it can get stuck on dropdowns with many options, or when it needs to scroll only a portion of the page. Drag-and-drop is another example. We are actively researching solutions for these cases that preserve the simplicity of the agent layer.

Fix: If you encounter these issues with Fuji-Web, try using the “instructions” feature to work around it. For example, if an “Add to Cart” button on a product search page doesn’t work due to missing tags, you can use the instructions to tell Fuji-Web that there is another “Add to Cart” button if it clicks through to the product details page. More detailed instructions will allow Fuji-Web to get around obstacles and complete the task more easily.

Future Work

We believe that Fuji-Web is a useful tool for automating online tasks as well as a state-of-the-art web agent component for more complex agentic systems. We plan to continue improving Fuji-Web in our quest to build reliable AI partners for critical workflows.

Supporting Programmatic Usage

We plan to enhance Fuji-Web’s programmability by exposing a JS API. This would facilitate integration with browser automation frameworks like Puppeteer, Playwright, and Selenium. If this sounds exciting, please shoot us a message in our Discord to help us prioritize!

Providing such an API enables:

  • Automatically benchmarking the agent’s performance across various tasks and scenarios.
  • Leveraging Fuji-Web as a sub-agent of a more complex agentic system, such as a fully autonomous agent that has long-term memory and can solve complex tasks (as in our internal usage and productization)
  • Linking Fuji-Web tasks to signals (e.g., every day at 8 a.m., whenever you receive an email, etc.)
  • Using Fuji-Web as a service on the cloud

Supporting cross-tab workflows

Most complex tasks exist across multiple applications and require deep context of the world. By supporting cross-tab workflows, the agent will be able to retrieve information across different tabs and utilize it for tasks such as responding to questions via messages, summarizing from multiple sources, etc. Additionally, this would allow Fuji-Web to continue to work correctly when the user navigates to other tabs, making the “background task” seamless.

Copilot Mode

The web can be a complex space – Fuji-Web can sometimes get stuck. For example, some pages require the user to log in or type in a verification code. In other scenarios, we may want to instruct Fuji-Web to pause before proceeding (e.g., review the cart before placing an order). The next step would be for Fuji-Web to proactively seek input from the user.

Long-Term & Decentralized Memory

As you might imagine, Fuji-Web can get confused sometimes (like we all do). But it is clear from many experiments that if you include more details, inform it where things can be found, or remind it to double-check your task prompt, it actually can get much more done! We built a “Prior Knowledge Augmentation” system in Fuji-Web, as an initial attempt to solidify these kinds of in-prompt instructions and hints into long-term memory, so that you don’t have to repeatedly make the same correction.

As of now, this feature is a bit hidden and it’s rather manual to collect this kind of knowledge. Here are some ideas in our backlog to accelerate building the knowledge augmentation system:

  • Add support for saving tasks
  • Add support for sharing tasks & instructions with others
  • Build a tool to extract general instructions from prompts
  • Create a Wikipedia-like knowledge base where users can work together to create instructions and other knowledge that can improve Fuji-Web’s performance
  • Build an agent that explores websites autonomously to create useful knowledge and instructions
  • Explore methods to share knowledge across tasks and contexts

How You Can Get Involved With Fuji-Web

We’ve set up a dedicated channel for Fuji-Web feedback on Discord.

Fuji-Web is also open-source on GitHub! Come try it out, raise an issue, or contribute some code!

Normal Computing builds AI partners for high-stakes enterprise applications. We solve these interdisciplinary problems across the full stack, from software infrastructure and algorithms to hardware and physics. If you are as interested as we are in advancing the frontier of AI reasoning and reliability, reach out to us at info@normalcomputing.ai!