June 24, 2026•6 min read•from UW News

GovScape lets you easily search millions of government documents

Our take

Navigating the vast landscape of government information just got significantly easier thanks to GovScape, a groundbreaking search system developed by a University of Washington-led research team. GovScape tackles a massive challenge: efficiently searching through millions of PDF documents archived from the End of Term Web Archive. Think of it as a super-powered search engine specifically designed for government records. Here’s what makes GovScape stand out: its ability to handle both precise keyword searches – like looking for “FAFSA” – and semantic searches. Semantic search is key; it understands the *meaning* of your query, surfacing relevant documents even if they don't contain your exact search terms. This dramatically expands the possibilities for research and discovery. The technology powering GovScape is incredibly efficient. Researchers leveraged cutting-edge artificial intelligence models to process approximately 10 million PDFs from Donald Trump’s first term for less than $1,500 – a cost of roughly $1 per 47,000 pages. That’s a remarkable feat of engineering. This tool's potential applications are broad, from academic research to citizen oversight. For example, understanding shifting environmental trends is increasingly important, as highlighted in a recent article, "Decades-long dataset shows which orcas are most at home in Puget Sound." GovScape empowers users to quickly access and analyze the data underlying such important issues. We anticipate GovScape will be a valuable resource for anyone seeking to understand and engage with government information.

GovScape lets you easily search millions of government documents

GovScape’s arrival feels like a quiet revolution in how we access and understand government information, and it's particularly relevant given the increasing demand for transparency and accountability. The ability to efficiently sift through millions of PDFs, a task previously requiring immense manual effort or prohibitively expensive resources, unlocks entirely new avenues for research and public scrutiny. We've seen similar data-driven initiatives elsewhere, like the work cataloging orca populations in Puget Sound [Decades-long dataset shows which orcas are most at home in Puget Sound], highlighting the value of long-term datasets and sophisticated analysis. Furthermore, the University of Washington’s commitment to fostering a vibrant research environment, demonstrated by initiatives like a bus tour for new faculty [President and Provost join new faculty on bus tour of Washington], creates fertile ground for breakthroughs like GovScape. This isn’t just about making data *available*; it’s about making it *usable*.

The low cost – less than $1,500 to process 10 million documents – is frankly astonishing. It demonstrates the power of modern AI and efficient computing to democratize access to information. Traditionally, only well-funded institutions or government agencies could afford to conduct this kind of deep dive into archives. Now, independent researchers, journalists, and even concerned citizens have a powerful tool at their disposal. The semantic search functionality is especially noteworthy; it moves beyond simple keyword matching to understand the *meaning* of documents, which significantly expands the scope of potential discoveries. Consider how this could be applied to understanding policy implementation, tracing the evolution of regulations, or even identifying patterns of influence within government agencies. This represents a significant step beyond simply finding documents; it's about extracting insights. It’s a far cry from the resource constraints faced by even WSU’s athletics department when securing top recruits [WSU scores biggest recruiting win of 2027 class with commitment from 4-star OL Rashaun Lavata'i], showcasing a contrasting application of resources toward intellectual discovery.

The implications extend beyond academic research. Think about the potential for investigative journalism, holding government accountable for its actions. Think about the ability for small businesses to navigate complex regulatory landscapes. Think about citizen groups advocating for policy changes, armed with data-driven evidence. The ability to quickly and affordably analyze government documents can level the playing field and empower individuals and organizations to participate more effectively in civic life. This also addresses a persistent challenge in government transparency – the sheer volume of information often overwhelms those trying to access it. GovScape effectively cuts through that noise, enabling users to focus on the specific data they need. The system's design prioritizes efficiency and usability, making it accessible to a broader range of users, not just highly technical experts.

Looking ahead, it will be fascinating to see how GovScape evolves and how other institutions adopt similar approaches. The initial focus on Donald Trump’s first term provides a valuable baseline, but the real potential lies in expanding the system to cover other administrations and government agencies. We also need to consider the ethical implications of such powerful search tools; ensuring responsible use and guarding against potential misuse will be crucial. Will we see similar initiatives emerge to analyze data from other sectors, such as healthcare or education? The success of GovScape suggests a growing recognition of the importance of data accessibility and the transformative power of AI in unlocking its potential – and it raises a vital question: how can we ensure these tools are used to strengthen, not undermine, democratic processes?

A search for “redacted documents” on a search engine. — A University of Washington-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like “FAFSA,” or use a visual search option to query for qualities like “redacted documents.” Photo: University of Washington

At the end of every presidential term, the End of Term Web Archive preserves that administration’s web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush’s second term, and runs up to 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.

A University of Washington-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like “FAFSA,” or use a semantic search, which finds documents on a topic even if the exact search terms don’t appear on the page. A visual search option lets them query for qualities like “redacted documents,” “aerial photographs” or “pie charts.” The system can currently search the 10 million PDFs hosted online during Donald Trump’s first term; the team plans to expand it to the whole archive.

Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers $1 to parse around 100 pages with AI.

The team will present its research July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego.

“The End of Term Web Archive is immensely important to historians, journalists and the American public,” said senior author Benjamin Charles Germain Lee, a UW assistant professor in the Information School. “But many of these digital archives are getting so big — The Internet Archive just announced its trillionth page archived — that finding information is the real challenge.”

The team worked with PDFs because they are a ubiquitous file format and can contain text, charts and images — a mix that is challenging for existing search systems but makes the documents ideal candidates for GovScape’s multimodal search.

They built a pipeline to process all the documents that splits each PDF into individual pages, saves the pages as images, then pulls out the text. The researchers used highly efficient AI models to generate “embeddings” for both the text and images from each page. Embeddings are essentially a string of numbers that systematically capture the text and images’ content.

Try the GovScape search engine

“Just as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content,” Lee said.

Researchers then built different indexing systems for the three kinds of search. The keyword search uses a basic index — similar to a book index — for all the text. If a user types in “FAFSA,” the system finds all the pages the word appears on.

For semantic and image searches, the system takes the user’s search term and creates an embedding. It then compares this embedding with the indices created from the embeddings of PDF pages and identifies the closest matches, which are returned as search results.

“Our next goal is to cover all of the 70 million PDFs in the entire End of Term Web Archive — everything from 2008 to 2024,” Lee said. “One of the challenges moving forward is how to efficiently search at that scale.”

Because government archives contain “every file type under the sun,” Lee said, future work might expand to documents such as spreadsheets, images and HTML pages.

“I’m really excited about the prospects for better access to government information with projects like GovScape,” Lee said. “Being able to actually find relevant information is vital to the health of democracy and to the functioning of society.”

Co-authors include Kyle Deeds of Boston University, who completed this research as a doctoral student in the Paul G. Allen School of Computer Science & Engineering; Ying‑Hsiang Huang and Leslie Harka, who completed this research as UW master’s students in the Information School; Claire Gong, Shreya Shaji, Alison Yan, Albert Du, and Anjali Gopal, all students in the Allen School; Samuel J. Klein of Harvard University; Shannon Zejiang Shen of the Massachusetts Institute of Technology; Mark Phillips of the University of North Texas; and Trevor Owens of the American Institute of Physics.

For more information, contact Lee at bcgl@uw.edu.

Read on the original site

Open the publisher's page for the full experience

View original article →

Tagged with

#Washington State University#public land-grant university#WSU research programs#student life at WSU#GovScape#End of Term Web Archive#PDF#FAFSA#Semantic Search#Artificial Intelligence#Redacted Documents#Visual Search#Aerial Photographs#Pie Charts#Donald Trump#Preservation#Web Archive#Archive#Document Processing#Public Information

GovScape lets you easily search millions of government documents

Related

Tagged with