GovScape lets you easily search millions of government documents
Our take

GovScape’s arrival feels like a quiet revolution in how we access and understand government information, and it's particularly relevant given the increasing demand for transparency and accountability. The ability to efficiently sift through millions of PDFs, a task previously requiring immense manual effort or prohibitively expensive resources, unlocks entirely new avenues for research and public scrutiny. We've seen similar data-driven initiatives elsewhere, like the work cataloging orca populations in Puget Sound [Decades-long dataset shows which orcas are most at home in Puget Sound], highlighting the value of long-term datasets and sophisticated analysis. Furthermore, the University of Washington’s commitment to fostering a vibrant research environment, demonstrated by initiatives like a bus tour for new faculty [President and Provost join new faculty on bus tour of Washington], creates fertile ground for breakthroughs like GovScape. This isn’t just about making data *available*; it’s about making it *usable*.
The low cost – less than $1,500 to process 10 million documents – is frankly astonishing. It demonstrates the power of modern AI and efficient computing to democratize access to information. Traditionally, only well-funded institutions or government agencies could afford to conduct this kind of deep dive into archives. Now, independent researchers, journalists, and even concerned citizens have a powerful tool at their disposal. The semantic search functionality is especially noteworthy; it moves beyond simple keyword matching to understand the *meaning* of documents, which significantly expands the scope of potential discoveries. Consider how this could be applied to understanding policy implementation, tracing the evolution of regulations, or even identifying patterns of influence within government agencies. This represents a significant step beyond simply finding documents; it's about extracting insights. It’s a far cry from the resource constraints faced by even WSU’s athletics department when securing top recruits [WSU scores biggest recruiting win of 2027 class with commitment from 4-star OL Rashaun Lavata'i], showcasing a contrasting application of resources toward intellectual discovery.
The implications extend beyond academic research. Think about the potential for investigative journalism, holding government accountable for its actions. Think about the ability for small businesses to navigate complex regulatory landscapes. Think about citizen groups advocating for policy changes, armed with data-driven evidence. The ability to quickly and affordably analyze government documents can level the playing field and empower individuals and organizations to participate more effectively in civic life. This also addresses a persistent challenge in government transparency – the sheer volume of information often overwhelms those trying to access it. GovScape effectively cuts through that noise, enabling users to focus on the specific data they need. The system's design prioritizes efficiency and usability, making it accessible to a broader range of users, not just highly technical experts.
Looking ahead, it will be fascinating to see how GovScape evolves and how other institutions adopt similar approaches. The initial focus on Donald Trump’s first term provides a valuable baseline, but the real potential lies in expanding the system to cover other administrations and government agencies. We also need to consider the ethical implications of such powerful search tools; ensuring responsible use and guarding against potential misuse will be crucial. Will we see similar initiatives emerge to analyze data from other sectors, such as healthcare or education? The success of GovScape suggests a growing recognition of the importance of data accessibility and the transformative power of AI in unlocking its potential – and it raises a vital question: how can we ensure these tools are used to strengthen, not undermine, democratic processes?

At the end of every presidential term, the End of Term Web Archive preserves that administration’s web presence as a vast trove of documents and webpages. The archive began in 2008, with George W. Bush’s second term, and runs up to 2024, collecting images, text, graphs, redacted pages and other media. So while it contains important public information, finding that information in the glut can prove difficult.
A University of Washington-led research team created GovScape, an efficient search system for PDFs from the End of Term Web Archive. Users can look up exact keywords, like “FAFSA,” or use a semantic search, which finds documents on a topic even if the exact search terms don’t appear on the page. A visual search option lets them query for qualities like “redacted documents,” “aerial photographs” or “pie charts.” The system can currently search the 10 million PDFs hosted online during Donald Trump’s first term; the team plans to expand it to the whole archive.
Because researchers used highly efficient artificial intelligence models to read the documents, processing all the PDFs costs less than $1,500, or about $1 per 47,000 pages. By comparison, Google might charge consumers $1 to parse around 100 pages with AI.
The team will present its research July 5 at the Annual Meeting of the Association for Computational Linguistics in San Diego.
“The End of Term Web Archive is immensely important to historians, journalists and the American public,” said senior author Benjamin Charles Germain Lee, a UW assistant professor in the Information School. “But many of these digital archives are getting so big — The Internet Archive just announced its trillionth page archived — that finding information is the real challenge.”
The team worked with PDFs because they are a ubiquitous file format and can contain text, charts and images — a mix that is challenging for existing search systems but makes the documents ideal candidates for GovScape’s multimodal search.
They built a pipeline to process all the documents that splits each PDF into individual pages, saves the pages as images, then pulls out the text. The researchers used highly efficient AI models to generate “embeddings” for both the text and images from each page. Embeddings are essentially a string of numbers that systematically capture the text and images’ content.
“Just as library classification systems group books on similar topics on the same shelf, these embeddings group similar pages with one another based on their visual and textual content,” Lee said.
Researchers then built different indexing systems for the three kinds of search. The keyword search uses a basic index — similar to a book index — for all the text. If a user types in “FAFSA,” the system finds all the pages the word appears on.
For semantic and image searches, the system takes the user’s search term and creates an embedding. It then compares this embedding with the indices created from the embeddings of PDF pages and identifies the closest matches, which are returned as search results.
“Our next goal is to cover all of the 70 million PDFs in the entire End of Term Web Archive — everything from 2008 to 2024,” Lee said. “One of the challenges moving forward is how to efficiently search at that scale.”
Because government archives contain “every file type under the sun,” Lee said, future work might expand to documents such as spreadsheets, images and HTML pages.
“I’m really excited about the prospects for better access to government information with projects like GovScape,” Lee said. “Being able to actually find relevant information is vital to the health of democracy and to the functioning of society.”
Co-authors include Kyle Deeds of Boston University, who completed this research as a doctoral student in the Paul G. Allen School of Computer Science & Engineering; Ying‑Hsiang Huang and Leslie Harka, who completed this research as UW master’s students in the Information School; Claire Gong, Shreya Shaji, Alison Yan, Albert Du, and Anjali Gopal, all students in the Allen School; Samuel J. Klein of Harvard University; Shannon Zejiang Shen of the Massachusetts Institute of Technology; Mark Phillips of the University of North Texas; and Trevor Owens of the American Institute of Physics.
For more information, contact Lee at bcgl@uw.edu.
Read on the original site
Open the publisher's page for the full experience