Old-Guard Archivists Keep Federal Data Safer Than You Think

Long before Trump, open government and open data evangelists had been preserving all kinds of data collected and stored by the government.
DataRescueTA.jpg
Jamie Lyons

At least twice a day, the data recorder on board NASA’s Lunar Reconnaissance Orbiter beams images to a station in White Sands, New Mexico. That data gets copied to the Goddard Space Flight Center in Maryland, and then copied again to computers at Arizona State University in Tempe. Two other copies go to an off-campus building where they live on different access-controlled computer systems.

Mark Robinson, the researcher running the team that operates the LRO’s cameras, analyzes that data. Every three months his team uploads raw and calibrated images to NASA’s public website for anyone to access.

That’s five layers of redundancy. “Nobody would be ever be able to delete these data,” Robinson says. “I’ve dedicated my life to preserving data. Buildings can collapse. Computers fail. But the LROC data will still be around.”

For one thing, deleting federal records is illegal. The National Archives Office of the Inspector General investigates claims of record fraud and can refer cases for prosecution. And for another, NASA shares and backs up all of its datasets over multiple government research facilities and academic institutions across the country. So there’s no easy way to erase all the copies of it, even if a webpage or two were to go missing.

And NASA says none have. “The availability of NASA Earth science data has not changed in recent months, nor have any Earth science datasets been taken offline,” the agency said in a statement.

So why, then, did web archivers at a data rescue event in Berkeley last week flag an atmospheric carbon dioxide dataset as missing? NASA had migrated it to a new location when the entire Earth Observing System site underwent a redesign in January of 2013. As for a Global Change Data Center reports repository that one web archiver worried was empty, the reality was that GCDC scientists never put any files there in the first place.

Since election night, hundreds of people have come to events across the country to log data they feared would get disappeared by government agencies in the Trump administration---because of connections to climate change, or gun violence, or any number of other subjects. Coordinated by groups like DataRefuge and the Environmental Data and Governance Initiative, every 404 error on a government website has made these would-be archivists suspicious.

But not every 404 error is evidence that some critical dataset got tossed. The webpages might be the same; it’s the world around them that changed.

Into the Archives

So NASA is on the record saying the agency hasn't put anything down the memory hole. The National Oceanic and Atmospheric Administration also confirmed that none of its datasets have been removed since January 20, 2017, and the agency has no plans to take any down in the immediate future. The Environmental Protection Agency did not respond to a request for comment.

EDGI, though, is recruiting “domain analysts” to help distinguish relevant gaps from harmless artifacts in the data they scrape. For now they’re treating every lead as relevant. “All these errors are worth investigating,” says Lindsey Dillon, EDGI’s steering committee chair. “They might turn out to be nothing, but the interesting thing is that in this moment, as never before, people are finding a deleted page or an absent report politically meaningful.”

Politically meaningful or not, the government has taken down only one federal database since January 20. The US Department of Agriculture scrubbed from its site animal welfare records, according to the Sunlight Foundation, a nonprofit that advocates for government transparency and data access. The erasure caused a public outcry, and as of Friday morning, the USDA began returning some documents.

Alex Howard, Sunlight’s deputy director, warns that it’s easy to read malice into every broken link or changed text on a webpage, but that it could just as easily be incompetence, or ignorance, something totally unrelated, or nothing at all. “We don’t want to play into the dynamic that is rushing toward us here,” he says, “where a vacuum of confirmed, trustworthy information from the top levels of government is filled up instead with our fear.”

New Kids On The Web

Groups like DataRefuge and EDGI organized quickly---getting a national movement off the ground in a matter of months. They operate from a “triage and prioritize” posture, based on tips they get from government scientists and with an eye toward the moves inside the White House and on Capitol Hill.

Worthy, sure, but long before Trump entered the political picture, open government and open data evangelists had been preserving all kinds of data collected and stored by the government, from crime statistics to unemployment rates to trade deficits.

Some changes, like a redo of the White House website, are normal parts of a presidential transition. The Department of Labor removing its blog posts on how it calculates the unemployment rate, or the Department of Energy changing its language around climate change, are worth keeping an eye on. “The more the merrier,” says James Jacobs, who runs Free Government Information, which tracks and stores government web data. “It’s been really hard for librarians to convince people that preserving the web is important. Google has done a very good job of making people think that once it’s online it’s there forever.”

Together with the Internet Archive, the Library of Congress, and the Government Publishing Office, Jacobs coordinates the End of Term project, a once a once-every-four-years web harvest of all .gov and .mil sites. He’s also considering completing annual harvests under the Trump administration.

All the extra hands will make lighter work for people like Jacobs, and having lots of copies of things is always better than having none at all. But even as groups of galvanized guerrilla archivists join the fray, breathing life into a cause to which Jacobs has committed his career, he is clear-eyed about the limitations of his line of work.

Archiving is inherently static. It’s a snapshot you take of a moment in time---whether that’s text on a webpage or surface water temperature measurements from the Chukchi Sea in February.

Datasets, on the other hand, are dynamic. And keeping open data pipelines, and the funding that makes them possible, is what scientists and concerned citizens should really be worried about. So while seeding web crawlers and downloading satellite images might make people feel a little less helpless in a time of digital uncertainty, a dataset is only as useful as its last upload. It’s not whether the data disappears---it’s whether people will still be collecting it tomorrow, and next month, and next year that matters.