A tool that will automatically trawl the internet for escort ads and sort them into a searchable database could be a transformative new weapon in the fight against human trafficking.
The idea is just one use for a cloud-based analytics tool known as Domain-specific Insight Graphs (DIG), built by Pedro Szekely and Craig Knoblock of the USC Viterbi School of Engineering's Information Sciences Institute (ISI).
Their system uses open source software to identify and extract information from the web based on any number of parameters. The front-end result is a vast list of searchable information -- DIG already has a base of 50 million webpages from which it has extracted two billion records, and it continues to collect another 5,000 pages every hour -- that can be turned into maps, timelines and tables after a user inputs a query.
It can be used for many purposes, but its creators are currently promoting it as a tool to help law enforcement track down those missing persons lost to the sex trade. "The internet contains seemingly limitless information, but we're constrained by our ability to search that information and come up with meaningful results. DIG solves that problem," commented Szekely.
The UK's Human Trafficking Centre identified 2,255 potential victims of human trafficking in 2012, and the Missing Persons Advocacy Network estimated 200,000 US children are at high risk for trafficking into the sex industry. Better tools to address the unwieldy problem of police scouring the entire web for clues are an obvious priority. Using DIG, investigators can search by phone number, location, alias and image. In a paper published on the technology, the team says that DIG -- with the help of ISI's Karma data integration toolkit -- can also pull in databases, spreadsheets, XML and JavaScript Object Notation documents. "The ability to integrate web services allows Karma to pull in live data from the various social media sites, such as Twitter, Instagram, and OpenStreetMaps. DIG then indexes the integrated data and provides an easy to use interface for query, visualisation, and analysis."
The approach is similar to that behind Microsoft's PhotoDNA, which helps investigators through the emotionally arduous task of searching for child abuse content online and on servers. PhotoDNA can be used to automatically cross-reference huge global databases so international law enforcement can work with each other to track down abusers and identify missing people. DIG can also be used to cross-reference material police already have in their system, to identify victims and criminals running trafficking rings.
Like PhotoDNA, DIG is free for law enforcement to use (its development was funded by Darpa programme Memex). The software will be upgraded quarterly over the next three years, after which more funding will need to be secured. "As the database continues to grow, DIG will be able to uncover new connections and patterns in the data, making it even more useful," said Knoblock in a statement.
There are plenty more uses for a technology such as DIG, and in an ISI series of slides on the software both arms trafficking and the drugs trade are flagged up as potential areas of application. Szekely and Knoblock also already using it to analyse material science research.
Their work is far from over though. As rapidly as DIG can categorise information, that information continues to grow. In the two and half years of the project that are left, the ISI team plans on prepping DIG to cope. "We are developing improved tools for rapidly training information extractors for new applications domains," the pair write in a paper on the tech. "We are working on highly scalable entity-linking and entity resolution techniques, which will allow us to both find entities within the set of extracted data as well as to resolve the entities with known entities in another source. And we are working on improving the machine learning techniques for automatically modelling new sources, which will improve the accuracy and speed of integrating new sources of data."
This article was originally published by WIRED UK