TextThresher’s Minimum Viable Product Complete
The TextThresher team and I are excited to announce that – with support from the Hypothesis Open Annotation Fund and the Sloan Foundation – we have completed our work building software that allows researchers to enlist citizen scientists in the complex annotation of large text corpora. Show more
Content analysis – the application of deep and broad tag sets to large corpora of text – has been a painstaking process for decades, usually requiring the close training of wave after wave of research assistants. But with the Annotator Content Analysis modules we’ve created (which are components of TextThresher), large annotation jobs that took several years can now be completed in several months by internet contributors. As we describe below, TextThresher works by organizing content analysis into an assembly line of tasks presented through volunteer science platforms like CrowdCrafting.
The Crowd Content Analysis Assembly Line with Pybossa
Our team has re-organized traditional, slow-going content analysis into two steps, each with its own Pybossa-served task presenter (described in a previous post as ‘Annotator Content Analysis modules’). A first round of contributors read longer documents like news articles and highlight the text units that correspond with just one high-level branch of a researcher’s larger semantic scheme. For example, these first round of contributors would highlight (in separate colors), all the words that describe ‘goings on at an Occupy encampment’, ‘government actions’, ‘protester initiated events’, or ‘police-initiated events’. Next, a second Pybossa-served task presenter (AKA, ACA module) displays those highlighted text units one at a time and guides contributors through a series of leading questions about the text. Those questions, pre-specified by the researcher are uniquely relevant to the type of text unit identified in Step 1. By answering questions and highlighting the words justifying their answers, contributors label and extract detailed variable/attribute information important to the researcher. Thus, the crowd completes work equivalent to content analysis – and much faster than a small research team could. This content analysis work is achievable without close training because TextThresher’s schemas reorganize the work into tasks of limited cognitive complexity. Instead of attempting to label long documents with any of a hundred or more tags, contributors are only directed to search the text for a few tags at a time. And in the second interface/module, contributors are only looking at rather small text units while they are directed to hunt for particular variable/attribute information.
TextThresher can ingest and export annotations, so that it is interoperable with automated text processing algorithms. For instance, its ‘NLP hints’ feature allows contributors to see the computer’s guess at the right answer. For example: If a question begins with ‘Who’, the NLP hints feature will italicize the proper names in a document. If it begins with ‘Where’, contributors will see all of the location-relevant words italicized.
TextThresher has a web-based interface that allows the researcher to import a corpus of documents and conceptual schema that organize structured tag sets into high-level topics and detailed questions. This interface – built using Django and PostgreSQL, and containerized using Docker – also allows the researcher to generate and upload batches of tasks to a Pybossa server. TextThresher’s Pybossa task presenters – written using the React and Redux frameworks, and built with webpack – are automatically deployed to Pybossa by TextThresher when it creates a project and uploads tasks. In addition to the TextThresher web app, a local version of Pybossa is provided for testing and experiments, and once projects are ready for remote access, they can be uploaded to a publicly available Pybossa server, such as Crowdcrafting. A deployment repository on Github makes it easy to install and run TextThresher on any machine (Mac, Windows, or Linux) running Docker.
TextThresher is just getting started. Future versions of the software will also include supervised machine learning features, reducing the amount of work humans must complete, and adding additional ways to provide hints for contributors. Initially, TextThresher is being used to parse more than 8000 news articles describing the events of the Occupy campaign. With complex multi-level data, researchers will be able to tease out the dynamics of police and protester interaction that lead to violence, negotiation, and everything in between. TextThresher is also being used by the PublicEditor project, which is organizing citizen science efforts to evaluate the news and establish the credibility of articles, journalists, and news sources. To learn more about how you can use TextThresher, email email@example.com.
The possibilities for TextThresher extend as far as the availability of text data and the imaginations of researchers. Some will be interested in legal documents, others policy documents and speeches. Some may have less interest in a particular class of documents and more interest in units of text ranging across them—perhaps related to the construction and reproduction of gender, class, or ethnic categories. Some may wish to study students’ written work en masse to better understand educational outcomes or the email correspondence of non-governmental organizations to optimize communication flows.
Galleries, libraries, archives, museums, and classrooms may also deploy TextThresher’s task presenters, advancing scientific literacy and engaging more people in social scientists’ efforts to better understand our world. Whatever the corpus and topic, TextThresher can help researchers generate rich, large databases from text – fast!
Crowdcrafting is a web-based service that invites volunteers to contribute to scientific projects developed by citizens, professionals, or institutions that need help to solve problems, analyze data, or complete challenging tasks that can’t be done by machines alone but require human intelligence. The platform is 100% open source—that is, its software is developed and distributed freely—and 100% open science, making scientific research accessible to everyone. Crowdcrafting uses its own Pybossa software: an open source framework for crowdsourcing projects. Institutions like the British Museum, CERN, and United Nations (UNITAR) are also Pybossa users.