Home About Projects Newsletter Team Join Us
Project Home


Year / 2015 User / Join Us Get Involved Developers / Meet our Team

Text Thresher is an open source crowd-based text analysis software that allows researchers to extract richer data from large text corpora faster. Text Thresher enables researchers to break down daunting text-annotation (i.e. content analysis) projects into smaller, more manageable multiple choice reading comprehension tasks that can be performed by crowd workers, citizen scientists, and annotation hobbyists through crowd work platforms like Mechanical Turk and citizen science platforms like CrowdCrafting.

Text Thresher re-organizes traditional, slow-going content analysis into two steps. First, experts (trained research assistants) identify text units in larger documents that correspond with just one branch of a researcher’s larger semantic scheme—a branch specifying variables and attributes that describe just one unit of analysis.

Next, Text Tresher will display those smaller text units (rich with information about the variables and attributes of just one unit of analysis) to crowd workers and walk them through a short series of leading questions about the text. By answering these questions and highlighting the words justifying their answers, crowd workers extract detailed variable/attribute information relevant to the researcher’s semantic scheme while labeling the text that corresponds to those variables/attributes. Thus, the crowd completes work equivalent to content analysis much faster than a traditional research team could. This content analysis work is achievable as crowd work because researchers reduce text units from document length to a few sentences, because those few sentences are only relevant to a small branch of the larger semantic scheme, and because so many people are familiar with reading-comprehension tasks.

The possibilities for Text Thresher extend as far as the availability of text data and the imaginations of researchers. Some researchers will be interested in legal documents, others policy documents and speeches. Some may have less interest in a particular class of documents and more interest in units of text ranging across them—perhaps related to the construction and reproduction of gender, class, or ethnic categories. Some may wish to study students’ written work en masse to better understand educational outcomes or the email correspondence of non-governmental organizations to optimize communication flows.

Whatever the corpus and topic, Text Thresher can help researchers generate rich, large databases from text fast. And galleries, libraries, archives, museums, and classrooms may also deploy Text Thresher to advance scientific literacy and engage more people in efforts to better understand our world.

We're grateful to the following organizations for supporting our projects:

SSRC AMPlab BIDS hypothes.is NSF Sloan Foundation Sage Publishing