PhD Candidate: Evgeny Krivosheev
Abstract of Dissertation:
Classification is a pervasive problem in research that aims at grouping items in categories
according to established criteria. There are two prevalent ways to classify items of interest: i) to
train and exploit machine learning (ML) algorithms or ii) to resort to human classification (via
experts or crowdsourcing).
Machine Learning algorithms have been rapidly improving with an impressive performance in
complex problems such as object recognition and natural language understanding.
However, in many cases, they cannot yet deliver the required levels of precision and recall, typically
due to difficulty of the problem and (lack of) availability of sufficiently large and clean datasets.
Research in crowdsourcing has also made impressive progress in the last few years, and the
crowd has been shown to perform well even in difficult tasks [Callaghan et al., 2018; Ranard et al.,
2014]. However, crowdsourcing remains expensive, especially when aiming at high levels of
accuracy, which often implies collecting more votes per item to make classification more robust to
Recently, we witness rapidly emerging the third direction of hybrid crowd-machine classification
that can achieve superior performance by combining the cost-effectiveness of automatic machine
classifiers with the accuracy of human judgment.
In this thesis, we focus on designing crowdsourcing strategies and hybrid crowd-machine
approaches that optimize the item classification problem in terms of results and budget. We start
by investigating crowd-based classification under the budget constraint with different loss
implications, i.\,e., when false positive and false negative errors carry different harm to the task.
Further, we propose and validate a probabilistic crowd classification algorithm that iteratively
estimates the statistical parameters of the task and data to efficiently manage the accuracy vs. cost
trade-off. We then investigate how the crowd and machines can support each other in tackling
classification problems. We present and evaluate a set of hybrid strategies balancing between
investing money in building machines and exploiting them jointly with crowd-based classifiers.
While analyzing our results of crowd and hybrid classification, we found it is relevant to study the
problem of quality of crowd observations and their confusions as well as another promising
direction of linking entities from structured and unstructured sources of data. We propose crowd
and neural network grounded algorithms to cope with these challenges followed by rich evaluation
on synthetic and real-world datasets.