Is MTurk for problems requiring years of training or special expertise?

We are trying to use Mechanical Turk to solve a problem that is different from most of the MTurk projects out there. First, to accomplish the task the Turker needs to have some ‘expertise’ in our subject domain. Second, we are trying to build a Machine Learning based model with very high precision requirement out of the data we collected from the Turkers. Instead of trying to find the model with the best precision/recall trade-off, we require our model to have very high precision (99.6%) and the best recall we can get for that precision level.

We are not sure yet whether MTurk is the way to go to get quality training/evaluation data. But we have been using it for almost two years. We spent a lot of money and efforts on collecting, cleaning and maintaining. So far it is somewhat meeting our expectation.

Here are some of the points:

1. Training and maintaining a population of Turkers with high accuracy:

We maintain continuous communications with the Turkers on TurkerNation. Whenever we set up a batch of HITs on MTurks, our researchers/analysts will monitor our TurkerNation board/thread to see if the Turkers have any questions about the HITs. If they do, we will try to answer the questions right away.

We salt our HITs with examples from our golden set, and use them to evaluate the Turkers.

We offer 100% bonus to Turkers who have accuracy of 100%. Most of the time none of the Turkers got 100% of the salted HITs correct. If this is the case, we always bonus the Turker with the highest accuracy by 100%. The rest of the Turkers will receive bonus based on their accuracy level. Usually we only bonus Turkers with accuracy higher than 80%, but if the batch is too hard, we may just bonus the top 20% or 30% of Turkers regardless of their precision.

After we evaluate each batch, we usually post messages about the accuracy information of each Turker and the amount of bonus they get on TurkerNation after each batch of HITs. The Turkers who read those posts can get an estimate of how well they perform compared to other Turkers.

To train the Turkers, we build and maintain a very detailed guideline page which takes some time to read, and edit it when our task changes or feedback from the Turkers suggests that we need to change and clarify our guideline. After a while, our guideline has become so long and it is very difficult for the Turkers to reference. To make it easier for the Turkers, in addition to the guideline pages, we create a set of example HITs with the correct answer and explanation. We add a link in the instruction section on our HITs, so the Turkers can go through them for reference very conveniently.

To further improve the accuracy of our Turkers, we expose a very powerful internal feature that we used in the model to the Turkers to help them make decisions for the ambiguous cases. This internal feature/score is so powerful, that some of the Turkers treat it as if it is the Golden Oracle, and are afraid of making decisions that are contradictory to the score. The pros: We did see a big increase in the accuracy of our Turkers. The cons: Sometimes the Turker just got lazy and made decisions by looking at that feature only. But most of the cases, our Turkers take our HITs very seriously and do take time working on the HITs very carefully.

2. A multiple tier labeling system

The first tier are the Turkers. Inside this tier we use a qualification score to separate them into Turkers and Super Turkers. Some of our jobs are only available to the Super Turkers. Usually we send out three types of jobs: a. Qualification jobs–these are offered to all Turkers who has an approval rate over 98%, and have done more than 500 HITs. These batches usually go parallel with our HITs for super Turkers. We use them to recruit more Super Turkers who do well on our HITs. These HITs are usually priced at 2 cent each. b. Super Turkers–these are the Turkers with qualification score over 8.  A majority of our HITs go to these Turkers. They pay 50% better than the ones for the general public at 3 cents. c. Super Super Turkers–Super Turkers who have done really well with us and with autoqual score over 25. We send some of our arbitration batches (batches of HITs that the super Turkers disagree on) to our Super Super Turkers, and pay them 5 cents for each HITs. These are only the base pay–in addition, there are always accuracy based bonuses.

The Super Turkers autoqual scores are adjusted based on their accuracy–the good ones will get an increase of 1 or 2 or even 3, and the bad ones will get an decrease of -1 or  -2. If their performance is really bad, we just disqualify them.

The second tier are our internal Data Raters. We hire people who usually have at least a bachelor degree to do the job on an hourly basis. They can work from home at any hours that are convenient to them and as many hours as they want. We bring them in for training when we see some big problems with their labeling results. The Data Raters work on two types of HITs: Hits the Super Turkers disagree on. We also put on MTurk the False Positives and False Negatives of our models and have the Data Raters work on them.

The third tier are our researchers, data analysts and QA team. These people go through the final false positives and false negatives together–usually many of these are very ambiguous cases–and discuss what the labels should be from their own perspectives. The things we learn from these sessions will become our new Label Guidelines and training examples.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>