tl;dr - A single bad review has instantly made me drop 0.3. Multiple +9 reviews since Nov 2018 (new system) only made my overall score grow by 0.2.

Hi everyone.

I'm a EN-SPA translator, I have worked for Gengo since 2014, having completed +53k jobs and translated +584k words (plus a bunch more for Gengo Projects) during that time.

I pride myself in taking my translations for Gengo very seriously and always deliver the best quality, so I'd consistently had an excellent average score during all these years. When the new review system was implemented, I had a very high one - probably something around 9.6 or 9.8

It went down to an 8.2, which was disappointing, to say the least. But I understand that things change and the review system is a very tricky, complex matter.

I just checked my history and these are my last few scores, before the one I got a few days ago: 9.5, 9.6, 10, 8.7, 10, 9.8, 9.2, 9.4, 10, 10, 10... However, even if I've been consistently receiving very good reviews, my score has only progressed from 8.2 to 8.4 since the implementation of the new system.

A few days ago, I received a not so stellar review: 8.1 for one of my recent translations. Keep in mind this has been my worst one in months.

Well, my score has now dropped 0.3, going from that seemingly perennial 8.4 to 8.1.

So I had been consistently getting very highly scored reviews, and it almost NEVER affected my overall Gengo Scorecard. I've been trying to heal that number to bring it back to where it was before the implementation of the new system, and it's been impossible.

Now, the moment I've got a "bad" review, it has *instantly* affected my score, dropping 0.3 with ONE single review, when it's never grown more than 0.2 after multiple +9 reviews in a row since Nov 2019.

How is this fair, I wonder? I keep reading about the importance of consistency, which I completely understand, but it seems to me this is a highly punitive, stress-inducing and unsatisfactory system that never rewards consistent good performance, but punish you for occasional mistakes. And we all make mistakes sometimes.

How does Gengo reward that consistency, I ask? What about the proven consistency of translators that have been working for Gengo for years, and have hundreds of thousands of words under their belts? What do I have to do to heal my score? I'm seriously asking, I would love some advice on this.

The pressure is enormous.

## 10 comments

draperHi Beatriz,

That's also my impression. Even with many (near) perfect reviews (9.5-10) my average score barely grew. It seems like "bad reviews" are more likely to lower your average score than "good reviews" are to increase it. I know that Gengo has tried to explain how the new review system works but it largely remains a mystery to me! That said, all translators are reviewed using the same system : as long as you are confident in the quality of your work, I wouldn't worry too much about it. :)

ErikThere's a statistical concept called confidence interval. It basically tells you the frequency with which the observed value of a sample (in this case, the score based on the averages of several samples) reflects the actual value (i.e. hypothetic score if all jobs were reviewed), since there's always an error when working with statistics.

I'm an engineer, and in my field (and any other applications I know about), where I must be pretty pretty sure that my estimations are right, the most common value for the confidence interval is 95% (i.e., there's a chance of only 5% of being wrong when conducting statistical tests). It's possible to use a value of 99%, but I've never seen it in practice. Maybe it's used in mission critical applications such as rocket science.

Well, Gengo is rejecting only our worst rating. This one rating means 2% if you have 50 ratings, and stops making any reasonable sense as you are rated more and more times.

Why don't we stick to the statistical well-accepted-and-proven standard and calculate the score based on the 95th percentile (rejecting the 5% not-so-good ratings)? This way, Gengo can be pretty sure that its quality is very good and we translators can focus on translating instead of trying to figure out what kind of witchcraft is used to generate our score (I can't get the same number when rejecting my worst score, taking the average and subtracting the standard deviation from it, so yes it's not very transparent).

gunnarbuEven if we all do not agree totally on the new consistency score system, I guess it is here to stay and we just have to work hard to keep our quality up. Having said that, I have also noticed the the system has a tendency to lower your consistency score very soon after you get a low score on a review, whereas it seems to take ages before high score reviews result in corresponding increase in consistency score. Can someone from Gengo please explain why this is? Lara?

Gunnar Buvik

Beatriz@Erik - Very interesting contribution!

I do agree, I would love to hear from Gengo regarding all these concerns.

This is by no means the first time the new system has been questioned by the community of translators, especially very experienced ones. I'm on the verge of losing access to exclusive collections, even if I've been receiving mostly +9 scores and a handful of 10s since Nov 2018 (and I've been basically very consistent during my 4 and a half years working for Gengo), when the new system was implemented and my score dropped from a 9.6 or 9.8 to an 8.4. How long do I need to be consistent for so, as Gunnarbu points out, I get a corresponding increase in my overall score?

Lara FernandezHi Beatriz,

I've looked into your case and your reviews, and spoken with our Quality Team, and I believe I have a clear idea of the reasons why you've experienced this drop, and also how to eventually recover from it :) This should also @gunnarbu's question as well.

Let me start by reminding you that the current translator score is the result of a two step calculation:

1) Weighted average of your last 10 GoCheck scores: this is important, because it creates the foundation of your score. Please note that it is weighted, meaning more recent and longer jobs will weight more in the calculation. i.e. this is not a simple average! Please also note that this weighted average changes more easily than the standard deviation calculation, due to its limited nature.

2) The subtraction of your standard deviation (calculated based on your entire GoCheck history minus your lowest score) from step 1.

Because of these two steps, when you receive a lower score than usual, your weighted average will be the first to naturally drop (if the job is larger than the other most recent 9 reviews, more significantly so). To that, we shall add the deduction of your standard deviation which, taken into account your long history translating for Gengo, high volume of reviewed jobs, and the fact that the score itself wasn't significantly low, may not have been considerably affected.

As you share, your current consistency score is 8.14. Upon taking a look (and this information is also available in your translator profile) I see that your weighted average of 10 most recent GoChecks (step 1) is 9.26. This means that your standard deviation is very low: 1.12.

To summarize, the drop you've seen is due to the change in your weighted average and not your standard deviation itself.

Now - how to recover from this? It's certainly possible to recover by improving the weighted average of your 10 most recent GoChecks :) Of course, this doesn't happen overnight, but as you get more high scoring reviews, and the lower outlier is pushed far back into the calculation, their weight within it will diminish (remember, recent and larger jobs weight more), allowing your weighted average to increase (provided your scores continue to be high, of course, of which I have no doubt seeing your track record!)

Based on the above, there are two more things that I would like to mention:

1) Under the new consistency score, anything above 7 is considered to be excellent consistency. You have an awesome score, Beatriz, and you're certainly not on the verge of losing access to exclusive translations. (The minimum threshold is a weighted average of 8, and you have a 9.26) I understand it's very hard to look at the numbers and internalize the idea that this is not a simple average measurement of your quality, and that it can be confusing. We're still working to figure out better communications in this regard :)

2) While we understand the translators' perspective when the score can't be recovered immediately, it is precisely this feature of the new system that allows us to better monitor consistency and fluctuations, versus the previous average where highly inconsistent translators could easily bounce back into higher scores that did not accurately represent their reliability (or lack thereof).

Hope this makes better sense -- please keep the feedback coming, as it helps us understand your experience and identify issues that we need to take a closer look at :)

Lara

AlexanderFirst off, I agree with @draper that one shouldn't worry too much since all translators are reviewed using the same system. As long as better translators get better scores and thus better chances to do the most interesting jobs, I think the system is fair.

That said, there is room for improvement, if only in how the whole thing is communicated. Inspired by what @Erik writes, I would like my score to be presented not by just a single number, but by a statement involving three numbers, e.g.

"Based on your results so far, we are P% confident that your next translation will be worth a GoCheck score of at least C, and our best guess is it will be close to W"

(with P = 95 as Erik suggests, C = consistency score and W = weighted average). A single exceptionally low (or high, for that matter) GoCheck score makes a big difference to the standard deviation (and thus to the consistency score), while the weighted average will change less dramatically. The sore point in the current presentation is that by highlighting the first, it obscures the latter.

Apart from that, how meaningful are these numbers? Imagine a translator who just started and gradually improves. The successive GoCheck scores are 7.0 - 7.1 - 7.2 - 7.3 - 7.4 - 7.5 - 7.6 - 7.7 - 7.8 - 7.9 - 8.0. By intuition, one would feel confident the next score will be 8+. But according to Gengo's math, these GoCheck scores are very inconsistent and the resulting score will only be 7+. A calculation that assumes a linear change of the GoCheck score over time would make more sense.

Also, why is the standard deviation based on the entire GoCheck history (minus the lowest score), whereas the weighted average is only based on the last 10 GoCheck scores? I suggest the standard deviation be replaced by a "weighted standard deviation" over the same last 10 GoCheck scores. Caveat: experts seem to disagree what exactly a "weighted standard deviation" should mean. There are several definitions out there. Still, I believe it would be an improvement if Gengo would use any of them.

Lara FernandezHi Alexander,

Thanks for your feedback!

Please allow me to correct a detail based on my explanation above:

A single exceptionally low (or high, for that matter) GoCheck score makes a big difference to the standard deviation (and thus to the consistency score), while the weighted average will change less dramatically.This is incorrect. In reality, it works the opposite way: a single exceptionally low (or high) score makes a big difference to the weighted average of the last 10 GoChecks, not to the standard deviation, which changes less dramatically and is weighted as well (albeit taking into account the entire GoCheck history of the translator minus their lowest score).

I agree, however, that the current presentation of the score could definitely be better and more transparent, so that it'd be easier to understand.

I'm passing all feedback shared here (and in other threads, as always) directly to the team in charge, so please keep it coming :)

Thanks,

Lara

AlexanderHi Lara - Thanks for correcting my error. Good to hear the standard deviation is actually weighted as well, and that you’re passing the remaining feedback to the team in charge.

ErikI agree with @Alexander, it would be great if the weighted average of our last 10 jobs and the standard deviation were presented in a statement like the one he suggests, it would be a lot more helpful than a lonely number (at least for math geeks like me :D). A graph showing the score trend would be great too, as well as a more detailed entry in the support section explaining all we're discussing here.

ErikI love how data is presented and explained in the Open Data pages, translators should have something like that as well.