A colleague and I had an interesting conversation about a quiz question of his. He asked students to put the four following quantities in order from least to greatest:

I don’t think I’ve ever written a problem like this for a test or quiz myself, which may explain why thinking about how to assess potential answers perplexed me so. How, exactly, would you grade this problem?

Suppose the correct order is A > B > C > D. A first attempt to grade this problem might be “One point for every item in the correct absolute position”. So

A > B > D > C

would earn two out of four points, as only A and B are in the correct absolute position, while

B > D > C > A

would earn one point, as only the C is in correct absolute position.

But there are some issues with this approach. This response:

B > C > D > A

earns zero points, since no item is in the correct absolute position. But in some sense this response is mostly correct: three of the items (B, C, and D) are correctly ordered relative to each other. Notice also that, in this scoring system, it is impossible to get 3 our of 4 points: the only possible scores are 0, 1, 2, or 4.

My suggestion was to consider all possible pairs of items and award one point for each of the relationships that the response implies is true. For example the arrangement

B > C > D > A

implies the following correct relationships: B > C, B > D, and C > D. The other three implied relationships are incorrect, thus this response would earn 3 out of 6 points.

The problem with this scoring system is that it is nearly impossible to get a zero! The only response that earns zero points under this scheme is

D > C > B > A

but this answer is compelling in its own way. The correct order is perfectly reversed, which strongly suggests the student knew something about the order of the items.

I like this problem, but I don’t think I”d put it on a test. It’s too hard to assess! But it was fun and mathematically interesting to think about, so maybe I’ll put the question “How would you grade this question?” on a test.

I seems to me I’d grade this by looking at any additional thoughts they put down. Things like “log_3(10) has to be bigger than 2” and “log_2(10)” has to be bigger than 3″ would be really useful to see. Of course, if a student clearly just used a calculator to get the decimal equivalents, that wouldn’t show me much. What else would you look for?

I agree that the reasoning is what’s interesting here. I suppose if I wanted to see that, I’d ask the student to order just two quantities and explain their reasoning.

And no, the students didn’t have calculators here. Although knowing how to use a calculator to evaluate log_2(10) does require some skill!

I love ranking problems, and that was exactly the way I decided to grade them. I wrote an Excel macro to evaluate it for me, including possibilities of equal relationships. My disappointment was along the lines of yours, that it is hard to get really low scores. A random guess on a list without equality will get 50% For that reason, I now tend to use them in class more than on graded assessments.

Right, the problem isn’t that low scores are hard to come by, it’s that the expected score is essentially “half correct”. It’s not clear we could learn a lot from the results.

I agree that this kind of problem would probably make a better in-class activity, maybe a fun group task.

Au contraire! I would strongly recommend the inclusion of such ranking tasks in a test but I would grade the item wholistically and based on the process/explanation the students give not just the final ranking. Thank you for sharing the question itself and the whole post.

T. Nadji

You might also consider grading using the “Levenshtein edit distance” between the two answers, or, better the Levenshtein-Damerau edit distance. The latter counts the number of insertions, deletions, substitutions, and transpositions that it would take to convert one sequence to another. (Pure Levenshtein does not allow transpositions, and this problem is largely about transpositions.) These algorithms are well-understood and may even already be implemented in your programming language of choice.

This algorithm is commonly used in spell-checkers to suggest ‘likely’ substitutions. You could even use this algorithm to grade how ‘wrong’ a word was spelled, by quantifying the number of edits required to make it correct.

Very interesting. It never occurred to me that there would be a metric on words and their potential misspellings, but it makes perfect sense. Although it seems like statistics and conditional probability is really at the heart of those problems.

Very cool. Thanks for sharing, Alan!

By the way, a quick test of using the pure Levenshtein edit distance to grade this type of problem doesn’t really give very good results. Here are the edit distances, assuming “ABCD” is correct:

A better metric might be to use Levenshtein-Damerau edit distance, as it allows transpositions to be taken into account (it’s a bit trickier to implement in a rigorous way that preserves its usability as a metric.)

ABCD 0

ABDC 2

ACBD 2

ACDB 2

ADBC 2

ADCB 2

BACD 2

BADC 3

BCAD 2

BCDA 2

BDAC 4

BDCA 3

CABD 2

CADB 4

CBAD 2

CBDA 3

CDAB 4

CDBA 4

DABC 2

DACB 3

DBAC 3

DBCA 2

DCAB 4

DCBA 4

Obviously, this isn’t a great metric because even values that are totally reversed (e.g. DCBA) score the same as CDAB which at least has two pairwise elements in the correct order. I should implement Levenshtein-Damerau and see if its results make more sense.

I implemented Levenshtein-Damerau edit distance calculations in my programming language Frink ( http://futureboy.us/frinkdocs/ , but the version with this function has not been released yet,) to see if it would give more “reasonable” grades than the pure Levenshtein algorithm. To give reasonable results, I had to eliminate the “replacement” operation so I only allow insertion, deletion, and swaps of 2 adjacent letters. Here are the grades given by that procedure, again assuming that ABCD is the correct answer:

ABCD 0

ABDC 1

ACBD 1

ACDB 2

ADBC 2

ADCB 3

BACD 1

BADC 2

BCAD 2

BCDA 2

BDAC 3

BDCA 3

CABD 2

CADB 3

CBAD 3

CBDA 4

CDAB 4

CDBA 4

DABC 2

DACB 3

DBAC 4

DBCA 4

DCAB 4

DCBA 5

These are interesting and have good properties: they allow all of the scores from 0 (all correct) to 5, with the only answer scoring 5 to be the total reversal DCBA. If only one pair are swapped (like BACD), the score is 1.

I think that this scoring method would be acceptable for use as a grading system.

Since there is renewed interest in this problem (see http://fawnnguyen.com/scoring-an-ordered-list/ ), I’ll note here that my programming language Frink ( https://frinklang.org/ does now have a Levenshtein-Damerau edit distance function, and even one which can allow you to disallow simple replacements, as outlined above.)

There’s an example program that produces the above output, (which disallows replacements) and can be extended to any number of parameters:

https://frinklang.org/fsp/colorize.fsp?f=gradingProblem.frink