How to assign partial credit on an exam of true-false questions?

gjm11 · on June 2, 2016

As one of the commenters points out, Tao has rediscovered the notion of a proper scoring rule -- https://en.wikipedia.org/wiki/Scoring_rule#Proper_scoring_ru... -- and the specific (rather nice, if you don't mind all the scores being negative and the infinite penalty when something happens that you said definitely wouldn't) logarithmic scoring rule -- https://en.wikipedia.org/wiki/Scoring_rule#Logarithmic_scori....

The Brier score -- you score minus the average squared error between your prediction and [1 for the right outcome, 0 for the others] -- is also a proper scoring rule (i.e., incentivizes you to report your probabilities accurately) and doesn't penalize maximally-wrong answers infinitely. For some purposes it's a better choice than the logarithmic score.

Houshalter · on June 2, 2016

The Brier score isn't very principled mathematically though. Squared error was developed for findinding models over real numbers, and doesn't fit very naturally with probabilities.

Log likelihood does allow infinitely negative scores, but maybe it should. If you assign a probability of 0, you are saying it's literally impossible. Not just unlikely, or even really really unlikely.

I prefer geometric mean of probability. It represents the "average probability" you gave to the correct answer, and the number is interpretable. But it's isomorphic to log likelihood, so has all the same mathematical properties.

contravariant · on June 3, 2016

There's one problem with considering the geometric mean to be the "average probability" which is that the geometric mean of p and q isn't 1 minus the mean of 1 - p and 1 - q. Ideally you'd like the average probability of something happening and the average probability of something not happening to add up to one. Depending on what you do with it this may or may not be a problem.

Houshalter · on June 3, 2016

That's true, but it's not supposed to add up to 1. Likelihood represents the probability you assigned to the actual outcome. So the probability of every possible outcome needs to sum to 1. But because each question increases the number of possible outcomes exponentially, the probability you mass you assign to any specific outcome must become very very small. And it also becomes uninterpretable, because it is such a small number.

You can recover the likelihood from the geometric mean of likelihood, by exponentiating it to the power of the number of questions. So if you were predicting 10 coin flips, and the geometric mean was 0.5, you can find the original likelihood by taking 0.5^10. Then if you do that for every possible outcome, the probability will sum to 1.

shoyer · on June 2, 2016

In the traditional formulation, the Brier score is 0 for correct guesses with p=1, 1 for incorrect guesses with p=1 and 0.25 for guesses with p=0.5: https://en.wikipedia.org/wiki/Brier_score#Example

When translated into a grading rule, this gives us a score range from -3 to +1, with 0 for uncertain guesses. So it's not as harsh as logarithmic scoring, but still penalizes confident wrong guesses more harshly than it rewards confident correct guesses.

Retric · on June 2, 2016

Sounds interesting. The problem with logarithmic scoring is your penalized for picking actual percentages vs gaming the system.

EX: If your 99% sure of each question and there are 100 questions then getting 1 wrong gives the best points at 99%. But, randomly you can only add +1 point from 100 correct but randomly there is a long tail of missing several. Further, scores are not linearly valuable trading a lower max score but higher chance of getting an A is a positive result.

Ideally you should set things up so accurate estimates give the best results.

shoyer · on June 2, 2016

> Ideally you should set things up so accurate estimates give the best results.

This is true for both the logarithmic score and the Brier score. These are both "strictly proper scoring rules", which is a formal way of stating your requirement that it should not be possible to game the system. In both cases, you get the highest score (on average) if you guess is true distribution.

Retric · on June 2, 2016

The problem is if your X% sure on each of a set of questions and you want to get an A, you want to minimize your chances of getting a bad grade more than you want to maximizes your score. Alternatively, there is no point in setting your odds for every question below the point that takes the minimum number of correct answers to pass. EX: If you need to get X% correct to pass, then don't set your odds below X%.

For a similar example consider what you bet on the final question in Jeopardy does not just depend on your estimate of your odds, but other factors.

Of course even thinking about this stuff is distracting, you may be better off picking a very small number of odds. Say 99%, 90%, 70%, and 50%.

dllthomas · on June 4, 2016

Retric was distinguishing between "the highest score" and "the best result". A higher score on an exam doesn't always turn into a higher grade in the class. If your outcomes are bucketed, you may be able to increase the odds you fall in the highest bucket by decreasing your probable final score.

openasocket · on June 2, 2016

I don't see how this follows. From Tao's derivation, if a student believes the probability that an answer is right is p and they put down q, then their expected score is pf(q) - (1-p)f(1-q). If the scoring function is a logarithm, the maximum of this expectation occurs at q=p.

Retric · on June 2, 2016

Suppose you rolled a 100 sided die 100 times. What are the odds it hits 99 (0,1,2,3,4...) times. Now average each of those to find your expected score.

Second, suppose you only care if your final score is an A. If you say your 98% positive and get 2 wrong that's > saying 99% and getting 2 wrong. Sure, you can get a higher max score by saying 99%, but if you only care about an A then you might as well hedge.

gjm11 · on June 2, 2016

Some students will want to maximize their score. Some will want to maximize their probability of getting an A. Some will want to maximize their probability of not failing. There's obviously no scoring rule you can use that will make all of these be optimized by giving accurate probabilities.

The logarithmic scoring rule (or any other proper scoring rule) means that a student wanting to maximize their expected score is incentivized to give accurate probabilities. I don't think it's reasonable to ask for more.

dllthomas · on June 4, 2016

I agree that it's probably (let's say 60% confidence) impossible to satisfy all of these at once.

That said, if we have to pick just one, I think it's worth asking what information is conveyed by the test. For instance, if all that is reported is "pass/fail", we should pick a scoring rule that maximizes a test taker's chance of passing when they are well calibrated.

Most interesting to me is how different these scoring rules wind up being...

zardeh · on June 2, 2016

Simply state "The test will be (bell)curved". Now all students have the same goal: perform better than as many other students as possible.

whack · on June 2, 2016

There used to be a Decision Science class at Stanford, where this exact scheme was used. Students were warned all the time to never indicate 100% certainty on any question, because if you ever did that and turned out to be wrong, you would fail the entire course because of that one question: even if it happened to be a minor homework assignment. I always thought this was a great way to teach people the lesson that you should (almost) never claim 100% certainty in anything, and that you should view knowledge through a probabilistic perspective.

ianai · on June 2, 2016

Unless you're arguing with someone that will take any missing confidence as sign that you're completely wrong.

jerf · on June 2, 2016

That's equivocation; the sort of confidence that people are looking for and critical of lacking is not the same thing as mathematical confidence. Mathematical confidence is basically entirely inhuman; we do not think that way naturally and generally can't think that way even if trained.

There is no contradiction in being 90% confident of something and acting entirely humanly confident in it. Whether it is wise depends too much on context to make a snap decision.

avn2109 · on June 2, 2016

>> "Unless you're arguing with someone that will take any missing confidence as sign that you're completely wrong."

This is 90% of people.

marxidad · on June 2, 2016

No. You have to stick to your guns that uncertainty is a valid disposition (maybe).

logfromblammo · on June 2, 2016

Just say that you're 101% sure, so the score will underflow into a positive score--or at least NaN--if you're actually wrong, and you will also get one bonus point for additional confidence from any person who does not understand math.

hinkley · on June 2, 2016

Yes but that guy over there says he's 110% certain you're wrong, so I'm going with his answer.

hinkley · on June 4, 2016

whyyy can't you delete your own posts? bah.

mikeash · on June 2, 2016

I'm disappointed that you don't score positive infinity for getting an answer correct with 100% certainty. Pass the entire course if you find one question you're really sure of. It would stop the teachers from putting in softball questions.

Houshalter · on June 2, 2016

Probability doesn't work like that. Likelihood has an upper bound at 1, and log likelihood has an upper bound at 0. Likelihood just represents the probability you assigned to the actual outcome. You can't have a probability higher than 1.

mikeash · on June 2, 2016

You can't have a probability lower than zero, but you're still able to obtain negative infinity points. Why is there an upper bound but not a lower bound?

Houshalter · on June 2, 2016

That's just the way log probabilities work. Log(0) is negative infinity, but log(1) is just 0.

You can have infinitely high scores if you use log odds though: http://lesswrong.com/lw/mp/0_and_1_are_not_probabilities/

But that can only happen if you get every question 100% right. Whereas getting a single question 100% wrong is easy to do.

mikeash · on June 2, 2016

Why is log probability the correct way to evaluate this?

Houshalter · on June 2, 2016

It's convenient, because multiplying probabilities together quickly gets very very small. Typically smaller than can be represented in floating point even.

Whereas logarithms make multiplication just addition, and they get small very very slowly.

Personally I prefer geometric mean. Which is equivalent to the average log likelihood. It has an upper bound at 1 and lower bound at 0, and represents the "average" probability you assigned the correct answer.

ryanmonroe · on June 2, 2016

I think I read a blog post about that somewehre.

Jabbles · on June 2, 2016

Because it takes a lot of skill to be "always wrong" (you would have to know the right answer in order to answer incorrectly). No skill is represented by getting 50% right/wrong, and indeed it scores you 0 points.

scott_s · on June 2, 2016

I don't think that answers mikeash's question. He is not talking about always being wrong. He is pointing out that the incentives are not symmetric. If you say that you are 100% certain, and you are wrong, then you fail the class. If the incentives were symmetric, if you said you were 100% certain and you were correct, then you would automatically pass the class.

A potential answer is: this asymmetry is similar to incentives outside the classroom. Claiming 100% certainty and being wrong can be disastrous to your reputation, and likely much more negative than the positive benefits of claiming 100% certainty and being correct.

mikeash · on June 2, 2016

Getting 50% right/wrong while expressing 100% confidence in your answers scores you negative infinity points. Indeed, getting 99.9% right/wrong while expressing 100% confidence scores you negative infinity points. But getting 100% right with 100% confidence only scores you a finite positive number of points.

scott_s · on June 2, 2016

mikeash was not advocating for a probability higher than 1. Rather, he was advocating for an infinite reward for indicating a probability of 1. We could reframe his suggestion to be, why are the incentives not symmetric? (I think there are good reasons for this, but it's a valid question.)

Houshalter · on June 2, 2016

I was just saying that's the way the math works out. It's not an arbitrary decision, it just happens that the natural way to score probabilities allows negative infinity from a single question.

You could make an arbitrary scoring metric that does whatever you want. But it wouldn't be principled, or have nice mathematical properties like this.

Retric · on June 2, 2016

The math was chosen arbitrarily so every outcome is also arbitrary.

whack · on June 2, 2016

That would make the class too easy to game. Just answer one question you feel most confident about, with 100% certainty, and if you happen to be right, you can blow off the entire rest of the course and get an instant A+.

joe_the_user · on June 2, 2016

The thing is that every time you drive down the highway, you're often betting your life that the direction you are pointing the steering wheel is correct (X degrees too far and you've hit the concrete barrier). So effectively people choose 100 on a regular basis in real life.

alanfalcon · on June 2, 2016

I've had an unfortunate encounter with a concrete barrier when my vehicle stopped traveling in the direction I was pointing the steering wheel. Neither myself nor my passenger lost a life. My point is we are also placing side bets (and trust) in crumple zones, seatbelts, airbags, and angled dividers that redirect forward momentum (and car roofs to hold up when upside-down on the freeway). Thankfully those side bets paid off for me and my passenger.

mathattack · on June 2, 2016

The overconfidence effect [0] is real. I see it in both made up games (90% of the class thinks they'll score in the top 50% on a test, or 90% of people think they're better than average drivers) as well as real situations where people overestimate their ability to meet deadlines. In complex software development projects, this overconfidence has a lot of second and third order effects.

[0] https://en.wikipedia.org/wiki/Overconfidence_effect

niccaluim · on June 2, 2016

> you should (almost) never claim 100% certainty in anything

Sort of off-topic but this is especially good advice when giving testimony. I gave deposition in a patent case once and, as I remember it, I was specifically instructed to qualify every answer with a statement of how certain I was. You can get in all kinds of trouble if you say "X is Y" and it turns out that X is actually Z. But if you say "I believe X is Y," well, no one can argue with that!

kazinator · on June 2, 2016

Good for a career in politics.

Houshalter · on June 2, 2016

I'm working on something like this right now. We asked 8 trivia questions on a survey of users of our website, and had people assign probability estimates to the probability they got the answer right.

First of all it seems like everyone had exactly the same probability of getting any random question right. Some people got every question right, and some got none right. But in exactly the proportion you would expect by random chance - that some were just particularly lucky or unlucky.

The second is that probability estimates did not vary much based on questions gotten right. Everyone expected to get about 44% of questions right, regardless how many questions they actually got right. People who only got 1 right assigned the same probability as people who got 5 right.

Likewise people who estimated higher probabilities of getting more questions right, got the same amount of questions right. And a decent percent of people were underconfident too, and assigned probabilities too low (but got the same amount of questions right.)

Lastly, people are really uncalibrated. Some people are just bad at estimating probability. When they say "80% chance of something" they mean that thing will only happen 58% of the time. You can be trained, in a relatively short time, to become calibrated. By estimating probabilities and seeing how many you actually got right. But most people aren't trained, so it would be a bit unfair to put this on a real test.

laplace2 · on June 2, 2016

Could this be simplified by asking students what they think their score on the test will be and adjusting their final score based on how accurate that estimate was?

Overall it seems like the biggest flaws in this system are that

1: Scores still get mapped to discrete letter grades.

2: A student's goal is not to get the highest score possible, rather it is to ensure that he or she is most likely to get an "A".

For example:

Given a 10 question quiz, a student who knows (in truth) she is 80% likely to answer each question correctly, and who needs to achieve a score of 0.2 to pass. The student is led to believe that by accurately estimating her confidence for each question at 80% she is giving herself the biggest advantage. If she does this and she gets 4 of the questions wrong her final score will be -1.219 and she will fail. However, if she had instead underestimated her confidence and given each question a confidence of 0.6 her final score would instead have been 0.290 and she would have passed. She could of course go even further and determine the likelihood of her getting N questions wrong and use that to determine the optimal confidence level to select which will maximize her expected score while ensuring that she is most likely to pass the class.

laplace2 · on June 2, 2016

http://pastebin.com/zahgmVnk

http://octave-online.net/

I put sample code in a pastebin and a link to octave-online if you want to verify how I'm thinking about this.

scraft · on June 2, 2016

I suppose if you combine this approach with the theory that nothing is 100% certain you end up with a test that is impossible to get 100% on (which I guess makes perfect sense as nothing is 100% certain). But back to reality, if you set students tests where a 100% confident answer being wrong instantly meant they fail, it feels like you are no longer testing the student on the subject matter in question and instead are testing their ability to play the probability meta-test game.

The task itself of working out how 'confident' you feel about an answer also feels like an impossible task (almost like a non-technical project manager asking a developer how long a task will take). I guess it would be fun to see the results of tests taken using this approach and then compare the scores with the new system, the old system and ultimately what they get as their final grade in the subject.

savanaly · on June 2, 2016

Fantastic and frankly hilarious scheme he has developed here. My favorite part is that if you state your absolute certainty in true or false and turn out to be wrong, your score is negative infinity. Only in a math class would I expect a score of negative infinity I suppose.

repsilat · on June 2, 2016

That weirdness is kinda just a consequence of wanting the scores for different questions to be added together. A simpler (to me) but equivalent way to frame it is to use scores that you multiply together. These scores are "the probability of the true answer occur, assuming the students odds."

That is, if the student gave a probability of 1 and were correct, their score for that question is 1. If they were wrong, they get a score of zero.

The interpretation of their final score his interpreted simillarly as a joint probability, assuming independent trials.

SloopJon · on June 2, 2016

The SAT used to have a wrong answer penalty to discourage random guessing. This seems to me a simple, effective way to reflect a student's certainty in an answer.

daxfohl · on June 2, 2016

Funny thing is the weights they assigned didn't discourage random guessing at all; they merely made your expected gain from guessing equal with the expected gain from not answering (i.e. zero).

So even if you had the very slightest idea, you should absolutely guess. If you took the test multiple times, then probabilistically you'd be better off if you always guessed even when you had no clue: some of your scores would be higher and some would be lower than if you hadn't guessed, and usually only your highest score is considered. Most guides overemphasized the penalty.

That said, it can screw you of course too if you happen to be unlucky in your guessing. This is what Terry's scheme overcomes.

joefkelley · on June 2, 2016

I remember in high school having no shortage of teachers who recommended not guessing if you didn't know, to avoid the wrong answer penalty. I tried to argue with them initially, then settled for simply trying to re-educate my friends. It's pretty astounding how few people understand probability.

kazinator · on June 2, 2016

Indeed, if a wrong guess has value -1, and a correct guess has value 1, then the expected value 2p - 1, which is indeed positive if p is greater than 0.5. Random guessing has p = 0.5, and so gives you a base-line expectation of zero. Thus guessing is not actually punished, as such, and guessing can still be used if you have p > 0.5 for any reason.

What this simple scheme eliminates is credit awarded for random guessing. If correct answers are given weight 1, and all else is zero, then if the student knows 0% of the material, he or she nevertheless falsely obtains about 50% credit.

With guessing punished by -1 weights, someone who knows nothing an expect a result clustered around zero.

This seems fair, because:

* zero knowledge -> score around zero;

* perfect knowledge -> perfect score;

* defiantly wrong answers -> significantly negative score

Knowledge is rewarded, lack of knowledge is neither rewarded nor punished, defiant behavior is punished. This seems academically ideal. :)

evincarofautumn · on June 2, 2016

As an aside: I’ve seen “defiantly” as an autocorrect error for “definitely” (via *definately) so much now that I almost always assume “defiantly” is an error.

tamana · on June 2, 2016

Terry's scheme punishes well educated people who are conservative in their claims of confidence

Jabbles · on June 2, 2016

Why should being conservative in your confidence be rewarded above having an accurate assessment of your confidence?

Jtsummers · on June 2, 2016

In the case of an exam, students are under stress and pressure that will often lead them to underestimate themselves, especially knowing that if they're wrong that 100% probability will net them a -infinity score for the exam, even if they get everything else right.

This also means questions have to be very carefully worded. I recall a number of exam questions (over many years of school, not one class in particular) where the "wrong" answers were arguably right depending on how the sentence was parsed just based on poor use of punctuation or other ambiguous wording.

nilkn · on June 2, 2016

If one were to actually use this scheme for a class, I think it'd make sense to compute both the classical grade and the probabilistic grade and isolate students whose classical and probabilistic grades differed greatly (as in your example of someone who answers everything correctly, but with low confidence). The professor could then exercise manual judgment to assign the appropriate grade and also probably discuss the discrepancy with the student (i.e., they might find it interesting that they consistently underestimated their knowledge of the subject; on the other hand, a student who was overly confident but did poorly might find it interesting that they were consistently overestimating their knowledge).

daxfohl · on June 3, 2016

Or even more to the point: it punishes those who are geniuses at the subject matter, but not so great at probability. Especially under stress, where the system could be seen as akin to gambling.

arjie · on June 2, 2016

As it should. If you don't know that you know, do you really know?

mikeash · on June 2, 2016

That works too, but it's much less granular. You only have two levels of certainty there, with only three possible results: certain and correct, certain and incorrect, and uncertain with unknown correctness. That means you can distinguish between, say, 50% certainty and 10% certainty, but you can't distinguish between 50% certainty and 99% certainty.

ChicagoBoy11 · on June 2, 2016

True, but how useful is that granularity, really? If you combine the fact that the test has multiple items assessing the same skill, and the fact that they employ (reasonably) sophisticated statistics (Item-response theory) to come up with a probabilistic assessment of the student's true ability given the observed answer pattern, would something like this really buy you a whole lot more?

mikeash · on June 2, 2016

It's probably not too useful. Ideally it would reduce variance a bit, but probably not enormously. People don't really understand their own level of certainty, anyway, so that would be a confounding factor.

I wonder if it would be a better compromise to allow multiple answers. If I've narrowed it down to two, I guess one and then I have a 50% chance of getting it right. Instead, let me just put down both answers, and if I got it right, give me half a point. I think you'd get the same ultimate outcome as penalizing wrong answers, but with a bit less randomness, while maintaining comprehension. Maybe!

tikhonj · on June 2, 2016

Here's a totally different idea: lets assume that the true-false questions are on a related topic. We can give partial credit if wrong answers are consistent with each other: that is, if there is some (simple) model of the topic we're asking about that is close to the correct model and produces those particular wrong results.

Intuitively, if there are a bunch of questions about the same "feature" and you get them all wrong, all those mistakes stem from the same misunderstanding. I guess in a well-written test where questions are conceptually spaced out this is not as much of an issue...

So to give partial credit, we try to find a simple model consistent with the T/F answers and award credit based on how wrong the model is.

In particular, this would catch the problem where you have a single misunderstanding that cascades into a whole bunch of wrong answers even if you actually understood the rest of the system fine. That's one of the main uses for partial credit in longhand answers, isn't it?

How would we do this systematically? Well, we can't, really. But there are places where we could. I worked with a professor who did research in program synthesis, and I remember he had an interesting idea for education: when a student submits an incorrect program in, say, Scheme, we could try to synthesize an interpreter that would make the program correct and then extract what the students misunderstanding was. (For example: they used dynamic scoping instead of static scoping.) If you seeded your synthesis system with the various kinds of things students actually get wrong, this could be both useful and practical.

You can apply the same idea to grading a test about Scheme. Award partial credit if somebody has a consistent mental model that just happens to be wrong. If they got a whole bunch of questions wrong just because they mixed up scoping rules—but understood everything else—partial credit seems fair.

I guess this is pretty explicitly rewarding consistency with partial credit, but that also seems fair in a lot of classes like CS.

To be clear, I don't actually think this approach is really practical: it's more of a thought experiment on what could be interesting. Even if it was possible, doing it on T/F questions would likely require a lot of questions since each one only provides one bit of input. If you had questions along the lines of "what does this Scheme program produce", you could get away with significantly fewer if you were clever about choosing them—but you'd still want some redundancy to be at least a bit robust to the student making typos as well as conceptual mistakes.

SubiculumCode · on June 2, 2016

Interesting derivation, but why not Receiver Operator Characteristics(ROC)? In recognition memory research we often collect confidence ratings on a 6-point scale for yes/no decisions, for example, and plot ROC curves, and calculate d' discrimination scores.

SubiculumCode · on June 2, 2016

https://en.m.wikipedia.org/wiki/Receiver_operating_character...

Or am I missing something.

primodemus · on June 2, 2016

Nitpick: His name is Terence not Terrance.

marxidad · on June 2, 2016

Both spellings are equivalent modulo permutation.

orlp · on June 2, 2016

I understand the attempt at humor but can not help but point out that the two names are not permutations of eachother.

cvick · on June 4, 2016

There is nothing in that method that accounts for the possibility that the test may be flawed in some way. I suppose that the assumption here is that the test is 'perfect' in that each question is worded in such a way that there is one and only one 'right' answer. But, as a taker of this kind of true/false test, if I am only allowed to provide my assessment of my own confidence in my answer without an explanation of "why" I assigned that value, then there's no way to answer in a way that I will not be penalized for a question that I don't feel that I can answer as asked.

Consider assigning an additional value of '1' to each question initially as a representation of the author's confidence that the question is not flawed in any way. Then, if a significant segment of the test taking population indicates that they believe that the question is flawed in some way, then, the "author's confidence" for that question would be reduced by an amount that wouldn't penalize me for identifying that flaw while still allowing for less 'partial credit' to those who answer incorrectly without being aware of the flaw, or more specifically, not giving as much credit to those who answered 'correctly' in spite of the flaw.

It's also possible that the question I think is a 'flaw' is really a 'trick' question and that in order to answer 'correctly' you must discover the trick -- this would still allow for a better result in that it more accurately assesses whether I really understand the question or not.

It also occurs to me that if the test administrator wants to give a test where they can assign some manner of partial credit to an answer, then they shouldn't give a true/false test in the first place.

pierrebai · on June 2, 2016

I don't think his scheme buys anything. Any pointing system in which a wrong answer has more weigh than a right answer will siply incensitive student to aim for the fair confidence level, the one that gives -1 for a bad answer. You need so few 'bad confidence' result to completely wreck your overall score that it is not worth it.

The main flaw of the scheme is that it is a purely mathematical analysis. Answers are not based solely on confidence but often relies of reading comprehension skills. (Even at the pure mathematical level. For example missing a square factor and a minus sign.) So you can have 100% confident incorrect answer just because you mis-read or mis-interpreted the question. Then you get punished hard. Given one's incapcity of self introspection and detection of such mis-reading, and the harsh punishment for such undetectable failure on the part of the student, his scheme is wrong headed.

savanaly · on June 2, 2016

Why wouldn't your knowledge of how often you misread problems factor into your estimate of what the probability of your answer to the problem being correct is? The scheme is rating how well you answer test questions, not how well you know the answers to the problems. As you point out, there's subtle differences. But no grading scheme on earth could address those subtle differences, so it's not exactly a failing if his scheme doesn't.

mabbo · on June 2, 2016

I imagine the first test or two would be a period of the students learning to understand the system, but after that it's perfectly fair. It's a good lesson in "You aren't 100% sure of anything, so don't claim you are".

tamana · on June 2, 2016

Why fail a student for making one specific mistake once?

kwikiel · on June 2, 2016

Because that will make him more focused on correctly approaching problems instead of rushing for solution without checking immediate work.

soreal · on June 2, 2016

The authors calls out: [Important note: here we are not using the term “confidence” in the technical sense used in statistics, but rather as an informal term for “subjective probability”.]

But then goes on to use a very technical, statistical version of confidence where 0% confidence is somehow equal to 100% confidence you picked the wrong answer.

All of this leads me to assert that despite the math being internally consistent, it does not apply well to the situation. A rational outside observer without reading such an article would assume that 0% confidence in your answer is equal to a 50% chance of being right.

The phrase "confidence that the answer is 'true'" can easily be interpreted as "confidence that the answer I marked is correct".

hfanson1 · on June 2, 2016

It is confidence that the answer is true. 100% being fully sure the answer is true. 0% being fully sure the answer is false. 50% should be used for both true and false have the same probability.

tamana · on June 2, 2016

This moves the problem to a quiz of introspection, not a quiz of the material

sophacles · on June 2, 2016

There's a pretty solid case to be made that introspection is missing from a lot of pedagogy anyway, so this isn't necessarily bad. I can see this being useful in many places - when testing on bits of interrelated knowledge, having a confidence score on the test may help people learn to puzzle out information and synthesize conclusions. Both of those skills are more valuable to the student than the ability to dump random facts - we have google for that these days.

lebca · on June 3, 2016

That was beautiful. It's been a long time since I've read anything derived as elegantly and clearly described as that. Got any more?

daxfohl · on June 3, 2016

But a possibility of negative infinity begs the question: how certain is the professor of the answer?

tzs · on June 2, 2016

A comment on the reddit discussion of this said that a similar scheme was used at CMU for midterm and final exams in the "Decision Systems and Decision Support Analysis" class offered by the decision theory department. The exams were multiple choice with four choices per question. The commentator linked to this handout describing the grading for the midterm: http://www.contrib.andrew.cmu.edu/%7Esbaugh/midterm_grading_...

If you assigned probability p to the correct choice, your score for that question was 1 + ln(p)/ln(4). You were not allowed to assign a probability of 0 or 1 to a choice.

The handout points out the importance of thinking about how to approach these tests ahead of time, and also explains the benefits of using this scoring system:

---- begin quote ----

I cannot stress strongly enough the need for each of you to sit down and think about different strategies for answering the questions. This grading technique completely removes any benefit of random guessing. Such a guess could be disastrous. You're much better off admitting that you don't know the answer to a question. (Placing a 0.25 probability by every option indicates that you have no idea which answer is correct, and your score will be 0 for that question). Assessments of probability 0 (0%) or 1 (100%) are not allowed. These answers will be interpreted as probability 0.001 (0.1%) and 0.997 (99.7%) respectively. Your probability assessments must sum to 1 (100%). A probability of 0.001 by the correct answer will result in a score of -4. In contrast, a probability of 0.997 on the correct answer only earns a score of 1. Think about the implications of this before the day of the test.

I strongly recommend that you analyze the grading problem from a decision analysis perspective. Calculate expected values (or expected utilities) for various levels of personal uncertainty. Notice what happens if you are overconfident or underconfident.

This grading scheme makes the midterm harder then a standard multiple choice test, but this is the point. It has many benefits from a teaching/learning perspective.

1) It teaches you to apply the techniques that have been discussed in class. You have to assess your own personal probabilities and apply them to problems that have very real (and potentially) important payoffs. It is impossible to get these points across with the few simple lotteries demonstrated during class.

2) It helps to removes the element of chance from the test. Because of the severe penalty for guessing, the test will more accurately measure your knowledge.

3) The test will also measure what you know about your knowledge.

4) By analyzing how you answer the questions, I will be able to determine which questions are hard and which questions you believe are hard. (A "flatter" distribution for a question indicates a lower confidence in the question, and therefore, a belief that it is hard.)

5) It will allow you to appreciate how hard it is to assess probabilities.

---- end quote ----

caf · on June 3, 2016

Interestingly, another way of looking at this result is that the score is a measure of the total information content of the respondent's answers.

amelius · on June 2, 2016

Sorry to be negative, but this sounds like a solution to make something which is horribly broken a teeny bit less horribly broken.

frenchy · on June 2, 2016

I think it's a little extreme to say that the current popular way of quickly assessing large numbers students is "horribly broken", but I think the most interesting part of this has nothing to do with student assessment and more to do with forcing the student to think about how well they know things.

A quiz can be a teaching tool as well as an assessment tool. It's often poorly designed for the first function, but that doesn't mean it has to be.