Notes on AI Bias

twa927 · on April 24, 2019

> Until about 2013, If you wanted to make a software system that could, say, recognise a cat in a photo, you would write logical steps. You’d make something that looked for edges in an image, and an eye detector, and a texture analyser for fur, and try to count legs, and so on, and you’d bolt them all together...

I'm doing a lot of such algorithms (well, not for images). Does someone know if such algorithms have a name? I'm calling it "heuristics" and I think it falls under "AI".

voidifremoved · on April 24, 2019

A while ago, Google photos autogenerated a video for me from my photo library. It was about a minute long, stitched together dozens of photos, called "dog video", and with a horrifying yapping dog soundtrack.

Every single photo was of a cat.

I have to say I was humbled by the amount of human and computing power that had gone into developing this system over the years, that could achieve such a complicated, impressive technical feat, without requiring any effort or money on my part, and yet also be 100% wrong.

Bartweiss · on April 24, 2019

> be 100% wrong

This really is quite impressive. It's rare for humans to do worse than random guessing on tasks, and they almost never do much worse. There's something almost charming about the ability of AI to put real effort into actively avoiding correct answers.

pixl97 · on April 24, 2019

Really it sounds like an error somewhere else, rather than the AI system. More like the ID for cat and dog were switched.

Bartweiss · on April 24, 2019

If this is the Google Photos folder system, I suspect the problem was that the IDs and the bucketing were de-linked.

Photos creates folders for you based on identified themes, and then adds new photos to them as they're taken. I haven't checked, but I'm guessing it doesn't relabel existing buckets to avoid causing confusion. And I'm not sure whether bucketing is done by assessing theme or similarity to other photos in a folder. If it's the latter, the system could have hit the confidence threshold to make a Dog folder out of a few images, then ceaselessly dumped similar-looking photos (i.e. cats) into that bucket.

jofer · on April 24, 2019

For the specific example given there, I'd say it's most often called feature engineering. I'd also argue that it's a lot more necessary than most people think, but I'm probably just being stodgy and am biased by working in relatively narrow domains.

Calling it "feature engineering" implies it's still being fed into some sort of trained classifier to make the final decision, though.

What you're describing of your own work might better fall under the broad category of an "expert system".

piker · on April 24, 2019

Expert systems? https://en.wikipedia.org/wiki/Expert_system

twa927 · on April 24, 2019

I think expert systems consist of a "rule engine" where rules can be added dynamically?

microtherion · on April 24, 2019

I kind of like the tongue in cheek moniker "GOFAI" (Good, Old-Fashioned AI), though that is applied more to symbolic AI https://www.cs.swarthmore.edu/~eroberts/cs91/projects/ethics...

_mlxl · on April 24, 2019

Maybe image segmentation? In my AI class it was referred to as image segmentation and edge detection (interchangeably)

https://en.wikipedia.org/wiki/Image_segmentation

chobeat · on April 24, 2019

I call these approaches: "there must be OpenCV in there somewhere"

bhl · on April 24, 2019

Heuristic algorithms using hand-crafted features.

frankbreetz · on April 24, 2019

First order logic rule-based system

mv4 · on April 24, 2019

this is similar to bag-of-words models

https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer...

layoutIfNeeded · on April 24, 2019

I would call it “classical” machine learning.

twa927 · on April 24, 2019

Hmm, I think there's no "machine learning" here. There's a human hard-coding some thought process, using mostly some simple statistics/thresholds to e.g. define what a "fur texture" looks like.

pedrosorio · on April 24, 2019

Machine learning was extensively used in image processing before 2013 / deep learning.

The main difference is that you’d write code to extract features from the image and then learn a model using those features (as opposed to using the pixel data directly and learning a model from that as in CNNs).

As an example, you wouldn’t necessarily write code for “fur texture” but instead would extract histograms of pixel brightness gradients and feed those (along with other things) to a machine learning algorithm. In this example, fur texture would generate a different histogram (to be used as a feature) than skin texture.

https://en.m.wikipedia.org/wiki/Histogram_of_oriented_gradie...

twa927 · on April 24, 2019

Ok, so this depends on what algorithms are used for the feature detection ("edges in an image, and an eye detector, and a texture analyser for fur"). I'm guessing hand-coding an algorithm for detecting edges in an image can be done successfully, but it looks much harder for "an eye detector", so it needs "machine learning"

What I meant when asking for a name of an algorithm class are algorithms where the feature extraction is done using hand-coded algorithms.

layoutIfNeeded · on April 24, 2019

You can call them “handcrafted decision trees” then.

fvdessen · on April 24, 2019

> Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés. Amazon spotted this and the system was never put into production.

Couldn't they have retrained the system with a 50/50 mix of males / females resumes ? Or restrict the use of the algorithm to sort male resumes ? Or maybe resumes don't actually correlate at all with success in Amazon ...

DuskStar · on April 24, 2019

One situation I could see leading to this result (Amazon cancelling their resume filtering software with the excuse that it 'skewed male') is that

1. The AI system accurately predicted employee success across both genders

AND

2. The AI system predicted that women would do worse than men

That's politically embarrassing and something that you can't necessarily 'fix' by improving the system. (see: all the 'will this person commit a crime if let out on parole' systems that end up accurately discriminating based on race)

This isn't to say that women are worse engineers than men, or anything of that sort - only that the applicant pool to Amazon was skewed, or women were treated worse in the workplace and thus performed worse, or a dozen other possible causes. (And only in this hypothetical scenario! I have no inside info from Amazon!)

gizmo686 · on April 24, 2019

Your example is quite possible, particularly at an organazation that would be embarrased by such a result.

Assume that the ability curve of male applicants and female applicants are identical; that the majority of applicants are male; and that Amazon wants to hire more females then would be expected given the portion of applicants that are female.

A natural way of accomplishing this goal is to give extra points to female applicants [0].

Due to selection bias, the ability curve of women within the population of Amazon engineers would skew lower then men within the population of Amazon engineers.

This is a special case of a more general phenomona. If you have signal S that is positivly correlated with a desired trait in the general population, and over select for S, you will find that S is negativly correlated within your population.

[0]. All proposals I have seen amount to either a good approximation of this or changing the applicant pool. And, by assumption, the latter is excluded.

Bartweiss · on April 24, 2019

In this case, it appears to instead be a matter of journalists focusing on totally the wrong aspect of a story for more drama. Buried deep in the original Reuters piece is this offhand mention:

> Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.

Apparently the recommendation system really did create gender bias, neither inherited from real differences nor from replicated human biases. (It looks like an issue with mismatched training data and task.) But that initial bias was found and corrected (2015) more than a year before the project was cancelled (2017) for providing "random" results. I think this is the most extreme case of algorithmic bias I've ever seen, but also the least commonly relevant; Amazon appears to have built a model which contained almost no rules except sexism, and scrapped it for not knowing anything worthwhile.

https://www.reuters.com/article/us-amazon-com-jobs-automatio...

DuskStar · on April 24, 2019

That is certainly another plausible explanation - and a less culture-war infused one, too. Thanks!

lotu · on April 24, 2019

This is feels like an elephant in the room when it comes to AI bias. We develop an AI that accurately predicts outcomes and discover it is biased, then instead of asking if maybe this means our current system is deeply biased and needs to be changed, we say, "don't use the AI; keep using the people who might or might not be biased but we don't know because we can't measure it in the way an AI can be measured."

If it isn't acceptable to use an AI to create biased outcomes how is it acceptable to use people to create the the same outcomes. AI decision making can be examined and tuned in ways that people cannot.

meheleventyone · on April 24, 2019

The problem is that AI and more generally 'algorithms' are or were presented as neutral and unbiased. As such their biased results prop up a biased system.

I don't think people are against using ML and for biased human systems. Just pointing out the ignorant, naive and lazy deference to computers that often occurs in human systems that share the same bias.

In short I'd think most people who are against biased AI are also against biased human systems for very similar reasons.

DuskStar · on April 24, 2019

Of course, sometimes reality is also biased, and the AI systems are just accurately reflecting reality. And that's an even bigger elephant.

meheleventyone · on April 24, 2019

I’m not sure what that even means if we know we can bias outcomes. Pretending there is some kind of natural state that is for the sake of being natural preferred seems odd given humans propensity to change the world to suit. I also suspect for many that ‘reality’ is really just a dog whistle for their preferred biases. Not to mention the entire issue with deriving and ought from an is.

wongarsu · on April 24, 2019

Suppose you train an AI to predict how good people are at weight lifting, trained from a bunch of seemingly unrelated data (maybe you want to hire bouncers or construction workers). You will find that the model predicts better performance for males. You notice this, identify that men are more likely to go to the gym than wimen, and modify your data to compensate for this. But when you rerun the model men still show better results. You find some other biases in your data. You find societal biases, like role models for girls not being physically strong. You even take some women and show that with training they outperform average men.

You can modify reality, but our understanding of biology - especially hormones - clearly tells us that the AI was right: men are generally better than women at weight lifting.

I'm not saying that every issue is like that, but it would be foolish to ignore that sometimes reality is biased, sometimes in obvious ways and sometimes more subtly.

meheleventyone · on April 25, 2019

What I was getting at is that our important choices are about outcomes and those have nothing really to do with assumptions about reality. For example all should be equal before the law. A statement that is supposed to be true but very obviously isn’t.

Your post is great for the assumptions it encodes. Like what does it mean to be good at weight lifting? And that for some reason being good at weight lifting is a good proxy for being a good bouncer or construction worker?

For an off the cuff example it’s a great way to demonstrate the sort of bias we can naively introduce then defend because it’s just ‘reality’. When really it’s much more complex than identifying a relevant trait and assuming everything else falls out of it.

Chris2048 · on April 25, 2019

> Like what does it mean to be good at weight lifting? And that for some reason being good at weight lifting is a good proxy for being a good bouncer or construction worker?

being a good weight lifter, means you can lift heavier weights then a less-good weight lifter. Whether this is a proxy for anything isn't relevant, because it's a purely contrived example. There are clearly jobs where physical strength (among other things) is important, and given the context of this example, there is no guarantee that a more complicated model evens out the differences.

The point of the example is, basically "there are some things which might discriminate strongly on the basis on physical traits, which might end up correlating with race/sex etc" - ask for a better model by all means, but there is no guarantee the perfect model will never correlate strongly with some political demographic, and hence be controversial.

meheleventyone · on April 25, 2019

Those were rhetorical questions I fully understand what the point of the example was.

Chris2048 · on April 25, 2019

but then it doesn't encode assumptions. Metaphors are supposed to be imperfect.

"bias" and "reality" can equally cover for model simplicity.

meheleventyone · on April 26, 2019

Of course it encodes assumptions whether imperfect or not.

Chris2048 · on April 26, 2019

Does a water-pipe analogy for electrical current encode the assumption that electricity is made of water?

btilly · on April 24, 2019

One major problem.

The parole software was NOT being fed data for "will this person commit another crime". It was being fed data for, "will this person be a suspect for another crime".

The significant difference is that selective enforcement biases the data that it was trained on. Said selective enforcement has multiple causes, including the fact that heavier patrolling in black neighborhoods makes catching crimes more likely.

The size of the selective enforcement bias shows in a number of ways. For example consider drugs. In surveys, the usage of illegal drugs is the same in blacks and whites. And yet 6 times as many blacks are arrested for using illegal drugs as whites.

Chris2048 · on April 25, 2019

Which represents ground truth better? arrest records, or survey results?

btilly · on April 25, 2019

For this? Probably survey results. Particularly https://nsduhweb.rti.org/respweb/homepage.cfm.

munchbunny · on April 24, 2019

I think this retelling of the story is over-simplified. It's a compelling story, but I don't know any competent engineers who give up on a whole project because of one setback. If this system never saw production use, it was because it's still not ready, or there were many other issues that aren't mentioned that led the team to give up, or because political winds shifted. Amazon is famous for killing projects quickly.

duxup · on April 24, 2019

It does make you wonder how much AI will be .. AI and how much guidance for desired outcomes humans will give it.

Humans are pretty happy to create nonsensical results if it fits their goals... especially if it befits them. I wonder if with AI we do that to the point that it is somewhat irrelevant.

manfredo · on April 24, 2019

The whole problem with allegations of AI bias is that people often point to disparities of outcome as proof of bias. The reality is that there are plenty of disparities on outcome regardless of bias, and the allegations of bias and attempt to rectify the alleged bias is another vector for the introduction of bias.

deogeo · on April 24, 2019

Sounds like an extraordinarily poor AI system if it depends on absolute numbers, and not per capita. And wouldn't the number of unsuccessful hires also skew male?

jerf · on April 24, 2019

"Sounds like an extraordinarily poor AI system if it depends on absolute numbers, and not per capita."

To some extent, you're bringing in your human bias to prefer human biases when you make that statement. We humans have a hierarchy of important attributes, and for various reasons believe race and gender are more important than eye color or height. But the machine learning algorithm just gets a multidimensional point in hyperspace. It doesn't, a priori, "know" that it needs to do a "per capita" adjustment based on FIELD_1 any more than it knows it needs to do a per capita adjustment on FIELD_2. And you can't "adjust" on all the fields because that'll just cancel out.

We are also in the weird position of wanting the machine to do adjustments based on FIELD_1, but without us having to actually admit to ourselves that we're doing it. From a technical perspective, probably the best answer is to do a straight-up training based on the data, then have an cleanly-separated after-the-fact cleanup process to perform whatever social adjustments it is we want on the outcome. But nobody is willing to admit that's what we want, and to put those adjustments down on paper in the form of code, because the instant they're concrete, pretty much everybody is going to decide they're wrong, and no two people are going to agree on the manner in which they are wrong, and an epic, national-front-page-news shitstorm will ensue. So here we are, trying to make adjustments without making adjustments, or, alternatively, trying to make adjustments in a place where we can blame the AI rather than humans.

(The ironic thing is that because we can't admit what we're trying to do, we're going to end up doing a really poor job of it. Tools will be applied haphazardly, the results can't be measured except very grossly at the very end of the process, and the goals won't be obtained and the system is always going to be quirky and weird. If we could clearly declare what it is we actually wanted, it would be fairly easy to get it from the AIs.)

Bartweiss · on April 24, 2019

The basic "resumes skewed male so the algorithm did too" explanation appears to be incorrect. But it's found in the original Reuters story and most derived stories, and finding it here implies it's reached the level of urban legend.

Going by the details of the Reuters story and several others, it appears that what actually happened was a training/task mismatch. Amazon wanted an algorithm to do resume discovery, which recruiters would run and get quality predictions as they viewed resumes. But they trained it on resume results, giving it past resumes which had been submitted to Amazon and telling it to seek similar resumes. None of the stories make it clear if there even was negative training data; it looks like the tool was simply told to compute degree-of-similarity to past inputs, and possibly told to prioritize resumes which were ultimately hired.

As a result, the tool was trying to convert a relatively gender-neutral pool (resumes found online) to a skewed one (Amazon applicant resumes), and did so by weighting gendered terms. It also seems to have underweighted technical terms, failing to appreciate them as mandatory or strictly position-specific.

The developers were sufficiently aware of that to catch and correct the known gender biases (e.g. devaluing women's colleges or the literal word "women's"), but were scared there were other uncaught biases. And the results were apparently terrible all around, so the tool was scrapped. Which is pretty much what you'd expect from something trained on exclusively positive, sample-biased examples. The story has been seriously distorted, but the real plan also seems terrible...

theoh · on April 24, 2019

Consider the possibility that the (pre-AI system) probability of success for a female applicant is the same as the probability of success of a male applicant. You could make a "per capita" quota as a kind of goal. That's not a problem, but how would you make sure the quota was met?

The typical AI system doesn't work on the basis of selecting candidates entirely at random, pro rata, in order to meet a quota. It works on the basis of criteria for success. One thing it might learn (unfortunately) is that most posts at the company are filled by men.

cygaril · on April 24, 2019

From a machine learning point of view, one can just add the constraint that the probability of being in the "yes" bucket is that same for both male and female candidates. Doing this will give a worse fit than an unconstrained optimization, but it is fairer.

More sophisticated approaches are possible.

theoh · on April 24, 2019

There's no "just" to any aspect of this topic. I think what you are talking about is what is sometimes called "classification parity", and there are problems with it, and with everything else we've come up with to combat bias.

https://arxiv.org/abs/1808.00023

bumby · on April 24, 2019

Or couldn't they provide data augmentation on the same samples to give the effect of a more diverse (and more populous) training set?

Using the blog's skin cancer example, couldn't the labelled images be augmented by altering the skin tones and adding these new examples to the training set?

It seems to me that some of the anomalous results discussed in the article are actually the result of poor model design or poor pre-processing data choices. We can't just throw anything to any ol' machine learning model and expect it to be magic

TheRealPomax · on April 24, 2019

maybe, but this might also just be someone unwilling to commit to the sunk cost fallacy. You can spend time and money fixing it, or you can cut your losses and just stop trying to automate something that probably didn't need full automation to begin with.

Bartweiss · on April 24, 2019

This story has been constantly misrepresented, because Reuters absolutely botched their initial report. Amazon was never building a tool to decide which interviewed candidates to hire, they were building a tool for discovering candidates. It was biased, but that gender bias wasn't the proximate reason for scrapping the tool.

As far as I can tell from later stories (e.g. 1, 2), what Amazon actually did was build a tool to show recruiters 'quality' predictions for all resumes, for instance as they scrolled LinkedIn. But they trained it on resumes submitted to Amazon for various positions, possibly also adding weight to resumes which produced hires.

In which case the problem is painfully obvious; the system effectively had no negative training data, and its positive examples (submitted resumes) didn't actually match the desired output (qualified resumes). It was computing degree of similarity between a gender-neutral-ish pool (resumes posted online) and a gender-skewed pool (resumes submitted to Amazon), and tried to make that conversion with whatever data was available - like devaluing resumes that mentioned women's colleges. (This wasn't just a proxy-variable thing, the model essentially learned to weight on gender.) Amazon's team apparently caught this issue and did the usual things like blinding on those words. But they were scared of uncaught factors; reading between the lines, they were unable to "detrain" biases like neural nets do because their dataset and task didn't match.

Ultimately, the tool was apparently scrapped because it made selections "almost at random". Which, again, isn't exactly surprising in light of the absolutely bonkers choice of training examples.

[1] https://www.aclu.org/blog/womens-rights/womens-rights-workpl...

[2] https://www.ml.cmu.edu/news/news-archive/2018/october/amazon...

eanzenberg · on April 24, 2019

That wouldn't matter if the KPI (worker performance) being predicted, which is inherently biased as well, was distributed differently among the balanced pool of applicants.

FakeComments · on April 24, 2019

[flagged]

TheCoelacanth · on April 24, 2019

I highly doubt that the preference for male candidates was the only problem with the AI. The preference for candidates with ice hockey on their resume almost certainly also would have resulted in a preference for white candidates.

FakeComments · on April 24, 2019

Sure, my point is why does that matter?

The correlation with ice hockey could be a career relevant detail because of correlation with discipline and pain tolerance. If it tracks all such signals, I don’t see the problem.

That it also has a circumstantial correlation with races isn’t inherently problematic — it could be that different groups of people are differently qualified.

There seems to be a position that culture is uncorrelated to job success, and that’s just nonsense. When people can show me these biases are correlated to race or sex once controlling for culture, then we have a problem.

michaelmior · on April 24, 2019

Race and culture are two very different things. There certainly can be some correlation, but drawing conclusions of suitability for employment based on race is a problem.

FakeComments · on April 24, 2019

I agree — race and culture are very different things.

That’s why it’s only meaningful to draw these conclusions once you control for cultural variation.

No one has shown that they drew conclusions based on race: they drew conclusions based on written words, which correlated with race. The claim is that this demonstrates racial discrimination.

However, an equally valid explanation is culture as a confounding variable correlated with both race and hiring, and which is the actual causative factor.

My point is that no one has shown this is hiring based on race, because the only things being judged are cultural artifacts. It’s merely presumed that anything correlated with race is inherently racist, when that’s not true.

No one has even remotely tried to address that, including yourself, preferring to merely assume it’s racism or special pleading like “well, there is some correlation — but without doing the math, I know that can’t be it!”

(All these same point apply to sexism.)

jakelazaroff · on April 24, 2019

"X correlates with race" and "X correlates with culture, which correlates with race" are the same thing. If neither X nor culture are related to what you're trying to measure, it's still just racial bias.

We've been using "ice hockey" as an example, right? Given that it's not obviously correlated with fitness for a job at Amazon, it's incumbent on you to show that it actually is. Until then, we should assume that it's at best a false positive from unrelated data (like the ruler in the skin cancer example) and at worst a proxy for race and gender.

michaelmior · on April 24, 2019

You're right, I didn't try to address it. I also didn't make any assumptions of racism in my comment. However, generally speaking, I don't think one culture is better than another either and discrimination based on culture is also a problem. (I understand that this is an opinion and many would disagree with me on that point.)

FakeComments · on April 24, 2019

[flagged]

dang · on April 25, 2019

If you keep doing ideological flamewars on HN we will ban you. Also, if you keep doing personal attacks on HN we will ban you.

Please review https://news.ycombinator.com/newsguidelines.html and follow the rules when posting here from now on.

FakeComments · on April 27, 2019

Again, I wish you would admit to ideological censorship:

You’re simply using etiquette as a guise for people who don’t want to have an honest discussion about the relevant philosophy (or in this case, the actual logic involved in the inference) to shut it down in favor of group think.

This problem is so widespread among the tech community, Congress is talking about it.

Just look at how fast this was downvoted and flagged — the point is to silence wrongthink.

Censorship for etiquette has always been used as a tool of political coercion — and I simply don’t take your vague claims I broke decorum, which always target one side of a sensitive issue, seriously.

You can absolutely use your power in your fiefdom; but you can’t use it to make me not talk honestly about issues — only remove me.

You’re just a political thug, who doesn’t like honest, blunt language.

Show me.

Ed:

There’s hundreds of comments like this without a peep, because they’re rightthink:

https://news.ycombinator.com/item?id=19763553

dang · on April 27, 2019

"Censorshop for etiquette" is not even close to being the active ingredient here. I hear that it looks that way to you, but what's actually going on when we moderate threads is so different, that I wonder how to convey the difference. Any ideas?

Let me put it this way: I wonder how to convince you how little we give a shit about decorum. What a miserable life it would be to spend one's days enforcing something that...limited. Such a concept appears nowhere in the site guidelines or in its moderation history.

It's pretty striking to hear "You’re just a political thug", and I can certainly understand why your comments have so much force in them if you feel that way. When you say we "always target one side of a sensitive issue", though, that's so factually inaccurate that I'm not sure what else I can do that would help.

TheCoelacanth · on April 24, 2019

Good luck with that argument in court.

turtlecloud · on April 24, 2019

Just remove the gender/sex as variables for the AI and maybe name too. Preprocess the resumes to remove them. Now you remove the majority of gender bias for the AI.

gizmo686 · on April 24, 2019

AI is really good at infering information. If gender is a real signal, it would be very difficult to filter the input such that it is not making a determination by what could be reffered to inferred gender.

gambler · on April 24, 2019

>The most obvious and immediately concerning place that this issue can be manifested is in human diversity.

I swear, when someone starts building autonomous killer robots, the first set of concerned articles will probably be asking whether robots were properly trained to target all genders and races with equal accuracy. This is not a sensible way to approach AI ethics.

>It was recently reported that Amazon had tried building a machine learning system to screen resumés for recruitment. Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés.

There is nothing "mechanistic" about this. It depends on how you select sample resumes and how you split them between "good" and "bad" labels.

I worked on a similar thing as an "encouraged" side-project at a certain company. Except I realized from day 1 that using AI on resumes is a bad idea and aimed to show this with data. My model was aiming to detect people who will quit or get fired within first 6 month (with the intent of lowering them in priority for interviews, supposedly). It miraculously achieved 85% accuracy... by figuring out how to detect summer interns.

Framing this problem as "bias" and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible. (I'm not saying that's what the author is doing, but that's definitely what's being done at large.) Fundamentally, there are significant higher-level problems with using statistical ML models for things like hiring or crime prediction.

Bartweiss · on April 24, 2019

That intern story is excellent; I'm adding it to my bank of "weird AI tricks" like pausing Tetris to avoid losing.

More topically, you're quite right to object to that Amazon reference. As far as I can tell, the real story is even worse than mislabeling. Amazon devs wanted a system to spot candidates in resume banks, so they trained it to recognize resumes similar to the ones submitted to Amazon in the past. The entire dataset was 'positive', and output degrees of similarity instead of classifications. Amazon applicants are mostly male while the pool was presumably 50/50, so that was learned as an element of "Amazon-candidate-ness".

That's also an interesting story, but from the first publication (in Reuters) it's been framed as an uneven base rate 'inevitably/predictably/mechanistically' producing a biased result. Which is not only untrue but downright backwards, since it implies that the rate in the general data is what matters, rather than the relative rate between samples and positive classifications. It's yet another variant of the mammogram base rates question, and I wish people would stop trying to reinforce the incorrect answer to that.

john-radio · on April 24, 2019

> That intern story is excellent; I'm adding it to my bank of "weird AI tricks" like pausing Tetris to avoid losing.

Post your bank! Let's be like Magnus Carlson and occasionally ask ourselves, "What would DeepMind do?"

Bartweiss · on April 25, 2019

Oh man, good question. I'm always up for swapping these stories. A lot of these came from a paper on weird AI tricks, and resulting best-of list on a blog collecting these stories.[1][2] Suffice to say, the people who think the orthogonality thesis is a weird hypothetical aren't keeping up with the state of things.

- The aforementioned Tetris story: an undirected learner was set to maximize score at Tetris learned normal play techniques, but also learned to pause the game immediately before losing so that the score wouldn't "decline" at game over.

- In the same vein as interns quitting, proxy detection of all sorts. Identify "field with sheep" by finding green fields with grey skies, or letting heuristics like "humans pick up dogs and cats" override correct identifications. (It's a goat until you pick it up, then it's a dog!)

- An agent playing Q*bert found a known bug for infinite lives, then escalated to an unknown bug which disabled the game while overflowing the score counter.

- Agents in a physics sim tasked with jumping as high as possible instead learned to 'fly' by abusing collision detection bugs, hitting themselves in ways that created upward momentum.

- Another "maximize jump height" task demonstrated that "highest" is an extremely fuzzy term. Initially measured by highest point, they became incredible tall. Measured by lowest point, they stayed tall and grew topheavy to 'kick' their base upwards.

- Number-handling bugs of all kinds. In one case, small twitches led to floating-point errors that created energy. In another, a "minimize force" task got solved by maximizing force and triggering integer wraparound.

My personal favorite is an adversarial bug. An agent playing tic-tac-toe on an infinite grid with a time limit submitted extremely remote moves which caused timeouts/crashes in any agent that tried to model the full board.

[1] https://arxiv.org/pdf/1803.03453.pdf

[2] https://aiweirdness.com/post/172894792687/when-algorithms-su...

moomin · on April 24, 2019

You’ve just read a long article that covers many aspects and zeroed in on your own hobby horse. You say there’s significantly bigger issues, but you don’t actually talk about that. Instead you talk about the thing you just said you didn’t think people should be talking about. There’s some serious projection going on here.

joshuamorton · on April 24, 2019

>Framing this problem as "bias"

Except that's exactly what it is. Much as your model was biased against interns.

> and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible.

Why? Pointing out a specific and concrete harm badly designed ML models cause is irresponsible? Just because the same kind of methodological flaw can cause other harms its irresponsible to use a motivating example?

gambler · on April 24, 2019

>Why? Pointing out a specific and concrete harm badly designed ML models cause is irresponsible?

In my opinion, yes, if it leads most readers to misjudge some fundamental properties of the problem as a whole. Again, I'm not saying this article is guilty, but most are.

joshuamorton · on April 24, 2019

> In my opinion, yes, if it leads most readers to misjudge some fundamental properties of the problem as a whole.

Which problem? The general statement of this problem is "models, trained on [somehow] misrepresentative data [or even technically representative data] can draw unintended conclusions that lead to harm". Specifically in this case, the harm was "the model was basically just trained to ignore all women applicants due to bad inference of conditional probabilities".

This is a common thing. Because our society draws lines and has bias, its fairly common for modelling failures to exist along those lines. Indeed, sometimes the failures are mostly harmless and immediately obvious, but often they aren't. And people building models should be made aware of those failure scenarios, and be especially aware of failure scenarios that affect underrepresented groups, because those groups are the most likely for the model to fail on if you aren't actively looking for them.

And this stuff is pervasive. Facial recognition tech is much worse at noticing the faces of darker skinned people [1]. Some of this is because the people building the common models (eigenfaces etc.) didn't use diverse skin tones, but some of it goes back further, white balance in film was tuned for lighter skin tones until the 90s[2]. Some of that has likely persisted into modern film and camera technology, unfortunately. People working with data need to understand their data. And that means understanding how bias infests their data.

> fundamental properties of the problem as a whole

You've yet to state the "whole problem" or the fundamental properties that people might misjudge. So I'm unclear what they are.

[1]: Arguably an advantage now.

[2]: https://petapixel.com/2015/09/19/heres-a-look-at-how-color-f...

gambler · on April 25, 2019

>Which problem? The general statement of this problem is "models, trained on [somehow] misrepresentative data [or even technically representative data] can draw unintended conclusions that lead to harm".

Throwing AI at answering an ill-formed question or optimizing a process that shouldn't happen in the first place is not something that can be corrected by getting better training data.

Moreover, automation can have consequences that aren't detectable by analyzing some test set.

westoncb · on April 25, 2019

> Except that's exactly what it is.

Using the term 'bias' has certain political motivations behind it. It's not about the term being technically untrue as it is about the term being non-neutral. For instance, here are some definitions of 'bias' I just grabbed from American Heritage:

"A preference or an inclination, especially one that inhibits impartial judgment."

"An unfair act or policy stemming from prejudice."

"A statistical sampling or testing error caused by systematically favoring some outcomes over others."

The ML model does not have a preference, inclination, or prejudice relating to interns, except insofar as we anthropomorphize it to have them. What does using a word suggesting that add?

A more neutral account of what's going on is along the lines: It's easy to accidentally train ML models so that they will make systematic errors. (Among those errors is the possibility for it to exhibit behavior resembling prejudice.)

joshuamorton · on April 25, 2019

Fine: it's easy to accidentally train ML models so that they will make systematic errors. Often these errors stem from systematic biases in our society, model creators should therefore be aware of the potential biases[1] that their models could reflect, and how to prevent them.

[1]: With the political motivation.

westoncb · on April 25, 2019

> Often these errors stem from systematic biases in our society ...

Depending on the what the appropriate quantification of 'often' is, that might make sense. Do we have enough reason to believe it would take on a high enough value to merit the usage of a term that refers only to it?

The other problem with what you're describing is that all we actually know is that the model is reflecting the current state of things. Your statement attributes particular causes to the current state of things, and implies a certain valuation of the current state of things (which I don't personally disagree with, necessarily—but I don't think my personal views should be reflected in scientific/engineering jargon).

So given the uncertain value of 'often,' and the unsettled nature of the causes behind various aspects of the 'current state of things,' it seems to be solidly jumping the gun to frame the entire general problem with a term that refers to this partial and fraught aspect of it.

joshuamorton · on April 25, 2019

>Your statement attributes particular causes to the current state of things

I didn't, nor should it matter how we got to where we are for a builder of a thing.

> and implies a certain valuation of the current state of things

This may have happened, but I'd disagree: recognizing that there exists inequality doesn't cast value judgement on that inequality. I simply stated that they're there. Perhaps saying "how to prevent them" is casting value judgement, so I might walk that back, model creators should be aware of the biases and aware of tools and strategies to account for them, if so desired.

Personally I think you're a bad person if, armed with the tools to detect and correct, you decide its okay to build something that has a systemic error that wrongly disfavors some group. But perhaps that's just me.

westoncb · on April 25, 2019

> ... recognizing that there exists inequality doesn't cast value judgement on that inequality.

You just asserted your attribution of cause right there: inequality. There are multiple possible causes for differing demographic representations in various roles. This is not a settled issue, even though people on both sides promote competing ideologies to the effect that it is.

(And again, I have intentionally left my own views on the subject out of this, even though I suspect they align with yours (insofar as cause attribution goes): I'm just pointing out the fact that this isn't something society agrees on, nor is it something the scientific data resolves unambiguously.)

> Personally I think you're a bad person if, armed with the tools to detect and correct, you decide its okay to build something that has a systemic error that wrongly disfavors some group.

Agreed, hinging on that point about cause attribution.

belorn · on April 25, 2019

> Often these errors stem from systematic biases in our society

No, this does also not match.

One of the easiest way to get a ML model that creates systematic errors is spam filters. If I take my spam folder with no consideration, what the filter will learn is that any language which isn't my own are spam, and that servers located outside my nation are spammers. This resembles prejudice.

The cause of this systematic error is that individual email addresses do not get ham emails uniformly from every nation and every language. Proximity warps the data. I would need to normalize the data based on language and nation if I wanted to remove those errors in the filter. Looking at it from a political perspective does not make the filter perform better, and fixing it from that side has a high risk of causing even more errors in the model.

jakelazaroff · on April 24, 2019

> There is nothing "mechanistic" about this. It depends on how you select sample resumes and how you split them between "good" and "bad" labels.

Isn't that what the article is trying to say, though? That your model can only be as accurate as your data set… and that even then, you have to be very careful to make sure it's not inferring patterns from entirely unrelated information?

bduerst · on April 24, 2019

If you train an AI using data from a system that already has certain biases, then the AI is going to replicate those same systemic biases in it's own predictions. It follows the "garbage in, garbage out" idiom.

Curiously though, did you compare the non-hire (full time) rates of interns vs fire rates of non-interns?

6cd6beb · on April 24, 2019

>If you train an AI using data from a system that already has certain biases, then the AI is going to replicate those same systemic biases

That's not what happened in the example at all. The example company isn't biased against summer interns, "who stops working after x time" was just a bad question.

The comment you're replying to can boil down to "do you want a monkey's paw solving your problem? If so then AI may be for you"

Or perhaps "stop pretending you're ever going to get ethics or empathy out of a computer"

bduerst · on April 24, 2019

I was referring to the Amazon resume model. The intern hires model was labeling, as GP said.

gambler · on April 24, 2019

>did you compare the non-hire (full time) rates of interns vs fire rates of non-interns?

Not sure I understand the question. IIRC, the way data was setup there was no way to tell why an intern stopped working for the company, because for all interns "reason code" for separation was the same.

bumby · on April 24, 2019

Isn't this the one of the major concerns of ML, the bias-variance trade-off? By creating a low-variance model, we create a highly biased model that misses some of the important feature relationships necessary to create a truly generalized model?

Meaning, isn't it prudent to spending time on this issue?

BickNowstrom · on April 24, 2019

You are conflating bias (error) with bias (fairness).

bumby · on April 24, 2019

Haha my comment was originally drafted to accuse the parent comment of the same. As I read the article, its concerned with error (e.g. misclassification of cancer) but the parent comment translated this to mean the social bias.

lonnyk · on April 24, 2019

I get the point, but why didn't you just exclude intern resumes from the training data? Do you still suspect a skewed result?

gambler · on April 24, 2019

>I get the point, but why didn't you just exclude intern resumes from the training data?

That was the logical next step and we started on that, but it required exporting more historic data out of the HR system and filtering out anyone who started as an intern as well. Sounds simple, but in practice it's anything but. Just for the reference, data extraction, cleaning and filtering in that project took at least an order of magnitude more time than anything related to machine learning.

The project eventually lost steam and got abandoned.

>Do you still suspect a skewed result?

Absolutely. My personal intuition is that there is very little correlation between resumes and candidate quality. If that is true, any seemingly accurate predictions would be the result of a similar problem. Testing this hypothesis was a large portion of why I agreed to work on the project in the first place.

sfreporter · on April 24, 2019

@gambler: thank you for reading my reporting. I would love to chat confidentially to understand your perspective better. Please see my HN profile.

johndavis925254 · on April 25, 2019

@sfreporter your reporting had the main lede quite buried to create sensationalism. Gender bias is a distant problem if the model's results are completely random.

Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.

lonnyk · on April 25, 2019

Got it. Thanks for the response.

chobeat · on April 24, 2019

I've just added this post to my reading list. I share it if anybody is interested in this and similar topics: https://github.com/chobeat/awesome-critical-tech-reading-lis...

Zolomon · on April 24, 2019

There is a course on this at New York University: https://dataresponsibly.github.io/courses/spring19/

killjoywashere · on April 25, 2019

I actually think this is where ML really shines. You can pick things apart. Sure, you might need carefully designed experiments, but you can subtract "female" from the resume and look for other data that cause some trained machine to activate, like patterns of word choice, etc. This is akin to the Go players learning from Alpha Go. It's actually a richly rewarding investigation for those of us who have done it. To discover a whole class of failure modes, that's success! And, unlike courts of law, the the process is much more efficient, because you don't have to contend with a defendant appealing to matters of intent or the emotions of a jury.

Someone · on April 24, 2019

Short way to describe the problem: we want to build systems that detect causation, but statistical models can only detect correlation.

wongarsu · on April 24, 2019

That's not entirely true: it's hard to show causation, but with enough data you can. If A correlates with B you know that either A causes B, B causes A, some C causes both A and B, or the correlation is a coincidence. If you have the data to rule out 3 of those the remaining possibility is the causation.

Someone · on April 24, 2019

So, how do you, for example, rule out “some C causes both A and B“, if you may not even know of the existence of C?

More importantly, the only way to really show causation is by positing a mechanism.

eanzenberg · on April 24, 2019

>>Now, suppose that 75% of the bad turbines use a Siemens sensor and only 12% of the good turbines use one (and suppose this has no connection to the failure). The system will build a model to spot turbines with Siemens sensors. Oops.

Given a statistically large enough sample, 2 outcomes: 1) The Siemens sensor actually is at fault. 2) The Siemens sensor is a part of a larger system, which is different in non-Siemens turbines, and that system is failing.

Either way, the model prediction on turbine failures is enhanced with that Siemens feature. But to even get to this granularity, you are diving into model explainability, or what features were important for each prediction. Here, you try to understand the black-box to find reasons for particular input->output.

munificent · on April 24, 2019

I think you assume here that the historical effects that led to Siemens sensors correlating with failure will continue to be true in the future. And I think that is the key fallacy that makes AI bias a problem.

We aren't just looking for patterns. We are looking for patterns so that we can take action and affect the future. If the patterns, which are real enough in the historical data, don't correctly predict the impact of a choice, then they are anti-helpful bias.

For example, it may be that the company bought Siemens sensors years ago and then switched to another brand later. Unsurprisingly, older turbines fail more than newer ones. So, really, it's age that is the causative factor and the concrete action you want to take is to pay closer attention to older turbines. Even though the correlation to Siemens is real, if the action you take is "replace all the Seimens sensors with another brand", that won't make those old turbines work any better.

In other words, understanding data doesn't just mean "see which bits are correlated with which other bots". In order to be useful, we need to understand which changes to those bits in the future will be correlated with which desired outcomes. Anything less than that and you don't yet have information, just data.

_pd19 · on April 24, 2019

> I think you assume here that the historical effects that led to Siemens sensors correlating with failure will continue to be true in the future.

Yes, AI systems presume induction to be true. But so does... uh, science and most other things we do?

gizmo686 · on April 24, 2019

Science has trained experts thinking about the data.

If you set a team of scientists to find a way of predicting failure of turbines, they might notice a correlation between Siemens sensors and failure. They would then look for and attempt to prove theories to explain this descrepency. In doing so, they would likly discover that, not only can they not find a causative theory, but the correlation goes away when they control for age.

AI systems stop after the first step, yet somehow are perceived as better than expert humans.

bumby · on April 24, 2019

That's an interesting way to frame it. AI may stop at proximate causes rather than finding root causes

munificent · on April 25, 2019

Or: AI shows correlation which we then implicitly treat as causation.

throwawaymath · on April 24, 2019

No, that's incorrect. Note the part of your quote which says, "and suppose this has no connection to the failure."

The point is the Siemens sensor is a superfluous correlation with turbine failure, because the underlying dataset is biased towards Siemens sensors. The scenario suggested by the author is one in which your turbine failure dataset does not match reality.

No amount of sample enlargement will correct sample bias. You have a variable which is disproportionately represented in your underlying dataset despite being independent from a collection of variables correlated to failure, and the algorithm is learning that one instead.

Real world ways this is plausible and cannot be corrected by increased sampling:

1. Your telemetry data is accurate, but your logging service providing that data is faulty and only consumes data from a subset of meaningful publishers.

2. Whoever provided this dataset fat fingered a SQL query which joined too few tables including the sensor vendors, but correctly returned only the failing turbines.

3. Your data has (unnormalized) duplicates, because more than one system is providing telemetry data for Siemens sensors without the older systems being retired.

4. You use mostly Siemens sensors, and simply didn't correct for this in your sample.

DuskStar · on April 24, 2019

Just to point out:

1. Not a spurious correlation - Siemens sensors are in fact associated with increased failure rates in the dataset and if you continue to sample data with the same methodology this correlation will continue. You need to fix your data collection methodology, but it's not a spurious correlation.

2. See #1.

3. See #1.

4. The original problem statement said that a low percentage of unfailed turbines used Siemens sensors, and a high percentage of failed turbines used Siemens sensors. So 'you use mostly Siemens sensors' would imply that most of your turbines have failed, which seems a little unlikely to me.

TheCoelacanth · on April 24, 2019

Only if your test data is free of sample bias.

Given how incredibly hard it is to avoid sample bias, you can't take it for granted that your training data doesn't have any sample bias.

DuskStar · on April 24, 2019

If the sample is "all the gas turbines I own", I don't particularly CARE about the bias...

TheCoelacanth · on April 24, 2019

If the training data is all gas turbines that you own, why do you care about having the ML model at all? You already have complete knowledge of the state of all your gas turbines.

There's no point to having an ML model unless you are applying it to something outside of the training data.

If you plan on applying the model to different turbines, then there is potential for sample bias in which turbines you selected. If you apply it to the same turbines at some point in the future, then you sampled points in time so there is a potential for sample bias based on which points in time you selected.

There is no way of completely avoiding the potential for sample bias unless you completely abandon ML as a useful concept.

DuskStar · on April 24, 2019

Well, I might care about predicting the next turbine to fail. If Siemens sensors are truly unrelated to the issues, that'll average out eventually - but I'd be highly skeptical of someone asserting that it's completely unrelated to the failures and not just covarying with something we're not using as a model input.

Why would I care about the fact that only 10% of turbines globally have Siemens sensors? I don't know the failure data outside of the turbines I own and operate, and those are the only ones I need to predict failures for.

TheCoelacanth · on April 24, 2019

Next turbine to fail means you sample based on time points, so you still could have sample bias.

Say that turbines have an average lifespan of X years, and from year 0 to 10 you bought 90% Siemens and then from year 10 to 20 you bought 10% Siemens and then you measure failure rates from year X to year X+10.

Based on that data you would predict that Siemens turbines will be the most likely to fail next, but they are probably actually less likely to fail because most of the ones that are likely to fail soon are already gone.

throwawaymath · on April 24, 2019

The superfluous correlation between Siemens sensors and turbine failures will not average out eventually if you have a sampling bias in your dataset.

DuskStar · on April 24, 2019

You keep saying that I have a sampling bias, but there really isn't any evidence for that. I'm sampling 100% of the population. You can't have a sampling bias when sampling 100% of the population.

It could be a spurious correlation, sure - but that'll go away as the amount of data increases.

throwawaymath · on April 24, 2019

You really should. If the sample is "all the gas turbines you own" and you disproportionately use Siemens sensors, your turbine failure forecast will (with high likelihood) reduce to a Siemens sensor forecast. This is easily plausible even if your sample's correlation between Siemens sensors and gas turbines is completely superfluous.

DuskStar · on April 24, 2019

You can't have a sampling bias when 'sampling' the entire population, because the definition of 'sampling bias' includes 'some members are not included in the sample'.

throwawaymath · on April 24, 2019

Precisely, yes. I'm talking about a sample including all representative gas turbine failures, across all sensor vendors.

TheCoelacanth · on April 25, 2019

You can't make predictions when sampling the entire population.

chobeat · on April 24, 2019

you should, because you might make worse decisions for the business, for the system or for the people that are impacted by the system. If you don't have the right data to decide, don't decide using the data.

jgon · on April 24, 2019

This quote stands out to me:

"just as a dog is much better at finding drugs than people, but you wouldn’t convict someone on a dog’s evidence. And dogs are much more intelligent than any machine learning."

Because in my head I followed it with the sentence "but we're all confident that we will have dogs driving our cars in about 5 years." Food for thought for sure.

dmix · on April 25, 2019

So dogs are better than humans at detecting drugs because they have a better sense of smell than can penetrate packaging. What does that have to do with technology being better/worse than humans at driving, exactly?

They didn't say dogs were better than technology at solving problems, in any sort of general sense.