Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> I’m not sure that that is a reasonable test of how the program typically behaves.

That's not what people care about, people care about their copyright being blatantly violated by a massive corporation _without any consequences_.



Honestly, I feel most people don't care about that. What they do care about, is the risk of Copilot making the user liable for copyright infringement. Even a possibility of it spewing out non-public-domain code should be considered a showstopper for any use of Copilot-generated code in a commercial project.

Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.


> Can Copilot produce licensed code verbatim, in enough quantities to matter, with a license your business would be infringing? Yes. Can you easily tell by looking at the output? No. Could someone end up suing you over it? Maybe, if they cared enough to find out. Can you honestly tell your investors, or a company you seek to be acquired by, that nobody else can have valid copyright claim against your code? No.

Well aren't all your assertions exactly the point of contention?


Well, the "enough quantities to matter" part wasn't tested in courts yet, but I fail to see a way to rule for "No" here in a way that wouldn't gift us an universal way to turn any code into public domain, destroying source code licensing as a concept. Other than this part, the first two claims have already been demonstrated, and the rest follow from them.


But that is in fact the most fundamental question here. And I’m not fully sold on the idea either that this is going to happen in real-world usage or that a single function in a massive program constitutes a large enough portion to be infringing.


Quake's square root function wasn't the only, or the largest, example of code Copilot reproduces verbatim. Among others I've seen to date is someone generating a real "About" page with PII information of some random software developer.

How much code is enough to infringe is a tricky question, though. It's not only a function of size, but also of importance/uniqueness - and we know that Copilot doesn't understand these concepts.


> ... or that a single function in a massive program constitutes a large enough portion to be infringing.

As part of the sequences of rulings in Google vs Oracle, the 9-line rangeCheck function, in the entirety of the Android codebase, was found to be infringing.


Ok, but is “I can go out of my way to make it misbehave” adequate proof that the copyright is being violated?


Not GP.

Yes, it is, because that means that the algorithm will produce that copyrighted code regardless of the intent of the person who makes it misbehave. People could both accidentally and "accidentally" make it reproduce copyrighted code. In the first case, it's unintentional. In the second, how could you prove it's intentional?

Because of this whole mess, I am actually adding clauses to FOSS licenses that I am writing, just to ensure that my copyright on my code is not infringed by code laundering.


I'm not at all in favor of the "code laundering" (which is a brilliant term, thank you). But I don't understand how you expect a new license to help.

1. A license applied to source code is effective because of your copyright

2. The claim of Copilot's maintainers is that it bypasses copyright

Therefore, they will assert that they can ignore the new license saying "you may not launder my code" just as surely as they can ignore the previous license.


First, I did not come up with the term "code laundering." I cannot claim credit for that; I saw it first on HN on https://news.ycombinator.com/item?id=27729209 somewhere.

Second, you are correct that Copilot's maintainers claim that it bypasses copyright, but if it does while producing exact copies of code, then copyright is dead, and there are a lot of big companies out there with deep pockets that will ensure that doesn't happen.

They may claim that because their algorithm is a black box, that whatever it produces has no copyright, but my licenses will push back directly on that claim by saying that if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output. After all, that's what we do with GPL and binary code. The binary code is the output of an algorithm (the compiler) whose input was the source code.

I hope by tying it together like that, the terms can close the loophole they are claiming. But of course, I am going to get a lawyer to help me with those licenses.


> ... if source code under the license is used as all or part of the inputs to an algorithm, whether all of the source code or partially, then the license terms must be attached to the output.

You're not getting it. If Copilot isn't currently infringing copyright then adding such a clause won't matter. Such a clause would only hold weight when copyright applies. On the other hand, if copyright does apply, then you don't need such a clause because the activity is already a violation of the vast majority of licenses. (It even violates extremely permissive ones because it effectively strips out the license notice.)

The GPL works specifically because copyright applies to the usecase in question. It simply specifies various requirements that you must meet in order to license the code given that copyright applies.

In short, you can't just put a clause into a license saying, effectively, "and also, this license confers superpowers which make it so that my copyright applies in additional situations where it otherwise wouldn't!".


I think the GP's "license" would still be effective, although it would not be "open source" per the OSI definition.

Imagine this simplified scenario first: if I published a source file publicly without any licensing or explanation except a standard copyright notice - "Copyright (C) 2021 MY NAME, all rights reserved", do you think a random person/company can take that code and integrate it into a commercial product?

I would argue not (in general). Copyrights law as it is, does not permit a user who has access to a copy to do whatever they want with that copy (esp. if it involves more copying). OSS licenses do give you much freedom as long as you don't modify it, and that's why we have impression that we can do whatever with publicized source code. However, if we think about other types of copyrighted work, say movies for example, streaming services can "rent" you a movie multiple times even though you've paid to download the content previously. What are you paying for the second time you rent? Another example - some photographers may allow you to freely browse their works, but they can still make you pay money if you want to use their photo in your commercial product.

So why wouldn't copyright restrict usage of source code in similar situations? The GP only needs to add a condition to the license to restrict how users can use it. It will no longer be OSS, but as long as it's his work, I don't see why in principle it shouldn't work.

(In practice, I don't think it will make much difference -- I think your argument is still somewhat compelling, and some people will probably take your position. Conservative corporate lawyers aimed at reducing legal risk would disagree, so it's basically a matter of how much legal risk one is ready to take. Also, for an author trying to do this, note that suing Microsoft in these cases would be expensive, since they will likely fight back given that they spent so much money trying to do this, and the outcome will be uncertain. If really tested in court, given the result of the Oracle v Google case, if the US Supreme Court is impressed by the social/economic benefits that Android brings, I'm pretty sure the justices will be even more impressed by this intelligent code generation thingy, and might just grant this thing a fair use.)


Your summary is generally correct, and I certainly agree with the other commenter's position on their work. But I think you're still missing the point. Copyright is the mechanism that allows you to prevent copying, but GitHub's claim is that copyright is irrelevant to Copilot's input.

I have a nice strong lock on my door. GitHub (asserts that it) can enter my home through the window.

Adding another deadbolt to the door does not help.


I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

Maybe I'm missing something (just not the thing you said), but has Github made any legal claims so far? The original article is written by a politician in EU...

Even if you're a lawyer defending Github in this case, there's still a couple things that needs to be clarified before you can make the case: (maybe the info is out there but I'm too lazy to research)

- Is Github only using code/repos that are explicitly under OSS licenses? (because if that's the case, then the discussion might be justified in presuming OSS terms, and it may be the case that more restrictive non-OSS licenses would require a different analysis)

- As somebody pointed out in another thread, the Github terms of service agreement seems to grant Github additional rights when dealing with user uploaded content. Is that a legal basis for the use?


> I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

And I tend to agree with you (and the other commenter) here. But GitHub doesn't.

> has Github made any legal claims so far?

I'm not sure how actively, but the CEO was here in the announcement thread the other day saying that they think the ingestion of the inputs is a "fair use". They also have some material defending the output side: https://docs.github.com/en/github/copilot/research-recitatio...

> Is Github only using code/repos that are explicitly under OSS licenses?

I don't think we know exactly what code they used as inputs, no.


Their argument defending the output side doesn't hold water, IMO. If Copilot produces exact copies verbatim, even some of the time, then as long as customers don't have access to the code used to generate the model, how can they be sure?

It's a matter of scale. With a big enough codebase, there will be copyright violations.


> I don't think I missed that point. I'm trying to argue that copyright is relevant to Copilot's input if not allowed by an OSS license.

The point (that they claim that) you are missing is that if "copyright is relevant to Copilot's input" then almost all existing OSS licenses already don't allow that.


The licenses that I am making implicitly acknowledge the argument that training an ML model is fair use.

However, GitHub said nothing about the output of the model being fair use. My license will say that the output of their model is under the same license as the input, which means they have restrictions if they want to distribute it (i.e., actually have people use Copilot).

I think this will work because it doesn't say that GitHub is wrong. Instead, it says that, even if GitHub is right, it doesn't matter.

It would also be very bad for GitHub to claim that the output of an algorithm can't be under the same license as the input because we feed licensed code to algorithms all the time and claim that their output is still under the same license. We call those algorithms "compilers" and the binary code they produce is still copyrighted and licensed.


> I think your argument is still somewhat compelling, and some people will probably take your position.

I didn't mean to take a side or argue a position here. I was just pointing out that licenses hold no legal power in the event that copyright itself doesn't apply.

> ... So why wouldn't copyright restrict usage of source code in similar situations?

I'm certainly not an expert here but I believe you are mistaken about the extent to which current copyright law (in the US) restricts such usage. I also don't think that the examples you bring up are as simple as you seem to be making out.

You are legally permitted to record broadcast shows for later viewing; you are not permitted to redistribute the recordings though. I assume (but am not certain) that rentals and streaming are the same. (That being said, bypassing DRM has been made its own crime. This effectively amounts to an end run around the rights otherwise granted to you by US copyright law. But then there are specific exceptions where bypassing DRM is permitted. I digress.)

You aren't legally permitted to mirror the contents of a website (such as the New York Times) without permission but you are allowed to access it since they make it publicly available. You are even permitted to save a copy for your own purposes when you access it; you are not permitted to redistribute that copy.

For an extreme example, consider the recent LinkedIn case. Unless I misunderstood it, the court deemed it acceptable to scrape any publicly available content. Certainly most such scraped content was never explicitly licensed for that though!

Even if the license for a piece of code was entirely proprietary, GitHub presumably acquired it through legal means (ie intentional upload). Once they have it in their possession, it's not at all clear to me that current copyright law in the US has anything to say about how they use it (short of redistribution). Of course, if their ToS promises that they won't use it for other purposes then they can't do that. But assuming they never promised you that in the first place ...

There's a traditional argument here about needing a license to legally incorporate the copyrighted work of another into your own.

One possible counter argument is that training a model on publicly available work is analogous to a person viewing that work. So long as the model never outputs any of the original inputs (or only exceedingly small fragments of them that would fall under fair use regardless) it's not clear that those outputs constitute derivatives at all (in the legal sense). Or they might. The courts haven't weighed in yet as far as I know. (Consider GPT-3 or This Waifu Does Not Exist for additional examples of the sort of ambiguity that's possible here.)

Of course, one possible counter to that is that the model itself is (in many cases) effectively a lossily compressed copy of the original input works. So perhaps redistribution of the model itself would be a violation of copyright. But even if that turns out to be the case, it's still not clear that the output of such a model would run afoul of copyright.


You have good points.

I argue that the output of an algorithm has the same copyright as the inputs to the algorithm, and that's because we use compilers (algorithms) to transform source code all the time already, and no one says that the binary code (outputs) is not copyrighted.


The trouble is there seems to be an entire continuum when it comes to degree of transformation.

The compiler produces more or less a direct (logical) translation so it's clearly some sort of derivative. We go from C to machine code but the output still "means" the same thing as the input. (More precisely, it's approximately a mathematically transformed subset of the original input. Lots of information is removed, things are reorganized, and a bit of extraneous information gets added in the process.)

For something notably more muddy than a compiler, consider This Waifu Does Not Exist. Any given output is (typically) nowhere near any particular input but you can often spot various strong resemblances.

Alternatively, the implementation of sketch-rnn (https://magenta.tensorflow.org/sketch-rnn-demo) is quite different - it outputs pen strokes instead of pixels. Still, the legal questions remain the same.

For a significantly muddier example, consider GPT-3. The outputs are (typically) not even remotely similar to anything that was input except in very broad strokes.

Where does Copilot fall along this continuum?

For even more confusion, consider running a New York Times article through Google Translate. Are you in the clear to publish that? I seriously doubt it.

But what about running it through an ML algorithm that (attempts to) produce a very brief summary of it? Many such implementations exist in the real world today. Their output is nothing like the input - should it still fall under the copyright of the original?

Finally, it's worth pointing out that for many of the above computerized tasks there are direct human equivalents. Art can be traced on a light table. A drawing can be produced that fuses the styles of two references. News articles can be manually translated or summarized.

Again, my intention here isn't to argue a particular side. I'm just trying to make it clear how complicated this stuff is and the fact that we don't have clear legal answers for most of it yet.


Ah, I see.

I argue that, even if training a dataset is fair use, distributing the result is copyright infringement. I would want my license to make that part clearer.


> even if training a dataset is fair use, distributing the result is copyright infringement

I would be inclined to agree that the current situation (ie reproducing training examples verbatim) violates copyright. On the other hand, I'm not so sure that a trained model does (or even should) be subject to the copyright of the inputs.

Of course I acknowledge that the latter view is controversial and also that such issues are so new that they haven't had a chance to be meaningfully addressed by either the courts or the legislature yet.

As an example of a similar situation, see (https://www.thiswaifudoesnotexist.net/) which was trained entirely on copyrighted artwork. Note that there are at least three distinct issues here - training the model, distributing the model itself, and distributing the output of the model.

> I would want my license to make that part clearer.

But again, GitHub's argument here is that the license is completely irrelevant because it doesn't apply in the first place. Thus they won't care one bit about any clarifications you make one way or the other.


You said that you're "not so sure that a trained model does (or even should) be subject to the copyright of the inputs."

You missed my point. I'm not saying that the model is subject to the copyright of the inputs; I'm saying that the model's outputs are, which is entirely different. We say that the output of a compiler is still subject to the copyright of the inputs, so why not this?


I misspoke. (Err mistyped?) I suspect there will often be a stronger case to be made for the model itself falling under copyright than what it outputs. It's up to the courts and the legislature in the end though, so who knows.

Anyway, by providing public access to this thing I infer GitHub to be taking the position that copyright doesn't apply to the output. (And I suspect they are wrong, in particular because of the verbatim code samples people have managed to coax out of it.)


> even if training a dataset is fair use, distributing the result is copyright infringement

That seems an unlikely legal argument. It would defeat the point of fair use if you couldn’t distribute the result.

And no copyright license can override copyright law. Licenses can only grant rights, they can’t take them away.


Can you add fines?


I wish. I just want users to know what rights they have. Ultimately, I want my software to serve end users, not companies. If companies add value for users with my software, that's exactly what I want.

But stripping licenses away so that users can't know what rights they have with my code is not that.


>I am actually adding clauses to FOSS licenses that I am writing

Doesn't this make your new licenses incompatible to a lot of existing licenses?


Not necessarily. If you do it right, you've got a perfectly GPL-compatible license (because such laundering is, technically, a violation of the GPL… probably) – it's just a license that's more explicit about what's a license violation.

Law isn't code.


GPL explicitly forbids re-licensing under more restrictive terms.

So either the added terms are not more restrictive, which basically means they are unnecessary and have no real effect; or they are more restrictive, which is incompatible with the GPL.

You can't have things go both ways. It seems that your argument is "we're not adding restrictions, we're just saying what we think Copyright law / the GPL should actually be like." But unfortunately you can't "clarify" Copyright Law or "clarify" the GPL by adding terms. Ultimately courts decide that.

(Of course, if somehow your "clarification" happens to align with a court decision, then maybe it will work after all. But in theory your "clarification" is still not necessary and has no additional effect....)


> But in theory your "clarification" is still not necessary and has no additional effect....

Except your clarification will be interpreted by a court of law. “This license is compatible with the GPL and I can interpret the GPL in a way that lets me do something this license says I can't” is much less likely to stand than “well maybe the author thought the GPL said this, but it actually says my interpretation”.

This, of course, presumes that such a license is actually compatible with the GPL, something I'm getting less and less certain of over time. (What constitutes a compiled form? If a predictive model doesn't count – which it might not, since it outputs source code, very much unlike how compiled programs normally work – then my argument falls down. And many other things would also knock the argument down; I'm not confident enough that all my assumptions are right, or that they should be right.)


GPL code and its derivatives can't be distributed with additional restrictions.


wizzwizz4 is correct. Also, I have explicit clauses saying that GPL/AGPL dominate.

But yes, my licenses may be incompatible (one-way) with permissive licenses. I say "one-way" because code with permissive licenses can still be used in code under my licenses, but maybe not necessarily the other way around.

I'm okay with that.


That does not really ring true to me. AGPL broadens the scope of violations as well, and you cannot use AGPL code in GPL-only code bases without turning the end product AGPL (but you can use GPL-only code in AGPL code bases).

If you're just adding something along the lines of "copying passages extensive enough to reach originality is a violation of this license" then that's indeed already covered by the GPL, and there is really no need to add such a passage other than to be more explicit - and confuse people at least at first about why your license is not actually the GPL. So there isn't much of a point to do it in the first place, in my humble opinion.

If you add text that says something along the lines of "you may not use this code as training data", then you created an incompatible license, and your code cannot be used in GPL code bases, and even worse, since it restricts what you can do with the code more than the GPL, it might even mean you stop being reverse-compatible and may not use GPL'ed code yourself in your own custom-license code base.

The AGPL does not further restrict code uses, just broadens the scope of when you have to make available the code, so it's fine there. However, the original BSD license with the advertising clause is considered incompatible with the GPL.

I am not a lawyer, and these are just my quick layman concerns. I fully recognize you're entitled to use whatever license you find suitable for your code and I am absolutely not entitled to your code and work whatsoever.

But that said, I wouldn't touch your code if I saw a "potentially problematic" custom license, and I wouldn't consider contributing to your projects either.


I understand your concerns.

Honestly, with this whole debacle, I am not going to be accepting outside contributions anyway.

I also understand the concern with a problematic license. However, I don't plan to make a specific exemption about machine learning, but rather tie up an ambiguity.

What I think I'll do is that the license will require that when the licensed source code is used, partially or fully, as an input to an algorithm, the license terms must be distributed with the output of that algorithm.

I don't think this is a violation of the GPL at all because the GPL requires you to distribute the license with the binary code of GPL'ed code, and such binary code is the output of an algorithm (the compiler) whose input was the source code.

But what it would do is put the onus on GitHub that, if they used my code in training that data, if they distributed the results (as they are doing), they must distribute my license terms as well and tell users that some of the results are under those terms.


> binary code is the output of an algorithm (the compiler) whose input was the source code.

Just because binary code is produced by the operation of an algorithm on source code doesn’t make all output produced any algorithm on that source code binary code. Otherwise checksums and hashes and prime numbers would be copyrighted.

Bats are not birds.


You have a point, which is why the legal system would still require that a copy be substantial before they count it as infringing. I would argue that Copilot has already been shown to copy substantial portions, though.


> something along the lines of "you may not use this code as training data"

Would such a term be legally binding under present copyright law? Other than disallowing inclusion in a redistributed dataset specifically intended for training ML models, it's not clear to me that it would actually prevent such use if you already had a copy on hand for some other purpose. (Specifically, note that GitHub indeed already has a copy on hand for their authorized primary purpose of publicly distributing it.)

More generally, the manner in which copyright law applies to machine learning algorithms in general hasn't been worked out by either the courts or legislature yet. Hence the current article ...


To be clear, my suspicion is that this is so unlikely to happen unintentionally that it does not represent a real risk. If the issue is that I can force it to generate infringing output if I really want to, it is an argument against the Web browser too, since I could just as easily use the copyright-unsafe "copy" feature.


I don't entirely agree.

Whereas using the browser's copy feature requires the user to have intent to use it, getting Copilot to produce exact code does not. And proving that intent is not easy.

I think companies will see that such code can be exactly reproduced and decide to stay away from Copilot. I hope they do. In fact, I am less willing to take outside contributions for my own code, even for bug fixes, just because of the risk that that code came from Copilot.


That makes sense if you ignore the idea that such a thing would seem unlikely to happen without intent, which was the key thing in the post you’re replying to.


Unlikely stuff will always happen with enough use. There are billions of lines of code in the world. There will be enough copyright violations. Even on single multi-million line codebases, there will be violations.


How long does it have to be for you to consider it copyrighted code?

For example, a book could be copyrighted, but they certainly cannot sue me because a book i wrote contains a sentence that is the same.


The answer to your first question is for the courts to decide, unfortunately.

However, for my purposes, using a new license with particular terms would only be to make companies like GitHub pause and think before using my code as "training" to an "algorithm" like Copilot.


Double standards ensue.

Tool that could be used to violate copyright := Gets prosecuted by MPAA and friends, legislation is passed to make use / development / distribution of such tools illegal

Bigcorp ships the ML equivalent of ALLCODE.tgz, but you actually gotta look in the no/dont/open/this/folder/gplviolations/quake.c folder := Is this adequate proof that copyright is being violated?


Since I do not work for the MPAA, I don't see why you expect me to answer for them. Half of the article's argument is that any argument you could use to shut down Copilot would also give a lot of power to such entities if it were accepted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: