JPlag – Detecting Software Plagiarism
github.comJPlag, like similar plagiarism detectors, is vulnerable to attack. We outline the attack in this paper and show its effectiveness against JPlag and another widely used plagiarism detector, Moss. Note that this was written in 2020, in the pre “CheatGPT” era!
https://arxiv.org/abs/2010.01700
Mossad: Defeating Software Plagiarism Detection
Breanna Devore-McDonald, Emery D. Berger
Automatic software plagiarism detection tools are widely used in educational settings to ensure that submitted work was not copied. These tools have grown in use together with the rise in enrollments in computer science programs and the widespread availability of code on-line. Educators rely on the robustness of plagiarism detection tools; the working assumption is that the effort required to evade detection is as high as that required to actually do the assigned work.
This paper shows this is not the case. It presents an entirely automatic program transformation approach, Mossad, that defeats popular software plagiarism detection tools. Mossad comprises a framework that couples techniques inspired by genetic programming with domain-specific knowledge to effectively undermine plagiarism detectors. Mossad is effective at defeating four plagiarism detectors, including Moss and JPlag. Mossad is both fast and effective: it can, in minutes, generate modified versions of programs that are likely to escape detection. More insidiously, because of its non-deterministic approach, Mossad can, from a single program, generate dozens of variants, which are classified as no more suspicious than legitimate assignments. A detailed study of Mossad across a corpus of real student assignments demonstrates its efficacy at evading detection. A user study shows that graduate student assistants consistently rate Mossad-generated code as just as readable as authentic student code. This work motivates the need for both research on more robust plagiarism detection tools and greater integration of naturally plagiarism-resistant methodologies like code review into computer science education.
As someone that actually used JPlag as a university TA, I think if the students are smart enough to implement this, they're probably smart enough to do whatever assignment we've asked of them (unless there's a easy peasy program to do the transformation, but I don't think it's the case here).
The usage of the tool is basically a deterrent against a very low-hanging cheating fruit for students (some still tried and thought changing the variable names would help them...)
If you read the paper, you'll see that the attack is entirely feasible to implement by hand (we did this ourselves but do not report on it in the paper). It's a pretty mechanical process. A bit of trial and error will get the job done; it's a hell of a lot easier than most assignments.
Here's the relevant quote that speaks to your statement:
Mossad thus defies the conventional wisdom that defeating plagiarism detection is difficult or requires significant programming ability. The techniques that underlie Mossad could be implemented manually, relying on only the most basic understanding of programming language principles, letting them evade detection by both plagiarism detectors and some degree of manual inspection...
The usual defense against these is to ask students to explain their submitted work. Randomly generated dead code would likely be even more difficult for the students to explain.
Though a counterargument to this would be that teachers don't have time to interview every student. If Mossad is so good that teachers can't pick out the objectively suspicious subset, they might need to subjectively pick a random sample with varying amount of personal bias involved.
Yup. I sort of independently discovered this mechanism, and not just for "cheating," but for group work. Didn't even have to go individual interviews.
It was simple, I let students work in groups to do coding stuff (it's an intro type of class with students of varying skill levels). I had them work on a project together all they wanted, letting them know that it would be turned in about a month or so before the end of the semester. I would review them and then, in class, they would INDIVIDUALLY be quizzed on their own teams project; down to e.g.
"You have a function blahblah, explain what it does. What would happen if I passed it X?"
Forces them to work together and sort of study together. Kind of puts a bit more pressure on the less knowledgable, but probably worth it.
> A user study shows that graduate student assistants consistently rate Mossad-generated code as just as readable as authentic student code.
Do you have any small examples on a program that was transformed/generated with Mossad that we could compare against the original? As far as I can tell, the paper just have a really tiny example function.
An example of a Mossad generated file would be the source file plus a bunch of dead code. The dead code consists of lines from the original file repeated in random locations (plus, if you are using an "entropy file", random lines of code that were successful mutations from previous generations of Mossad).
As it turns out, a lot of student code can look this way anyway. Something crazy like 70% of authentic student code can have dead code in assignment submissions.
> As it turns out, a lot of student code can look this way anyway. Something crazy like 70% of authentic student code can have dead code in assignment submissions.
Having assessed student code this does not surprise me. Source code control late at night for students, especially non-CS majors, tends to be variations of "append a number to the end of the function name" eg. sum1(x, y) sum2(x, y) ... sumTHISREALLYWORKS(x, y).
That said, if dead code was being used to hide plagiarism, which is something I had not considered before, then telling students they would be marked down for dead code would probably be enough to stop it.
I mean. Should be doing that anyway. Code doesn’t just exist for the computer, but also for humans who have to maintain it.
> I mean. Should be doing that anyway. Code doesn’t just exist for the computer, but also for humans who have to maintain it.
Harsh! I like to think I am good lecturer.
Depends on the specification of the assignment. In my case I teach data science not software development so the specification is not "bullet proof code that won't break when pytorch releases a new version tomorrow" but rather statistical and data rigour. This is where spent my time when marking, not how maintainable the code is.
CS students turn in MUCH better code, but frequently data is leaking into tests or validation sets etc. making the results either meaningless or compromised.
At the end of the day code quality is strongly correlated to grades.
That seems like readability is even more important! I've taught programming to friends and family my entire life (to anyone who wants to learn), and one thing I always focus on is 'telling a story with comments', explaining how, where, and why data flows through the code. At the end, reread your comments and your code and figure out which one is wrong; then refactor.
I'm surprised that large amounts of dead code is neither an obvious-to-machines nor an obvious-to-humans problem or demerit with submitted assignments -- regardless of plagiarism status. I'd especially have thought such a clunky approach should be caught be a decent plagiarism detection software. It makes me wonder if simply feeding a student's assignment into Claude would be more reliable these days by just asking it, "If you remove all the dead code, is the remaining code likely plagiarized?"
How would that pass in the user study? Did the people reviewing the code fail to see dead code scattered across random locations? Feels like it would be obvious as soon as you opened the file.
It would certainly depend to some degree on the complexity of the assignment. But it's also not that unusual for legitimate, non-plagiarized submissions to have dead code.
Sure, but is it not unusual to have "dead code consisting of lines from the original file repeated in random locations"? That would certainly stick out in any other environment (like a professional one).
I didn't study anything related to computers/software/programming in school, so I don't know what level is expected. But if I was tutoring someone and they handed me something with dead code in random locations in it, it would certainly catch my attention.
I think two things are at play here.
1. Students will frequently just try things until it works, move code around, etc., leading to very messy code. 2. Graders often do not look at individual assignments unless there is a reason to do so, often relying on automated test suites. And when they do look, I'd bet their first reaction is something like "I don't know why they're repeating themselves like this, but my rubric only penalizes them for 5 points here..."
In an educational setting the plagiarism tools are probably most wanted by lecturers, but least useful. Do they teach every individual differently? If not, then there is not much surprise, if elementary ideas are expressed in very similar ways. So some cases of very similar solutions are bound to happen, hopefully not throwing shadow without proof of plagiarism.
I recently had to check code from some of my students at the university as I suspected plagiarism. I discovered JPlag which works like a charm and generates nice reports
Next time just ask them a few questions about the programming choices they made. Far easier.
How do you deal with disputes? One's code is flagged even if the student in question didn't actually cheat. What then? Do you trust tools over the students' word?
In addition, do things like stack overflow and using LLM-generated code count as cheating? Because that is horrible in and of itself, though a separate concern.
The output of plagiarism tools should only serve as a hint to look at a pair of solutions more closely. All judgement should be derived entirely from similarities between solutions and not some artificial similarity score computed by some program.
Unfortunately, this is not really what happens in my experience. The output of plagiarism tools is taken as fact (especially at high school levels). Without extraordinary evidence of the tool being incorrect, students have no recourse, even if they could sit and explain the thought process behind every word/line of code/whatever.
Lousy high school.
Indeed, this is exactly what I did.
If you talk about the written code to the student in question it should become clear whether it was copied or not.
Well, in this case I noticed the same code copied while grading a project. I used then JPlag to run an automatic check in all the submissions for all the projects. It found many instances where a couple of students did a copy-paste with same variable names, comments, etc. It was quite obvious if you look in detail, and JPlag helped us spot it in multiple files easily.
*edited mobile typos
An archival video of all coding sessions (locally, hosted by the student), starting with a visible outline of pseudo-code and ending with debugging should be sufficient.
In case of a false positive from a faulty detector this is extraordinary evidence.
We had a professor require us to use git as a timestamped log of our progress. Of course you could fake it but stealing work and basically redoing it piece by piece with fake timestamps is a lot of work for cheaters.
Kinda rare these days with ChatGPT
You might be surprised. Many students who use ChatGPT for assignments end up turning in code identical (or nearly identical) to other students who use ChatGPT.
Surprising because you get different answers each time you ask ChatGPT.
Different in an exact string match but code that is copied and pasted from ChatGPT has a lot of similarities in the way that it is (over) commented. I've seen a lot of Python where the student who "authored" it cannot tell me how a method works or why it was implemented despite having the comments prefixed to every line in the file.
> (over) commented
From my experience using ChatGPT, It usually remove most of my already written comments when I ask questions about code I wrote myself. It usually give you outline comments. So unless you are supporter of the self documented code idea, I don't think ChatGPT over comments.
It's obviously down to taste, but what I've seen over and over is a comment per line which to me is excessive outside it being requested of absolute beginners.
That happens and also the model can't decide if it wants the comment on the line before the code or if everything should be appended to the line itself so when I see both styles within a single project it's another signal. People generally have a style that they stick with.
Ah yes, good old "Did you even read the essay before handing it in? Next time, please do."
ChatGPT answers don't differ that much without being prompted to do so
yeah but the prompt itself generally adds sufficient randomness to avoid the same verbatim answer each time.
as an example just go ask it to write any sufficiently average function. use different names and phrases for what the function should do; you'll generally get a different flavor of answer each time, even if the functions all output the same thing.
sometimes the prompt even forces the thing to output the most naive implementation possible due to the ordering or perceived priority of things within the requesting prompt.
it's fun to use as a tool to nudge it into what you want once you get the hang of the preconceptions it falls into.
MOSS seems to be pretty good finding multiple people using LLM-generated code and flagging them as copies of each other. I imagine it would also be a good idea to throw the assignment text into the few most popular LLMs and feed that in as well, but I don't know of anyone who has tried this.
FWIW the attack we describe in the paper works against MOSS, too (that was the original inspiration for the name, “Mossad”).
I was actually looking for something like this a few days ago!
There’s an open source tool which I love the idea of (basically a tool for declarative integration tests), but I really don’t like it’s implementation. I tried to contribute to improve it, but it’s too much work and it will never fit my ideal.
So I basically decided to "redo it but better", and I’m also tempted to make it a paid, proprietary tool because my implementation diverges enough that I consider it a different codebase altogether (and it would bring legitimate value to companies). I wrote my code from scratch but still had some knowledge of the original code base so I’d be interested in running something like JPlag to make sure I didn’t accidentally plagiarize open source code.
I hope I find a way to make it compare 2 codebases :)
> to make sure I didn’t accidentally plagiarize open source code.
If you didn't plagiarize, you don't need to run the tool. If you did plagiarize and want to hide it, tho...
There’s big sections of code I wrote in the original open source lib. I didn’t copy paste the code but the implementation in this component is obviously pretty close. I’m the copyright holder of this code anyway so it should not be an issue, but I’d rather not take the risk.
Plagiarism is not always clear cut because life is messy. That’s why Wine doesn’t allow contributions from people who have seen Windows source code[1] for instance, even though it could be good faith contributions with experience instead of plagiarism
[1]: https://wiki.winehq.org/Developer_FAQ#Who_can't_contribute_t...?
One of the key outcomes of my master's thesis was the development of an extendable solution for Code Clone Detection (CCD), primarily focused on code and tested with undergraduates at my university [1]. Although I didn't have time to complete the adapter for JPlag, I believe it would be highly beneficial.
Interestingly, whenever I discussed my thesis, the first reaction from others often revolved around moral concerns.
This looks cool, but for me one of the big wins with JPlag is that I just download and run a single JAR file.
Presumably, this needs a corpus of software to check against. Does it include one, or do you have to bring your own?
> Just to make it clear: JPlag does not compare to the internet! It is designed to find similarities among the student solutions, which is usually sufficient for computer programs.
It seems like the latter based on their wiki, but also that that corpus can be relatively small.
Should a plagiarism score be considered when generating code with an infinite monkeys algorithm with selection or better?
Would that result in inability to write code in a clean room, even; because eventually all possible code strings and mutations thereof would already be patented.
For example, are three notes or chords copyrightable?
pro tip: change variable/function names, switch from if/else to switch, invert if/else statements, switch for and while loops, group code differently, create helper functions or collapse helper functions, rewrite loops as streams/ranges/list comprehensions, etc. The ide can do most automatically.
It is pretty much impossible to detect software plagiarism, especially on leetcode style questions as only 1 style or pattern is the most efficient answer.
Though if a student changes it sufficiently enough, they might begin to actually see the invariants and ideas and actually learn the material.
It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.
Building things from first principles is a great way to instil understanding about how things work and why, which helps the understanding of problems later.
When doing things for real in the workforce, reinventing core parts yourself is often not the best way, if only because you'll reinvent already fixed bugs too or that it wastes time, and that should be explained too. But understanding how things work below the outer layer of what would otherwise be black boxes lets you better understand when things go wrong, or have a better position to assess a pre-made library/service/other to be confident it is the most suitable option¹. Also, building things from scratch helps teach complexity analysis and, at a slightly higher level, security analysis, both of which are very useful, often vital, at much higher and/or more abstract levels.
If building from first principles, or close to, is being drilled into students as the way to do things full stop, then those students are being taught poorly. It isn't how I remember learning way back when I was last called a student.
For example my understanding of how b-trees and their relatives work, in part from having build routines to manage them in the dim and distant past, along with a number of other similar bits of knowledge, helps my understanding of how many DBMSs work in general and how certain optimisations at higher levels² do/don't work. I doubt I'll ever need to build any structure like that from anything close to first principles, but having done so in the past was not wasted time. The same with knowledge of filesystem construction, network protocols, etc. – I'll probably not use those things directly but the understanding helps me make choices, create solutions³, and solve problems, less directly.
--------
[1] or at least a suitable option
[2] things I do in the query syntax, what the query planner/runner can/can't do with that, etc.
[3] I am sometimes the local master of temporary hacky solutions that get us over the line and allow time to do things more right slightly later instead of things failing right now.
Yea, you should learn how things work. I'm just worried that overly clever plagiarism detection would ding you for things like converging on common design patterns.
> It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.
Kind of makes sense, doesn't it? While in school, you want to learn as much as possible (ideally?), while in the workforce, you want (or the company wants you) to be as efficient as possible. Different goals leads to different workflows.
This argument has come up a lot since ChatGPT released. I can agree that new tools (like LLMs) can have a place in education, potentially. That said, learning the foundation of anything you do is critical to understanding the higher levels of it.
I think the same connection can be made to StackOverflow. If you are/were a computer science student and you did a lot of copy-pasting and not a lot of thinking and trialing, there's a really good change you didn't get to suffer mistakes during development that you know to avoid now as a graduate. Now, we all have taken advantage of code from StackOverflow and that's a tool to aid development for us, but when it's something you treat as a crutch, when it doesn't have the answers you need, you're screwed.
One case I saw literally yesterday at work. We had a dev that had written a lot of copy-pasted code saying that it couldn't be generalized and take advantage of a mapping we already have to generate a UI. This dev had not yet had the fortune of learning how to properly abstract this kind of problem in this specific circumstance. I sat down with him for a moment and instead of spitting out a bunch of nonsense that the LLM was trying to get him to write, we say down and paired on the issue until we had a more general solution.
He learned some abstraction concepts, we all got a better code base, and he learned a way to help tease an LLM into a better solution going forward. That foundation was required to get that better solution though, in this situation.
Generally speaking, I think you should know all your underlying concepts so you can audit any new development assisting tools.
> only to flip that when they join the workforce.
A surprisingly large number of people do not realise that code on StackOverflow is under a relatively restrictive license.
https://meta.stackexchange.com/questions/12527/do-i-have-to-...
Some companies take this very seriously and others do not care at all. And, of course, there are companies that outright ban any library without the right licence.
Most companies realize they can do what they want as long as there are no actual consequences. These companies prosper under the rules of natural selection, compared to the companies which are afraid of doing things because some paper says they shouldn't do it even though nothing will happen if they do it.
exam: implement add.
"return a + b": plagiarism, disqualified.
"return a + 1 + b - 1": A+
The exam questions themselves would be plagiarism also
> It's funny how we drill the idea that everything must be reimplemented from first principles into students, only to flip that when they join the workforce.
It is funny, but that doesn’t mean it’s wrong.
PS If the domain in your profile is supposed to be valid … it isn’t.
Yea, it's a great way to learn.
Thanks, I need to redo my website/blog/whatever. I'm still getting email at it!
Exactly this. If my coworker implemented sort themselves we'd have questions about their qualifications.
> It's funny how we drill the idea that everything must be reimplemented from first principles into students
Where on earth did you go to school that you couldn't just call list.sort() after your first algorithms class? I've never seen a class that teaches students that everything must be implemented from first principles. They only do that for the concepts taught in that class, or 1-2 classes prior, to make sure you are actually learning and have a clue what you're doing. Which... should makes sense? You should want engineers to cut costs based on understanding rather than ignorance.
Can I use it to detect Copy & Paste within my company's own codebase?
Jetbrains IDEs will flag duplicate code of N lines (configurable) or more. Just run the inspection manually and there you go. You can even do it in batch in CI nowadays.
I've had some success with PMD/CMD [1]. It's Java-based, though, which could be a pain, depending on your setup. If your codebase uses Python, you can use Pylint for this, too.
What do you need that for? It’s not really cheating when it’s going into a product, if it works it works. In a hypothetical ideal corporate environment wouldn’t it be preferred if one could save company time by copy and pasting?
Is this also effective at detecting code duplication within a codebase?