Programming students need to be informed about plagiarism and collusion. Hence, we developed an assessment submission system to remind students about the matter. Each submission will be compared to others and any similarities that do not seem a result of coincidence will be reported along with their possible reasons. The system also employs gamification to promote early and unique submissions. Nevertheless, the system might put unnecessary pressure as coincidental similarities can still be reported. Further, it does not specifically cover self-plagiarism. We revisit the system and shift our focus to report simulated similarities from student own submission instead of reporting actual similarities across submissions. According to our evaluation with 390 students and five quasi-experiments, students with simulated similarities are slightly more aware of plagiarism and collusion, self-plagiarism in particular. Their awareness of the matter is somewhat acceptable (around 75%) and they see the benefits of our assessment submission system.
Source code plagiarism is an emerging issue in computer science education. As a result, a number of techniques have been proposed to handle this issue. However, comparing these techniques may be challenging, since they are evaluated with their own private dataset(s). This paper contributes in providing a public dataset for comparing these techniques. Specifically, the dataset is designed for evaluation with an Information Retrieval (IR) perspective. The dataset consists of 467 source code files, covering seven introductory programming assessment tasks. Unique to this dataset, both intention to plagiarise and advanced plagiarism attacks are considered in its construction. The dataset's characteristics were observed by comparing three IR-based detection techniques, and it is clear that most IR-based techniques are less effective than a baseline technique which relies on Running-Karp-Rabin Greedy-String-Tiling, even though some of them are far more time-efficient.