T-norms or t-conorms? How to aggregate similarity degrees for plagiarism detection

A new paper by Maciek Bartoszuk and me is to appear in Knowledge-Based Systems (doi:10.1016/j.knosys.2021.107427).

Abstract. Making correct decisions as to whether code chunks should be considered similar becomes increasingly important in software design and education and not only can improve the quality of computer programs, but also help assure the integrity of student assessments. In this paper we test numerous source code similarity detection tools on pairs of code fragments written in the data science-oriented functional programming language R. Contrary to mainstream approaches, instead of considering symmetric measures of “how much code chunks A and B are similar to each other”, we propose and study the nonsymmetric degrees of inclusion “to what extent A is a subset of B” and “to what degree B is included in A”. Overall, t-norms yield better precision (how many suspicious pairs are actually similar), t-conorms maximise recall (how many similar pairs are successfully retrieved), and custom aggregation functions fitted to training data provide a good balance between the two. Also, we find that program dependence graph-based methods tend to outperform those relying on normalised source code text, tokens, and names of functions invoked.