Let’s Code a Plagiarism Detector
Which can arrest you while copying a text from medium or copying a code from stack overflow :)
Applications like a plagiarism detector don’t have a clear direction like an algorithm rather it’s more of a heuristics pathway. Each and every logic you implement may fail at some instance, sometimes the worst algorithm can get your job done, there’s always a room for ambiguity.
Let’s bust this ambiguity and help developers like you from what I have observed after carrying out extensive research over the domain, successfully delivering a large-scale project & providing a consultancy to meet both logics — technology & business impact.
Key Players
Key players in this market are undoubtedly:
1- Python
2- C / C++
3- Java
4- DSA (Guru Jee) 🥳
5- NLP (Natural Language Processing)
Ways to code a plagiarism checker
Let’s deep dive into the methods using which you can code a plag checker for yourself. These ways may include algos, hacks, modules, frameworks, libraries and tools & techniques etc.
1- Python using Difflib (SequenceMatcher) Built-In Library
2- Python Third Party Libraries & Modules like:
copydetect
copyleaks
pycode-similar
plag-checker etc.
3- Winnowing Algorithm
Winnowing algorithm is an algorithm used to make fingerprints from a document. The Winnowing algorithm converts text documents into a set of hash values called fingerprints.
4- Tokenization or Token Frequency Matching
Word tokenization splits the sentences into words. For example, ‘I love books.’ tokenizes to ‘I’, ‘love’, ‘books’ and ‘.’. We change all the uppercase alphabets to lowercase to generalize tokens across the texts.
5- Stop Words Phenomenon (Pre-Processing)
Stop-Words like ‘or’, ‘the’ and ‘in’ and punctuations are removed.
6- N-Grams
Ferret tools use n-gram, n-grams consider tokens as bigrams, trigrams, tetragrams as per the values of n.
7- Longest Common Subsequence (LCS)
The longest common subsequence (LCS) is defined as the longest subsequence that is common to all the given sequences, provided that the elements of the subsequence are not required to occupy consecutive positions within the original sequences.
8- Rabin Karp Algorithm
Rabin-Karp algorithm is an algorithm used for searching/matching patterns in the text using a hash function. Unlike Naive string-matching algorithm, it does not travel through every character in the initial phase
rather it filters the characters that do not match and then performs the comparison.
9- Naive String-Matching Algorithm
It is the simplest method among all to look for patterns in an input text.
It checks all the characters one by one in the given string of characters.
It finds the exact string matches — be it more exact occurrences of the pattern.
10- Levenshtein’s Distance Algorithm / Edit Distance
The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).
11- Bag of words & Removing White Spaces (Pre-Processing)
Pre-Processing of Input Text using bag of words & removing white spaces concepts.
12- Cosine Matching Index (Guru among all)
Takes the measure of cosine angles of characters. The the Cosine of the angle between the vectors are determined which are ultimately used as a comparison index between two texts.
13- The Jaccard Similarity
Jaccard similarity measures the similarity between two sets of data to see which members are shared and distinct. The Jaccard similarity is calculated by dividing the number of observations in both sets by the number of observations in either set.
14- Jaro-Winkler Similarity
In statistics, the Jaro-Winkler similarity is a way to measure the similarity between two strings.
Before wrapping up, let’s cease fire between the cos’s & levs!
Cosine Similarity Index vs Levenshtein’s Distance Algorithm
The vectors compared with cosine can for instance contain frequencies of characters or characters n-grams, hence making it a string similarity measure. One can replace the sequence of characters with a sequence of strings or a sequence of n-grams, thus making Levenshtein a more general distance measure.
The main conceptual difference between Cosine and Levenshtein is that the former assumes a “bag-of-words” vector representation, i.e. compares unordered sets, whereas the latter takes into account the order of the elements in the sequences.
I would say that one could use Cosine similarity
when dealing with vectors (for instance the distance between (1,2,3) and (4,5,6)) and one could use the Levenshtein distance when dealing with strings (“distance” between “aaaaa” and “aaaba”).Cosine Similarity is for computing the similarity between objects considering their representations as vectors. Levenshtein distance is for computing the similarity between sequences of characters.
Cosine is for Code & Levenshtein is for text :)
That’s it :)