CN113011154A

CN113011154A - Job duplicate checking method based on deep learning

Info

Publication number: CN113011154A
Application number: CN202110279211.3A
Authority: CN
Inventors: 张凌; 胡布焕; 张晶
Original assignee: South China University of Technology SCUT; CERNET Corp
Current assignee: South China University of Technology SCUT; CERNET Corp
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-22

Abstract

The invention discloses a homework duplicate checking method based on deep learning, which comprises the following steps: the method comprises the steps of obtaining student course homework data and homework template files, judging homework template formats, conducting topic cutting processing on obtained homework, judging whether topics in the homework are subjective questions or objective questions, conducting text preprocessing on answers of the subjective questions in the homework after the topics are cut, calculating similarity between student homework by utilizing a deep learning technology (namely a convolutional neural network model), analyzing calculation results of the similarity, clustering student homework with high similarity, and generating a similarity report. In order to facilitate the teacher to view the similar content condition, the invention marks the similar content between the similar jobs. The method can find out the text contents with similar operation semantics and solve the problem that a plurality of plagiarism detection methods are poor in anti-interference effect.

Description

Job duplicate checking method based on deep learning

Technical Field

The invention relates to the technical field of student homework duplicate checking, in particular to a homework duplicate checking method based on deep learning.

Background

In online assistant teaching in colleges and universities, electronic documents become one of the main forms of student's homework submission. With the attention of people on academic moral, how to assist teachers to find out the copy content in the homework submitted by students becomes a research hotspot.

At present, a plurality of plagiarism detection systems exist, such as domestic China-network of understanding (CNKI) academic unfamiliar literature detection systems, foreign Turnitin, PlagScan, Dupli Checker and other systems. These systems can assist teachers in finding out the part of the student submitting the plagiarism in the homework, but because these systems use the internet as a source of plagiarism, it is difficult to discover the plagiarism relationships that exist between students' local homework. At present, a plurality of plagiarism detection methods are researched and put into use by people, and the most popular is the plagiarism detection method based on the lexical method. The plagiarism detection method based on the lexical method mainly considers the lexical characteristics in the text, for example, a fingerprint characteristic extraction method which is used more in the early stage is adopted. The method based on fingerprint feature extraction represents the documents as a fingerprint sequence, and the similarity between the documents is calculated according to the fingerprint sequence. The plagiarism detection method based on the lexical method is suitable for simple copy and paste, but when a plagiarism person has behaviors of evading detection such as paraphrase replacement and the like on a text, the effect of the method is not obvious. Researchers also use grammar-based plagiarism detection methods (e.g., part-of-speech tagging), semantic-based plagiarism methods (e.g., display semantic analysis, latent semantic analysis), and machine learning-based plagiarism detection methods (e.g., support vector machines, linear regression models), among others.

With the wide application of deep learning in the field of computers, many researchers use deep learning to realize plagiarism detection and achieve some better results. One of the key points of the plagiarism detection technology is text similarity calculation, and text paraphrase replacement, synonym replacement and the like can be well found by using a deep learning technology in the text similarity calculation, so that not only can word plagiarism be found, but also semantic plagiarism can be found by using a deep learning related technology in a plagiarism detection task.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an operation duplicate checking method based on deep learning, which can accurately find out text contents with similar operation semantics and solve the problem of poor anti-interference effect of a plurality of plagiarism detection methods.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: a homework duplicate checking method based on deep learning comprises the following steps:

1) acquiring student course homework data and homework template files;

2) judging the format of an operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic;

3) performing text preprocessing on the answers of the subjective questions in the operation after the questions are cut;

4) calculating the similarity between student assignments;

5) analyzing the similarity calculation result, and gathering student jobs with high similarity to generate a similarity report;

6) marking the similar content among the similar operations to finish the operation duplication checking.

In the step 1), the student course homework data refers to student homework obtained from courses of an online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course by serving as a teacher or teaching assistance on an online learning platform.

In step 2), judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:

judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;

performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;

judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.

In step 3), performing text preprocessing on the answers of the subjective questions in the operation after the questions, wherein the specific conditions are as follows:

a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;

b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation and stop word removal, the English preprocessing flow is: sentence segmentation, word reduction, case unification and de-notation.

In step 4), calculating the similarity between each job and other jobs, wherein the job set is A (A)₁,A2,...,A_t) Where t is the total number of jobs in the job set, A_tFor the t-th job, a certain job A is calculated_iWith other operations A_jThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:

4.1) sentence screening: discarding sentences with the number of words less than 3 and the number of words greater than 20 in the sentences and other repeated sentences, and marking each sentence to indicate from which operation text the sentence comes;

4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:

4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; for sentence S₁(w₁₁,w₁₂,...,w_1n) And sentence S₂(w₂₁,w₂₂,...,w_2m) Wherein (w)₁₁,w₁₂,...,w_1n) And (w)₂₁,w₂₂,...,w_2m) Are respectively sentences S₁And sentence S₂If w is a word of₁₁＝w₂₁Then will S₁And S₂Putting the obtained product into an array;

4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)_u1,w_u2,...,w_ua) As a sentence S₁(w₁₁,w₁₂,...,w_1n) With sentence S₂(w₂₁,w₂₂,...,w_2m) Set of Chinese words, wherein (w)_u1,w_u2,...,w_ua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S₁(w₁₁,w₁₂,...,w_1n) With sentence S₂(w₂₁,w₂₂,...,w_2m) Is the sentence S₁And sentence S₂The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;

4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value₁(S₁,S₂) In which S is₁And S₂Respectively from operation A_iAnd operation A_j(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:

4.3.1) Using the trained word2vec model, pair P sentences₁(S₁,S₂) Sentence S in₁(w₁₁,w₁₂,...,w_1n) And sentence S₂(w₂₁,w₂₂,...,w_2m) Word embedding processing to generate a word embedding matrix M₁(v₁₁,v₁₂,...,v_1n) And M₂(v₂₁,v₂₂,...,v_2m) Wherein (v)₁₁,v₁₂,...,v_1n) And (v)₂₁,v₂₂,...,v_2m) Respectively word-embedded matrix M₁And M₂The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;

4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;

4.3.3) statistically similar pairs of sentences from Job A_iAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and m is taken as an operation A_iAnd do

s

Trade A_jSimilarity (A) of_i,A_j)；

In step 5), for the job pair (A)_i,A_j) And an operation pair (A)_j,A_k) If Similarity (A)_i,A_j) Not less than the threshold value, Similarity (A)_j,A_k) ≧ threshold, then A_i、A_j、A_kAre grouped into one; wherein, Similarity (A)_j,A_k) For operation A_jAnd operation A_kThe similarity of (c).

In step 6), after finding a set of sentences in the job that are similar to those in another job, the sentences are located in the job and highlighted using the pdf highlighting tool.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method is based on deep learning (namely a convolutional neural network model), solves the problem that similar contents cannot be found out under the condition of heavy interference, and can effectively improve the efficiency of teachers for checking the students for the heavy homework by applying the method provided by the invention to the existing plagiarism detection system.

2. By using the sentence screening and quick matching method, the sentences which are too long and too short are screened out, the possibly similar sentence pairs are quickly matched, and the problem of too long plagiarism detection time is solved.

Drawings

FIG. 1 is a logic flow diagram of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1, the method for task duplication checking based on deep learning provided by this embodiment includes the following steps:

1) acquiring student course homework data and homework template files, and reading file contents; the student course homework data refers to student homework obtained from courses of the online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course and used as a teacher or an assistant on an online learning platform.

2) Judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:

3) Carrying out text preprocessing on the answers of the subjective questions in the operation after the questions are cut, wherein the specific conditions are as follows:

b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation, stop word removal and the like, wherein the English preprocessing flow comprises the following steps: sentence segmentation, word reduction, case unification, symbol removal and the like.

4) Calculating the similarity between student assignments: calculating the similarity between each operation and other operations, wherein the operation set is A (A)₁,A2,...,A_t) Where t is the total number of jobs in the job set, A_tFor the t-th job, a certain job A is calculated_iWith other operations A_jThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:

4.1) sentence screening: discarding too short (the number of words in the sentence is less than 3) and too long (the number of words in the sentence is more than 20) sentences and other repeated sentences, and marking each sentence to indicate the operation text from which the sentence comes;

4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; to pairIn sentence S₁(w₁₁,w₁₂,...,w_1n) And sentence S₂(w₂₁,w₂₂,...,w_2m) Wherein (w)₁₁,w₁₂,...,w_1n) And (w)₂₁,w₂₂,...,w_2m) Are respectively sentences S₁And sentence S₂If w is a word of₁₁＝w₂₁Then will S₁And S₂Putting the obtained product into an array;

4.3.1) Using the trained word2vec model, pair P sentences₁(S₁,S₂) Sentence S in₁(w₁₁,w₁₂,...,w_1n) And sentence S₂(w₂₁,w₂₂,...,w_2m) Word embedding processing to generate a word embedding matrix M₁(v₁₁,v₁₂,...,v_1n) And M2 (v)₂₁,v₂₂,...,v_2m) Wherein (v)₁₁,v₁₂,...,v_1n) And (v)₂₁,v₂₂,...,v_2m) Respectively word-embedded matrix M₁And M₂The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;

4.3.3) statistically similar pairs of sentences from Job A_iAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and s is taken as an operation A_iAnd operation A_jSimilarity (A) of_i,A_j)。

5) Analyzing the similarity calculation result, gathering student jobs with high similarity to generate a similarity report, and specifically operating as follows: for operation pair (A)_i,A_j) And an operation pair (A)_j,A_k) If Similarity (A)_i,A_j) Not less than the threshold value, Similarity (A)_j,A_k) ≧ threshold, then A_i、A_j、A_kAre grouped into one; wherein, Similarity (A)_j,A_k) For operation A_jAnd operation A_kThe similarity of (c).

6) Marking similar contents among similar jobs to finish the job duplicate checking, and the specific operations are as follows: after a sentence set similar to the sentence set in another job in the job is found, the sentences are located in the job and highlighted by using a pdf highlighting tool, so that a teacher can conveniently check the similar content condition.

According to the similarity calculation result among the student homework, the teacher can see which homework has plagiarism suspicion, and according to the similar text marking result, can see which texts are similar.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A homework duplicate checking method based on deep learning is characterized by comprising the following steps:

1) acquiring student course homework data and homework template files;

4) calculating the similarity between student assignments;

2. The task duplication checking method based on deep learning according to claim 1, wherein: in the step 1), the student course homework data refers to student homework obtained from courses of an online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course by serving as a teacher or teaching assistance on an online learning platform.

3. The task duplication checking method based on deep learning according to claim 1, wherein: in step 2), judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:

4. The task duplication checking method based on deep learning according to claim 1, wherein: in step 3), performing text preprocessing on the answers of the subjective questions in the operation after the questions, wherein the specific conditions are as follows:

5. The task duplication checking method based on deep learning according to claim 1, wherein: in step 4), calculating the similarity between each job and other jobs, wherein the job set is A (A)₁,A2,...,A_t) Where t is the total number of jobs in the job set, A_tFor the t-th job, a certain job A is calculated_iWith other operations A_jThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:

4.3.3) statistically similar pairs of sentences from Job A_iAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and m is taken as an operation A_iAnd do s

Trade A_jSimilarity (A) of_i,A_j)；

6. The task duplication checking method based on deep learning according to claim 1, wherein: in step 6), after finding a set of sentences in the job that are similar to those in another job, the sentences are located in the job and highlighted using the pdf highlighting tool.