CN113011154A - Job duplicate checking method based on deep learning - Google Patents
Job duplicate checking method based on deep learning Download PDFInfo
- Publication number
- CN113011154A CN113011154A CN202110279211.3A CN202110279211A CN113011154A CN 113011154 A CN113011154 A CN 113011154A CN 202110279211 A CN202110279211 A CN 202110279211A CN 113011154 A CN113011154 A CN 113011154A
- Authority
- CN
- China
- Prior art keywords
- sentence
- similarity
- homework
- sentences
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
- G06Q50/205—Education administration or guidance
Abstract
The invention discloses a homework duplicate checking method based on deep learning, which comprises the following steps: the method comprises the steps of obtaining student course homework data and homework template files, judging homework template formats, conducting topic cutting processing on obtained homework, judging whether topics in the homework are subjective questions or objective questions, conducting text preprocessing on answers of the subjective questions in the homework after the topics are cut, calculating similarity between student homework by utilizing a deep learning technology (namely a convolutional neural network model), analyzing calculation results of the similarity, clustering student homework with high similarity, and generating a similarity report. In order to facilitate the teacher to view the similar content condition, the invention marks the similar content between the similar jobs. The method can find out the text contents with similar operation semantics and solve the problem that a plurality of plagiarism detection methods are poor in anti-interference effect.
Description
Technical Field
The invention relates to the technical field of student homework duplicate checking, in particular to a homework duplicate checking method based on deep learning.
Background
In online assistant teaching in colleges and universities, electronic documents become one of the main forms of student's homework submission. With the attention of people on academic moral, how to assist teachers to find out the copy content in the homework submitted by students becomes a research hotspot.
At present, a plurality of plagiarism detection systems exist, such as domestic China-network of understanding (CNKI) academic unfamiliar literature detection systems, foreign Turnitin, PlagScan, Dupli Checker and other systems. These systems can assist teachers in finding out the part of the student submitting the plagiarism in the homework, but because these systems use the internet as a source of plagiarism, it is difficult to discover the plagiarism relationships that exist between students' local homework. At present, a plurality of plagiarism detection methods are researched and put into use by people, and the most popular is the plagiarism detection method based on the lexical method. The plagiarism detection method based on the lexical method mainly considers the lexical characteristics in the text, for example, a fingerprint characteristic extraction method which is used more in the early stage is adopted. The method based on fingerprint feature extraction represents the documents as a fingerprint sequence, and the similarity between the documents is calculated according to the fingerprint sequence. The plagiarism detection method based on the lexical method is suitable for simple copy and paste, but when a plagiarism person has behaviors of evading detection such as paraphrase replacement and the like on a text, the effect of the method is not obvious. Researchers also use grammar-based plagiarism detection methods (e.g., part-of-speech tagging), semantic-based plagiarism methods (e.g., display semantic analysis, latent semantic analysis), and machine learning-based plagiarism detection methods (e.g., support vector machines, linear regression models), among others.
With the wide application of deep learning in the field of computers, many researchers use deep learning to realize plagiarism detection and achieve some better results. One of the key points of the plagiarism detection technology is text similarity calculation, and text paraphrase replacement, synonym replacement and the like can be well found by using a deep learning technology in the text similarity calculation, so that not only can word plagiarism be found, but also semantic plagiarism can be found by using a deep learning related technology in a plagiarism detection task.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an operation duplicate checking method based on deep learning, which can accurately find out text contents with similar operation semantics and solve the problem of poor anti-interference effect of a plurality of plagiarism detection methods.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a homework duplicate checking method based on deep learning comprises the following steps:
1) acquiring student course homework data and homework template files;
2) judging the format of an operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic;
3) performing text preprocessing on the answers of the subjective questions in the operation after the questions are cut;
4) calculating the similarity between student assignments;
5) analyzing the similarity calculation result, and gathering student jobs with high similarity to generate a similarity report;
6) marking the similar content among the similar operations to finish the operation duplication checking.
In the step 1), the student course homework data refers to student homework obtained from courses of an online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course by serving as a teacher or teaching assistance on an online learning platform.
In step 2), judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:
judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;
performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;
judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.
In step 3), performing text preprocessing on the answers of the subjective questions in the operation after the questions, wherein the specific conditions are as follows:
a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;
b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation and stop word removal, the English preprocessing flow is: sentence segmentation, word reduction, case unification and de-notation.
In step 4), calculating the similarity between each job and other jobs, wherein the job set is A (A)1,A2,...,At) Where t is the total number of jobs in the job set, AtFor the t-th job, a certain job A is calculatediWith other operations AjThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:
4.1) sentence screening: discarding sentences with the number of words less than 3 and the number of words greater than 20 in the sentences and other repeated sentences, and marking each sentence to indicate from which operation text the sentence comes;
4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:
4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; for sentence S1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Wherein (w)11,w12,...,w1n) And (w)21,w22,...,w2m) Are respectively sentences S1And sentence S2If w is a word of11=w21Then will S1And S2Putting the obtained product into an array;
4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)u1,wu2,...,wua) As a sentence S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Set of Chinese words, wherein (w)u1,wu2,...,wua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Is the sentence S1And sentence S2The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;
4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value1(S1,S2) In which S is1And S2Respectively from operation AiAnd operation Aj(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:
4.3.1) Using the trained word2vec model, pair P sentences1(S1,S2) Sentence S in1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Word embedding processing to generate a word embedding matrix M1(v11,v12,...,v1n) And M2(v21,v22,...,v2m) Wherein (v)11,v12,...,v1n) And (v)21,v22,...,v2m) Respectively word-embedded matrix M1And M2The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;
4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;
4.3.3) statistically similar pairs of sentences from Job AiAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and m is taken as an operation AiAnd do
s
Trade AjSimilarity (A) ofi,Aj);
In step 5), for the job pair (A)i,Aj) And an operation pair (A)j,Ak) If Similarity (A)i,Aj) Not less than the threshold value, Similarity (A)j,Ak) ≧ threshold, then Ai、Aj、AkAre grouped into one; wherein, Similarity (A)j,Ak) For operation AjAnd operation AkThe similarity of (c).
In step 6), after finding a set of sentences in the job that are similar to those in another job, the sentences are located in the job and highlighted using the pdf highlighting tool.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method is based on deep learning (namely a convolutional neural network model), solves the problem that similar contents cannot be found out under the condition of heavy interference, and can effectively improve the efficiency of teachers for checking the students for the heavy homework by applying the method provided by the invention to the existing plagiarism detection system.
2. By using the sentence screening and quick matching method, the sentences which are too long and too short are screened out, the possibly similar sentence pairs are quickly matched, and the problem of too long plagiarism detection time is solved.
Drawings
FIG. 1 is a logic flow diagram of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the method for task duplication checking based on deep learning provided by this embodiment includes the following steps:
1) acquiring student course homework data and homework template files, and reading file contents; the student course homework data refers to student homework obtained from courses of the online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course and used as a teacher or an assistant on an online learning platform.
2) Judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:
judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;
performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;
judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.
3) Carrying out text preprocessing on the answers of the subjective questions in the operation after the questions are cut, wherein the specific conditions are as follows:
a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;
b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation, stop word removal and the like, wherein the English preprocessing flow comprises the following steps: sentence segmentation, word reduction, case unification, symbol removal and the like.
4) Calculating the similarity between student assignments: calculating the similarity between each operation and other operations, wherein the operation set is A (A)1,A2,...,At) Where t is the total number of jobs in the job set, AtFor the t-th job, a certain job A is calculatediWith other operations AjThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:
4.1) sentence screening: discarding too short (the number of words in the sentence is less than 3) and too long (the number of words in the sentence is more than 20) sentences and other repeated sentences, and marking each sentence to indicate the operation text from which the sentence comes;
4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:
4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; to pairIn sentence S1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Wherein (w)11,w12,...,w1n) And (w)21,w22,...,w2m) Are respectively sentences S1And sentence S2If w is a word of11=w21Then will S1And S2Putting the obtained product into an array;
4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)u1,wu2,...,wua) As a sentence S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Set of Chinese words, wherein (w)u1,wu2,...,wua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Is the sentence S1And sentence S2The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;
4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value1(S1,S2) In which S is1And S2Respectively from operation AiAnd operation Aj(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:
4.3.1) Using the trained word2vec model, pair P sentences1(S1,S2) Sentence S in1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Word embedding processing to generate a word embedding matrix M1(v11,v12,...,v1n) And M2 (v)21,v22,...,v2m) Wherein (v)11,v12,...,v1n) And (v)21,v22,...,v2m) Respectively word-embedded matrix M1And M2The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;
4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;
4.3.3) statistically similar pairs of sentences from Job AiAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and s is taken as an operation AiAnd operation AjSimilarity (A) ofi,Aj)。
5) Analyzing the similarity calculation result, gathering student jobs with high similarity to generate a similarity report, and specifically operating as follows: for operation pair (A)i,Aj) And an operation pair (A)j,Ak) If Similarity (A)i,Aj) Not less than the threshold value, Similarity (A)j,Ak) ≧ threshold, then Ai、Aj、AkAre grouped into one; wherein, Similarity (A)j,Ak) For operation AjAnd operation AkThe similarity of (c).
6) Marking similar contents among similar jobs to finish the job duplicate checking, and the specific operations are as follows: after a sentence set similar to the sentence set in another job in the job is found, the sentences are located in the job and highlighted by using a pdf highlighting tool, so that a teacher can conveniently check the similar content condition.
According to the similarity calculation result among the student homework, the teacher can see which homework has plagiarism suspicion, and according to the similar text marking result, can see which texts are similar.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (6)
1. A homework duplicate checking method based on deep learning is characterized by comprising the following steps:
1) acquiring student course homework data and homework template files;
2) judging the format of an operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic;
3) performing text preprocessing on the answers of the subjective questions in the operation after the questions are cut;
4) calculating the similarity between student assignments;
5) analyzing the similarity calculation result, and gathering student jobs with high similarity to generate a similarity report;
6) marking the similar content among the similar operations to finish the operation duplication checking.
2. The task duplication checking method based on deep learning according to claim 1, wherein: in the step 1), the student course homework data refers to student homework obtained from courses of an online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course by serving as a teacher or teaching assistance on an online learning platform.
3. The task duplication checking method based on deep learning according to claim 1, wherein: in step 2), judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:
judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;
performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;
judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.
4. The task duplication checking method based on deep learning according to claim 1, wherein: in step 3), performing text preprocessing on the answers of the subjective questions in the operation after the questions, wherein the specific conditions are as follows:
a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;
b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation and stop word removal, the English preprocessing flow is: sentence segmentation, word reduction, case unification and de-notation.
5. The task duplication checking method based on deep learning according to claim 1, wherein: in step 4), calculating the similarity between each job and other jobs, wherein the job set is A (A)1,A2,...,At) Where t is the total number of jobs in the job set, AtFor the t-th job, a certain job A is calculatediWith other operations AjThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:
4.1) sentence screening: discarding sentences with the number of words less than 3 and the number of words greater than 20 in the sentences and other repeated sentences, and marking each sentence to indicate from which operation text the sentence comes;
4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:
4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; for sentence S1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Wherein (w)11,w12,...,w1n) And (w)21,w22,...,w2m) Are respectively sentences S1And sentence S2If w is a word of11=w21Then will S1And S2Putting the obtained product into an array;
4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)u1,wu2,...,wua) As a sentence S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Set of Chinese words, wherein (w)u1,wu2,...,wua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Is the sentence S1And sentence S2The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;
4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value1(S1,S2) In which S is1And S2Respectively from operation AiAnd operation Aj(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:
4.3.1) Using the trained word2vec model, pair P sentences1(S1,S2) Sentence S in1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Word embedding processing to generate a word embedding matrix M1(v11,v12,...,v1n) And M2 (v)21,v22,...,v2m) Wherein (v)11,v12,...,v1n) And (v)21,v22,...,v2m) Respectively word-embedded matrix M1And M2The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;
4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;
4.3.3) statistically similar pairs of sentences from Job AiAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and m is taken as an operation AiAnd do s
Trade AjSimilarity (A) ofi,Aj);
In step 5), for the job pair (A)i,Aj) And an operation pair (A)j,Ak) If Similarity (A)i,Aj) Not less than the threshold value, Similarity (A)j,Ak) ≧ threshold, then Ai、Aj、AkAre grouped into one; wherein, Similarity (A)j,Ak) For operation AjAnd operation AkThe similarity of (c).
6. The task duplication checking method based on deep learning according to claim 1, wherein: in step 6), after finding a set of sentences in the job that are similar to those in another job, the sentences are located in the job and highlighted using the pdf highlighting tool.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110279211.3A CN113011154A (en) | 2021-03-16 | 2021-03-16 | Job duplicate checking method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110279211.3A CN113011154A (en) | 2021-03-16 | 2021-03-16 | Job duplicate checking method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113011154A true CN113011154A (en) | 2021-06-22 |
Family
ID=76407828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110279211.3A Pending CN113011154A (en) | 2021-03-16 | 2021-03-16 | Job duplicate checking method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113011154A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463567A (en) * | 2022-04-12 | 2022-05-10 | 北京吉道尔科技有限公司 | Block chain-based intelligent education operation big data plagiarism prevention method and system |
CN117251445A (en) * | 2023-10-11 | 2023-12-19 | 杭州今元标矩科技有限公司 | Deep learning-based CRM data screening method, system and medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411564A (en) * | 2011-08-17 | 2012-04-11 | 北方工业大学 | Electronic homework copying detection method |
-
2021
- 2021-03-16 CN CN202110279211.3A patent/CN113011154A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102411564A (en) * | 2011-08-17 | 2012-04-11 | 北方工业大学 | Electronic homework copying detection method |
Non-Patent Citations (1)
Title |
---|
胡布焕 等: "一种基于语义相似的中文文档抄袭检测方法", 深圳大学学报理工版, vol. 37, no. 1, pages 107 - 111 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114463567A (en) * | 2022-04-12 | 2022-05-10 | 北京吉道尔科技有限公司 | Block chain-based intelligent education operation big data plagiarism prevention method and system |
CN117251445A (en) * | 2023-10-11 | 2023-12-19 | 杭州今元标矩科技有限公司 | Deep learning-based CRM data screening method, system and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108363743B (en) | Intelligent problem generation method and device and computer readable storage medium | |
Janda et al. | Syntactic, semantic and sentiment analysis: The joint effect on automated essay evaluation | |
CN110609983B (en) | Structured decomposition method for policy file | |
CN113011154A (en) | Job duplicate checking method based on deep learning | |
CN111241397A (en) | Content recommendation method and device and computing equipment | |
CN113282701A (en) | Composition material generation method and device, electronic equipment and readable storage medium | |
CN112287197A (en) | Method for detecting sarcasm of case-related microblog comments described by dynamic memory cases | |
Flor et al. | Text mining and automated scoring | |
CN112966518A (en) | High-quality answer identification method for large-scale online learning platform | |
Mandge et al. | Revolutionize cosine answer matching technique for question answering system | |
Dündar et al. | A Hybrid Approach to Question-answering for a Banking Chatbot on Turkish: Extending Keywords with Embedding Vectors. | |
CN105955954A (en) | New enterprise name finding method based on bidirectional recurrent neural network | |
Riza et al. | Automatic generation of short-answer questions in reading comprehension using NLP and KNN | |
CN115017271A (en) | Method and system for intelligently generating RPA flow component block | |
JP6586055B2 (en) | Deep case analysis device, deep case learning device, deep case estimation device, method, and program | |
Shah et al. | Automatic evaluation of free text answers: A review | |
CN112347786A (en) | Artificial intelligence scoring training method and device | |
ALMUAYQIL et al. | Towards an Ontology-Based Fully Integrated System for Student E-Assessment | |
CN106776533A (en) | Method and system for analyzing a piece of text | |
CN111930947A (en) | System and method for identifying authors of modern Chinese written works | |
Ghosh | Predicting question deletion and assessing question quality in social Q&A sites using weakly supervised deep neural networks | |
Amur et al. | State-of-the-Art: Assessing Semantic Similarity in Automated Short-Answer Grading Systems | |
US20230281388A1 (en) | A Method and System for Analyzing a Piece of Text Comprising Chinese Characters | |
CN116167344B (en) | Automatic text generation method for deep learning creative science and technology | |
EP4163815A1 (en) | Textual content evaluation using machine learned models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |