CN113011154A - Job duplicate checking method based on deep learning - Google Patents

Job duplicate checking method based on deep learning Download PDF

Info

Publication number
CN113011154A
CN113011154A CN202110279211.3A CN202110279211A CN113011154A CN 113011154 A CN113011154 A CN 113011154A CN 202110279211 A CN202110279211 A CN 202110279211A CN 113011154 A CN113011154 A CN 113011154A
Authority
CN
China
Prior art keywords
sentence
similarity
homework
sentences
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110279211.3A
Other languages
Chinese (zh)
Inventor
张凌
胡布焕
张晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
CERNET Corp
Original Assignee
South China University of Technology SCUT
CERNET Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT, CERNET Corp filed Critical South China University of Technology SCUT
Priority to CN202110279211.3A priority Critical patent/CN113011154A/en
Publication of CN113011154A publication Critical patent/CN113011154A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance

Abstract

The invention discloses a homework duplicate checking method based on deep learning, which comprises the following steps: the method comprises the steps of obtaining student course homework data and homework template files, judging homework template formats, conducting topic cutting processing on obtained homework, judging whether topics in the homework are subjective questions or objective questions, conducting text preprocessing on answers of the subjective questions in the homework after the topics are cut, calculating similarity between student homework by utilizing a deep learning technology (namely a convolutional neural network model), analyzing calculation results of the similarity, clustering student homework with high similarity, and generating a similarity report. In order to facilitate the teacher to view the similar content condition, the invention marks the similar content between the similar jobs. The method can find out the text contents with similar operation semantics and solve the problem that a plurality of plagiarism detection methods are poor in anti-interference effect.

Description

Job duplicate checking method based on deep learning
Technical Field
The invention relates to the technical field of student homework duplicate checking, in particular to a homework duplicate checking method based on deep learning.
Background
In online assistant teaching in colleges and universities, electronic documents become one of the main forms of student's homework submission. With the attention of people on academic moral, how to assist teachers to find out the copy content in the homework submitted by students becomes a research hotspot.
At present, a plurality of plagiarism detection systems exist, such as domestic China-network of understanding (CNKI) academic unfamiliar literature detection systems, foreign Turnitin, PlagScan, Dupli Checker and other systems. These systems can assist teachers in finding out the part of the student submitting the plagiarism in the homework, but because these systems use the internet as a source of plagiarism, it is difficult to discover the plagiarism relationships that exist between students' local homework. At present, a plurality of plagiarism detection methods are researched and put into use by people, and the most popular is the plagiarism detection method based on the lexical method. The plagiarism detection method based on the lexical method mainly considers the lexical characteristics in the text, for example, a fingerprint characteristic extraction method which is used more in the early stage is adopted. The method based on fingerprint feature extraction represents the documents as a fingerprint sequence, and the similarity between the documents is calculated according to the fingerprint sequence. The plagiarism detection method based on the lexical method is suitable for simple copy and paste, but when a plagiarism person has behaviors of evading detection such as paraphrase replacement and the like on a text, the effect of the method is not obvious. Researchers also use grammar-based plagiarism detection methods (e.g., part-of-speech tagging), semantic-based plagiarism methods (e.g., display semantic analysis, latent semantic analysis), and machine learning-based plagiarism detection methods (e.g., support vector machines, linear regression models), among others.
With the wide application of deep learning in the field of computers, many researchers use deep learning to realize plagiarism detection and achieve some better results. One of the key points of the plagiarism detection technology is text similarity calculation, and text paraphrase replacement, synonym replacement and the like can be well found by using a deep learning technology in the text similarity calculation, so that not only can word plagiarism be found, but also semantic plagiarism can be found by using a deep learning related technology in a plagiarism detection task.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an operation duplicate checking method based on deep learning, which can accurately find out text contents with similar operation semantics and solve the problem of poor anti-interference effect of a plurality of plagiarism detection methods.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: a homework duplicate checking method based on deep learning comprises the following steps:
1) acquiring student course homework data and homework template files;
2) judging the format of an operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic;
3) performing text preprocessing on the answers of the subjective questions in the operation after the questions are cut;
4) calculating the similarity between student assignments;
5) analyzing the similarity calculation result, and gathering student jobs with high similarity to generate a similarity report;
6) marking the similar content among the similar operations to finish the operation duplication checking.
In the step 1), the student course homework data refers to student homework obtained from courses of an online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course by serving as a teacher or teaching assistance on an online learning platform.
In step 2), judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:
judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;
performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;
judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.
In step 3), performing text preprocessing on the answers of the subjective questions in the operation after the questions, wherein the specific conditions are as follows:
a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;
b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation and stop word removal, the English preprocessing flow is: sentence segmentation, word reduction, case unification and de-notation.
In step 4), calculating the similarity between each job and other jobs, wherein the job set is A (A)1,A2,...,At) Where t is the total number of jobs in the job set, AtFor the t-th job, a certain job A is calculatediWith other operations AjThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:
4.1) sentence screening: discarding sentences with the number of words less than 3 and the number of words greater than 20 in the sentences and other repeated sentences, and marking each sentence to indicate from which operation text the sentence comes;
4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:
4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; for sentence S1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Wherein (w)11,w12,...,w1n) And (w)21,w22,...,w2m) Are respectively sentences S1And sentence S2If w is a word of11=w21Then will S1And S2Putting the obtained product into an array;
4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)u1,wu2,...,wua) As a sentence S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Set of Chinese words, wherein (w)u1,wu2,...,wua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Is the sentence S1And sentence S2The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;
4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value1(S1,S2) In which S is1And S2Respectively from operation AiAnd operation Aj(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:
4.3.1) Using the trained word2vec model, pair P sentences1(S1,S2) Sentence S in1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Word embedding processing to generate a word embedding matrix M1(v11,v12,...,v1n) And M2(v21,v22,...,v2m) Wherein (v)11,v12,...,v1n) And (v)21,v22,...,v2m) Respectively word-embedded matrix M1And M2The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;
4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;
4.3.3) statistically similar pairs of sentences from Job AiAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and m is taken as an operation AiAnd do
s
Trade AjSimilarity (A) ofi,Aj);
In step 5), for the job pair (A)i,Aj) And an operation pair (A)j,Ak) If Similarity (A)i,Aj) Not less than the threshold value, Similarity (A)j,Ak) ≧ threshold, then Ai、Aj、AkAre grouped into one; wherein, Similarity (A)j,Ak) For operation AjAnd operation AkThe similarity of (c).
In step 6), after finding a set of sentences in the job that are similar to those in another job, the sentences are located in the job and highlighted using the pdf highlighting tool.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method is based on deep learning (namely a convolutional neural network model), solves the problem that similar contents cannot be found out under the condition of heavy interference, and can effectively improve the efficiency of teachers for checking the students for the heavy homework by applying the method provided by the invention to the existing plagiarism detection system.
2. By using the sentence screening and quick matching method, the sentences which are too long and too short are screened out, the possibly similar sentence pairs are quickly matched, and the problem of too long plagiarism detection time is solved.
Drawings
FIG. 1 is a logic flow diagram of the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
As shown in fig. 1, the method for task duplication checking based on deep learning provided by this embodiment includes the following steps:
1) acquiring student course homework data and homework template files, and reading file contents; the student course homework data refers to student homework obtained from courses of the online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course and used as a teacher or an assistant on an online learning platform.
2) Judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:
judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;
performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;
judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.
3) Carrying out text preprocessing on the answers of the subjective questions in the operation after the questions are cut, wherein the specific conditions are as follows:
a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;
b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation, stop word removal and the like, wherein the English preprocessing flow comprises the following steps: sentence segmentation, word reduction, case unification, symbol removal and the like.
4) Calculating the similarity between student assignments: calculating the similarity between each operation and other operations, wherein the operation set is A (A)1,A2,...,At) Where t is the total number of jobs in the job set, AtFor the t-th job, a certain job A is calculatediWith other operations AjThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:
4.1) sentence screening: discarding too short (the number of words in the sentence is less than 3) and too long (the number of words in the sentence is more than 20) sentences and other repeated sentences, and marking each sentence to indicate the operation text from which the sentence comes;
4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:
4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; to pairIn sentence S1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Wherein (w)11,w12,...,w1n) And (w)21,w22,...,w2m) Are respectively sentences S1And sentence S2If w is a word of11=w21Then will S1And S2Putting the obtained product into an array;
4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)u1,wu2,...,wua) As a sentence S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Set of Chinese words, wherein (w)u1,wu2,...,wua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Is the sentence S1And sentence S2The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;
4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value1(S1,S2) In which S is1And S2Respectively from operation AiAnd operation Aj(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:
4.3.1) Using the trained word2vec model, pair P sentences1(S1,S2) Sentence S in1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Word embedding processing to generate a word embedding matrix M1(v11,v12,...,v1n) And M2 (v)21,v22,...,v2m) Wherein (v)11,v12,...,v1n) And (v)21,v22,...,v2m) Respectively word-embedded matrix M1And M2The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;
4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;
4.3.3) statistically similar pairs of sentences from Job AiAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and s is taken as an operation AiAnd operation AjSimilarity (A) ofi,Aj)。
5) Analyzing the similarity calculation result, gathering student jobs with high similarity to generate a similarity report, and specifically operating as follows: for operation pair (A)i,Aj) And an operation pair (A)j,Ak) If Similarity (A)i,Aj) Not less than the threshold value, Similarity (A)j,Ak) ≧ threshold, then Ai、Aj、AkAre grouped into one; wherein, Similarity (A)j,Ak) For operation AjAnd operation AkThe similarity of (c).
6) Marking similar contents among similar jobs to finish the job duplicate checking, and the specific operations are as follows: after a sentence set similar to the sentence set in another job in the job is found, the sentences are located in the job and highlighted by using a pdf highlighting tool, so that a teacher can conveniently check the similar content condition.
According to the similarity calculation result among the student homework, the teacher can see which homework has plagiarism suspicion, and according to the similar text marking result, can see which texts are similar.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (6)

1. A homework duplicate checking method based on deep learning is characterized by comprising the following steps:
1) acquiring student course homework data and homework template files;
2) judging the format of an operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic;
3) performing text preprocessing on the answers of the subjective questions in the operation after the questions are cut;
4) calculating the similarity between student assignments;
5) analyzing the similarity calculation result, and gathering student jobs with high similarity to generate a similarity report;
6) marking the similar content among the similar operations to finish the operation duplication checking.
2. The task duplication checking method based on deep learning according to claim 1, wherein: in the step 1), the student course homework data refers to student homework obtained from courses of an online learning platform; the operation template file is a file in an operation answering format, which is submitted in a course by serving as a teacher or teaching assistance on an online learning platform.
3. The task duplication checking method based on deep learning according to claim 1, wherein: in step 2), judging the format of the operation template, performing topic cutting processing on the obtained operation, and judging whether the topic in the operation is a subjective topic or an objective topic, wherein the specific conditions are as follows:
judging the format of the operation template: the system provides a plurality of operation template formats for teachers, and judges which template format the obtained operation template belongs to by using a regular expression;
performing topic cutting processing on the obtained operation: after the homework template format is judged, the regular expression corresponding to the template format is used for cutting the homework of the student, and a homework cutting result is returned;
judging whether the questions in the operation are subjective questions or objective questions: judging subjective questions and objective questions according to the answer content of the students according to the following judgment rules: a. if the answer is preceded by "answer: ", this question is the subjective question; b. the question with the answer content length less than 20 is an objective question; c. the questions which can not be judged by the above conditions are all subjective questions.
4. The task duplication checking method based on deep learning according to claim 1, wherein: in step 3), performing text preprocessing on the answers of the subjective questions in the operation after the questions, wherein the specific conditions are as follows:
a. and (3) Chinese and English judgment: utilizing a part-of-speech analyzer to cut words of the text and judge the part-of-speech of each word, counting the number of Chinese words and the number of English words, calculating the proportion of Chinese and English to answer content, and taking the large proportion as the language to which the operation belongs;
b. different preprocessing flows are carried out according to different languages: the Chinese preprocessing process comprises the following steps: sentence segmentation, word segmentation and stop word removal, the English preprocessing flow is: sentence segmentation, word reduction, case unification and de-notation.
5. The task duplication checking method based on deep learning according to claim 1, wherein: in step 4), calculating the similarity between each job and other jobs, wherein the job set is A (A)1,A2,...,At) Where t is the total number of jobs in the job set, AtFor the t-th job, a certain job A is calculatediWith other operations AjThe similarity of j is more than or equal to 1 and less than or equal to t, j is not equal to i, and the specific process is as follows:
4.1) sentence screening: discarding sentences with the number of words less than 3 and the number of words greater than 20 in the sentences and other repeated sentences, and marking each sentence to indicate from which operation text the sentence comes;
4.2) sentence pre-matching: a method based on quick matching of character strings is utilized to screen out sentence pairs which are possibly similar in two operations, and the screening steps are as follows:
4.2.1) matching sentences containing the same key value into a cluster by taking a single word in the sentences as the key value; for sentence S1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Wherein (w)11,w12,...,w1n) And (w)21,w22,...,w2m) Are respectively sentences S1And sentence S2If w is a word of11=w21Then will S1And S2Putting the obtained product into an array;
4.2.2) giving a threshold value, and outputting sentence pairs with similarity larger than the threshold value in each threshold value; set U (w)u1,wu2,...,wua) As a sentence S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Set of Chinese words, wherein (w)u1,wu2,...,wua) Is the word in the set U, a is the number of the words in the set U, 1 < a < m + n, S1(w11,w12,...,w1n) With sentence S2(w21,w22,...,w2m) Is the sentence S1And sentence S2The same number of words in the sentence sets accounts for the proportion of the two sentence word sets U, and sentence pairs with similarity larger than a threshold value are output;
4.3) calling a convolutional neural network model to carry out semantic similarity matching on the sentences larger than the threshold value; for sentences with similarity greater than threshold value1(S1,S2) In which S is1And S2Respectively from operation AiAnd operation Aj(ii) a The sentence pair is used as the input of the neural network, the semantic similarity of the two sentences is calculated, and the specific flow is as follows:
4.3.1) Using the trained word2vec model, pair P sentences1(S1,S2) Sentence S in1(w11,w12,...,w1n) And sentence S2(w21,w22,...,w2m) Word embedding processing to generate a word embedding matrix M1(v11,v12,...,v1n) And M2 (v)21,v22,...,v2m) Wherein (v)11,v12,...,v1n) And (v)21,v22,...,v2m) Respectively word-embedded matrix M1And M2The word vector is used for embedding words into the matrix as the input of the convolutional neural network model;
4.3.2) using the word embedded matrix as the input of a convolutional neural network model, wherein the convolutional neural network model consists of two convolutional layers, a maximum pooling layer and three full-connection layers and is obtained by utilizing semantic similar data set training; obtaining semantic similarity between sentence pairs by using a neural network, and returning the sentence pairs with the similarity larger than a threshold value;
4.3.3) statistically similar pairs of sentences from Job AiAll sentences are put in an array, the sum of the lengths m of all the sentences and the total length s of the texts of the operation Ai are calculated, and m is taken as an operation AiAnd do s
Trade AjSimilarity (A) ofi,Aj);
In step 5), for the job pair (A)i,Aj) And an operation pair (A)j,Ak) If Similarity (A)i,Aj) Not less than the threshold value, Similarity (A)j,Ak) ≧ threshold, then Ai、Aj、AkAre grouped into one; wherein, Similarity (A)j,Ak) For operation AjAnd operation AkThe similarity of (c).
6. The task duplication checking method based on deep learning according to claim 1, wherein: in step 6), after finding a set of sentences in the job that are similar to those in another job, the sentences are located in the job and highlighted using the pdf highlighting tool.
CN202110279211.3A 2021-03-16 2021-03-16 Job duplicate checking method based on deep learning Pending CN113011154A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110279211.3A CN113011154A (en) 2021-03-16 2021-03-16 Job duplicate checking method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110279211.3A CN113011154A (en) 2021-03-16 2021-03-16 Job duplicate checking method based on deep learning

Publications (1)

Publication Number Publication Date
CN113011154A true CN113011154A (en) 2021-06-22

Family

ID=76407828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110279211.3A Pending CN113011154A (en) 2021-03-16 2021-03-16 Job duplicate checking method based on deep learning

Country Status (1)

Country Link
CN (1) CN113011154A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463567A (en) * 2022-04-12 2022-05-10 北京吉道尔科技有限公司 Block chain-based intelligent education operation big data plagiarism prevention method and system
CN117251445A (en) * 2023-10-11 2023-12-19 杭州今元标矩科技有限公司 Deep learning-based CRM data screening method, system and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411564A (en) * 2011-08-17 2012-04-11 北方工业大学 Electronic homework copying detection method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411564A (en) * 2011-08-17 2012-04-11 北方工业大学 Electronic homework copying detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡布焕 等: "一种基于语义相似的中文文档抄袭检测方法", 深圳大学学报理工版, vol. 37, no. 1, pages 107 - 111 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463567A (en) * 2022-04-12 2022-05-10 北京吉道尔科技有限公司 Block chain-based intelligent education operation big data plagiarism prevention method and system
CN117251445A (en) * 2023-10-11 2023-12-19 杭州今元标矩科技有限公司 Deep learning-based CRM data screening method, system and medium

Similar Documents

Publication Publication Date Title
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
Janda et al. Syntactic, semantic and sentiment analysis: The joint effect on automated essay evaluation
CN110609983B (en) Structured decomposition method for policy file
CN113011154A (en) Job duplicate checking method based on deep learning
CN111241397A (en) Content recommendation method and device and computing equipment
CN113282701A (en) Composition material generation method and device, electronic equipment and readable storage medium
CN112287197A (en) Method for detecting sarcasm of case-related microblog comments described by dynamic memory cases
Flor et al. Text mining and automated scoring
CN112966518A (en) High-quality answer identification method for large-scale online learning platform
Mandge et al. Revolutionize cosine answer matching technique for question answering system
Dündar et al. A Hybrid Approach to Question-answering for a Banking Chatbot on Turkish: Extending Keywords with Embedding Vectors.
CN105955954A (en) New enterprise name finding method based on bidirectional recurrent neural network
Riza et al. Automatic generation of short-answer questions in reading comprehension using NLP and KNN
CN115017271A (en) Method and system for intelligently generating RPA flow component block
JP6586055B2 (en) Deep case analysis device, deep case learning device, deep case estimation device, method, and program
Shah et al. Automatic evaluation of free text answers: A review
CN112347786A (en) Artificial intelligence scoring training method and device
ALMUAYQIL et al. Towards an Ontology-Based Fully Integrated System for Student E-Assessment
CN106776533A (en) Method and system for analyzing a piece of text
CN111930947A (en) System and method for identifying authors of modern Chinese written works
Ghosh Predicting question deletion and assessing question quality in social Q&A sites using weakly supervised deep neural networks
Amur et al. State-of-the-Art: Assessing Semantic Similarity in Automated Short-Answer Grading Systems
US20230281388A1 (en) A Method and System for Analyzing a Piece of Text Comprising Chinese Characters
CN116167344B (en) Automatic text generation method for deep learning creative science and technology
EP4163815A1 (en) Textual content evaluation using machine learned models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination