CN114239539A

CN114239539A - English composition off-topic detection method and device

Info

Publication number: CN114239539A
Application number: CN202111571897.XA
Authority: CN
Inventors: 杨航; 邓嘉; 张新访
Original assignee: Wuhan Tianyu Information Industry Co Ltd
Current assignee: Wuhan Tianyu Information Industry Co Ltd
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-25

Abstract

The invention discloses a method and a device for detecting English composition separation problems, which relate to the technical field of computers, and the method comprises the steps of constructing an encoder model based on a self-supervision contrast learning mode, and carrying out fine tuning training on the encoder model to obtain a target encoder model; respectively inputting the composition of the question and the composition to be detected to a target encoder model to obtain the embedding of the composition of the question and the composition to be detected; and judging whether the composition to be detected leaves the question or not based on the similarity between the composition to be checked and the embedding of the composition to be detected. The invention can effectively reduce the cost of composition deviation detection.

Description

English composition off-topic detection method and device

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for detecting English composition departure.

Background

Currently, when an automatic grading system for english compositions is used to grade english compositions, whether the compositions are in question (off-question) needs to be detected, a common off-question detection method is mainly performed based on a combination mode of a topic model and a word vector representation, such as a combination of an LDA (late Dirichlet Allocation, implicit Dirichlet Allocation) model and a word2vec (correlation model for generating a word vector), or a combination of variants of the two, then a correlation degree of the composition to be detected under the current given composition topic is calculated according to the combination representation, and whether the composition to be detected is an off-question composition is determined according to the correlation.

The existing problem separation detection method mainly has the following problems: 1. the word2vec vocabulary representation is easy to ignore the association between words in the current text, and the influence of the sequence relation and the position relation of sentences on the sentence semantics; 2. the method comprises the following steps of (1) not carrying out improvement learning on the problem separation detection capability of a specific problem during the problem separation detection of different topics; 3. in the use of the LDA model, relatively many samples are required for accurate subject information acquisition in detection, which results in high detection cost.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a device for detecting the separation of English composition from the question, which can effectively reduce the cost of the detection of the separation of the composition from the question.

In order to achieve the above purpose, the invention provides a method for detecting the departure of English composition, which comprises the following steps:

constructing an encoder model based on a self-supervision contrast learning mode, and carrying out fine tuning training on the encoder model to obtain a target encoder model;

respectively inputting the composition of the question and the composition to be detected to a target encoder model to obtain the embedding of the composition of the question and the composition to be detected;

and judging whether the composition to be detected leaves the question or not based on the similarity between the composition to be checked and the embedding of the composition to be detected.

On the basis of the technical scheme, the construction of the encoder model based on the self-supervision contrast learning method comprises the following specific steps:

obtaining a bert model or a RoBERTA model as a basic model;

constructing a positive example pair and a negative example pair by a dropout mask method based on a label-free text data set;

and inputting the constructed positive case pair and the constructed negative case pair into the basic model to train the basic model, so as to obtain the encoder model.

On the basis of the technical scheme, the constructed positive example pair and the constructed negative example pair are input into a basic model to train the basic model to obtain an encoder model, wherein a loss function adopted for training the basic model is as follows:

therein, loss_iRepresenting a loss function, N representing the number of batch samples, log representing a logarithm calculation function, e representing a natural constant, tau representing a temperature parameter, sim representing cosine similarity calculation,

denotes the ith sample Z_iThe hidden layer (dropout mask is not used) represents,

indicating that the ith sample is represented by the hidden layer after the dropout mask,

denotes that the jth sample is represented by a hidden layer after the dropout mask, j denotes the sample number, where j ∈ [1, N]。

On the basis of the technical scheme, the method for performing fine tuning training on the encoder model to obtain the target encoder model comprises the following specific steps:

obtaining the composition of the marks of different composition questions as a training set for fine tuning training;

enhancing the training set based on a data enhancement method;

and inputting the enhanced training set into an encoder model, and adjusting parameters of a full connection layer of the encoder model to obtain a target encoder model.

On the basis of the technical scheme, in the training set, the problem deducting composition of the same composition subject is a positive example pair, and the problem deducting composition of different composition subjects is a negative example pair.

On the basis of the technical scheme, the title composition and the composition to be detected are respectively input to the target encoder model to obtain the embedding of the title composition and the composition to be detected, wherein the specific steps of obtaining the embedding of the title composition are as follows:

acquiring a plurality of discount compositions the same as the composition subjects of the composition to be detected;

and sequentially inputting the obtained multiple buckling question texts into a target encoder model to obtain multiple imbedding, wherein the multiple imbedding forms an imbedding set of the buckling question texts.

On the basis of the technical scheme, whether the composition to be detected leaves the question or not is judged based on the similarity between the composition to be checked and the embedding of the composition to be detected, and the specific steps comprise:

cosine similarity calculation between embeddings in the embedding and embedding sets of the composition to be detected is sequentially carried out to obtain a plurality of similarity values;

selecting the maximum value or the average value from the obtained multiple similarity values as the mark deduction value of the composition to be detected;

and comparing the mark deduction value with a preset standard mark deduction value so as to judge whether the composition to be detected leaves the question.

On the basis of the technical scheme, the preset standard deduction value comprises a plagiarism suspicion value and a leaving suspicion value.

On the basis of the technical proposal, the device comprises a shell,

when the discount score is larger than the plagiarism suspicion score, judging that the composition to be detected is plagiarism;

judging the composition to be detected to be withheld when the withholding value is not greater than the plagiarism suspicion value and is not less than the leaving suspicion value;

and when the discount score is smaller than the leave suspicion score, judging that the composition to be detected leaves the question.

The invention provides a device for detecting the departure of English composition, which comprises:

the training module is used for constructing an encoder model based on a self-supervision contrast learning mode and carrying out fine tuning training on the encoder model to obtain a target encoder model;

the input module is used for respectively inputting the composition of the question and the composition to be detected to the target encoder model to obtain the embedding of the composition of the question and the composition to be detected;

and the judging module is used for judging whether the composition to be detected leaves the question or not based on the similarity between the composition to be checked and the embedding of the composition to be detected.

Compared with the prior art, the invention has the advantages that: the method includes the steps that an encoder model is built on the basis of a self-supervision contrast learning mode, fine-tuning training is conducted on the encoder model to obtain a target encoder model, then a question-deducting composition and a composition to be detected are respectively input to the target encoder model to obtain the question-deducting composition and embedding of the composition to be detected, whether the composition to be detected leaves a question or not is judged on the basis of the similarity between the question-deducting composition and the embedding of the composition to be detected, performance of composition departure detection is improved, sample dependence on the composition of a specific question is reduced, the construction of a departure detection task model can be completed on the basis of a small number of question-deducting composition samples, and cost of composition departure detection is effectively reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for detecting English composition separation in an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an apparatus for detecting departure of english compositions according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

Referring to fig. 1, an english composition separation detection method provided in an embodiment of the present invention is used for automatically detecting whether an english composition separates from a question, and specifically includes the following steps:

s1: an encoder model (a common model framework) is constructed based on a self-supervision contrast learning mode, and fine tuning training is carried out on the encoder model to obtain a target encoder model;

namely, a self-supervision comparison learning method is adopted to carry out comparison learning on the basic model, so that the encoder model is obtained. Meanwhile, fine tuning training is carried out on the encoder model by adding a full connection layer suitable for the off-topic detection training, and a target encoder model suitable for the off-topic detection task is obtained.

S2: respectively inputting the composition of the question and the composition to be detected to a target encoder model to obtain the embedding of the composition of the question and the composition to be detected; for the same composition subject, inputting the composition to be detected into the target encoder model to obtain the embedding of the composition to be detected.

S3: and judging whether the composition to be detected leaves the question or not based on the similarity between the composition to be checked and the embedding of the composition to be detected. And judging whether the composition to be detected leaves the question or not by comparing the similarity between embedding of the composition to be checked and embedding of the composition to be detected.

In the embodiment of the invention, an encoder model is constructed based on a self-supervision contrast learning method, and the method specifically comprises the following steps:

s101: obtaining a bert model (self-coding language model) or a RoBERTA model as a basic model, and retraining the model on the basis of the basic model

S102: constructing a positive example pair and a negative example pair by a dropout mask method based on a label-free text data set; the training samples for the base model are common public unlabeled text data sets, such as NLI (Natural Language Inference) data sets.

A regular example pair is constructed by a dropout mask method, namely the same sentence input model is input twice, due to the randomness of dropout (a method for preventing overfitting), the output of the two times is different, and when the dropout is small, the two output embedding semantics can be considered to be similar. Likewise, different text is chosen as the negative example pair.

S103: and inputting the constructed positive case pair and the constructed negative case pair into the basic model to train the basic model, so as to obtain the encoder model. The goal of the training is to narrow the distance between the positive case pairs and pull the distance between the negative case pairs.

In the embodiment of the invention, the constructed positive case pair and the constructed negative case pair are input into a basic model to train the basic model to obtain an encoder model, wherein a loss function adopted for training the basic model is as follows:

In the embodiment of the invention, the fine tuning training is carried out on the encoder model to obtain a target encoder model, and the specific steps comprise:

s111: obtaining the composition of the marks of different composition questions as a training set for fine tuning training; in the training set, the deduction composition of the same composition question is positive example pair, and the deduction composition of different composition questions is negative example pair. The data of the composition of the questions of different English composition examinations are obtained from the public data and are used as the data of fine tuning training.

S112: enhancing the training set based on a data enhancement method; in the present invention, the data enhancement method includes but is not limited to anti-attack, word order disorder, cutting, Dropout, etc.

S113: and inputting the enhanced training set into an encoder model, and adjusting parameters of a full connection layer of the encoder model to obtain a target encoder model.

The training goal is to make the positive example pair closer and the negative example pair farther, and the corresponding objective loss function can be designed. And during fine tuning training, fixing partial parameters of the encoder model, fine tuning parameters of the full connection layer, and finally training to obtain the target encoder model.

In the embodiment of the invention, a question-deducting composition and a composition to be detected are respectively input to a target encoder model to obtain the question-deducting composition and embedding of the composition to be detected, wherein the concrete steps of obtaining the embedding of the question-deducting composition are as follows:

s201: acquiring a plurality of discount compositions the same as the composition subjects of the composition to be detected;

s202: and sequentially inputting the obtained multiple buckling question texts into a target encoder model to obtain multiple imbedding, wherein the multiple imbedding forms an imbedding set of the buckling question texts.

Namely, for the composition texts with the same topic as the composition texts to be detected, the topic texts are respectively input into a target encoder model to obtain the embedding of each topic text, and the embedding of all the topic texts are combined together to form an embedding set of the topic texts.

In the embodiment of the invention, whether the composition to be detected leaves the question or not is judged based on the similarity between the composition to be checked and the embedding of the composition to be detected, and the specific steps comprise:

s301: cosine similarity calculation between embeddings in the embedding and embedding sets of the composition to be detected is sequentially carried out to obtain a plurality of similarity values; namely, cosine similarity calculation is carried out on the embedding of the composition to be detected and each embedding in the embedding set in sequence, each time, a similarity value is obtained, and a plurality of similarity values are obtained in total.

S302: selecting a maximum value or an average value from the obtained similarity values as a discount value of the composition to be detected, wherein the value range of the discount value is [0,1 ];

s303: and comparing the mark deduction value with a preset standard mark deduction value so as to judge whether the composition to be detected leaves the question.

Specifically, the preset standard discount value comprises a plagiarism suspicion value and a departure suspicion value, the plagiarism suspicion value can be 0.95, the departure suspicion value can be 0.6, and values of the plagiarism suspicion value and the departure suspicion value can be flexibly set according to specific conditions. When the discount score is larger than the plagiarism suspicion score, judging that the composition to be detected is plagiarism; judging the composition to be detected to be withheld when the withholding value is not greater than the plagiarism suspicion value and is not less than the leaving suspicion value; and when the discount score is smaller than the leave suspicion score, judging that the composition to be detected leaves the question.

The method improves the representation quality of text coding through technologies such as self-supervision contrast learning, fine-tuning training, data enhancement and the like, so that the performance of off-topic detection is better, and the sample dependency on a special topic composition is reduced. Firstly, constructing a basic encoder model; secondly, constructing a reinforced encoder model, namely a target encoder model, so that the enhanced encoder model is suitable for a task of off-topic detection; and finally, for the task of detecting the composition departure of the specific topic, taking the existing data set of the composition of the special topic as a support set (support set), coding the data of the support set through a reinforced encoder model to obtain an embedding set of the composition of the topic, inputting the reinforced encoder model into the composition to be detected to obtain the embedding of the composition to be detected, comparing the similarity relation between the embedding of the composition to be detected and the embedding set of the composition of the topic, and explaining the degree of the summary of the composition to be detected so as to judge whether the composition to be detected leaves the topic or not.

The retrained encoder model can be well suitable for the off-topic detection task, the basic encoder model adopts self-supervision contrast learning, the similarity of texts can be better measured through encoding, the retrained model can also inherit the semantic similarity calculation capability, and the target encoder model is subjected to off-topic detection task fine adjustment through the existing English composition data set and is better suitable for the off-topic detection task; meanwhile, the off-topic detection of the composition to be detected can be completed only by a small amount of composition of the discount; meanwhile, the method is more suitable for the detection task of the actual examination scene, and only a small amount of composition making and deducting composition embedding sets meeting the requirements need to be selected in the actual detection due to the fact that a small amount of composition making and deducting compositions are needed in the detection, so that less manual work can be used for completing the detection of the separated problems.

The English composition problem-separating detection method comprises the steps of constructing an encoder model based on a self-supervision contrast learning mode, carrying out fine-tuning training on the encoder model to obtain a target encoder model, then respectively inputting a question-deducting composition and a composition to be detected to the target encoder model to obtain the question-deducting composition and the embeadding of the composition to be detected, finally judging whether the composition to be detected separates from the question or not based on the similarity between the question-deducting composition and the embeadding of the composition to be detected, improving the performance of composition problem-separating detection, reducing the sample dependence on the specific question composition, completing the construction of a problem-separating detection task model based on a small number of question-deducting composition samples, and effectively reducing the cost of composition problem-separating detection.

Referring to fig. 2, an apparatus for detecting a departure of an english composition according to an embodiment of the present invention includes a training module, an input module, and a determination module. The training module is used for constructing an encoder model based on a self-supervision contrast learning mode and carrying out fine tuning training on the encoder model to obtain a target encoder model; the input module is used for respectively inputting the composition of the question and the composition to be detected to the target encoder model to obtain the embedding of the composition of the question and the composition to be detected; the judging module is used for judging whether the composition to be detected leaves the question or not based on the similarity between the composition to be detected and the embedding of the composition to be detected.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for detecting English composition separation problems is characterized by comprising the following steps:

2. The method for detecting the separation of the English compositions from the questions of claim 1, wherein the method for constructing the encoder model based on the self-supervision comparison learning method comprises the following specific steps:

obtaining a bert model or a RoBERTA model as a basic model;

3. The method according to claim 2, wherein the constructed positive case pair and negative case pair are input into the base model to train the base model, so as to obtain the encoder model, wherein the loss function used for training the base model is as follows:

4. The method for detecting the separation of the English compositions from the questions of claim 2, wherein the fine tuning training of the encoder model is performed to obtain a target encoder model, and the specific steps include:

enhancing the training set based on a data enhancement method;

5. The method for detecting the departure of english compositions according to claim 4, wherein: in the training set, the composition of the same composition subject is positive case pair, and the composition of different composition subjects is negative case pair.

6. The method for detecting the departure of english compositions according to claim 1, wherein: respectively inputting the composition of the questions and the composition to be detected to a target encoder model to obtain the composition of the questions and the embedding of the composition to be detected, wherein the specific steps of obtaining the embedding of the composition of the questions are as follows:

7. The method for detecting the leaving of the English composition as claimed in claim 6, wherein the step of determining whether the composition to be detected leaves the question based on the similarity between the composition to be checked and the embedding of the composition to be detected comprises the following steps:

8. The method for detecting the departure of english compositions according to claim 7, wherein: the preset standard deduction value comprises a plagiarism suspicion value and a leaving suspicion value.

9. The method for detecting the departure of english compositions according to claim 8, wherein:

10. The utility model provides an english composition leaves problem detection device which characterized in that includes: