CN113191119A

CN113191119A - Method, apparatus and storage medium for training text error correction model

Info

Publication number: CN113191119A
Application number: CN202110616159.6A
Authority: CN
Inventors: 王亦宁; 刘升平; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-07-30

Abstract

The invention relates to a training method, equipment and a storage medium of a text error correction model, wherein the method comprises the steps of carrying out pseudo-mark construction on each text in acquired unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text; detecting whether the construction times of the pseudo mark reach preset iteration times or not; if the pseudo label construction times reach the preset iteration times, all pseudo label texts are used as pseudo label data, so that the manual labeling work is reduced, and the data volume is increased; and then the pseudo-mark data and the obtained marked data can be used as source end data, the unmarked data can be used as target end data, and a text error correction model is trained based on a sequence-to-sequence method of a pointer network, so that the training efficiency of the text error correction model is improved.

Description

Method, apparatus and storage medium for training text error correction model

Technical Field

The invention relates to the technical field of information processing, in particular to a training method, equipment and a storage medium of a text error correction model.

Background

Text error correction is an important research direction in computer natural language processing, and errors generated by human factors in a text, such as wrongly written characters, wrongly written orders and the like, can be corrected through a computer algorithm.

The existing generative text error correction model usually needs to collect large-scale labeling data and then train an end-to-end generative model, so that the correction process from the error text to the correct text can be realized.

However, for a text with a small data size such as medical data, it takes a certain amount of time and labor to collect and label the data, which reduces the training efficiency of the text correction model.

Disclosure of Invention

The invention provides a training method, equipment and a storage medium of a text error correction model, which aim to solve the technical problems of lower accuracy and higher quality of a generated text in the prior art.

The technical scheme for solving the technical problems is as follows:

a training method of a text correction model comprises the following steps:

performing pseudo-mark construction on each text in the obtained unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text;

detecting whether the construction times of the pseudo mark reach preset iteration times or not;

if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;

and taking the pseudo mark data and the obtained marked data as source end data, taking the unmarked data as target end data, and training a text error correction model based on a sequence-to-sequence method of a pointer network.

Further, in the above method for training a text error correction model, the preset multiple pseudo-mark rules include:

randomly deleting at least one word at a preset first probability at the position of each word in each text; and/or

Randomly inserting at least one character at the position of each word in each text with a preset first probability; and/or

Adding noise to the position of each word in each text according to normal distribution, and reordering each word after noise is added; and/or

Collecting and constructing a phonetic-close word dictionary, and replacing each word in each text with a phonetic-close word according to a preset third probability; and/or

Collecting and constructing a similar word dictionary, and replacing each word in each text with a similar word according to a preset fourth probability; and/or

Each word in each piece of text is maintained.

Further, in the above method for training a text error correction model, the predetermined plurality of pseudo-mark rules includes randomly inserting at least one word at a predetermined first probability at a position of each word in each text, and the method further includes:

constructing a word table according to the character frequency in the unmarked data;

selecting at least one inserted word from the word table.

Further, in the above method for training a text error correction model, constructing a word table according to the frequency of characters in the unlabeled data includes:

taking characters with the character frequency larger than or equal to a preset threshold value in the unmarked data as target characters;

and constructing the word table according to the target character.

Further, in the above method for training a text error correction model, the training of the text error correction model based on a sequence-to-sequence method of a pointer network includes:

carrying out word sequence division on the source end data, and carrying out word vector processing on the obtained word sequence to obtain a word matrix corresponding to the word sequence;

encoding the word matrix by using an encoder to obtain an input word sequence encoding representation;

decoding the input word sequence code by using a decoder under an attention mechanism to obtain error correction data corresponding to the source end data;

determining a loss value based on the error correction data and the target end data;

and performing iterative training on the current model based on the loss value until a training stopping condition is reached to obtain a text error correction model.

Further, in the above method for training a text error correction model, decoding the input word sequence code to obtain error correction data corresponding to the source data, includes:

decoding the input word sequence code to obtain an error-corrected output word sequence code representation;

performing linear transformation on the logistic regression layer of the output word sequence coding representation input encoder, and outputting the initial probability distribution of each moment t in the target end data;

determining the fusion probability distribution of each moment t in the target end data according to the probability distribution of each moment t in the target end data and the obtained replication mechanism score;

selecting the word corresponding to the maximum fusion probability as a generation result of the moment t;

error correction data corresponding to the source data is generated based on the words at all times.

Further, in the above method for training a text correction model, the process of obtaining the score of the replication mechanism includes:

performing matrix transformation on the coded representation of the output word sequence to obtain an output vector;

performing matrix transformation on the hidden state of the encoder to obtain a key vector and a value vector;

determining the replication mechanism score based on the output vector, the key vector, and the value vector.

The invention also provides a training device of the text error correction model, which comprises:

the pseudo mark construction module is used for carrying out pseudo mark construction on each text in the unmarked data based on a pseudo mark rule randomly selected from a plurality of preset pseudo mark rules to obtain a pseudo mark text corresponding to each text;

the detection module is used for detecting whether the construction times of the pseudo mark reach the preset iteration times; if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;

and the training module is used for taking the pseudo-mark data and the obtained marked data as source end data, taking the unmarked data as target end data and training a text error correction model based on a sequence-to-sequence method of a pointer network.

The invention also provides a training device of the text error correction model, which comprises: a processor and a memory;

the processor is used for executing the text generation program stored in the memory to realize the training method of the text correction model as described in any one of the above items.

The present invention also provides a storage medium storing one or more programs which, when executed, implement the method of training a text correction model according to any one of claims 1 to 7.

The invention has the beneficial effects that:

the method comprises the steps of carrying out pseudo-mark construction on each text in the obtained unmarked data through a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text until the pseudo-mark construction times reach the preset iteration times, taking all the pseudo-mark texts as the pseudo-mark data, reducing the work of manual marking, increasing the data volume, further taking the pseudo-mark data and the obtained marked data as source end data, taking the unmarked data as target end data, training a text error correction model based on a pointer network sequence-to-sequence method, improving the training efficiency of the text error correction model, accurately judging which texts can be reserved and which texts need to be modified, and greatly reducing the phenomenon of correct text modification errors.

Drawings

FIG. 1 is a flowchart of an embodiment of a method for training a text correction model according to the present invention;

FIG. 2 is a schematic structural diagram of an embodiment of a training apparatus for text error correction models according to the present invention;

FIG. 3 is a schematic structural diagram of an embodiment of a training apparatus for a text error correction model according to the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of an embodiment of a training method of a text correction model according to the present invention, and as shown in fig. 1, the training method of the text correction model of the present embodiment may include the following steps:

100. performing pseudo-mark construction on each text in the obtained unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text;

in a specific implementation process, for a text with a small data volume, such as medical data, data expansion can be performed on the medical data, so that a text with a large data volume is obtained.

Specifically, the required data can be acquired as the unmarked data through data capture, manual writing and other modes, and each text in the acquired unmarked data is subjected to the pseudo mark construction according to the pseudo mark rule randomly selected from the preset pseudo mark rules, so that the pseudo mark text corresponding to each text is obtained.

In a specific implementation process, the preset multiple pseudo-marking rules include: randomly deleting at least one word at a preset first probability at the position of each word in each text; and/or, randomly inserting at least one word at a preset first probability in the position of each word in each text; and/or adding noise to the position of each word in each text according to normal distribution, and reordering each word added with noise; and/or, collecting and constructing a phonetic near word dictionary, and replacing each word in each text with a phonetic near word according to a preset third probability; and/or, collecting and constructing a similar word dictionary, and replacing each word in each text with a similar word according to a preset fourth probability; and/or, maintaining each word in each piece of text.

For example, the unlabeled data is "cerebral blood supply insufficiency", and after the expansion, the "cerebral blood supply insufficiency", "cerebral write insufficiency", and "cerebral blood supply insufficiency" can be obtained.

In a specific implementation process, if a plurality of preset pseudo-tag rules include that at least one word is randomly inserted into the position of each word in each text with a preset first probability, a word table may be constructed according to the frequency of the characters in the non-tag data, and specifically, the characters in the non-tag data whose frequency of the characters is greater than or equal to a preset threshold may be used as target characters; and constructing the word table according to the target character. After the word table is constructed, at least one inserted word may be selected from the word table.

101. Detecting whether the construction times of the pseudo mark reach preset iteration times or not;

102. if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;

in a specific implementation process, pseudo mark construction can be performed on each text in the obtained unmarked data for multiple times, and after each pseudo mark construction, the current pseudo mark construction times are recorded so as to detect whether the pseudo mark construction times reach the preset iteration times; and if the construction times of the pseudo marks reach the preset iteration times, taking all pseudo mark texts as pseudo mark data.

103. And taking the pseudo mark data and the obtained marked data as source end data, taking the unmarked data as target end data, and training a text error correction model based on a sequence-to-sequence method of a pointer network.

In a specific implementation, step 103 may be implemented according to the following steps:

(1) carrying out word sequence division on the source end data, and carrying out word vector processing on the obtained word sequence to obtain a word matrix corresponding to the word sequence;

in one specific implementation, x ═ x may be defined₁,x₂,...,x_n]Word sequence representing source data, X ═ v₁,v₂,....v_i..,v_n]Representing a word matrix corresponding to the word sequence. Wherein v is_iA vector representing the ith word sequence.

(2) Encoding the word matrix by using an encoder to obtain an input word sequence encoding representation;

in one implementation, Self may be defined_enc() For the encoder calculation unit based on the self-attention mechanism, the encoded representation of each word passing through the encoder can be calculated by the following calculation formula (1):

wherein the content of the first and second substances,

representing an encoded representation of the mth word sequence. After being coded by the coder, the coded representation h of the topmost layer of the coder can be obtained^N。

(3) Decoding the input word sequence code by using a decoder under an attention mechanism to obtain error correction data corresponding to the source end data;

definition y ═ y₁,y₂,...,y_n]Word sequence representing error correction result, Y ═ u₁,u₂,...u_i...,u_n]A matrix obtained by preprocessing a word sequence representing data input at a target end through a word vector, wherein u_iA vector representing the ith word.

In a specific implementation, the implementation of this step is as follows:

(31) decoding the input word sequence code to obtain an error-corrected output word sequence code representation;

in particular, Self may be defined_dec() Computing units for a self-attention based decoderEncoded representation of the output word sequence at time t by the decoder

Obtained by calculating the following equation (2):

wherein the content of the first and second substances,

for the coded representation of the output word sequence at time t in the nth layer of the target, u_tRepresenting the input of the decoder at time t, h^NIs a hidden state of the encoder.

(32) Performing linear transformation on the logistic regression layer of the output word sequence coding representation input encoder, and outputting the initial probability distribution of each moment t in the target end data;

in a specific implementation process, the output word sequence coded representation can be input into the logistic regression layer of the decoder to perform linear transformation, and output the initial probability distribution in the target end data at each time t.

The output word sequence coding means that the output word sequence after linear transformation to transform is coded as follows:

wherein, O_tIs the input representation of the softmax layer of the decoder.

O obtained by linear transformation_tAnd outputting the initial probability distribution in the target-end data at each time t by softmax.

Prob_gen＝softmax(W·O_t+ b); wherein, W and b are model parameters, and the dimension of W is the same as the dimension of a word list.

(33) Determining the fusion probability distribution of each moment t in the target end data according to the probability distribution of each moment t in the target end data and the obtained replication mechanism score;

in a specific implementation process, the obtaining process of the replication mechanism score includes:

a. performing matrix transformation on the coded representation of the output word sequence to obtain an output vector;

specifically, the output word sequence coding representation may be matrix transformed according to the following equation (3) to obtain an output vector q_t。

A first matrix is represented.

b. Performing matrix transformation on the hidden state of the encoder to obtain a key vector and a value vector;

specifically, the hidden state of the encoder may be matrix-transformed according to the following calculation formula (4) to obtain a key vector K and a value vector V.

A second matrix is represented that is a matrix of,

representing a third matrix.

c. Determining the replication mechanism score based on the output vector, the key vector, and the value vector.

In one specific implementation, the replication mechanism score may be determined according to the following calculation equation (5):

in this embodiment, after obtaining the replication mechanism score, the probability distribution of each time t in the target end data and the obtained replication mechanism score may be fused according to the following calculation formula (6), so as to obtain a fusion probability distribution of each time t in the target end data.

(34) Selecting the word corresponding to the maximum fusion probability as a generation result of the moment t;

(35) error correction data corresponding to the source data is generated based on the words at all times.

After the fusion probability distribution of each time t in the target end data is obtained, the word corresponding to the maximum fusion probability can be selected as the result of the generation of the time t, and the error correction data corresponding to the source end data is generated based on the words at all times.

(4) Determining a loss value based on the error correction data and the target end data;

specifically, the generated error correction data is compared with the target end data, and the loss value is calculated by a loss function.

(5) And performing iterative training on the current model based on the loss value until a training stopping condition is reached to obtain a text error correction model.

Specifically, the threshold value of the loss value may be set in advance as a condition for stopping training. For example, the threshold value is set to 0.2. This is not limited by the present application.

The training method of the text error correction model of the embodiment selects the pseudo-mark rule randomly from the preset pseudo-mark rules, pseudo mark construction is carried out on each text in the obtained unmarked data to obtain a pseudo mark text corresponding to each text until the pseudo mark construction times reach the preset iteration times, all the pseudo mark texts are used as the pseudo mark data, the work of manual marking is reduced, the data volume is increased, the pseudo mark data and the obtained marked data can be used as source data, the unmarked data can be used as target data, and trains the text error correction model based on the sequence-to-sequence method of the pointer network, thereby improving the training efficiency of the text error correction model, and can accurately judge which texts can be reserved and which texts need to be modified, thereby greatly reducing the phenomenon of wrong modification of correct texts.

It should be noted that the method of the embodiment of the present invention may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one device of the multiple devices may only perform one or more steps of the method according to the embodiment of the present invention, and the multiple devices interact with each other to complete the method.

Fig. 2 is a schematic structural diagram of an embodiment of a training apparatus for a text correction model according to the present invention, and as shown in fig. 2, the training apparatus for a text correction model according to this embodiment may include a pseudo label construction module 20, a detection module 21, and a training module 22.

A pseudo mark constructing module 20, configured to perform pseudo mark construction on each text in the unmarked data based on a pseudo mark rule randomly selected from a plurality of preset pseudo mark rules, so as to obtain a pseudo mark text corresponding to each text;

In a specific implementation process, if a plurality of preset pseudo-mark rules include that at least one word is randomly inserted into the position of each word in each text with a preset first probability, a word table can be constructed according to the frequency of the characters in the non-mark data; selecting at least one inserted word from the word table.

Specifically, the process of constructing the word table is as follows:

taking characters with the character frequency larger than or equal to a preset threshold value in the unmarked data as target characters; and constructing the word table according to the target character.

The detection module 21 is configured to detect whether the pseudo mark construction number reaches a preset iteration number; if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;

and the training module 22 is configured to use the pseudo tag data and the obtained tagged data as source end data, use the non-tag data as target end data, and train a text error correction model based on a sequence-to-sequence method of a pointer network.

In one embodiment, the training module 22 may implement the training text correction model according to the following steps:

wherein the content of the first and second substances,

In a specific implementation, the implementation of this step is as follows:

in particular, Self may be defined_dec() For the self-attention-based decoder computation unit, the decoder encodes a representation of the output word sequence at time t

Obtained by calculating the following equation (2):

wherein the content of the first and second substances,

wherein, O_tIs the input representation of the softmax layer of the decoder.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and specific implementation schemes thereof may refer to the method described in the foregoing embodiment and relevant descriptions in the method embodiment, and have beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 3 is a schematic structural diagram of an embodiment of a training device for a text correction model according to the present invention, and as shown in fig. 3, the passing device of this embodiment may include: a processor 1010 and a memory 1020. Those skilled in the art will appreciate that the device may also include input/output interface 1030, communication interface 1040, and bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The present invention also provides a storage medium storing one or more programs that, when executed, implement the method for training a text correction model of the above-described embodiments.

The invention also provides a text error correction method, which comprises the following steps:

and inputting the text to be corrected into the text correction model obtained in the embodiment, and outputting the standard text corresponding to the text to be corrected.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A training method of a text correction model is characterized by comprising the following steps:

2. The method for training the text correction model according to claim 1, wherein the preset plurality of pseudo labeling rules comprise:

Each word in each piece of text is maintained.

3. The method of claim 2, wherein the predetermined plurality of pseudo-tagging rules includes randomly inserting at least one word at a predetermined first probability at a position of each word in each text, the method further comprising:

selecting at least one inserted word from the word table.

4. The method for training the text correction model according to claim 3, wherein constructing the word table according to the frequency of the characters in the label-free data comprises:

and constructing the word table according to the target character.

5. The method for training the text correction model according to claim 1, wherein the training of the text correction model based on a sequence-to-sequence method of a pointer network comprises:

6. The method for training a text error correction model according to claim 5, wherein decoding the input word sequence code to obtain error correction data corresponding to the source data comprises:

7. The method for training the text correction model according to claim 6, wherein the obtaining of the replication mechanism score comprises:

8. An apparatus for training a text correction model, comprising:

9. An apparatus for training a text correction model, comprising: a processor and a memory;

the processor is configured to execute the text generation program stored in the memory to implement the training method of the text correction model according to any one of claims 1 to 7.

10. A storage medium storing one or more programs which, when executed, implement the method of training a text correction model according to any one of claims 1-7.