CN113191119A - Method, apparatus and storage medium for training text error correction model - Google Patents

Method, apparatus and storage medium for training text error correction model Download PDF

Info

Publication number
CN113191119A
CN113191119A CN202110616159.6A CN202110616159A CN113191119A CN 113191119 A CN113191119 A CN 113191119A CN 202110616159 A CN202110616159 A CN 202110616159A CN 113191119 A CN113191119 A CN 113191119A
Authority
CN
China
Prior art keywords
text
word
data
pseudo
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110616159.6A
Other languages
Chinese (zh)
Inventor
王亦宁
刘升平
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202110616159.6A priority Critical patent/CN113191119A/en
Publication of CN113191119A publication Critical patent/CN113191119A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding

Abstract

The invention relates to a training method, equipment and a storage medium of a text error correction model, wherein the method comprises the steps of carrying out pseudo-mark construction on each text in acquired unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text; detecting whether the construction times of the pseudo mark reach preset iteration times or not; if the pseudo label construction times reach the preset iteration times, all pseudo label texts are used as pseudo label data, so that the manual labeling work is reduced, and the data volume is increased; and then the pseudo-mark data and the obtained marked data can be used as source end data, the unmarked data can be used as target end data, and a text error correction model is trained based on a sequence-to-sequence method of a pointer network, so that the training efficiency of the text error correction model is improved.

Description

Method, apparatus and storage medium for training text error correction model
Technical Field
The invention relates to the technical field of information processing, in particular to a training method, equipment and a storage medium of a text error correction model.
Background
Text error correction is an important research direction in computer natural language processing, and errors generated by human factors in a text, such as wrongly written characters, wrongly written orders and the like, can be corrected through a computer algorithm.
The existing generative text error correction model usually needs to collect large-scale labeling data and then train an end-to-end generative model, so that the correction process from the error text to the correct text can be realized.
However, for a text with a small data size such as medical data, it takes a certain amount of time and labor to collect and label the data, which reduces the training efficiency of the text correction model.
Disclosure of Invention
The invention provides a training method, equipment and a storage medium of a text error correction model, which aim to solve the technical problems of lower accuracy and higher quality of a generated text in the prior art.
The technical scheme for solving the technical problems is as follows:
a training method of a text correction model comprises the following steps:
performing pseudo-mark construction on each text in the obtained unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text;
detecting whether the construction times of the pseudo mark reach preset iteration times or not;
if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;
and taking the pseudo mark data and the obtained marked data as source end data, taking the unmarked data as target end data, and training a text error correction model based on a sequence-to-sequence method of a pointer network.
Further, in the above method for training a text error correction model, the preset multiple pseudo-mark rules include:
randomly deleting at least one word at a preset first probability at the position of each word in each text; and/or
Randomly inserting at least one character at the position of each word in each text with a preset first probability; and/or
Adding noise to the position of each word in each text according to normal distribution, and reordering each word after noise is added; and/or
Collecting and constructing a phonetic-close word dictionary, and replacing each word in each text with a phonetic-close word according to a preset third probability; and/or
Collecting and constructing a similar word dictionary, and replacing each word in each text with a similar word according to a preset fourth probability; and/or
Each word in each piece of text is maintained.
Further, in the above method for training a text error correction model, the predetermined plurality of pseudo-mark rules includes randomly inserting at least one word at a predetermined first probability at a position of each word in each text, and the method further includes:
constructing a word table according to the character frequency in the unmarked data;
selecting at least one inserted word from the word table.
Further, in the above method for training a text error correction model, constructing a word table according to the frequency of characters in the unlabeled data includes:
taking characters with the character frequency larger than or equal to a preset threshold value in the unmarked data as target characters;
and constructing the word table according to the target character.
Further, in the above method for training a text error correction model, the training of the text error correction model based on a sequence-to-sequence method of a pointer network includes:
carrying out word sequence division on the source end data, and carrying out word vector processing on the obtained word sequence to obtain a word matrix corresponding to the word sequence;
encoding the word matrix by using an encoder to obtain an input word sequence encoding representation;
decoding the input word sequence code by using a decoder under an attention mechanism to obtain error correction data corresponding to the source end data;
determining a loss value based on the error correction data and the target end data;
and performing iterative training on the current model based on the loss value until a training stopping condition is reached to obtain a text error correction model.
Further, in the above method for training a text error correction model, decoding the input word sequence code to obtain error correction data corresponding to the source data, includes:
decoding the input word sequence code to obtain an error-corrected output word sequence code representation;
performing linear transformation on the logistic regression layer of the output word sequence coding representation input encoder, and outputting the initial probability distribution of each moment t in the target end data;
determining the fusion probability distribution of each moment t in the target end data according to the probability distribution of each moment t in the target end data and the obtained replication mechanism score;
selecting the word corresponding to the maximum fusion probability as a generation result of the moment t;
error correction data corresponding to the source data is generated based on the words at all times.
Further, in the above method for training a text correction model, the process of obtaining the score of the replication mechanism includes:
performing matrix transformation on the coded representation of the output word sequence to obtain an output vector;
performing matrix transformation on the hidden state of the encoder to obtain a key vector and a value vector;
determining the replication mechanism score based on the output vector, the key vector, and the value vector.
The invention also provides a training device of the text error correction model, which comprises:
the pseudo mark construction module is used for carrying out pseudo mark construction on each text in the unmarked data based on a pseudo mark rule randomly selected from a plurality of preset pseudo mark rules to obtain a pseudo mark text corresponding to each text;
the detection module is used for detecting whether the construction times of the pseudo mark reach the preset iteration times; if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;
and the training module is used for taking the pseudo-mark data and the obtained marked data as source end data, taking the unmarked data as target end data and training a text error correction model based on a sequence-to-sequence method of a pointer network.
The invention also provides a training device of the text error correction model, which comprises: a processor and a memory;
the processor is used for executing the text generation program stored in the memory to realize the training method of the text correction model as described in any one of the above items.
The present invention also provides a storage medium storing one or more programs which, when executed, implement the method of training a text correction model according to any one of claims 1 to 7.
The invention has the beneficial effects that:
the method comprises the steps of carrying out pseudo-mark construction on each text in the obtained unmarked data through a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text until the pseudo-mark construction times reach the preset iteration times, taking all the pseudo-mark texts as the pseudo-mark data, reducing the work of manual marking, increasing the data volume, further taking the pseudo-mark data and the obtained marked data as source end data, taking the unmarked data as target end data, training a text error correction model based on a pointer network sequence-to-sequence method, improving the training efficiency of the text error correction model, accurately judging which texts can be reserved and which texts need to be modified, and greatly reducing the phenomenon of correct text modification errors.
Drawings
FIG. 1 is a flowchart of an embodiment of a method for training a text correction model according to the present invention;
FIG. 2 is a schematic structural diagram of an embodiment of a training apparatus for text error correction models according to the present invention;
FIG. 3 is a schematic structural diagram of an embodiment of a training apparatus for a text error correction model according to the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of an embodiment of a training method of a text correction model according to the present invention, and as shown in fig. 1, the training method of the text correction model of the present embodiment may include the following steps:
100. performing pseudo-mark construction on each text in the obtained unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text;
in a specific implementation process, for a text with a small data volume, such as medical data, data expansion can be performed on the medical data, so that a text with a large data volume is obtained.
Specifically, the required data can be acquired as the unmarked data through data capture, manual writing and other modes, and each text in the acquired unmarked data is subjected to the pseudo mark construction according to the pseudo mark rule randomly selected from the preset pseudo mark rules, so that the pseudo mark text corresponding to each text is obtained.
In a specific implementation process, the preset multiple pseudo-marking rules include: randomly deleting at least one word at a preset first probability at the position of each word in each text; and/or, randomly inserting at least one word at a preset first probability in the position of each word in each text; and/or adding noise to the position of each word in each text according to normal distribution, and reordering each word added with noise; and/or, collecting and constructing a phonetic near word dictionary, and replacing each word in each text with a phonetic near word according to a preset third probability; and/or, collecting and constructing a similar word dictionary, and replacing each word in each text with a similar word according to a preset fourth probability; and/or, maintaining each word in each piece of text.
For example, the unlabeled data is "cerebral blood supply insufficiency", and after the expansion, the "cerebral blood supply insufficiency", "cerebral write insufficiency", and "cerebral blood supply insufficiency" can be obtained.
In a specific implementation process, if a plurality of preset pseudo-tag rules include that at least one word is randomly inserted into the position of each word in each text with a preset first probability, a word table may be constructed according to the frequency of the characters in the non-tag data, and specifically, the characters in the non-tag data whose frequency of the characters is greater than or equal to a preset threshold may be used as target characters; and constructing the word table according to the target character. After the word table is constructed, at least one inserted word may be selected from the word table.
101. Detecting whether the construction times of the pseudo mark reach preset iteration times or not;
102. if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;
in a specific implementation process, pseudo mark construction can be performed on each text in the obtained unmarked data for multiple times, and after each pseudo mark construction, the current pseudo mark construction times are recorded so as to detect whether the pseudo mark construction times reach the preset iteration times; and if the construction times of the pseudo marks reach the preset iteration times, taking all pseudo mark texts as pseudo mark data.
103. And taking the pseudo mark data and the obtained marked data as source end data, taking the unmarked data as target end data, and training a text error correction model based on a sequence-to-sequence method of a pointer network.
In a specific implementation, step 103 may be implemented according to the following steps:
(1) carrying out word sequence division on the source end data, and carrying out word vector processing on the obtained word sequence to obtain a word matrix corresponding to the word sequence;
in one specific implementation, x ═ x may be defined1,x2,...,xn]Word sequence representing source data, X ═ v1,v2,....vi..,vn]Representing a word matrix corresponding to the word sequence. Wherein v isiA vector representing the ith word sequence.
(2) Encoding the word matrix by using an encoder to obtain an input word sequence encoding representation;
in one implementation, Self may be definedenc() For the encoder calculation unit based on the self-attention mechanism, the encoded representation of each word passing through the encoder can be calculated by the following calculation formula (1):
Figure BDA0003097676960000071
wherein the content of the first and second substances,
Figure BDA0003097676960000072
representing an encoded representation of the mth word sequence. After being coded by the coder, the coded representation h of the topmost layer of the coder can be obtainedN
(3) Decoding the input word sequence code by using a decoder under an attention mechanism to obtain error correction data corresponding to the source end data;
definition y ═ y1,y2,...,yn]Word sequence representing error correction result, Y ═ u1,u2,...ui...,un]A matrix obtained by preprocessing a word sequence representing data input at a target end through a word vector, wherein uiA vector representing the ith word.
In a specific implementation, the implementation of this step is as follows:
(31) decoding the input word sequence code to obtain an error-corrected output word sequence code representation;
in particular, Self may be defineddec() Computing units for a self-attention based decoderEncoded representation of the output word sequence at time t by the decoder
Figure BDA0003097676960000073
Obtained by calculating the following equation (2):
Figure BDA0003097676960000074
wherein the content of the first and second substances,
Figure BDA0003097676960000075
for the coded representation of the output word sequence at time t in the nth layer of the target, utRepresenting the input of the decoder at time t, hNIs a hidden state of the encoder.
(32) Performing linear transformation on the logistic regression layer of the output word sequence coding representation input encoder, and outputting the initial probability distribution of each moment t in the target end data;
in a specific implementation process, the output word sequence coded representation can be input into the logistic regression layer of the decoder to perform linear transformation, and output the initial probability distribution in the target end data at each time t.
The output word sequence coding means that the output word sequence after linear transformation to transform is coded as follows:
Figure BDA0003097676960000081
wherein, OtIs the input representation of the softmax layer of the decoder.
O obtained by linear transformationtAnd outputting the initial probability distribution in the target-end data at each time t by softmax.
Probgen=softmax(W·Ot+ b); wherein, W and b are model parameters, and the dimension of W is the same as the dimension of a word list.
(33) Determining the fusion probability distribution of each moment t in the target end data according to the probability distribution of each moment t in the target end data and the obtained replication mechanism score;
in a specific implementation process, the obtaining process of the replication mechanism score includes:
a. performing matrix transformation on the coded representation of the output word sequence to obtain an output vector;
specifically, the output word sequence coding representation may be matrix transformed according to the following equation (3) to obtain an output vector qt
Figure BDA0003097676960000082
Figure BDA0003097676960000083
A first matrix is represented.
b. Performing matrix transformation on the hidden state of the encoder to obtain a key vector and a value vector;
specifically, the hidden state of the encoder may be matrix-transformed according to the following calculation formula (4) to obtain a key vector K and a value vector V.
Figure BDA0003097676960000084
Figure BDA0003097676960000085
A second matrix is represented that is a matrix of,
Figure BDA0003097676960000086
representing a third matrix.
c. Determining the replication mechanism score based on the output vector, the key vector, and the value vector.
In one specific implementation, the replication mechanism score may be determined according to the following calculation equation (5):
Figure BDA0003097676960000087
in this embodiment, after obtaining the replication mechanism score, the probability distribution of each time t in the target end data and the obtained replication mechanism score may be fused according to the following calculation formula (6), so as to obtain a fusion probability distribution of each time t in the target end data.
Figure BDA0003097676960000091
(34) Selecting the word corresponding to the maximum fusion probability as a generation result of the moment t;
(35) error correction data corresponding to the source data is generated based on the words at all times.
After the fusion probability distribution of each time t in the target end data is obtained, the word corresponding to the maximum fusion probability can be selected as the result of the generation of the time t, and the error correction data corresponding to the source end data is generated based on the words at all times.
(4) Determining a loss value based on the error correction data and the target end data;
specifically, the generated error correction data is compared with the target end data, and the loss value is calculated by a loss function.
(5) And performing iterative training on the current model based on the loss value until a training stopping condition is reached to obtain a text error correction model.
Specifically, the threshold value of the loss value may be set in advance as a condition for stopping training. For example, the threshold value is set to 0.2. This is not limited by the present application.
The training method of the text error correction model of the embodiment selects the pseudo-mark rule randomly from the preset pseudo-mark rules, pseudo mark construction is carried out on each text in the obtained unmarked data to obtain a pseudo mark text corresponding to each text until the pseudo mark construction times reach the preset iteration times, all the pseudo mark texts are used as the pseudo mark data, the work of manual marking is reduced, the data volume is increased, the pseudo mark data and the obtained marked data can be used as source data, the unmarked data can be used as target data, and trains the text error correction model based on the sequence-to-sequence method of the pointer network, thereby improving the training efficiency of the text error correction model, and can accurately judge which texts can be reserved and which texts need to be modified, thereby greatly reducing the phenomenon of wrong modification of correct texts.
It should be noted that the method of the embodiment of the present invention may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one device of the multiple devices may only perform one or more steps of the method according to the embodiment of the present invention, and the multiple devices interact with each other to complete the method.
Fig. 2 is a schematic structural diagram of an embodiment of a training apparatus for a text correction model according to the present invention, and as shown in fig. 2, the training apparatus for a text correction model according to this embodiment may include a pseudo label construction module 20, a detection module 21, and a training module 22.
A pseudo mark constructing module 20, configured to perform pseudo mark construction on each text in the unmarked data based on a pseudo mark rule randomly selected from a plurality of preset pseudo mark rules, so as to obtain a pseudo mark text corresponding to each text;
in a specific implementation process, the preset multiple pseudo-marking rules include: randomly deleting at least one word at a preset first probability at the position of each word in each text; and/or, randomly inserting at least one word at a preset first probability in the position of each word in each text; and/or adding noise to the position of each word in each text according to normal distribution, and reordering each word added with noise; and/or, collecting and constructing a phonetic near word dictionary, and replacing each word in each text with a phonetic near word according to a preset third probability; and/or, collecting and constructing a similar word dictionary, and replacing each word in each text with a similar word according to a preset fourth probability; and/or, maintaining each word in each piece of text.
In a specific implementation process, if a plurality of preset pseudo-mark rules include that at least one word is randomly inserted into the position of each word in each text with a preset first probability, a word table can be constructed according to the frequency of the characters in the non-mark data; selecting at least one inserted word from the word table.
Specifically, the process of constructing the word table is as follows:
taking characters with the character frequency larger than or equal to a preset threshold value in the unmarked data as target characters; and constructing the word table according to the target character.
The detection module 21 is configured to detect whether the pseudo mark construction number reaches a preset iteration number; if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;
and the training module 22 is configured to use the pseudo tag data and the obtained tagged data as source end data, use the non-tag data as target end data, and train a text error correction model based on a sequence-to-sequence method of a pointer network.
In one embodiment, the training module 22 may implement the training text correction model according to the following steps:
(1) carrying out word sequence division on the source end data, and carrying out word vector processing on the obtained word sequence to obtain a word matrix corresponding to the word sequence;
in one specific implementation, x ═ x may be defined1,x2,...,xn]Word sequence representing source data, X ═ v1,v2,....vi..,vn]Representing a word matrix corresponding to the word sequence. Wherein v isiA vector representing the ith word sequence.
(2) Encoding the word matrix by using an encoder to obtain an input word sequence encoding representation;
in one implementation, Self may be definedenc() For the encoder calculation unit based on the self-attention mechanism, the encoded representation of each word passing through the encoder can be calculated by the following calculation formula (1):
Figure BDA0003097676960000111
wherein the content of the first and second substances,
Figure BDA0003097676960000112
representing an encoded representation of the mth word sequence. After being coded by the coder, the coded representation h of the topmost layer of the coder can be obtainedN
(3) Decoding the input word sequence code by using a decoder under an attention mechanism to obtain error correction data corresponding to the source end data;
definition y ═ y1,y2,...,yn]Word sequence representing error correction result, Y ═ u1,u2,...ui...,un]A matrix obtained by preprocessing a word sequence representing data input at a target end through a word vector, wherein uiA vector representing the ith word.
In a specific implementation, the implementation of this step is as follows:
(31) decoding the input word sequence code to obtain an error-corrected output word sequence code representation;
in particular, Self may be defineddec() For the self-attention-based decoder computation unit, the decoder encodes a representation of the output word sequence at time t
Figure BDA0003097676960000121
Obtained by calculating the following equation (2):
Figure BDA0003097676960000122
wherein the content of the first and second substances,
Figure BDA0003097676960000123
for the coded representation of the output word sequence at time t in the nth layer of the target, utRepresenting the input of the decoder at time t, hNIs a hidden state of the encoder.
(32) Performing linear transformation on the logistic regression layer of the output word sequence coding representation input encoder, and outputting the initial probability distribution of each moment t in the target end data;
in a specific implementation process, the output word sequence coded representation can be input into the logistic regression layer of the decoder to perform linear transformation, and output the initial probability distribution in the target end data at each time t.
The output word sequence coding means that the output word sequence after linear transformation to transform is coded as follows:
Figure BDA0003097676960000124
wherein, OtIs the input representation of the softmax layer of the decoder.
O obtained by linear transformationtAnd outputting the initial probability distribution in the target-end data at each time t by softmax.
Probgen=softmax(W·Ot+ b); wherein, W and b are model parameters, and the dimension of W is the same as the dimension of a word list.
(33) Determining the fusion probability distribution of each moment t in the target end data according to the probability distribution of each moment t in the target end data and the obtained replication mechanism score;
in a specific implementation process, the obtaining process of the replication mechanism score includes:
a. performing matrix transformation on the coded representation of the output word sequence to obtain an output vector;
specifically, the output word sequence coding representation may be matrix transformed according to the following equation (3) to obtain an output vector qt
Figure BDA0003097676960000125
b. Performing matrix transformation on the hidden state of the encoder to obtain a key vector and a value vector;
specifically, the hidden state of the encoder may be matrix-transformed according to the following calculation formula (4) to obtain a key vector K and a value vector V.
Figure BDA0003097676960000131
c. Determining the replication mechanism score based on the output vector, the key vector, and the value vector.
In one specific implementation, the replication mechanism score may be determined according to the following calculation equation (5):
Figure BDA0003097676960000132
in this embodiment, after obtaining the replication mechanism score, the probability distribution of each time t in the target end data and the obtained replication mechanism score may be fused according to the following calculation formula (6), so as to obtain a fusion probability distribution of each time t in the target end data.
Figure BDA0003097676960000133
(34) Selecting the word corresponding to the maximum fusion probability as a generation result of the moment t;
(35) error correction data corresponding to the source data is generated based on the words at all times.
After the fusion probability distribution of each time t in the target end data is obtained, the word corresponding to the maximum fusion probability can be selected as the result of the generation of the time t, and the error correction data corresponding to the source end data is generated based on the words at all times.
(4) Determining a loss value based on the error correction data and the target end data;
specifically, the generated error correction data is compared with the target end data, and the loss value is calculated by a loss function.
(5) And performing iterative training on the current model based on the loss value until a training stopping condition is reached to obtain a text error correction model.
Specifically, the threshold value of the loss value may be set in advance as a condition for stopping training. For example, the threshold value is set to 0.2. This is not limited by the present application.
The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and specific implementation schemes thereof may refer to the method described in the foregoing embodiment and relevant descriptions in the method embodiment, and have beneficial effects of the corresponding method embodiment, which are not described herein again.
Fig. 3 is a schematic structural diagram of an embodiment of a training device for a text correction model according to the present invention, and as shown in fig. 3, the passing device of this embodiment may include: a processor 1010 and a memory 1020. Those skilled in the art will appreciate that the device may also include input/output interface 1030, communication interface 1040, and bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The present invention also provides a storage medium storing one or more programs that, when executed, implement the method for training a text correction model of the above-described embodiments.
The invention also provides a text error correction method, which comprises the following steps:
and inputting the text to be corrected into the text correction model obtained in the embodiment, and outputting the standard text corresponding to the text to be corrected.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A training method of a text correction model is characterized by comprising the following steps:
performing pseudo-mark construction on each text in the obtained unmarked data based on a pseudo-mark rule randomly selected from a plurality of preset pseudo-mark rules to obtain a pseudo-mark text corresponding to each text;
detecting whether the construction times of the pseudo mark reach preset iteration times or not;
if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;
and taking the pseudo mark data and the obtained marked data as source end data, taking the unmarked data as target end data, and training a text error correction model based on a sequence-to-sequence method of a pointer network.
2. The method for training the text correction model according to claim 1, wherein the preset plurality of pseudo labeling rules comprise:
randomly deleting at least one word at a preset first probability at the position of each word in each text; and/or
Randomly inserting at least one character at the position of each word in each text with a preset first probability; and/or
Adding noise to the position of each word in each text according to normal distribution, and reordering each word after noise is added; and/or
Collecting and constructing a phonetic-close word dictionary, and replacing each word in each text with a phonetic-close word according to a preset third probability; and/or
Collecting and constructing a similar word dictionary, and replacing each word in each text with a similar word according to a preset fourth probability; and/or
Each word in each piece of text is maintained.
3. The method of claim 2, wherein the predetermined plurality of pseudo-tagging rules includes randomly inserting at least one word at a predetermined first probability at a position of each word in each text, the method further comprising:
constructing a word table according to the character frequency in the unmarked data;
selecting at least one inserted word from the word table.
4. The method for training the text correction model according to claim 3, wherein constructing the word table according to the frequency of the characters in the label-free data comprises:
taking characters with the character frequency larger than or equal to a preset threshold value in the unmarked data as target characters;
and constructing the word table according to the target character.
5. The method for training the text correction model according to claim 1, wherein the training of the text correction model based on a sequence-to-sequence method of a pointer network comprises:
carrying out word sequence division on the source end data, and carrying out word vector processing on the obtained word sequence to obtain a word matrix corresponding to the word sequence;
encoding the word matrix by using an encoder to obtain an input word sequence encoding representation;
decoding the input word sequence code by using a decoder under an attention mechanism to obtain error correction data corresponding to the source end data;
determining a loss value based on the error correction data and the target end data;
and performing iterative training on the current model based on the loss value until a training stopping condition is reached to obtain a text error correction model.
6. The method for training a text error correction model according to claim 5, wherein decoding the input word sequence code to obtain error correction data corresponding to the source data comprises:
decoding the input word sequence code to obtain an error-corrected output word sequence code representation;
performing linear transformation on the logistic regression layer of the output word sequence coding representation input encoder, and outputting the initial probability distribution of each moment t in the target end data;
determining the fusion probability distribution of each moment t in the target end data according to the probability distribution of each moment t in the target end data and the obtained replication mechanism score;
selecting the word corresponding to the maximum fusion probability as a generation result of the moment t;
error correction data corresponding to the source data is generated based on the words at all times.
7. The method for training the text correction model according to claim 6, wherein the obtaining of the replication mechanism score comprises:
performing matrix transformation on the coded representation of the output word sequence to obtain an output vector;
performing matrix transformation on the hidden state of the encoder to obtain a key vector and a value vector;
determining the replication mechanism score based on the output vector, the key vector, and the value vector.
8. An apparatus for training a text correction model, comprising:
the pseudo mark construction module is used for carrying out pseudo mark construction on each text in the unmarked data based on a pseudo mark rule randomly selected from a plurality of preset pseudo mark rules to obtain a pseudo mark text corresponding to each text;
the detection module is used for detecting whether the construction times of the pseudo mark reach the preset iteration times; if the pseudo mark construction times reach the preset iteration times, taking all pseudo mark texts as pseudo mark data;
and the training module is used for taking the pseudo-mark data and the obtained marked data as source end data, taking the unmarked data as target end data and training a text error correction model based on a sequence-to-sequence method of a pointer network.
9. An apparatus for training a text correction model, comprising: a processor and a memory;
the processor is configured to execute the text generation program stored in the memory to implement the training method of the text correction model according to any one of claims 1 to 7.
10. A storage medium storing one or more programs which, when executed, implement the method of training a text correction model according to any one of claims 1-7.
CN202110616159.6A 2021-06-02 2021-06-02 Method, apparatus and storage medium for training text error correction model Pending CN113191119A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110616159.6A CN113191119A (en) 2021-06-02 2021-06-02 Method, apparatus and storage medium for training text error correction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110616159.6A CN113191119A (en) 2021-06-02 2021-06-02 Method, apparatus and storage medium for training text error correction model

Publications (1)

Publication Number Publication Date
CN113191119A true CN113191119A (en) 2021-07-30

Family

ID=76975819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110616159.6A Pending CN113191119A (en) 2021-06-02 2021-06-02 Method, apparatus and storage medium for training text error correction model

Country Status (1)

Country Link
CN (1) CN113191119A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150254327A1 (en) * 2014-03-07 2015-09-10 Tata Consultancy Services Limited System and method for rectifying a typographical error in a text file
CN111401012A (en) * 2020-03-09 2020-07-10 北京声智科技有限公司 Text error correction method, electronic device and computer readable storage medium
CN111563390A (en) * 2020-04-28 2020-08-21 北京字节跳动网络技术有限公司 Text generation method and device and electronic equipment
CN111859919A (en) * 2019-12-02 2020-10-30 北京嘀嘀无限科技发展有限公司 Text error correction model training method and device, electronic equipment and storage medium
CN112417823A (en) * 2020-09-16 2021-02-26 中国科学院计算技术研究所 Chinese text word order adjusting and quantitative word completion method and system
CN112861519A (en) * 2021-03-12 2021-05-28 云知声智能科技股份有限公司 Medical text error correction method, device and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150254327A1 (en) * 2014-03-07 2015-09-10 Tata Consultancy Services Limited System and method for rectifying a typographical error in a text file
CN111859919A (en) * 2019-12-02 2020-10-30 北京嘀嘀无限科技发展有限公司 Text error correction model training method and device, electronic equipment and storage medium
CN111401012A (en) * 2020-03-09 2020-07-10 北京声智科技有限公司 Text error correction method, electronic device and computer readable storage medium
CN111563390A (en) * 2020-04-28 2020-08-21 北京字节跳动网络技术有限公司 Text generation method and device and electronic equipment
CN112417823A (en) * 2020-09-16 2021-02-26 中国科学院计算技术研究所 Chinese text word order adjusting and quantitative word completion method and system
CN112861519A (en) * 2021-03-12 2021-05-28 云知声智能科技股份有限公司 Medical text error correction method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEIZHAO等: "Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data", HTTPS://ARXIV.ORG/ABS/1903.00138, pages 1 - 10 *

Similar Documents

Publication Publication Date Title
CN107767870B (en) Punctuation mark adding method and device and computer equipment
CN105068998A (en) Translation method and translation device based on neural network model
GB2556978A (en) Testing applications with a defined input format
CN113590761B (en) Training method of text processing model, text processing method and related equipment
US20120036133A1 (en) Computing device and method for searching for parameters in a data model
CN111611811B (en) Translation method, translation device, electronic equipment and computer readable storage medium
CN111666427A (en) Entity relationship joint extraction method, device, equipment and medium
CN110866402B (en) Named entity identification method and device, storage medium and electronic equipment
CN110019865B (en) Mass image processing method and device, electronic equipment and storage medium
CN110046637B (en) Training method, device and equipment for contract paragraph annotation model
CN111079944B (en) Transfer learning model interpretation realization method and device, electronic equipment and storage medium
CN107451106A (en) Text method and device for correcting, electronic equipment
CN110709855A (en) Techniques for dense video description
WO2019092868A1 (en) Information processing device, information processing method, and computer-readable recording medium
US20150339291A1 (en) Method and apparatus for performing bilingual word alignment
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN117290694A (en) Question-answering system evaluation method, device, computing equipment and storage medium
CN113191119A (en) Method, apparatus and storage medium for training text error correction model
CN115374766A (en) Text punctuation recovery method and related equipment
CN115454423A (en) Static webpage generation method and device, electronic equipment and storage medium
CN112836527B (en) Training method, system, equipment and storage medium of machine translation model
CN112183088B (en) Word level determining method, model building method, device and equipment
CN113204944A (en) Text generation method, device, equipment and storage medium
CN109190091B (en) Encoding and decoding method and device
CN112069549B (en) Method and system for downloading picture when Bootstrap-table plug-in exports table

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination