WO2022160447A1 - 文本纠错方法、装置、设备及存储介质 - Google Patents

文本纠错方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022160447A1
WO2022160447A1 PCT/CN2021/083296 CN2021083296W WO2022160447A1 WO 2022160447 A1 WO2022160447 A1 WO 2022160447A1 CN 2021083296 W CN2021083296 W CN 2021083296W WO 2022160447 A1 WO2022160447 A1 WO 2022160447A1
Authority
WO
WIPO (PCT)
Prior art keywords
corrected
text
error correction
word vector
text corpus
Prior art date
Application number
PCT/CN2021/083296
Other languages
English (en)
French (fr)
Inventor
邓悦
郑立颖
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022160447A1 publication Critical patent/WO2022160447A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of machine learning technologies, and in particular, to a text error correction method, apparatus, device, and storage medium.
  • the present application provides a text error correction method, device, device and storage medium, which are used to solve the problem of unaligned texts of corpus to be corrected and improve the accuracy of text error correction.
  • a first aspect of the present application provides a text error correction method, comprising: acquiring a text corpus to be corrected, and inputting the text corpus to be corrected into a pre-trained embedding layer to generate a word vector group to be corrected; The described word vector group to be corrected is input into the pre-trained detection discriminator, and the position information of the word vector is generated; according to the position information of the word vector, mask coverage is performed on the word vector group to be corrected, and the overlaid word vector group is generated.
  • the word vector group input the covered word vector group into a pre-trained error correction network, generate an error-corrected text corpus, and restore the error-corrected text based on the pre-trained error correction network corpus, to generate a target text corpus, and the error-corrected text corpus includes placeholders.
  • a second aspect of the present application provides a text error correction device, comprising: an acquisition module for acquiring a text corpus to be corrected, and inputting the text corpus to be corrected into a pre-trained embedding layer to generate words to be corrected A vector group; a location information generation module is used to input the word vector group to be corrected into a pre-trained detection discriminator to generate location information of the word vector; an overlay module is used to pair the word vector according to the location information of the word vector.
  • the word vector group to be corrected is masked and covered to generate a covered word vector group; a text corpus generation module is used to input the covered word vector group into a pre-trained error correction network to generate an error correction and restore the error-corrected text corpus based on the pre-trained error correction network to generate a target text corpus, where the error-corrected text corpus includes placeholders.
  • a third aspect of the present application provides a text error correction device, comprising: a memory and at least one processor, where instructions are stored in the memory; the at least one processor invokes the instructions in the memory, so that all The text error correction device performs the text error correction method as described below:
  • Obtain the text corpus to be corrected and input the text corpus to be corrected into a pre-trained embedding layer to generate a word vector group to be corrected; input the word vector group to be corrected into a pre-trained detection discriminator , generate the position information of the word vector; mask the to-be-corrected word vector group according to the position information of the word vector, and generate a covered word vector group; input the covered word vector group into pre-training
  • a good error correction network an error-corrected text corpus is generated, and based on the pre-trained error-correction network, the error-corrected text corpus is restored to generate a target text corpus, and the error-corrected text corpus includes: Placeholder.
  • a fourth aspect of the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium that, when running on a computer, cause the computer to execute the text error correction method as described below:
  • Obtain the text corpus to be corrected and input the text corpus to be corrected into a pre-trained embedding layer to generate a word vector group to be corrected; input the word vector group to be corrected into a pre-trained detection discriminator , generate the position information of the word vector; mask the to-be-corrected word vector group according to the position information of the word vector, and generate a covered word vector group; input the covered word vector group into pre-training
  • a good error correction network an error-corrected text corpus is generated, and based on the pre-trained error-correction network, the error-corrected text corpus is restored to generate a target text corpus, and the error-corrected text corpus includes: Placeholder.
  • the text corpus to be corrected is obtained, and the text corpus to be corrected is input into a pre-trained embedding layer to generate a word vector group to be corrected; the word vector group to be corrected is input In the pre-trained detection discriminator, the position information of the word vector is generated; the mask to cover the to-be-corrected word vector group according to the position information of the word vector, and the covered word vector group is generated; The post-trained word vector group is input into a pre-trained error correction network to generate an error-corrected text corpus, and based on the pre-trained error-correction network, the error-corrected text corpus is restored to generate a target text corpus.
  • the error-corrected text corpus includes placeholders.
  • the word vector of the typo word is covered by the mask, the covered word vector group is generated, then the error correction is performed based on the pre-trained error correction network and the placeholder is added to generate the error-corrected text corpus, and finally The error-corrected text corpus is restored to generate the target text corpus, which solves the problem of unaligned text in the corpus to be corrected, thereby improving the accuracy of text error correction.
  • FIG. 1 is a schematic diagram of an embodiment of a text error correction method in an embodiment of the present application
  • FIG. 2 is a schematic diagram of another embodiment of a text error correction method in an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an embodiment of a text error correction device in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another embodiment of a text error correction apparatus in an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an embodiment of a text error correction device in an embodiment of the present application.
  • Embodiments of the present application provide a text error correction method, apparatus, device, and storage medium, which are used to solve the problem of unaligned text in a corpus to be corrected, and improve the accuracy of text error correction.
  • an embodiment of the text error correction method in the embodiment of the present application includes:
  • the server obtains the text corpus to be corrected, and inputs the text corpus to be corrected into a pre-trained embedding layer for vectorization to generate a word vector group to be corrected. It should be emphasized that, in order to further ensure the privacy and security of the above text corpus to be corrected, the above text corpus to be corrected can also be stored in a node of a blockchain.
  • the text corpus to be corrected is the text corpus to be corrected for input text conversion, or it can be the text corpus to be corrected for input voice conversion.
  • the text corpus to be corrected can be either Chinese text corpus, such as "I'm from Shanghai", or It is an English text corpus, such as "I come from Shanghai”.
  • the server will input the cosine-trained embedding layer for the Chinese text corpus to be corrected or the English text corpus to be corrected, that is, quantize in the Embedding layer to generate the vector to be corrected Group.
  • the execution body of the present application may be a text error correction device, or a terminal or a server, which is not specifically limited here.
  • the embodiments of the present application take the server as an execution subject as an example for description.
  • the server inputs the word vector group to be corrected into the pre-trained detection discriminator to discriminate the position of the word vector, and generates the position information of the word vector.
  • the server inputs (h 1 ,h 2 ,...,h n ) into the pre-trained detection discriminator , the position information of each word vector in the word vector group to be corrected is identified by the pre-trained detection discriminator, and the position information including the typo word vector is generated.
  • (h 1 ,h 2 ,...,h n ) is the vector group of the word vector group to be corrected as "I am from Shanghai", and the server inputs the to-be-corrected vector group into the pre-trained detection discriminator , the pre-trained detection discriminator discriminates the vector group to be corrected, and generates the position information of the word vector (0, 1, 0, 0, 0), where "0" indicates that the word vector at this position is correct The word vector of , "1" means that the word vector at this position is the wrong word vector.
  • the server uses a mask to cover the word vector group to be corrected according to the position information of the word vector, and generates a covered word vector group.
  • mask coverage is performed on the word vector group to be corrected according to the position information of the word vector.
  • the vector is covered, so that only the word vector of the correct word and the mask vector covering the typo word vector are retained, so as to obtain the covered word vector group.
  • the word vector group (h 1 ,h 2 ,...,h n ) of the word vector to be corrected in "I am good from Shanghai” the position information of the corresponding word vector (0, 1, 0, 0, 0)
  • the server then According to (0, 1, 0, 0, 0), mask the word vector group to be corrected for "(h 1 , h 2 ,..., h n )" to generate
  • the word vector group of that is, the word vector group after covering.
  • the erroneous text corpus includes placeholders.
  • the server inputs the overwritten word vector group into the pre-trained error correction network, firstly generates the error-corrected text corpus, and then in the pre-trained error-correction network, corrects the error-corrected text corpus including placeholders Restore and generate target text corpus.
  • the server firstly converts the overwritten word vector group Enter the pre-trained error correction network for the first text restoration and add placeholders to generate the error-corrected text corpus "I am from [NONE] Shanghai", and then the server will "I am from [NONE] Shanghai” is restored to generate the target text corpus "I am from Shanghai”.
  • the word vector of the typo word is covered by the mask, the covered word vector group is generated, then the error correction is performed based on the pre-trained error correction network and the placeholder is added to generate the error-corrected text corpus, and finally The error-corrected text corpus is restored to generate the target text corpus, which solves the problem of unaligned text in the corpus to be corrected, thereby improving the accuracy of text error correction.
  • another embodiment of the text error correction method in the embodiment of the present application includes:
  • the server first obtains the text corpus training data set and the text corpus verification data set, and then uses the text corpus training data set to train the detection generator and the detection discriminator, and generates an initial detection generator and an initial detection discriminator.
  • e is a parameter that can be trained, that is, a parameter that can be adjusted
  • h is the word vector
  • h is the text corpus training data
  • t is the position of the word.
  • p D (h', t) is the position probability
  • h' is the text corpus training data
  • t is the position of the word
  • w T h G is the vector
  • T is an operation symbol, representing the "transpose of the matrix"(Transpose)".
  • the text corpus verification data set is used to coordinately adjust the initial detection generator and the initial detection discriminator to generate a pre-trained detection discriminator.
  • the pre-trained detection discriminator is mainly used for subsequent text error correction, so only the pre-trained detection discriminator is reserved for use at the end, but in the process of training and adjustment, it is necessary to refer to the detection generator
  • the output results of the detection discriminator are trained or adjusted. Therefore, the detection generator and the detection discriminator are jointly trained and adjusted.
  • the server uses the corresponding loss function to adjust the initial detection generator and the initial detection discriminator respectively.
  • I is the set of mask positions
  • h') is the output result of the initial detection generator. This loss function is used to adjust the initial detection network generator to generate a transition detection generator.
  • p D (h',t) is the output result of the initial detection discriminator, and the server uses this loss function to adjust the initial detection network discriminator to generate a transition detection discriminator.
  • the server uses the fusion formula to fuse the above two loss functions based on the preset ratio, so as to minimize the loss function and generate a pre-trained detection network discriminator.
  • the fusion formula is:
  • is a preset ratio, and in this embodiment, the ratio is 50%.
  • the server obtains the text corpus to be corrected, and inputs the text corpus to be corrected into a pre-trained embedding layer for vectorization to generate a word vector group to be corrected. It should be emphasized that, in order to further ensure the privacy and security of the above text corpus to be corrected, the above text corpus to be corrected can also be stored in a node of a blockchain.
  • the text corpus to be corrected is the text corpus to be corrected for input text conversion, or it can be the text corpus to be corrected for input voice conversion.
  • the text corpus to be corrected can be either Chinese text corpus, such as "I'm from Shanghai", or It is an English text corpus, such as "I come from Shanghai”.
  • the server will input the cosine-trained embedding layer for the Chinese text corpus to be corrected or the English text corpus to be corrected, that is, quantize in the Embedding layer to generate the vector to be corrected Group.
  • the server first obtains the text corpus to be corrected, and performs one-hot encoding on the text corpus to be corrected to generate the text code to be corrected; then the server reads the mapping matrix from the pre-trained embedding layer, and converts the The error correction text code is input into the pre-trained embedding layer, and the to-be-corrected text code is multiplied by the mapping matrix in the pre-trained embedding layer to generate the to-be-corrected word vector group.
  • the embedding layer that is, the Embedding layer
  • the server can map the text corpus to be corrected from one space to another, so as to obtain the word vector group to be corrected.
  • the server can read a mapping matrix from the fully connected layer, and then multiply the text corpus to be corrected by the mapping matrix to obtain the vector group to be corrected.
  • the server first performs one-hot encoding on the text corpus to be corrected, namely one-hot, to generate the text encoding to be corrected (x 1 , x 2 ,..., x n ), and then the server reads the mapping matrix from the pre-trained embedding layer , the mapping matrix is:
  • the server multiplies (x 1 ,x 2 ,...,x n ) with the mapping matrix to generate a set of word vectors to be corrected (h 1 ,h 2 ,...,h n ).
  • the server inputs the word vector group to be corrected into the pre-trained detection discriminator to discriminate the position of the word vector, and generates the position information of the word vector.
  • the server inputs (h 1 ,h 2 ,...,h n ) into the pre-trained detection discriminator , the position information of each word vector in the word vector group to be corrected is identified by the pre-trained detection discriminator, and the position information including the typo word vector is generated.
  • (h 1 ,h 2 ,...,h n ) is the vector group of the word vector group to be corrected as "I am from Shanghai", and the server inputs the to-be-corrected vector group into the pre-trained detection discriminator , the pre-trained detection discriminator discriminates the vector group to be corrected, and generates the position information of the word vector (0, 1, 0, 0, 0), where "0" indicates that the word vector at this position is correct The word vector of , "1" means that the word vector at this position is the wrong word vector.
  • the server first inputs the word vector group to be corrected and the text corpus to be corrected into the detection linear layer for calculation, and generates the vector group to be calculated, and the detection linear layer is located in the pre-trained detection discriminator;
  • the set identifier probability formula performs probability calculation on the vector group to be calculated to generate the position probability; finally, the server determines the position information of the word vector based on the position probability.
  • the server inputs the word vector group to be corrected and the text corpus to be corrected into the detection linear layer in the pre-trained detection discriminator in parallel, and in the detection linear layer, the word vector to be corrected is calculated with reference to the text corpus to be corrected.
  • the location information of the vector is:
  • p D (h', t) is the position probability
  • h' is the text corpus to be corrected
  • t is the position of the word
  • w T h G is the vector group to be calculated
  • T is an operation symbol, representing a matrix
  • the function in this embodiment is the inner product of the matrix W and the matrix h, where the matrix W and the matrix h are calculated in the linear layer based on the set of word vectors to be corrected.
  • the server determines the largest position probability as the typo position probability. For example, the text corpus to be corrected is "I am good from Shanghai", where "I” is the first position, “good” is the second position, “self” is the third position, and “ ⁇ ” is the fourth position , “sea” is the fifth position.
  • the server After the above calculation, the server generates multiple position probabilities of 0.5, 0.9, 0.65, 0.6 and 0.55.
  • the server generates the position information of the word vector according to the position probability (0, 1, 0, 0, 0), where "0” means that the word vector at this position is the correct word vector, and "1" means that the word vector at this position is the wrong word vector.
  • the server uses a mask to cover the word vector group to be corrected according to the position information of the word vector, and generates a covered word vector group.
  • mask coverage is performed on the word vector group to be corrected according to the position information of the word vector.
  • the vector is covered, so that only the word vector of the correct word and the mask vector covering the typo word vector are retained, so as to obtain the covered word vector group.
  • the word vector group (h 1 ,h 2 ,...,h n ) of the word vector to be corrected in "I am good from Shanghai” the position information of the corresponding word vector (0, 1, 0, 0, 0)
  • the server then According to (0, 1, 0, 0, 0), mask the word vector group to be corrected for "(h 1 , h 2 ,..., h n )" to generate
  • the word vector group of that is, the word vector group after covering.
  • the server first obtains the modification range parameter as a natural number, then determines the target word vector to be corrected corresponding to the position information of the misspelled word vector in the to-be-corrected word vector group based on the position information of the word vector, and finally, based on the modification range parameter Perform mask coverage on the target word vector to be corrected to generate a covered word vector group.
  • the modification amplitude parameter is a natural number such as 0, 1, 2, and 3.
  • the modification amplitude parameter is 2, and the position information of the word vector is (0, 1, 0, 0, 0) , the word vector group to be corrected is the word vector group of "I am good from Shanghai", the server determines the target word vector to be corrected as "good” based on the position information of the word vector, and the server uses [MASK] The vector covers the target word vector to be corrected, and generates the word vector group of "I [MASK][MASK]Shanghai", that is, the overwritten word vector group
  • the server inputs the overwritten word vector group into the pre-trained error correction network, firstly generates the error-corrected text corpus, and then in the pre-trained error-correction network, corrects the error-corrected text corpus including placeholders Restore and generate target text corpus.
  • the server firstly converts the overwritten word vector group Enter the pre-trained error correction network for the first text restoration and add placeholders to generate the error-corrected text corpus "I am from [NONE] Shanghai", and then the server will "I am from [NONE] Shanghai” is restored to generate the target text corpus "I am from Shanghai”.
  • the server first inputs the covered word vector group into the error correction hidden layer for calculation, and generates an error corrected text corpus including placeholders, and the error correction hidden layer is located in the pre-trained error correction network; then the server The placeholder is predicted based on the pre-trained error correction network, and the predicted placeholder corpus is generated; finally, the server generates the target text corpus based on the predicted placeholder corpus and the error-corrected text corpus.
  • the placeholder is [NONE], and in other embodiments, the placeholder may also be other, and the number of placeholders is less than or equal to the number of masks.
  • the word vector group after covering is The server inputs the overwritten word vector into the error correction hidden layer, and generates the error corrected text corpus "I am from [NONE] Shanghai", where [NONE] is a placeholder, and the server predicts the placeholder, this When the server predicts that the placeholder is a null value, it generates the predicted placeholder corpus "null”, and replaces the placeholder in the error-corrected text corpus with the predicted placeholder corpus, and generates "I am from Shanghai. ” target text corpus.
  • the server inputs the overwritten word vector group into the error correction hidden layer for calculation, and generates an error corrected text corpus including placeholders.
  • the error correction hidden layer is located in the pre-trained error correction network and specifically includes:
  • the server first inputs the covered word vector group into the pre-trained error correction network to generate hidden layer text information; the server initializes the hidden layer text information to generate an initialization vector sequence; then the server calculates the score of the initialization vector sequence based on the attention mechanism , to generate the attention weight score; finally, the server inputs the attention weight score and the hidden layer text into the error correction hidden layer for calculation, and generates an error corrected text corpus including placeholders.
  • V W V x input
  • Q, K, and V are initialization vector sequences.
  • the server calculates based on the attention mechanism. The calculation formula is:
  • T is the position of the word
  • the server inputs the multiple attention weight scores into the pre-correction score.
  • the computation is performed in the error hidden layer to generate an error-corrected text corpus including placeholders.
  • the word vector of the typo word is covered by the mask, the covered word vector group is generated, then the error correction is performed based on the pre-trained error correction network and the placeholder is added to generate the error-corrected text corpus, and finally The error-corrected text corpus is restored to generate the target text corpus, which solves the problem of unaligned text in the corpus to be corrected, thereby improving the accuracy of text error correction.
  • an embodiment of the text error correction device in the embodiment of the present application includes:
  • the obtaining module 301 is used for obtaining the text corpus to be corrected, and inputting the text corpus to be corrected into a pre-trained embedding layer to generate a word vector group to be corrected;
  • the location information generation module 302 is used to input the word vector group to be corrected into the pre-trained detection discriminator, and generate the location information of the word vector;
  • Covering module 303 configured to mask and cover the to-be-corrected word vector group according to the position information of the word vector, and generate a covered word vector group;
  • the text corpus generation module 304 is used to input the covered word vector group into a pre-trained error correction network, generate an error-corrected text corpus, and restore the error correction network based on the pre-trained error correction network.
  • the error-corrected text corpus generates a target text corpus, and the error-corrected text corpus includes placeholders.
  • the word vector of the typo word is covered by the mask, the covered word vector group is generated, then the error correction is performed based on the pre-trained error correction network and the placeholder is added to generate the error-corrected text corpus, and finally The error-corrected text corpus is restored to generate the target text corpus, which solves the problem of unaligned text in the corpus to be corrected, thereby improving the accuracy of text error correction.
  • FIG. 4 another embodiment of the text error correction apparatus in the embodiment of the present application includes:
  • the obtaining module 301 is used for obtaining the text corpus to be corrected, and inputting the text corpus to be corrected into a pre-trained embedding layer to generate a word vector group to be corrected;
  • the location information generation module 302 is used to input the word vector group to be corrected into the pre-trained detection discriminator, and generate the location information of the word vector;
  • Covering module 303 configured to mask and cover the to-be-corrected word vector group according to the position information of the word vector, and generate a covered word vector group;
  • the text corpus generation module 304 is used to input the covered word vector group into a pre-trained error correction network, generate an error-corrected text corpus, and restore the error correction network based on the pre-trained error correction network.
  • the error-corrected text corpus generates a target text corpus, and the error-corrected text corpus includes placeholders.
  • the obtaining module 301 can also be specifically used for:
  • mapping matrix Read the mapping matrix from the pre-trained embedding layer, input the text code to be corrected into the pre-trained embedding layer, and multiply the text code to be corrected by the mapping matrix to generate Set of word vectors to be corrected.
  • the location information generation module 302 can also be specifically used for:
  • the position information of the word vector is determined based on the position probability.
  • the covering module 303 can also be specifically used for:
  • a target word vector to be corrected is determined in the group of word vectors to be corrected, and the position information of the target word vector to be corrected is the position information of the misspelled word vector;
  • Mask coverage is performed on the target word vector to be corrected based on the modification magnitude parameter, and a covered word vector group is generated.
  • the text corpus generation module 304 includes:
  • the computing unit 3041 is used to input the covered word vector group into the error correction hidden layer for calculation, and generate the text corpus after error correction, and the error correction hidden layer is located in the pre-trained error correction network.
  • the error-corrected text corpus includes placeholders
  • a prediction unit 3042 configured to predict the placeholder based on the pre-trained error correction network, and generate a predicted placeholder corpus
  • the text corpus generating unit 3043 is configured to generate a target text corpus based on the predicted placeholder corpus and the error-corrected text corpus.
  • the computing unit 3041 can also be specifically used for:
  • the multiple attention weight scores and the hidden layer text information are input into the error correction hidden layer for calculation, and the text corpus after error correction is generated, and the error correction hidden layer is located in the pre-trained error correction network, so The error-corrected text corpus includes placeholders.
  • the text error correction device further includes:
  • the training module 305 is used to obtain the text corpus training data set and the text corpus verification data set, and use the text corpus training data to carry out the training of the generator and the training of the discriminator, and generate an initial detection generator and an initial detection discriminator;
  • the adjustment module 306 is configured to use the text corpus to verify the data set, perform adjustment based on the initial detection generator and the initial detection discriminator, and generate a pre-trained detection discriminator.
  • the word vector of the typo word is covered by the mask, the covered word vector group is generated, then the error correction is performed based on the pre-trained error correction network and the placeholder is added to generate the error-corrected text corpus, and finally The error-corrected text corpus is restored to generate the target text corpus, which solves the problem of unaligned text in the corpus to be corrected, thereby improving the accuracy of text error correction.
  • FIG. 5 is a schematic structural diagram of a text error correction device provided by an embodiment of the present application.
  • the text error correction device 500 may vary greatly due to different configurations or performances, and may include one or more processors (central processing units). , CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) storing application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the text error correction apparatus 500 .
  • the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the text error correction device 500 .
  • Text error correction device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more.
  • operating systems 531 such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more.
  • the present application further provides a text error correction device, the computer device includes a memory and a processor, and computer-readable instructions are stored in the memory.
  • the processor is made to execute the The steps of the text error correction method.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium.
  • the computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform the steps of the text error correction method.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Optimization (AREA)
  • Databases & Information Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Algebra (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种文本纠错方法、装置、设备及存储介质,涉及人工智能技术领域,用于解决待纠错语料文本未对齐的问题,提高文本纠错的准确率。所述文本纠错方法包括:获取待纠错文本语料,并将待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;将待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;按照词向量的位置信息对待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;将覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于预先训练好的纠错网络还原纠错后的文本语料,生成目标文本语料,纠错后的文本语料包括占位符。上述文本纠错方法还涉及区块链技术,待纠错文本语料可存储于区块链中。

Description

文本纠错方法、装置、设备及存储介质
本申请要求于2021年1月28日提交中国专利局、申请号为202110117570.9、发明名称为“文本纠错方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及机器学习技术领域,尤其涉及一种文本纠错方法、装置、设备及存储介质。
背景技术
在公文撰写、文案编辑、输入法纠错、以及语音识别后的文本结果输出等与自然语言文本相关的处理环节中,多字、错字、漏字等情况是时有发生的,针对这种情况设置专门的人工校对核验需要较高的人工成本和时间成本,同时很多时候也无法保证较高的纠正准确率。因此,出现了一些文本纠错的相关人工智能模型,其广泛的应用到现实场景之中。
在现有技术中,发明人意识到传统的机器学习或统计相关的模型在文本纠错领域的表现不尽如人意,主要利用双向预训练语言模型对文本进行纠错,但是这种纠错方式对英文这种非对齐语料进行文本纠错的纠错准确率较低。
发明内容
本申请提供了一种文本纠错方法、装置、设备及存储介质,用于解决待纠错语料文本未对齐的问题,提高文本纠错的准确率。
本申请第一方面提供了一种文本纠错方法,包括:获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
本申请第二方面提供了一种文本纠错装置,包括:获取模块,用于获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;位置信息生成模块,用于将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;覆盖模块,用于按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;文本语料生成模块,用于将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
本申请第三方面提供了一种文本纠错设备,包括:存储器和至少一个处理器,所述存储器中存储有指令;所述至少一个处理器调用所述存储器中的所述指令,以使得所述文本纠错设备执行如下所述的文本纠错方法:
获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储 有指令,当其在计算机上运行时,使得计算机执行如下所述的文本纠错方法:
获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
本申请提供的技术方案中,获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。本申请实施例中,利用掩码覆盖错别字的词向量,生成覆盖后的词向量组,然后基于预先训练好的纠错网络进行纠错以及添加占位符,生成纠错后的文本语料,最后还原纠错后的文本语料,生成目标文本语料,解决了待纠错语料文本未对齐的问题,从而提高了文本纠错的准确率。
附图说明
图1为本申请实施例中文本纠错方法的一个实施例示意图;
图2为本申请实施例中文本纠错方法的另一个实施例示意图;
图3为本申请实施例中文本纠错装置的一个实施例示意图;
图4为本申请实施例中文本纠错装置的另一个实施例示意图;
图5为本申请实施例中文本纠错设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种文本纠错方法、装置、设备及存储介质,用于解决待纠错语料文本未对齐的问题,提高文本纠错的准确率。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中文本纠错方法的一个实施例包括:
101、获取待纠错文本语料,并将待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
服务器获取待纠错文本语料,并将该待纠错文本语料输入预先训练好的嵌入层中进行向量化,生成待纠错词向量组。需要强调的是,为进一步保证上述待纠错文本语料的私密和安全性,上述待纠错文本语料还可以存储于一区块链的节点中。
待纠错文本语料为输入文本转换的待纠错文本语料,也可以为输入语音转换的待纠错文本语料,待纠错文本语料既可以为中文文本语料,例如“我来自上海”,也可以为英文文本语料,例如“I come from Shanghai”。当获取到待纠错文本语料时,服务器将为中文的 待纠错文本语料或者将为英文的待纠错文本语料输入余弦训练好的嵌入层,即Embedding层中进行量化,生成待纠错向量组。
可以理解的是,本申请的执行主体可以为文本纠错装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。
102、将待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
服务器将待纠错词向量组输入预先训练好的检测判别器中进行词向量位置的判别,生成词向量的位置信息。
例如,假设待纠错词向量组为(h 1,h 2,...,h n),服务器将(h 1,h 2,...,h n)输入预先训练好的检测判别器中,通过预先训练好的检测判别器将待纠错词向量组中的每个词向量的位置信息识别出来,生成包括错别字词向量的位置信息。例如(h 1,h 2,...,h n)是待纠错词向量组为“我好自上海”的向量组,服务器则将该待纠错向量组输入预先训练好的检测判别器中,经过预先训练好的检测判别器对该待纠错向量组进行判别,生成词向量的位置信息(0,1,0,0,0),其中“0”代表该位置的词向量是正确的词向量,“1”代表该位置的词向量是错误的词向量。
103、按照词向量的位置信息对待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
服务器采用掩码按照词向量的位置信息对待纠错词向量组进行覆盖,生成覆盖后的词向量组。
在本实施例中,按照词向量的位置信息对待纠错词向量组进行掩码覆盖,可以将此过程理解为按照词向量的位置信息将待纠错词向量组中错别字对应的待纠错词向量进行覆盖,从而只保留正确字的词向量和覆盖错别字词向量的掩码向量,从而得到覆盖后的词向量组。例如“我好自上海”的待纠错词向量组(h 1,h 2,...,h n),对应的词向量的位置信息(0,1,0,0,0),服务器则按照(0,1,0,0,0)对“(h 1,h 2,...,h n)”的待纠错词向量组进行掩码覆盖,生成
Figure PCTCN2021083296-appb-000001
的词向量组,即覆盖后的词向量组。
104、将覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于预先训练好的纠错网络还原纠错后的文本语料,生成目标文本语料,纠错后的文本语料包括占位符。
服务器将覆盖后的词向量组输入预先训练好的纠错网络中,首先生成纠错后的文本语料,然后在该预先训练好的纠错网络中对包括占位符的纠错后的文本语料进行还原,生成目标文本语料。
对于一些未对齐的待纠错文本语料,通过添加占位符,可以解决语料未对齐造成文本纠错准确率较低的问题。在本实施例中,服务器首先将覆盖后的词向量组
Figure PCTCN2021083296-appb-000002
输入预先训练好的纠错网络中进行第一次文本还原并添加占位符,从而生成纠错后的文本语料“我来自[NONE]上海”,然后服务器在预先训练好的纠错网络中对“我来自[NONE]上海”进行还原,生成目标文本语料“我来自上海”。
本申请实施例中,利用掩码覆盖错别字的词向量,生成覆盖后的词向量组,然后基于预先训练好的纠错网络进行纠错以及添加占位符,生成纠错后的文本语料,最后还原纠错后的文本语料,生成目标文本语料,解决了待纠错语料文本未对齐的问题,从而提高了文本纠错的准确率。
请参阅图2,本申请实施例中文本纠错方法的另一个实施例包括:
201、获取文本语料训练数据集和文本语料验证数据集,并采用文本语料训练数据进行生成器的训练和判别器的训练,生成初始检测生成器和初始检测判别器;
服务器首先获取文本语料训练数据集和文本语料验证数据集,然后采用其中的文本语料训练数据集进行检测生成器和检测判别器的训练,生成初始检测生成器和初始检测判别器。
关于初始检测生成器,涉及的计算公式为:
Figure PCTCN2021083296-appb-000003
for i∈Possible Result
其中,e为可以训练的参数,即可以调整的参数,h为词向量,h为文本语料训练数据,t为词的位置。
关于初始检测判别器,涉及的计算公式为:
p D(h',t)=sigmoid(w Th G,t)
其中,p D(h',t)为位置概率,h'为文本语料训练数据,t为词的位置,w Th G,t是向量,其中T是一个运算符号,代表矩阵的“转置(Transpose)”。
202、采用文本语料验证数据集,基于初始检测生成器和初始检测判别器进行调整,生成预先训练好的检测判别器;
在生成初始检测生成器和初始检测判别器之后,再采用文本语料验证数据集对初始检测生成器和初始检测判别器进行协同调整,从而生成预先训练好的检测判别器。
在本实施例中,主要采用预先训练好的检测判别器进行后面的文本纠错,因此最后只保留预先训练好的检测判别器进行使用,但是在训练和调整的过程中,需要参考检测生成器的输出结果对检测判别器进行训练或者调整,因此,对检测生成器和检测判别器进行协同训练和调整。在得到初始检测生成器和初始检测判别器之后,服务器采用对应的损失函数分别初始检测生成器和初始检测判别器进行调整。
关于初始检测生成器,涉及的损失函数为:
Figure PCTCN2021083296-appb-000004
其中,I为掩码位置的集合,p G(h t||h')为初始检测生成器的输出结果,采用该损失函数对初始检测网络生成器进行调整,生成过渡检测生成器。
关于初始检测判别器,涉及的损失函数为:
Figure PCTCN2021083296-appb-000005
其中,p D(h',t)为初始检测判别器的输出结果,服务器采用该损失函数对初始检测网络判别器进行调整,生成过渡检测判别器。
最后服务器采用融合公式,基于预置的比例对上述两个损失函数进行融合,从而最小化损失函数,生成预先训练好的检测网络判别器,融合公式为:
Figure PCTCN2021083296-appb-000006
其中,λ为预置的比例,在本实施例中,该比例为50%。
203、获取待纠错文本语料,并将待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
服务器获取待纠错文本语料,并将该待纠错文本语料输入预先训练好的嵌入层中进行向量化,生成待纠错词向量组。需要强调的是,为进一步保证上述待纠错文本语料的私密和安全性,上述待纠错文本语料还可以存储于一区块链的节点中。
待纠错文本语料为输入文本转换的待纠错文本语料,也可以为输入语音转换的待纠错文本语料,待纠错文本语料既可以为中文文本语料,例如“我来自上海”,也可以为英文文本语料,例如“I come from Shanghai”。当获取到待纠错文本语料时,服务器将为中文的待纠错文本语料或者将为英文的待纠错文本语料输入余弦训练好的嵌入层,即Embedding层中进行量化,生成待纠错向量组。
具体的,服务器首先获取待纠错文本语料,并对该待纠错文本语料进行独热编码,生成待纠错文本编码;然后服务器从预先训练好的嵌入层中读取映射矩阵,并将待纠错文本编码输入预先训练好的嵌入层中,在预先训练好的嵌入层中将待纠错文本编码与映射矩阵相乘,生成待纠错词向量组。
需要说明的是,嵌入层,即Embedding层是一个全连接层。通过该嵌入层,服务器可以将待纠错文本语料从一个空间映射到另外一个空间,从而得到待纠错词向量组。具体的,服务器可以从全连接层中读取一个映射矩阵,然后将待纠错文本语料与该映射矩阵相乘,即可得到待纠错向量组。
服务器首先对待纠错文本语料进行独热编码,即one-hot,生成待纠错文本编码(x 1,x 2,…,x n),然后服务器从预先训练好的嵌入层中读取映射矩阵,映射矩阵为:
Figure PCTCN2021083296-appb-000007
最后服务器将(x 1,x 2,…,x n)与映射矩阵相乘,生成待纠错词向量组(h 1,h 2,...,h n)。
204、将待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
服务器将待纠错词向量组输入预先训练好的检测判别器中进行词向量位置的判别,生成词向量的位置信息。
例如,假设待纠错词向量组为(h 1,h 2,...,h n),服务器将(h 1,h 2,...,h n)输入预先训练好的检测判别器中,通过预先训练好的检测判别器将待纠错词向量组中的每个词向量的位置信息识别出来,生成包括错别字词向量的位置信息。例如(h 1,h 2,...,h n)是待纠错词向量组为“我好自上海”的向量组,服务器则将该待纠错向量组输入预先训练好的检测判别器中,经过预先训练好的检测判别器对该待纠错向量组进行判别,生成词向量的位置信息(0,1,0,0,0),其中“0”代表该位置的词向量是正确的词向量,“1”代表该位置的词向量是错误的词向量。
具体的,服务器首先将待纠错词向量组和待纠错文本语料,输入检测线性层中进行计算,生成待计算向量组,检测线性层位于预先训练好的检测判别器中;然后服务器按照预置的标识符概率公式对待计算向量组进行概率计算,生成位置概率;最后服务器基于位置概率确定词向量的位置信息。
服务器将待纠错词向量组和待纠错文本语料并行输入预先训练好的检测判别器中的检测线性层中,在该检测线性层中参考待纠错文本语料对待纠错词向量进行计算,生成待计算向量组,然后采用一个预置的标识符概率公式,即激活函数,对待计算向量组进行计算,生成位置概率;最后参考位置概率、待纠错文本语料和待纠错词向量确定词向量的位置信息。标识符概率公式为:
p D(h',t)=sigmoid(w Th G,t)
其中,p D(h',t)为位置概率,h'为待纠错文本语料,t为词的位置,w Th G,t是待计算向量组,其中T是一个运算符号,代表矩阵的“转置(Transpose)”,但是在本实施例中的作用是对矩阵W和矩阵h的内积,其中矩阵W和矩阵h是在线性层中基于待纠错词向量组计算得到的。
然后服务器将最大的位置概率确定为错别字位置概率。例如待纠错文本语料为“我好自上海”,其中“我”为第一个位置,“好”为第二个位置,“自”为第三个位置,“上”为第四个位置,“海”为第五个位置,服务器经过上述计算,生成多个位置概率0.5、0.9、0.65、0.6以及0.55,服务器根据该位置概率生成词向量的位置信息(0,1,0,0,0),其中“0”代表该位置的词向量是正确的词向量,“1”代表该位置的词向量是错误的词向量。
205、按照词向量的位置信息对待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
服务器采用掩码按照词向量的位置信息对待纠错词向量组进行覆盖,生成覆盖后的词向量组。
在本实施例中,按照词向量的位置信息对待纠错词向量组进行掩码覆盖,可以将此过程理解为按照词向量的位置信息将待纠错词向量组中错别字对应的待纠错词向量进行覆盖,从而只保留正确字的词向量和覆盖错别字词向量的掩码向量,从而得到覆盖后的词向量组。例如“我好自上海”的待纠错词向量组(h 1,h 2,...,h n),对应的词向量的位置信息(0,1,0,0,0),服务器则按照(0,1,0,0,0)对“(h 1,h 2,...,h n)”的待纠错词向量组进行掩码覆盖,生成
Figure PCTCN2021083296-appb-000008
的词向量组,即覆盖后的词向量组。
具体的,服务器首先获取为自然数的修改幅度参数,然后基于词向量的位置信息在待纠错词向量组中确定与错别字词向量的位置信息对应的目标待纠错词向量,最后基于修改幅度参数对目标待纠错词向量进行掩码覆盖,生成覆盖后的词向量组。
需要说明的是,修改幅度参数为0、1、2、3这类自然数,在本实施例中,例如修改幅度参数为2,词向量的位置信息为(0,1,0,0,0),待纠错词向量组为“我好自上海”的词向量组,服务器基于词向量的位置信息确定目标待纠错词向量为“好”的目标待纠错词向量,服务器采用[MASK]的向量覆盖目标待纠错词向量,生成“我[MASK][MASK]上海”的词向量组,即覆盖后的词向量组
Figure PCTCN2021083296-appb-000009
206、将覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于预先训练好的纠错网络还原纠错后的文本语料,生成目标文本语料,纠错后的文本语料包括占位符。
服务器将覆盖后的词向量组输入预先训练好的纠错网络中,首先生成纠错后的文本语料,然后在该预先训练好的纠错网络中对包括占位符的纠错后的文本语料进行还原,生成目标文本语料。
对于一些未对齐的待纠错文本语料,通过添加占位符,可以解决语料未对齐造成文本纠错准确率较低的问题。在本实施例中,服务器首先将覆盖后的词向量组
Figure PCTCN2021083296-appb-000010
输入预先训练好的纠错网络中进行第一次文本还原并添加占位符,从而生成纠错后的文本语料“我来自[NONE]上海”,然后服务器在预先训练好的纠错网络中对“我来自[NONE]上海”进行还原,生成目标文本语料“我来自上海”。
具体的,服务器首先将覆盖后的词向量组输入纠错隐藏层中进行计算,生成包括占位符的纠错后的文本语料,纠错隐藏层位于预先训练好的纠错网络中;然后服务器基于预先 训练好的纠错网络对占位符进行预测,生成预测后的占位符语料;最后服务器基于预测后的占位符语料和纠错后的文本语料,生成目标文本语料。
需要说明的是,在本实施例中,占位符为[NONE],在其他实施例中,占位符还可以为其他,占位符的数量小于等于掩码的数量。
例如,覆盖后的词向量组为
Figure PCTCN2021083296-appb-000011
服务器将该覆盖后的词向量输入纠错隐藏层中,生成纠错后的文本语料“我来自[NONE]上海”,其中[NONE]为占位符,服务器对该占位符进行预测,此时服务器预测该占位符为空值,生成预测后的占位符语料“空值”,采用预测后的占位符语料替换纠错后的文本语料中的占位符,生成“我来自上海”的目标文本语料。
服务器将覆盖后的词向量组输入纠错隐藏层中进行计算,生成包括占位符的纠错后的文本语料,纠错隐藏层位于预先训练好的纠错网络中具体包括:
服务器首先将覆盖后的词向量组输入预先训练好的纠错网络中,生成隐藏层文本信息;服务器初始化隐藏层文本信息,生成初始化向量序列;然后服务器基于注意力机制对初始化向量序列进行得分计算,生成注意力权重得分;最后服务器将注意力权重得分和隐藏层文本输入纠错隐藏层中进行计算,生成包括占位符的纠错后的文本语料。
初始化涉及到的公式为:
Q=W Qx input
K=W Kx input
V=W Vx input
其中,W Q、W K、W V为随机获取的权重参数,且W Q=W K=W V,x input隐藏层文本信息。Q、K、V为初始化向量序列,服务器基于注意力机制进行计算,计算公式为:
Figure PCTCN2021083296-appb-000012
在本式中,T为词的位置,T=1则代表计算第一个字的注意力权重得分,经过计算,生成多个注意力权重得分,然后服务器将多个注意力权重得分输入预纠错隐藏层中进行计算,生成包括占位符的纠错后的文本语料。
例如前文的例子“我[MASK]自上海”中,当前注意力得分“上海”最高,而“自”的注意力权重的分也较高,预先训练好的纠错网络则能够学习到所需要还原的大概是地名前的一个动词,且是以“自”作为结尾的动词,因此生成“我来自[NONE]上海”的纠错后的文本语料。
本申请实施例中,利用掩码覆盖错别字的词向量,生成覆盖后的词向量组,然后基于预先训练好的纠错网络进行纠错以及添加占位符,生成纠错后的文本语料,最后还原纠错后的文本语料,生成目标文本语料,解决了待纠错语料文本未对齐的问题,从而提高了文本纠错的准确率。
上面对本申请实施例中文本纠错方法进行了描述,下面对本申请实施例中文本纠错装置进行描述,请参阅图3,本申请实施例中文本纠错装置一个实施例包括:
获取模块301,用于获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
位置信息生成模块302,用于将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
覆盖模块303,用于按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖, 生成覆盖后的词向量组;
文本语料生成模块304,用于将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
本申请实施例中,利用掩码覆盖错别字的词向量,生成覆盖后的词向量组,然后基于预先训练好的纠错网络进行纠错以及添加占位符,生成纠错后的文本语料,最后还原纠错后的文本语料,生成目标文本语料,解决了待纠错语料文本未对齐的问题,从而提高了文本纠错的准确率。
请参阅图4,本申请实施例中文本纠错装置的另一个实施例包括:
获取模块301,用于获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
位置信息生成模块302,用于将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
覆盖模块303,用于按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
文本语料生成模块304,用于将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
可选的,获取模块301还可以具体用于:
获取待纠错文本语料,并对所述待纠错文本语料进行独热编码,生成待纠错文本编码;
从预先训练好的嵌入层中读取映射矩阵,并将所述待纠错文本编码输入所述预先训练好的嵌入层中,将所述待纠错文本编码与所述映射矩阵相乘,生成待纠错词向量组。
可选的,位置信息生成模块302还可以具体用于:
将所述待纠错词向量组和所述待纠错文本语料,输入检测线性层中进行计算,生成待计算向量组,所述检测线性层位于预先训练好的检测判别器中;
按照预置的标识符概率公式对所述待计算向量组进行概率计算,生成位置概率;
基于所述位置概率确定词向量的位置信息。
可选的,覆盖模块303还可以具体用于:
获取预置的修改幅度参数,所述修改幅度参数为自然数;
基于所述词向量的位置信息在所述待纠错词向量组中确定目标待纠错词向量,所述目标待纠错词向量的位置信息为错别字词向量的位置信息;
基于所述修改幅度参数对所述目标待纠错词向量进行掩码覆盖,生成覆盖后的词向量组。
可选的,文本语料生成模块304包括:
计算单元3041,用于将所述覆盖后的词向量组输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符;
预测单元3042,用于基于所述预先训练好的纠错网络对所述占位符进行预测,生成预测后的占位符语料;
文本语料生成单元3043,用于基于所述预测后的占位符语料和所述纠错后的文本语料,生成目标文本语料。
可选的,计算单元3041还可以具体用于:
将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成隐藏层文本信息;
对所述隐藏层文本信息进行初始化,生成初始化向量序列;
基于注意力机制对所述初始化向量序列进行计算,生成多个注意力权重得分;
将所述多个注意力权重得分和所述隐藏层文本信息输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符。
可选的,文本纠错装置还包括:
训练模块305,用于获取文本语料训练数据集和文本语料验证数据集,并采用所述文本语料训练数据进行生成器的训练和判别器的训练,生成初始检测生成器和初始检测判别器;
调整模块306,用于采用所述文本语料验证数据集,基于所述初始检测生成器和初始检测判别器进行调整,生成预先训练好的检测判别器。
本申请实施例中,利用掩码覆盖错别字的词向量,生成覆盖后的词向量组,然后基于预先训练好的纠错网络进行纠错以及添加占位符,生成纠错后的文本语料,最后还原纠错后的文本语料,生成目标文本语料,解决了待纠错语料文本未对齐的问题,从而提高了文本纠错的准确率。
上面图3和图4从模块化功能实体的角度对本申请实施例中的文本纠错装置进行详细描述,下面从硬件处理的角度对本申请实施例中文本纠错设备进行详细描述。
图5是本申请实施例提供的一种文本纠错设备的结构示意图,该文本纠错设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对文本纠错设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在文本纠错设备500上执行存储介质530中的一系列指令操作。
文本纠错设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的文本纠错设备结构并不构成对文本纠错设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种文本纠错设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述文本纠错方法的步骤。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述文本纠错方法的步骤。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种文本纠错方法,其中,所述文本纠错方法包括:
    获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
    将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
    按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
    将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
  2. 根据权利要求1所述的文本纠错方法,其中,所述获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组包括:
    获取待纠错文本语料,并对所述待纠错文本语料进行独热编码,生成待纠错文本编码;
    从预先训练好的嵌入层中读取映射矩阵,并将所述待纠错文本编码输入所述预先训练好的嵌入层中,将所述待纠错文本编码与所述映射矩阵相乘,生成待纠错词向量组。
  3. 根据权利要求1所述的文本纠错方法,其中,所述将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息包括:
    将所述待纠错词向量组和所述待纠错文本语料,输入检测线性层中进行计算,生成待计算向量组,所述检测线性层位于预先训练好的检测判别器中;
    按照预置的标识符概率公式对所述待计算向量组进行概率计算,生成位置概率;
    基于所述位置概率确定词向量的位置信息。
  4. 根据权利要求1所述的文本纠错方法,其中,所述按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组包括:
    获取预置的修改幅度参数,所述修改幅度参数为自然数;
    基于所述词向量的位置信息在所述待纠错词向量组中确定目标待纠错词向量,所述目标待纠错词向量的位置信息为错别字词向量的位置信息;
    基于所述修改幅度参数对所述目标待纠错词向量进行掩码覆盖,生成覆盖后的词向量组。
  5. 根据权利要求1所述的文本纠错方法,其中,所述将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符包括:
    将所述覆盖后的词向量组输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符;
    基于所述预先训练好的纠错网络对所述占位符进行预测,生成预测后的占位符语料;
    基于所述预测后的占位符语料和所述纠错后的文本语料,生成目标文本语料。
  6. 根据权利要求5所述的文本纠错方法,其中,所述将所述覆盖后的词向量组输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符包括:
    将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成隐藏层文本信息;
    对所述隐藏层文本信息进行初始化,生成初始化向量序列;
    基于注意力机制对所述初始化向量序列进行计算,生成多个注意力权重得分;
    将所述多个注意力权重得分和所述隐藏层文本信息输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符。
  7. 根据权利要求1-6中任意一项所述的文本纠错方法,其中,在所述获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组之前,所述文本纠错方法还包括:
    获取文本语料训练数据集和文本语料验证数据集,并采用所述文本语料训练数据进行生成器的训练和判别器的训练,生成初始检测生成器和初始检测判别器;
    采用所述文本语料验证数据集,基于所述初始检测生成器和初始检测判别器进行调整,生成预先训练好的检测判别器。
  8. 一种文本纠错装置,其中,所述文本纠错装置包括:
    获取模块,用于获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
    位置信息生成模块,用于将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
    覆盖模块,用于按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
    文本语料生成模块,用于将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
  9. 一种文本纠错设备,其中,所述文本纠错设备包括:存储器和至少一个处理器,所述存储器中存储有指令;
    所述至少一个处理器调用所述存储器中的所述指令,以使得所述文本纠错设备执行如下所述的文本纠错方法:
    获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
    将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
    按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
    将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
  10. 根据权利要求9所述的文本纠错设备,其中,所述文本纠错设备被所述处理器执行所述获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组的步骤时,包括:
    获取待纠错文本语料,并对所述待纠错文本语料进行独热编码,生成待纠错文本编码;
    从预先训练好的嵌入层中读取映射矩阵,并将所述待纠错文本编码输入所述预先训练好的嵌入层中,将所述待纠错文本编码与所述映射矩阵相乘,生成待纠错词向量组。
  11. 根据权利要求9所述的文本纠错设备,其中,所述文本纠错设备被所述处理器执行所述将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息的步骤时,包括:
    将所述待纠错词向量组和所述待纠错文本语料,输入检测线性层中进行计算,生成待计算向量组,所述检测线性层位于预先训练好的检测判别器中;
    按照预置的标识符概率公式对所述待计算向量组进行概率计算,生成位置概率;
    基于所述位置概率确定词向量的位置信息。
  12. 根据权利要求9所述的文本纠错设备,其中,所述文本纠错设备被所述处理器执行所述按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词 向量组的步骤时,包括:
    获取预置的修改幅度参数,所述修改幅度参数为自然数;
    基于所述词向量的位置信息在所述待纠错词向量组中确定目标待纠错词向量,所述目标待纠错词向量的位置信息为错别字词向量的位置信息;
    基于所述修改幅度参数对所述目标待纠错词向量进行掩码覆盖,生成覆盖后的词向量组。
  13. 根据权利要求9所述的文本纠错设备,其中,所述文本纠错设备被所述处理器执行所述将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符的步骤时,包括:
    将所述覆盖后的词向量组输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符;
    基于所述预先训练好的纠错网络对所述占位符进行预测,生成预测后的占位符语料;
    基于所述预测后的占位符语料和所述纠错后的文本语料,生成目标文本语料。
  14. 根据权利要求13所述的文本纠错设备,其中,所述文本纠错设备被所述处理器执行所述将所述覆盖后的词向量组输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符的步骤时,包括:
    将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成隐藏层文本信息;
    对所述隐藏层文本信息进行初始化,生成初始化向量序列;
    基于注意力机制对所述初始化向量序列进行计算,生成多个注意力权重得分;
    将所述多个注意力权重得分和所述隐藏层文本信息输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符。
  15. 根据权利要求9-14中任意一项所述的文本纠错设备,其中,在所述文本纠错设备被所述处理器执行所述获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组的步骤之前,还包括:
    获取文本语料训练数据集和文本语料验证数据集,并采用所述文本语料训练数据进行生成器的训练和判别器的训练,生成初始检测生成器和初始检测判别器;
    采用所述文本语料验证数据集,基于所述初始检测生成器和初始检测判别器进行调整,生成预先训练好的检测判别器。
  16. 一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,其中,所述指令被处理器执行时实现如下所述的文本纠错方法:
    获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组;
    将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息;
    按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组;
    将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述文本纠错的指令被所述处理器执行所述获取待纠错文本语料,并将所述待纠错文本语料输入预先训练好的嵌入层,生成待纠错词向量组的步骤时,包括:
    获取待纠错文本语料,并对所述待纠错文本语料进行独热编码,生成待纠错文本编码;
    从预先训练好的嵌入层中读取映射矩阵,并将所述待纠错文本编码输入所述预先训练好的嵌入层中,将所述待纠错文本编码与所述映射矩阵相乘,生成待纠错词向量组。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述文本纠错的指令被所述处理器执行所述将所述待纠错词向量组输入预先训练好的检测判别器中,生成词向量的位置信息的步骤时,包括:
    将所述待纠错词向量组和所述待纠错文本语料,输入检测线性层中进行计算,生成待计算向量组,所述检测线性层位于预先训练好的检测判别器中;
    按照预置的标识符概率公式对所述待计算向量组进行概率计算,生成位置概率;
    基于所述位置概率确定词向量的位置信息。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述文本纠错的指令被所述处理器执行所述按照所述词向量的位置信息对所述待纠错词向量组进行掩码覆盖,生成覆盖后的词向量组的步骤时,包括:
    获取预置的修改幅度参数,所述修改幅度参数为自然数;
    基于所述词向量的位置信息在所述待纠错词向量组中确定目标待纠错词向量,所述目标待纠错词向量的位置信息为错别字词向量的位置信息;
    基于所述修改幅度参数对所述目标待纠错词向量进行掩码覆盖,生成覆盖后的词向量组。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述文本纠错的指令被所述处理器执行所述将所述覆盖后的词向量组输入预先训练好的纠错网络中,生成纠错后的文本语料,并基于所述预先训练好的纠错网络还原所述纠错后的文本语料,生成目标文本语料,所述纠错后的文本语料包括占位符的步骤时,包括:
    将所述覆盖后的词向量组输入纠错隐藏层中进行计算,生成纠错后的文本语料,所述纠错隐藏层位于预先训练好的纠错网络中,所述纠错后的文本语料包括占位符;
    基于所述预先训练好的纠错网络对所述占位符进行预测,生成预测后的占位符语料;
    基于所述预测后的占位符语料和所述纠错后的文本语料,生成目标文本语料。
PCT/CN2021/083296 2021-01-28 2021-03-26 文本纠错方法、装置、设备及存储介质 WO2022160447A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110117570.9 2021-01-28
CN202110117570.9A CN112905737B (zh) 2021-01-28 2021-01-28 文本纠错方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022160447A1 true WO2022160447A1 (zh) 2022-08-04

Family

ID=76119549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083296 WO2022160447A1 (zh) 2021-01-28 2021-03-26 文本纠错方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112905737B (zh)
WO (1) WO2022160447A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116991874A (zh) * 2023-09-26 2023-11-03 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备
CN117332038A (zh) * 2023-09-19 2024-01-02 鹏城实验室 文本信息检测方法、装置、设备和存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515931B (zh) * 2021-07-27 2023-07-21 中国平安人寿保险股份有限公司 文本纠错方法、装置、计算机设备及存储介质
CN113593574B (zh) * 2021-08-25 2024-04-19 广州虎牙科技有限公司 一种语音识别方法、计算机程序产品及电子设备
CN113743110B (zh) * 2021-11-08 2022-02-11 京华信息科技股份有限公司 一种基于微调生成式对抗网络模型的漏词检测方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626047A (zh) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 智能化文本纠错方法、装置、电子设备及可读存储介质
CN111985213A (zh) * 2020-09-07 2020-11-24 科大讯飞华南人工智能研究院(广州)有限公司 一种语音客服文本纠错的方法和装置
CN112200664A (zh) * 2020-10-29 2021-01-08 上海畅圣计算机科技有限公司 基于ernie模型和dcnn模型的还款预测方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8306819B2 (en) * 2009-03-09 2012-11-06 Microsoft Corporation Enhanced automatic speech recognition using mapping between unsupervised and supervised speech model parameters trained on same acoustic training data
CN111191649A (zh) * 2019-12-31 2020-05-22 上海眼控科技股份有限公司 一种识别弯曲多行文本图像的方法与设备
CN111613214A (zh) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 一种用于提升语音识别能力的语言模型纠错方法
CN111950292B (zh) * 2020-06-22 2023-06-27 北京百度网讯科技有限公司 文本纠错模型的训练方法、文本纠错处理方法和装置
CN111753532B (zh) * 2020-06-29 2024-04-16 北京百度网讯科技有限公司 西文文本的纠错方法和装置、电子设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111626047A (zh) * 2020-04-23 2020-09-04 平安科技(深圳)有限公司 智能化文本纠错方法、装置、电子设备及可读存储介质
CN111985213A (zh) * 2020-09-07 2020-11-24 科大讯飞华南人工智能研究院(广州)有限公司 一种语音客服文本纠错的方法和装置
CN112200664A (zh) * 2020-10-29 2021-01-08 上海畅圣计算机科技有限公司 基于ernie模型和dcnn模型的还款预测方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117332038A (zh) * 2023-09-19 2024-01-02 鹏城实验室 文本信息检测方法、装置、设备和存储介质
CN116991874A (zh) * 2023-09-26 2023-11-03 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备
CN116991874B (zh) * 2023-09-26 2024-03-01 海信集团控股股份有限公司 一种文本纠错、基于大模型的sql语句生成方法及设备

Also Published As

Publication number Publication date
CN112905737B (zh) 2023-07-28
CN112905737A (zh) 2021-06-04

Similar Documents

Publication Publication Date Title
WO2022160447A1 (zh) 文本纠错方法、装置、设备及存储介质
CN108052512B (zh) 一种基于深度注意力机制的图像描述生成方法
US11860969B2 (en) Universal transformers
KR102608469B1 (ko) 자연어 생성 방법 및 장치
JP5413622B2 (ja) 言語モデル作成装置、言語モデル作成方法、およびプログラム
WO2023173533A1 (zh) 文本纠错方法、装置、设备及存储介质
CN111695343A (zh) 错词纠正方法、装置、设备及存储介质
KR20200000216A (ko) 단어자질을 강화한 음성 대화 방법 및 시스템
CN109522550B (zh) 文本信息纠错方法、装置、计算机设备和存储介质
JP7111464B2 (ja) 翻訳方法、翻訳装置及び翻訳システム
Yu et al. On-device neural language model based word prediction
JP7070653B2 (ja) 学習装置、音声認識順位推定装置、それらの方法、およびプログラム
Dai et al. Learning low-resource end-to-end goal-oriented dialog for fast and reliable system deployment
KR102592585B1 (ko) 번역 모델 구축 방법 및 장치
CN112084301B (zh) 文本修正模型的训练方法及装置、文本修正方法及装置
CN115147849A (zh) 字符编码模型的训练方法、字符匹配方法和装置
US20220343163A1 (en) Learning system, learning device, and learning method
US11694041B2 (en) Chapter-level text translation method and device
WO2022110730A1 (zh) 基于标签的优化模型训练方法、装置、设备及存储介质
CN112613008A (zh) 一种学生身份在线认证方法及系统
WO2022267674A1 (zh) 基于深度学习的文本翻译方法、装置、设备及存储介质
JP2021033994A (ja) テキスト処理方法、装置、デバイス及びコンピュータ読み取り可能な記憶媒体
CN116882403A (zh) 一种地理命名实体多目标匹配方法
CN112016281B (zh) 错误医疗文本的生成方法、装置及存储介质
CN112509559B (zh) 音频识别方法、模型训练方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922046

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21922046

Country of ref document: EP

Kind code of ref document: A1