CN114154488A - Statement processing method and device - Google Patents

Statement processing method and device Download PDF

Info

Publication number
CN114154488A
CN114154488A CN202111510600.9A CN202111510600A CN114154488A CN 114154488 A CN114154488 A CN 114154488A CN 202111510600 A CN202111510600 A CN 202111510600A CN 114154488 A CN114154488 A CN 114154488A
Authority
CN
China
Prior art keywords
candidate
statement
sentence
backward
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111510600.9A
Other languages
Chinese (zh)
Inventor
姬子明
李长亮
李小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202111510600.9A priority Critical patent/CN114154488A/en
Publication of CN114154488A publication Critical patent/CN114154488A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a statement processing method and a device, wherein the statement processing method comprises the following steps: acquiring a statement to be corrected containing wrongly written words; determining a candidate word corresponding to the wrongly written word, and generating a candidate statement based on the candidate word and the statement to be corrected; constructing a forward statement feature and a backward statement feature corresponding to the candidate statement; and inputting the forward statement features and the backward statement features into a sequencing module for processing, and determining the replacement statement corresponding to the statement to be corrected according to a processing result.

Description

Statement processing method and device
Technical Field
The present application relates to the field of text processing technologies, and in particular, to a sentence processing method. The application also relates to a sentence processing apparatus, a computing device, and a computer-readable storage medium.
Background
With the development of internet technology, various types of documents start to be digitalized, and document error correction is an essential link in each scene, for example, before an article is released, before archive storage and before a mail is sent, the document error correction function is involved, so that wrongly written words in the document are identified on the basis, and the correctness and continuity of the document content are ensured by modifying the document error correction function. However, in the prior art, when identifying wrongly-written characters in a document, a method of calculating a confusion degree of a candidate sentence is usually adopted to directly replace the sentence with wrongly-written characters, so as to avoid a problem that the integrity of the document is damaged by the wrongly-written characters. However, the method does not consider the information of the original sentence, the accuracy of the error correction task cannot be guaranteed from the perspective of the user, and the higher probability of the candidate sentence is not enough to indicate that the original sentence is definitely wrong, so an effective scheme is urgently needed to solve the above problem.
Disclosure of Invention
In view of this, embodiments of the present application provide a statement processing method to solve technical defects in the prior art. The embodiment of the application also provides a sentence processing device, a computing device and a computer readable storage medium.
According to a first aspect of embodiments of the present application, there is provided a statement processing method, including:
acquiring a statement to be corrected containing wrongly written words;
determining a candidate word corresponding to the wrongly written word, and generating a candidate statement based on the candidate word and the statement to be corrected;
constructing a forward statement feature and a backward statement feature corresponding to the candidate statement;
and inputting the forward statement features and the backward statement features into a sequencing module for processing, and determining the replacement statement corresponding to the statement to be corrected according to a processing result.
According to a second aspect of embodiments of the present application, there is provided a sentence processing apparatus including:
the acquisition module is configured to acquire a statement to be corrected containing wrongly written words;
the determining module is configured to determine a candidate word corresponding to the wrongly written word and generate a candidate statement based on the candidate word and the statement to be corrected;
the building module is configured to build a forward statement feature and a backward statement feature corresponding to the candidate statement;
and the processing module is configured to input the forward statement features and the backward statement features into the sorting module for processing, and determine the replacement statement corresponding to the statement to be corrected according to a processing result.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the processor executes the computer-executable instructions to realize the steps of the statement processing method.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the statement processing method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program which, when executed by the chip, implements the steps of the sentence processing method.
According to the statement processing method, after the statement to be corrected containing the wrongly written words is obtained, the candidate written words corresponding to the wrongly written words can be determined firstly, the candidate statement is generated by combining the statement to be corrected on the basis, then the forward statement feature and the backward statement feature are constructed by combining the overall features of the candidate statement, the forward statement feature and the backward statement feature are input into the sequencing model to be processed, the replacement statement with accurate correction can be obtained according to the processing result, when the statement to be corrected is corrected, the information of the foreword structure and the original statement can be fully fused, the sequencing model can output a more reliable prediction result, and therefore the accuracy of correction is guaranteed.
Drawings
Fig. 1 is a flowchart of a sentence processing method according to an embodiment of the present application;
FIG. 2 is a diagram illustrating a method for processing a sentence according to an embodiment of the present application;
FIG. 3 is a processing flow diagram of a sentence processing method applied in a document error correction scenario according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a sentence processing apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
A statistical language model: the basic model of the NPL field can be used for judging the probability of a sentence.
NLP: natural Language Processing is an important direction in the fields of computer science and artificial intelligence, and studies the theory and method of effective communication between people and computers by Natural Language. Integrating linguistics, computer science, mathematics and the like, aims to extract information from text data and aims to enable a computer to process or 'understand' natural language so as to perform automatic translation, text classification, emotion analysis and the like.
The ELECTRA: a pre-trained Mask Language Model (MLM), the model structure is consistent with BERT, but the training process is different from BERT; the MLM serves as a generator and is used for providing an effective method for automatically selecting masked tokens, and the fact that the ELECTRA can quickly learn corresponding prediction capability is achieved.
GPT 2: a pre-trained Language Model (LM) is adapted to determine the probability of a sentence occurring, as well as a statistical language model.
Confusion set: the candidate set for finding wrong words in the error correction task comprises homophones, near-form words and the like of each word.
The confusion degree is: the language model evaluates an index of sentence probability, with smaller values indicating greater probability of sentence occurrence.
BERT model: (bidirectional encoder recurrents from Transformer), characterized by a transform-based bi-directional encoder, and the root of the BERT model is the transform, and is derived from the interpretation is all you need. Wherein the bidirectional meaning means that when processing a word, it can take into account the information of the words before and after the word, thereby obtaining the semantic meaning of the context.
N-gram model: an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation with the size of N on the content in the text according to bytes, and form a byte fragment sequence with the length of N. Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension. The model is based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, and that the probability of a complete sentence is the product of the probabilities of occurrence of the words. These probabilities can be obtained by counting the number of times that N words occur simultaneously directly from the corpus.
In the present application, a sentence processing method is provided. The present application relates to a sentence processing apparatus, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
In practical applications, when a sentence is corrected, a common scheme is to use a language model to perform a perplexity calculation on the whole candidate sentence, so as to calculate the probability of the candidate sentence. However, the information of the original sentence is lost, from the perspective of a user, the error correction task should ensure the accuracy, and the higher probability of the occurrence of the candidate sentence cannot indicate that the original sentence is definitely wrong, so the information of the original sentence should be considered when calculating the probability of the candidate sentence. In addition, a general language model, such as GPT2, only predicts the next word from front to back, and can only calculate the confusion of the forward sentence, while the error part in the error correction task is only a few, and its context sentences all help the candidate sorting, so that only considering the forward sentence results in the omission of valid information, and the context sentence information should be considered in practical application.
In view of this, according to the statement processing method provided by the application, after a to-be-corrected statement containing a wrongly-written word is obtained, a candidate word corresponding to the wrongly-written word may be determined, and a candidate statement may be generated based on the candidate word by combining the to-be-corrected statement, and then a forward statement feature and a backward statement feature may be constructed by combining the overall features of the candidate statement, and input to the sorting model for processing, so that a replacement statement with accurate error correction may be obtained according to a processing result, and when the to-be-corrected statement is corrected, information of a preceding and following text structure and an original statement may be fully fused, so that the sorting model may output a more reliable prediction result, thereby ensuring accuracy of error correction.
Fig. 1 shows a flowchart of a statement processing method according to an embodiment of the present application, which specifically includes the following steps:
step S102, obtaining the statement to be corrected containing the wrongly written words.
Specifically, the sentence to be corrected containing the wrongly written word means a sentence with writing errors or inputted wrong characters in any field, such as a to-be-issued article containing the wrongly written word, a file containing the wrongly written word, a mail containing the wrongly written word, and the like; and the statement to be corrected is the statement which is uploaded to the server and needs to be corrected. Correspondingly, the wrongly written characters are specifically finger-shaped near wrongly written characters or shape-sound wrongly written characters, etc., i.e. characters which need to be corrected in the sentence to be corrected.
Further, since the to-be-corrected statements processed in different scenes may be acquired in different forms, and the articles or documents to which the to-be-corrected statements belong may contain more contents, and if the articles or documents are used as units for error correction, more time will be consumed, so that the to-be-corrected statements containing wrongly-written or mispronounced words can be located in a mode of model error detection, in this embodiment, the specific implementation mode is as follows, in steps S102-2 to S102-6:
step S102-2, acquiring a text to be processed, and constructing a character sequence based on the text to be processed;
step S102-4, inputting the character sequence into a wrongly written character detection model for processing, and obtaining wrongly written characters contained in the text to be processed;
and S102-6, selecting the sentence containing the wrongly written words in the text to be processed as the sentence to be corrected.
Specifically, the text to be processed specifically refers to a text that needs to be corrected, and the text to be processed may be an article, a book, a file, a composition, or the like. Correspondingly, the character sequence specifically refers to a sequence constructed by all characters contained in the text to be processed, and is used for the wrongly written character detection model to realize the positioning of wrongly written characters by taking the characters as units.
The wrongly written or mispronounced character detection model can be constructed by adopting an ELECTRA model, and the conception building by utilizing a generator and a discriminator is realized. Namely, in the training process, the generator part is MLM, and the structure is similar to that of a BERT model; and predicting any word in the sentence by utilizing an ELECTRA model so as to determine whether the word contains wrongly written characters. That is, by replacing the selected word with a new word, if the replaced word is not the original word, the replaced word is labeled, and the other words of the sentence are labeled with labels which are not replaced; then, the discriminator part trains a discrimination model to carry out replacement recognition on words at all positions, at the moment, the ELECTRA model predicts the words at all positions in the sentence, and if the characters at a certain position are classified as replaced through a prediction result, the characters corresponding to the position are determined as wrongly-recognized characters; if the character at a certain position is classified as not replaced according to the prediction result, determining that the character corresponding to the position is not a wrongly written character; by analogy, each word in the sentence can be traversed to obtain wrongly written characters in the sentence by adopting the processing. That is, in the application stage, the wrongly written character detection model can classify the characters at each position in the character sequence to determine whether each character is a wrongly written character or not according to the classification result.
In addition, the wrongly written character detection model can also use an N-gram model, and the semantic reasonable probability of the statement is judged through the model under the wrongly written character detection scene so as to reflect the statement reasonable degree; that is, the N-gram model may perform local analysis on the sentence, and after the word units in the sentence are combined with the adjacent word units, calculate the semantic reasonable probability of the combined word unit with respect to the sentence, thereby obtaining whether the semantic reasonable probability corresponding to each word unit satisfies the threshold, if not, it indicates that the combined word unit is unreasonable with respect to the sentence, and then select the word unit lower than the threshold as the wrongly-written word. In practical applications, the wrongly written word detection model may be selected according to actual requirements, and the embodiment is not limited herein.
Based on this, after the text to be processed is obtained, in order to improve the error correction efficiency, a character sequence may be constructed based on the text to be processed, and then the character sequence is input to the wrongly written character detection model for processing, and wrongly written characters included in the text to be processed output by the wrongly written character detection model are obtained. Since the text to be processed contains more sentences, the sentence containing the wrongly written words can be selected as the sentence to be corrected for the subsequent error correction processing operation.
It should be noted that, in order to accurately locate the wrongly written characters in the sentence in the application stage, the detection of the wrongly written characters needs to be fully trained. Namely: in the model training stage, obtaining sample sentences in a preset sample set and sample wrongly written characters corresponding to the sample sentences; then, inputting the sample statement into the initial wrongly-written character detection model for processing to obtain a predicted wrongly-written character in the sample statement output by the model; and then, the initial wrongly-written character detection model is subjected to parameter adjustment by combining the predicted wrongly-written characters and the sample wrongly-written characters until the wrongly-written character detection model meeting the training stopping condition is obtained. The training stop condition may be the number of iterations or the result of the loss value comparison, and this embodiment is not limited in this respect.
For example, the text to be processed uploaded by the user is obtained as an article to be published, and in order to avoid that the experience of the browser is affected by wrongly written words after publication, error detection can be performed on the text. Based on the method, a character sequence composed of characters is constructed according to a text to be processed uploaded by a user, then the character sequence is input into an ELECTRA model for processing, the wrongly written character contained in the text to be processed is determined as 'A' according to a processing result, then a statement containing the wrongly written character 'A' is positioned in the text to be processed, the statement is determined as 'in the A field of green oil', and at the moment, the statement is used as a statement to be corrected for follow-up correction.
In conclusion, the identification of the wrongly written words is completed by combining the wrongly written word detection model, so that the volume of the error correction statement can be reduced, and the efficiency of subsequent error correction can be effectively improved.
Step S104, determining a candidate word corresponding to the wrongly written word, and generating a candidate statement based on the candidate word and the statement to be corrected.
Specifically, after the statement to be corrected containing the wrongly written character is obtained, in order to accurately complete the error correction task and replace the statement to be corrected, a plurality of candidate characters corresponding to the wrongly written character may be determined, so as to screen out a correct character from the plurality of candidate characters and replace the wrongly written character. In the process, how to accurately screen out the correct character from the candidate characters is a problem which needs to be solved urgently. In order to improve the efficiency of screening correct words, the sentence processing method provided in this embodiment constructs a candidate sentence by combining each candidate word with a sentence to be corrected, and detects the probability that each candidate word is used as a correct word by subsequently combining context information, so as to screen out a correct word that meets the semantics of the sentence to be corrected.
The candidate words specifically refer to a plurality of characters that may be used as correct words to replace wrongly written words, and correspondingly, the candidate sentences specifically refer to a plurality of sentences obtained by replacing wrongly written words in the error correction sentences respectively based on the plurality of candidate words, so as to be used for analyzing the probability that the candidate words are correct words by subsequently combining the context information of each wrongly written word.
Further, in the process of determining the candidate word corresponding to the wrongly written word, since many other characters can be replaced in different scenarios, if all the characters are determined as candidate words, and then the correct word is screened based on all the characters, more screening time may be consumed, so that in order to quickly determine the replacement sentence in a shorter time, the candidate word may be determined by using a method combining a policy and a model, in this embodiment, the specific implementation manner is as in steps S1042 to S1044:
step S1042, inputting the wrongly-written characters into a recall model for processing, and obtaining a plurality of initial candidate characters corresponding to the wrongly-written characters;
step S1044 is to screen out the candidate words corresponding to the wrongly written words from the plurality of initial candidate words based on a preset screening strategy.
Specifically, the recall model is a candidate character recall model formed by an N-gram language model and a BERT mask language model, and initial candidate characters possibly serving as wrongly-written characters are recalled from a large number of characters; correspondingly, the initial candidate character specifically refers to a candidate character which is relatively close to the wrongly-written character, namely, the candidate character recall model recalls a character which is relatively close to the candidate character from a large number of characters to be used as the initial candidate character; wherein, the characters similar to the candidate characters can refer to characters similar to the candidate characters and/or similar to the sound; the preset screening strategy specifically refers to a strategy for screening a plurality of initial candidate words, so as to screen a set number of characters which can be used as candidate words from the plurality of initial candidate words.
The preset screening policy may be to calculate similarity between the wrongly-written characters and each initial candidate character, and select a set number of initial candidate characters with higher similarity as candidate characters corresponding to the wrongly-written characters, where the set number may be selected according to actual requirements, such as 5, 8, 10, and the like, and this embodiment is not limited herein; or the initial candidate character of the preliminary screening part is used as the intermediate candidate character, the preliminary screening form can be the form of random screening, or the screening of characters with specific quantity at intervals, or the screening of adjacent characters with specific quantity in a concentrated way, etc.; and filtering the intermediate candidate words by combining with a preset confusion set in the strategy, namely filtering candidate words with lower probability as correct candidate words in the intermediate candidate words by the confusion set so as to obtain a set number of candidate words. The preset confusion set specifically refers to a set of similar characters and/or homophones containing each candidate character. In practical application, the preset screening policy may be set according to a practical application scenario, and this embodiment is not limited herein.
Based on this, after the wrongly-written characters are determined, the wrongly-written characters can be input into the recall model to be processed, so that a plurality of initial candidate characters corresponding to the wrongly-written characters output by the recall model can be obtained, then the initial candidate characters are screened based on a preset screening strategy, and the candidate characters corresponding to the wrongly-written characters can be obtained according to a screening result, so that the candidate characters can be used for determining a replacement statement subsequently.
In addition, in order to ensure that the recall model can accurately complete the determination of the candidate words corresponding to the wrongly-written words, the recall model needs to be trained sufficiently in the training phase. Namely: the method comprises the steps of pre-constructing the relation between the sample wrongly-written characters and sample candidate characters related to the sample wrongly-written characters, enabling a recall model to fully learn the relation, and achieving accurate prediction in a model application stage.
In conclusion, the candidate words of wrongly-written words are determined by combining the screening strategy and the recall model, so that the number of the candidate words can be ensured, the processing time of an error correction task can be effectively reduced, and the determination of the replacement statement can be rapidly completed.
Furthermore, when candidate words are screened based on a preset screening strategy, in order to improve the interpretability of screening the candidate words and reduce the processing time of an error correction task, candidate word filtering can be performed by combining a candidate word confusion set; that is to say, in order to remove redundant candidate words in the preliminary screening portion and reduce the subsequent calculation cost, a portion of candidate words may be filtered out through a preset screening policy, in this embodiment, the specific implementation manner is as in step 1044-2:
step 1044-2, determining a screening proportion and a candidate character confusion set according to the screening strategy; screening a set number of intermediate candidate words from the plurality of initial candidate words based on the screening proportion; and filtering a set number of the intermediate candidate words by using the candidate word confusion set, and obtaining the candidate words according to a filtering result.
Specifically, the candidate word confusion set specifically refers to a set including a shape-near word and a homophone word of each initial candidate word, and is used for filtering out candidate words with low probability as correct words from intermediate candidate words; correspondingly, the intermediate candidate words specifically refer to a set number of initial candidate words screened from the plurality of initial candidate words, and the number of the intermediate candidate words may be set according to an actual application scenario, which is not limited in this embodiment.
Filtering the intermediate candidate words based on a candidate word confusion set, specifically, determining a homophone word and/or a similar word set corresponding to a wrongly-written word in the candidate word confusion set, then comparing each intermediate candidate word with the words in the set, and rejecting a set number of intermediate candidate words, namely selecting characters with similarity smaller than a preset threshold value with the intermediate candidate words according to a comparison result, and deleting the characters as a set number of intermediate candidate words; in the process, if the number of the intermediate candidate words smaller than the preset threshold is equal to the set number, the intermediate candidate words are used as intermediate candidate words needing to be deleted; if the number of the intermediate candidate words smaller than the preset threshold is smaller than the set number, all the intermediate candidate words smaller than the preset threshold are used as intermediate candidate words needing to be deleted; if the number of the intermediate candidate words smaller than the preset threshold is larger than the set number, randomly selecting the set number of intermediate candidate words from the intermediate candidate words as intermediate candidate words to be deleted, or selecting the set number of intermediate candidate words as the intermediate candidate words to be deleted according to the sequence of the similarity from small to large; in a specific implementation, the selection may be performed according to actual requirements, and the embodiment is not limited herein. In addition, a set number of intermediate candidate words can be directly selected as the intermediate candidate words to be deleted according to the sequence of the similarity from small to large. And finally, taking the residual intermediate candidate words as candidate words corresponding to wrongly-written words.
Based on this, can confirm screening proportion and the candidate word confusion set that wrong word corresponds according to preset screening strategy, later screen out the intermediate candidate word of setting for quantity from a large amount of initial candidate words according to screening proportion to preliminary filtration deviates from the candidate word of correct word, promptly: selecting initial candidate words corresponding to the set quantity from all the initial candidate words as intermediate candidate words according to the proportion of the set quantity to the quantity of all the initial candidate words, and directly deleting other initial candidate words except the intermediate candidate words from all the initial candidate words so as to finish the processing operation of a primary filtering part; and finally, the residual intermediate candidate words are used as candidate words of wrongly-written words so as to be used for constructing candidate sentences subsequently.
In conclusion, the filtering of the candidate words is completed by combining the recall model and the candidate word confusion set, so that the number of the candidate words can be effectively reduced, the efficiency of constructing the candidate sentences is improved, and the processing efficiency of the error correction task is further improved.
Furthermore, when filtering a set number of intermediate candidate words through the candidate word confusion set, in order to improve the quality of the remaining candidate words, that is, to keep candidate words close to the correct word, the method may be implemented by calculating the number of editing operations, in this embodiment, as shown in step S1044-4:
step S1044-4, calculating the times of editing operation between the confusion word contained in the candidate word confusion set and each intermediate candidate word; and filtering the intermediate candidate words in a set number according to the editing operation times, and obtaining the candidate words according to a filtering result.
Specifically, the number of editing operations is specifically the minimum number of editable times between characters, and is used for representing the degree of similarity between two characters, wherein the smaller the number of editing operations is, the higher the degree of similarity between two characters is, and otherwise, the larger the number of editing operations is, the lower the degree of similarity between two characters is; the minimum editable times refer to the times of editing which is triggered by one character, is changed into another character by operations such as modification, deletion, position movement and the like.
Based on this, after the candidate word confusion set corresponding to the wrongly-written character is determined, the number of times of editing operations between each intermediate candidate word and each confusion word in the candidate word confusion set can be calculated, and then the set number of intermediate candidate words are filtered according to the number of times of editing operations, that is, candidate words with a large number of times of editing operations in the set number of intermediate candidate words are removed, and the remaining intermediate candidate words are used as candidate words corresponding to the wrongly-written character for subsequently constructing the candidate sentence. After the number of times of editing operation is calculated, the number of times of editing operation is compared with a preset number threshold, an intermediate candidate word which is larger than the number threshold is a candidate word with a larger number of times of editing operation, the candidate word is removed, and the intermediate candidate word with the number of times of editing operation smaller than or equal to the number threshold is selected as a candidate word corresponding to a wrongly-written word.
According to the above example, after the wrongly-written characters 'A' contained in the sentence 'on the field of the first oil' to be corrected are determined, the wrongly-written characters are input to a recall model constructed on the basis of an N-gram language model and a BERT mask language model for processing, according to the processing result, top5 characters which are ranked from large to small according to the similarity with the wrongly-written characters are selected as intermediate candidate characters, namely, the wrongly-written characters 'A' are processed through the N-gram language model and the BERT mask language model to obtain N initial candidate characters, then 5 of the N initial candidate characters are selected as intermediate candidate characters, and the intermediate candidate characters are determined to be { A, Shi, Shen, Tian, Ri }, respectively.
Further, after the initial candidate word is determined, in order to reduce the degree of deviation between the recalled candidate word and the original word, a candidate word confusion set is introduced to filter the candidate word. And the candidate character confusion set contains homophone characters and form-similar characters corresponding to wrongly-written characters, such as { first, second, third, Tian, sweet, Tian … }, the times of editing operation of each confusion character and each intermediate candidate character contained in the candidate character confusion set are calculated, the times of editing operation of { second, third and Tian } are determined to be smaller than a time threshold according to the calculation result, then { second, third and Tian } are used as candidate characters, and the intermediate candidate characters A and the days are removed for subsequent construction of candidate sentences for error correction.
In summary, when filtering the candidate words through the candidate word confusion set, in order to improve the filtering accuracy, the filtering may be performed by calculating the number of editing operations, so as to ensure that the remaining candidate words are closer to the correct words, thereby facilitating the subsequent correction of the error correction sentence by obtaining the correct words.
Further, after determining the candidate word corresponding to the wrongly written word, in order to complete the error correction processing on the statement to be corrected by combining the candidate word, the candidate word may be merged into the statement to be corrected to construct a candidate statement, and then the error correction processing operation is completed by using the sorting model, which is specifically implemented in step S1046:
step S1046, determining the character position of the wrongly written character in the sentence to be corrected; and updating the statement to be corrected according to the character position by using the candidate words to obtain the candidate statement.
Specifically, the character position is specifically a position of the wrongly written word in the sentence to be corrected, and the position is determined based on the character sequence. Based on the method, after the character position of the wrongly-written character in the statement to be corrected is determined, the wrongly-written character can be replaced by the candidate character, so that the candidate statement can be obtained according to the replacement result, and whether the candidate character is the correct character or not can be predicted by combining the context information of the candidate character in the statement to be corrected subsequently.
According to the above example, after the candidate words { from, field } are obtained, the wrongly written character position in the statement to be corrected "on the field of green oil" is determined as the sixth character, then each candidate word is used for updating the statement to be corrected, the first candidate statement is obtained as "on the field of green oil", the second candidate statement is obtained as "on the field of green oil", and the third candidate statement is obtained as "on the field of green oil", so as to be used for carrying out error correction processing operation subsequently.
In conclusion, the candidate words are fused into the statement to be corrected in a way of constructing the candidate statement, so that the correct words can be determined in the candidate words by combining the context information of the candidate words in the statement to be corrected in the following process, and the error correction precision is ensured.
In addition, it should be noted that, when constructing a candidate sentence, considering that there may be an error in the wrongly-written word determined by the above processing method, if subsequent calculation is performed on the basis to replace the wrongly-written word, an error may be caused in the corrected sentence, so to avoid replacing the correct character by the corrected sentence, an additional candidate sentence may be formed by combining the wrongly-written word and the sentence to be corrected, that is, in a case where there are k candidate words, k +1 candidate sentences may be generated by combining the wrongly-written word and the candidate word for subsequently determining the replacement sentence.
And step S106, constructing the forward statement characteristics and the backward statement characteristics corresponding to the candidate statements.
Specifically, after the candidate sentences are constructed based on the candidate words and the sentences to be corrected, further, in order to prevent loss of previous and following information of the original sentences and the candidate sentences during processing of the error correction task, the accuracy of the candidate words can be reflected from multiple dimensions, at this time, corresponding forward sentence features and backward sentence features can be constructed for each candidate sentence, the accuracy of the candidate words in the sentences to be corrected can be analyzed by subsequently combining the forward/backward sentence features of each candidate sentence, and therefore the correct words can be corrected for the sentences to be corrected. The forward sentence characteristic specifically refers to a vector expression constructed according to a forward sequence of each character in the candidate sentence, and the backward sentence characteristic specifically refers to a vector expression constructed according to a backward sequence of each character in the candidate sentence.
Based on this, in order to complete the determination of the correct word from multiple dimensions, forward statement features and backward statement features corresponding to the candidate statement are spliced by combining the statement to be corrected containing the wrongly written word and the candidate statement containing the candidate word, so that the analysis of each candidate word is completed by combining the context information of the candidate word in the candidate statement without losing the information of the original statement.
Further, when constructing the forward statement feature corresponding to the candidate forward statement, the actual procedure is to splice the statement to be corrected containing the wrongly written word and the forward statement containing the candidate word, and map the spliced statement to the vector space to obtain the forward statement feature, which is specifically implemented in the embodiment as the steps S1062-2 to S1062-6:
step S1062-2, dividing the candidate sentences according to the positions of the candidate words in the candidate sentences to obtain candidate forward sentences;
s1062-4, splicing the statement to be corrected with the candidate forward statement to obtain an initial forward statement;
step S1062-6, adding a statement mark in the initial forward statement, and constructing the forward statement feature based on the initial forward statement with the statement mark added.
Specifically, the candidate forward sentences are sentences obtained by removing characters sorted behind candidate words from candidate sentences containing the candidate words; correspondingly, the initial forward statement specifically refers to a statement obtained by splicing the statement to be corrected and the candidate forward statement, wherein splicing refers to connecting the statement to be corrected and the candidate forward statement front and back to form a new statement; correspondingly, the sentence mark specifically refers to a mark added in the initial forward sentence for marking the position of the wrongly written word, a mark for marking the start position of the candidate forward sentence in the initial forward sentence of the candidate sentence, and a mark for marking the end position of the candidate forward sentence in the initial forward sentence.
Based on the above, after the candidate sentence corresponding to the candidate word is determined, the candidate sentence can be divided based on the position of the candidate word in the candidate sentence to delete other characters behind the candidate word in the candidate sentence, the candidate forward sentence is formed based on the remaining characters, and then the sentence to be corrected is spliced with the candidate forward sentence to obtain an initial forward sentence corresponding to the candidate word; and finally, adding statement marks in the initial forward statement, and constructing forward statement features by combining the initial forward statement with the added statement marks so as to be used for determining the replacement statement subsequently.
Adding a statement mark in the initial forward statement specifically means adding a set mark, such as "< s >", at a splicing position of the initial forward statement, adding a set mark, such as "_", before and after a candidate word, and adding a set mark, such as "< \ s >", at the end of the statement for subsequent feature construction.
In conclusion, by splicing the statement to be corrected and the candidate forward statement, the influence of the forward statement on the candidate words can be combined, the original text information of the statement to be corrected can be fully combined, the follow-up prediction accuracy is effectively improved, and therefore the correct word is accurately determined from the candidate words to generate the replacement statement.
Further, when constructing the backward statement feature corresponding to the candidate forward statement, the actual procedure is to splice the statement to be corrected containing the wrongly written word and the backward statement containing the candidate word, and map the spliced statement to the vector space to obtain the backward statement feature, which is specifically implemented in the embodiment as the steps S1064-2 to S1064-8:
step S1064-2, performing reverse order processing on the candidate sentences to obtain candidate backward sentences;
step S1064-4, dividing the candidate backward sentences according to the positions of the candidate words in the candidate backward sentences to obtain target candidate backward sentences;
s1064-6, splicing the statement to be corrected and the target candidate backward statement to obtain an initial backward statement;
and S1064-8, adding a sentence mark in the initial backward sentence, and constructing the backward sentence characteristic based on the initial backward sentence with the added sentence mark.
Specifically, the candidate backward sentence is a sentence obtained by performing reverse order processing on the candidate sentence, and for example, the sentence obtained after "will be out of date" reverse order is "will be out of date"; correspondingly, the target candidate backward statement specifically refers to a statement obtained by removing characters sorted behind the candidate words from the candidate backward statements containing the candidate words; correspondingly, the initial backward statement specifically refers to a statement obtained by splicing the statement to be corrected and the target candidate backward statement, wherein splicing refers to connecting the statement to be corrected and the target candidate backward statement back and forth to form a new statement. For example, the sentence to be corrected is abc, the candidate sentence is adc, the candidate backward sentence is cda, and the target candidate backward sentence is cd, and then the sentence to be corrected abc and the target candidate backward sentence cd are spliced to obtain the initial backward sentence cbcd.
After the candidate sentences are obtained, in order to take the influence of the sentence content behind the candidate words in the candidate sentences on the candidate words into consideration, the candidate sentences may be subjected to reverse order processing to obtain candidate backward sentences, then the candidate backward sentences are divided according to the positions of the candidate words in the candidate backward sentences to delete other characters behind the candidate words in the candidate backward sentences, the target candidate backward sentences are formed based on the remaining characters, then the reverse order error correction sentences corresponding to the sentences to be corrected are spliced with the target candidate backward sentences to obtain initial backward sentences of the candidate words, finally sentence marks are added in the initial backward sentences, and the backward sentence characteristics are constructed by combining the initial backward sentences to which the sentence marks are added for subsequent determination of the replacement sentences.
Adding a statement mark in the initial backward statement specifically means adding a set mark, such as "< s >", at a splicing position of the initial backward statement, adding a set mark, such as "_", before and after the candidate word, and adding a set mark, such as "< \ s >", at the end of the statement for subsequent feature construction.
Following the above example, the candidate word is determined to be { from, Shen, Tian }, the first candidate statement is "in the field of the green oil", the second candidate statement is "in the field of the green oil", and the third candidate statement is "in the field of the green oil".
On the basis, the sentence to be corrected is divided based on the position in the first candidate sentence to obtain a first candidate forward sentence ' the right of the green oil, then the sentence to be corrected is spliced with the first candidate forward sentence ' on the first field of the green oil ', an identifier ' _ ' for identifying the position of the wrongly written character, an identifier ' S > for identifying the starting position of the first candidate forward sentence in the first initial forward sentence, and an identifier ' < \ > for identifying the ending position of the first candidate forward sentence in the first initial forward sentence are added in the spliced first initial forward sentence, and the first forward sentence corresponding to the first candidate sentence is obtained according to the adding result, wherein the first forward sentence characteristic of the first candidate sentence is S11 ═ { in green, oil, _, alpha, _ field, up, < S >, green, oil, _, and by < \ S >.
Further, the first candidate sentence is subjected to reverse order processing to obtain a first candidate backward sentence 'shang yao' with oil green, and then the first candidate backward sentence 'shang yao' is obtained by dividing the first candidate backward sentence based on the 'yao' position in the first candidate backward sentence; and then splicing the reverse-order error correction statement ' oil green of the upper wild A ' corresponding to the statement to be corrected with the first target candidate backward statement, and adding identifications ' _________ and ' < \ S > ' into the spliced first initial backward statement to obtain the first backward statement corresponding to the first candidate statement with the characteristics of S12 ═ { upper, wild, _, A, _, oil, green, on, < S >, upper, wild, _ and then, < \\ S > }.
And by analogy, respectively constructing a second forward statement feature S21 and a second backward statement feature S22 corresponding to the second candidate statement, and a third forward statement feature S31 and a second backward statement feature S32 corresponding to the third candidate statement for subsequent determination of the alternative statement.
It should be noted that the added sentence identifier is to enable the sorting module to determine the position of the wrongly-written word, the forward-sentence feature of the candidate forward sentence, and the start position and the end position of the target candidate backward sentence in the backward-sentence feature, so as to support the sorting module to accurately determine the correctness of each candidate word. Meanwhile, considering that the overall confusion degree of the later characters is far smaller than that of the earlier characters, that is, the confusion degree of the later characters relative to the candidate sentences does not substantially affect, that is, when the model is processed, the influence degree of the later characters on the candidate sentences is smaller, so that the embodiment selects to truncate at the position of the wrongly-written character to construct the forward sentence features and the backward sentence features corresponding to each candidate sentence, so as to improve the processing efficiency and ensure the accuracy.
Step S108, inputting the forward statement characteristics and the backward statement characteristics into a sorting module for processing, and determining the replacement statement corresponding to the statement to be corrected according to the processing result.
Specifically, after the forward statement feature and the backward statement feature corresponding to each candidate statement are constructed, the forward statement feature and the backward statement feature can be simultaneously input to the sorting module for processing, so that a correct word is predicted from a plurality of candidate words through the sorting module, and a replacement statement corresponding to a statement to be corrected is generated by combining the correct word.
Based on this, the sorting module specifically refers to a module integrating a forward GPT2 model and a backward GPT2 model, and the forward confusion and the backward confusion of each candidate statement can be respectively calculated through the forward GPT2 model and the backward GPT2 model, and then the confusion of the candidate statement is determined by combining the forward confusion and the backward confusion of the candidate statement, so that whether each candidate statement can be used as a replacement statement or not is analyzed by combining the original text features of the statement to be corrected through the confusion, and the prediction accuracy is ensured. The forward GPT2 model is used for combining the influence of the front text on the wrongly written words under the condition of positive sequence, and the backward GPT2 model is used for combining the influence of the rear text on the wrongly written words under the condition of reverse sequence, so that the sequencing accuracy is improved.
Further, after the forward sentence feature and the backward sentence feature corresponding to each candidate sentence are processed by the sorting module, an optional implementation manner is to calculate a candidate score of each candidate sentence, so as to screen out a replacement sentence according to the candidate score, in this embodiment, the specific implementation manner is as in steps S1082 to S1084:
step S1082, inputting the forward sentence characteristics and the backward sentence characteristics to the sorting module for processing, and obtaining candidate scores corresponding to the candidate sentences.
Specifically, the candidate score specifically refers to a probability score of the candidate sentence as the alternative sentence, where a higher candidate score indicates a higher confusion degree, and further indicates a higher probability as the alternative sentence, and conversely, a lower candidate score indicates a lower confusion degree, and further indicates a lower probability as the alternative sentence.
Based on this, after the forward sentence characteristics and the backward sentence characteristics corresponding to the candidate sentences are input to the sorting module, the confusion degree of the candidate sentences can be calculated through the sorting module from the forward dimension and the backward dimension so as to map out the candidate scores of the candidate sentences, and the subsequent determination of the replacement sentences is facilitated.
Further, the process of calculating the candidate score by the ranking module is as follows from step S1082-2 to step S1082-6:
step S1082-2, inputting the forward sentence characteristic and the backward sentence characteristic to the sorting module;
step S1082-4, processing the forward sentence characteristics through a forward sorting model in the sorting module to obtain a forward confusion degree, and processing the backward sentence characteristics through a backward sorting model in the sorting module to obtain a backward confusion degree;
and step S1082-6, processing the forward confusion degree and the backward confusion degree through a computing unit in the sorting module, and obtaining and outputting candidate scores corresponding to the candidate sentences.
Specifically, after the forward statement feature and the backward statement feature corresponding to any one candidate statement are obtained, the forward statement feature and the backward statement feature corresponding to the candidate statement can be directly input to the sorting module, so that the forward statement feature is processed through a forward sorting model (such as the forward GPT2 model) in the sorting module, and the forward confusion degree corresponding to the candidate statement is obtained according to the processing result; meanwhile, backward sentence features are processed through a backward sorting model (such as the backward GPT2 model) in the sorting module, and backward confusion degrees corresponding to candidate sentences are obtained according to processing results. And finally, calculating the target confusion based on the forward confusion and the backward confusion in the sorting module, thereby realizing the determination of the candidate scores of the candidate sentences based on the target confusion, outputting the sorting module, and satisfying the requirement that the replacement sentences can be determined in a mode of comparing the candidate scores in the follow-up process.
The calculation unit may determine the target confusion by multiplying the two puzzles when calculating the target confusion based on the forward and backward puzzles. Meanwhile, when calculating the confusion degree of a forward sorting model or a backward sorting model in the sorting module, the forward sorting model or the backward sorting model is obtained by accumulating and multiplying the predicted probability values from the starting identification position to the ending identification position of the candidate forward sentence.
In addition, in the training stage of the ranking module, the labels of the samples used for training the ranking module are also all forward/backward sentence characteristics, and it should be noted that the training stage replaces the correct words in the error correction text with the candidate words, so that the ranking module can learn the capability of predicting the candidate scores, and meanwhile, the ranking module is ensured not to generate errors on the predicted values of the wrong words.
Step S1084, determining a target candidate sentence in the candidate sentences based on the candidate score, and taking the target candidate sentence as a replacement sentence corresponding to the sentence to be corrected.
Specifically, after the candidate score corresponding to each candidate sentence is obtained by the sorting module, the target candidate sentence may be extracted from the candidate sentences according to the candidate score, and the target candidate sentence may be used as the replacement sentence. To complete the error correction processing task of the statement to be corrected.
Along the above example, after the forward sentence feature and the backward sentence feature corresponding to each candidate sentence are obtained, the forward sentence feature and the backward sentence feature corresponding to each candidate sentence can be input to the sorting module, the forward confusion degree is calculated through the forward GPT2 model in the sorting module, the backward confusion degree is calculated through the backward GPT2 model, and the forward confusion degree and the backward confusion degree corresponding to the first candidate sentence are determined to be 0.4 and 0.3 according to the calculation result; the forward confusion degree corresponding to the second candidate sentence is 0.6, and the backward confusion degree is 0.5; the third candidate sentence corresponds to a forward confusion of 0.8 and a backward confusion of 0.9.
Further, determining that the candidate score corresponding to the first candidate sentence is 0.12, the candidate score corresponding to the second candidate sentence is 0.30, and the candidate score corresponding to the third candidate sentence is 0.72 by a computing unit in the sorting module, determining that the candidate score of the third candidate sentence is the highest according to the comparison result, and indicating that the probability of the third candidate sentence as the replacement sentence is the highest, determining that the replacement sentence is 'in the field of the green oil', and feeding the replacement sentence back to the user.
In conclusion, by combining the context information for error correction, the influence of the context information on the original sentence is considered, and the error correction precision is effectively improved, so that the correctness and the integrity of the sentence are ensured in an error correction scene.
Referring to a schematic diagram of a statement processing method shown in fig. 2, after a statement to be corrected is obtained, the statement to be corrected may be detected through a wrongly written character error detection model in an error detection module deployed at a server, so as to obtain a wrongly written character in the statement to be corrected according to a detection result, then the wrongly written character is input to a recall model, an initial candidate character corresponding to the wrongly written character is screened through a language model and a mask language model integrated in the recall model, and is filtered in combination with a candidate character confusion set, so as to obtain a candidate character corresponding to the wrongly written character, and finally the candidate character is input to a sorting module, so that a forward statement feature and a backward statement feature corresponding to the statement to be corrected in combination with the candidate character are constructed, and a confusion degree of each candidate statement is obtained through the forward sorting module and the backward sorting module, so as to obtain a replacement statement based on the confusion degree.
According to the statement processing method, after the statement to be corrected containing the wrongly written words is obtained, the candidate written words corresponding to the wrongly written words can be determined firstly, the candidate statement is generated by combining the statement to be corrected on the basis, then the forward statement feature and the backward statement feature are constructed by combining the overall features of the candidate statement, the forward statement feature and the backward statement feature are input into the sequencing model to be processed, the replacement statement with accurate correction can be obtained according to the processing result, when the statement to be corrected is corrected, the information of the foreword structure and the original statement can be fully fused, the sequencing model can output a more reliable prediction result, and therefore the accuracy of correction is guaranteed.
The following description will further explain the text detection method by taking an application of the text detection method provided by the present application in a document error correction scenario as an example with reference to fig. 3. Fig. 3 shows a processing flow chart of a text detection method applied in a document error correction scenario according to an embodiment of the present application, which specifically includes the following steps:
step S302, acquiring the document to be corrected.
In this embodiment, a document to be corrected is "… for years, i do not require …" as an example for explanation, because the document to be corrected contains too many characters, this embodiment is explained based on only extracting part of sentences in the document for convenience of description, and the correction processing of the whole document can refer to the corresponding description content of this embodiment, which is not described herein in detail.
Step S304, determining the statement to be corrected containing the wrongly written words in the document to be corrected.
Specifically, after the document to be corrected is obtained, it may be input to the ELECTRA model to classify the words at each position in the sentence, so as to determine whether each word in the sentence is a wrongly-written word. Namely, … one time for a dry year, i 'i no regress …' inputs the words at each position in the sentence into the ELECTRA model to classify the words, and determines that the 'one time for a dry year' contains wrongly written words according to the processing result, wherein the wrongly written words are 'dry'.
Step S306, inputting the wrongly written characters in the sentence to be corrected into the recall model for processing, and obtaining candidate characters according to the processing result.
Specifically, under the condition that it is determined that the wrongly-written characters are "dry", in order to correct the wrongly-written characters and improve the accuracy after error correction, correct characters are selected to replace the wrongly-written characters, so that correct sentences are obtained, and the correctness of the stored documents is ensured. The wrongly-written characters can be input into the recall model to be processed, and meanwhile, in order to improve the precision of candidate sorting after recall, candidate characters after being recalled can be filtered in combination with the confusion set, so that candidate characters which are more fit with original characters can be obtained and are used for sorting subsequently.
Based on this, in the embodiment, the wrongly-written character "stem" is recalled based on the N-gram language model and the BERT mask language model, top-6 characters are selected as initial candidate characters according to the processing result, that is, the wrongly-written character "stem" is processed through the N-gram language model and the BERT mask language model to obtain N candidate characters, then the top-6 characters are selected as the initial candidate characters according to the preset selection rule, and the initial candidate characters are determined to be { stem, thousand, migration, lead, thousand, fiber }, respectively.
Further, after the initial candidate word is determined, in order to improve that the recalled candidate word and the original word do not deviate too much, an obfuscation set is introduced to filter the candidate word. The confusion set contains homophones and near-forms for each word. And filtering the initial candidate words through the confusion set to determine candidate words with smaller deviations from the original words as { stem, thousand, migration, thousand, fiber }.
Step S308, forward sentence input and backward sentence input are constructed based on the candidate words and the sentence to be corrected.
Specifically, a forward GPT2 model and a backward GPT2 model are set in a sorting module, the confusion degree of each candidate sentence is calculated by the two models at the same time, and finally a value representing the confusion degree of the sentence is obtained by taking a mean value, so that before sorting, the forward GPT2 model and the backward GPT2 model need to be trained respectively until a forward GPT2 model and a backward GPT2 model meeting a training stop condition are obtained, it needs to be noted that the forward GPT2 model is used for combining the influence of the forward sequence on wrongly-written words, and the backward GPT2 model is used for combining the influence of the backward sequence on wrongly-written words, so that the sorting accuracy is improved.
Therefore, before both are input, a forward sentence input and a backward sentence input need to be constructed. Further, the sentence to be corrected is "in one year in the same time" stem "The words are wrongly written words, and candidate words { stem, thousand, transition, thousand, fine } corresponding to the wrongly written words "stem" are obtained through recall processing, at this time, 5 candidate sentences can be formed based on the candidate words and the original words in the sentence to be corrected, which are S respectivelyc1Dry, year, etc., one, go }; sc2One, two, { thousand, year, etc.; sc3Move, year, etc., one, return }; sc4One, two }; sc5One, two }.
Further, since it is necessary to simultaneously calculate the confusion of each candidate sentence by the forward GPT2 model and the backward GPT2 model, it is necessary to generate a forward sentence input and a backward sentence input for each candidate sentence at this time; and considering that the overall confusion of the later input is much smaller than that of the earlier input, i.e. the confusion of the later input with respect to each candidate sentence is not substantially affected, the present embodiment chooses to truncate at the wrong character position to construct the forward sentence input and the backward sentence input corresponding to each candidate sentence, that is:
candidate sentence Sc1Forward sentence input S corresponding to dry, yearly, etc., one, and onefc1One, two, three, etc.,<s>a, dry,<\s>h, inputting S to the sentencebc1I.e., (hui, yi, etc.), yearly, dry,<s>hui, Yi, etc., yearly, dry, and then,<\s>};
Sc2forward sentence input S corresponding to { thousand, year, etc., one } isfc2One, two, three, etc.,<s>a, a thousand,<\s>h, inputting S to the sentencebc2One, etc., year, thousand, one,<s>hui, one, equal, yearly, thousands, a,<\s>};
Sc3forward sentence input S corresponding to { transition, year, etc., one }fc3One, two, etc.,<s>a _, an,<\s>h, inputting S to the sentencebc3One, etc., year, migration, etc.,<s>hui, Yi, etc., yearly, transferred, etc.,<\s>};
Sc4forward sentence input S corresponding to { thousand years, etc., one, two } sentencefc4One, two, etc.,<s>a,<\s>h, inputting S to the sentencebc4One, etc., year, thousand, etc.,<s>hui, Yi, etc., yearly, Qian, etc.,<\s>};
Sc5forward sentence input S corresponding to { fine, yearly, etc., one } sentencefc5One, two, three, etc.,<s>a, a fiber,<\s>h, inputting S to the sentencebc5I.e., (hui, one, etc.), yearly, fiber, etc.,<s>hui, Yi, etc., yearly, fiber, etc.,<\s>}。
wherein _ "is a mark to indicate the current wrong word position of the model; < s > represents the start of the candidate sentence (forward or backward); < \ s > represents the end of the candidate sentence.
Step S310, inputting the forward sentence input and the backward sentence input into a forward GPT2 model and a backward GPT2 model respectively for processing, and obtaining a forward confusion degree and a backward confusion degree.
Specifically, after obtaining the forward sentence input and the backward sentence input corresponding to each candidate sentence, the forward sentence input may be input to the forward GPT2 model for calculating the forward confusion, the backward sentence input to the backward GPT2 model for calculating the backward confusion, and S is determined according to the calculation resultc1The corresponding forward confusion degree is 0.4, and the backward confusion degree is 0.3; sc2The corresponding forward confusion degree is 0.8, and the backward confusion degree is 0.9; sc3The corresponding forward confusion degree is 0.5, and the backward confusion degree is 0.5; sc4The corresponding forward confusion degree is 0.6, and the backward confusion degree is 0.7; sc5The corresponding forward confusion is 0.5 and the backward confusion is 0.6.
In step S312, each target confusion is calculated according to the forward confusion and the backward confusion, and a target replacement sentence is determined according to the target confusion.
Step S314, updating the document to be corrected according to the target replacement statement, and storing the updated document to be corrected.
Specifically, after obtaining the forward and backward perplexities of each candidate sentence, the forward perplexity and the backward perplexity of each candidate sentence may be multiplied at this time, and the multiplication result may be used as the target perplexity of each candidate sentence.
Based on this, a candidate sentence S is determinedc1Target perplexity L ofc10.4 × 0.3 ═ 0.12; determining candidate sentences Sc2Target perplexity L ofc20.8 × 0.9 ═ 0.72; determining candidate sentences Sc30.5 x 0.25; determining candidate sentences Sc4Target perplexity L ofc40.6 × 0.7 ═ 0.42; determining candidate sentences Sc5Target perplexity L ofc5=0.5*0.6=0.30。
Further, after determining the target perplexity of each candidate sentence, the candidate sentences may then be ranked based on the target perplexity, i.e., Lc2>Lc4>Lc5>Lc3>Lc1And the greater the confusion, the higher the probability that the candidate sentence is as a replacement sentence, when the candidate sentence S is determinedc2When the confusion is the greatest, the candidate sentence S is selectedc2And as a replacement statement, updating the document to be corrected based on the replacement statement, wherein the updated document to be corrected is { … thousand years and so on, i no regret … }, and then storing the updated document to be corrected.
In conclusion, by combining the context information for error correction, the influence of the context information on the original sentence is considered, and the error correction precision is effectively improved, so that the correctness of the content in the stored document is ensured in the document storage scene.
Corresponding to the above method embodiment, the present application further provides an embodiment of a sentence processing apparatus, and fig. 4 shows a schematic structural diagram of a sentence processing apparatus provided in an embodiment of the present application. As shown in fig. 4, the apparatus includes:
an obtaining module 402, configured to obtain a statement to be corrected, which contains a wrongly written word;
a determining module 404 configured to determine a candidate word corresponding to the wrongly written word, and generate a candidate statement based on the candidate word and the statement to be corrected;
a constructing module 406 configured to construct forward sentence features and backward sentence features corresponding to the candidate sentences;
the processing module 408 is configured to input the forward statement feature and the backward statement feature to the sorting module for processing, and determine a replacement statement corresponding to the statement to be corrected according to a processing result.
In an optional embodiment, the obtaining module 402 is further configured to:
acquiring a text to be processed, and constructing a character sequence based on the text to be processed; inputting the character sequence into a wrongly written character detection model for processing to obtain wrongly written characters contained in the text to be processed; and selecting the sentence containing the wrongly written words in the text to be processed as the sentence to be corrected.
In an optional embodiment, the determining module 404 is further configured to:
inputting the wrongly-written characters into a recall model for processing to obtain a plurality of initial candidate characters corresponding to the wrongly-written characters; and screening the candidate words corresponding to the wrongly-written words from the plurality of initial candidate words based on a preset screening strategy.
In an optional embodiment, the determining module 404 is further configured to:
determining a screening proportion and a candidate character confusion set according to the screening strategy; screening a set number of intermediate candidate words from the plurality of initial candidate words based on the screening proportion; and filtering a set number of the intermediate candidate words by using the candidate word confusion set, and obtaining the candidate words according to a filtering result.
In an optional embodiment, the determining module 404 is further configured to:
determining the character position of the wrongly written character in the statement to be corrected; and updating the statement to be corrected according to the character position by using the candidate words to obtain the candidate statement.
In an optional embodiment, the building module 406 is further configured to:
dividing the candidate sentences according to the positions of the candidate words in the candidate sentences to obtain candidate forward sentences; splicing the statement to be corrected with the candidate forward statement to obtain an initial forward statement; and adding statement marks in the initial forward statement, and constructing the forward statement features based on the initial forward statement with the added statement marks.
In an optional embodiment, the building module 406 is further configured to:
carrying out reverse order processing on the candidate sentences to obtain candidate backward sentences; dividing the candidate backward sentences according to the positions of the candidate words in the candidate backward sentences to obtain target candidate backward sentences; splicing the statement to be corrected and the target candidate backward statement to obtain an initial backward statement; and adding statement identifications in the initial backward statement, and constructing the backward statement features based on the initial backward statement with the added statement identifications.
In an optional embodiment, the processing module 408 is further configured to:
inputting the forward sentence characteristics and the backward sentence characteristics into the sorting module for processing to obtain candidate scores corresponding to the candidate sentences; and determining a target candidate sentence in the candidate sentences based on the candidate scores, and taking the target candidate sentence as a replacement sentence corresponding to the sentence to be corrected.
In an optional embodiment, the processing module 408 is further configured to:
inputting the forward sentence features and the backward sentence features to the sorting module; processing the forward sentence characteristics through a forward sorting model in the sorting module to obtain a forward confusion degree, and processing the backward sentence characteristics through a backward sorting model in the sorting module to obtain a backward confusion degree; and processing the forward confusion degree and the backward confusion degree through a computing unit in the sorting module to obtain and output candidate scores corresponding to the candidate sentences.
In an optional embodiment, the determining module 404 is further configured to:
calculating the times of editing operation of the confusion word contained in the candidate word confusion set and each intermediate candidate word; and filtering the intermediate candidate words in a set number according to the editing operation times, and obtaining the candidate words according to a filtering result.
The statement processing device provided by this embodiment can determine candidate words corresponding to the wrongly written words after obtaining statements to be corrected containing the wrongly written words, and generate candidate statements based on the candidate words by combining the statements to be corrected, and then construct forward statement features and backward statement features by combining overall features of the candidate statements, and input the forward statement features and the backward statement features into the sorting model for processing, so that a replacement statement with relatively accurate error correction can be obtained according to a processing result, and when the statements to be corrected are corrected, information of a preceding and a following text structures and an original statement can be fully fused, so that the sorting model can output a more reliable prediction result, and accuracy of error correction is ensured.
The above is a schematic scheme of a sentence processing apparatus of the present embodiment. It should be noted that the technical solution of the term processing apparatus and the technical solution of the term processing method described above belong to the same concept, and for details that are not described in detail in the technical solution of the term processing apparatus, reference may be made to the description of the technical solution of the term processing method described above. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 5 illustrates a block diagram of a computing device 500 provided according to an embodiment of the present application. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. Processor 520 is coupled to memory 510 via bus 530, and database 550 is used to store data.
Computing device 500 also includes access device 540, access device 540 enabling computing device 500 to communicate via one or more networks 560. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 540 may include one or more of any type of network interface, e.g., a Network Interface Card (NIC), wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the application, the above-described components of computing device 500 and other components not shown in FIG. 5 may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 500 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 500 may also be a mobile or stationary server.
Wherein processor 520 is configured to execute the following computer-executable instructions:
acquiring a statement to be corrected containing wrongly written words;
determining a candidate word corresponding to the wrongly written word, and generating a candidate statement based on the candidate word and the statement to be corrected;
constructing a forward statement feature and a backward statement feature corresponding to the candidate statement;
and inputting the forward statement features and the backward statement features into a sequencing module for processing, and determining the replacement statement corresponding to the statement to be corrected according to a processing result.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above statement processing method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above statement processing method.
An embodiment of the present application further provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are configured to:
acquiring a statement to be corrected containing wrongly written words;
determining a candidate word corresponding to the wrongly written word, and generating a candidate statement based on the candidate word and the statement to be corrected;
constructing a forward statement feature and a backward statement feature corresponding to the candidate statement;
and inputting the forward statement features and the backward statement features into a sequencing module for processing, and determining the replacement statement corresponding to the statement to be corrected according to a processing result.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above statement processing method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above statement processing method.
The present embodiment discloses a chip storing a computer program that implements the steps of the statement processing when executed by the chip.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (13)

1. A sentence processing method, comprising:
acquiring a statement to be corrected containing wrongly written words;
determining a candidate word corresponding to the wrongly written word, and generating a candidate statement based on the candidate word and the statement to be corrected;
constructing a forward statement feature and a backward statement feature corresponding to the candidate statement;
and inputting the forward statement features and the backward statement features into a sequencing module for processing, and determining the replacement statement corresponding to the statement to be corrected according to a processing result.
2. The sentence processing method of claim 1, wherein the obtaining of the sentence to be corrected, which contains the erroneous word, comprises:
acquiring a text to be processed, and constructing a character sequence based on the text to be processed;
inputting the character sequence into a wrongly written character detection model for processing to obtain wrongly written characters contained in the text to be processed;
and selecting the sentence containing the wrongly written words in the text to be processed as the sentence to be corrected.
3. The sentence processing method of claim 1, wherein the determining the candidate word corresponding to the wrongly written word comprises:
inputting the wrongly-written characters into a recall model for processing to obtain a plurality of initial candidate characters corresponding to the wrongly-written characters;
and screening the candidate words corresponding to the wrongly-written words from the plurality of initial candidate words based on a preset screening strategy.
4. The sentence processing method of claim 3, wherein the screening the candidate words corresponding to the wrongly written word from the plurality of initial candidate words based on a preset screening policy comprises:
determining a screening proportion and a candidate character confusion set according to the screening strategy;
screening a set number of intermediate candidate words from the plurality of initial candidate words based on the screening proportion;
and filtering a set number of the intermediate candidate words by using the candidate word confusion set, and obtaining the candidate words according to a filtering result.
5. The sentence processing method of claim 1, wherein the generating a candidate sentence based on the candidate word and the sentence to be corrected comprises:
determining the character position of the wrongly written character in the statement to be corrected;
and updating the statement to be corrected according to the character position by using the candidate words to obtain the candidate statement.
6. The sentence processing method of any one of claims 1 to 5, wherein the constructing the forward sentence features corresponding to the candidate sentences comprises:
dividing the candidate sentences according to the positions of the candidate words in the candidate sentences to obtain candidate forward sentences;
splicing the statement to be corrected with the candidate forward statement to obtain an initial forward statement;
and adding statement marks in the initial forward statement, and constructing the forward statement features based on the initial forward statement with the added statement marks.
7. The sentence processing method according to any one of claims 1 to 5, wherein the constructing of the backward sentence features corresponding to the candidate sentences comprises:
carrying out reverse order processing on the candidate sentences to obtain candidate backward sentences;
dividing the candidate backward sentences according to the positions of the candidate words in the candidate backward sentences to obtain target candidate backward sentences;
splicing the statement to be corrected and the target candidate backward statement to obtain an initial backward statement;
and adding statement identifications in the initial backward statement, and constructing the backward statement features based on the initial backward statement with the added statement identifications.
8. The sentence processing method according to any one of claims 1 to 5, wherein the inputting the forward sentence characteristic and the backward sentence characteristic into a sorting module for processing and determining the alternative sentence corresponding to the sentence to be corrected according to the processing result comprises:
inputting the forward sentence characteristics and the backward sentence characteristics into the sorting module for processing to obtain candidate scores corresponding to the candidate sentences;
and determining a target candidate sentence in the candidate sentences based on the candidate scores, and taking the target candidate sentence as a replacement sentence corresponding to the sentence to be corrected.
9. The sentence processing method of claim 8, wherein the inputting the forward sentence characteristic and the backward sentence characteristic into the sorting module for processing to obtain the candidate score corresponding to the candidate sentence comprises:
inputting the forward sentence features and the backward sentence features to the sorting module;
processing the forward sentence characteristics through a forward sorting model in the sorting module to obtain a forward confusion degree, and processing the backward sentence characteristics through a backward sorting model in the sorting module to obtain a backward confusion degree;
and processing the forward confusion degree and the backward confusion degree through a computing unit in the sorting module to obtain and output candidate scores corresponding to the candidate sentences.
10. The sentence processing method of claim 4, wherein the filtering a set number of the intermediate candidate words by using the candidate word confusion set to obtain the candidate words according to a filtering result comprises:
calculating the times of editing operation of the confusion word contained in the candidate word confusion set and each intermediate candidate word;
and filtering the intermediate candidate words in a set number according to the editing operation times, and obtaining the candidate words according to a filtering result.
11. A sentence processing apparatus, comprising:
the acquisition module is configured to acquire a statement to be corrected containing wrongly written words;
the determining module is configured to determine a candidate word corresponding to the wrongly written word and generate a candidate statement based on the candidate word and the statement to be corrected;
the building module is configured to build a forward statement feature and a backward statement feature corresponding to the candidate statement;
and the processing module is configured to input the forward statement features and the backward statement features into the sorting module for processing, and determine the replacement statement corresponding to the statement to be corrected according to a processing result.
12. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the method of any one of claims 1 to 10.
13. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 10.
CN202111510600.9A 2021-12-10 2021-12-10 Statement processing method and device Pending CN114154488A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111510600.9A CN114154488A (en) 2021-12-10 2021-12-10 Statement processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111510600.9A CN114154488A (en) 2021-12-10 2021-12-10 Statement processing method and device

Publications (1)

Publication Number Publication Date
CN114154488A true CN114154488A (en) 2022-03-08

Family

ID=80450687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111510600.9A Pending CN114154488A (en) 2021-12-10 2021-12-10 Statement processing method and device

Country Status (1)

Country Link
CN (1) CN114154488A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN111563161A (en) * 2020-04-26 2020-08-21 深圳市优必选科技股份有限公司 Sentence recognition method, sentence recognition device and intelligent equipment
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method
CN113343671A (en) * 2021-06-07 2021-09-03 佳都科技集团股份有限公司 Statement error correction method, device and equipment after voice recognition and storage medium
CN113435187A (en) * 2021-06-24 2021-09-24 湖北大学 Text error correction method and system for industrial alarm information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN111563161A (en) * 2020-04-26 2020-08-21 深圳市优必选科技股份有限公司 Sentence recognition method, sentence recognition device and intelligent equipment
CN112989806A (en) * 2021-04-07 2021-06-18 广州伟宏智能科技有限公司 Intelligent text error correction model training method
CN113343671A (en) * 2021-06-07 2021-09-03 佳都科技集团股份有限公司 Statement error correction method, device and equipment after voice recognition and storage medium
CN113435187A (en) * 2021-06-24 2021-09-24 湖北大学 Text error correction method and system for industrial alarm information

Similar Documents

Publication Publication Date Title
CN108091328B (en) Speech recognition error correction method and device based on artificial intelligence and readable medium
US11853817B2 (en) Utilizing a natural language model to determine a predicted activity event based on a series of sequential tokens
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
CN110362824B (en) Automatic error correction method, device, terminal equipment and storage medium
CN114580382A (en) Text error correction method and device
JPWO2007097208A1 (en) Language processing apparatus, language processing method, and language processing program
CN113961685A (en) Information extraction method and device
CN112347142B (en) Data processing method and device
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN114818718A (en) Contract text recognition method and device
CN112446217B (en) Emotion analysis method and device and electronic equipment
CN113076720A (en) Long text segmentation method and device, storage medium and electronic device
CN109933787B (en) Text key information extraction method, device and medium
CN114154488A (en) Statement processing method and device
CN114997167A (en) Resume content extraction method and device
CN113722421B (en) Contract auditing method and system and computer readable storage medium
CN111222321B (en) Punctuation mark processing method and device
CN110413779B (en) Word vector training method, system and medium for power industry
CN114722817A (en) Event processing method and device
CN113688615A (en) Method, device and storage medium for generating field annotation and understanding character string
JP2007058415A (en) Text mining device, text mining method, and program for text mining
CN113609860B (en) Text segmentation method and device and computer equipment
CN115906814A (en) Identification method and device
US11989500B2 (en) Framework agnostic summarization of multi-channel communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination