CN115713075A - Text processing method and device, electronic equipment and readable storage medium - Google Patents

Text processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115713075A
CN115713075A CN202211446734.3A CN202211446734A CN115713075A CN 115713075 A CN115713075 A CN 115713075A CN 202211446734 A CN202211446734 A CN 202211446734A CN 115713075 A CN115713075 A CN 115713075A
Authority
CN
China
Prior art keywords
text
error correction
target text
detection result
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211446734.3A
Other languages
Chinese (zh)
Inventor
王亭
李志飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mobvoi Information Technology Co ltd
Original Assignee
Shanghai Mobvoi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mobvoi Information Technology Co ltd filed Critical Shanghai Mobvoi Information Technology Co ltd
Priority to CN202211446734.3A priority Critical patent/CN115713075A/en
Publication of CN115713075A publication Critical patent/CN115713075A/en
Pending legal-status Critical Current

Links

Images

Abstract

The embodiment of the application provides a text processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of computers. In the embodiment of the application, the embodiment of the application can receive the target text, and performing text rule detection and model detection on the target text. Further, the embodiment of the application can perform error correction processing on the target text according to the result of the rule detection and the result of the model detection, so as to determine an error correction text corresponding to the target text. In the method and the device, in the process of correcting the target text, the text rule detection and the model detection are performed on the target text, so that the target text can be corrected from multiple dimensions, and the comprehensive correction of the target text is realized.

Description

Text processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a text processing method and apparatus, an electronic device, and a readable storage medium.
Background
At present, when a user inputs text through an electronic device such as a computer, electronic devices often can correct errors for text entered by a user, i.e., check, mark, or correct errors in the text.
In the related art, the related art may implement error correction on a text through a model, however, the dimension of the model for processing the text is relatively single, that is, the model cannot comprehensively detect various types of errors occurring in the text.
Disclosure of Invention
In view of this, embodiments of the present application provide a text processing method, an apparatus, an electronic device, and a readable storage medium, so as to correct a target text from multiple dimensions, and implement comprehensive correction of the target text.
In a first aspect, a text processing method is provided, and the method includes:
and acquiring a target text.
And performing text rule detection on the target text, and determining a rule detection result.
And inputting the target text into a text detection model, and determining a model detection result.
And performing error correction processing on the target text according to the rule detection result and the model detection result, and determining an error correction text corresponding to the target text.
In some embodiments, the rule detection result comprises a duplicate field detection result.
The text rule detection is performed on the target text, and determining a rule detection result comprises:
and carrying out repeated field detection on the target text according to a preset repeated field detection rule so as to determine a repeated field detection result.
In some embodiments, the rule detection result comprises a common word detection result.
The text rule detection is performed on the target text, and determining a rule detection result comprises:
and performing common word detection on the target text according to a preset common word list, and marking the common words in the target text to determine a common word detection result.
In some embodiments, the performing error correction processing on the target text according to the rule detection result and the model detection result, and determining an error correction text corresponding to the target text includes:
and determining at least one error correction type label according to the rule detection result and the model detection result.
And determining an error correction candidate set corresponding to each error correction type label according to an error correction rule corresponding to each error correction type label, wherein the error correction candidate set comprises candidate words or candidate words for correcting the target text.
According to each error correction candidate set pair the target text is subjected to an error correction process, and determining the error correction text corresponding to the target text.
In some embodiments, the performing error correction processing on the target text according to each error correction candidate set, and determining an error correction text corresponding to the target text includes:
and performing simulated error correction on the target text according to a preset language model and each error correction candidate set to determine scores corresponding to candidate words or candidate words in each error correction candidate set.
And determining the target words or the target words corresponding to the error correction candidate sets according to the scores.
And correcting the target text according to the target words or the target words corresponding to the error correction candidate sets to determine the error correction text corresponding to the target text.
In some embodiments, the error correction type tags include one or more of pronunciation-like tags, font-like tags, position-reversed tags, multi-word tags, few-word tags, and pronoun-error tags.
In some embodiments, the method further comprises:
and correcting a dictionary according to a preset result, and verifying the rule detection result and the model detection result.
Updating the rule detection result and/or the model detection result in response to the rule detection result and/or the model detection result hitting a word or word in the result correction dictionary.
In a second aspect, there is provided a text processing apparatus, the device comprises:
and the target text acquisition module is configured to execute acquisition of the target text.
And the rule detection module is configured to execute text rule detection on the target text and determine a rule detection result.
A model detection module for detecting the model of the object, configured to perform entering the target text into a text detection model, and determining a model detection result.
And the error correction module is configured to execute error correction processing on the target text according to the rule detection result and the model detection result, and determine an error correction text corresponding to the target text.
In some embodiments, the rule detection result comprises a duplicate field detection result.
The rule detection module is specifically configured to perform:
and carrying out repeated field detection on the target text according to a preset repeated field detection rule so as to determine a repeated field detection result.
In some embodiments, the rule detection result comprises a common word detection result.
The rule detection module is specifically configured to perform:
and performing common word detection on the target text according to a preset common word list, and marking the common words in the target text to determine a common word detection result.
In some embodiments, the error correction module is specifically configured to perform:
and determining at least one error correction type label according to the rule detection result and the model detection result.
And determining an error correction candidate set corresponding to each error correction type label according to an error correction rule corresponding to each error correction type label, wherein the error correction candidate set comprises candidate words or candidate words for correcting the target text.
And performing error correction processing on the target text according to each error correction candidate set, and determining an error correction text corresponding to the target text.
In some embodiments, the error correction module is specifically configured to perform:
and performing simulated error correction on the target text according to a preset language model and each error correction candidate set so as to determine scores corresponding to candidate words or candidate words in each error correction candidate set.
And determining the target words or the target words corresponding to the error correction candidate sets according to the scores.
And correcting the target text according to the target words or the target words corresponding to the error correction candidate sets to determine the error correction text corresponding to the target text.
In some embodiments, the error correction type tags include one or more of pronunciation-like tags, font-like tags, position-reversed tags, multi-word tags, few-word tags, and pronoun-error tags.
In some embodiments, the apparatus further comprises:
and the verification module is configured to execute dictionary correction according to a preset result and verify the rule detection result and the model detection result.
An update module configured to perform updating the rule detection result and/or the model detection result in response to the rule detection result and/or the model detection result hitting a word or word in the result correction dictionary.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory is configured to store one or more computer program instructions, and the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect.
In the embodiment of the application, the embodiment of the application can receive the target text and perform text rule detection and model detection on the target text. Further, the embodiment of the application can perform error correction processing on the target text according to the result of the rule detection and the result of the model detection, so as to determine an error correction text corresponding to the target text. In the method and the device, in the process of correcting the target text, the text rule detection and the model detection are performed on the target text, so that the target text can be corrected from multiple dimensions, and the comprehensive correction of the target text is realized.
Drawings
The foregoing and other objects, features and advantages of the embodiments of the present application will be apparent from the following description of the embodiments of the present application with reference to the accompanying drawings in which:
FIG. 1 is a schematic flow chart illustrating a text processing method according to an embodiment of the present application;
FIG. 2 is a flowchart of a text processing method according to an embodiment of the present application;
FIG. 3 is a flowchart of another text processing method according to an embodiment of the present application;
FIG. 4 is a flowchart of another text processing method according to an embodiment of the present application;
FIG. 5 is a flowchart of another text processing method according to an embodiment of the present application;
FIG. 6 is a flowchart of another text processing method according to an embodiment of the present application;
FIG. 7 is a flowchart of another text processing method according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The present application is described below based on examples, but the present application is not limited to only these examples. In the following detailed description of the present application, certain specific details are set forth in detail. It will be apparent to one skilled in the art that the present application may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present application.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including but not limited to".
In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
With the development of computer technology, most users choose to input texts using electronic devices such as computers, and when a user inputs texts through an electronic device, electronic devices can often correct errors for text entered by a user, i.e., check, mark, or correct errors in the text.
In the related art, the related art may implement error correction on the text through a model, for example, the related art may implement error correction on the text through a Bidirectional Encoder Representation (BERT) model of a transformer. However, the dimension of the text processed by the model is relatively single, that is, the model cannot comprehensively detect various types of errors occurring in the text. Taking the BERT model as an example, when a text is corrected based on the BERT model in the related art, the error correction can only be performed on the pronunciation or font problem, but cannot be performed on other types of text errors, which may cause problems such as text error accumulation, text error omission, and text error detection to a certain extent. Therefore, how to implement comprehensive error correction of texts is a problem that needs to be solved urgently at present.
In order to solve the above problem, an embodiment of the present application provides a text processing method, where the method may be applied to an electronic device, where the electronic device may be a terminal or a server, the terminal may be a smartphone, a tablet Computer, a Personal Computer (PC), or the like, and the server may be a single server, may also be a server cluster configured in a distributed manner, and may also be a cloud server.
As shown in fig. 1, the user 11 may input the target text 12 through an external input device such as a mouse or a keyboard, or through an input unit (an input unit such as a keyboard or a touch display screen) of the electronic device 13 itself. Accordingly, the electronic device 13 may receive the target text 12, and perform text rule detection and model detection on the target text 12 based on the text processing method. Further, the electronic device 13 may perform error correction processing on the target text 12 according to the result of the rule detection and the result of the model detection, so as to determine the error corrected text 14 corresponding to the target text 12. In the embodiment of the present application, in the process of correcting the target text 12, both the text rule detection and the model detection are performed on the target text 12, so that the embodiment of the present application can correct the target text 12 from multiple dimensions, thereby implementing the comprehensive correction of the target text 12.
Specifically, as shown in fig. 2, the text processing method may include the following steps:
in step S100, a target text is acquired.
The target text may include a plurality of words, phrases and sentences.
In an optional implementation manner, after the target text is obtained, the target text may be preprocessed to improve efficiency of text error correction.
The preprocessing may include noise filtering, text segmentation, space detection, and symbol detection. Specifically, meaningless text noise such as a messy code may occur in the target text, and at this time, the text noise may be recognized and deleted according to the embodiment of the present application, so that effective information in the target text is retained.
The embodiment of the application can also perform sentence level division on the target text according to punctuation marks in the target text so as to label, record and return a sentence set corresponding to the target text. The embodiment of the application can also perform word recognition on the target text, so that word level division is performed on the target text to label, record and return a word set corresponding to the target text.
The embodiment of the application can also perform space detection or symbol detection on the target text, so as to record or delete the space or the symbol in the target text.
Therefore, by preprocessing the target text, the embodiment of the application can remove noise in the target text, divide the target text and detect a space or a coincidence in the target text, thereby improving the subsequent error correction efficiency for the target text.
In step S200, text rule detection is performed on the target text, and a rule detection result is determined.
The text rule may be used to characterize a writing rule corresponding to the text, for example, errors corresponding to the text rule may include problems of word order reversal, multiple words, few words, and assistant word use errors. The embodiment of the application can detect the text rule to determine the text rule problem appearing in the target text, so as to determine the rule detection result. The rule detection result may include a field in which a text rule error occurs, a corresponding position of the field in the target text, an error type flag corresponding to the field, and the like.
It should be noted that there is no fixed execution sequence between step S200 and step S300 in the present embodiment, that is, after step S100 is executed in the present embodiment, step S200 may be executed first and then step S300 is executed, step S300 may be executed first and then step S200 may be executed, and step S200 and step S300 may be executed simultaneously.
In an optional embodiment, the rule detection result may include a repeated field detection result. The repeated field detection result may include a repeated field in the target text, a position of the repeated field in the target text, an error type flag corresponding to the repeated field, and the like.
Specifically, the step S200 may include the following steps:
in step S210, repeated field detection is performed on the target text according to a preset repeated field detection rule to determine a repeated field detection result.
In the embodiment of the present application, if a user has a situation such as an incorrect operation when inputting a text, consecutive identical fields may appear in a target text, and in this case, the embodiment of the present application may perform an operation such as marking on the repeated fields according to a preset repeated field detection rule, so as to determine a repeated field detection result.
Specifically, the repeated field detection can be performed on the target text according to the number of words in the repeated field. For example, as shown in fig. 3, the process of performing repeated field detection on the target text may include the following steps:
in step S31, a repeated field portion in the target text is determined.
Where the repeated field portion is used to characterize the repeated portion in consecutive identical fields, e.g., in the field "slowly", the repeated field portion is "slow", and in the field "hello your hello", the repeated field portion is "hello".
In step S32, it is determined whether or not the number of words in the overlap field is greater than 1, and if the number of words in the overlap field is greater than 1, step S33 is executed, and if the number of words in the overlap field is not greater than 1, step S31 is executed.
In practical applications, since there are words composed of plural words (for example, words that are slow, gradual, or sometimes), if the words composed of plural words are determined as the repeated fields, a large number of false detections may occur. Therefore, in the repeated field detection rule, the embodiment of the application can exclude words formed by single words, namely, filtering the part of the repeated field part with the number of words less than or equal to 1, thereby improving the accuracy of the repeated field detection.
In step S33, a duplicate field detection result is determined.
The repeated field detection result may include a repeated field in the target text, a position of the repeated field in the target text, an error type flag corresponding to the repeated field, and the like.
By detecting repeated fields of the target text, the detection range of the target text can be increased, and therefore comprehensive error correction of the target text is achieved.
In an alternative embodiment, the rule detection result may include a common word detection result. The common word detection result may include an uncommon word appearing in the target text, a corresponding position of the uncommon word in the target text, an error type flag corresponding to the uncommon word, and the like.
Specifically, the step S200 may include the following steps:
in step S220, common word detection is performed on the target text according to a preset common word list, and the unusual words in the target text are marked to determine a common word detection result.
The target text can be traversed, all words in the target text are screened according to the preset common word list, and words not included in the common word list are determined as non-common words.
For example, a common word list of non-entity words may be preset for common word detection in the embodiments of the present application. The entity words can be used to characterize words with entity information (e.g., nouns, etc.), and correspondingly, the non-entity words can be used to characterize words without entity information (e.g., verbs, etc.).
It should be noted that, since entity words such as nouns have characteristics of large number, fast update speed, and the like, common word detection for entity words often causes situations such as false detection and false detection. For example, for a new noun generated from a new object, if the common vocabulary is not updated in time, an error detection may occur. The non-entity words do not have entity information, so that the number of the non-entity words is relatively stable, the updating frequency is low, and further, a common word list of the non-entity words can be constructed for the non-entity words in the embodiment of the application so as to be used for detecting the common words.
Further, in the embodiment of the present application, when detecting a common word based on a common word list of non-entity words, word segmentation processing may be performed on a target text first to determine each non-entity word in the target text. Furthermore, according to the common word list of the non-entity words, the method and the device for detecting the non-entity words in the target text can screen the non-entity words in the target text to screen out the unusual words in the non-entity words and determine the detection result of the common words. The common word detection result may include an uncommon word appearing in the target text, a corresponding position of the uncommon word in the target text, an error type flag corresponding to the uncommon word, and the like.
By detecting common words in the target text, the detection range of the target text can be increased, and therefore comprehensive error correction of the target text is achieved.
In step S300, the target text is input into the text detection model, and a model detection result is determined.
If the model detection is performed on the target text in the chinese language, the embodiment of the present application may perform the model detection on the target text based on a pre-trained chinese natural language training model Mac BERT (MLM as correction BERT).
The pre-training of the Mac BERT model adopts a full word MASK (MASK) mode, so compared with the BERT model, the Mac BERT model performs model detection on a target text based on word granularity in actual application. Moreover, in the Chinese text, most words are composed of a plurality of single words, so the Mac BERT model carries out model detection on the target text based on the granularity of the words, and can detect the problems of multi-word, few words, word order reversal, word aid error and the like under the granularity of the words besides the detection of wrongly written or mispronounced words.
In step S400, an error correction process is performed on the target text according to the rule detection result and the model detection result, and an error correction text corresponding to the target text is determined.
According to the method and the device, the position of the error in the target text can be positioned according to the rule detection result and the model detection result. Furthermore, the target text can be directly corrected according to the language model in the embodiment of the application, so that the corrected text corresponding to the target text can be determined. In the embodiment of the application, one or more optional fields in each position where an error occurs in the target text may be determined first, so that an applicable field is selected from the optional fields to correct the target text, and an error correction text corresponding to the target text is determined.
Therefore, the method and the device for detecting the text rule can receive the target text and perform text rule detection and model detection on the target text. Further, the embodiment of the application can perform error correction processing on the target text according to the result of the rule detection and the result of the model detection, so as to determine an error correction text corresponding to the target text. In the method and the device, in the process of correcting the target text, the text rule detection and the model detection are performed on the target text, so that the target text can be corrected from multiple dimensions, and the comprehensive correction of the target text is realized.
That is to say, because both text rule detection and model detection have certain limitations, in the embodiment of the present application, by performing text rule detection and model detection on a target text at the same time, complementation between text rule detection and model detection can be achieved, so that error correction can be performed on the target text from multiple dimensions, and comprehensive error correction on the target text is achieved.
In an alternative embodiment, as shown in fig. 4, the step S400 may include the following steps:
at step S410, at least one error correction type tag is determined according to the rule detection result and the model detection result.
The rule detection result and the model detection result can be used for marking and positioning fields with errors in the target text, so that the error correction type labels corresponding to the results can be determined after the rule detection result and the model detection result are determined, and the results can be classified.
Specifically, for the rule detection result, since the rule detection is to detect the target text based on the relatively fixed writing rule, the embodiment of the present application may preset different error correction type tags corresponding to the rule detection. For example, the repeated field detection result may correspond to an error correction type tag of "multiple words", and the common word detection result may correspond to an error correction type tag of "common word error".
For the model detection result, the embodiment of the application can add the error correction type label in the training set in the training process of the model for model detection, and train the model based on the training set. After the model for model detection is trained, the corresponding error correction type label can be output together while the text error is output according to the target text, namely, the error correction type label is included in the model detection result.
In step S420, an error correction candidate set corresponding to each error correction type tag is determined according to the error correction rule corresponding to each error correction type tag.
The error correction candidate set may include a plurality of candidate words or a plurality of candidate words, and a single error correction candidate set may include both candidate words and candidate words.
Moreover, in the embodiment of the present application, different error correction type tags generally correspond to different error types, so that different error correction rules may be set for different error correction type tags (i.e., different error types) in the embodiment of the present application, so as to implement accurate error correction on a target text.
In an alternative embodiment, the error correction type tags may include one or more of pronunciation-like tags, font-like tags, upside-down tags, multi-word tags, few-word tags, and typographical error tags.
The pronunciation similar label and the font similar label can be error correction type labels in the model detection result, and the position reversal label, the multi-character label, the few-character label and the auxiliary word error label can be error correction type labels corresponding to the rule detection result.
Specifically, in the training process of the model for model detection, pronunciation similar labels and font similar labels are added in a training set, and the model is trained based on the training set. After the model for model detection is trained, the corresponding error correction type label can be output together while the text error is output according to the target text, namely, the error correction type label is included in the model detection result.
Meanwhile, different text rule detection processes can be set for the position reversal label, the multi-word label, the few-word label and the auxiliary word error label so as to determine a rule detection result comprising the position reversal label, the multi-word label, the few-word label or the auxiliary word error label. For example, the embodiment of the present application may determine the repeated field in the target text and determine the multi-word tag through the repeated field detection. The embodiment of the application can also determine the auxiliary words used in the target text by the common word detection and determine the auxiliary word error labels. The method and the device for detecting the word sequence in the target text can also determine the conditions of word sequence reversal, word sequence reversal or few words in the target text through grammar detection, and determine the position reversal label or the few word label.
Further, in the embodiment of the present application, different error correction rules may be set for each error correction type tag.
For the pronunciation similar label, the embodiment of the present application may pre-construct and maintain a confusion data set with similar pronunciation according to similar pronunciation such as initial consonant, vowel, tone, etc. in the chinese pinyin, and after the pronunciation similar label is determined in the embodiment of the present application, the confusion data set with similar pronunciation, the candidate word or candidate word with similar pronunciation to the field in the confusion data set with similar pronunciation may be determined according to the confusion data set with similar pronunciation and the field corresponding to the pronunciation similar label, so as to determine the error correction candidate set corresponding to the pronunciation similar label. In addition, the confusion data set with similar pronunciation can also be a confusion data set with similar pronunciation which is constructed and maintained in advance according to pronunciation of other languages (such as phonetic symbols of english) in the embodiment of the present application.
For the font-like tag, in the embodiment of the present application, a confusion data set with a similar font may be pre-constructed and maintained according to the font of the chinese, and after the font-like tag is determined in the embodiment of the present application, candidate words or candidate words with a similar font to the field in the confusion data set with a similar font may be determined according to the confusion data set with a similar font and the field corresponding to the font-like tag, so as to determine an error correction candidate set corresponding to the font-like tag. In addition, the confusion data set with similar font style can also be a confusion data set with similar font style which is constructed and maintained in advance according to the font styles (such as the letter composition of English words) of other languages in the embodiment of the application.
For the position reversal tag, in the embodiment of the present application, after the position reversal tag is determined, all words in a field corresponding to the position reversal tag are fully arranged, and each arrangement result after full arrangement is used as an error correction candidate set corresponding to the position reversal tag.
For a multi-word tag, in the embodiment of the present application, after the multi-word tag is determined, a field corresponding to the multi-word tag is determined, and one word or multiple words in the field are deleted at random, so that one candidate word or candidate word is determined. After determining a plurality of candidate words or candidate words, the embodiments of the present application may use each candidate word or candidate word and a field corresponding to the multi-word tag (i.e., an original field without deleting any part) as an error correction candidate set corresponding to the multi-word tag.
For the few-word labels, after the few-word labels are determined, the field corresponding to the few-word labels is determined, and the field is subjected to predictive word filling through a preset voice model (for example, a BERT model), so that a plurality of candidate words or candidate words are determined. After determining a plurality of candidate words or candidate words, the embodiments of the present application may use each candidate word or candidate word and a field corresponding to the label of the few words (i.e., an original field not added with a word or word) as an error correction candidate set corresponding to the label of the few words.
For the auxiliary word error label, in the embodiment of the present application, a preset auxiliary word set (for example, the preset auxiliary word set may include auxiliary words such as "of", "ground", and "d"), may be used as an error correction candidate set corresponding to the auxiliary word error label.
According to the embodiment of the application, different error correction type tags generally correspond to different error types, so that different error correction rules can be set for different error correction type tags (namely different error types), and therefore error correction candidate sets corresponding to the error correction type tags are determined, and accurate error correction of the target text is achieved.
In step S430, the error correction processing is performed on the target text according to each error correction candidate set, and an error correction text corresponding to the target text is determined.
According to the method and the device for correcting the target text, one target word or one target word can be selected from all candidate words or candidate words of the error correction candidate set, and the corresponding field in the target text is replaced by the target word or the target word, so that error correction processing of the target text is achieved, and the error correction text corresponding to the target text is determined.
In an alternative embodiment, as shown in fig. 5, step S430 may include the following steps:
in step S431, a simulated error correction is performed on the target text according to a preset language model and each error correction candidate set, so as to determine a score corresponding to a candidate word or a candidate word in each error correction candidate set.
If the target text is a Chinese text, the embodiment of the application can use each candidate word or candidate word in the error correction candidate set to replace the corresponding field in the target text, and then simulate the replaced target text according to a preset Chinese language model (N-Gram) to determine the score output by the N-Gram model. Since the N-Gram model calculates the scores according to word frequency statistics and the like, and semantic information cannot be understood, the scores output by the N-Gram model can be used as a basis for rough sorting to realize rough sorting of candidate words or sentences corresponding to the candidate words in each error correction candidate set.
Further, in the embodiment of the present application, a semantic confusion (PPL) of each sentence in the coarse sorting may be calculated, so as to determine a final score corresponding to a candidate word or a candidate word in each error correction candidate set according to the PPL.
In step S432, the target word or the target word corresponding to each error correction candidate set is determined according to the size of the score.
Specifically, the candidate word or the candidate word with the highest score may be used as the target word or the target word in the embodiment of the present application.
In step S433, the target text is corrected according to the target words or the target words corresponding to the error correction candidate sets, so as to determine the error correction text corresponding to the target text.
Therefore, the method and the device for detecting the text rule can receive the target text and perform text rule detection and model detection on the target text. Further, in the embodiment of the present application, each error correction type tag may be determined according to a result of the rule detection and a result of the model detection, so that error correction processing is performed on the target text according to the error correction rule corresponding to each error correction type tag, thereby determining the error correction text corresponding to the target text. Wherein, in the process of correcting the target text, the embodiment of the application not only carries out text rule detection on the target text, model detection is also carried out on the target text, so that the target text can be corrected from multiple dimensions, and comprehensive correction of the target text is realized.
In an optional implementation manner, the embodiment of the present application may further correct the rule detection result and the model detection result, specifically, as shown in fig. 6, the process may include the following steps:
in step S61, the dictionary is corrected based on the preset result, and the rule detection result and the model detection result are checked.
The result correction dictionary may include fixed statement words, new words, hotwords, confused words and sentences, date formats, predetermined Chinese rules, and the like.
In step S62, the rule detection result and/or the model detection result are updated in response to the rule detection result and/or the model detection result hit to correct a word or word in the dictionary.
Specifically, the embodiment of the present application may remove the word or the word in the hit correction dictionary from the rule detection result and/or the model detection result, so as to update the rule detection result and/or the model detection result.
In the process of correcting the target text in the embodiment of the present application, if some new words, hot words, confused words, or other words that are easily recognized as being wrong appear in the target text, the embodiment of the present application may recognize the above words as wrong words. At this time, the embodiment of the present application may correct the dictionary based on a preset result, and check the rule detection result and the model detection result to prevent the word from being recognized as an erroneous word.
In addition, because the result correction dictionary may also include the date format, the predetermined Chinese rule and other contents, when the target text has the wrong date and the characters of a specific writing mode (such as network popular vocabulary), the embodiment of the application may correct the above situation through the result correction dictionary.
Therefore, after the target text is corrected comprehensively from multiple dimensions, the detection result can be corrected through the result correction dictionary, and the accuracy of text correction is further improved.
In an optional implementation manner, the embodiment of the present application may further perform post-processing on the error correction text. Specifically, the embodiment of the present application may set the text correction database according to the frequently occurring false detection condition in the history, and check and correct the error-corrected text according to the text correction database, for example, cancel the field of the error correction in the error-corrected text or mark the field of the error correction in the error-corrected text, so as to further improve the accuracy of text error correction.
With reference to the foregoing embodiments, the embodiments of the present application may receive a target text and perform preprocessing on the target text. Further, the embodiment of the application can perform text rule detection and model detection on the target text, correct the rule detection result and/or the model detection result according to the result correction dictionary, and further perform error correction processing on the target text according to the corrected rule detection result and the corrected model detection result, so that the corrected text corresponding to the target text is determined. Furthermore, the embodiment of the application can also perform post-processing on the error correction text so as to further improve the accuracy of text error correction.
For example, as shown in fig. 7, the above process may include the following steps:
in step S71, a target text is acquired.
In step S72, the target text is preprocessed.
The preprocessing process may include noise filtering, text segmentation, space detection, and symbol detection.
In step S73, repeated field detection is performed on the target text, and a repeated field detection result is determined.
The repeated field detection result may include a repeated field in the target text, a position of the repeated field in the target text, an error type flag corresponding to the repeated field, and the like.
In step S74, common word detection is performed on the target text, and a common word detection result is determined.
The common word detection result may include an uncommon word appearing in the target text, a corresponding position of the uncommon word in the target text, an error type flag corresponding to the uncommon word, and the like.
In step S75, model detection is performed on the target text, and a model detection result is determined.
It should be noted that step S73, step S74 and step S75 do not have a fixed execution sequence, that is, in the embodiment of the present application, step S73, step S74 and step S75 may be executed in a certain sequence, or step S73, step S74 and step S75 may be executed synchronously.
In addition, fig. 7 is only an example of the embodiment of the present application, and in practical applications, the embodiment of the present application may further include other types of detection (e.g., syntax detection, etc.).
In step S76, the detection result is corrected.
According to the method and the device, the repeated field detection result, the common word detection result and the model detection result can be corrected according to the result correction dictionary.
In step S77, the error correction candidate set is recalled.
And the recalling error correction candidate set is an error correction candidate set which is determined to respectively correspond to each error correction type label according to the error correction rule respectively corresponding to each error correction type label.
In step S78, the error correction text is determined by sorting based on the error correction candidate set.
According to the method and the device, simulation error correction can be performed on the target text according to the language model and the error correction candidate set, so that scores corresponding to candidate words or candidate words in each error correction candidate set are determined. And further determining target characters or target words corresponding to the error correction candidate sets according to the scores, and correcting the target text according to the target characters or the target words corresponding to the error correction candidate sets to determine the error correction text corresponding to the target text.
In step S79, post-processing is performed on the corrected text.
The text correction database can be set according to the frequently-occurring historical false detection conditions, and the error correction text is checked and corrected according to the text correction database.
By the embodiment of the application, in the process of correcting the target text, the text rule detection is carried out on the target text, and the model detection is also carried out on the target text, so that the embodiment of the application can correct the target text from multiple dimensions, and the comprehensive correction of the target text is realized.
Based on the same technical concept, an embodiment of the present application further provides a text processing apparatus, as shown in fig. 8, the apparatus includes: a target text acquisition module 81, a rule detection module 82, a model detection module 83, and an error correction module 84.
A target text obtaining module 81 configured to perform obtaining the target text.
And the rule detection module 82 is configured to perform text rule detection on the target text and determine a rule detection result.
And the model detection module 83 is configured to perform inputting the target text into a text detection model and determine a model detection result.
And the error correction module 84 is configured to perform error correction processing on the target text according to the rule detection result and the model detection result, and determine an error corrected text corresponding to the target text.
In some embodiments, the rule detection result comprises a duplicate field detection result.
The rule detection module 82 is specifically configured to perform:
and carrying out repeated field detection on the target text according to a preset repeated field detection rule so as to determine a repeated field detection result.
In some embodiments, the rule detection result comprises a common word detection result.
The rule detection module 82 is specifically configured to perform:
and performing common word detection on the target text according to a preset common word list, and marking the common words in the target text to determine a common word detection result.
In some embodiments, the error correction module 84 is specifically configured to perform:
and determining at least one error correction type label according to the rule detection result and the model detection result.
And determining an error correction candidate set corresponding to each error correction type label according to an error correction rule corresponding to each error correction type label, wherein the error correction candidate set comprises candidate words or candidate words for correcting the target text.
And performing error correction processing on the target text according to each error correction candidate set, and determining an error correction text corresponding to the target text.
In some embodiments, the error correction module 84 is specifically configured to perform:
and performing simulated error correction on the target text according to a preset language model and each error correction candidate set so as to determine scores corresponding to candidate words or candidate words in each error correction candidate set.
And determining the target words or the target words corresponding to the error correction candidate sets according to the grading size.
And correcting the target text according to the target words or the target words corresponding to the error correction candidate sets to determine the error correction text corresponding to the target text.
In some embodiments, the error correction type tags include one or more of pronunciation-like tags, font-like tags, position-reversed tags, multi-word tags, few-word tags, and pronoun-error tags.
In some embodiments, the apparatus further comprises:
and the verification module is configured to execute dictionary correction according to a preset result and verify the rule detection result and the model detection result.
An update module configured to perform updating the rule detection result and/or the model detection result in response to the rule detection result and/or the model detection result hitting a word or word in the result correction dictionary.
In the embodiment of the application, the target text can be received, and text rule detection and model detection can be performed on the target text. Further, the embodiment of the application can perform error correction processing on the target text according to the result of the rule detection and the result of the model detection, so as to determine an error correction text corresponding to the target text. In the method and the device, in the process of correcting the target text, the text rule detection and the model detection are performed on the target text, so that the target text can be corrected from multiple dimensions, and the comprehensive correction of the target text is realized.
Fig. 9 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device shown in fig. 9 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 91 and a memory 92. The processor 91 and the memory 92 are connected by a bus 93. The memory 92 is adapted to store instructions or programs executable by the processor 91. The processor 91 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 91 implements processing of data and control of other devices by executing instructions stored by the memory 92 to perform the method flows of the embodiments of the present application as described above. The bus 93 connects the above-described components together, and also connects the above-described components to a display controller 94 and a display device and an input/output (I/O) device 95. Input/output (I/O) devices 95 may be a mouse, keyboard, modem, network interface, touch input device, motion sensitive input device, printer, and other devices known in the art. Typically, the input/output devices 95 are coupled to the system through an input/output (I/O) controller 96.
As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the present application is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of text processing, the method comprising:
acquiring a target text;
performing text rule detection on the target text to determine a rule detection result;
inputting the target text into a text detection model, and determining a model detection result; and
and performing error correction processing on the target text according to the rule detection result and the model detection result, and determining an error correction text corresponding to the target text.
2. The method of claim 1, wherein the rule detection result comprises a repeated field detection result;
the text rule detection is performed on the target text, and determining a rule detection result comprises:
and carrying out repeated field detection on the target text according to a preset repeated field detection rule so as to determine a repeated field detection result.
3. The method of claim 1, wherein the rule detection result comprises a common word detection result;
the text rule detection is performed on the target text, and determining a rule detection result comprises:
and performing common word detection on the target text according to a preset common word list, and marking the non-common words in the target text to determine a common word detection result.
4. The method according to claim 1, wherein the performing error correction processing on the target text according to the rule detection result and the model detection result, and determining an error corrected text corresponding to the target text comprises:
determining at least one error correction type label according to the rule detection result and the model detection result;
determining an error correction candidate set corresponding to each error correction type label according to an error correction rule corresponding to each error correction type label, wherein the error correction candidate set comprises candidate words or candidate words for correcting the target text; and
and performing error correction processing on the target text according to each error correction candidate set, and determining an error correction text corresponding to the target text.
5. The method of claim 4, wherein performing error correction processing on the target text according to each of the error correction candidate sets, and determining an error correction text corresponding to the target text comprises:
performing simulated error correction on the target text according to a preset language model and each error correction candidate set to determine scores corresponding to candidate characters or candidate words in each error correction candidate set;
determining a target word or a target word corresponding to each error correction candidate set according to the grade; and
and correcting the target text according to the target words or the target words corresponding to the error correction candidate sets to determine the error correction text corresponding to the target text.
6. The method of claim 4 or 5, wherein the error correction type tags comprise one or more of pronunciation-like tags, font-like tags, upside-down tags, multi-word tags, few-word tags, and pronoun-error tags.
7. The method of claim 1, further comprising:
correcting a dictionary according to a preset result, and verifying the rule detection result and the model detection result; and
updating the rule detection result and/or the model detection result in response to the rule detection result and/or the model detection result hitting a word or word in the result correction dictionary.
8. A text processing apparatus, characterized in that the apparatus comprises:
a target text acquisition module configured to perform acquisition of a target text;
the rule detection module is configured to perform text rule detection on the target text and determine a rule detection result;
the model detection module is configured to input the target text into a text detection model and determine a model detection result; and
and the error correction module is configured to perform error correction processing on the target text according to the rule detection result and the model detection result, and determine an error correction text corresponding to the target text.
9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-7.
10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1-7.
CN202211446734.3A 2022-11-18 2022-11-18 Text processing method and device, electronic equipment and readable storage medium Pending CN115713075A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211446734.3A CN115713075A (en) 2022-11-18 2022-11-18 Text processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211446734.3A CN115713075A (en) 2022-11-18 2022-11-18 Text processing method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115713075A true CN115713075A (en) 2023-02-24

Family

ID=85233678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211446734.3A Pending CN115713075A (en) 2022-11-18 2022-11-18 Text processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115713075A (en)

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN110532573B (en) Translation method and system
JP4833476B2 (en) Language input architecture that converts one text format to the other text format with modeless input
CN113168498A (en) Language correction system and method thereof, and language correction model learning method in system
CN111639489A (en) Chinese text error correction system, method, device and computer readable storage medium
JPH07325824A (en) Grammar checking system
CN111079412A (en) Text error correction method and device
CN113435186B (en) Chinese text error correction system, method, device and computer readable storage medium
JPH07325828A (en) Grammar checking system
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
JP6778655B2 (en) Word concatenation discriminative model learning device, word concatenation detection device, method, and program
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
CN113657098A (en) Text error correction method, device, equipment and storage medium
US20040193399A1 (en) System and method for word analysis
CN110147546B (en) Grammar correction method and device for spoken English
Uthayamoorthy et al. Ddspell-a data driven spell checker and suggestion generator for the tamil language
JP5097802B2 (en) Japanese automatic recommendation system and method using romaji conversion
JP7040155B2 (en) Information processing equipment, information processing methods and programs
CN115713075A (en) Text processing method and device, electronic equipment and readable storage medium
KR102182248B1 (en) System and method for checking grammar and computer program for the same
Mekki et al. COTA 2.0: An automatic corrector of tunisian Arabic social media texts
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium
Mohapatra et al. Spell checker for OCR
Lu et al. Language model for Mongolian polyphone proofreading
Syarafina et al. Designing a word recommendation application using the Levenshtein Distance algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination