CN114036930A

CN114036930A - Text error correction method, device, equipment and computer readable medium

Info

Publication number: CN114036930A
Application number: CN202111265955.6A
Authority: CN
Inventors: 王斌; 尤旸
Original assignee: Beijing Minglue Zhaohui Technology Co Ltd
Current assignee: Beijing Minglue Zhaohui Technology Co Ltd
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-02-11

Abstract

The application relates to a text error correction method, a text error correction device, text error correction equipment and a computer readable medium. The method comprises the following steps: acquiring a text to be corrected; identifying a text to be corrected based on a first knowledge map to obtain target wrong words in the text to be corrected, wherein the target wrong words are words in the text to be corrected, which are irrelevant to the content of the text to be corrected, and the first knowledge map is used for recording domain knowledge relevant to the content of the text to be corrected; determining candidate replacement words of the target wrong words based on a second knowledge graph, wherein the second knowledge graph is used for recording the domain knowledge of the confusable text; performing feature sorting on candidate texts obtained by replacing target wrong words with the candidate replacement words; and determining the optimal text with the characteristic sequence as corrected text. The method and the device solve the technical problem of low text error correction accuracy.

Description

Text error correction method, device, equipment and computer readable medium

Technical Field

The present application relates to the field of technologies, and in particular, to a text error correction method, apparatus, device, and computer readable medium.

Background

The Chinese error correction technology is an important technology for realizing automatic check and automatic error correction of Chinese sentences, and aims to improve the language correctness and reduce the manual check cost. The error correction module is the most basic module for natural language processing, and the importance degree of the error correction module is self-evident.

In daily life, we often find many wrongly written words in social tools or public articles. The probability of text errors in the new media fields such as microblogs is about 2% through research; in the field of speech recognition, the error rate can reach 8-10% at most.

At present, in the related technology, mainly error correction is performed on a text based on pinyin, firstly, the pinyin of all local dictionaries is placed in a data structure, then, the text to be corrected is annotated, the editing distance between the pinyin of the text to be corrected and the pinyin of all dictionaries is calculated, a text which is close to the editing distance between the pinyin of the text to be corrected and the pinyin of all dictionaries is obtained as a candidate text by setting a certain threshold, and the text with the highest score is selected by scoring according to a certain rule. For the text with longer length, the word-based editing distance is adopted, the word-based editing distance between the input text to be corrected and the existing dictionary text is calculated, and then the text with the highest score is calculated through a certain sorting rule. The method for correcting the text based on the pinyin or the character editing distance greatly depends on the dictionary, and once the text input by the user is the text which is not in the dictionary, the error correction accuracy is extremely low, so that the user experience is extremely poor.

Aiming at the technical problem of low text error correction accuracy, no effective solution is provided at present.

Disclosure of Invention

The application provides a text error correction method, a text error correction device, text error correction equipment and a computer readable medium, which aim to solve the technical problem of low text error correction accuracy.

According to an aspect of an embodiment of the present application, there is provided a text error correction method, including:

acquiring a text to be corrected;

identifying a text to be corrected based on a first knowledge map to obtain target wrong words in the text to be corrected, wherein the target wrong words are words in the text to be corrected, which are irrelevant to the content of the text to be corrected, and the first knowledge map is used for recording domain knowledge relevant to the content of the text to be corrected;

determining candidate replacement words of the target wrong words based on a second knowledge graph, wherein the second knowledge graph is used for recording the domain knowledge of the confusable text;

performing feature sorting on candidate texts obtained by replacing target wrong words with the candidate replacement words;

and determining the optimal text with the characteristic sequence as corrected text.

Optionally, identifying the text to be corrected based on the first knowledge graph, and obtaining the target wrong word in the text to be corrected includes:

extracting a target dictionary from the first knowledge graph, and preprocessing a text to be corrected, wherein the content of the target dictionary comprises target attribute values of corresponding entities in the first knowledge graph, and the preprocessing comprises at least one of format conversion, format standardization and character filtering;

dividing the preprocessed text to be corrected into a plurality of character combinations by using a target dictionary, and matching each character combination with data in the target dictionary, wherein each character is at least distributed to one character combination, and each character combination at least comprises one character;

and determining character combinations which do not match with the data in the target dictionary as target wrong words.

Optionally, identifying the text to be corrected based on the first knowledge graph, and obtaining the target wrong word in the text to be corrected further includes:

creating a dictionary tree by using the first knowledge graph, wherein the dictionary tree is used for storing the mapping relation of the pinyin of the domain knowledge related to the content of the text to be corrected to an entity, and each node in the dictionary tree corresponds to one pinyin character in the stored pinyin;

converting the character combination into a pinyin combination, and determining a target pinyin chain with the minimum editing distance to the pinyin combination in a dictionary tree;

determining a preset character combination of an entity corresponding to the target pinyin chain according to the mapping relation from pinyin to the entity;

and under the condition that the character combination is inconsistent with the preset character combination, determining the character combination as a target wrong word.

Optionally, determining the target wrong word in the text to be corrected further includes:

sequentially covering each character combination in the text to be corrected according to the character sequence of the text to be corrected to obtain a plurality of covered text sequences;

inputting the covering text sequence into a mask language model, and acquiring a predicted character output by the mask language model and a corresponding predicted probability, wherein the predicted character is a character which is obtained by the mask language model according to the context semantic recognition of the covered character and fills in a covering area, and the predicted probability is the probability of the predicted character filling in the covering area;

and determining the covered character as a target wrong word under the condition that the covered character combination is inconsistent with the corresponding predicted character.

Optionally, determining candidate replacement words for the target wrong word based on the second knowledge graph comprises:

extracting a confusion word data set from the second knowledge graph;

and determining the words with the similarity between the confusing word data set and the target wrong word larger than or equal to the similarity threshold value as candidate replacement words.

Optionally, determining the candidate replacement word of the target wrong word further comprises at least one of the following ways:

determining a preset character combination as a candidate replacement word;

and determining the predicted character as a candidate replacement word when the prediction probability of the predicted character is greater than or equal to the probability threshold.

Optionally, the performing feature ranking on the candidate text obtained by replacing the target wrong word with the candidate replacement word includes:

inputting the candidate text into a logistic regression model, and obtaining a text sorting result output by the logistic regression model, wherein the logistic regression model is used for extracting text characteristics and performing characteristic sorting, and the text sorting result is a similarity sorting result of the text characteristics of the candidate text and the content characteristics of the text to be corrected;

the text features include at least one of:

selecting frequency of candidate texts;

editing distance between the candidate text and the text to be corrected;

the Jacard distance between the pinyin of the candidate text and the pinyin of the text to be corrected;

semantic accuracy of the candidate text determined by the multivariate language model.

According to another aspect of the embodiments of the present application, there is provided a text correction apparatus including:

the text acquisition module is used for acquiring a text to be corrected;

the error detection module is used for identifying the text to be corrected based on the first knowledge map to obtain target error words in the text to be corrected, wherein the target error words are words in the text to be corrected, which are irrelevant to the content of the text to be corrected, and the first knowledge map is used for recording domain knowledge relevant to the content of the text to be corrected;

the candidate recall module is used for determining candidate replacement words of the target wrong words based on a second knowledge graph, and the second knowledge graph is used for recording the domain knowledge of the confusable text;

the candidate sorting module is used for performing feature sorting on candidate texts obtained by replacing target wrong words with candidate replacement words;

and the error correction module is used for determining the optimal text with the characteristic sequence as the corrected text.

According to another aspect of the embodiments of the present application, there is provided an electronic device, including a memory, a processor, a communication interface, and a communication bus, where the memory stores a computer program executable on the processor, and the memory and the processor communicate with each other through the communication bus and the communication interface, and the processor implements the steps of the method when executing the computer program.

According to another aspect of embodiments of the present application, there is also provided a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the above-mentioned method.

The scheme can be applied to the field of deep learning for natural language processing, and compared with the related technology, the technical scheme provided by the embodiment of the application has the following advantages:

the technical scheme of the application is to obtain a text to be corrected; identifying a text to be corrected based on a first knowledge map to obtain target wrong words in the text to be corrected, wherein the target wrong words are words in the text to be corrected, which are irrelevant to the content of the text to be corrected, and the first knowledge map is used for recording domain knowledge relevant to the content of the text to be corrected; determining candidate replacement words of the target wrong words based on a second knowledge graph, wherein the second knowledge graph is used for recording the domain knowledge of the confusable text; performing feature sorting on candidate texts obtained by replacing target wrong words with the candidate replacement words; and determining the optimal text with the characteristic sequence as corrected text. The method and the device have the advantages that the text error correction is carried out by combining the knowledge graph in a mode of error detection, candidate recall and candidate sorting, and the recall of the candidate words is increased by mining the knowledge graph to obtain a dictionary and a confusion data set of the domain knowledge. The text error correction result is more accurate, and the technical problem of low text error correction accuracy is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without any creative effort.

FIG. 1 is a diagram illustrating an alternative text error correction method hardware environment according to an embodiment of the present application;

FIG. 2 is a flow chart of an alternative text correction method provided in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative error detection scheme provided in accordance with an embodiment of the present application;

FIG. 4 is a schematic illustration of an alternative recall candidate provided in accordance with an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative candidate ranking provided in accordance with an embodiment of the present application;

FIG. 6 is a block diagram of an alternative text correction apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning in themselves. Thus, "module" and "component" may be used in a mixture.

In the related technology, the text is corrected mainly based on pinyin, the pinyin of all local dictionaries is placed in a data structure, the text to be corrected is annotated, the editing distance between the pinyin of the text to be corrected and the pinyin of all dictionaries is calculated, a certain threshold value is set, the text which is close to the editing distance between the pinyin of the text to be corrected and the pinyin of all dictionaries is obtained and used as a candidate text, and the text with the highest score is selected by scoring according to a certain rule. For the text with longer length, the word-based editing distance is adopted, the word-based editing distance between the input text to be corrected and the existing dictionary text is calculated, and then the text with the highest score is calculated through a certain sorting rule. The method for correcting the text based on the pinyin or the character editing distance greatly depends on the dictionary, and once the text input by the user is the text which is not in the dictionary, the error correction accuracy is extremely low, so that the user experience is extremely poor.

In order to solve the problems mentioned in the background, according to an aspect of embodiments of the present application, an embodiment of a text error correction method is provided.

Alternatively, in the embodiment of the present application, the text error correction method described above may be applied to a hardware environment formed by the terminal 101 and the server 103 as shown in fig. 1. As shown in fig. 1, a server 103 is connected to a terminal 101 through a network, which may be used to provide services for the terminal or a client installed on the terminal, and a database 105 may be provided on the server or separately from the server, and is used to provide data storage services for the server 103, and the network includes but is not limited to: wide area network, metropolitan area network, or local area network, and the terminal 101 includes but is not limited to a PC, a cell phone, a tablet computer, and the like.

A text error correction method in the embodiment of the present application may be executed by the server 103, or may be executed by both the server 103 and the terminal 101, as shown in fig. 2, the method may include the following steps:

step S202, acquiring a text to be corrected.

In the embodiment of the application, the text error correction method provided by the application can be deployed in technical scenes such as a question-answering system and a search engine. For example, a user inputs a word click search in a search box, the search result is based on the content input by the user, and if the text finally used for searching is inconsistent with the content really wanted to be searched by the user due to the input method, artificial input mistakes, input omission, multiple inputs and the like, the search result of the search engine deviates from the original meaning of the user, so that the user needs to re-input the text and correct the input text at one time. The text error correction method provided by the application is deployed, after a user inputs a segment of characters, the system can determine whether the user has the situations of input error, multi-input and missing according to the characters input by the user, if the text error correction is needed, the text error correction method provided by the embodiment of the application is adopted to carry out the text error correction, so that the text which has more accurate semantics and better accords with the original meaning of the user is used as the search text to carry out the content search, the obtained search content also better accords with the user requirement, and the user experience is improved.

In the embodiment of the present application, the text error correction method provided by the embodiment of the present application may also be deployed in an input method, for example, in a process that a user inputs a segment of text through the input method, the system corrects the error of the text input by the user in real time, and may provide a plurality of corrected texts for the user to select a text that best meets the user's intention.

In the embodiment of the application, the text input by the user in the scene such as the search box can be used as the text to be corrected, and the text to be corrected can be obtained only by reading the content of the search box.

Step S204, identifying the text to be corrected based on the first knowledge map to obtain target wrong words in the text to be corrected, wherein the target wrong words are words in the text to be corrected, the words are irrelevant to the content of the text to be corrected, and the first knowledge map is used for recording domain knowledge relevant to the content of the text to be corrected.

In the embodiment of the present application, the first knowledge graph may be professional domain knowledge in a specific domain, or may be general domain knowledge. If the text to be corrected input by the user contains the neural network, a knowledge map related to the neural network is called, and if the text to be corrected input by the user contains the table tennis, a knowledge map related to the table tennis is called.

The knowledge graph records huge and comprehensive domain knowledge of the corresponding domain, and based on the first knowledge graph, contents which are not matched with and irrelevant to the domain knowledge can be found out from the text to be corrected, so that target wrong words are determined.

And step S206, determining candidate replacement words of the target wrong words based on a second knowledge graph, wherein the second knowledge graph is used for recording the domain knowledge of the confusable text.

In the embodiment of the present application, the second knowledge map is a knowledge map recording confusable words, such as homophonic confusion, and similar words. And finding out alternative combinations of the target wrong words, namely candidate alternative words by using the second knowledge graph.

And S208, performing feature sorting on the candidate texts obtained by replacing target wrong words with the candidate replacement words.

In the embodiment of the application, a candidate text can be obtained after a candidate word replaces a target wrong word, and after all candidate words are replaced, all candidate texts are subjected to feature sequencing, so that the reasonability and the accuracy of semantics of each candidate text and the conformity of domain knowledge can be judged.

Step S210, determining the optimal text with the sorted features as the corrected text.

In the embodiment of the application, the optimal text obtained after sequencing is determined to be the text with the error correction, and the optimal text is the text which is most reasonable in model judgment, most accurate in semantic meaning and most in line with the knowledge of the relevant fields.

Through the steps S202 to S210, text error correction is performed by combining the knowledge graph in the modes of error detection, candidate recall and candidate sorting, and the recall of the candidate words is increased by mining the knowledge graph to acquire a dictionary and an obfuscated data set of the domain knowledge. The text error correction result is more accurate, and the technical problem of low text error correction accuracy is solved.

Optionally, in step S204, identifying the text to be corrected based on the first knowledge graph, and obtaining the target wrong word in the text to be corrected includes:

step 1, extracting a target dictionary from the first knowledge graph, and preprocessing a text to be corrected, wherein the content of the target dictionary comprises target attribute values of corresponding entities in the first knowledge graph, and the preprocessing comprises at least one of format conversion, format standardization and character filtering.

In the embodiment of the application, the target dictionary is extracted from the knowledge graph database, and the attribute of each entity to be acquired in the knowledge graph is determined firstly. As an alternative embodiment, the present application may obtain the main and sub-heading attributes of each entity (for example, the entity of the company employee, the main heading attribute is the name of the employee, and the sub-heading attribute is the alias of the employee, so that the name and the alias of each employee are stored in the error correction dictionary), and then obtain all the values of the main and sub-heading attributes of each entity. After the attribute of the entity to be acquired is determined, the acquired attribute value of the entity can read the knowledge graph from the distributed file system, so as to acquire all values of the attribute corresponding to the entity, and can also acquire all values of the attribute corresponding to the entity by calling an interface of the knowledge graph through a gremlin method. The gremlin language is the most popular query language for graph databases and is a graph language specified under the Apache TinkerPop framework. In Gremlin, a vertex can be queried by a Step V () statement, an edge can be queried by a Step E () statement, a vertex (entity) and an edge id can be obtained by a Step id () statement, a vertex and an edge label can be obtained by a Step label () statement, and a vertex and an edge attribute can be obtained by a Step properties () statement.

In the embodiment of the application, the target dictionary can also be obtained from the log of the history search record, the log of the history search of the user is analyzed, the query (query text) used by the user in each search is obtained, and the history search query of the user is stored in the dictionary.

In the implementation of the application, format conversion, format standardization and character filtering can remove meaningless characters, symbols, spaces and the like in the text to be corrected, for example, the input sentence is turned by a full angle and a half angle, useless spaces, punctuations, expressions and the like are removed, and each format of the text to be corrected can be standardized to remove redundant information.

And 2, splitting the preprocessed text to be corrected into a plurality of character combinations by using the target dictionary, and matching each character combination with data in the target dictionary, wherein each character is at least distributed to one character combination, and each character combination at least comprises one character.

In the embodiment of the application, the target dictionary can be utilized to divide words of the text to be corrected. Because the word segmentation is divided according to the dictionary, the word segmentation is carried out after a single character, namely the word is considered to be not formed with the context, and is considered as a possible wrong word, each possible wrong word can be combined with surrounding words, and if the combined word appears in the dictionary, a wrong word is not calculated at the position. Therefore, possible wrong characters and words in the text to be corrected can be found.

In the embodiment of the application, the HanLP tool can be used for word segmentation. HanLP is a Java toolkit consisting of a series of models and algorithms aimed at popularizing the application of natural language processing in a production environment. The HanLP has the characteristics of complete functions, high performance, clear architecture, novel linguistic data and self-definition. When rich functions are provided, the HanLP internal module insists on low coupling, the model insists on inertia loading, the service insists on static providing and the dictionary insists on plaintext publishing, so that the use is very convenient, and meanwhile, the HanLP internal module is provided with some corpus processing tools to help a user to train own corpus.

In the embodiment of the application, it should be noted that a user-defined domain dictionary is required to be introduced into a corresponding domain, and dictionary data is an entity attribute value provided by a knowledge graph, so that a word segmentation device can identify domain proper nouns, and the word segmentation accuracy is improved. For example, the word "credit card embodies hand and beard fee" is divided into "credit card/embody/hand/beard/fee", and "hand", "beard" and "fee" are used as the potential words to be corrected.

And 3, determining the character combination which is not matched with the data in the target dictionary as a target wrong word.

In the embodiment of the application, the hand/beard/fee is not matched with the domain knowledge of the knowledge map related to the credit card completely, and reasonable words cannot be formed before and after the hand/beard/fee, so that the hand/beard/fee can be determined as target wrong words.

For part of the vocabularies in the professional field with different homonyms, such as "plain observation" and "plain inspection", "wisdom system" and "mingzhi system", etc., potential wrong words cannot be found out by means of word segmentation, so that the vocabularies in the professional field need to be subjected to error detection based on a knowledge graph.

step 1, creating a dictionary tree by using a first knowledge graph, wherein the dictionary tree is used for storing a mapping relation between pinyin of domain knowledge related to the content of a text to be corrected and an entity, and each node in the dictionary tree corresponds to one pinyin character in the stored pinyin;

step 2, converting the character combination into a pinyin combination, and determining a target pinyin chain with the minimum editing distance with the pinyin combination in a dictionary tree;

step 3, determining a preset character combination of an entity corresponding to the target pinyin chain according to the mapping relation between the pinyin and the entity;

and 4, determining the character combination as a target wrong word under the condition that the character combination is inconsistent with the preset character combination.

In the implementation of the application, vocabularies in professional fields such as 'observation' and 'inspection of the naked eye', 'wisdom system' and 'treatment of the naked eye' are mostly near-phonetic errors, so that in the embodiment of the application, a dictionary tree is created by utilizing a first knowledge map, the dictionary tree is used for storing the mapping relation between the pinyin of the field knowledge related to the content of the text to be corrected and the entity, each node in the dictionary tree correspondingly stores one pinyin character in the pinyin, and thus a mapping dictionary from the pinyin in the professional field to the entity is established, and the error correction process from the wrong word to the pinyin and then to the entity is completed. In the process of creating the dictionary tree, for example, m is used as a tree root node in the Pinyin mingcha of 'Minchai', child nodes i, n, g, c, h and a are sequentially created to obtain a Pinyin chain m-i-n-g-c-h-a, and an entity vocabulary corresponding to the Pinyin chain is preset to 'Minchai' rather than 'Minchai', so that the mapping relation from m-i-n-g-c-h-a to 'Minchai' is established.

In the embodiment of the application, by taking the case that the character combination in the text to be corrected is 'plain inspection' and the preset character combination in the dictionary tree is 'plain inspection', the 'plain inspection' is converted into the pinyin combination mingcha, the target pinyin chain m-i-n-g-c-h-a is found in the dictionary tree, the preset character is determined to be 'plain inspection' according to the preset mapping relation between pinyin and an entity, and the 'plain inspection' is inconsistent with the 'plain inspection', so that the 'plain inspection' in the text to be corrected is a target wrong word.

In the embodiment of the application, a string of pinyin chains with the smallest pinyin editing distance can be calculated for the pinyins with the same tone and different tone, the string of pinyin chains with the smallest pinyin editing distance can be used as a target pinyin chain, and therefore the problem that the pinyins with wrong segments cannot be matched when not in a dictionary tree is solved, for example, the pinyin for 'min search' is mincha, the pinyin chain matched with the pinyin chain cannot be found in the dictionary tree, but the mingcha with the pinyin editing distance equal to 1 can be matched, and therefore the mingcha can be determined as the target pinyin chain to perform text error correction. In the calculation of the pinyin editing distance, a pinyin character g needs to be inserted into mingcha obtained by mincha, so that the pinyin editing distance is 1.

In the embodiment of the application, the dictionary tree can be realized by adopting a Trie tree.

For the vocabulary which can be partially formed, but has no or little relation with the text content, the vocabulary is difficult to be judged as wrong words through word segmentation, so the technical scheme of the application also provides error detection based on the bert model.

step 1, sequentially covering each character combination in a text to be corrected according to the character sequence of the text to be corrected to obtain a plurality of covered text sequences;

step 2, inputting the covering text sequence into a mask language model, and acquiring a predicted character output by the mask language model and a corresponding predicted probability, wherein the predicted character is a character which is obtained by the mask language model according to the context semantic recognition of the covered character and fills a covering area, and the predicted probability is the probability of the predicted character filling the covering area;

and 3, determining the covered character as a target wrong word under the condition that the covered character combination is inconsistent with the corresponding predicted character.

In the embodiment of the application, a Mask Language Model is a bert Model, and can convert the error detection process into a word-level complete blank filling problem, that is, words to be filled in each blank are predicted one by one during error detection, so that whether a word at the current position has an error is judged according to the predicted probability distribution.

According to the method, each character in a text sentence is sequentially made into a mask according to the character sequence of the text to be corrected, if the text to be corrected is { I is China kernel }, a plurality of covering text sequences { mask, namely, China, kernel }, { I, mask, China, kernel }. are obtained after covering, so that the most suitable character and word at the current position is predicted according to the context of the current character, for example, if the size of a word list is 10000, 10000 classifications are equivalently made at each position in the sentence, namely the probability of filling 10000 words is calculated). As an alternative, a fault tolerance threshold k of 10 may be set, the top 10 words with the highest probability are selected from the classification results such as 10000 classifications, if the original word appears in the top 10 of the prediction results, the position is considered as not a wrongly-written word (word), otherwise, the position is a wrongly-written word (word). Further, if the position is determined as the target wrong word, the word with the highest probability in the first 10 words is taken as the candidate replacement word.

It should be noted that: the deep learning model can learn the context semantics, the bert model can better learn the context semantics of the sentence by depending on the pre-training model, the bert can detect the knowledge relevance error of the sentence, can predict the probability of each word at each position in the sentence to be corrected, and if the probability of the word at the position in the sentence is too small, the word is judged to be a potential error, so that the subsequent better correction is facilitated. The dependency on the dictionary is small and the corrected result is not in the dictionary. For example, in the above-mentioned "credit card/embodiment/hand/beard/fee", the word segmentation cannot determine the "embodiment" as the wrong word, but the mask language model can recognize that the word is not related to the context and belongs to the wrong word.

Optionally, as shown in fig. 3, after the input sentence is preprocessed, the error detection result based on the participle, the professional domain vocabulary error detection result based on the dictionary tree (i.e., product proper name detection), and the error detection result based on the mask language model (bert model) may be superimposed to obtain all target wrong words in the text to be corrected.

Optionally, the step S206 of determining candidate replacement words of the target wrong word based on the second knowledge graph includes:

step 1, extracting a confusion word data set from a second knowledge graph;

and 2, determining the words in the confusion word data set, wherein the similarity between the words and the target wrong words is greater than or equal to a similarity threshold value, as candidate replacement words.

In an embodiment of the present application, the second knowledge graph is a domain knowledge graph of confusing words. Taking the Chinese character "emotion" as an example, confusing words may include:

homophonic and homophonic: candlestick Qingqing [ q 2] indicates that the pinyin is q, and the tone is the second sound;

homophonic and allophasic: qing [ q 1] has [ q 3] Qing [ q 4], which indicates that the pinyin is q, and the tones are the first sound, the third sound and the fourth sound;

homonym: grabbing birds having qin 2;

tone anomaly: jingjing jing … jing [ jin1], neck alarm … well [ jin3], Jingjing … jing [ jin4], jin … gold [ jin1], jin … jin Ying [ jin3], nearly as forbidden as possible … soak [ jin4], Qin Shi (q jin 1), bed [ q in3], Press [ q in4], and the like.

And may also include the shape of a word: please ask qing debt spot sad palpitations and tights, etc.

In addition, partially legitimate words may also be confusing words, such as "person" ([ ren2-yuan2], meaning 'member'), "humanity" ([ ren2-yuan2], meaning 'relationship'), for the word "language", the following confusing words: linguistics, previews, orals, fish eyes, etc.

In the embodiment of the present application, the confusion word data may also use a public confusion word set.

In the embodiment of the application, all the confusion words of the target wrong word can be directly found out in the confusion word data set as the candidate replacement words, and the similarity between the target wrong word and each word in the confusion word data set can be calculated firstly, so that the closest word is found, and the word and the confusion words thereof are selected as the candidate replacement words. The calculation of the similarity can be the calculation of the Euclidean distance, the Pearson correlation coefficient, the cosine distance, the generalized Jacard coefficient and the like among the words.

In the embodiment of the present application, as shown in fig. 4, by taking "credit card/embodiment/hand/beard/fee" as an example, the "credit card" is already an accurate word, and the "embodiment" is a legal word group, but the semantics and context are not particularly consistent with the "credit card", and is determined as a wrong word, and a candidate replacement word of the word is obtained as "withdrawal" by using the confusion word data set. The word segmentation is used for obtaining that the ' hand/whisker/fee ' is not a legal word group, so that the candidate replacement word of the ' hand ' obtained by the confusion word data set is ' head ' and the like, the candidate replacement word of the ' whisker ' is ' need ', ' continuation ' and the like, and the candidate replacement word of the fee ' is ' fly ' and the like.

determining a preset character combination as a candidate replacement word;

In the embodiment of the application, when the dictionary tree is used for error detection, the preset character combination corresponding to the target pinyin chain can be used as a candidate replacement word. When the error detection is performed using the mask language model, the predicted character may be used as a candidate replacement word in a case where the prediction probability of the predicted character is greater than or equal to the probability threshold.

the text features include at least one of: selecting frequency of candidate texts; editing distance between the candidate text and the text to be corrected; the Jacard distance between the pinyin of the candidate text and the pinyin of the text to be corrected; semantic accuracy of the candidate text determined by the multivariate language model.

In the embodiment of the present application, as shown in fig. 4, the candidate texts include "credit card represents charge for. The logistic regression model is used for performing regression analysis on the independent variable and the dependent variable, namely describing the relationship between the independent variable X and the dependent variable Y, or the influence degree of the independent variable X on the dependent variable Y, and predicting the dependent variable Y. The dependent variable is the result that we want to obtain, the independent variable is the potential factor that affects the result, and there may be one or more independent variables. The independent variable is the text characteristic of the candidate text, and the dependent variable is the sorting result of the candidate text.

In the embodiment of the present application, as shown in fig. 5, each feature may be scored, and the scores of the features are integrated to perform ranking, so that one candidate text with the highest score may be selected as a final text, that is, an error-corrected text. In the text characteristics, the higher the selection frequency of the candidate text is, the higher the score is, the smaller the editing distance between the candidate text and the text to be corrected is, the higher the score is, the closer the Jacard distance between the pinyin of the candidate text and the pinyin of the text to be corrected is, and the higher the semantic accuracy of the candidate text determined by the multi-element language model is, the higher the score is. A multivariate language model (n-gram) can evaluate whether a sentence is reasonable.

According to still another aspect of an embodiment of the present application, as shown in fig. 6, there is provided a text correction apparatus including:

a text obtaining module 601, configured to obtain a text to be corrected;

the error detection module 603 is configured to identify the text to be corrected based on a first knowledge graph, to obtain a target error word in the text to be corrected, where the target error word is a word in the text to be corrected that is irrelevant to the content of the text to be corrected, and the first knowledge graph is used to record domain knowledge relevant to the content of the text to be corrected;

a candidate recall module 605, configured to determine candidate replacement words of the target wrong word based on a second knowledge graph, where the second knowledge graph is used to record domain knowledge of the confusable text;

a candidate ranking module 607, configured to perform feature ranking on candidate texts obtained by replacing target wrong words with candidate replacement words;

and the error correction module 609 is used for determining the optimal text with the characteristic sequence as the corrected text.

It should be noted that the text obtaining module 601 in this embodiment may be configured to execute the step S202 in this embodiment, the error detecting module 603 in this embodiment may be configured to execute the step S204 in this embodiment, the candidate recall module 605 in this embodiment may be configured to execute the step S206 in this embodiment, the candidate ranking module 607 in this embodiment may be configured to execute the step S208 in this embodiment, and the error correcting module 609 in this embodiment may be configured to execute the step S210 in this embodiment.

It should be noted here that the modules described above are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above embodiments. It should be noted that the modules described above as a part of the apparatus may operate in a hardware environment as shown in fig. 1, and may be implemented by software or hardware.

Optionally, the error detection module is specifically configured to:

Optionally, the error detection module is further configured to:

Optionally, the candidate recall module is specifically configured to:

extracting a confusion word data set from the second knowledge graph;

Optionally, the candidate recall module is further configured to:

determining a preset character combination as a candidate replacement word;

Optionally, the candidate ranking module is specifically configured to:

the text features include at least one of:

selecting frequency of candidate texts;

editing distance between the candidate text and the text to be corrected;

According to another aspect of the embodiments of the present application, an electronic device is provided, as shown in fig. 7, and includes a memory 701, a processor 703, a communication interface 705, and a communication bus 707, where the memory 701 stores a computer program that is executable on the processor 703, the memory 701 and the processor 703 communicate with each other through the communication interface 705 and the communication bus 707, and the processor 703 implements the steps of the method when executing the computer program.

The memory and the processor in the electronic equipment are communicated with the communication interface through a communication bus. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

There is also provided, in accordance with yet another aspect of an embodiment of the present application, a computer-readable medium having non-volatile program code executable by a processor.

Optionally, in an embodiment of the present application, a computer readable medium is configured to store program code for the processor to perform the following steps:

acquiring a text to be corrected;

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.

When the embodiments of the present application are specifically implemented, reference may be made to the above embodiments, and corresponding technical effects are achieved.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented by means of units performing the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk. It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is merely exemplary of the present application and is presented to enable those skilled in the art to understand and practice the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text error correction method, comprising:

acquiring a text to be corrected;

identifying the text to be corrected based on a first knowledge graph to obtain target wrong words in the text to be corrected, wherein the target wrong words are words in the text to be corrected, which are irrelevant to the content of the text to be corrected, and the first knowledge graph is used for recording domain knowledge relevant to the content of the text to be corrected;

determining candidate replacement words of the target wrong words based on a second knowledge graph, wherein the second knowledge graph is used for recording domain knowledge of confusable texts;

performing feature sorting on candidate texts obtained by replacing the target wrong words with the candidate replacement words;

2. The method according to claim 1, wherein the identifying the text to be corrected based on the first knowledge graph to obtain the target wrong words in the text to be corrected comprises:

extracting a target dictionary from the first knowledge graph, and preprocessing the text to be corrected, wherein the content of the target dictionary comprises target attribute values of corresponding entities in the first knowledge graph, and the preprocessing comprises at least one of format conversion, format standardization and character filtering;

dividing the preprocessed text to be corrected into a plurality of character combinations by using the target dictionary, and matching each character combination with data in the target dictionary, wherein each character is at least distributed to one character combination, and each character combination at least comprises one character;

determining the character combination which does not match with the data in the target dictionary as the target wrong word.

3. The method according to claim 2, wherein the identifying the text to be corrected based on the first knowledge graph to obtain the target wrong words in the text to be corrected further comprises:

creating a dictionary tree by using the first knowledge graph, wherein the dictionary tree is used for storing a mapping relation of pinyin of domain knowledge related to the content of the text to be corrected to an entity, and each node in the dictionary tree corresponds to one pinyin character in the stored pinyin;

converting the character combination into a pinyin combination, and determining a target pinyin chain with the minimum editing distance to the pinyin combination in the dictionary tree;

determining a preset character combination of an entity corresponding to the target pinyin chain according to a mapping relation from pinyin to the entity;

and under the condition that the character combination is inconsistent with the preset character combination, determining the character combination as the target wrong word.

4. The method of claim 3, wherein determining the target wrong word in the text to be corrected further comprises:

inputting the covering text sequence into a mask language model, and acquiring a predicted character and a corresponding predicted probability output by the mask language model, wherein the predicted character is a character which fills a covering region and is obtained by the mask language model according to the context semantic recognition of the covered character, and the predicted probability is the probability of filling the covering region by the predicted character;

and determining the covered character as the target wrong word under the condition that the covered character combination is inconsistent with the corresponding predicted character.

5. The method of claim 4, wherein determining candidate replacement words for the target wrong word based on a second knowledge-graph comprises:

extracting a confusion term data set from the second knowledge-graph;

determining the words in the confusion word data set, the similarity of which to the target wrong word is greater than or equal to a similarity threshold value, as the candidate replacement words.

6. The method of claim 4 or 5, wherein determining the candidate replacement word for the target wrong word further comprises at least one of:

determining the preset character combination as the candidate replacement word;

determining the predicted character as the candidate replacement word if the prediction probability of the predicted character is greater than or equal to a probability threshold.

7. The method of claim 6, wherein feature ranking candidate texts obtained by replacing the target mispronounced word with the candidate replacement word comprises:

inputting the candidate text into a logistic regression model, and obtaining a text sorting result output by the logistic regression model, wherein the logistic regression model is used for extracting text features and performing feature sorting, and the text sorting result is a similarity sorting result of the text features of the candidate text and the content features of the text to be corrected;

the textual features include at least one of:

selecting frequency of the candidate texts;

the editing distance between the candidate text and the text to be corrected;

8. A text correction apparatus, comprising:

the text acquisition module is used for acquiring a text to be corrected;

the error detection module is used for identifying the text to be corrected based on a first knowledge graph to obtain a target error word in the text to be corrected, wherein the target error word is a word in the text to be corrected, which is irrelevant to the content of the text to be corrected, and the first knowledge graph is used for recording domain knowledge relevant to the content of the text to be corrected;

the candidate recall module is used for determining candidate replacement words of the target wrong words based on a second knowledge graph, wherein the second knowledge graph is used for recording the domain knowledge of the confusable text;

the candidate sorting module is used for performing feature sorting on the candidate texts obtained by replacing the target wrong words with the candidate replacement words;

9. An electronic device comprising a memory, a processor, a communication interface and a communication bus, wherein the memory stores a computer program operable on the processor, and the memory and the processor communicate via the communication bus and the communication interface, wherein the processor implements the steps of the method according to any of the claims 1 to 7 when executing the computer program.

10. A computer-readable medium having non-volatile program code executable by a processor, wherein the program code causes the processor to perform the method of any of claims 1 to 7.