CN112733529A

CN112733529A - Text error correction method and device

Info

Publication number: CN112733529A
Application number: CN201911029376.4A
Authority: CN
Inventors: 刘恒友; 李辰; 包祖贻; 徐光伟; 李林琳
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2021-04-30
Anticipated expiration: 2039-10-28
Also published as: CN112733529B

Abstract

The invention discloses a text error correction method and a text error correction device. Acquiring a plurality of elements contained in the text; determining feature data for at least one of the plurality of elements; generating a feature set of the text by taking the plurality of elements and the feature data as features; predicting error conditions of the text based on the feature set; and correcting the text based on the prediction result. Therefore, support is provided for reducing the text error correction rate and improving the text error correction quality.

Description

Text error correction method and device

Technical Field

The present invention relates to the field of information retrieval technologies, and in particular, to a text error correction method and apparatus.

Background

With the popularization of the internet, information on the internet is more and more abundant, and people can conveniently acquire required information by inputting query sentences (query) in a search bar in various search application scenes. In this case, the user often enters an incorrect query when performing a search.

Currently, most search engines introduce an error correction mechanism, that is, error correction is performed on a query input by a user, so as to correct an incorrect query input by the user into a correct query, so as to return a search result in accordance with the requirement of the user to the user. However, there are cases where erroneous correction often occurs, such as erroneous correction of the correct query entered by the user as an erroneous query. In this way, the search results returned to the user may not meet the user's requirements, and the user's search experience may be greatly affected.

Therefore, there is still a need for an improved error correction scheme to reduce the error correction rate and improve the user experience.

Disclosure of Invention

The purpose of the present disclosure is to provide a text error correction method and apparatus, so as to provide support for reducing error correction rate and improving user experience.

According to a first aspect of the present disclosure, there is provided a text error correction method including: acquiring a plurality of elements contained in the text; determining feature data for at least one of the plurality of elements; generating a feature set of the text by taking the plurality of elements and the feature data as features; predicting error conditions of the text based on the feature set; and correcting the text based on the prediction result.

Optionally, the elements comprise characters and/or words and/or binary segmentations;

optionally, the feature data comprises: part-of-speech characteristics of characters and/or words; and/or inter-element association features.

Optionally, the inter-element association feature comprises at least one of: the position characteristics of the characters in the words; inter-element dependency characteristics; inter-element correlation characteristics.

Optionally, the feature data further comprises: a combination of two or more features of an element.

Optionally, the combined features comprise at least one of: the position feature of the element in the word and the combination feature of the part of speech feature of the word; the position characteristics of the elements in the words and/or the binary participles and the combination characteristics of the words and/or the binary participles; a combination of part-of-speech features of the elements and the inter-element relevance features.

Optionally, the step of obtaining a plurality of elements included in the text includes: -participling said text to obtain said elements, and/or-determining feature data of at least one of said plurality of elements comprises at least one of: performing part-of-speech tagging on the text to obtain part-of-speech characteristics of the characters and/or words; performing dependency syntax analysis processing on the text to obtain the inter-element dependency relationship characteristics; and acquiring the inter-element correlation characteristics from the characteristic database.

Optionally, the step of generating the feature set of the text comprises: acquiring feature identifiers corresponding to the features from a feature database, wherein the feature database is obtained by processing a text data set, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features are stored in the feature database in an associated manner, and the features comprise a plurality of elements extracted from the text data set and feature data of the elements; and generating the feature set based on the feature identification.

Optionally, the text data set comprises at least one of: a general domain data set; a vertical domain data set; a networked encyclopedia data set.

Optionally, the feature database further stores feature vectors corresponding to the plurality of features respectively in an associated manner, and the step of generating the feature set based on the feature identifier further includes: acquiring feature vectors corresponding to the plurality of features respectively based on the feature identifiers; and combining the obtained feature vectors to obtain the feature set.

Optionally, the feature vector is obtained by feature training the plurality of elements extracted from the text data set and feature data of the elements.

Optionally, based on the feature set, an error prediction model is used to predict the error condition of the text.

Optionally, the prediction result is a binary sequence, and each bit of the binary sequence represents the correctness of the character corresponding to the binary sequence in the text.

Optionally, the step of predicting the error condition of the text includes: determining feature identifications corresponding to the plurality of features respectively; and taking the feature identification as an input of the error prediction model to predict the text.

Optionally, feature identifiers respectively corresponding to the plurality of features are obtained from a feature database.

Optionally, the text error correction method may further include: the misprediction model is trained.

Optionally, the step of training the misprediction model comprises: generating a feature set of a corpus, wherein the corpus is a text; and training the error prediction model based on the feature set of the corpus.

Optionally, the method further comprises: and acquiring a labeling sequence corresponding to the corpus, wherein the labeling sequence represents the error condition of the corpus, and the error prediction model is trained based on the feature set of the corpus and the labeling training.

Optionally, a labeling sequence corresponding to the corpus is obtained based on the error correction parallel corpus data set.

Optionally, the tagging sequence is a binary sequence, and each bit of the binary training represents the correctness of the corresponding character in the corpus respectively; and/or the annotation training comprises a wrong label and/or a correct label.

Optionally, the error prediction model is a BilSTM-CRF model.

Optionally, the text comprises a query statement input by a user. Or, optionally, the text comprises articles of a predetermined author or a predetermined owner or a predetermined source.

Optionally, the plurality of elements include trade names and/or trade names, and the step of predicting the error condition of the text includes: predicting the error condition of the trade name and/or the trade name in the text; and/or the step of correcting the text comprises: and correcting the trade name and/or the trade name in the text.

Optionally, the method may further include: and maintaining a network new word lexicon, wherein in the step of predicting the error condition of the text, the network new word lexicon is referred to so as to avoid identifying the network new words as errors.

Optionally, the method may further include: and maintaining a knowledge base, wherein corresponding correct participles and corresponding incorrect participles are recorded, wherein the corresponding correct participles and the corresponding incorrect participles are obtained based on the prediction result, and in the step of predicting the error condition of the text, the knowledge base is referred to.

According to a second aspect of the present disclosure, there is provided a text misprediction method, including: acquiring a plurality of elements contained in the text; determining feature data for at least one of the plurality of elements; generating a feature set of the text by taking the plurality of elements and the feature data as features; and predicting the error condition of the text by using an error prediction model based on the feature set.

Optionally, the prediction result of the error condition of the text is a binary sequence, and each bit of the binary sequence represents the correctness of the character corresponding to the bit in the text.

Optionally, the error prediction model is a BilSTM-CRF model.

Optionally, the plurality of elements include trade names and/or trade names, and the step of predicting the error condition of the text includes: and predicting the error condition of the trade name and/or the trade name in the text.

According to a third aspect of the present disclosure, there is also provided a text correction apparatus including: element acquiring means for acquiring a plurality of elements included in the text; feature extraction means for determining feature data of at least one of the plurality of elements; the feature set device is used for generating a feature set of the text by taking the plurality of elements and the feature data as features; the error prediction device is used for predicting the error condition of the text based on the feature set; and error correction means for correcting the error of the text based on the prediction result.

Therefore, the text is processed to obtain the plurality of elements contained in the text, and the feature set is generated by the plurality of elements and the corresponding feature data, so that more element relationships can be obtained, and support is provided for improving the quality of related services. The method for generating the feature set of the text can be suitable for training a related text error prediction model and can also be suitable for carrying out error prediction on the text. Therefore, the text error correction rate can be reduced, and the text error correction quality is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 shows a feature database preparation flow diagram according to one embodiment of the present disclosure.

FIG. 2 shows a flow diagram of a method of generating a feature set of text according to one embodiment of the present disclosure.

FIG. 3 shows a flow diagram of a method of training a misprediction model, according to one embodiment of the present disclosure.

FIG. 4 shows a model training flow diagram according to one embodiment of the present disclosure.

FIG. 5 shows a flow diagram of a text misprediction method according to one embodiment of the present disclosure.

FIG. 6 illustrates a flow diagram of a misprediction model application, according to one embodiment of the present disclosure.

FIG. 7 shows a flow diagram of a text correction method according to one embodiment of the present disclosure.

FIG. 8 shows a flow diagram of text correction according to one embodiment of the present disclosure.

FIG. 9 shows a schematic block diagram of an apparatus to generate a feature set of a text according to one embodiment of the present disclosure.

FIG. 10 shows a schematic block diagram of an apparatus to train a misprediction model, according to one embodiment of the present disclosure.

FIG. 11 shows a schematic block diagram of a text misprediction apparatus, according to one embodiment of the present disclosure.

FIG. 12 shows a schematic block diagram of a text correction apparatus according to one embodiment of the present disclosure.

FIG. 13 shows a flow diagram of a text correction method according to one embodiment of the present disclosure.

FIG. 14 shows a flow diagram of a text misprediction method according to one embodiment of the present disclosure.

FIG. 15 shows a schematic block diagram of a computing device that may be used to implement a method according to an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As described above, for example, in a search scenario, in order to avoid the situation of error correction as much as possible and reduce the error correction rate, the present disclosure provides a text error correction scheme, in which a scheme for generating a feature set of a text is provided, and the scheme may be applied to both training a text error prediction model and predicting the error situation of the text, so as to avoid error correction, reduce the error correction rate, and improve user experience.

In order to better understand the text correction scheme of the present disclosure, in the following description, descriptions will be developed from different stages, respectively. The method specifically comprises a characteristic database preparation stage, a characteristic set generation stage, a text error model training stage, a text error prediction stage and an error correction stage.

It should be understood that the different stages are described for better understanding of the technical solutions of the present disclosure, and are not intended to limit the embodiments of the present disclosure in any way. For example, the feature set generating method described in the following feature set generating stage may be applied to both the text error model training stage and the text error prediction stage and the text error correction stage, and will not be described in detail in the following description.

The text error prediction scheme of the present disclosure and the details involved therein will be described in detail below with reference to the accompanying drawings and examples, respectively.

Preparation phase of feature database (including feature vector)

In the embodiment of the present invention, first, a feature database may be maintained in advance, so as to provide support for later generation of feature sets or training of models, model applications, and the like.

In one embodiment, the feature database may be derived by processing a text data set. Wherein the text data set may include, but is not limited to, one type of text data set.

The feature database may be maintained, for example, based on a vast text data set. In different application scenarios, the text data sets used or the feature databases maintained are not identical. For example, the text data set may be a general domain data set and the resulting feature database may be a general domain feature database. Or the text data set may be a vertical domain data set and the resulting feature database may be a vertical domain feature database. Alternatively, the text dataset may also be a web encyclopedia dataset, as the present disclosure is not limited in this respect. Wherein the feature database is maintained, for example, by performing a word segmentation process and/or a feature extraction process on the mass text data set.

As shown in FIG. 1, the text data set used to maintain the feature database may include, by way of example, a new retailer search history data set, a web encyclopedia data set (e.g., wiki encyclopedia).

First, the text data set may be data preprocessed in order to obtain element and/or feature data. The data preprocessing may include processing involved in natural language processing, and may include, but is not limited to, performing word segmentation processing, part of speech tagging Processing (POS), Dependency Parsing processing (Dependency Parsing), and the like on the text data set.

After data preprocessing, elements contained in the text, such as characters and/or words and/or binary participles, can be obtained.

Secondly, features may be extracted from the data after data preprocessing, for example, feature extraction processing may be performed on the obtained elements to obtain corresponding feature data.

In the embodiment of the present disclosure, as much feature data as possible may be extracted in various ways to enrich the relationship between elements, thereby providing better support for promoting applications based on the feature data (e.g., training a text misprediction model or an application of the text misprediction model, etc.).

The feature data can include the features of the elements, such as characters and/or words and/or binary participles; attribute features of the elements, such as part-of-speech features of characters and/or words; inter-element association features may also be included, such as location features of characters in terms, inter-element dependency features, and the like.

In one embodiment, features of different granularity may also be extracted to enrich the relationship between elements.

For example, discrete features may be extracted for an element. The discrete feature may include, for example, a word segmentation feature, a part of speech feature (POS), a Dependency type feature (Dependency type), a Dependency word feature (Dependency word), a Character feature (Character), a binary word segmentation feature (Bigram), a Position feature (Position), and the like.

For example, inter-element correlation features may be extracted for elements. The inter-element correlation characteristic may be PMI Score, for example. PMI (Point Mutual information), which is point-by-point Mutual information, is used to measure the correlation between two things (e.g., two words).

In the embodiment of the present invention, the correlation between two elements (including characters and/or words and/or binary participles, etc.) may be measured based on the PMI. Wherein, the PMI Score may be obtained by the following formula:

PMI(w1,w2)＝log(P(w1,w2)/(P(w1)*P(w2)))

wherein PMI (w1, w2) represents the relevance score of element 1 and element 2, w1 represents element 1, w2 represents element 2, and P (w1) represents the probability of element 1 occurring in the text dataset, i.e. the ratio of the number of times element 1 occurs in the text dataset to the total number of words in the text dataset; p (w2) represents the probability of element 2 occurring in the text data set, i.e. the ratio of the number of times element 2 occurs in the text data set to the total number of words in the text data set; p (w1, w2) represents the probability that element 1 and element 2 co-occur in the text data set, the number of times set elements 1 and 2 co-occur in the text data set compared to the total number of words in the text data set. Preferably, the above element 1 and element 2 may be two words, Word1 and Word 2.

For example, the above-described feature data may also include a combined feature of two or more features of an element, which may be extracted for the element.

The combined feature may include, for example, a combined feature of a position feature of an element in a word and a part-of-speech feature of the word, a combined feature of a position feature of an element in a word and/or a binary segment and the word and/or the binary segment, a combined feature of a part-of-speech feature of an element and a correlation feature between the elements, and the like.

As an example, the combined features of the position features of an element in a word and the part-of-speech features of the word, also referred to as position POS features. When the Position POS feature is extracted, each chinese Character or Character (Character) in a word (word) may be traversed, and a POS of the word in which the Position marker and the Character belong may constitute a Character-granularity feature, that is, a combined feature of a Position feature of the Character in the word and a part-of-speech feature of the Character in the word.

In one embodiment, the markers may be marked with Position markers { "B", "I" }. Where "B" may represent a start position and "I" may represent a non-start position. For example, for the noun word "TV __ NN", the corresponding position POS characteristics may be { B _ NN, I _ NN, I _ NN }.

By way of example, the positional features of an element in a word and/or binary segment, and the combined features of the word and/or binary segment, may also be referred to as word segmentation features. In extracting the word segmentation feature, for example, it may be determined based on the text in which the word and/or bigram is located.

For example, for a query statement (query): A1A2B1B2B3C1D1, when the query can be cut into words A1A2_ B1B2B3_ C1_ D1, wherein the word segmentation feature of Character B2 can be I _ word (B1B2B 3); the worksegmentation feature of Character A1 may be B _ word (A1A 2).

As an example, the position feature of an element in a word and/or binary participle and the combined feature of said word and/or binary participle, also referred to as POS and PMI combined feature. Since the same adjacent word pair has different meanings under different POS-pairs, in the embodiment of the present disclosure, the POS and PMI combined feature is used as a supplement to the PMI feature, so as to obtain more feature data. In addition, in the embodiment of the disclosure, for the features related to the PMI Score, the PMI Score can be discretized, and corresponding feature identifiers are given to the discretized feature data.

It should be understood that the above description is only illustrative and not restrictive of the characteristic data to which the disclosure relates. In other embodiments, other characteristic data may also be included, which is not described in detail herein.

In order to facilitate the use of the elements or feature data in the feature database, in an embodiment, after the element or feature data is obtained, a feature identifier may be assigned to the obtained element or feature data, so that in a subsequent use process, the required related elements or feature data may be quickly found based on the feature identifier.

Wherein, each characteristic or each class of characteristics can be endowed with a unique characteristic identification. The signature may be a binary sequence or a character string recognizable by a computer, which is not limited by the present disclosure.

In the maintained feature database, a plurality of features obtained based on the text data set and feature identifiers respectively corresponding to the features may be stored in association with each other. For example, as shown in fig. 1, mapping data of feature names and feature identifications of feature data may be saved as a file, such as a feature name and feature ID mapping table of discrete features, a PMI Score feature and feature ID mapping table, and a feature name and feature ID mapping table of combined features.

Returning to the flowchart shown in fig. 1, feature pre-training may also be performed based on the text data set in order to obtain a relevant feature vector. Feature pre-training can be performed based on the text data set and a predetermined tool, and feature vectors corresponding to the elements or the feature data respectively can be stored in the maintained feature database in an associated manner.

In the embodiment of the present invention, a corpus composed of predetermined text datasets (including, but not limited to, the text dataset described above and the error correction parallel corpus dataset described below) may be used, and word vectors (word embedding), character vectors (character embedding), binary word segmentation vectors (bigram embedding), and the like may be obtained by training using a fastText tool.

Thus, the preparation phase of the feature database in the embodiment of the present disclosure has been described in detail with reference to fig. 1, and the feature database maintained at this phase may be used in any of the phases described below, which are described in detail below.

The above feature database may be applicable to any scenario, in particular in a search scenario. For example, feature data can be maintained through a domain text data set and a network encyclopedia data set, and more features can be extracted as much as possible, so that error correction can be performed on query statements input by a user in a search scene, results meeting query requirements of the user can be provided for the user, and user experience is improved.

Generation of feature sets for text

FIG. 2 shows a flow diagram of a method of generating a feature set of text according to one embodiment of the present disclosure. The details of the generation scheme of the feature set of the text are the same as or similar to those of the preparation stage of the feature database, and the specific implementation can be referred to above, which is not described herein again. It should be understood that the text herein may include mass text, and may also include corpora involved in the model training phase, query sentences involved in the text misprediction phase and/or the text correction phase, which is not limited by the present disclosure.

As shown in fig. 2, in step S210, a plurality of elements included in the text are acquired.

The text may be a chinese text, an english text, or a character string (including punctuation marks and the like) including both chinese and english.

The text may be obtained from a variety of sources. Also, the meaning of text is not exactly the same in different scenarios. For example, in the model training phase described below, the text may represent a text data set, and in the text misprediction phase described below, the text may be text to be mispredicted, such as a query sentence input by a user in a search box, or text obtained by text-converting a query speech of the user, which is not limited by the present disclosure.

In addition, the text may also be an article of a predetermined author or a predetermined owner or a predetermined source. For example, it may be some network-reached article. By the method, the article is subjected to error prediction or error correction, so that specialized services are provided for relevant users.

The elements may be data processed from text, and the elements obtained may be associated with the particular data processing being performed. In one embodiment, the data processing may include, for example, word segmentation processing, part-of-speech tagging (POS), Dependency Parsing (Dependency Parsing), and the like, and the element may include, for example, a character and/or a word and/or a binary word segmentation obtained by segmenting a text.

In the disclosed embodiments, text may be processed based on a multi-segmentation algorithm. Or, the text can be processed based on the types of unary participle, binary participle, multi-element participle, accurate participle and the like according to the participle granularity, and the adopted participle processing mode is not limited by the disclosure. Therein, the text may preferably be processed by a multi-segmentation algorithm or a multi-segmentation granularity in order to obtain as many elements or features as possible.

In step S220, feature data of at least one of the plurality of elements is determined.

Here, the corresponding feature data may be obtained by performing a feature extraction process on the elements.

The feature data may include the elements themselves, feature data associated with the elements, and feature data between different elements.

As in the previous feature database preparation phase, here, for example, the elements may include characters and/or words and/or bigrams. The feature data may include part-of-speech features of characters (characters) and/or words (Word) and/or words (including words derived from binary segmentation), and may also include inter-element association features.

The inter-element association feature includes, for example, a Position (Position) feature of a character in a word, an inter-element Dependency relationship feature (e.g., a Dependency type feature (Dependency type), a dependent word (Dependency word), etc.), and an inter-element association feature.

The feature data may also include a combined feature of two or more features of the element. For example, a combination feature of a position feature of the element in a word and a part-of-speech feature of the word; the position characteristics of the elements in the words and/or the binary participles and the combination characteristics of the words and/or the binary participles; a combination of part-of-speech features of the elements and the inter-element relevance features.

At least one element of interest here may be a domain-specific word or phrase. For example, the plurality of elements may include trade names, and the like. In the case of misprediction or error correction, it may be set to focus on words or phrases in this particular area. In this way, in the process of predicting the error condition of the text described below, the error condition of the brand name and/or trade name in the text can be predicted in an overlapping manner. Alternatively, in the process of correcting the text described below, correction of the trade name and/or the trade name in the text may be repeated.

Wherein, in step S210, the element may be obtained by performing word segmentation processing on the text. In step S220, part-of-speech tagging may be performed on the text to obtain part-of-speech features of the characters and/or words; or, performing dependency syntax analysis processing on the text to obtain the inter-element dependency relationship characteristics; alternatively, the inter-element correlation features may also be obtained from a feature database.

Then, in step S230, a feature set of the text is generated by using the plurality of elements and the feature data as features.

Here, the plurality of elements and the feature data may be feature-trained to obtain corresponding feature vectors, and generate feature sets of the text.

A plurality of features and corresponding feature vectors have been derived for a large volume of text data sets in the feature database maintained as before. In one embodiment, a feature identifier corresponding to each feature may be obtained from the feature database based on the obtained plurality of elements and feature data, and the feature set may be generated based on the feature identifier. Wherein, in the feature database, the plurality of features include a plurality of elements extracted from the text dataset and feature data of the elements.

In an embodiment, the feature database may further store feature vectors corresponding to the plurality of features in an associated manner, where the feature identifiers and the feature vectors of the features correspond to each other, and when the feature set is generated, for example, the feature vectors corresponding to the plurality of features may be obtained based on the feature identifiers, and the obtained feature vectors may be combined to obtain the feature set.

The method of generating a feature set of text of the present disclosure has been described in detail so far in connection with fig. 2. The method for generating the feature set of the text can be applied to any text or any scene needing to generate the feature set of the text, including but not limited to the following scenes of training a misprediction model or scenes of text misprediction. In the following description, details will not be given for how to generate the feature set of the text. The extracted parts of speech characteristics, PMI Score characteristics, combination characteristics and the like play an important role in judging the error condition of the text, and the error prediction model is trained based on the characteristics, so that the accuracy of the model can be greatly improved, and the accuracy of the text error prediction based on the model is improved.

Training error prediction model

FIG. 3 shows a flow diagram of a method of training a misprediction model, according to one embodiment of the present disclosure. FIG. 4 shows a model training flow diagram according to one embodiment of the present disclosure. Wherein the model may be trained based on the corpus. The corpus may be text that is a training sample. In a preferred embodiment, the corpus may be query sentences historically input by the user, and particularly query sentences input by the user that have errors and are corrected.

In one embodiment, the corresponding corpus may be determined based on the application scenario of the model. In the following examples of the present disclosure, the model training scheme of the present disclosure is explained with a historical error correction parallel corpus as the training corpus. The historical error correction parallel corpora are a data set consisting of < error query, corresponding correct query >.

As shown in FIG. 3, in step S310, a feature set of the corpus may be generated. Here, for example, the feature set of the corpus may be generated using the aforementioned method.

Thereafter, in step S320, the misprediction model is trained based on the feature set. Wherein the error prediction model may be BilSTM-CRF.

Specifically, referring to the flowchart shown in fig. 4, with the error-corrected parallel corpus as the corpus of the training model, first, in step S411, for example, data preprocessing may be performed on the error-corrected parallel corpus, wherein, similar to or the same as the foregoing, word segmentation processing, part-of-speech tagging processing, dependency parsing processing, or the like may be performed on the corpus to obtain a plurality of elements, such as characters and/or words and/or binary participles.

Thereafter, in step S412, for example, a feature extraction process may be further performed to obtain a corresponding feature.

Here, the extracted feature data may include part-of-speech features of characters and/or words, or inter-element association features, the same as or similar to the foregoing. The inter-element association feature includes at least one of: the position characteristics of the characters in the words; inter-element dependency characteristics; inter-element correlation characteristics. The feature data may also include a combined feature of two or more features of the element. The combined features include at least one of: the position feature of the element in the word and the combination feature of the part of speech feature of the word; the position characteristics of the elements in the words and/or the binary participles and the combination characteristics of the words and/or the binary participles; a combination of part-of-speech features of the elements and the inter-element relevance features.

The step of acquiring a plurality of elements included in the text may include: and performing word segmentation processing on the text to obtain the element. The step of determining the characteristic data of at least one of the plurality of elements comprises at least one of: performing part-of-speech tagging on the text to obtain part-of-speech characteristics of the characters and/or words; performing dependency syntax analysis processing on the text to obtain the inter-element dependency relationship characteristics; and acquiring the inter-element correlation characteristics from the characteristic database.

The error prediction model may be associated with the feature database, and in step S413, feature identifiers corresponding to the respective features may be obtained from the feature database based on the obtained feature data.

A feature Identification (ID) for each feature may be used as an input to the misprediction model. For word embedding, character embedding and bigram embedding, corresponding embedding vectors can be loaded from a pre-trained embedding file. For the remaining feature Identifiers (IDs), the corresponding embedding vectors can be taken from the randomly initialized embedding file of the BiLSTM.

Thereafter, the obtained feature vectors may be combined together to generate a feature set, and a misprediction model may be trained based on the feature set in step S414.

In one embodiment, the corpus may be labeled and the model may be trained in conjunction with the feature sets and corresponding labeling sequences.

Referring to fig. 4, for example, in step S415, a labeling sequence corresponding to the corpus may be obtained, and then the misprediction model may be trained by combining a feature set and the labeling sequence.

Here, the sequence annotation data of the corpus may be generated using a predetermined annotation tool. The annotation sequence can characterize the error condition of the corpus. Here, the tagging sequence corresponding to the corpus may be obtained based on the error-corrected parallel corpus data set.

In one embodiment, the annotation sequence may be a binary sequence, and each bit of the binary sequence may represent the correctness of the corresponding character in the corpus. For example, the annotated sequence may be a sequence [0,1,0, … 1], where a sequence ith bit of 0 indicates that there is no error in the ith character of the original query, and a sequence ith bit of 1 indicates that there is an error in the ith character of the original query.

In one embodiment, the annotation sequence can include a wrong tag and/or a correct tag. Similar to the binary sequence, the correctness of the characters corresponding to each bit of the corpus in the labeled sequence can be marked based on the error label and/or the correct label, which is not described herein again.

Then, in step S414, the foregoing embedding values may be combined to obtain forward (forward) and backward (backward) output vectors of the BiLSTM as the inputs of the BiLSTM. And combining the vectors together to be used as the input of the CRF model, outputting the parallel corpus tagging sequence data as the CRF, and training the CRF model.

Thereafter, in step S416, the trained error prediction model may be saved to the BilSTM-CRF model file.

So far, the scheme of training the misprediction model of the present disclosure has been described in detail with reference to fig. 3-4.

The trained error prediction model (such as the BilSTM-CRF model) can be used for predicting the error condition of the predetermined text.

Text misprediction

Referring to the flowchart shown in fig. 6, the text misprediction may be applied in a search scenario, where a query sentence input by a user may be used as text to be predicted. For the text, referring to fig. 5, at step S510, a feature set of the text may be generated. The feature set of the text may be generated using the method of generating the feature set described previously herein. In step S520, an error condition of the text may be predicted by using an error prediction model based on the feature set.

Returning to fig. 6, the same or similar to the foregoing, in step S610, data preprocessing may be performed on the text (e.g., the query sentence input by the user) to obtain a plurality of elements included in the text. The data preprocessing may include word segmentation processing, part-of-speech tagging processing, dependency parsing processing, and the like.

In step S620, feature extraction processing may be performed on the processed elements to obtain a plurality of features. The feature identifiers corresponding to the plurality of features can be determined, and the feature identifiers are used as the input of the error prediction model to predict the text.

Specifically, for example, feature identifiers corresponding to a plurality of features may be acquired from a feature database based on the extracted features, and the acquired feature identifiers may be used as an input of the error prediction model.

In step S630, the error prediction model may automatically load a corresponding embedding value from the trained embedding file based on the feature identifier, and start the prediction of the BiLSTM-CRF, so as to obtain a prediction result.

In one embodiment, the prediction result is a binary sequence, and each bit of the binary sequence represents the correctness of the character corresponding to the binary sequence in the text.

Text error correction

In the embodiment of the present disclosure, the text may be subjected to error correction processing with respect to the prediction result of the text.

FIG. 7 shows a flow diagram of a text correction method according to one embodiment of the present disclosure. FIG. 8 shows a flow diagram of text correction according to one embodiment of the present disclosure.

As shown in fig. 7, in step S710, an error condition of the text may be predicted. Wherein the prediction of error conditions for the text can be implemented based on the ways shown in fig. 5-6.

In step S720, the text may be corrected based on the prediction result.

As shown in fig. 8, in step S721, the prediction result may be judged to determine the correctness of the text.

If the result of the determination is yes, that is, if the prediction result indicates that the text has an error, the process proceeds to step S722, where an error correction process is performed on the text, and an error correction result, that is, a corrected text obtained after error correction is performed, is returned.

If the determination result is no, that is, if the prediction result indicates that there is no error in the text, the process proceeds to step S723, and the original text is returned as an error correction result.

Therefore, the method and the device can perform error prediction and error correction processing on the query sentence input by the user by generating the feature set of the text and based on the error prediction model obtained by pre-training, so that the error analysis can be performed on the text input by the user in real time, and the error correction processing can be performed on the erroneous text input by the user in real time, so that a result meeting the query requirement of the user can be returned to the user, and the user experience is improved.

FIG. 9 shows a schematic block diagram of an apparatus to generate a feature set of a text according to one embodiment of the present disclosure. FIG. 10 shows a schematic block diagram of an apparatus to train a misprediction model, according to one embodiment of the present disclosure. FIG. 11 shows a schematic block diagram of a text misprediction apparatus, according to one embodiment of the present disclosure. FIG. 12 shows a schematic block diagram of a text correction apparatus according to one embodiment of the present disclosure. Wherein the functional blocks of the device can be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present disclosure. It will be appreciated by those skilled in the art that the functional blocks described in fig. 9-12 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

In the following, functional modules that the device can have and operations that each functional module can perform are briefly described, and for the details related thereto, reference may be made to the above description, and details are not described here again.

As shown in fig. 9, the apparatus 900 for generating a feature set of a text may include an element acquiring apparatus 910, a feature extracting apparatus 920, and a feature set apparatus 930.

Wherein, the element acquiring means 910 may acquire a plurality of elements included in the text. The feature extraction device 920 may determine feature data of at least one element of the plurality of elements. Feature set unit 930 may generate a feature set for the text by using the plurality of elements and the feature data as features.

As shown in fig. 10, the training apparatus 1000 for the misprediction model may include a feature set generating apparatus 1010 and a training apparatus 1020.

The feature set generating device 1010 can generate the feature set of the corpus by using the method for generating the text feature set. The training means 1020 may train the misprediction model based on the feature set.

As shown in fig. 11, text misprediction 1100 may include feature set generation 1110 and misprediction 1120. The feature set generating device 1110 can generate the feature set of the text by using the method for generating the feature set of the text described above. The error prediction unit 1120 can predict the error condition of the text by using an error prediction model based on the feature set.

As shown in fig. 12, the text correction apparatus 1200 may include an error prediction apparatus 1210 and an error correction apparatus 1220. The error prediction unit 1210 may predict an error condition of the text by using the text error prediction method. The error correction unit 1220 may correct the error of the text based on the prediction result.

As shown in fig. 1 to 8, the present invention can also be implemented as a text error correction method or a text error prediction method.

FIG. 13 shows a flow diagram of a text correction method according to one embodiment of the present disclosure. For the details involved, reference may be made to the above description, which is not repeated here.

As shown in FIG. 13, in step 1310, a plurality of elements contained in the text are obtained. Wherein the element may comprise a character and/or a word and/or a bigram.

At step 1320, feature data for at least one of the plurality of elements is determined.

In one embodiment, the feature data may include: part-of-speech characteristics of characters and/or words; and/or inter-element association features. The inter-element association feature may include at least one of: the position characteristics of the characters in the words; inter-element dependency characteristics; inter-element correlation characteristics.

In another embodiment, the characteristic data may further include: a combination of two or more features of an element. The combined features include at least one of: the position feature of the element in the word and the combination feature of the part of speech feature of the word; the position characteristics of the elements in the words and/or the binary participles and the combination characteristics of the words and/or the binary participles; a combination of part-of-speech features of the elements and the inter-element relevance features.

At step 1330, a feature set of the text is generated by characterizing the plurality of elements and the feature data.

At step 1340, an error condition of the text is predicted based on the feature set.

In step 1350, the text is corrected based on the prediction.

In an embodiment of the present disclosure, the step of obtaining a plurality of elements included in the text may include: -participling said text to obtain said elements, and/or-determining feature data of at least one of said plurality of elements comprises at least one of: performing part-of-speech tagging on the text to obtain part-of-speech characteristics of the characters and/or words; performing dependency syntax analysis processing on the text to obtain the inter-element dependency relationship characteristics; and acquiring the inter-element correlation characteristics from the characteristic database.

In an embodiment of the present disclosure, the step of generating the feature set of the text may include: acquiring feature identifiers corresponding to the features from a feature database, wherein the feature database is obtained by processing a text data set, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features are stored in the feature database in an associated manner, and the features comprise a plurality of elements extracted from the text data set and feature data of the elements; and generating the feature set based on the feature identification. Wherein the text data set comprises at least one of: a general domain data set; a vertical domain data set; a networked encyclopedia data set.

In this embodiment of the present disclosure, the feature database may further store feature vectors corresponding to the plurality of features respectively in an associated manner, and the step of generating the feature set based on the feature identifier further includes: acquiring feature vectors corresponding to the plurality of features respectively based on the feature identifiers; and combining the obtained feature vectors to obtain the feature set.

In the embodiment of the present disclosure, the feature vector is obtained by performing feature training on the plurality of elements and feature data of the elements extracted from the text data set.

In the embodiment of the disclosure, based on the feature set, an error prediction model may be utilized to predict an error condition of the text. The prediction result of the error condition of the text may be a binary sequence, and each bit of the binary sequence represents the correctness of the character corresponding to the binary sequence in the text.

In an embodiment of the present disclosure, the step of predicting the error condition of the text may include: determining feature identifications corresponding to the plurality of features respectively; and taking the feature identification as an input of the error prediction model to predict the text.

In this embodiment of the present disclosure, feature identifiers corresponding to the plurality of features may be acquired from a feature database.

In embodiments of the present disclosure, the misprediction model may also be trained. Wherein the step of training the misprediction model may comprise: generating a feature set of a corpus, wherein the corpus is a text; and training the error prediction model based on the feature set of the corpus.

And a labeling sequence corresponding to the corpus can be obtained, the labeling sequence represents the error condition of the corpus, and the error prediction model is trained based on the feature set of the corpus and the labeling training. And acquiring a labeling sequence corresponding to the corpus based on the error correction parallel corpus data set.

In this disclosure, the tagging sequence is a binary sequence, and each bit of the binary training represents the correctness of a character corresponding to the bit in the corpus; and/or the annotation training comprises a wrong label and/or a correct label.

In an embodiment of the present disclosure, the error prediction model is a BilSTM-CRF model. The text may include a query statement entered by the user.

FIG. 14 shows a flow diagram of a text misprediction method according to one embodiment of the present disclosure. For the details involved, reference may be made to the above description, which is not repeated here.

As shown in fig. 14, in step S1410, a plurality of elements included in the text are acquired. In step S1420, feature data of at least one of the plurality of elements is determined. In step S1430, a feature set of the text is generated by using the plurality of elements and the feature data as features. The method for producing the feature set is the same as described above, and the details thereof can be found in the related description above. In step S1440, an error condition of the text is predicted based on the feature set by using an error prediction model.

In the embodiment of the present disclosure, the prediction result of the error condition of the text may be a binary sequence, and each bit of the binary sequence represents the correctness of the character corresponding to the binary sequence in the text.

In an embodiment of the present disclosure, the step of predicting the error condition of the text includes: determining feature identifications corresponding to the plurality of features respectively; and taking the feature identification as an input of the error prediction model to predict the text.

In an embodiment of the present disclosure, the error prediction model is a BilSTM-CRF model.

In embodiments of the present disclosure, the text may include a query statement entered by a user. Alternatively, the text may include articles of a predetermined author or a predetermined owner or a predetermined source.

In embodiments of the present disclosure, the plurality of elements may include a trade name and/or a trade name. Such a step of predicting an error condition of the text may include: and predicting the error condition of the trade name and/or the trade name in the text. Alternatively, the step of correcting the text may include: and correcting the trade name and/or the trade name in the text.

In the embodiment of the disclosure, a network new word lexicon can also be maintained. When the error condition of the text is predicted, the network new word library can be referred to so as to avoid identifying the network new words as errors.

In the embodiment of the present disclosure, a knowledge base may be further maintained, in which corresponding correct participles and incorrect participles are recorded, where the corresponding correct participles and incorrect participles are obtained based on the prediction result. The knowledge base may be referred to in the step of predicting an error condition of the text.

In addition, the present disclosure can also realize a text error correction apparatus. The text error correction device may include an element acquisition device, a feature extraction device, a feature set device, an error prediction device, and an error correction device. The element acquisition device is used for acquiring a plurality of elements contained in the text; the feature extraction device is used for determining feature data of at least one element in the plurality of elements; the feature set device is used for generating a feature set of the text by taking the plurality of elements and the feature data as features; the error prediction device is used for predicting the error condition of the text based on the feature set; and the error correction device is used for correcting the error of the text based on the prediction result.

FIG. 15 shows a schematic structural diagram of a computing device according to an embodiment of the invention.

Referring to fig. 15, the computing device 1500 includes a memory 1510 and a processor 1520.

The processor 1520 may be one multi-core processor or may include a plurality of processors. In some embodiments, processor 1520 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 1520 may be implemented using custom circuitry, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 1510 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. The ROM may store, among other things, static data or instructions for the processor 1520 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory 1510 may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 1510 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-density optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 1510 has stored thereon processable code that, when processed by the processor 1520, causes the processor 1520 to perform the methods described above.

The text feature set generating method and apparatus, the model training and applying method and apparatus according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A text error correction method, comprising:

acquiring a plurality of elements contained in the text;

determining feature data for at least one of the plurality of elements;

generating a feature set of the text by taking at least the plurality of elements and the feature data as features;

predicting error conditions of the text based on the feature set; and

and correcting the text based on the prediction result.

2. The method of claim 1,

the elements comprise characters and/or words and/or binary segmentations.

3. The method of claim 2, wherein the characterization data comprises:

part-of-speech characteristics of characters and/or words; and/or

And (4) correlation characteristics among elements.

4. The method of claim 3, wherein the inter-element association features comprise at least one of:

the position characteristics of the characters in the words;

inter-element dependency characteristics;

inter-element correlation characteristics.

5. The method of claim 4, wherein the characterization data further comprises:

a combination of two or more features of an element.

6. The method of claim 5, wherein the combined features comprise at least one of:

the position feature of the element in the word and the combination feature of the part of speech feature of the word;

the position characteristics of the elements in the words and/or the binary participles and the combination characteristics of the words and/or the binary participles;

a combination of part-of-speech features of the elements and the inter-element relevance features.

7. The method of claim 4, wherein the step of obtaining the plurality of elements contained in the text comprises:

performing word segmentation on the text to obtain the element,

and/or

The step of determining the characteristic data of at least one of the plurality of elements comprises at least one of:

performing part-of-speech tagging on the text to obtain part-of-speech characteristics of the characters and/or words;

performing dependency syntax analysis processing on the text to obtain the inter-element dependency relationship characteristics;

and acquiring the inter-element correlation characteristics from the characteristic database.

8. The method of claim 1, wherein the step of generating a feature set for the text comprises:

acquiring feature identifiers corresponding to the features from a feature database, wherein the feature database is obtained by processing a text data set, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features are stored in the feature database in an associated manner, and the features comprise a plurality of elements extracted from the text data set and feature data of the elements; and

generating the feature set based on the feature identification.

9. The method of claim 8, wherein the text data set comprises at least one of:

a general domain data set;

a vertical domain data set;

a networked encyclopedia data set.

10. The method according to claim 8, wherein the feature database further stores feature vectors respectively corresponding to the plurality of features in association, and the step of generating the feature set based on the feature identification further comprises:

acquiring feature vectors corresponding to the plurality of features respectively based on the feature identifiers;

and combining the obtained feature vectors to obtain the feature set.

11. The method of claim 10, wherein the feature vector is obtained by feature training the plurality of elements and feature data of the elements extracted from the text dataset.

12. The method of claim 1,

and predicting the error condition of the text by using an error prediction model based on the feature set.

13. The method of claim 12,

the prediction result is a binary sequence, and each bit of the binary sequence represents the correctness of the character corresponding to the binary sequence in the text.

14. The method of claim 12, wherein predicting the error condition of the text comprises:

determining feature identifications corresponding to the plurality of features respectively;

and taking the feature identification as an input of the error prediction model to predict the text.

15. The method of claim 14,

and acquiring feature identifications corresponding to the plurality of features from a feature database.

16. The method of claim 12, further comprising:

the misprediction model is trained.

17. The method of claim 16, wherein the step of training the misprediction model comprises:

generating a feature set of a corpus, wherein the corpus is a text;

and training the error prediction model based on the feature set of the corpus.

18. The method of claim 17, further comprising:

acquiring a labeling sequence corresponding to the corpus, wherein the labeling sequence represents an error condition of the corpus, and wherein,

and training the error prediction model based on the feature set of the corpus and the labeling training.

19. The method of claim 18,

and acquiring a labeling sequence corresponding to the corpus based on the error correction parallel corpus data set.

20. The method of claim 18,

the marking sequence is a binary sequence, and each bit of the binary training respectively represents the correctness of the corresponding character in the corpus; and/or

The labeling training includes wrong labels and/or correct labels.

21. The method of claim 12,

the error prediction model is a BilSTM-CRF model.

22. The method of claim 1,

the text comprises a query sentence input by a user; or

The text includes articles of a predetermined author or a predetermined owner or a predetermined source.

23. The method of claim 1, wherein the plurality of elements comprise trade names and/or trade names,

the step of predicting the error condition of the text comprises the following steps: predicting the error condition of the trade name and/or the trade name in the text; and/or

The step of correcting the text comprises: and correcting the trade name and/or the trade name in the text.

24. The method of claim 1, further comprising:

a new word bank of the network is maintained,

and in the step of predicting the error condition of the text, the network new word library is referred to so as to avoid identifying the network new words as errors.

25. The method of claim 1, further comprising:

maintaining a knowledge base in which corresponding correct and incorrect participles are recorded, wherein the corresponding correct and incorrect participles are derived based on the prediction result,

wherein, in the step of predicting the error condition of the text, the knowledge base is referred to.

26. A text misprediction method, comprising:

acquiring a plurality of elements contained in the text;

determining feature data for at least one of the plurality of elements;

generating a feature set of the text by taking the plurality of elements and the feature data as features;

27. The method of claim 26,

the prediction result of the error condition of the text is a binary sequence, and each bit of the binary sequence respectively represents the correctness of the character corresponding to the binary sequence in the text.

28. The method of claim 26, wherein predicting the error condition of the text comprises:

29. The method of claim 28,

30. The method of claim 26,

the error prediction model is a BilSTM-CRF model.

31. The method of claim 26,

the text comprises a query sentence input by a user; or

32. The method of claim 26, wherein the plurality of elements comprise trade names and/or trade names,

the step of predicting the error condition of the text comprises the following steps: and predicting the error condition of the trade name and/or the trade name in the text.

33. The method of claim 26, further comprising:

a new word bank of the network is maintained,

34. The method of claim 26, further comprising:

wherein, in the step of predicting the error condition of the text, the knowledge base is referred to. .

35. A text correction apparatus, comprising:

element acquiring means for acquiring a plurality of elements included in the text;

feature extraction means for determining feature data of at least one of the plurality of elements;

the feature set device is used for generating a feature set of the text by taking the plurality of elements and the feature data as features;

the error prediction device is used for predicting the error condition of the text based on the feature set; and

and the error correction device is used for correcting the error of the text based on the prediction result.