CN112733529B

CN112733529B - Text error correction method and device

Info

Publication number: CN112733529B
Application number: CN201911029376.4A
Authority: CN
Inventors: 刘恒友; 李辰; 包祖贻; 徐光伟; 李林琳
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2023-09-29
Anticipated expiration: 2039-10-28
Also published as: CN112733529A

Abstract

The invention discloses a text error correction method and device. Wherein, a plurality of elements contained in the text are acquired; determining characteristic data of at least one element of the plurality of elements; generating a feature set of the text by taking the elements and the feature data as features; predicting an error condition of the text based on the feature set; and correcting the text based on the prediction result. Therefore, support is provided for reducing the error correction rate of the text and improving the error correction quality of the text.

Description

Text error correction method and device

Technical Field

The present invention relates to the field of information retrieval technologies, and in particular, to a text error correction method and apparatus.

Background

Along with popularization of the internet, information on the internet is more and more abundant, and under various search application scenes, people can conveniently acquire the needed information by inputting query sentences (queries) in a search bar. In this case, the user often inputs an incorrect query when searching.

Currently, most search engines introduce an error correction mechanism, i.e. error correcting the query input by the user, so as to correct the error query input by the user into a correct query, so as to return a search result conforming to the requirement of the user. However, there are often cases where incorrect corrections occur, such as incorrect query input by the user as an incorrect query. In this way, the search results returned to the user do not meet the user requirements, and the search experience of the user is greatly affected.

Accordingly, there remains a need for an improved error correction scheme to reduce the error correction rate and improve the user experience.

Disclosure of Invention

The invention aims to provide a text error correction method and a text error correction device, which are used for providing support for reducing error correction rate and improving user experience.

According to a first aspect of the present disclosure, there is provided a text error correction method, comprising: acquiring a plurality of elements contained in the text; determining characteristic data of at least one element of the plurality of elements; generating a feature set of the text by taking the elements and the feature data as features; predicting an error condition of the text based on the feature set; and correcting the text based on the prediction result.

Optionally, the element comprises a character and/or a word and/or a binary word;

optionally, the feature data includes: part-of-speech features of characters and/or words; and/or inter-element association features.

Optionally, the inter-element association feature includes at least one of: the location characteristics of the characters in the words; inter-element dependency characteristics; inter-element correlation features.

Optionally, the feature data further includes: a combined feature of two or more features of an element.

Optionally, the combined features include at least one of: a combination of a position feature of the element in a word and a part-of-speech feature of the word; the position characteristics of the elements in the words and/or the binary word and the combination characteristics of the words and/or the binary word; and combining the part-of-speech feature of the element with the inter-element correlation feature.

Optionally, the step of obtaining a plurality of elements contained in the text includes: the step of word segmentation of the text to obtain the elements and/or determining characteristic data of at least one of the plurality of elements comprises at least one of: performing part-of-speech tagging on the text to obtain part-of-speech features of the characters and/or words; performing dependency syntax analysis processing on the text to obtain the inter-element dependency characteristics; the inter-element correlation features are obtained from a feature database.

Optionally, the step of generating the feature set of the text comprises: the method comprises the steps of obtaining feature identifiers corresponding to the features from a feature database, wherein the feature database is obtained by processing a text data set, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features are stored in the feature database in an associated mode, and the plurality of features comprise a plurality of elements extracted from the text data set and feature data of the elements; and generating the feature set based on the feature identification.

Optionally, the text data set includes at least one of: a universal domain data set; a vertical domain data set; network encyclopedia data set.

Optionally, the feature database further stores feature vectors corresponding to the features respectively in association, and the step of generating the feature set based on the feature identifier further includes: acquiring feature vectors respectively corresponding to the plurality of features based on the feature identifiers; combining the obtained feature vectors results in the feature set.

Optionally, the feature vector is obtained by feature training the plurality of elements and feature data of the elements extracted in the text dataset.

Optionally, based on the feature set, an error condition of the text is predicted using an error prediction model.

Optionally, the prediction result is a binary sequence, and each bit of the binary sequence represents the correctness of the corresponding character in the text.

Optionally, the step of predicting an error condition of the text comprises: determining the feature identifiers corresponding to the features respectively; and taking the characteristic identification as input of the misprediction model to predict the text.

Optionally, feature identifiers corresponding to the features are obtained from a feature database.

Optionally, the text error correction method may further include: training the misprediction model.

Optionally, the step of training the misprediction model includes: generating a feature set of a corpus, wherein the corpus is text; and training the misprediction model based on the feature set of the corpus.

Optionally, the method further comprises: and acquiring a labeling sequence corresponding to the corpus, wherein the labeling sequence characterizes the error condition of the corpus, and training the error prediction model based on the feature set of the corpus and the labeling training.

Optionally, based on the error correction parallel corpus data set, a labeling sequence corresponding to the corpus is obtained.

Optionally, the labeling sequence is a binary sequence, and each bit of the binary training represents the correctness of the corresponding character in the corpus; and/or the annotation training comprises a wrong label and/or a correct label.

Optionally, the misprediction model is a BiLSTM-CRF model.

Optionally, the text includes a query statement entered by a user. Or, alternatively, the text comprises an article of a predetermined author or predetermined owner or predetermined source.

Optionally, the plurality of elements includes a trade name and/or a trade name, and the step of predicting an error condition of the text includes: predicting the error condition of trade names and/or trade names in the text; and/or the step of correcting the text comprises: error correction is applied to the trade name and/or trade name in the text.

Optionally, the method may further include: and maintaining a network new word stock, wherein in the step of predicting the error condition of the text, the network new word stock is referred to so as to avoid identifying the network new word as an error.

Optionally, the method may further include: maintaining a knowledge base, wherein corresponding correct and incorrect segmentation words are recorded, wherein the corresponding correct and incorrect segmentation words are obtained based on the prediction result, and in the step of predicting the error condition of the text, referring to the knowledge base.

According to a second aspect of the present disclosure, there is provided a text misprediction method comprising: acquiring a plurality of elements contained in the text; determining characteristic data of at least one element of the plurality of elements; generating a feature set of the text by taking the elements and the feature data as features; and predicting the error condition of the text by using an error prediction model based on the feature set.

Optionally, the prediction result of the error condition of the text is a binary sequence, and each bit of the binary sequence represents the correctness of the corresponding character in the text.

Optionally, the misprediction model is a BiLSTM-CRF model.

Optionally, the plurality of elements includes a trade name and/or a trade name, and the step of predicting an error condition of the text includes: the error condition of the trade name and/or trade name in the text is predicted.

According to a third aspect of the present disclosure, there is also provided a text error correction apparatus, including: element acquisition means for acquiring a plurality of elements contained in the text; feature extraction means for determining feature data of at least one element of the plurality of elements; feature set means for generating a feature set of the text with the plurality of elements and the feature data as features; error prediction means for predicting an error condition of the text based on the feature set; and error correction means for correcting the error of the text based on the prediction result.

Therefore, the text is processed to acquire a plurality of elements contained in the text, and the feature set is generated by the plurality of elements and corresponding feature data, so that more element relations can be obtained, and support is provided for improving the quality of related services. The method for generating the characteristic set of the text can be suitable for training a related text misprediction model and can also be suitable for mispredicting the text. Therefore, the text error correction rate can be reduced, and the text error correction quality is improved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

FIG. 1 illustrates a feature database preparation flow diagram according to one embodiment of the present disclosure.

Fig. 2 shows a flow diagram of a method of generating a feature set of text according to one embodiment of the present disclosure.

FIG. 3 illustrates a flow diagram of a method of training a misprediction model in accordance with one embodiment of the present disclosure.

FIG. 4 illustrates a model training flow diagram according to one embodiment of the present disclosure.

Fig. 5 shows a flow diagram of a text misprediction method in accordance with one embodiment of the present disclosure.

FIG. 6 illustrates a misprediction model application flow diagram in accordance with one embodiment of the present disclosure.

Fig. 7 shows a flow diagram of a text error correction method according to one embodiment of the present disclosure.

Fig. 8 shows a flow diagram of text error correction according to one embodiment of the present disclosure.

Fig. 9 shows a schematic block diagram of an apparatus for generating a feature set of text according to one embodiment of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an apparatus for training a misprediction model in accordance with one embodiment of the present disclosure.

FIG. 11 shows a schematic block diagram of a text misprediction apparatus in accordance with one embodiment of the present disclosure.

Fig. 12 shows a schematic block diagram of a text error correction apparatus according to one embodiment of the present disclosure.

Fig. 13 shows a flow diagram of a text error correction method according to one embodiment of the present disclosure.

FIG. 14 shows a flow diagram of a text misprediction method in accordance with one embodiment of the present disclosure.

FIG. 15 illustrates a schematic diagram of a computing device that may be used to implement a method according to an embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As described above, in order to avoid the situation of error correction as much as possible, and reduce the error correction rate, for example, in the search scenario, the present disclosure proposes a text error correction scheme, where a scheme for generating a feature set of a text is proposed, where the scheme may be applied to training a text error prediction model, and also may be applied to predicting the error situation of the text, so as to avoid error correction, reduce the error correction rate, and improve user experience.

For a better understanding of the text error correction scheme of the present disclosure, in the following description, descriptions will be developed from different stages, respectively. The method comprises a feature database preparation stage, a feature set generation stage, a text error model training stage, a text error prediction stage and an error correction stage.

It should be understood that the different stages are described for better understanding of the technical solution of the present disclosure, and are not any limitation of the embodiments of the present disclosure. For example, the feature set generating method described in the following feature set generating stage may be applied to a text error model training stage, a text error prediction stage, and a text error correction stage, which will not be described in detail below.

The text misprediction schemes of the present disclosure and the details involved therein will be described in detail below with reference to the drawings and examples, respectively.

Feature database (including feature vectors) preparation phase

In the embodiment of the present invention, first, a feature database may be maintained in advance so as to provide support for later generation of feature sets or training of models, model applications, and the like.

In one embodiment, the feature database may be obtained by processing a text dataset. Wherein the text data set may include, but is not limited to, a type of text data set.

The feature database may be maintained, for example, based on a massive text data set. The text data sets used or the feature databases maintained are not identical in different application scenarios. The text data set may be, for example, a general field data set and the resulting feature database may be a general field feature database. Or the text data set may be a vertical domain data set and the resulting feature database may be a vertical domain feature database. Alternatively, the text data set may also be a web encyclopedia data set, as this disclosure is not limited in this regard. Wherein the feature database is maintained, for example, by word segmentation and/or feature extraction processes on a massive text dataset.

As shown in fig. 1, the text data set used to maintain the feature database may include, by way of example, a new retail electronic search history data set, a network encyclopedia data set (e.g., wiki encyclopedia).

First, the text data set may be data pre-processed in order to obtain element and/or feature data. The data preprocessing may include processing involved in natural language processing, and may include, but is not limited to, word segmentation processing, part-of-speech tagging Processing (POS), dependency syntax analysis processing (Dependency Parsing), and the like, on a text data set.

After data preprocessing, elements contained in the text, such as characters and/or words and/or binary word segmentation, etc., can be obtained.

Next, features may be extracted from the data after the data preprocessing, for example, feature extraction processing may be performed on the obtained elements to obtain corresponding feature data.

In the embodiment of the disclosure, as many feature data as possible may be extracted in various manners to enrich the relationships between the elements, so as to provide better support for promoting applications based on the feature data (such as training a text misprediction model or an application of the text misprediction model, etc.).

Wherein the feature data may include element self features such as characters and/or words and/or bigrams; attribute features of elements, such as part-of-speech features of characters and/or words, may also be included; inter-element association features may also be included, such as character position features in words, inter-element dependency features, inter-element correlation features, and the like.

In one embodiment, features of different granularities may also be extracted to enrich the relationships between elements.

For example, discrete features may be extracted for an element. The discrete features may include, for example, word segmentation features, part-of-speech features (POS), dependency type features (Dependency type), dependency word features (Dependency word), character features (Character), binary word segmentation features (Bigram), position features (Position), and the like.

For example, inter-element correlation features may be extracted for elements. The inter-element correlation feature may be, for example, PMI Score. PMI (Pointwise Mutual Information), i.e., point-by-point mutual information, is used to measure the correlation (e.g., two words) between two things.

In the embodiment of the invention, the correlation between two elements (including characters and/or words and/or binary word segmentation and the like) can be measured based on the PMI. Wherein, the PMI Score can be obtained by the following formula:

PMI(w1,w2)＝log(P(w1,w2)/(P(w1)*P(w2)))

wherein, PMI (w 1, w 2) represents the correlation score of element 1 and element 2, w1 represents element 1, w2 represents element 2, P (w 1) represents the probability that element 1 appears in the text data set, i.e. the ratio of the number of times element 1 appears in the text data set to the total number of words in the text data set; p (w 2) represents the probability that element 2 appears in the text data set, i.e., the ratio of the number of times element 2 appears in the text data set to the total number of words in the text data set; p (w 1, w 2) represents the probability that element 1 and element 2 co-occur in the text data set, the ratio of the number of times that element 1 and element 2 co-occur in the text data set to the total number of words of the text data set. Preferably, the above element 1 and element 2 may be two words, namely Word1 and Word2.

For example, the above feature data may further include a combined feature of two or more features of the element, which can be extracted for the element.

The combination features may include, for example, a combination feature of a position feature of an element in a word and a part-of-speech feature of the word, a combination feature of a position feature of an element in a word and/or a bigram and the word and/or bigram, a combination feature of a part-of-speech feature of an element and a correlation feature between the elements, and so on.

As an example, a combined feature of a location feature of an element in a term and a part-of-speech feature of the term may also be referred to as a position POS feature. When the Position POS features are extracted, each Chinese Character or Character (Character) in a word can be traversed, and the Position marks and the POS of the word where the Character is located form features of Character granularity, namely, the Position features of the Character in the word and the combined features of the part of speech of the Character in the word.

In one embodiment, the Position markers { "B", "I" } may be marked. Where "B" may represent a start position and "I" may represent a non-start position. For example, for the noun word "television __ NN," its corresponding position POS feature may be { b_nn, i_nn }.

As an example, the combination of the location features of an element in a word and/or bigram and the word and/or bigram features may also be referred to as word segmentation features. In extracting the word segmentation feature, it may be determined, for example, based on the text in which the term and/or bigram is located.

For example, for a query statement (query): when the query can be cut into A1A2_B1B2B2B3_C1_D1, wherein the word segmentation characteristic of Character B2 can be I_word (B1B2B3); the word segment feature of Character A1 may be B_word (A1A 2).

As an example, the combined features of the position features of the element in the word and/or the bigram and the word and/or the bigram may also be referred to as POS and PMI combined features. Because the same adjacent word pair has different meanings under different POS-pairs, the combined features of the POS and the PMI are used as the supplement of the PMI features in the embodiment of the disclosure so as to obtain more feature data. In addition, in the embodiment of the present disclosure, for the features related to the PMI Score, the PMI Score may be further discretized, and a corresponding feature identifier may be assigned to the discretized feature data.

It is to be understood that the above is merely illustrative of the characteristic data to which the present disclosure relates and not limiting. In other embodiments, other characteristic data may also be included, which is not described in detail herein.

In order to facilitate the use of the elements or feature data in the feature database, in one embodiment, after each acquisition of the elements or feature data, a feature identifier may be assigned to the acquired elements or feature data, so that in a subsequent use process, the required relevant elements or feature data may be quickly found based on the feature identifier.

Wherein for each feature or each class of features, a unique feature identification may be assigned thereto. The feature identifier may be a binary sequence or a computer-recognizable character string, which is not limited by the present disclosure.

In the feature database, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features may be stored in association with each other. For example, as shown in fig. 1, the feature names of the feature data and the mapping data of the feature identification thereof may be saved as files, such as a feature name and feature ID mapping table of discrete features, a PMI Score feature and feature ID mapping table, and a feature name and feature ID mapping table of combined features.

Returning to the flowchart shown in fig. 1, feature pre-training may also be performed based on the text data set in order to obtain relevant feature vectors. Wherein feature pre-training may be performed based on the text data set and a predetermined tool, and feature vectors corresponding to the elements or feature data, respectively, may also be stored in association in the maintained feature database.

In the embodiment of the present invention, a corpus composed of predetermined text data sets (including but not limited to the text data sets described above and error correction parallel corpus data sets described below) may be used, and word vectors (word segmentation), character vectors (character embedding), binary word segmentation vectors (bigram segmentation) and the like may be trained using a fastText tool.

Thus far, the preparation phase of the feature database in the embodiments of the present disclosure has been described in detail with reference to fig. 1, and the feature database maintained at this phase may be used at any of the phases described below, in detail below.

The above feature database may be applicable to any scenario, particularly in search scenarios. The feature data can be maintained through a field text data set and a network encyclopedia data set, and more features can be extracted as much as possible, so that error correction can be conveniently carried out on query sentences input by a user in a search scene, and a result meeting the query requirement can be provided for the user, thereby improving user experience.

Generation of feature sets for text

Fig. 2 shows a flow diagram of a method of generating a feature set of text according to one embodiment of the present disclosure. The details of the generation scheme of the feature set of the text are the same as or similar to those of the preparation stage of the feature database, and specific implementation can be referred to above, and will not be repeated here. It should be appreciated that, here, the text may include massive amounts of text, as well as corpus involved in a model training phase, text misprediction phase, and/or query statements involved in a text correction phase, which is not limiting in this disclosure.

As shown in fig. 2, in step S210, a plurality of elements included in the text are acquired.

The text may be a chinese text, an english text, or a string of characters (including punctuation marks, etc.) including both chinese and english.

The text may be obtained from a variety of sources. Also, the meaning of text is not exactly the same in different scenarios. For example, in a model training phase described below, the text may represent a text data set, while in a text misprediction phase described below, the text may be text to be mispredicted, such as a query statement entered by a user in a search box, or text obtained by text converting the user's query speech, which is not limiting in this disclosure.

In addition, the text may also be an article of a predetermined author or predetermined owner or predetermined source. For example, there may be some web-strike articles. By the method of the invention, the articles are mispredicted or corrected, thereby providing specialized services to the relevant users.

The above elements may be obtained after data processing of the text, and the obtained elements are related to the specific data processing performed. In one embodiment, the data processing may include, for example, a word segmentation process, a part-of-speech tagging Process (POS), a dependency syntax analysis process (Dependency Parsing), etc., and the element may include, for example, characters and/or words resulting from the segmentation of text and/or binary segmentation, etc.

In embodiments of the present disclosure, text may be processed based on a multi-word segmentation algorithm. Or, the text can be processed according to the word segmentation granularity based on the types of unitary word segmentation, binary word segmentation, multi-element word segmentation, precise word segmentation and the like, and the word segmentation processing mode adopted by the method is not limited. Wherein the text may preferably be processed by a multi-word segmentation algorithm or multi-word segmentation granularity in order to obtain as many elements or features as possible.

In step S220, feature data of at least one element of the plurality of elements is determined.

Here, the corresponding feature data may be obtained by performing feature extraction processing on the elements.

The feature data may include the elements themselves, feature data associated with the elements, and feature data between different elements.

The elements may include, for example, characters and/or words and/or bigrams, as in or similar to the previously described feature database preparation stage. Feature data may include part-of-speech features of characters (Characters) and/or words (Word) and/or words (including words resulting from binary Word segmentation), and may also include inter-element association features.

The inter-element association features include, for example, a Position (Position) feature of a character in a word, an inter-element Dependency relationship feature (e.g., dependency type feature (Dependency type), dependency word (Dependency word), etc.), and an inter-element Dependency feature.

The feature data may also include a combined feature of two or more features of the element. For example, a combination of a location feature of the element in a term and a part-of-speech feature of the term; the position characteristics of the elements in the words and/or the binary word and the combination characteristics of the words and/or the binary word; and combining the part-of-speech feature of the element with the inter-element correlation feature.

At least one element of interest herein may be a field-specific word or word. For example, trade names, and the like may be included in the plurality of elements. In the case of misprediction or correction, it may be set to focus on words or words in this particular field. In this way, in the process of predicting the error condition of the text described below, the brand name and/or the error condition of the brand name in the text may be emphasized. Alternatively, in the process of correcting text described below, the trademark and/or trade name in the text may be emphasized.

Wherein, in step S210, the element may be obtained by performing word segmentation processing on the text. In step S220, part-of-speech tagging may be performed on the text to obtain part-of-speech features of the characters and/or words; alternatively, the text may be subjected to dependency syntax analysis processing to obtain the inter-element dependency characteristics; alternatively, inter-element correlation features may also be obtained from a feature database.

Then, in step S230, a feature set of the text is generated by using the plurality of elements and the feature data as features.

Here, feature training may be performed on the plurality of elements and the feature data to obtain corresponding feature vectors, and generate a feature set of the text.

A number of features and corresponding feature vectors have been obtained for a massive text data set in a feature database as maintained previously. In one embodiment, the feature set may be generated based on the obtained multiple elements and feature data, and the feature identifiers corresponding to the features are obtained from the feature database. Wherein in the feature database, the plurality of features includes a plurality of elements extracted from the text data set and feature data of the elements.

In an embodiment, the feature database may further store feature vectors corresponding to the plurality of features in association with each other, where feature identifiers of the features and feature vectors correspond to each other, and when generating the feature set, for example, the feature vectors corresponding to the plurality of features may be obtained based on the feature identifiers, and the obtained feature vectors may be combined to obtain the feature set.

Thus far, the method of generating a feature set of text of the present disclosure has been described in detail in connection with fig. 2. The method for generating the feature set of the text can be applied to any text or any scene needing to generate the feature set of the text, including but not limited to a training misprediction model scene or a text misprediction scene described below. A detailed description of how to generate the feature set of the text will not be provided in the following description. The extracted parts of speech features, PMI Score features, combined features and the like have important roles in judging the error condition of the text, and the error prediction model is trained based on the features, so that the accuracy of the model can be greatly improved, and the accuracy of the text error prediction based on the model is improved.

Training a misprediction model

FIG. 3 illustrates a flow diagram of a method of training a misprediction model in accordance with one embodiment of the present disclosure. FIG. 4 illustrates a model training flow diagram according to one embodiment of the present disclosure. Wherein the model may be trained based on the training corpus. The training corpus may be text as training samples. In a preferred embodiment, the training corpus may be query sentences input by the user in a history, in particular query sentences input by the user which have errors and are corrected.

In one embodiment, the corresponding training corpus may be determined based on the application scenario of the model. In the following examples of the present disclosure, the model training scheme of the present disclosure is illustrated with a history error correction parallel corpus as the training corpus. The historical error correction parallel corpus is a data set consisting of < error query, and corresponding correct query >.

As shown in fig. 3, at step S310, a feature set of a training corpus may be generated. Here, the feature set of the training corpus may be generated using the aforementioned method, for example.

Thereafter, at step S320, the misprediction model is trained based on the feature set. Wherein the misprediction model may be BiLSTM-CRF.

Specifically, referring to the flowchart shown in fig. 4, the error correction parallel corpus is used as a training corpus of the training model, first, in step S411, data preprocessing may be performed on the error correction parallel corpus, where, similarly or identically to the foregoing, word segmentation processing, part-of-speech labeling processing, dependency syntactic analysis processing, or the like may be performed on the corpus to obtain a plurality of elements, for example, characters and/or words and/or binary word segmentation.

After that, in step S412, for example, a feature extraction process may be further performed to obtain corresponding features.

Here, the extracted feature data may include part-of-speech features of characters and/or words, or inter-element association features, as the same or similar to the foregoing. The inter-element association feature comprises at least one of: the location characteristics of the characters in the words; inter-element dependency characteristics; inter-element correlation features. The feature data may also include a combined feature of two or more features of the element. The combination features include at least one of: a combination of a position feature of the element in a word and a part-of-speech feature of the word; the position characteristics of the elements in the words and/or the binary word and the combination characteristics of the words and/or the binary word; and combining the part-of-speech feature of the element with the inter-element correlation feature.

Wherein the step of obtaining a plurality of elements contained in the text may include: and performing word segmentation processing on the text to obtain the element. The step of determining characteristic data of at least one element of the plurality of elements comprises at least one of: performing part-of-speech tagging on the text to obtain part-of-speech features of the characters and/or words; performing dependency syntax analysis processing on the text to obtain the inter-element dependency characteristics; the inter-element correlation features are obtained from a feature database.

The misprediction model may be associated with the aforementioned feature database, and in step S413, the feature identifiers corresponding to the respective features may be obtained from the aforementioned feature database based on the obtained feature data.

The feature Identification (ID) of each feature may be used as an input to the misprediction model. For word subedding, character embedding, bigram subedding, the corresponding subedding vector may be loaded from the pre-trained subedding file. For the rest of the feature Identification (ID), the corresponding emmbedding vector can be fetched from the BiLSTM randomly initialized emmbedding file.

Thereafter, the obtained feature vectors may be combined together to generate a feature set and a misprediction model may be trained based on the feature set at step S414.

In one embodiment, the material may be labeled and the model trained in connection with the feature set and corresponding labeling sequence.

Referring to fig. 4, for example, in step S415, a labeling sequence corresponding to the corpus may be obtained, and then the misprediction model may be trained by combining the feature set and the labeling sequence.

Here, a predetermined labeling tool may be used to generate sequence labeling data of the corpus. The annotation sequence can characterize the error condition of the corpus. Here, the labeling sequence corresponding to the corpus may be obtained based on the error correction parallel corpus data set.

In one embodiment, the labeling sequence may be a binary sequence, and each bit of the binary sequence may respectively represent the correctness of the corresponding character in the corpus. For example, the labeling sequence may be the sequence [0,1,0, … 1], wherein a sequence i bit of 0 indicates that the original query i-th character has no error, and a sequence i bit of 1 indicates that the original query i-th character has an error.

In one embodiment, the labeling sequence may include an error tag and/or a correct tag. Similar to the binary sequence, the correctness of the characters corresponding to each bit of the corpus in the labeling sequence may be labeled based on the error label and/or the correct label, which is not described herein.

Thereafter, in step S414, the foregoing ebedding values may be combined as an input to the BiLSTM to obtain a forward (forward) and backward (backward) output vector of the BiLSTM. And combining the vectors together, taking the combined vectors as an input of a CRF model, taking the parallel corpus labeling sequence data as an output of the CRF, and training the CRF model.

Thereafter, in step S416, the trained misprediction model may be saved to a BiLSTM-CRF model file.

Thus far, the scheme of training the misprediction model of the present disclosure has been described in detail in connection with FIGS. 3-4.

The trained error prediction model (e.g., biLSTM-CRF model) may be used to predict error conditions for a predetermined text.

Text misprediction

Referring to the flowchart shown in fig. 6, the text misprediction may be applied in a search scenario in which a query statement entered by a user may be used as text to be predicted. For this text, referring to fig. 5, at step S510, a feature set of the text may be generated. The feature set of the text may be generated using the previously described method of generating feature sets. In step S520, an error condition of the text may be predicted using an error prediction model based on the feature set.

Returning to fig. 6, as the same or similar to the foregoing, at step S610, data preprocessing may be performed on text (e.g., a query sentence input by a user) to obtain a plurality of elements included in the text. The data preprocessing may include word segmentation processing, part-of-speech tagging processing, dependency syntax analysis processing, and the like.

In step S620, feature extraction processing may be performed on the processed element to obtain a plurality of features. And determining the feature identifiers corresponding to the features respectively, and taking the feature identifiers as the input of the misprediction model to predict the text.

Specifically, for example, feature identifiers corresponding to a plurality of features may be acquired from a feature database based on the extracted features, and the acquired feature identifiers may be used as inputs of a misprediction model.

In step S630, the misprediction model may automatically load a corresponding ebedding value from the trained ebedding file based on the feature identifier, and begin prediction of the BiLSTM-CRF to obtain a prediction result.

In one embodiment, the prediction result is a binary sequence, each bit of the binary sequence representing the correctness of the character corresponding thereto in the text.

Text error correction

In the embodiment of the disclosure, the text may be subjected to error correction processing for the prediction result of the text.

Fig. 7 shows a flow diagram of a text error correction method according to one embodiment of the present disclosure. Fig. 8 shows a flow diagram of text error correction according to one embodiment of the present disclosure.

As shown in fig. 7, in step S710, an error condition of the text may be predicted. Wherein the prediction of the error condition of the text may be implemented based on the manner shown in fig. 5-6.

In step S720, the text may be error corrected based on the prediction result.

As shown in fig. 8, in step S721, a prediction result may be judged to determine the correctness of the text.

If the determination result is yes, that is, if the prediction result indicates that there is an error in the text, the process proceeds to step S722, where error correction processing is performed on the text, and an error correction result, that is, a corrected text obtained after error modification is performed, is returned.

If the result of the determination is no, that is, if the result of the prediction indicates that there is no error in the text, the process proceeds to step S723, and the original text is returned as an error correction result.

By generating the feature set of the text and performing error prediction and error correction processing on the query statement input by the user based on the error prediction model obtained by pre-training, the text input by the user can be subjected to error analysis in real time, and error correction processing can be performed on the error text input by the user in real time, so that a result meeting the query requirement of the user is returned to the user, and therefore user experience is improved.

Fig. 9 shows a schematic block diagram of an apparatus for generating a feature set of text according to one embodiment of the present disclosure. FIG. 10 illustrates a schematic block diagram of an apparatus for training a misprediction model in accordance with one embodiment of the present disclosure. FIG. 11 shows a schematic block diagram of a text misprediction apparatus in accordance with one embodiment of the present disclosure. Fig. 12 shows a schematic block diagram of a text error correction apparatus according to one embodiment of the present disclosure. Wherein the functional modules of the apparatus may be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present disclosure. Those skilled in the art will appreciate that the functional modules depicted in fig. 9-12 may be combined or divided into sub-modules to implement the principles of the invention described above. Accordingly, the description herein may support any possible combination, or division, or even further definition of the functional modules described herein.

The following is a brief description of the functional modules that the apparatus may have and the operations that each functional module may perform, and details related thereto may be referred to the above related description, which is not repeated herein.

As shown in fig. 9, the means 900 for generating a feature set of a text may include an element acquisition means 910, a feature extraction means 920, and a feature set means 930.

Wherein the element obtaining means 910 may obtain a plurality of elements contained in the text. The feature extraction means 920 may determine feature data of at least one element of the plurality of elements. Feature set means 930 may feature the plurality of elements and the feature data to generate a feature set of the text.

As shown in fig. 10, the training apparatus 1000 of the misprediction model may include a feature set generation apparatus 1010 and a training apparatus 1020.

The feature set generating means 1010 may generate the feature set of the corpus using the method of generating the text feature set described above. Training means 1020 may train the misprediction model based on the feature set.

As shown in fig. 11, the text misprediction 1100 may include a feature set generation means 1110 and a misprediction means 1120. The feature set generating device 1110 may generate the feature set of the text by using the method for generating the feature set of the text. The misprediction device 1120 may utilize a misprediction model to predict the error condition of the text based on the feature set.

As shown in fig. 12, the text error correction apparatus 1200 may include an error prediction apparatus 1210 and an error correction apparatus 1220. The error prediction device 1210 may predict an error condition of the text using the text error prediction method described above. The error correction means 1220 may correct the text based on the prediction result.

As shown in fig. 1-8, the present invention may also be implemented as a text error correction method or a text error prediction method.

Fig. 13 shows a flow diagram of a text error correction method according to one embodiment of the present disclosure. For details concerning the foregoing and related description, reference is made to the details of such construction and is not further described herein.

As shown in fig. 13, at step 1310, a plurality of elements contained in the text are acquired. Wherein the element may comprise a character and/or a word and/or a bigram.

At step 1320, feature data for at least one element of the plurality of elements is determined.

In one embodiment, the feature data may include: part-of-speech features of characters and/or words; and/or inter-element association features. The inter-element association feature may include at least one of: the location characteristics of the characters in the words; inter-element dependency characteristics; inter-element correlation features.

In another embodiment, the feature data may further include: a combined feature of two or more features of an element. The combined features include at least one of: a combination of a position feature of the element in a word and a part-of-speech feature of the word; the position characteristics of the elements in the words and/or the binary word and the combination characteristics of the words and/or the binary word; and combining the part-of-speech feature of the element with the inter-element correlation feature.

At step 1330, a feature set of the text is generated featuring the plurality of elements and the feature data.

In step 1340, an error condition of the text is predicted based on the feature set.

In step 1350, the text is error corrected based on the prediction.

In an embodiment of the present disclosure, the step of obtaining a plurality of elements included in the text may include: the step of word segmentation of the text to obtain the elements and/or determining characteristic data of at least one of the plurality of elements comprises at least one of: performing part-of-speech tagging on the text to obtain part-of-speech features of the characters and/or words; performing dependency syntax analysis processing on the text to obtain the inter-element dependency characteristics; the inter-element correlation features are obtained from a feature database.

In an embodiment of the present disclosure, the step of generating the feature set of the text may include: the method comprises the steps of obtaining feature identifiers corresponding to the features from a feature database, wherein the feature database is obtained by processing a text data set, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features are stored in the feature database in an associated mode, and the plurality of features comprise a plurality of elements extracted from the text data set and feature data of the elements; and generating the feature set based on the feature identification. Wherein the text data set comprises at least one of: a universal domain data set; a vertical domain data set; network encyclopedia data set.

In an embodiment of the present disclosure, the feature database may further store feature vectors corresponding to the plurality of features in association, and the step of generating the feature set based on the feature identifier further includes: acquiring feature vectors respectively corresponding to the plurality of features based on the feature identifiers; combining the obtained feature vectors results in the feature set.

In an embodiment of the disclosure, the feature vector is obtained by feature training the plurality of elements and feature data of the elements extracted in the text data set.

In embodiments of the present disclosure, based on the feature set, an error prediction model may be utilized to predict an error condition of the text. Wherein, the prediction result of the error condition of the text can be a binary sequence, and each bit of the binary sequence respectively represents the correctness of the corresponding character in the text.

In an embodiment of the present disclosure, the step of predicting an error condition of the text may include: determining the feature identifiers corresponding to the features respectively; and taking the characteristic identification as input of the misprediction model to predict the text.

In the embodiment of the present disclosure, the feature identifiers corresponding to the plurality of features respectively may be obtained from a feature database.

In embodiments of the present disclosure, the misprediction model may also be trained. Wherein the step of training the misprediction model may comprise: generating a feature set of a corpus, wherein the corpus is text; and training the misprediction model based on the feature set of the corpus.

The method can further comprise the step of obtaining a labeling sequence corresponding to the corpus, wherein the labeling sequence represents the error condition of the corpus, and the error prediction model is trained based on the feature set of the corpus and the labeling training. And acquiring a labeling sequence corresponding to the corpus based on the error correction parallel corpus data set.

In the embodiment of the disclosure, the labeling sequence is a binary sequence, and each bit of the binary training represents the correctness of the corresponding character in the corpus; and/or the annotation training comprises a wrong label and/or a correct label.

In an embodiment of the present disclosure, the misprediction model is a BiLSTM-CRF model. The text may include a query statement entered by the user.

FIG. 14 shows a flow diagram of a text misprediction method in accordance with one embodiment of the present disclosure. For details concerning the foregoing and related description, reference is made to the details of such construction and is not further described herein.

As shown in fig. 14, in step S1410, a plurality of elements included in the text are acquired. In step S1420, feature data of at least one element of the plurality of elements is determined. In step S1430, a feature set of the text is generated featuring the plurality of elements and the feature data. Wherein the method of producing the feature set is the same as described above, and details thereof are described in the above. In step S1440, based on the feature set, an error condition of the text is predicted using an error prediction model.

In the embodiment of the disclosure, the prediction result of the error condition of the text may be a binary sequence, and each bit of the binary sequence represents the correctness of the corresponding character in the text.

In an embodiment of the present disclosure, the step of predicting an error condition of the text includes: determining the feature identifiers corresponding to the features respectively; and taking the characteristic identification as input of the misprediction model to predict the text.

In an embodiment of the present disclosure, the misprediction model is a BiLSTM-CRF model.

In the disclosed embodiments, the text may include a query statement entered by a user. Alternatively, the text may include articles of a predetermined author or predetermined owner or predetermined source.

In embodiments of the present disclosure, the plurality of elements may include a trade name and/or a trade name. The step of predicting an error condition of the text may thus comprise: the error condition of the trade name and/or trade name in the text is predicted. Alternatively, the step of correcting the text may include: error correction is applied to the trade name and/or trade name in the text.

In the embodiment of the disclosure, a network new word stock may also be maintained. When predicting the error condition of the text, the network new word stock may be referred to avoid identifying the network new word as an error.

In an embodiment of the present disclosure, a knowledge base may also be maintained, wherein corresponding correct and incorrect segmentations are recorded, wherein the corresponding correct and incorrect segmentations are derived based on the prediction result. In the step of predicting an error condition of the text, the knowledge base may be referred to.

In addition, the text error correction device can be realized. The text error correction apparatus may include element acquisition means, feature extraction means, feature set means, error prediction means, and error correction means. The element acquisition device is used for acquiring a plurality of elements contained in the text; the feature extraction device is used for determining feature data of at least one element in the plurality of elements; the feature set device is used for generating a feature set of the text by taking the plurality of elements and the feature data as features; the error prediction device is used for predicting the error condition of the text based on the feature set; and the error correction device is used for correcting errors of the text based on the prediction result.

FIG. 15 shows a schematic diagram of a computing device in accordance with an embodiment of the invention.

Referring to fig. 15, a computing device 1500 includes a memory 1510 and a processor 1520.

Processor 1520 may be a multi-core processor or may include multiple processors. In some embodiments, processor 1520 may comprise a general purpose main processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, the processor 1520 may be implemented using custom circuitry, for example, an application specific integrated circuit (ASIC, application Specific Integrated Circuit) or a field programmable gate array (FPGA, field Programmable Gate Arrays).

Memory 1510 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1520 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1510 may comprise any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1510 may include a readable and/or writable removable storage device, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a blu-ray read only disc, an super-dense disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disk, and so forth. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 1510 has stored thereon processable code that, when processed by the processor 1520, causes the processor 1520 to perform the methods described above.

The text feature set generating method and apparatus, model training and applying method and apparatus according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for text correction, comprising:

acquiring a plurality of elements contained in the text;

determining characteristic data of at least one element of the plurality of elements;

at least taking the plurality of elements and the feature data as features to generate a feature set of the text;

based on the feature set, predicting the error condition of the text by using an error prediction model; and

based on the prediction result, correcting the text,

the prediction result is a binary sequence, and each bit of the binary sequence represents the correctness of the corresponding character in the text.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the elements include characters and/or words and/or bigrams.

3. The method of claim 2, wherein the characteristic data comprises:

part-of-speech features of characters and/or words; and/or

Inter-element association features.

4. A method according to claim 3, wherein the inter-element association features include at least one of:

the location characteristics of the characters in the words;

inter-element dependency characteristics;

inter-element correlation features.

5. The method of claim 4, wherein the characterization data further comprises:

a combined feature of two or more features of an element.

6. The method of claim 5, wherein the combined features comprise at least one of:

a combination of a position feature of the element in a word and a part-of-speech feature of the word;

the position characteristics of the elements in the words and/or the binary word and the combination characteristics of the words and/or the binary word;

and combining the part-of-speech feature of the element with the inter-element correlation feature.

7. The method of claim 4, wherein the step of obtaining a plurality of elements contained in the text comprises:

word segmentation processing is carried out on the text to obtain the elements,

and/or

The step of determining characteristic data of at least one element of the plurality of elements comprises at least one of:

performing part-of-speech tagging on the text to obtain part-of-speech features of the characters and/or words;

performing dependency syntax analysis processing on the text to obtain the inter-element dependency characteristics;

the inter-element correlation features are obtained from a feature database.

8. The method of claim 1, wherein the step of generating the feature set of the text comprises:

the method comprises the steps of obtaining feature identifiers corresponding to the features from a feature database, wherein the feature database is obtained by processing a text data set, a plurality of features obtained based on the text data set and feature identifiers corresponding to the features are stored in the feature database in an associated mode, and the plurality of features comprise a plurality of elements extracted from the text data set and feature data of the elements; and

the feature set is generated based on the feature identification.

9. The method of claim 8, wherein the text data set comprises at least one of:

a universal domain data set;

a vertical domain data set;

network encyclopedia data set.

10. The method of claim 8, wherein the feature database further stores feature vectors corresponding to the plurality of features, respectively, in association, and wherein generating the feature set based on the feature identification further comprises:

acquiring feature vectors respectively corresponding to the plurality of features based on the feature identifiers;

combining the obtained feature vectors results in the feature set.

11. The method of claim 10, wherein the feature vector is obtained by feature training feature data of the plurality of elements and elements extracted in the text dataset.

12. The method of claim 1, wherein predicting an error condition of the text comprises:

determining the feature identifiers corresponding to the features respectively;

and taking the characteristic identification as input of the misprediction model to predict the text.

13. The method of claim 12, wherein the step of determining the position of the probe is performed,

and obtaining the feature identifiers corresponding to the features from a feature database.

14. The method as recited in claim 1, further comprising:

training the misprediction model.

15. The method of claim 14, wherein the step of training the misprediction model comprises:

generating a feature set of a corpus, wherein the corpus is text;

and training the misprediction model based on the feature set of the corpus.

16. The method as recited in claim 15, further comprising:

Obtaining a labeling sequence corresponding to the corpus, wherein the labeling sequence characterizes the error condition of the corpus, and the labeling sequence comprises a labeling sequence identification unit,

and training the misprediction model based on the feature set of the corpus and the labeling training.

17. The method of claim 16, wherein the step of determining the position of the probe comprises,

and acquiring a labeling sequence corresponding to the corpus based on the error correction parallel corpus data set.

18. The method of claim 16, wherein the step of determining the position of the probe comprises,

the labeling sequence is a binary sequence, and each bit of the binary training respectively represents the correctness of the corresponding character in the corpus; and/or

The annotation training includes incorrect labels and/or correct labels.

19. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the misprediction model is a BiLSTM-CRF model.

20. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the text comprises a query sentence input by a user; or alternatively

The text includes articles of a predetermined author or predetermined owner or predetermined source.

21. The method of claim 1, wherein the plurality of elements comprises trade names and/or trade names,

the step of predicting the error condition of the text comprises the following steps: predicting the error condition of trade names and/or trade names in the text; and/or

The step of correcting the text comprises the following steps: error correction is applied to the trade name and/or trade name in the text.

22. The method as recited in claim 1, further comprising:

a word library of new words of the network is maintained,

in the step of predicting the error condition of the text, the network new word stock is referred to, so as to avoid identifying the network new word as an error.

23. The method as recited in claim 1, further comprising:

maintaining a knowledge base in which corresponding correct and incorrect segmentations are recorded, wherein the corresponding correct and incorrect segmentations are derived based on the prediction,

wherein in the step of predicting an error condition of the text, the knowledge base is referred to.

24. A method of text misprediction, comprising:

acquiring a plurality of elements contained in the text;

generating a feature set of the text by taking the elements and the feature data as features;

based on the feature set, predicting an error condition of the text using an error prediction model,

Wherein, the prediction result of the error condition of the text is a binary sequence, and each bit of the binary sequence respectively represents the correctness of the corresponding character in the text.

25. The method of claim 24, wherein predicting an error condition of the text comprises:

determining the feature identifiers corresponding to the features respectively;

26. The method of claim 25, wherein the step of determining the position of the probe is performed,

27. The method of claim 24, wherein the step of determining the position of the probe is performed,

the misprediction model is a BiLSTM-CRF model.

28. The method of claim 24, wherein the step of determining the position of the probe is performed,

the text comprises a query sentence input by a user; or alternatively

29. The method of claim 24, wherein the plurality of elements comprises trade names and/or trade names,

the step of predicting the error condition of the text comprises the following steps: the error condition of the trade name and/or trade name in the text is predicted.

30. The method as recited in claim 24, further comprising:

a word library of new words of the network is maintained,

31. The method as recited in claim 24, further comprising:

32. A text error correction apparatus, comprising:

element acquisition means for acquiring a plurality of elements contained in the text;

feature extraction means for determining feature data of at least one element of the plurality of elements;

feature set means for generating a feature set of the text with the plurality of elements and the feature data as features;

the error prediction device is used for predicting the error condition of the text by utilizing an error prediction model based on the feature set; and

Error correction means for correcting the error of the text based on the prediction result,

wherein the prediction result is a binary sequence, and each bit of the binary sequence represents the correctness of the corresponding character in the text.