WO2014036827A1

WO2014036827A1 - Text correcting method and user equipment

Info

Publication number: WO2014036827A1
Application number: PCT/CN2013/073382
Authority: WO
Inventors: 胡楠; 杨锦春
Original assignee: 华为技术有限公司
Priority date: 2012-09-10
Filing date: 2013-03-28
Publication date: 2014-03-13
Also published as: CN103678271B; CN103678271A

Abstract

Embodiments of the present invention provide a text correcting method and user equipment, and relate to the language processing field, which may reduce correction mistakes, and improve the correction flexibility and accuracy. The text correcting method comprises: obtaining more than two text types of a to-be-corrected text in a preset text classification standard; obtaining, in a correction knowledge base, a to-be-combined sublanguage model corresponding to each text type of the to-be-corrected text; combining the obtained more than two to-be-combined sublanguage models into a mixed language model; correcting the to-be-corrected text according to the mixed language model to obtain a correction suggestion text. The text correcting method and user equipment provided in the embodiments of the present invention are used for correcting a wrong text.

Description

The present invention claims the priority of a Chinese patent application filed on September 10, 2012 by the Chinese Patent Office, the application number is 201210332263.3, and the invention is entitled "a text correction method and user equipment". The entire contents are incorporated herein by reference. Technical field

The present invention relates to the field of language processing, and in particular, to a text correction method and user equipment.

Background technique

With the advent of the digital age, text correction techniques that correct erroneous texts to be corrected are becoming more widely used. In the prior art, the error in the noise channel theory that the text to be corrected is mainly derived from input errors generated during manual input, and input errors generated in optical character recognition and speech recognition. The noise channel theory produces these texts to be corrected as real text after passing through a channel mixed with noise. For example, \^ is the original string sequence < 1 , 2 , . . . , Wn > , that is, the completely correct text, after the noise channel is generated, the noise text <^^ (^ ,... 0 ₃ >, using the noise channel theory The method of text correction is to establish a noise channel probability model to find a string sequence W, so that when the string sequence 0 is observed, the probability of occurrence of W is the largest, and the sequence of string 0 is the text to be corrected. The string sequence W is an ideal corrected text, which can also be called an ideal string, but the ideal corrected text is not necessarily identical to the correct text W. Among them, the string sequence W' is the string with the highest probability, P (0|W) is called the channel probability or generation model, and the probability P(W) is the probability of occurrence of the string sequence W in the language model.

In the method of realizing text correction using the noise channel theory, it is necessary to obtain the character string w that maximizes the probability according to the language model, but when the locale and subject background of the text to be corrected are different, the same word or string may represent Different meanings, therefore, different correction options are needed, but the language model in the prior art is relatively fixed, and the corrected text can only take a fixed correction option, so it is easy to appear Positive errors result in poor calibration flexibility and low accuracy.

Summary of the invention

Embodiments of the present invention provide a text correction method and user equipment for improving correction flexibility and correctness.

In order to achieve the above object, the embodiment of the present invention uses the following technical solutions:

In one aspect, a text correction method is provided, including:

Obtaining two or more text types of the text to be corrected in the preset text classification standard; acquiring, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected;

The obtained two or more sub-language models to be combined are combined into a mixed language model; the corrected text is corrected according to the mixed language model to obtain a corrected suggestion text.

The preset text classification criteria are: any one of a language environment, a theme, an author, a writing style, and a theme.

The method further includes:

Obtaining the preset text classification standard;

Two or more sub-language models are established according to the text type in the preset text classification standard.

Combining the obtained two or more sub-language models to be combined into a mixed language model includes:

Obtaining a proportion of each text type in the text to be corrected;

And obtaining the mixed language model by combining the acquired two or more sub-language models to be combined according to the proportion of the respective text types.

Before correcting the text to be corrected according to the mixed language model to obtain the corrected suggestion text, the method further includes:

Obtaining an error detection model in the correction knowledge base;

An error location of the to-be-processed text is determined by the error detection model, the error location including an erroneous character or an erroneous character string.

The error detection model includes: a word connection model, a part-of-speech model, and a sound near word Any one or more of the dictionary and the form near the dictionary.

The correcting the text to be corrected according to the mixed language model to obtain correction suggestion text includes:

Generating a sequence of strings to be corrected from the error location;

Performing a correcting operation on the sequence of strings to be corrected to obtain at least one sequence of corrected character strings;

Obtaining the first m and the last n characters of the error position in the to-be-corrected text, and combining with the corrected character string sequence to obtain at least one screening sequence;

Obtaining, according to the mixed language model, a string sequence having the highest probability of occurrence of an ideal string in the at least one screening sequence as a correction suggestion text, or

According to the mixed language model, the first few character string sequences having a high probability of occurrence of an ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model.

In one aspect, a user equipment is provided, including:

An obtaining unit, configured to obtain two or more text types of the text to be corrected in a preset text classification standard;

The obtaining unit is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to the generating unit. ;

a generating unit, configured to receive information about the acquired two or more sub-language models to be combined sent by the acquiring unit, and combine the acquired two or more sub-language models to be combined into a mixed language model, where The information of the mixed language model is sent to the correction unit;

And a correction unit, configured to receive information of the mixed language model sent by the generating unit, and correct the text to be corrected according to the mixed language model to obtain correction suggestion text.

The preset text classification criteria are: locale, theme, author, writing Any of the styles and themes.

The user equipment further includes:

The obtaining unit is configured to acquire the preset text classification standard, and send the preset text classification standard to an establishing unit;

And an establishing unit, configured to receive the preset text classification standard sent by the acquiring unit, and establish two or more sub-language models according to the text type in the preset text classification standard.

The generating unit is specifically configured to:

Obtaining a proportion of each text type in the text to be corrected;

The user equipment further includes:

a model obtaining unit, configured to acquire an error detection model in the correction knowledge base, and send information of the error detection model to a determining unit;

And a determining unit, configured to receive information about the error detection model sent by the model obtaining unit, and determine an error location of the to-be-processed text by using the error detection model, where the error location includes an error character or an error string.

The error detection model includes: one or more of a word connection model, a part-of-speech connection model, a sound near dictionary, and a shape near dictionary.

The correction unit is specifically configured to:

Generating a sequence of strings to be corrected from the error location;

Passing at least one of the noise channel probability models according to the mixed language model In the screening sequence, the first few string sequences with a high probability of occurrence of the ideal string are obtained as the correction suggestion text.

An embodiment of the present invention provides a text correction method and a user equipment, where the text correction method includes: acquiring two or more text types of the text to be corrected in a preset text classification standard; and acquiring the text to be corrected in the correction knowledge base Each of the text types corresponds to the sub-language model to be combined; the obtained two or more sub-language models to be combined are combined into a mixed language model; and the corrected text is corrected according to the mixed language model to obtain corrected suggestion text. In this way, by classifying the text to be corrected and then obtaining the corresponding mixed language model, the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or the text to be corrected When the text types are different, the correction text can provide different correction options, thus reducing correction errors and improving correction flexibility and correctness.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.

1 is a schematic flowchart of a text correction method according to an embodiment of the present invention; FIG. 2 is a schematic flowchart of another text correction method according to an embodiment of the present invention; FIG. 3 is a schematic structural diagram of a user equipment according to an embodiment of the present disclosure; ;

FIG. 4 is a schematic structural diagram of another user equipment according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of still another user equipment according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of still another user equipment according to an embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on an embodiment of the present invention, All other embodiments obtained by those skilled in the art without creative efforts are within the scope of the present invention.

An embodiment of the present invention provides a text correction method, including:

5101. Obtain two or more text types of the text to be corrected in the preset text classification standard.

The above preset text classification criteria may include: any one of a language environment, a theme, an author, a writing style, and a theme. For example, texts can be divided into text types such as sports, economics, politics, and technology according to the theme.

If the text classification standard preset by the user is the theme background, the user equipment may establish a corresponding sub-language model according to the text type of the theme background in the correction knowledge base. When the text type of the text to be corrected is obtained, the text classification technique can be utilized to determine the classification to which the text to be corrected belongs.

5102. Obtain, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected.

5103. Combine the obtained two or more sub-language models to be combined into a mixed language model.

For example, when entering a piece of computer technology consulting text that contains economic aspects such as the stock market, text classification techniques can be used to determine the type of text to which the text belongs is science and economic. Select the technology and economic sub-language model corresponding to the text type of the text to be corrected in the correction knowledge base, and then combine the technology class with the economic sub-language model into a mixed language model.

5104. Correcting the corrected text according to the mixed language model to obtain a corrected suggestion text.

In this way, by classifying the text to be corrected and then obtaining the corresponding mixed language model, the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, thereby reducing correction errors and improving correction flexibility. And correctness.

For example, another embodiment of the present invention provides a specific method 20 for text correction, including: S201. The user equipment classifies the acquired corpus according to the preset text classification standard into each sub-language model according to the text type.

First, the user equipment needs to obtain the preset text classification standard, and the preset text classification standard may include: any one of a language environment, a theme background, an author, a writing style, and a theme, and is usually preset by the user according to specific conditions. .

Then, in the correction knowledge base, the user equipment establishes two or more sub-language models according to the text type in the preset text classification standard.

For example, the following types of sub-language models can be obtained by language environment, such as a business environment, a living environment, or an official environment. The following types of sub-language models, such as sports, politics, literature, or history, can be obtained by theme. At the same time, the actual type of the sub-language model is also related to the type of the corpus. For example, if there is no historical type corpus in the correction knowledge base, the historical sub-language model can be regarded as idle or invalid, when the user equipment passes the initiative. If a method of obtaining or inputting a user obtains a certain amount of historical corpus, a new historical sub-language model can be established according to the historical corpus, and the historical sub-language model is regarded as an effective sub-language model.

Then, according to the preset text classification standard, the acquired corpus is classified into the sub-language model according to the type.

Specifically, the user equipment can enrich the correction knowledge base by obtaining corpus on a regular or irregular basis. The method for obtaining the corpus may be that the user equipment obtains the corpus data through the connection connection with the Internet, the periodic update, or the like, or the user provides the corpus data to the user equipment through the input interface such as the configuration management interface of the user equipment. Then, the user equipment classifies the corpus into an existing type of sub-language model or creates a new sub-language model according to the type of the corpus indicated by the user. For example, if the historical corpus data is missing from the corpus, the user can add the historical corpus collection through regular updates, Internet search, and even through the configuration management interface, and then establish a historical sub-language model; if there is historical corpus data, it can also pass In the above way, a new historical corpus is added to update the sub-language model.

However, most of the time, the corpus obtained by the user equipment is an unclassified corpus, and the user equipment needs to classify the obtained corpus according to the preset text classification standard according to the type. In the sub-language model, the corpus is classified. For example, for the above-mentioned computer technology consulting text containing economic aspects such as the stock market, part of it is "Dell estimates that its first quarter revenue was about $14.2 billion, and earnings per share were 33 cents. Revenues for the quarter were $1,42 billion to $14.6 billion, and earnings per share were between 35 and 38 cents, while analysts on average predicted Dell's revenue for the same period of $14.52 billion, with earnings per share of 38 cents." The text classification technology is used to automatically classify unclassified corpus. The classification process is divided into two phases: training phase and classification phase. In the training phase, the text in the classified corpus is processed by word segmentation, and the word segmentation process is the same as the prior art, and will not be described here. After the word segmentation, the above content can be expressed as "Dai / er / company / estimate / / / its / first / quarter / income / about / is ...", for convenience of presentation, the embodiment of the present invention uses ' / ' Represents the segmentation between words. The text after the word segmentation is removed from the stop words, such as: "ground", "", etc., and then the word vector representation of the text is established according to the ratio of the word, the word frequency and the total number of words appearing in the text, and different words are represented in this vector. In one dimension, the ratio of the word frequency to the total number of words is the value corresponding to the dimension. The collection of word vectors of different texts in the above corpus is further combined with known classification labels to train the classifier through dimensionality reduction processing; in the classification stage, the corpus text processing to be classified is represented as a vector, and input into the classifier to perform physical education on the text. , financial and other types of classification. The corpus is classified into corresponding sub-language models according to different classifications, and the probability of the corresponding sub-language model is updated.

In particular, the text in the corpus establishes the 2-Gram statistical model of the word and the 3-Gram statistical model as a word continuation model. For example, if a corpus text contains the text "Knowledge Base Building Module", the word 2 is created. The -Gram group is "Knowledge", "Knowledge", "Library", "Build", "Modeling" and "Module", and then calculates the statistical probability of occurrence of each 2-Gram group in the classification corpus of the text. Further, for the above-mentioned computer technology consulting texts including economic aspects such as the stock market, the established 2-Gram group includes: "Dell", "And Public", "Company", "Estimating", "Estimating" , "its first", "first", "one season", "quarter" and so on. First count the number of occurrences of each word and calculate the proportion of the word in the entire corpus as a probability of occurrence of the word. For each 2-Gram group, count the number of words that appear after the first word, such as "Dell", indicating that the word "Dai" is followed by the word "尔", if it is "weared" in the text contained in the entire corpus. After the word "尔" appears 1000 times, the number of times after the word "Dai" is "1000", the number of times after the word "Dai" is 10,000 times. The words appearing after the word "Dai" have many possibilities and the number of occurrences is different. Count the number of times that all "Dai" characters are followed by other words, such as 500000 times, and then calculate the probability of occurrence of various possibilities. The probability of "wearing" followed by "er" is roughly estimated to be 1000% for 1000/500000, and the probability of "wearing" followed by "hat" is roughly estimated to be 10000/500000 for 2%. The acquisition of the 3-Gram statistical model is the same as that of the 2-Gram statistical model, and will not be described here. The 2-Gram and 3-Gram word splicing models facilitate the error location of the text to be processed in the subsequent process.

Further, the corpus after the word segmentation may be tagged with a part of speech, and then a 2-yuan part-of-speech statistical model and a 3-yuan part-of-speech statistical model are established as a part-of-speech continuation model, wherein "2 yuan" in the 2-yuan part-of-speech statistical model is represented as two phrases. , or 2 characters. For example, the hypothetical corpus contains the "knowledge base building module". After the word segmentation, the words "knowledge base", "build" and "module" are obtained. The participles of the mark are nouns, verbs and nouns. The established two-word statistical model is respectively For the "knowledge base construction" and "build module", the part of speech is noun plus verb, verb plus noun, the established 3-yuan part-of-speech statistical model is "knowledge base building module", the part of speech is noun plus verb plus noun, that is, in the establishment 2 When the meta-sentence statistical model and the 3-member part-of-speech statistical model, the corresponding part of speech also needs to be labeled. The calculation method of the specific statistical model is similar to the method of establishing the 2-Gram and 3-Gram statistical models of the above words, and the present invention will not be described again.

Finally, you can use the pinyin and Wubi input methods to create a close-up and near-dictionary dictionary. Such as "where" - "out", "form" - "type", "磬" - "罄" and so on. The present invention will not be described in detail herein.

S202. The user equipment acquires two or more text types of the text to be corrected in a preset text classification standard.

The user equipment can obtain the text to be corrected in various ways, such as the user directly entering the user equipment through the user interface, or the user directly transmitting to the user equipment through an input interface such as a configuration management interface. Then, the user equipment uses the text classification technology to perform automatic text classification on the corrected text, and the classification process is divided into two stages: training stage and minute Class stage. In the training phase, the corrected text is subjected to word segmentation processing, and the word segmentation process is the same as the prior art, and will not be described herein. Remove the stop words from the text after the word segmentation, such as: "ground", ",,, etc., and then establish the word vector representation of the text according to the ratio of the word, word frequency and total number of words appearing in the text, and then combine the processing by dimensionality reduction and the like. The known classification label training classifier; in the classification stage, the text processing to be corrected is represented as a vector, and input into the classifier to classify the text into sports, finance, etc. The text to be corrected is classified according to different classifications. In the corresponding sub-language model, and update the probability of the corresponding sub-language model.

S203. The user equipment acquires a mixed language model.

First, the user equipment may acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected. The calibration knowledge base may include: a sub-language model, a word continuation model, a part-of-speech continuation model, a sound near dictionary, a shape near dictionary, and the like. Since there are many types of text in the correction knowledge base, it is only necessary to select a sub-language model corresponding to the text type of the text to be corrected to obtain a mixed language model.

Then, the user equipment can obtain the proportion of each sub-language model in the text to be corrected by calculation. Finally, according to the proportions of the respective sub-language models, the acquired two sub-language models to be combined are obtained to obtain the mixed language model. Specifically, the expectation maximization algorithm (EM algorithm) can be used to obtain the proportion of each sub-language model to be combined in the mixed language model, and then the sub-language model to be combined according to the proportion of each sub-language model to be combined in the mixed language model Combine to get a mixed language model. Of course, each sub-language model can also be multiplied by the corresponding weight to achieve the effect of obtaining a mixed language model according to the specific gravity combination.

Specifically, the mixed language model is formed by linear interpolation for each sub-language model. For the N-Gram sublanguage model, the mixed language model is represented by each sublanguage model as follows:

Where i is the length of the string to be corrected, k is the number of sub-language models, the weight of each sub-language model, mm) is the string in the sub-language model Now the probability of a sequence, ^i≤ J ^≤k, the prior art using the same theoretical noise channel request P (W) of the method is omitted here.

According to the expectation maximization algorithm, for the above mixed language model, a likelihood function of the text to be processed can be given. According to the likelihood function, it is necessary to find the weight of the sub-language model ^ to maximize the likelihood function, then the ^ is the weight of the sub-language model. Assuming that the text to be processed of a text type contains a total of T words, the corresponding formula for updating the weight of the text type is:

Tt — \ ^ _t _ i \ _t _Λ , t ^λ ~ ^{1 Ρ}

Ύ =

j k

X Γ ¹ ( l , where t represents the tth weight estimation value, in the embodiment of the present invention, t is finally equal to the number of words in the text to be processed, M represents a language model, and Mj represents the embodiment provided by the embodiment of the present invention. The j-th sub-language model in the mixed language model, k is the number of sub-language models involved in determining the text.

For example, suppose that the composition of the sub-language model for the text to be corrected is: two sub-language models of technology and economy, then k=2. In the initial state, set == ⁽⁾¹ or other smaller positive values; for the first word {wl} of the text to be processed, the probability of occurrence of the word wl in the two sub-language models of technology and economy is obtained as P(wl ;Ml) and P(wl; M2), and then calculated according to the above formula. At this time t=l, then the first formula update weight value to get ^, ^ value. For the second word {w2} in the text, the conditional probability P(w2|wl; Ml) and P(w2|wl; M2) appearing in w2 under the condition that wl appears in the two sub-language models of science and economy, Then update the weight value according to the same steps as above, and the subsequent steps are similar. Finally, the final weight is obtained after T updates.

S204. The user equipment determines, by using an error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string.

Before the user equipment determines the wrong location of the text to be processed, it is necessary to obtain an error detection model in the correction knowledge base. The error detection model can include: a word connection model, a word Any one or more of the splicing model, the syllabary, and the syllabary. In particular, the error detection model may include other models, and the present invention will not be described again. In this embodiment, step S201 has obtained a word connection model, a part-of-speech connection model, a sound near dictionary, a shape near dictionary, and the like, and the user equipment can obtain one or more error detection models according to preset detection rules. .

First, the user equipment can perform word segmentation and part-of-speech tagging processing on the text to be processed. For the specific process, refer to the related explanation in step S201, and details are not described herein again. A single character or a scattered string that appears consecutively after the word segmentation can be checked by the word continuity model to see if it is correct. At the same time, the part of speech can be used to check the continuity of part of speech. The specific process can refer to the prior art. Because common text errors can be divided into two categories: "non-multiple word errors" and "true multi-word errors." "Non-multiple word errors" means that such errors destroy the surface structure of words and form a single string, which leads to the original word string of a multi-word word not found in the word segment dictionary, such as "loyalty", the correct word is " Loyalty", but because it cannot be found in the word segmentation dictionary, the word segmentation program is divided into a plurality of individual Chinese characters or the words "loyalty", "耿", "耿". Statistically speaking, the probability of "耿" after "loyalty" is very small. This type of error can be detected by setting an appropriate threshold, so such errors can be detected by the word continuity model. The wrong string of "true multi-word error" is a multi-word in the segmentation vocabulary. Usually there is no word-level error, and this kind of error is generally a grammatical structure or a morphological error. "My book "The correct string is "My Book" or "At a long time". The correct string is "extended time". For "every time", "the director" is a noun and the "time" behind is also a noun. From the statistical point of noun, the probability of ^ is smaller; and the correct "extended time" is a combination of verbs and nouns, which is statistically reasonable. Therefore, such errors can be found by the part-of-speech model to determine the part-of-speech relationship. The method of determining the error position by the sound near dictionary and the near-dictionary dictionary or the like can refer to the prior art. In particular, the method for detecting the above-mentioned error position is only a schematic description, and any variation or substitution can be easily conceived within the scope of the present invention within the technical scope of the present invention.

It should be noted that in the prior art, the text channel is implemented by using the noise channel theory. The positive method may include: setting a first character in the sequence of the string to be corrected to an editing position, performing a correction operation on the corrected character string according to the word connection relationship in the language model, and generating a new set of N string sequence combinations, The above operation is repeated by setting the second character position of each string sequence in the newly generated string sequence set to the editing position. By limiting the size of N and the depth of each editing operation, it is guaranteed that N corrective strings can be obtained after a limited number of operations. However, the operation process defaults to the error of the entire string of the text to be corrected, and it is necessary to correct the position of almost all the text in the text to be corrected, and the operation is complicated. If the string sequence of the text to be corrected is long, a state explosion may occur. Happening. In the embodiment of the present invention, the screening of the error position is performed before the correction, the number of corrections is effectively reduced, and the efficiency of the correction is improved.

S205. The user equipment performs correction according to the mixed language model to obtain corrected correction text.

First, a sequence of strings to be corrected can be generated from the error location.

Then, the user equipment may perform a correcting operation on the sequence of the string to be corrected by error detection model matching or other methods to obtain at least one sequence of corrected character strings, and the at least one sequence of the corrected string may constitute a set of corrected string sequences, and the specific correction is performed. The operation can refer to the prior art.

Then, the user equipment may obtain the first m and the last n characters of the error position in the to-be-corrected text, and combine with the corrected character string sequence to obtain at least one screening sequence. Where m and n are positive integers or 0, which can be preset or dynamic. In this way, the sequence of correction strings is more closely related to the context of the text to be corrected. For example, if it is judged that the position of the error is "intermittent" in the "discontinuous" three-character position, the sequence of the string to be corrected is a string sequence of three characters of "intermittent", then After correcting the string sequence to be corrected, the corrected string sequence is "intermittently", and the first 2 and the last 2 characters of the error position are obtained as "sound intermittent" as a screening sequence, and the "sound" can be calculated by using the statistical language model. There is a high probability that "intermittent" will appear later, which means that the correction string generated here is appropriate. Of course, in practical applications, there may be multiple correction string sequences obtained after correction, which is only a schematic description. Finally, the user equipment may obtain, according to the mixed language model, a string sequence with the highest probability of occurrence of the ideal string in the at least one screening sequence as the correction suggestion text, or according to the mixed language model, by using a noise channel probability model. The first few character string sequences having a high probability of occurrence of the ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model. The correction suggestion text can be provided to the user through the human-machine interaction interface of the user equipment, for the user to confirm the correction scheme, and the corrected character string position can be emphasized by underlining, etc., and the correction for different types of errors can also be different. Mark the color symbols or shading.

It should be noted that the sequence of the steps of the text correction method provided by the embodiment of the present invention may be appropriately adjusted, and the steps may also be correspondingly increased or decreased according to the situation, and any person skilled in the art may be within the technical scope disclosed by the present invention. The method of change can be easily thought of, so it will not be described again.

The text correction method provided by the embodiment of the present invention, by classifying the text to be corrected, and then acquiring the corresponding mixed language model, so that the mixed language model according to the correction can dynamically change according to the text type of the text to be corrected, and the language model can More accurately reflect the linguistic phenomenon of the text. When the preset text classification standard or the text type of the text to be corrected is different, the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness. At the same time, due to the screening of the wrong position, the number of corrections is effectively reduced, and the efficiency of the correction is improved.

For example, for the text "Dell estimates that its first quarter revenue (in) is about $14.2 billion, and earnings per share is 33 cents. The company previously forecasted a quarterly income (input) of $1,42 billion to $14.6 billion. Earnings per share ranged from 35 to 38 cents, while analysts on average predicted Dell's revenue for the same period of $14.52 billion, earnings per share of 38 cents." Among them, "revenue" is caused by OCR (Optical Character Recognition, optical character i only ¹ J). The "receipt" can be corrected to "revenue" when using the prior art correction, but the term "Dell" is mistakenly mistaken and deleted to obtain the "company estimate" error correction, which can be selected by using the present invention. Technology sub-language adds recognition of the term "Dell" so that no similar errors occur. Same As such, the present invention can also identify a named entity that may cause anomalies generated in the word segmentation and part-of-speech tagging without correction processing, supplemented by a named entity recognition technique between corrections.

An embodiment of the present invention provides a user equipment 30, as shown in FIG. 3, including: an obtaining unit 30 1 , configured to acquire two or more text types of text to be corrected in a preset text classification standard.

For example, the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.

The obtaining unit 30 1 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to The unit 302 is generated.

The generating unit 302 is configured to receive the acquired information of the two or more sub-language models to be combined sent by the acquiring unit 302, and combine the obtained two or more sub-language models to be combined into a mixed language model, and The information of the mixed language model is sent to the correction unit 303.

The generating unit 302 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more to-be-combined language models according to a specific gravity of each text type to obtain the mixed language model.

The correcting unit 303 is configured to receive information about the mixed language model sent by the generating unit 302, and correct the text to be corrected according to the mixed language model to obtain a corrected suggestion text.

The correcting unit 303 is specifically configured to: generate a sequence of the character string to be corrected by the error location; perform a correcting operation on the sequence of the string to be corrected to obtain at least one sequence of the corrected character string; and obtain the text to be corrected The first m and the last n characters of the error position are combined with the corrected character string sequence to obtain at least one screening sequence; according to the mixed language model, obtaining an ideal character in the at least one screening sequence by using a noise channel probability model a string sequence with the highest probability of occurrence of the string as the correction suggestion text, or according to the mixed language model, obtaining the first few characters with a high probability of occurrence of the ideal character string in the at least one selected sequence by the noise channel probability model The string sequence is used as the correction suggestion text.

In this way, the obtaining unit classifies the text to be corrected, and then the generating unit acquires the corresponding mixed language model, so that the mixed language model on which the correcting unit performs the correction can dynamically change according to the text type of the text to be corrected, when the preset When the text classification standard or the text type of the text to be corrected is different, the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.

Further, as shown in FIG. 4, the user equipment 10 may further include: the acquiring unit 30 1 , configured to acquire the preset text classification standard, and send the preset text classification standard to the establishing unit 304 ;

The establishing unit 304 is configured to receive the preset text classification standard sent by the acquiring unit 30 1 , and establish two or more sub-language models according to the text type in the preset text classification standard.

The model obtaining unit 305 is configured to acquire an error detection model in the correction knowledge base, and send the information of the error detection model to the determining unit 306;

For example, the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.

a determining unit 306, configured to receive information about the error detection model sent by the model obtaining unit 305, and determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an incorrect character or an incorrect character string.

A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific use steps of the foregoing user equipment can be referred to the corresponding process in the foregoing text correction method, and details are not described herein again.

The user equipment provided by the embodiment of the present invention classifies the text to be corrected, and then obtains a corresponding mixed language model, so that the mixed language model according to the correction can be dynamically changed according to the text type of the text to be corrected, and the language model can be compared. Accurately reflect the linguistic phenomenon of the text. When the preset text classification standard or the text type of the text to be corrected is different, the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness. At the same time, due to the error bit The screening is effective, reducing the number of corrections and improving the efficiency of calibration.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the device and the unit described above can refer to the corresponding process in the foregoing method embodiments, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.

The units described as separate components may or may not be physically separated, and the components displayed as the units may or may not be physical units, and may be located in one place or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiment of the present embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units. The embodiment of the present invention provides a user equipment 50, as shown in FIG. 5, including: a processor 501, configured to acquire two or more text types of the text to be corrected in a preset text classification standard.

The processor 501 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected; Combining the sub-language models into a mixed language model; correcting the text to be corrected according to the mixed language model to obtain corrected suggestion text.

The processor 501 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more sub-language models to be combined to obtain the mixed language according to the specific gravity of each text type model.

The processor 501 is specifically configured to: generate a sequence of the string to be corrected by the error location; perform a correction operation on the sequence of the string to be corrected, to obtain at least one sequence of the corrected string; and acquire the text in the to-be-corrected text And m and n characters after the error position are combined with the corrected character string sequence to obtain at least one selected sequence; according to the mixed language model, obtaining an ideal string in the at least one selected sequence by using a noise channel probability model a string sequence with the highest probability of occurrence as the correction suggestion text, or according to the mixed language model, obtaining the first string sequence with a high probability of occurrence of the ideal string in the at least one selected sequence by using the noise channel probability model as Correct the suggested text.

In this way, the processor classifies the text to be corrected, and then obtains the corresponding mixed language model, so that the mixed language model on which the correction is performed can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or When the text type of the text to be corrected is different, the correction text can provide different correction options, so the correction error can be reduced, and the correction flexibility and accuracy can be improved.

Further, the processor 501 is further configured to: obtain the preset text classification standard.

As shown in FIG. 6, the user equipment 50 further includes: a memory 502, configured to establish two or more sub-language models according to the type in the preset text classification standard, and send the information of the sub-language model to the processing. 501.

The processor 501 is further configured to acquire an error detection model in the correction knowledge base. For example, the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.

The processor 501 is further configured to determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string. A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific use steps of the memory and the processor in the user equipment described above can refer to the corresponding process in the foregoing text correction method, and Let me repeat.

A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Claims

claims

1. A text correction method, characterized by including:

Obtain more than two text types of the text to be corrected in the preset text classification standards; Obtain the sub-language model to be combined corresponding to each text type of the text to be corrected in the correction knowledge base;

The obtained two or more sub-language models to be combined are combined into a mixed language model; the text to be corrected is corrected according to the mixed language model to obtain a correction suggestion text.

2. The method according to claim 1, characterized in that the preset text classification standard is any one of: language environment, theme background, author, writing style and theme.

3. The method according to claim 2, characterized in that, the method further includes: obtaining the preset text classification standard;

Establish two or more sub-language models according to the text types in the preset text classification standards.

4. The method according to claim 3, characterized in that said combining the obtained two or more sub-language models to be combined into a mixed language model includes:

Obtain the proportion of each text type in the text to be corrected;

According to the proportion of each text type, the obtained two or more sub-language models to be combined are combined to obtain the hybrid language model.

5. The method according to any one of claims 1 to 4, characterized in that, before correcting the text to be corrected according to the mixed language model to obtain the corrected suggested text, the method further includes:

Obtain the error detection model in the correction knowledge base;

The error location of the text to be processed is determined through the error detection model, and the error location includes an error character or an error string.

6. The method according to claim 5, characterized in that the error detection model includes: any one or more of a word continuation model, a part-of-speech continuation model, a pronunciation dictionary and a form proximity dictionary.

7. The method according to claim 5 or 6, characterized in that, correcting the text to be corrected according to the mixed language model to obtain the correction suggestion text includes: generating a string sequence to be corrected from the error position ;

Perform a correction operation on the character string sequence to be corrected to obtain at least one corrected character string sequence;

Obtain the first m characters and the last n characters of the error position in the text to be corrected, and combine them with the correction string sequence to obtain at least one screening sequence;

According to the mixed language model, a string sequence with the highest occurrence probability of an ideal string is obtained as the correction suggestion text in the at least one screening sequence through a noise channel probability model, or

According to the mixed language model, the first few character string sequences with higher occurrence probability of ideal character strings are obtained in the at least one filtering sequence through the noise channel probability model as the correction suggestion text.

8. A user equipment, characterized by including:

The acquisition unit is used to acquire more than two text types of the text to be corrected in the preset text classification standards;

The acquisition unit is also configured to acquire the sub-language model to be combined corresponding to each text type of the text to be corrected in the correction knowledge base, and send the obtained information of more than two sub-language models to be combined to the generation unit ;

A generating unit configured to receive the information of the two or more acquired sub-language models to be combined sent by the acquisition unit, and to combine the acquired two or more sub-language models to be combined into a hybrid language model, and to combine the two or more acquired sub-language models to be combined into a hybrid language model. The information from the mixed language model is sent to the correction unit;

A correction unit, configured to receive the information of the mixed language model sent by the generation unit, and correct the text to be corrected according to the mixed language model to obtain a correction suggestion text.

9. The user equipment according to claim 8, characterized in that the preset text classification standard is any one of: language environment, theme background, author, writing style and theme.

10. The user equipment according to claim 9, characterized in that the user equipment further includes:

The acquisition unit is used to acquire the preset text classification standard and send the preset text classification standard to the establishment unit;

An establishment unit, configured to receive the preset text classification standard sent by the acquisition unit, and establish two or more sub-language models according to the text types in the preset text classification standard.

11. The user equipment according to claim 10, characterized in that the generating unit is specifically used to:

Obtain the proportion of each text type in the text to be corrected;

12. The user equipment according to any one of claims 8 to 11, characterized in that, the user equipment further includes:

A model acquisition unit, used to acquire the error detection model in the correction knowledge base, and send the information of the error detection model to the determination unit;

A determining unit, configured to receive the information of the error detection model sent by the model acquisition unit, and determine the error position of the text to be processed through the error detection model, where the error position includes an error character or an error string.

13. The user equipment according to claim 12, characterized in that the error detection model includes: any one or more of a word continuation model, a part-of-speech continuation model, a pronunciation dictionary and a form similarity dictionary.

14. The user equipment according to claim 12 or 13, characterized in that the correction unit is specifically used to:

Generate a string sequence to be corrected from the error position;

Obtain the m characters before and n characters after the error position in the text to be corrected, and combine them with the correction string sequence to obtain at least one screening sequence; According to the mixed language model, a character string sequence with the highest occurrence probability of an ideal character string is obtained as the correction suggestion text in the at least one screening sequence through a noise channel probability model, or