WO2014036827A1 - Text correcting method and user equipment - Google Patents

Text correcting method and user equipment Download PDF

Info

Publication number
WO2014036827A1
WO2014036827A1 PCT/CN2013/073382 CN2013073382W WO2014036827A1 WO 2014036827 A1 WO2014036827 A1 WO 2014036827A1 CN 2013073382 W CN2013073382 W CN 2013073382W WO 2014036827 A1 WO2014036827 A1 WO 2014036827A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
corrected
model
correction
language model
Prior art date
Application number
PCT/CN2013/073382
Other languages
French (fr)
Chinese (zh)
Inventor
胡楠
杨锦春
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014036827A1 publication Critical patent/WO2014036827A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Definitions

  • the present invention relates to the field of language processing, and in particular, to a text correction method and user equipment.
  • is the original string sequence ⁇ 1 , 2 , . . . , Wn > , that is, the completely correct text, after the noise channel is generated, the noise text ⁇ ( ⁇ ,...
  • the method of text correction is to establish a noise channel probability model to find a string sequence W, so that when the string sequence 0 is observed, the probability of occurrence of W is the largest, and the sequence of string 0 is the text to be corrected.
  • the string sequence W is an ideal corrected text, which can also be called an ideal string, but the ideal corrected text is not necessarily identical to the correct text W.
  • the string sequence W' is the string with the highest probability
  • W) is called the channel probability or generation model
  • the probability P(W) is the probability of occurrence of the string sequence W in the language model.
  • Embodiments of the present invention provide a text correction method and user equipment for improving correction flexibility and correctness.
  • the embodiment of the present invention uses the following technical solutions:
  • a text correction method including:
  • the obtained two or more sub-language models to be combined are combined into a mixed language model; the corrected text is corrected according to the mixed language model to obtain a corrected suggestion text.
  • the preset text classification criteria are: any one of a language environment, a theme, an author, a writing style, and a theme.
  • the method further includes:
  • Two or more sub-language models are established according to the text type in the preset text classification standard.
  • Combining the obtained two or more sub-language models to be combined into a mixed language model includes:
  • the method further includes:
  • An error location of the to-be-processed text is determined by the error detection model, the error location including an erroneous character or an erroneous character string.
  • the error detection model includes: a word connection model, a part-of-speech model, and a sound near word Any one or more of the dictionary and the form near the dictionary.
  • the correcting the text to be corrected according to the mixed language model to obtain correction suggestion text includes:
  • the first few character string sequences having a high probability of occurrence of an ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model.
  • a user equipment including:
  • An obtaining unit configured to obtain two or more text types of the text to be corrected in a preset text classification standard
  • the obtaining unit is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to the generating unit. ;
  • a generating unit configured to receive information about the acquired two or more sub-language models to be combined sent by the acquiring unit, and combine the acquired two or more sub-language models to be combined into a mixed language model, where The information of the mixed language model is sent to the correction unit;
  • a correction unit configured to receive information of the mixed language model sent by the generating unit, and correct the text to be corrected according to the mixed language model to obtain correction suggestion text.
  • the preset text classification criteria are: locale, theme, author, writing Any of the styles and themes.
  • the user equipment further includes:
  • the obtaining unit is configured to acquire the preset text classification standard, and send the preset text classification standard to an establishing unit;
  • an establishing unit configured to receive the preset text classification standard sent by the acquiring unit, and establish two or more sub-language models according to the text type in the preset text classification standard.
  • the generating unit is specifically configured to:
  • the user equipment further includes:
  • a model obtaining unit configured to acquire an error detection model in the correction knowledge base, and send information of the error detection model to a determining unit;
  • a determining unit configured to receive information about the error detection model sent by the model obtaining unit, and determine an error location of the to-be-processed text by using the error detection model, where the error location includes an error character or an error string.
  • the error detection model includes: one or more of a word connection model, a part-of-speech connection model, a sound near dictionary, and a shape near dictionary.
  • the correction unit is specifically configured to:
  • the first few string sequences with a high probability of occurrence of the ideal string are obtained as the correction suggestion text.
  • An embodiment of the present invention provides a text correction method and a user equipment, where the text correction method includes: acquiring two or more text types of the text to be corrected in a preset text classification standard; and acquiring the text to be corrected in the correction knowledge base Each of the text types corresponds to the sub-language model to be combined; the obtained two or more sub-language models to be combined are combined into a mixed language model; and the corrected text is corrected according to the mixed language model to obtain corrected suggestion text.
  • the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or the text to be corrected When the text types are different, the correction text can provide different correction options, thus reducing correction errors and improving correction flexibility and correctness.
  • FIG. 1 is a schematic flowchart of a text correction method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of another text correction method according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a user equipment according to an embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of another user equipment according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic structural diagram of still another user equipment according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of still another user equipment according to an embodiment of the present invention.
  • An embodiment of the present invention provides a text correction method, including:
  • the above preset text classification criteria may include: any one of a language environment, a theme, an author, a writing style, and a theme.
  • texts can be divided into text types such as sports, economics, politics, and technology according to the theme.
  • the user equipment may establish a corresponding sub-language model according to the text type of the theme background in the correction knowledge base.
  • the text classification technique can be utilized to determine the classification to which the text to be corrected belongs.
  • text classification techniques can be used to determine the type of text to which the text belongs is science and economic. Select the technology and economic sub-language model corresponding to the text type of the text to be corrected in the correction knowledge base, and then combine the technology class with the economic sub-language model into a mixed language model.
  • the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, thereby reducing correction errors and improving correction flexibility. And correctness.
  • another embodiment of the present invention provides a specific method 20 for text correction, including: S201.
  • the user equipment classifies the acquired corpus according to the preset text classification standard into each sub-language model according to the text type.
  • the user equipment needs to obtain the preset text classification standard, and the preset text classification standard may include: any one of a language environment, a theme background, an author, a writing style, and a theme, and is usually preset by the user according to specific conditions. .
  • the user equipment establishes two or more sub-language models according to the text type in the preset text classification standard.
  • the following types of sub-language models can be obtained by language environment, such as a business environment, a living environment, or an official environment.
  • the following types of sub-language models, such as sports, politics, literature, or history, can be obtained by theme.
  • the actual type of the sub-language model is also related to the type of the corpus. For example, if there is no historical type corpus in the correction knowledge base, the historical sub-language model can be regarded as idle or invalid, when the user equipment passes the initiative. If a method of obtaining or inputting a user obtains a certain amount of historical corpus, a new historical sub-language model can be established according to the historical corpus, and the historical sub-language model is regarded as an effective sub-language model.
  • the acquired corpus is classified into the sub-language model according to the type.
  • the user equipment can enrich the correction knowledge base by obtaining corpus on a regular or irregular basis.
  • the method for obtaining the corpus may be that the user equipment obtains the corpus data through the connection connection with the Internet, the periodic update, or the like, or the user provides the corpus data to the user equipment through the input interface such as the configuration management interface of the user equipment. Then, the user equipment classifies the corpus into an existing type of sub-language model or creates a new sub-language model according to the type of the corpus indicated by the user.
  • the user can add the historical corpus collection through regular updates, Internet search, and even through the configuration management interface, and then establish a historical sub-language model; if there is historical corpus data, it can also pass In the above way, a new historical corpus is added to update the sub-language model.
  • the corpus obtained by the user equipment is an unclassified corpus, and the user equipment needs to classify the obtained corpus according to the preset text classification standard according to the type.
  • the corpus is classified. For example, for the above-mentioned computer technology consulting text containing economic aspects such as the stock market, part of it is "Dell estimates that its first quarter revenue was about $14.2 billion, and earnings per share were 33 cents. Revenues for the quarter were $1,42 billion to $14.6 billion, and earnings per share were between 35 and 38 cents, while analysts on average predicted Dell's revenue for the same period of $14.52 billion, with earnings per share of 38 cents.”
  • the text classification technology is used to automatically classify unclassified corpus.
  • the classification process is divided into two phases: training phase and classification phase.
  • training phase the text in the classified corpus is processed by word segmentation, and the word segmentation process is the same as the prior art, and will not be described here.
  • word segmentation the above content can be expressed as "Dai / er / company / estimate / / / its / first / quarter / income / about / is .", for convenience of presentation, the embodiment of the present invention uses ' / ' Represents the segmentation between words.
  • the collection of word vectors of different texts in the above corpus is further combined with known classification labels to train the classifier through dimensionality reduction processing; in the classification stage, the corpus text processing to be classified is represented as a vector, and input into the classifier to perform physical education on the text. , financial and other types of classification.
  • the corpus is classified into corresponding sub-language models according to different classifications, and the probability of the corresponding sub-language model is updated.
  • the text in the corpus establishes the 2-Gram statistical model of the word and the 3-Gram statistical model as a word continuation model. For example, if a corpus text contains the text "Knowledge Base Building Module", the word 2 is created. The -Gram group is "Knowledge”, “Knowledge”, “Library”, “Build”, “Modeling” and “Module”, and then calculates the statistical probability of occurrence of each 2-Gram group in the classification corpus of the text.
  • the established 2-Gram group includes: “Dell”, “And Public”, “Company”, “Estimating”, “Estimating” , “its first”, “first”, “one season”, “quarter” and so on.
  • First count the number of occurrences of each word and calculate the proportion of the word in the entire corpus as a probability of occurrence of the word.
  • For each 2-Gram group count the number of words that appear after the first word, such as "Dell”, indicating that the word “Dai” is followed by the word " ⁇ ”, if it is "weared” in the text contained in the entire corpus.
  • the corpus after the word segmentation may be tagged with a part of speech, and then a 2-yuan part-of-speech statistical model and a 3-yuan part-of-speech statistical model are established as a part-of-speech continuation model, wherein "2 yuan" in the 2-yuan part-of-speech statistical model is represented as two phrases. , or 2 characters.
  • the hypothetical corpus contains the "knowledge base building module". After the word segmentation, the words "knowledge base”, "build” and "module” are obtained. The participles of the mark are nouns, verbs and nouns.
  • the established two-word statistical model is respectively For the "knowledge base construction” and "build module", the part of speech is noun plus verb, verb plus noun, the established 3-yuan part-of-speech statistical model is "knowledge base building module”, the part of speech is noun plus verb plus noun, that is, in the establishment 2
  • the meta-sentence statistical model and the 3-member part-of-speech statistical model the corresponding part of speech also needs to be labeled.
  • the calculation method of the specific statistical model is similar to the method of establishing the 2-Gram and 3-Gram statistical models of the above words, and the present invention will not be described again.
  • the user equipment acquires two or more text types of the text to be corrected in a preset text classification standard.
  • the user equipment can obtain the text to be corrected in various ways, such as the user directly entering the user equipment through the user interface, or the user directly transmitting to the user equipment through an input interface such as a configuration management interface. Then, the user equipment uses the text classification technology to perform automatic text classification on the corrected text, and the classification process is divided into two stages: training stage and minute Class stage. In the training phase, the corrected text is subjected to word segmentation processing, and the word segmentation process is the same as the prior art, and will not be described herein.
  • the known classification label training classifier in the classification stage, the text processing to be corrected is represented as a vector, and input into the classifier to classify the text into sports, finance, etc.
  • the text to be corrected is classified according to different classifications. In the corresponding sub-language model, and update the probability of the corresponding sub-language model.
  • the user equipment acquires a mixed language model.
  • the user equipment may acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected.
  • the calibration knowledge base may include: a sub-language model, a word continuation model, a part-of-speech continuation model, a sound near dictionary, a shape near dictionary, and the like. Since there are many types of text in the correction knowledge base, it is only necessary to select a sub-language model corresponding to the text type of the text to be corrected to obtain a mixed language model.
  • the user equipment can obtain the proportion of each sub-language model in the text to be corrected by calculation.
  • the acquired two sub-language models to be combined are obtained to obtain the mixed language model.
  • the expectation maximization algorithm EM algorithm
  • the sub-language model to be combined according to the proportion of each sub-language model to be combined in the mixed language model Combine to get a mixed language model.
  • each sub-language model can also be multiplied by the corresponding weight to achieve the effect of obtaining a mixed language model according to the specific gravity combination.
  • the mixed language model is formed by linear interpolation for each sub-language model.
  • the mixed language model is represented by each sublanguage model as follows:
  • i is the length of the string to be corrected
  • k is the number of sub-language models
  • the weight of each sub-language model, mm is the string in the sub-language model
  • a likelihood function of the text to be processed can be given.
  • the likelihood function it is necessary to find the weight of the sub-language model ⁇ to maximize the likelihood function, then the ⁇ is the weight of the sub-language model.
  • t represents the tth weight estimation value, in the embodiment of the present invention, t is finally equal to the number of words in the text to be processed, M represents a language model, and Mj represents the embodiment provided by the embodiment of the present invention.
  • the j-th sub-language model in the mixed language model, k is the number of sub-language models involved in determining the text.
  • the user equipment determines, by using an error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string.
  • the error detection model can include: a word connection model, a word Any one or more of the splicing model, the syllabary, and the syllabary.
  • the error detection model may include other models, and the present invention will not be described again.
  • step S201 has obtained a word connection model, a part-of-speech connection model, a sound near dictionary, a shape near dictionary, and the like, and the user equipment can obtain one or more error detection models according to preset detection rules. .
  • the user equipment can perform word segmentation and part-of-speech tagging processing on the text to be processed.
  • word segmentation and part-of-speech tagging processing can be performed on the text to be processed.
  • step S201 the specific explanation in step S201, and details are not described herein again.
  • a single character or a scattered string that appears consecutively after the word segmentation can be checked by the word continuity model to see if it is correct.
  • the part of speech can be used to check the continuity of part of speech.
  • the specific process can refer to the prior art.
  • Non-multiple word errors means that such errors destroy the surface structure of words and form a single string, which leads to the original word string of a multi-word word not found in the word segment dictionary, such as "loyalty", the correct word is " Loyalty", but because it cannot be found in the word segmentation dictionary, the word segmentation program is divided into a plurality of individual Chinese characters or the words “loyalty", “ ⁇ ", " ⁇ ". Statistically speaking, the probability of " ⁇ ” after "loyalty” is very small. This type of error can be detected by setting an appropriate threshold, so such errors can be detected by the word continuity model.
  • the wrong string of "true multi-word error” is a multi-word in the segmentation vocabulary. Usually there is no word-level error, and this kind of error is generally a grammatical structure or a morphological error.
  • My book The correct string is "My Book” or “At a long time”. The correct string is "extended time”. For “every time”, “the director” is a noun and the “time” behind is also a noun. From the statistical point of noun, the probability of ⁇ is smaller; and the correct "extended time” is a combination of verbs and nouns, which is statistically reasonable. Therefore, such errors can be found by the part-of-speech model to determine the part-of-speech relationship.
  • the method of determining the error position by the sound near dictionary and the near-dictionary dictionary or the like can refer to the prior art.
  • the method for detecting the above-mentioned error position is only a schematic description, and any variation or substitution can be easily conceived within the scope of the present invention within the technical scope of the present invention.
  • the positive method may include: setting a first character in the sequence of the string to be corrected to an editing position, performing a correction operation on the corrected character string according to the word connection relationship in the language model, and generating a new set of N string sequence combinations, The above operation is repeated by setting the second character position of each string sequence in the newly generated string sequence set to the editing position.
  • N corrective strings can be obtained after a limited number of operations.
  • the operation process defaults to the error of the entire string of the text to be corrected, and it is necessary to correct the position of almost all the text in the text to be corrected, and the operation is complicated. If the string sequence of the text to be corrected is long, a state explosion may occur. happening.
  • the screening of the error position is performed before the correction, the number of corrections is effectively reduced, and the efficiency of the correction is improved.
  • the user equipment performs correction according to the mixed language model to obtain corrected correction text.
  • a sequence of strings to be corrected can be generated from the error location.
  • the user equipment may perform a correcting operation on the sequence of the string to be corrected by error detection model matching or other methods to obtain at least one sequence of corrected character strings, and the at least one sequence of the corrected string may constitute a set of corrected string sequences, and the specific correction is performed.
  • the operation can refer to the prior art.
  • the user equipment may obtain the first m and the last n characters of the error position in the to-be-corrected text, and combine with the corrected character string sequence to obtain at least one screening sequence.
  • m and n are positive integers or 0, which can be preset or dynamic. In this way, the sequence of correction strings is more closely related to the context of the text to be corrected.
  • the sequence of the string to be corrected is a string sequence of three characters of "intermittent"
  • the corrected string sequence is "intermittently”
  • the first 2 and the last 2 characters of the error position are obtained as "sound intermittent” as a screening sequence, and the "sound” can be calculated by using the statistical language model.
  • "intermittent” will appear later, which means that the correction string generated here is appropriate.
  • there may be multiple correction string sequences obtained after correction which is only a schematic description.
  • the user equipment may obtain, according to the mixed language model, a string sequence with the highest probability of occurrence of the ideal string in the at least one screening sequence as the correction suggestion text, or according to the mixed language model, by using a noise channel probability model.
  • the first few character string sequences having a high probability of occurrence of the ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model.
  • the correction suggestion text can be provided to the user through the human-machine interaction interface of the user equipment, for the user to confirm the correction scheme, and the corrected character string position can be emphasized by underlining, etc., and the correction for different types of errors can also be different. Mark the color symbols or shading.
  • the text correction method provided by the embodiment of the present invention, by classifying the text to be corrected, and then acquiring the corresponding mixed language model, so that the mixed language model according to the correction can dynamically change according to the text type of the text to be corrected, and the language model can More accurately reflect the linguistic phenomenon of the text.
  • the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
  • the number of corrections is effectively reduced, and the efficiency of the correction is improved.
  • the present invention can also identify a named entity that may cause anomalies generated in the word segmentation and part-of-speech tagging without correction processing, supplemented by a named entity recognition technique between corrections.
  • An embodiment of the present invention provides a user equipment 30, as shown in FIG. 3, including: an obtaining unit 30 1 , configured to acquire two or more text types of text to be corrected in a preset text classification standard.
  • the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.
  • the obtaining unit 30 1 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to The unit 302 is generated.
  • the generating unit 302 is configured to receive the acquired information of the two or more sub-language models to be combined sent by the acquiring unit 302, and combine the obtained two or more sub-language models to be combined into a mixed language model, and The information of the mixed language model is sent to the correction unit 303.
  • the generating unit 302 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more to-be-combined language models according to a specific gravity of each text type to obtain the mixed language model.
  • the correcting unit 303 is configured to receive information about the mixed language model sent by the generating unit 302, and correct the text to be corrected according to the mixed language model to obtain a corrected suggestion text.
  • the correcting unit 303 is specifically configured to: generate a sequence of the character string to be corrected by the error location; perform a correcting operation on the sequence of the string to be corrected to obtain at least one sequence of the corrected character string; and obtain the text to be corrected
  • the first m and the last n characters of the error position are combined with the corrected character string sequence to obtain at least one screening sequence; according to the mixed language model, obtaining an ideal character in the at least one screening sequence by using a noise channel probability model a string sequence with the highest probability of occurrence of the string as the correction suggestion text, or according to the mixed language model, obtaining the first few characters with a high probability of occurrence of the ideal character string in the at least one selected sequence by the noise channel probability model
  • the string sequence is used as the correction suggestion text.
  • the obtaining unit classifies the text to be corrected, and then the generating unit acquires the corresponding mixed language model, so that the mixed language model on which the correcting unit performs the correction can dynamically change according to the text type of the text to be corrected, when the preset
  • the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
  • the user equipment 10 may further include: the acquiring unit 30 1 , configured to acquire the preset text classification standard, and send the preset text classification standard to the establishing unit 304 ;
  • the establishing unit 304 is configured to receive the preset text classification standard sent by the acquiring unit 30 1 , and establish two or more sub-language models according to the text type in the preset text classification standard.
  • the model obtaining unit 305 is configured to acquire an error detection model in the correction knowledge base, and send the information of the error detection model to the determining unit 306;
  • the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.
  • a determining unit 306 configured to receive information about the error detection model sent by the model obtaining unit 305, and determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an incorrect character or an incorrect character string.
  • the user equipment provided by the embodiment of the present invention classifies the text to be corrected, and then obtains a corresponding mixed language model, so that the mixed language model according to the correction can be dynamically changed according to the text type of the text to be corrected, and the language model can be compared.
  • the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
  • the screening is effective, reducing the number of corrections and improving the efficiency of calibration.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as the units may or may not be physical units, and may be located in one place or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiment of the present embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the embodiment of the present invention provides a user equipment 50, as shown in FIG. 5, including: a processor 501, configured to acquire two or more text types of the text to be corrected in a preset text classification standard.
  • the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.
  • the processor 501 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected; Combining the sub-language models into a mixed language model; correcting the text to be corrected according to the mixed language model to obtain corrected suggestion text.
  • the processor 501 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more sub-language models to be combined to obtain the mixed language according to the specific gravity of each text type model.
  • the processor 501 is specifically configured to: generate a sequence of the string to be corrected by the error location; perform a correction operation on the sequence of the string to be corrected, to obtain at least one sequence of the corrected string; and acquire the text in the to-be-corrected text And m and n characters after the error position are combined with the corrected character string sequence to obtain at least one selected sequence; according to the mixed language model, obtaining an ideal string in the at least one selected sequence by using a noise channel probability model a string sequence with the highest probability of occurrence as the correction suggestion text, or according to the mixed language model, obtaining the first string sequence with a high probability of occurrence of the ideal string in the at least one selected sequence by using the noise channel probability model as Correct the suggested text.
  • the processor classifies the text to be corrected, and then obtains the corresponding mixed language model, so that the mixed language model on which the correction is performed can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or
  • the correction text can provide different correction options, so the correction error can be reduced, and the correction flexibility and accuracy can be improved.
  • processor 501 is further configured to: obtain the preset text classification standard.
  • the user equipment 50 further includes: a memory 502, configured to establish two or more sub-language models according to the type in the preset text classification standard, and send the information of the sub-language model to the processing. 501.
  • the processor 501 is further configured to acquire an error detection model in the correction knowledge base.
  • the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.
  • the processor 501 is further configured to determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Embodiments of the present invention provide a text correcting method and user equipment, and relate to the language processing field, which may reduce correction mistakes, and improve the correction flexibility and accuracy. The text correcting method comprises: obtaining more than two text types of a to-be-corrected text in a preset text classification standard; obtaining, in a correction knowledge base, a to-be-combined sublanguage model corresponding to each text type of the to-be-corrected text; combining the obtained more than two to-be-combined sublanguage models into a mixed language model; correcting the to-be-corrected text according to the mixed language model to obtain a correction suggestion text. The text correcting method and user equipment provided in the embodiments of the present invention are used for correcting a wrong text.

Description

一种文本校正方法及用户设备 本申请要求于 2012 年 9 月 10 日提交中国专利局、 申请号为 201210332263.3、 发明名称为 "一种文本校正方法及用户设备" 的 中国专利申请的优先权, 其全部内容通过引用结合在本申请中。 技术领域  The present invention claims the priority of a Chinese patent application filed on September 10, 2012 by the Chinese Patent Office, the application number is 201210332263.3, and the invention is entitled "a text correction method and user equipment". The entire contents are incorporated herein by reference. Technical field
本发明涉及语言处理领域, 尤其涉及一种文本校正方法及用户 设备。  The present invention relates to the field of language processing, and in particular, to a text correction method and user equipment.
背景技术 Background technique
随着数字化时代的到来, 对错误的待校正文本进行修正的文本 校正技术应用愈加广泛。 在现有技术中, 噪声信道理论中认为待校 正文本的错误主要来源于手工输入过程中产生的输入错误, 以及光 学字符识别和语音识别中产生的输入错误。 该噪声信道理论将这些 待校正文本视为真实文本经过了一个混入噪声的信道后而产生。 示 例的, \^为原字符串序列< 1 , 2 , . . . , Wn > , 即完全正确的文本, 经过噪声信道后产生噪声文本〈^^ (^ ,… 03>, 利用噪声信道理论进 行文本校正的方法即为通过建立噪声信道概率模型, 求某个字符串 序列 W,使得在观察到字符串序列 0的情况下, W,的出现概率最大, 字符串序列 0为待校正文本, 字符串序列 W,为理想的校正文本, 也 可以称为理想字符串, 但该理想的校正文本与正确文本 W不一定完 全相同。 其中, 字符串序列 W'为使得 概率最大的字符串, P(0|W)被称为信道概率或生成模型, 概率 P(W)为语言模型中字符串 序列 W出现的概率。 With the advent of the digital age, text correction techniques that correct erroneous texts to be corrected are becoming more widely used. In the prior art, the error in the noise channel theory that the text to be corrected is mainly derived from input errors generated during manual input, and input errors generated in optical character recognition and speech recognition. The noise channel theory produces these texts to be corrected as real text after passing through a channel mixed with noise. For example, \^ is the original string sequence < 1 , 2 , . . . , Wn > , that is, the completely correct text, after the noise channel is generated, the noise text <^^ (^ ,... 0 3 >, using the noise channel theory The method of text correction is to establish a noise channel probability model to find a string sequence W, so that when the string sequence 0 is observed, the probability of occurrence of W is the largest, and the sequence of string 0 is the text to be corrected. The string sequence W is an ideal corrected text, which can also be called an ideal string, but the ideal corrected text is not necessarily identical to the correct text W. Among them, the string sequence W' is the string with the highest probability, P (0|W) is called the channel probability or generation model, and the probability P(W) is the probability of occurrence of the string sequence W in the language model.
在利用噪声信道理论实现文本校正的方法中, 需要根据语言模 型获取使得 概率最大的字符串 w,, 但是, 当待校正文本 的语言环境和主题背景等不同时, 相同的词语或字符串可能表示不 同的意思, 因此需要不同的校正选择, 但现有技术中的语言模型较 为固定, 对待校正文本只能釆取固定的校正选择, 因而容易出现校 正错误, 导致校正灵活性较差, 正确性较低。 In the method of realizing text correction using the noise channel theory, it is necessary to obtain the character string w that maximizes the probability according to the language model, but when the locale and subject background of the text to be corrected are different, the same word or string may represent Different meanings, therefore, different correction options are needed, but the language model in the prior art is relatively fixed, and the corrected text can only take a fixed correction option, so it is easy to appear Positive errors result in poor calibration flexibility and low accuracy.
发明内容 Summary of the invention
本发明的实施例提供一种文本校正方法及用户设备, 用于提高 校正灵活性和正确性。  Embodiments of the present invention provide a text correction method and user equipment for improving correction flexibility and correctness.
为达到上述目 的, 本发明的实施例釆用如下技术方案:  In order to achieve the above object, the embodiment of the present invention uses the following technical solutions:
一方面, 提供一种文本校正方法, 包括:  In one aspect, a text correction method is provided, including:
获取待校正文本在预设文本分类标准中的两个以上文本类型; 在校正知识库中获取与所述待校正文本的每一个文本类型对应 的待组合子语言模型;  Obtaining two or more text types of the text to be corrected in the preset text classification standard; acquiring, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected;
将获取的两个以上待组合子语言模型组合成为混合语言模型; 根据所述混合语言模型对所述待校正文本进行校正得到校正建 议文本。  The obtained two or more sub-language models to be combined are combined into a mixed language model; the corrected text is corrected according to the mixed language model to obtain a corrected suggestion text.
所述预设文本分类标准为: 语言环境、 主题背景、 作者、 写作 风格和题材中的任意一项。  The preset text classification criteria are: any one of a language environment, a theme, an author, a writing style, and a theme.
所述方法还包括:  The method further includes:
获取所述预设文本分类标准;  Obtaining the preset text classification standard;
根据所述预设文本分类标准中的文本类型建立两个以上的子语 言模型。  Two or more sub-language models are established according to the text type in the preset text classification standard.
所述将获取的两个以上待组合子语言模型组合成为混合语言模 型包括:  Combining the obtained two or more sub-language models to be combined into a mixed language model includes:
获取所述待校正文本中各个文本类型的比重;  Obtaining a proportion of each text type in the text to be corrected;
根据所述各个文本类型的比重, 将所述获取的两个以上待组合 子语言模型组合获得所述混合语言模型。  And obtaining the mixed language model by combining the acquired two or more sub-language models to be combined according to the proportion of the respective text types.
在根据所述混合语言模型对所述待校正文本进行校正得到校正 建议文本之前, 所述方法还包括:  Before correcting the text to be corrected according to the mixed language model to obtain the corrected suggestion text, the method further includes:
获取所述校正知识库中的错误检测模型;  Obtaining an error detection model in the correction knowledge base;
通过所述错误检测模型确定所述待处理文本的错误位置, 所述 错误位置包括错误字符或错误字符串。  An error location of the to-be-processed text is determined by the error detection model, the error location including an erroneous character or an erroneous character string.
所述错误检测模型包括: 字接续模型、 词性接续模型、 音近字 典和形近字典中的任意一种或多种。 The error detection model includes: a word connection model, a part-of-speech model, and a sound near word Any one or more of the dictionary and the form near the dictionary.
所述根据所述混合语言模型对所述待校正文本进行校正得到校 正建议文本包括:  The correcting the text to be corrected according to the mixed language model to obtain correction suggestion text includes:
由所述错误位置生成待校正字符串序列;  Generating a sequence of strings to be corrected from the error location;
由所述错误位置生成待校正字符串序列;  Generating a sequence of strings to be corrected from the error location;
对所述待校正字符串序列进行校正操作, 得到至少一个校正字 符串序列;  Performing a correcting operation on the sequence of strings to be corrected to obtain at least one sequence of corrected character strings;
在所述待校正文本中获取所述错误位置前 m个和后 n个字符, 与所述校正字符串序列组合得到至少一个筛选序列;  Obtaining the first m and the last n characters of the error position in the to-be-corrected text, and combining with the corrected character string sequence to obtain at least one screening sequence;
根据所述混合语言模型, 通过噪声信道概率模型在所述至少一 个筛选序列中获取理想字符串出现概率最大的一个字符串序列作为 校正建议文本, 或  Obtaining, according to the mixed language model, a string sequence having the highest probability of occurrence of an ideal string in the at least one screening sequence as a correction suggestion text, or
根据所述混合语言模型, 通过噪声信道概率模型在所述至少一 个筛选序列中获取理想字符串出现概率较大的前几个字符串序列作 为校正建议文本。  According to the mixed language model, the first few character string sequences having a high probability of occurrence of an ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model.
一方面, 提供一种用户设备, 包括:  In one aspect, a user equipment is provided, including:
获取单元, 用于获取待校正文本在预设文本分类标准中的两个 以上文本类型;  An obtaining unit, configured to obtain two or more text types of the text to be corrected in a preset text classification standard;
所述获取单元还用于在校正知识库中获取与所述待校正文本的 每一个文本类型对应的待组合子语言模型, 并将获取的两个以上待 组合子语言模型的信息发送至生成单元;  The obtaining unit is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to the generating unit. ;
生成单元, 用于接收所述获取单元发送的所述获取的两个以上 待组合子语言模型的信息, 并将所述获取的两个以上待组合子语言 模型组合成为混合语言模型, 将所述混合语言模型的信息发送至校 正单元;  a generating unit, configured to receive information about the acquired two or more sub-language models to be combined sent by the acquiring unit, and combine the acquired two or more sub-language models to be combined into a mixed language model, where The information of the mixed language model is sent to the correction unit;
校正单元, 用于接收所述生成单元发送的所述混合语言模型的 信息, 并根据所述混合语言模型对所述待校正文本进行校正得到校 正建议文本。  And a correction unit, configured to receive information of the mixed language model sent by the generating unit, and correct the text to be corrected according to the mixed language model to obtain correction suggestion text.
所述预设文本分类标准为: 语言环境、 主题背景、 作者、 写作 风格和题材中的任意一项。 The preset text classification criteria are: locale, theme, author, writing Any of the styles and themes.
所述用户设备还包括:  The user equipment further includes:
所述获取单元, 用于获取所述预设文本分类标准, 并将所述预 设文本分类标准发送至建立单元;  The obtaining unit is configured to acquire the preset text classification standard, and send the preset text classification standard to an establishing unit;
建立单元, 用于接收所述获取单元发送的所述预设文本分类标 准, 根据所述预设文本分类标准中的文本类型建立两个以上的子语 言模型。  And an establishing unit, configured to receive the preset text classification standard sent by the acquiring unit, and establish two or more sub-language models according to the text type in the preset text classification standard.
所述生成单元具体用于:  The generating unit is specifically configured to:
获取所述待校正文本中各个文本类型的比重;  Obtaining a proportion of each text type in the text to be corrected;
根据所述各个文本类型的比重, 将所述获取的两个以上待组合 子语言模型组合获得所述混合语言模型。  And obtaining the mixed language model by combining the acquired two or more sub-language models to be combined according to the proportion of the respective text types.
所述用户设备还包括:  The user equipment further includes:
模型获取单元, 用于获取所述校正知识库中的错误检测模型, 并将所述错误检测模型的信息发送给确定单元;  a model obtaining unit, configured to acquire an error detection model in the correction knowledge base, and send information of the error detection model to a determining unit;
确定单元, 用于接收所述模型获取单元发送的所述错误检测模 型的信息, 并通过所述错误检测模型确定所述待处理文本的错误位 置, 所述错误位置包括错误字符或错误字符串。  And a determining unit, configured to receive information about the error detection model sent by the model obtaining unit, and determine an error location of the to-be-processed text by using the error detection model, where the error location includes an error character or an error string.
所述错误检测模型包括: 字接续模型、 词性接续模型、 音近字 典和形近字典中的任意一种或多种。  The error detection model includes: one or more of a word connection model, a part-of-speech connection model, a sound near dictionary, and a shape near dictionary.
所述校正单元具体用于:  The correction unit is specifically configured to:
由所述错误位置生成待校正字符串序列;  Generating a sequence of strings to be corrected from the error location;
对所述待校正字符串序列进行校正操作, 得到至少一个校正字 符串序列;  Performing a correcting operation on the sequence of strings to be corrected to obtain at least one sequence of corrected character strings;
在所述待校正文本中获取所述错误位置前 m个和后 n个字符, 与所述校正字符串序列组合得到至少一个筛选序列;  Obtaining the first m and the last n characters of the error position in the to-be-corrected text, and combining with the corrected character string sequence to obtain at least one screening sequence;
根据所述混合语言模型, 通过噪声信道概率模型在所述至少一 个筛选序列中获取理想字符串出现概率最大的一个字符串序列作为 校正建议文本, 或  Obtaining, according to the mixed language model, a string sequence having the highest probability of occurrence of an ideal string in the at least one screening sequence as a correction suggestion text, or
根据所述混合语言模型, 通过噪声信道概率模型在所述至少一 个筛选序列中获取理想字符串出现概率较大的前几个字符串序列作 为校正建议文本。 Passing at least one of the noise channel probability models according to the mixed language model In the screening sequence, the first few string sequences with a high probability of occurrence of the ideal string are obtained as the correction suggestion text.
本发明实施例提供一种文本校正方法及用户设备, 该文本校正 方法包括: 获取待校正文本在预设文本分类标准中的两个以上文本 类型; 在校正知识库中获取与所述待校正文本的每一个文本类型对 应的待组合子语言模型; 将获取的两个以上待组合子语言模型组合 成为混合语言模型; 根据所述混合语言模型对所述待校正文本进行 校正得到校正建议文本。 这样一来, 通过将待校正文本进行分类, 然后获取相应的混合语言模型, 使得校正时所依据的混合语言模型 能够根据待校正文本的文本类型动态变化, 当预设文本分类标准或 待校正文本的文本类型不同时, 对待校正文本可以提供不同的校正 选择, 因此能够减少校正错误, 提高校正灵活性和正确性。  An embodiment of the present invention provides a text correction method and a user equipment, where the text correction method includes: acquiring two or more text types of the text to be corrected in a preset text classification standard; and acquiring the text to be corrected in the correction knowledge base Each of the text types corresponds to the sub-language model to be combined; the obtained two or more sub-language models to be combined are combined into a mixed language model; and the corrected text is corrected according to the mixed language model to obtain corrected suggestion text. In this way, by classifying the text to be corrected and then obtaining the corresponding mixed language model, the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or the text to be corrected When the text types are different, the correction text can provide different correction options, thus reducing correction errors and improving correction flexibility and correctness.
附图说明 DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下 面将对实施例或现有技术描述中所需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附图仅仅是本发明的一些实施例, 对于 本领域普通技术人员来讲, 在不付出创造性劳动的前提下, 还可以 根据这些附图获得其他的附图。  In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.
图 1为本发明实施例提供的一种文本校正方法流程示意图; 图 2为本发明实施例提供的另一种文本校正方法流程示意图; 图 3为本发明实施例提供的一种用户设备结构示意图;  1 is a schematic flowchart of a text correction method according to an embodiment of the present invention; FIG. 2 is a schematic flowchart of another text correction method according to an embodiment of the present invention; FIG. 3 is a schematic structural diagram of a user equipment according to an embodiment of the present disclosure; ;
图 4为本发明实施例提供的另一种用户设备结构示意图;  FIG. 4 is a schematic structural diagram of another user equipment according to an embodiment of the present disclosure;
图 5为本发明实施例提供的又一种用户设备结构示意图;  FIG. 5 is a schematic structural diagram of still another user equipment according to an embodiment of the present disclosure;
图 6为本发明实施例提供的再一种用户设备结构示意图。  FIG. 6 is a schematic structural diagram of still another user equipment according to an embodiment of the present invention.
具体实施方式 detailed description
下面将结合本发明实施例中的附图, 对本发明实施例中的技术 方案进行清楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明 一部分实施例, 而不是全部的实施例。 基于本发明中的实施例, 本 领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他 实施例, 都属于本发明保护的范围。 The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on an embodiment of the present invention, All other embodiments obtained by those skilled in the art without creative efforts are within the scope of the present invention.
本发明实施例提供一种文本校正方法, 包括:  An embodiment of the present invention provides a text correction method, including:
5101、 获取待校正文本在预设文本分类标准中的两个以上文本 类型。  5101. Obtain two or more text types of the text to be corrected in the preset text classification standard.
上述预设文本分类标准可以包括: 语言环境、 主题背景、 作者、 写作风格 和题材中的任意一项。 示例的, 按照主题背景可以将文本 分为体育、 经济、 政治、 科技等文本类型。  The above preset text classification criteria may include: any one of a language environment, a theme, an author, a writing style, and a theme. For example, texts can be divided into text types such as sports, economics, politics, and technology according to the theme.
若用户预设的文本分类标准为主题背景, 则用户设备可以在校 正知识库中依据该主题背景的文本类型建立相应的子语言模型。 在 获取待校正文本的文本类型时, 可以利用文本分类技术确定待校正 文本所属的分类。  If the text classification standard preset by the user is the theme background, the user equipment may establish a corresponding sub-language model according to the text type of the theme background in the correction knowledge base. When the text type of the text to be corrected is obtained, the text classification technique can be utilized to determine the classification to which the text to be corrected belongs.
5102、 在校正知识库中获取与所述待校正文本的每一个文本类 型对应的待组合子语言模型。  5102. Obtain, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected.
5103、 将获取的两个以上待组合子语言模型组合成为混合语言 模型。  5103. Combine the obtained two or more sub-language models to be combined into a mixed language model.
例如, 当输入一段包含有股市等经济方面内容的计算机科技咨 询文本时, 利用文本分类技术可以确定该文本所属的文本类型为科 技类和经济类。 在校正知识库中选择与待校正文本的文本类型对应 的科技类与经济类子语言模型, 然后将该科技类与经济类子语言模 型组合成为混合语言模型。  For example, when entering a piece of computer technology consulting text that contains economic aspects such as the stock market, text classification techniques can be used to determine the type of text to which the text belongs is science and economic. Select the technology and economic sub-language model corresponding to the text type of the text to be corrected in the correction knowledge base, and then combine the technology class with the economic sub-language model into a mixed language model.
5104、 根据混合语言模型对待校正文本进行校正得到校正建议 文本。  5104. Correcting the corrected text according to the mixed language model to obtain a corrected suggestion text.
这样一来, 通过将待校正文本进行分类, 然后获取相应的混合 语言模型, 使得校正时所依据的混合语言模型能够根据待校正文本 的文本类型动态变化, 因此能够减少校正错误, 提高校正灵活性和 正确性。  In this way, by classifying the text to be corrected and then obtaining the corresponding mixed language model, the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, thereby reducing correction errors and improving correction flexibility. And correctness.
示例的, 本发明另一个实施例提供一种文本校正的具体方法 20 , 包括: S201、 用户设备根据预设文本分类标准将获取的语料按照文本 类型归类至各子语言模型中。 For example, another embodiment of the present invention provides a specific method 20 for text correction, including: S201. The user equipment classifies the acquired corpus according to the preset text classification standard into each sub-language model according to the text type.
首先, 用户设备需要获取所述预设文本分类标准, 该预设文本 分类标准可以包括: 语言环境、 主题背景、 作者、 写作风格 和题材 中的任意一项, 通常由用户根据具体情况进行预先设置。  First, the user equipment needs to obtain the preset text classification standard, and the preset text classification standard may include: any one of a language environment, a theme background, an author, a writing style, and a theme, and is usually preset by the user according to specific conditions. .
然后, 在校正知识库中, 用户设备根据所述预设文本分类标准 中的文本类型建立两个以上的子语言模型。  Then, in the correction knowledge base, the user equipment establishes two or more sub-language models according to the text type in the preset text classification standard.
示例的, 按照语言环境可以获得以下类型的子语言模型, 如商 业环境、 生活环境或官方环境等等。 按照主题背景可以获得以下类 型的子语言模型, 如体育、 政治、 文学或历史等等。 同时, 子语言 模型的实际种类也和语料的种类有关, 示例的, 若所述校正知识库 中不存在历史类型的语料, 则历史类子语言模型可以视为空闲或无 效, 当用户设备通过主动获取或用户输入等方法获得一定数量的历 史类语料, 则可以根据该历史类语料建立新的历史类子语言模型, 该历史类子语言模型视为有效的子语言模型。  For example, the following types of sub-language models can be obtained by language environment, such as a business environment, a living environment, or an official environment. The following types of sub-language models, such as sports, politics, literature, or history, can be obtained by theme. At the same time, the actual type of the sub-language model is also related to the type of the corpus. For example, if there is no historical type corpus in the correction knowledge base, the historical sub-language model can be regarded as idle or invalid, when the user equipment passes the initiative. If a method of obtaining or inputting a user obtains a certain amount of historical corpus, a new historical sub-language model can be established according to the historical corpus, and the historical sub-language model is regarded as an effective sub-language model.
然后, 根据所述预设文本分类标准, 将获取的语料按照类型归 入所述子语言模型中。  Then, according to the preset text classification standard, the acquired corpus is classified into the sub-language model according to the type.
具体的, 用户设备可以通过定期或不定期地获取语料来丰富校 正知识库。 该语料的获取方法可以是用户设备通过与互联网连接搜 索、 定期更新等方式主动获取, 也可以是用户通过用户设备的配置 管理接口等输入接口向用户设备提供经过分类的语料数据。 然后, 用户设备根据用户指示的该语料的类型将该语料归入已有类型的子 语言模型中或建立新的子语言模型。 示例的, 如果语料库中缺少历 史类语料数据, 用户可以通过定期更新、 互联网搜索甚至通过配置 管理接口添加历史类语料集合, 然后建立历史类子语言模型; 如果 已有历史类语料数据, 也可通过上述方式添加新的历史类语料, 以 更新子语言模型。  Specifically, the user equipment can enrich the correction knowledge base by obtaining corpus on a regular or irregular basis. The method for obtaining the corpus may be that the user equipment obtains the corpus data through the connection connection with the Internet, the periodic update, or the like, or the user provides the corpus data to the user equipment through the input interface such as the configuration management interface of the user equipment. Then, the user equipment classifies the corpus into an existing type of sub-language model or creates a new sub-language model according to the type of the corpus indicated by the user. For example, if the historical corpus data is missing from the corpus, the user can add the historical corpus collection through regular updates, Internet search, and even through the configuration management interface, and then establish a historical sub-language model; if there is historical corpus data, it can also pass In the above way, a new historical corpus is added to update the sub-language model.
但大多数时候, 用户设备获取的语料是未分类的语料, 用户设 备需要根据所述预设文本分类标准, 将获取的语料按照类型归入所 述子语言模型中, 即进行语料的归类。 示例的, 对于上面提及的包 含股市等经济方面内容的计算机科技咨询文本, 其部分内容为 "戴 尔公司估计, 其第一季度收入约为 142亿美元, 每股收益 33美分。 此前公司预测当季收入为 142亿至 146亿美元, 每股收益 35 至 38 美分, 而分析师平均预测戴尔同期收入为 145.2 亿美元, 每股收益 38 美分"。 利用文本分类技术对未分类语料进行自动文本分类, 该 分类过程分为两个阶段: 训练阶段和分类阶段。 在训练阶段, 对分 类语料集合中的文本进行分词处理, 该分词过程与现有技术相同, 此处不再赘述。 经过分词后上面的内容可表示为 "戴 /尔 /公司 /估计 /,/其 /第一 /季度 /收入 /约 /为 ... ", 为了表示的方便, 本发明实施例使 用 ' /' 表示词之间的分割。 对分词后的文本去除停用词, 如: "地", "的 "等, 然后根据文本中出现的词、 词频与词总数的比例建立文本 的词向量表示, 不同的词在此向量中表示一维, 词频与词总数的比 例为该维对应的数值。 将上述语料中不同文本的词向量的集合再通 过降维等处理结合已知的分类标签训练分类器; 在分类阶段, 将待 分类语料文本处理表示为向量, 输入至分类器中对文本进行体育, 财经等类型的分类。 根据不同的分类将该语料归类于相应的子语言 模型中, 并更新相应子语言模型的概率。 However, most of the time, the corpus obtained by the user equipment is an unclassified corpus, and the user equipment needs to classify the obtained corpus according to the preset text classification standard according to the type. In the sub-language model, the corpus is classified. For example, for the above-mentioned computer technology consulting text containing economic aspects such as the stock market, part of it is "Dell estimates that its first quarter revenue was about $14.2 billion, and earnings per share were 33 cents. Revenues for the quarter were $1,42 billion to $14.6 billion, and earnings per share were between 35 and 38 cents, while analysts on average predicted Dell's revenue for the same period of $14.52 billion, with earnings per share of 38 cents." The text classification technology is used to automatically classify unclassified corpus. The classification process is divided into two phases: training phase and classification phase. In the training phase, the text in the classified corpus is processed by word segmentation, and the word segmentation process is the same as the prior art, and will not be described here. After the word segmentation, the above content can be expressed as "Dai / er / company / estimate / / / its / first / quarter / income / about / is ...", for convenience of presentation, the embodiment of the present invention uses ' / ' Represents the segmentation between words. The text after the word segmentation is removed from the stop words, such as: "ground", "", etc., and then the word vector representation of the text is established according to the ratio of the word, the word frequency and the total number of words appearing in the text, and different words are represented in this vector. In one dimension, the ratio of the word frequency to the total number of words is the value corresponding to the dimension. The collection of word vectors of different texts in the above corpus is further combined with known classification labels to train the classifier through dimensionality reduction processing; in the classification stage, the corpus text processing to be classified is represented as a vector, and input into the classifier to perform physical education on the text. , financial and other types of classification. The corpus is classified into corresponding sub-language models according to different classifications, and the probability of the corresponding sub-language model is updated.
特别的, 语料中的文本建立字的 2-Gram统计模型和 3 -Gram统 计模型作为字接续模型, 示例的, 假设某语料文本中包含 "知识库 构建模块"这一文本,则建立的字 2-Gram组分别为 "知识 "、 "识库 "、 "库构"、 "构建"、 "建模" 和 "模块", 然后计算该文本所属分类语 料中各 2-Gram组出现的统计概率。 进一步的, 对于上面提及的包含 股市等经济方面内容的计算机科技咨询文本,建立的字 2-Gram组包 含: "戴尔 "、 "而公", "公司 ", "司估", "估计", "其第 ", "第 一", "一季", "季度" 等等。 首先统计每个单字出现的次数并计算 单字在整个语料中的比例 , 以此作为单字出现的概率。 对于每个 2-Gram组统计第一个字后出现的字的次数, 如 "戴尔", 表示 "戴" 字后面接 "尔" 字出现了一次, 如果在整个语料包含的文本中 "戴" 字后面接 "尔" 字出现了 1000次, 则记录 "戴" 字后面为 "尔" 字 的次数为 1000 , 同样可统计出 "戴"字后面为 "帽"字的次数为 10000 次。 而 "戴" 字后面出现的字有很多可能性而且出现的次数也不相 同, 统计所有 "戴" 字后接有其他字的次数, 如 500000次, 然后计 算各种可能性出现的概率。 对于 "戴" 后接 "尔" 字的概率大致可 以估算为 1000/500000为 0.2% , 而 "戴" 后接 "帽" 字的概率大致 可估算为 10000/500000 为 2%。 3 -Gram 统计模型的获取与所述 2-Gram 统计模型的获取方法相同, 这里不再赘述, 该 2-Gram 和 3 - G r a m字接续模型便于在后续过程中待处理文本的错误位置。 In particular, the text in the corpus establishes the 2-Gram statistical model of the word and the 3-Gram statistical model as a word continuation model. For example, if a corpus text contains the text "Knowledge Base Building Module", the word 2 is created. The -Gram group is "Knowledge", "Knowledge", "Library", "Build", "Modeling" and "Module", and then calculates the statistical probability of occurrence of each 2-Gram group in the classification corpus of the text. Further, for the above-mentioned computer technology consulting texts including economic aspects such as the stock market, the established 2-Gram group includes: "Dell", "And Public", "Company", "Estimating", "Estimating" , "its first", "first", "one season", "quarter" and so on. First count the number of occurrences of each word and calculate the proportion of the word in the entire corpus as a probability of occurrence of the word. For each 2-Gram group, count the number of words that appear after the first word, such as "Dell", indicating that the word "Dai" is followed by the word "尔", if it is "weared" in the text contained in the entire corpus. After the word "尔" appears 1000 times, the number of times after the word "Dai" is "1000", the number of times after the word "Dai" is 10,000 times. The words appearing after the word "Dai" have many possibilities and the number of occurrences is different. Count the number of times that all "Dai" characters are followed by other words, such as 500000 times, and then calculate the probability of occurrence of various possibilities. The probability of "wearing" followed by "er" is roughly estimated to be 1000% for 1000/500000, and the probability of "wearing" followed by "hat" is roughly estimated to be 10000/500000 for 2%. The acquisition of the 3-Gram statistical model is the same as that of the 2-Gram statistical model, and will not be described here. The 2-Gram and 3-Gram word splicing models facilitate the error location of the text to be processed in the subsequent process.
进一步的, 也可以对分词后的语料进行词性标注, 再建立 2元 词性统计模型和 3 元词性统计模型作为词性接续模型, 其中, 2 元 词性统计模型中的 " 2元" 表示为两个词组, 或 2个字符。 示例的, 假设语料 包含 "知识库构建模块", 分词后得到 "知识库"、 "构建" 和 "模块" 三个词, 标注的词性为名词、 动词和名词, 建立的 2 元 词性统计模型分别为 "知识库构建" 和 "构建模块", 词性分别为名 词加动词、 动词加名词, 建立的 3 元词性统计模型为 "知识库构建 模块", 词性为名词加动词加名词, 即在建立 2元词性统计模型和 3 元词性统计模型时, 相应的词性也需要标注。 具体的统计模型的计 算方法与上述字的 2-Gram和 3 -Gram统计模型的建立方法类似, 本 发明对此不再赘述。  Further, the corpus after the word segmentation may be tagged with a part of speech, and then a 2-yuan part-of-speech statistical model and a 3-yuan part-of-speech statistical model are established as a part-of-speech continuation model, wherein "2 yuan" in the 2-yuan part-of-speech statistical model is represented as two phrases. , or 2 characters. For example, the hypothetical corpus contains the "knowledge base building module". After the word segmentation, the words "knowledge base", "build" and "module" are obtained. The participles of the mark are nouns, verbs and nouns. The established two-word statistical model is respectively For the "knowledge base construction" and "build module", the part of speech is noun plus verb, verb plus noun, the established 3-yuan part-of-speech statistical model is "knowledge base building module", the part of speech is noun plus verb plus noun, that is, in the establishment 2 When the meta-sentence statistical model and the 3-member part-of-speech statistical model, the corresponding part of speech also needs to be labeled. The calculation method of the specific statistical model is similar to the method of establishing the 2-Gram and 3-Gram statistical models of the above words, and the present invention will not be described again.
最后, 可以利用拼音和五笔输入法等编码方法建立音近和形近 字典。 如 "处" - "出", "形" - "型", "磬" - "罄" 等。 本发明对 此不再详述。  Finally, you can use the pinyin and Wubi input methods to create a close-up and near-dictionary dictionary. Such as "where" - "out", "form" - "type", "磬" - "罄" and so on. The present invention will not be described in detail herein.
S202、 用户设备获取待校正文本在预设文本分类标准中的两个 以上文本类型。  S202. The user equipment acquires two or more text types of the text to be corrected in a preset text classification standard.
用户设备可以通过多种方式获取待校正文本, 如用户通过用户 界面直接录入到用户设备中, 或用户通过配置管理接口等输入接口 直接传输至用户设备。 然后, 用户设备利用文本分类技术对待校正 文本进行自动文本分类, 该分类过程分为两个阶段: 训练阶段和分 类阶段。 在训练阶段, 对待校正文本进行分词处理, 该分词过程与 现有技术相同, 此处不再赘述。 对分词后的文本去除停用词, 如: "地", "的,,等, 然后根据文本中出现的词、 词频与词总数的比例 建立文本的词向量表示, 再通过降维等处理结合已知的分类标签训 练分类器; 在分类阶段, 将待校正文本处理表示为向量, 输入至分 类器中对文本进行体育, 财经等类型的分类。 根据不同的分类将该 待校正文本归类于相应的子语言模型中, 并更新相应子语言模型的 概率。 The user equipment can obtain the text to be corrected in various ways, such as the user directly entering the user equipment through the user interface, or the user directly transmitting to the user equipment through an input interface such as a configuration management interface. Then, the user equipment uses the text classification technology to perform automatic text classification on the corrected text, and the classification process is divided into two stages: training stage and minute Class stage. In the training phase, the corrected text is subjected to word segmentation processing, and the word segmentation process is the same as the prior art, and will not be described herein. Remove the stop words from the text after the word segmentation, such as: "ground", ",,, etc., and then establish the word vector representation of the text according to the ratio of the word, word frequency and total number of words appearing in the text, and then combine the processing by dimensionality reduction and the like. The known classification label training classifier; in the classification stage, the text processing to be corrected is represented as a vector, and input into the classifier to classify the text into sports, finance, etc. The text to be corrected is classified according to different classifications. In the corresponding sub-language model, and update the probability of the corresponding sub-language model.
S203、 用户设备获取混合语言模型。  S203. The user equipment acquires a mixed language model.
首先, 用户设备可以在校正知识库中获取与所述待校正文本的 每一个文本类型对应的待组合子语言模型。 该校正知识库可以包括: 子语言模型、 字接续模型、 词性接续模型、 音近字典和形近字典等 等。 由于校正知识库中的文本类型较多, 只需要选择与所述待校正 文本的文本类型对应的子语言模型来组合获得混合语言模型。  First, the user equipment may acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected. The calibration knowledge base may include: a sub-language model, a word continuation model, a part-of-speech continuation model, a sound near dictionary, a shape near dictionary, and the like. Since there are many types of text in the correction knowledge base, it is only necessary to select a sub-language model corresponding to the text type of the text to be corrected to obtain a mixed language model.
然后, 用户设备可以通过计算获取待校正文本中各子语言模型 的比重。 最后, 根据所述各个子语言模型的比重, 将获取的两个以 上待组合子语言模型组合获得所述混合语言模型。 具体的, 可以利 用期望最大化算法(EM 算法)得到各个待组合子语言模型在混合语 言模型中的比重, 然后按照各个待组合子语言模型的在混合语言模 型中的比重将待组合子语言模型组合获得混合语言模型。 当然, 各 个子语言模型也可以乘以相应的权值来达到按照所述比重组合得到 混合语言模型的效果。  Then, the user equipment can obtain the proportion of each sub-language model in the text to be corrected by calculation. Finally, according to the proportions of the respective sub-language models, the acquired two sub-language models to be combined are obtained to obtain the mixed language model. Specifically, the expectation maximization algorithm (EM algorithm) can be used to obtain the proportion of each sub-language model to be combined in the mixed language model, and then the sub-language model to be combined according to the proportion of each sub-language model to be combined in the mixed language model Combine to get a mixed language model. Of course, each sub-language model can also be multiplied by the corresponding weight to achieve the effect of obtaining a mixed language model according to the specific gravity combination.
具体的, 该混合语言模型为各个子语言模型通过线性插值组合 而成。 对于 N-Gram 子语言模型, 混合语言模型由各子语言模型具 体表示如下: Specifically, the mixed language model is formed by linear interpolation for each sub-language model. For the N-Gram sublanguage model, the mixed language model is represented by each sublanguage model as follows:
Figure imgf000012_0001
其中, i为待校正的字符串长度, k为子语言模型的个数, 为 各个子语言模型的权值, m m )为子语言模型中字符串 序列 现的概率, i≤J≤k , 与现有技术中利用 噪声信道理论求 P(W)的方法相同, 这里不再赘述。
Figure imgf000012_0001
Where i is the length of the string to be corrected, k is the number of sub-language models, the weight of each sub-language model, mm) is the string in the sub-language model Now the probability of a sequence, i≤ J ≤k, the prior art using the same theoretical noise channel request P (W) of the method is omitted here.
根据期望最大化算法, 对于上述混合语言模型, 可以给定一个 待处理文本的似然函数。 根据该似然函数, 需要求出子语言模型的 权值 ^使似然函数最大, 则该 ^即为该子语言模型的权重。 假设某个 文本类型的待处理文本中总共包含 T个字, 则该文本类型相应的权 值 的更新公式为:  According to the expectation maximization algorithm, for the above mixed language model, a likelihood function of the text to be processed can be given. According to the likelihood function, it is necessary to find the weight of the sub-language model ^ to maximize the likelihood function, then the ^ is the weight of the sub-language model. Assuming that the text to be processed of a text type contains a total of T words, the corresponding formula for updating the weight of the text type is:
t t — \ ^ t_ i \ t Λ, t λ ~1 Ρ Tt — \ ^ t _ i \ t Λ , t λ ~ 1 Ρ
Ύ =  Ύ =
j k  j k
X Γ1 ( l , 其中, t表示第 t次权值估计值, 在本发明实施例中 t最终等于 待处理文本中字的个数 Τ, M表示语言模型, Mj表示在本发明实施 例提供的混合语言模型中的第 j个子语言模型, k是确定该文本涉及 的子语言模型的个数。 X Γ 1 ( l , where t represents the tth weight estimation value, in the embodiment of the present invention, t is finally equal to the number of words in the text to be processed, M represents a language model, and Mj represents the embodiment provided by the embodiment of the present invention. The j-th sub-language model in the mixed language model, k is the number of sub-language models involved in determining the text.
示例的, 假设对于待校正文本确定其子语言模型的组成为: 科 技和经济两个子语言模型, 则 k=2。 在初始状态, 设置 = = ()1或 者其他较小的正数值; 对于待处理文本的第一个字 {wl}, 在科技和 经济两个子语言模型中获得单字 wl 出现的概率作为 P(wl;Ml)和 P(wl;M2), 然后根据上述公式计算 。 此时 t=l, 然后将第一个 公式更新权重值得到 ^, ^值。 对于文本中的第二个字 {w2}, 在科 技和经济两个子语言模型中计算 wl 出现的条件下 w2出现的条件概 率 P(w2|wl;Ml)和 P(w2|wl;M2), 然后按照上述同样的步骤更新权重 值得到 , 后续步骤与此类似。 最后在经过 T 次更新得到最终 权重。 For example, suppose that the composition of the sub-language model for the text to be corrected is: two sub-language models of technology and economy, then k=2. In the initial state, set == ()1 or other smaller positive values; for the first word {wl} of the text to be processed, the probability of occurrence of the word wl in the two sub-language models of technology and economy is obtained as P(wl ;Ml) and P(wl; M2), and then calculated according to the above formula. At this time t=l, then the first formula update weight value to get ^, ^ value. For the second word {w2} in the text, the conditional probability P(w2|wl; Ml) and P(w2|wl; M2) appearing in w2 under the condition that wl appears in the two sub-language models of science and economy, Then update the weight value according to the same steps as above, and the subsequent steps are similar. Finally, the final weight is obtained after T updates.
S204、用户设备通过错误检测模型确定待处理文本的错误位置, 所述错误位置包括错误字符或错误字符串。  S204. The user equipment determines, by using an error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string.
在用户设备确定待处理文本的错误位置前, 需要获取校正知识 库中的错误检测模型。 该错误检测模型可以包括: 字接续模型、 词 性接续模型、 音近字典和形近字典中的任意一种或多种, 特别的, 该错误检测模型还可以包括其他模型, 本发明不再赘述。 在本实施 例中, 步骤 S201 已经得到了字接续模型、 词性接续模型、 音近字典 和形近字典等等, 用户设备可以根据预设的检测规则, 从中获取一 种或多种得到错误检测模型。 Before the user equipment determines the wrong location of the text to be processed, it is necessary to obtain an error detection model in the correction knowledge base. The error detection model can include: a word connection model, a word Any one or more of the splicing model, the syllabary, and the syllabary. In particular, the error detection model may include other models, and the present invention will not be described again. In this embodiment, step S201 has obtained a word connection model, a part-of-speech connection model, a sound near dictionary, a shape near dictionary, and the like, and the user equipment can obtain one or more error detection models according to preset detection rules. .
首先, 用户设备可以对待处理文本进行分词和词性标注处理, 具体过程可以参考步骤 S201 中相关解释, 这里不再赘述。 对分词后 连续出现的单个字符或散乱的字符串可以用字接续模型来检查其接 续是否正确。 同时, 可以利用词性接续模型来对词性的接续进行检 查, 具体过程可以参考现有技术。 由于常见的文本错误可分为两类: "非多字词错误" 和 "真多字词错误"。 "非多字词错误" 指这类错 误破坏了词表层结构而形成了单字串, 导致原本一个多字词的词串 在分词词典中找不到, 如 "忠耿耿", 其正确词语为 "忠心耿耿 ", 但由于因在分词词典中无法找到, 而被分词程序切分成多个单个的 汉字或词语 "忠 "、 "耿"、 "耿"。 从统计上来看 "忠 " 后面出现 "耿" 的概率很小, 通过设置适当的阔值可以检测该类错误, 因此这类错 误可通过对字接续模型进行检测。 "真多字词错误"这类错误字串是 分词词库中的多字词, 通常不会出现词层面错误, 而这种错误一般 是语法结构或词性搭配上的错误, "我我的书" 其正确字符串为 "我 的书" 或者 "处长时间 " 其正确的字符串为 "延长时间 ", 对 "处长 时间 " 中 "处长" 是名词而后面的 "时间 " 也是名词, 从统计上名 词后面接名词的 ^既率较小; 而正确的 "延长时间 " 是动词加名词的 搭配, 从统计上看比较合理。 因此这类错误可通过词性接续模型判 断词性接续关系找到。 通过音近字典和形近字典等等确定错误位置 的方法可以参考现有技术。 特别的, 上述错误位置的检测方法只是 示意性说明, 任何熟悉本技术领域的技术人员在本发明揭露的技术 范围内, 可轻易想到变化或替换, 都应涵盖在本发明的保护范围之 内。  First, the user equipment can perform word segmentation and part-of-speech tagging processing on the text to be processed. For the specific process, refer to the related explanation in step S201, and details are not described herein again. A single character or a scattered string that appears consecutively after the word segmentation can be checked by the word continuity model to see if it is correct. At the same time, the part of speech can be used to check the continuity of part of speech. The specific process can refer to the prior art. Because common text errors can be divided into two categories: "non-multiple word errors" and "true multi-word errors." "Non-multiple word errors" means that such errors destroy the surface structure of words and form a single string, which leads to the original word string of a multi-word word not found in the word segment dictionary, such as "loyalty", the correct word is " Loyalty", but because it cannot be found in the word segmentation dictionary, the word segmentation program is divided into a plurality of individual Chinese characters or the words "loyalty", "耿", "耿". Statistically speaking, the probability of "耿" after "loyalty" is very small. This type of error can be detected by setting an appropriate threshold, so such errors can be detected by the word continuity model. The wrong string of "true multi-word error" is a multi-word in the segmentation vocabulary. Usually there is no word-level error, and this kind of error is generally a grammatical structure or a morphological error. "My book "The correct string is "My Book" or "At a long time". The correct string is "extended time". For "every time", "the director" is a noun and the "time" behind is also a noun. From the statistical point of noun, the probability of ^ is smaller; and the correct "extended time" is a combination of verbs and nouns, which is statistically reasonable. Therefore, such errors can be found by the part-of-speech model to determine the part-of-speech relationship. The method of determining the error position by the sound near dictionary and the near-dictionary dictionary or the like can refer to the prior art. In particular, the method for detecting the above-mentioned error position is only a schematic description, and any variation or substitution can be easily conceived within the scope of the present invention within the technical scope of the present invention.
需要说明的是, 在现有技术中, 利用噪声信道理论实现文本校 正的方法可以包括: 将待校正字符串序列中第一个字符设置为编辑 位置, 根据语言模型中的字接续关系对待校正字符串进行校正操作, 生成新的 N个字符串序列组合的集合, 然后将新生成的字符串序列 集合中的每个字符串序列的第二字符位置设为编辑位置重复上述操 作。 通过限制 N的大小和每次编辑操作的深度可保证经过有限次的 操作可以得到 N个概率较大的校正字符串。 但是该操作过程默认整 个待校正文本中的字符串都存在错误, 需要将待校正文本中近乎所 有的位置进行校正操作, 操作复杂, 如果待校正文本的字符串序列 较长, 会出现状态爆炸的情况。 本发明实施例中, 在校正前先进行 错误位置的筛选, 有效减少了校正的次数, 提高了校正的效率。 It should be noted that in the prior art, the text channel is implemented by using the noise channel theory. The positive method may include: setting a first character in the sequence of the string to be corrected to an editing position, performing a correction operation on the corrected character string according to the word connection relationship in the language model, and generating a new set of N string sequence combinations, The above operation is repeated by setting the second character position of each string sequence in the newly generated string sequence set to the editing position. By limiting the size of N and the depth of each editing operation, it is guaranteed that N corrective strings can be obtained after a limited number of operations. However, the operation process defaults to the error of the entire string of the text to be corrected, and it is necessary to correct the position of almost all the text in the text to be corrected, and the operation is complicated. If the string sequence of the text to be corrected is long, a state explosion may occur. Happening. In the embodiment of the present invention, the screening of the error position is performed before the correction, the number of corrections is effectively reduced, and the efficiency of the correction is improved.
S205、 用户设备根据混合语言模型对待校正文本进行校正得到 校正建议文本。  S205. The user equipment performs correction according to the mixed language model to obtain corrected correction text.
首先, 可以由所述错误位置生成待校正字符串序列。  First, a sequence of strings to be corrected can be generated from the error location.
然后, 用户设备可以通过错误检测模型匹配或其他方法对所述 待校正字符串序列进行校正操作, 得到至少一个校正字符串序列, 该至少一个校正字符串序列可以组成校正字符串序列集合, 具体校 正操作可以参考现有技术。  Then, the user equipment may perform a correcting operation on the sequence of the string to be corrected by error detection model matching or other methods to obtain at least one sequence of corrected character strings, and the at least one sequence of the corrected string may constitute a set of corrected string sequences, and the specific correction is performed. The operation can refer to the prior art.
接着, 用户设备可以在所述待校正文本中获取所述错误位置前 m个和后 n个字符, 与所述校正字符串序列组合得到至少一个筛选 序列。 其中, m 和 n 为正整数或 0 , 可以为预设值, 也可以为动态 值。 这样, 使得所述校正字符串序列与待校正文本的上下文联系更 为紧密。 示例, 若判断获取错误位置为 "声音断续续的" 中 "断续 续" 3个字符的位置, 则待校正字符串序列为该 "断续续" 3个字符 组成的字符串序列, 则经过对待校正字符串序列校正得到校正字符 串序列 "断断续续",获取所述错误位置前 2个和后 2个字符得到 "声 音断断续续的"作为一个筛选序列, 利用统计语言模型可计算出 "声 音" 后面出现 "断断续续" 的概率很大, 则可说明这里产生的校正 字符串是恰当的。 当然, 实际应用中, 校正后得到的校正字符串序 列可以有多个, 这里只是示意性说明。 最后, 用户设备可以根据所述混合语言模型, 通过噪声信道概 率模型在所述至少一个筛选序列中获取理想字符串出现概率最大的 一个字符串序列作为校正建议文本, 或根据所述混合语言模型, 通 过噪声信道概率模型在所述至少一个筛选序列中获取理想字符串出 现概率较大的前几个字符串序列作为校正建议文本。 所述校正建议 文本可以通过用户设备的人机交互界面提供给用户, 供用户确认校 正方案, 在经过校正的字符串位置可以通过下划线等方式进行强调, 对于不同类型的错误的校正也可以通过不同颜色的符号或底紋等标 注出来。 Then, the user equipment may obtain the first m and the last n characters of the error position in the to-be-corrected text, and combine with the corrected character string sequence to obtain at least one screening sequence. Where m and n are positive integers or 0, which can be preset or dynamic. In this way, the sequence of correction strings is more closely related to the context of the text to be corrected. For example, if it is judged that the position of the error is "intermittent" in the "discontinuous" three-character position, the sequence of the string to be corrected is a string sequence of three characters of "intermittent", then After correcting the string sequence to be corrected, the corrected string sequence is "intermittently", and the first 2 and the last 2 characters of the error position are obtained as "sound intermittent" as a screening sequence, and the "sound" can be calculated by using the statistical language model. There is a high probability that "intermittent" will appear later, which means that the correction string generated here is appropriate. Of course, in practical applications, there may be multiple correction string sequences obtained after correction, which is only a schematic description. Finally, the user equipment may obtain, according to the mixed language model, a string sequence with the highest probability of occurrence of the ideal string in the at least one screening sequence as the correction suggestion text, or according to the mixed language model, by using a noise channel probability model. The first few character string sequences having a high probability of occurrence of the ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model. The correction suggestion text can be provided to the user through the human-machine interaction interface of the user equipment, for the user to confirm the correction scheme, and the corrected character string position can be emphasized by underlining, etc., and the correction for different types of errors can also be different. Mark the color symbols or shading.
需要说明的是, 本发明实施例提供的文本校正方法步骤的先后 顺序可以进行适当调整, 步骤也可以根据情况进行相应增减, 任何 熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易想 到变化的方法, 因此不再赘述。  It should be noted that the sequence of the steps of the text correction method provided by the embodiment of the present invention may be appropriately adjusted, and the steps may also be correspondingly increased or decreased according to the situation, and any person skilled in the art may be within the technical scope disclosed by the present invention. The method of change can be easily thought of, so it will not be described again.
本发明实施例提供的文本校正方法, 通过将待校正文本进行分 类, 然后获取相应的混合语言模型, 使得校正时所依据的混合语言 模型能够根据待校正文本的文本类型动态变化, 该语言模型能较准 确的反映该文本的语言现象。 当预设文本分类标准或待校正文本的 文本类型不同时, 对待校正文本可以提供不同的校正选择, 因此能 够减少校正错误, 提高校正灵活性和正确性。 同时, 由于进行了错 误位置的筛选, 有效减少了校正的次数, 提高了校正的效率。  The text correction method provided by the embodiment of the present invention, by classifying the text to be corrected, and then acquiring the corresponding mixed language model, so that the mixed language model according to the correction can dynamically change according to the text type of the text to be corrected, and the language model can More accurately reflect the linguistic phenomenon of the text. When the preset text classification standard or the text type of the text to be corrected is different, the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness. At the same time, due to the screening of the wrong position, the number of corrections is effectively reduced, and the efficiency of the correction is improved.
示例的, 对于文本 "戴尔公司估计, 其第一季度收人 (入) 约 为 142亿美元, 每股收益 33 美分。 此前公司预测当季收人 (入) 为 142亿至 146亿美元, 每股收益 35 至 38 美分, 而分析师平均预测 戴尔同期收入为 145.2亿美元, 每股收益 38美分"。 其中的 "收入" 被 OCR ( Optical Character Recognition , 光学字符 i只另1 J ) 軟件 i只另 为 "人" 从而产生错误。 在利用现有技术校正时都能将 "收人" 校 正为 "收入", 但是 "戴尔" 这个名词被误认为是错误而被删除而得 到 "公司估计" 的错误校正, 利用本发明可以通过选择科技类子语 言增加对 "戴尔公司 " 这个名词的识别从而不会产生类似错误。 同 样, 本发明也可以在校正之间辅以命名实体识别技术将可能导致分 词和词性标注中产生的异常的命名实体进行识别而不进行校正处 理。 For example, for the text "Dell estimates that its first quarter revenue (in) is about $14.2 billion, and earnings per share is 33 cents. The company previously forecasted a quarterly income (input) of $1,42 billion to $14.6 billion. Earnings per share ranged from 35 to 38 cents, while analysts on average predicted Dell's revenue for the same period of $14.52 billion, earnings per share of 38 cents." Among them, "revenue" is caused by OCR (Optical Character Recognition, optical character i only 1 J). The "receipt" can be corrected to "revenue" when using the prior art correction, but the term "Dell" is mistakenly mistaken and deleted to obtain the "company estimate" error correction, which can be selected by using the present invention. Technology sub-language adds recognition of the term "Dell" so that no similar errors occur. Same As such, the present invention can also identify a named entity that may cause anomalies generated in the word segmentation and part-of-speech tagging without correction processing, supplemented by a named entity recognition technique between corrections.
本发明实施例提供一种用户设备 30 , 如图 3所示, 包括: 获取单元 30 1 , 用于获取待校正文本在预设文本分类标准中的 两个以上文本类型。  An embodiment of the present invention provides a user equipment 30, as shown in FIG. 3, including: an obtaining unit 30 1 , configured to acquire two or more text types of text to be corrected in a preset text classification standard.
示例的, 所述预设文本分类标准可以为: 语言环境、 主题背景、 作者、 写作风格和题材中的任意一项。  For example, the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.
所述获取单元 30 1还用于在校正知识库中获取与所述待校正文 本的每一个文本类型对应的待组合子语言模型, 并将获取的两个以 上待组合子语言模型的信息发送至生成单元 302。  The obtaining unit 30 1 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to The unit 302 is generated.
生成单元 302 , 用于接收所述获取单元 302 发送的所述获取的 两个以上待组合子语言模型的信息, 并将所述获取的两个以上待组 合子语言模型组合成为混合语言模型, 将所述混合语言模型的信息 发送至校正单元 303。  The generating unit 302 is configured to receive the acquired information of the two or more sub-language models to be combined sent by the acquiring unit 302, and combine the obtained two or more sub-language models to be combined into a mixed language model, and The information of the mixed language model is sent to the correction unit 303.
该生成单元 302具体用于: 获取所述待校正文本中各个文本类 型的比重; 根据各个文本类型的比重, 将所述获取的两个以上待组 合子语言模型组合获得所述混合语言模型。  The generating unit 302 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more to-be-combined language models according to a specific gravity of each text type to obtain the mixed language model.
校正单元 303 , 用于接收所述生成单元 302 发送的所述混合语 言模型的信息, 并根据所述混合语言模型对所述待校正文本进行校 正得到校正建议文本。  The correcting unit 303 is configured to receive information about the mixed language model sent by the generating unit 302, and correct the text to be corrected according to the mixed language model to obtain a corrected suggestion text.
所述校正单元 303具体可以用于: 由所述错误位置生成待校正 字符串序列; 对所述待校正字符串序列进行校正操作, 得到至少一 个校正字符串序列; 在所述待校正文本中获取所述错误位置前 m个 和后 n个字符, 与所述校正字符串序列组合得到至少一个筛选序列; 根据所述混合语言模型, 通过噪声信道概率模型在所述至少一个筛 选序列中获取理想字符串出现概率最大的一个字符串序列作为校正 建议文本, 或根据所述混合语言模型, 通过噪声信道概率模型在所 述至少一个 选序列中获取理想字符串出现概率较大的前几个字符 串序列作为校正建议文本。 The correcting unit 303 is specifically configured to: generate a sequence of the character string to be corrected by the error location; perform a correcting operation on the sequence of the string to be corrected to obtain at least one sequence of the corrected character string; and obtain the text to be corrected The first m and the last n characters of the error position are combined with the corrected character string sequence to obtain at least one screening sequence; according to the mixed language model, obtaining an ideal character in the at least one screening sequence by using a noise channel probability model a string sequence with the highest probability of occurrence of the string as the correction suggestion text, or according to the mixed language model, obtaining the first few characters with a high probability of occurrence of the ideal character string in the at least one selected sequence by the noise channel probability model The string sequence is used as the correction suggestion text.
这样一来, 获取单元通过将待校正文本进行分类, 然后生成单 元获取相应的混合语言模型, 使得校正单元进行校正时所依据的混 合语言模型能够根据待校正文本的文本类型动态变化, 当预设文本 分类标准或待校正文本的文本类型不同时, 对待校正文本可以提供 不同的校正选择, 因此能够减少校正错误, 提高校正灵活性和正确 性。  In this way, the obtaining unit classifies the text to be corrected, and then the generating unit acquires the corresponding mixed language model, so that the mixed language model on which the correcting unit performs the correction can dynamically change according to the text type of the text to be corrected, when the preset When the text classification standard or the text type of the text to be corrected is different, the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
进一步的, 如图 4所示, 所述用户设备 1 0还可以包括: 所述获取单元 30 1 , 用于获取所述预设文本分类标准, 并将所 述预设文本分类标准发送至建立单元 304 ;  Further, as shown in FIG. 4, the user equipment 10 may further include: the acquiring unit 30 1 , configured to acquire the preset text classification standard, and send the preset text classification standard to the establishing unit 304 ;
建立单元 304 , 用于接收所述获取单元 30 1 发送的所述预设文 本分类标准, 根据所述预设文本分类标准中的文本类型建立两个以 上的子语言模型。  The establishing unit 304 is configured to receive the preset text classification standard sent by the acquiring unit 30 1 , and establish two or more sub-language models according to the text type in the preset text classification standard.
模型获取单元 305 , 用于获取所述校正知识库中的错误检测模 型, 并将所述错误检测模型的信息发送给确定单元 306 ;  The model obtaining unit 305 is configured to acquire an error detection model in the correction knowledge base, and send the information of the error detection model to the determining unit 306;
示例的, 所述错误检测模型可以包括: 字接续模型、 词性接续 模型、 音近字典和形近字典中的任意一种或多种。  For example, the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.
确定单元 306 , 用于接收所述模型获取单元 305 发送的所述错 误检测模型的信息, 并通过所述错误检测模型确定所述待处理文本 的错误位置, 所述错误位置包括错误字符或错误字符串。  a determining unit 306, configured to receive information about the error detection model sent by the model obtaining unit 305, and determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an incorrect character or an incorrect character string.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁, 上述描述的用户设备的具体使用步骤, 可以参考前述文本校正方法 的实施例中的对应过程, 在此不再赘述。  A person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific use steps of the foregoing user equipment can be referred to the corresponding process in the foregoing text correction method, and details are not described herein again.
本发明实施例提供的用户设备, 通过将待校正文本进行分类, 然后获取相应的混合语言模型, 使得校正时所依据的混合语言模型 能够根据待校正文本的文本类型动态变化, 该语言模型能较准确的 反映该文本的语言现象。 当预设文本分类标准或待校正文本的文本 类型不同时, 对待校正文本可以提供不同的校正选择, 因此能够减 少校正错误, 提高校正灵活性和正确性。 同时, 由于进行了错误位 置的筛选, 有效减少了校正的次数, 提高了校正的效率。 The user equipment provided by the embodiment of the present invention classifies the text to be corrected, and then obtains a corresponding mixed language model, so that the mixed language model according to the correction can be dynamically changed according to the text type of the text to be corrected, and the language model can be compared. Accurately reflect the linguistic phenomenon of the text. When the preset text classification standard or the text type of the text to be corrected is different, the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness. At the same time, due to the error bit The screening is effective, reducing the number of corrections and improving the efficiency of calibration.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁, 上述描述的装置和单元的具体工作过程, 可以参考前述方法实施例 中的对应过程, 在此不再赘述。  A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the device and the unit described above can refer to the corresponding process in the foregoing method embodiments, and details are not described herein again.
在本申请所提供的几个实施例中, 应该理解到, 所揭露的系统, 装置和方法, 可以通过其它的方式实现。 例如, 以上所描述的装置 实施例仅仅是示意性的, 例如, 所述单元的划分, 仅仅为一种逻辑 功能划分, 实际实现时可以有另外的划分方式, 例如多个单元或组 件可以结合或者可以集成到另一个系统, 或一些特征可以忽略, 或 不执行。 另一点, 所显示或讨论的相互之间的耦合或直接耦合或通 信连接可以是通过一些接口, 装置或单元的间接耦合或通信连接, 可以是电性, 机械或其它的形式。  In the several embodiments provided by the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed. In addition, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.
所述作为分离部件说明的单元可以是或者也可以不是物理上分 开的, 作为单元显示的部件可以是或者也可以不是物理单元, 即可 以位于一个地方, 或者也可以分布到多个网络单元上。 可以根据实 际的需要选择其中的部分或者全部单元来实现本实施例方案的目 的。  The units described as separate components may or may not be physically separated, and the components displayed as the units may or may not be physical units, and may be located in one place or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiment of the present embodiment.
另外, 在本发明各个实施例中的各功能单元可以集成在一个处 理单元中, 也可以是各个单元单独物理包括, 也可以两个或两个以 上单元集成在一个单元中。 上述集成的单元既可以釆用硬件的形式 实现, 也可以釆用硬件加软件功能单元的形式实现。 本发明实施例提供一种用户设备 50 , 如图 5所示, 包括: 处理器 501 , 用于获取待校正文本在预设文本分类标准中的两 个以上文本类型。  In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units. The embodiment of the present invention provides a user equipment 50, as shown in FIG. 5, including: a processor 501, configured to acquire two or more text types of the text to be corrected in a preset text classification standard.
示例的, 所述预设文本分类标准可以为: 语言环境、 主题背景、 作者、 写作风格和题材中的任意一项。  For example, the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.
所述处理器 501 还用于在校正知识库中获取与所述待校正文本 的每一个文本类型对应的待组合子语言模型; 将获取的两个以上待 组合子语言模型组合成为混合语言模型; 根据所述混合语言模型对 所述待校正文本进行校正得到校正建议文本。 The processor 501 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected; Combining the sub-language models into a mixed language model; correcting the text to be corrected according to the mixed language model to obtain corrected suggestion text.
所述处理器 501 具体用于: 获取所述待校正文本中各个文本类 型的比重; 根据所述各个文本类型的比重, 将所述获取的两个以上 待组合子语言模型组合获得所述混合语言模型。  The processor 501 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more sub-language models to be combined to obtain the mixed language according to the specific gravity of each text type model.
所述处理器 501 具体用于: 由所述错误位置生成待校正字符串 序列; 对所述待校正字符串序列进行校正操作, 得到至少一个校正 字符串序列; 在所述待校正文本中获取所述错误位置前 m个和后 n 个字符, 与所述校正字符串序列组合得到至少一个 选序列; 根据 所述混合语言模型, 通过噪声信道概率模型在所述至少一个 选序 列中获取理想字符串出现概率最大的一个字符串序列作为校正建议 文本, 或根据所述混合语言模型, 通过噪声信道概率模型在所述至 少一个 选序列中获取理想字符串出现概率较大的前几个字符串序 列作为校正建议文本。  The processor 501 is specifically configured to: generate a sequence of the string to be corrected by the error location; perform a correction operation on the sequence of the string to be corrected, to obtain at least one sequence of the corrected string; and acquire the text in the to-be-corrected text And m and n characters after the error position are combined with the corrected character string sequence to obtain at least one selected sequence; according to the mixed language model, obtaining an ideal string in the at least one selected sequence by using a noise channel probability model a string sequence with the highest probability of occurrence as the correction suggestion text, or according to the mixed language model, obtaining the first string sequence with a high probability of occurrence of the ideal string in the at least one selected sequence by using the noise channel probability model as Correct the suggested text.
这样一来, 处理器通过将待校正文本进行分类, 然后获取相应 的混合语言模型, 使得进行校正时所依据的混合语言模型能够根据 待校正文本的文本类型动态变化, 当预设文本分类标准或待校正文 本的文本类型不同时, 对待校正文本可以提供不同的校正选择, 因 此能够减少校正错误, 提高校正灵活'ί和正确性。  In this way, the processor classifies the text to be corrected, and then obtains the corresponding mixed language model, so that the mixed language model on which the correction is performed can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or When the text type of the text to be corrected is different, the correction text can provide different correction options, so the correction error can be reduced, and the correction flexibility and accuracy can be improved.
进一步的, 所述处理器 501还用于: 获取所述预设文本分类标 准。  Further, the processor 501 is further configured to: obtain the preset text classification standard.
如图 6所示, 该用户设备 50还包括: 存储器 502 , 用于根据所 述预设文本分类标准中的类型建立两个以上的子语言模型, 并将所 述子语言模型的信息发送给处理器 501。  As shown in FIG. 6, the user equipment 50 further includes: a memory 502, configured to establish two or more sub-language models according to the type in the preset text classification standard, and send the information of the sub-language model to the processing. 501.
处理器 501还用于获取所述校正知识库中的错误检测模型。 示例的, 所述错误检测模型可以包括: 字接续模型、 词性接续 模型、 音近字典和形近字典中的任意一种或多种。  The processor 501 is further configured to acquire an error detection model in the correction knowledge base. For example, the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.
处理器 501 还用于通过所述错误检测模型确定所述待处理文本 的错误位置, 所述错误位置包括错误字符或错误字符串。 所属领域的技术人员可以清楚地了解到,为描述的方便和简洁, 上述描述的用户设备中存储器和处理器的具体使用步骤, 可以参考 前述文本校正方法的实施例中的对应过程, 在此不再赘述。 The processor 501 is further configured to determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string. A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific use steps of the memory and the processor in the user equipment described above can refer to the corresponding process in the foregoing text correction method, and Let me repeat.
本领域普通技术人员可以理解: 实现上述方法实施例的全部或 部分步骤可以通过程序指令相关的硬件来完成, 前述的程序可以存 储于一计算机可读取存储介质中, 该程序在执行时, 执行包括上述 方法实施例的步骤; 而前述的存储介质包括: ROM、 RAM , 磁碟或 者光盘等各种可以存储程序代码的介质。  A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围 并不局限于此, 任何熟悉本技术领域的技术人员在本发明揭露的技 术范围内, 可轻易想到变化或替换, 都应涵盖在本发明的保护范围 之内。 因此, 本发明的保护范围应以所述权利要求的保护范围为准。  The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Claims

权 利 要 求 书 claims
1、 一种文本校正方法, 其特征在于, 包括: 1. A text correction method, characterized by including:
获取待校正文本在预设文本分类标准中的两个以上文本类型; 在校正知识库中获取与所述待校正文本的每一个文本类型对应 的待组合子语言模型; Obtain more than two text types of the text to be corrected in the preset text classification standards; Obtain the sub-language model to be combined corresponding to each text type of the text to be corrected in the correction knowledge base;
将获取的两个以上待组合子语言模型组合成为混合语言模型; 根据所述混合语言模型对所述待校正文本进行校正得到校正建 议文本。 The obtained two or more sub-language models to be combined are combined into a mixed language model; the text to be corrected is corrected according to the mixed language model to obtain a correction suggestion text.
2、 根据权利要求 1 所述的方法, 其特征在于, 所述预设文本分 类标准为: 语言环境、 主题背景、 作者、 写作风格和题材中的任意一 项。 2. The method according to claim 1, characterized in that the preset text classification standard is any one of: language environment, theme background, author, writing style and theme.
3、 根据权利要求 2所述的方法, 其特征在于, 所述方法还包括: 获取所述预设文本分类标准; 3. The method according to claim 2, characterized in that, the method further includes: obtaining the preset text classification standard;
根据所述预设文本分类标准中的文本类型建立两个以上的子语 言模型。 Establish two or more sub-language models according to the text types in the preset text classification standards.
4、 根据权利要求 3 所述的方法, 其特征在于, 所述将获取的两 个以上待组合子语言模型组合成为混合语言模型包括: 4. The method according to claim 3, characterized in that said combining the obtained two or more sub-language models to be combined into a mixed language model includes:
获取所述待校正文本中各个文本类型的比重; Obtain the proportion of each text type in the text to be corrected;
根据所述各个文本类型的比重,将所述获取的两个以上待组合子 语言模型组合获得所述混合语言模型。 According to the proportion of each text type, the obtained two or more sub-language models to be combined are combined to obtain the hybrid language model.
5、 根据权利要求 1 至 4任意一项权利要求所述的方法, 其特征 在于, 在根据所述混合语言模型对所述待校正文本进行校正得到校正 建议文本之前, 所述方法还包括: 5. The method according to any one of claims 1 to 4, characterized in that, before correcting the text to be corrected according to the mixed language model to obtain the corrected suggested text, the method further includes:
获取所述校正知识库中的错误检测模型; Obtain the error detection model in the correction knowledge base;
通过所述错误检测模型确定所述待处理文本的错误位置,所述错 误位置包括错误字符或错误字符串。 The error location of the text to be processed is determined through the error detection model, and the error location includes an error character or an error string.
6、 根据权利要求 5所述的方法, 其特征在于, 所述错误检测模 型包括: 字接续模型、 词性接续模型、 音近字典和形近字典中的任意 一种或多种。 6. The method according to claim 5, characterized in that the error detection model includes: any one or more of a word continuation model, a part-of-speech continuation model, a pronunciation dictionary and a form proximity dictionary.
7、 根据权利要求 5或 6所述的方法, 其特征在于, 所述根据所 述混合语言模型对所述待校正文本进行校正得到校正建议文本包括: 由所述错误位置生成待校正字符串序列; 7. The method according to claim 5 or 6, characterized in that, correcting the text to be corrected according to the mixed language model to obtain the correction suggestion text includes: generating a string sequence to be corrected from the error position ;
对所述待校正字符串序列进行校正操作,得到至少一个校正字符 串序列; Perform a correction operation on the character string sequence to be corrected to obtain at least one corrected character string sequence;
在所述待校正文本中获取所述错误位置前 m个和后 n个字符, 与所述校正字符串序列组合得到至少一个筛选序列; Obtain the first m characters and the last n characters of the error position in the text to be corrected, and combine them with the correction string sequence to obtain at least one screening sequence;
根据所述混合语言模型,通过噪声信道概率模型在所述至少一个 筛选序列中获取理想字符串出现概率最大的一个字符串序列作为校 正建议文本, 或 According to the mixed language model, a string sequence with the highest occurrence probability of an ideal string is obtained as the correction suggestion text in the at least one screening sequence through a noise channel probability model, or
根据所述混合语言模型,通过噪声信道概率模型在所述至少一个 筛选序列中获取理想字符串出现概率较大的前几个字符串序列作为 校正建议文本。 According to the mixed language model, the first few character string sequences with higher occurrence probability of ideal character strings are obtained in the at least one filtering sequence through the noise channel probability model as the correction suggestion text.
8、 一种用户设备, 其特征在于, 包括: 8. A user equipment, characterized by including:
获取单元,用于获取待校正文本在预设文本分类标准中的两个以 上文本类型; The acquisition unit is used to acquire more than two text types of the text to be corrected in the preset text classification standards;
所述获取单元还用于在校正知识库中获取与所述待校正文本的 每一个文本类型对应的待组合子语言模型, 并将获取的两个以上待组 合子语言模型的信息发送至生成单元; The acquisition unit is also configured to acquire the sub-language model to be combined corresponding to each text type of the text to be corrected in the correction knowledge base, and send the obtained information of more than two sub-language models to be combined to the generation unit ;
生成单元,用于接收所述获取单元发送的所述获取的两个以上待 组合子语言模型的信息, 并将所述获取的两个以上待组合子语言模型 组合成为混合语言模型, 将所述混合语言模型的信息发送至校正单 元; A generating unit configured to receive the information of the two or more acquired sub-language models to be combined sent by the acquisition unit, and to combine the acquired two or more sub-language models to be combined into a hybrid language model, and to combine the two or more acquired sub-language models to be combined into a hybrid language model. The information from the mixed language model is sent to the correction unit;
校正单元,用于接收所述生成单元发送的所述混合语言模型的信 息, 并根据所述混合语言模型对所述待校正文本进行校正得到校正建 议文本。 A correction unit, configured to receive the information of the mixed language model sent by the generation unit, and correct the text to be corrected according to the mixed language model to obtain a correction suggestion text.
9、 根据权利要求 8所述的用户设备, 其特征在于, 所述预设文 本分类标准为: 语言环境、 主题背景、 作者、 写作风格和题材中的任 意一项。 9. The user equipment according to claim 8, characterized in that the preset text classification standard is any one of: language environment, theme background, author, writing style and theme.
10、 根据权利要求 9所述的用户设备, 其特征在于, 所述用户设 备还包括: 10. The user equipment according to claim 9, characterized in that the user equipment further includes:
所述获取单元, 用于获取所述预设文本分类标准, 并将所述预设 文本分类标准发送至建立单元; The acquisition unit is used to acquire the preset text classification standard and send the preset text classification standard to the establishment unit;
建立单元, 用于接收所述获取单元发送的所述预设文本分类标 准, 根据所述预设文本分类标准中的文本类型建立两个以上的子语言 模型。 An establishment unit, configured to receive the preset text classification standard sent by the acquisition unit, and establish two or more sub-language models according to the text types in the preset text classification standard.
1 1、 根据权利要求 10所述的用户设备, 其特征在于, 所述生成 单元具体用于: 11. The user equipment according to claim 10, characterized in that the generating unit is specifically used to:
获取所述待校正文本中各个文本类型的比重; Obtain the proportion of each text type in the text to be corrected;
根据所述各个文本类型的比重,将所述获取的两个以上待组合子 语言模型组合获得所述混合语言模型。 According to the proportion of each text type, the obtained two or more sub-language models to be combined are combined to obtain the hybrid language model.
12、 根据权利要求 8至 1 1任意一项权利要求所述的用户设备, 其特征在于, 所述用户设备还包括: 12. The user equipment according to any one of claims 8 to 11, characterized in that, the user equipment further includes:
模型获取单元, 用于获取所述校正知识库中的错误检测模型, 并 将所述错误检测模型的信息发送给确定单元; A model acquisition unit, used to acquire the error detection model in the correction knowledge base, and send the information of the error detection model to the determination unit;
确定单元,用于接收所述模型获取单元发送的所述错误检测模型 的信息, 并通过所述错误检测模型确定所述待处理文本的错误位置, 所述错误位置包括错误字符或错误字符串。 A determining unit, configured to receive the information of the error detection model sent by the model acquisition unit, and determine the error position of the text to be processed through the error detection model, where the error position includes an error character or an error string.
13、 根据权利要求 12所述的用户设备, 其特征在于, 所述错误 检测模型包括: 字接续模型、 词性接续模型、 音近字典和形近字典中 的任意一种或多种。 13. The user equipment according to claim 12, characterized in that the error detection model includes: any one or more of a word continuation model, a part-of-speech continuation model, a pronunciation dictionary and a form similarity dictionary.
14、 根据权利要求 12或 13所述的用户设备, 其特征在于, 所述 校正单元具体用于: 14. The user equipment according to claim 12 or 13, characterized in that the correction unit is specifically used to:
由所述错误位置生成待校正字符串序列; Generate a string sequence to be corrected from the error position;
对所述待校正字符串序列进行校正操作,得到至少一个校正字符 串序列; Perform a correction operation on the character string sequence to be corrected to obtain at least one corrected character string sequence;
在所述待校正文本中获取所述错误位置前 m个和后 n个字符, 与所述校正字符串序列组合得到至少一个筛选序列; 根据所述混合语言模型,通过噪声信道概率模型在所述至少一个 筛选序列中获取理想字符串出现概率最大的一个字符串序列作为校 正建议文本, 或 Obtain the m characters before and n characters after the error position in the text to be corrected, and combine them with the correction string sequence to obtain at least one screening sequence; According to the mixed language model, a character string sequence with the highest occurrence probability of an ideal character string is obtained as the correction suggestion text in the at least one screening sequence through a noise channel probability model, or
根据所述混合语言模型,通过噪声信道概率模型在所述至少一个 筛选序列中获取理想字符串出现概率较大的前几个字符串序列作为 校正建议文本。 According to the mixed language model, the first few character string sequences with higher occurrence probability of ideal character strings are obtained in the at least one filtering sequence through the noise channel probability model as the correction suggestion text.
PCT/CN2013/073382 2012-09-10 2013-03-28 Text correcting method and user equipment WO2014036827A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210332263.3 2012-09-10
CN201210332263.3A CN103678271B (en) 2012-09-10 2012-09-10 A kind of text correction method and subscriber equipment

Publications (1)

Publication Number Publication Date
WO2014036827A1 true WO2014036827A1 (en) 2014-03-13

Family

ID=50236498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/073382 WO2014036827A1 (en) 2012-09-10 2013-03-28 Text correcting method and user equipment

Country Status (2)

Country Link
CN (1) CN103678271B (en)
WO (1) WO2014036827A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device
US11093712B2 (en) 2018-11-21 2021-08-17 International Business Machines Corporation User interfaces for word processors

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104409075B (en) * 2014-11-28 2018-09-04 深圳创维-Rgb电子有限公司 Audio recognition method and system
CN105550173A (en) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 Text correction method and device
CN108628873B (en) * 2017-03-17 2022-09-27 腾讯科技(北京)有限公司 Text classification method, device and equipment
CN107729318B (en) * 2017-10-17 2021-04-20 语联网(武汉)信息技术有限公司 Method for automatically correcting partial characters by Chinese part-of-speech judgment
CN111412925B (en) * 2019-01-08 2023-07-18 阿里巴巴集团控股有限公司 POI position error correction method and device
CN112036273A (en) * 2020-08-19 2020-12-04 泰康保险集团股份有限公司 Image identification method and device
CN115713934B (en) * 2022-11-30 2023-08-15 中移互联网有限公司 Error correction method, device, equipment and medium for converting voice into text

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
CN101031913A (en) * 2004-09-30 2007-09-05 皇家飞利浦电子股份有限公司 Automatic text correction
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition
JP2011113099A (en) * 2009-11-21 2011-06-09 Kddi R & D Laboratories Inc Text correction program and method for correcting text containing unknown word, and text analysis server
CN102165435A (en) * 2007-08-01 2011-08-24 金格软件有限公司 Automatic context sensitive language generation, correction and enhancement using an internet corpus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101031913A (en) * 2004-09-30 2007-09-05 皇家飞利浦电子股份有限公司 Automatic text correction
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
CN102165435A (en) * 2007-08-01 2011-08-24 金格软件有限公司 Automatic context sensitive language generation, correction and enhancement using an internet corpus
CN101655837A (en) * 2009-09-08 2010-02-24 北京邮电大学 Method for detecting and correcting error on text after voice recognition
JP2011113099A (en) * 2009-11-21 2011-06-09 Kddi R & D Laboratories Inc Text correction program and method for correcting text containing unknown word, and text analysis server

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093712B2 (en) 2018-11-21 2021-08-17 International Business Machines Corporation User interfaces for word processors
CN113051894A (en) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 Text error correction method and device

Also Published As

Publication number Publication date
CN103678271B (en) 2016-09-14
CN103678271A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
US11810568B2 (en) Speech recognition with selective use of dynamic language models
WO2014036827A1 (en) Text correcting method and user equipment
US11693894B2 (en) Conversation oriented machine-user interaction
US20210201143A1 (en) Computing device and method of classifying category of data
US10114809B2 (en) Method and apparatus for phonetically annotating text
WO2018219023A1 (en) Speech keyword identification method and device, terminal and server
CN105988990B (en) Chinese zero-reference resolution device and method, model training method and storage medium
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
JP5901001B1 (en) Method and device for acoustic language model training
US9881010B1 (en) Suggestions based on document topics
US8176419B2 (en) Self learning contextual spell corrector
US20090192781A1 (en) System and method of providing machine translation from a source language to a target language
WO2018076450A1 (en) Input method and apparatus, and apparatus for input
US10902211B2 (en) Multi-models that understand natural language phrases
WO2017161899A1 (en) Text processing method, device, and computing apparatus
US9082404B2 (en) Recognizing device, computer-readable recording medium, recognizing method, generating device, and generating method
US20150242386A1 (en) Using language models to correct morphological errors in text
Fusayasu et al. Word-error correction of continuous speech recognition based on normalized relevance distance
US9251141B1 (en) Entity identification model training
WO2020052060A1 (en) Method and apparatus for generating correction statement
CN117043859A (en) Lookup table cyclic language model
CN102955770A (en) Method and system for automatic recognition of pinyin
US20230186898A1 (en) Lattice Speech Corrections
Ray et al. Iterative delexicalization for improved spoken language understanding
CN111382322B (en) Method and device for determining similarity of character strings

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13835272

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13835272

Country of ref document: EP

Kind code of ref document: A1