WO2014036827A1 - Procédé et équipement utilisateur de correction de texte - Google Patents

Procédé et équipement utilisateur de correction de texte Download PDF

Info

Publication number
WO2014036827A1
WO2014036827A1 PCT/CN2013/073382 CN2013073382W WO2014036827A1 WO 2014036827 A1 WO2014036827 A1 WO 2014036827A1 CN 2013073382 W CN2013073382 W CN 2013073382W WO 2014036827 A1 WO2014036827 A1 WO 2014036827A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
corrected
model
correction
language model
Prior art date
Application number
PCT/CN2013/073382
Other languages
English (en)
Chinese (zh)
Inventor
胡楠
杨锦春
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2014036827A1 publication Critical patent/WO2014036827A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Definitions

  • the present invention relates to the field of language processing, and in particular, to a text correction method and user equipment.
  • is the original string sequence ⁇ 1 , 2 , . . . , Wn > , that is, the completely correct text, after the noise channel is generated, the noise text ⁇ ( ⁇ ,...
  • the method of text correction is to establish a noise channel probability model to find a string sequence W, so that when the string sequence 0 is observed, the probability of occurrence of W is the largest, and the sequence of string 0 is the text to be corrected.
  • the string sequence W is an ideal corrected text, which can also be called an ideal string, but the ideal corrected text is not necessarily identical to the correct text W.
  • the string sequence W' is the string with the highest probability
  • W) is called the channel probability or generation model
  • the probability P(W) is the probability of occurrence of the string sequence W in the language model.
  • Embodiments of the present invention provide a text correction method and user equipment for improving correction flexibility and correctness.
  • the embodiment of the present invention uses the following technical solutions:
  • a text correction method including:
  • the obtained two or more sub-language models to be combined are combined into a mixed language model; the corrected text is corrected according to the mixed language model to obtain a corrected suggestion text.
  • the preset text classification criteria are: any one of a language environment, a theme, an author, a writing style, and a theme.
  • the method further includes:
  • Two or more sub-language models are established according to the text type in the preset text classification standard.
  • Combining the obtained two or more sub-language models to be combined into a mixed language model includes:
  • the method further includes:
  • An error location of the to-be-processed text is determined by the error detection model, the error location including an erroneous character or an erroneous character string.
  • the error detection model includes: a word connection model, a part-of-speech model, and a sound near word Any one or more of the dictionary and the form near the dictionary.
  • the correcting the text to be corrected according to the mixed language model to obtain correction suggestion text includes:
  • the first few character string sequences having a high probability of occurrence of an ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model.
  • a user equipment including:
  • An obtaining unit configured to obtain two or more text types of the text to be corrected in a preset text classification standard
  • the obtaining unit is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to the generating unit. ;
  • a generating unit configured to receive information about the acquired two or more sub-language models to be combined sent by the acquiring unit, and combine the acquired two or more sub-language models to be combined into a mixed language model, where The information of the mixed language model is sent to the correction unit;
  • a correction unit configured to receive information of the mixed language model sent by the generating unit, and correct the text to be corrected according to the mixed language model to obtain correction suggestion text.
  • the preset text classification criteria are: locale, theme, author, writing Any of the styles and themes.
  • the user equipment further includes:
  • the obtaining unit is configured to acquire the preset text classification standard, and send the preset text classification standard to an establishing unit;
  • an establishing unit configured to receive the preset text classification standard sent by the acquiring unit, and establish two or more sub-language models according to the text type in the preset text classification standard.
  • the generating unit is specifically configured to:
  • the user equipment further includes:
  • a model obtaining unit configured to acquire an error detection model in the correction knowledge base, and send information of the error detection model to a determining unit;
  • a determining unit configured to receive information about the error detection model sent by the model obtaining unit, and determine an error location of the to-be-processed text by using the error detection model, where the error location includes an error character or an error string.
  • the error detection model includes: one or more of a word connection model, a part-of-speech connection model, a sound near dictionary, and a shape near dictionary.
  • the correction unit is specifically configured to:
  • the first few string sequences with a high probability of occurrence of the ideal string are obtained as the correction suggestion text.
  • An embodiment of the present invention provides a text correction method and a user equipment, where the text correction method includes: acquiring two or more text types of the text to be corrected in a preset text classification standard; and acquiring the text to be corrected in the correction knowledge base Each of the text types corresponds to the sub-language model to be combined; the obtained two or more sub-language models to be combined are combined into a mixed language model; and the corrected text is corrected according to the mixed language model to obtain corrected suggestion text.
  • the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or the text to be corrected When the text types are different, the correction text can provide different correction options, thus reducing correction errors and improving correction flexibility and correctness.
  • FIG. 1 is a schematic flowchart of a text correction method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of another text correction method according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a user equipment according to an embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of another user equipment according to an embodiment of the present disclosure.
  • FIG. 5 is a schematic structural diagram of still another user equipment according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of still another user equipment according to an embodiment of the present invention.
  • An embodiment of the present invention provides a text correction method, including:
  • the above preset text classification criteria may include: any one of a language environment, a theme, an author, a writing style, and a theme.
  • texts can be divided into text types such as sports, economics, politics, and technology according to the theme.
  • the user equipment may establish a corresponding sub-language model according to the text type of the theme background in the correction knowledge base.
  • the text classification technique can be utilized to determine the classification to which the text to be corrected belongs.
  • text classification techniques can be used to determine the type of text to which the text belongs is science and economic. Select the technology and economic sub-language model corresponding to the text type of the text to be corrected in the correction knowledge base, and then combine the technology class with the economic sub-language model into a mixed language model.
  • the mixed language model on which the correction is based can dynamically change according to the text type of the text to be corrected, thereby reducing correction errors and improving correction flexibility. And correctness.
  • another embodiment of the present invention provides a specific method 20 for text correction, including: S201.
  • the user equipment classifies the acquired corpus according to the preset text classification standard into each sub-language model according to the text type.
  • the user equipment needs to obtain the preset text classification standard, and the preset text classification standard may include: any one of a language environment, a theme background, an author, a writing style, and a theme, and is usually preset by the user according to specific conditions. .
  • the user equipment establishes two or more sub-language models according to the text type in the preset text classification standard.
  • the following types of sub-language models can be obtained by language environment, such as a business environment, a living environment, or an official environment.
  • the following types of sub-language models, such as sports, politics, literature, or history, can be obtained by theme.
  • the actual type of the sub-language model is also related to the type of the corpus. For example, if there is no historical type corpus in the correction knowledge base, the historical sub-language model can be regarded as idle or invalid, when the user equipment passes the initiative. If a method of obtaining or inputting a user obtains a certain amount of historical corpus, a new historical sub-language model can be established according to the historical corpus, and the historical sub-language model is regarded as an effective sub-language model.
  • the acquired corpus is classified into the sub-language model according to the type.
  • the user equipment can enrich the correction knowledge base by obtaining corpus on a regular or irregular basis.
  • the method for obtaining the corpus may be that the user equipment obtains the corpus data through the connection connection with the Internet, the periodic update, or the like, or the user provides the corpus data to the user equipment through the input interface such as the configuration management interface of the user equipment. Then, the user equipment classifies the corpus into an existing type of sub-language model or creates a new sub-language model according to the type of the corpus indicated by the user.
  • the user can add the historical corpus collection through regular updates, Internet search, and even through the configuration management interface, and then establish a historical sub-language model; if there is historical corpus data, it can also pass In the above way, a new historical corpus is added to update the sub-language model.
  • the corpus obtained by the user equipment is an unclassified corpus, and the user equipment needs to classify the obtained corpus according to the preset text classification standard according to the type.
  • the corpus is classified. For example, for the above-mentioned computer technology consulting text containing economic aspects such as the stock market, part of it is "Dell estimates that its first quarter revenue was about $14.2 billion, and earnings per share were 33 cents. Revenues for the quarter were $1,42 billion to $14.6 billion, and earnings per share were between 35 and 38 cents, while analysts on average predicted Dell's revenue for the same period of $14.52 billion, with earnings per share of 38 cents.”
  • the text classification technology is used to automatically classify unclassified corpus.
  • the classification process is divided into two phases: training phase and classification phase.
  • training phase the text in the classified corpus is processed by word segmentation, and the word segmentation process is the same as the prior art, and will not be described here.
  • word segmentation the above content can be expressed as "Dai / er / company / estimate / / / its / first / quarter / income / about / is .", for convenience of presentation, the embodiment of the present invention uses ' / ' Represents the segmentation between words.
  • the collection of word vectors of different texts in the above corpus is further combined with known classification labels to train the classifier through dimensionality reduction processing; in the classification stage, the corpus text processing to be classified is represented as a vector, and input into the classifier to perform physical education on the text. , financial and other types of classification.
  • the corpus is classified into corresponding sub-language models according to different classifications, and the probability of the corresponding sub-language model is updated.
  • the text in the corpus establishes the 2-Gram statistical model of the word and the 3-Gram statistical model as a word continuation model. For example, if a corpus text contains the text "Knowledge Base Building Module", the word 2 is created. The -Gram group is "Knowledge”, “Knowledge”, “Library”, “Build”, “Modeling” and “Module”, and then calculates the statistical probability of occurrence of each 2-Gram group in the classification corpus of the text.
  • the established 2-Gram group includes: “Dell”, “And Public”, “Company”, “Estimating”, “Estimating” , “its first”, “first”, “one season”, “quarter” and so on.
  • First count the number of occurrences of each word and calculate the proportion of the word in the entire corpus as a probability of occurrence of the word.
  • For each 2-Gram group count the number of words that appear after the first word, such as "Dell”, indicating that the word “Dai” is followed by the word " ⁇ ”, if it is "weared” in the text contained in the entire corpus.
  • the corpus after the word segmentation may be tagged with a part of speech, and then a 2-yuan part-of-speech statistical model and a 3-yuan part-of-speech statistical model are established as a part-of-speech continuation model, wherein "2 yuan" in the 2-yuan part-of-speech statistical model is represented as two phrases. , or 2 characters.
  • the hypothetical corpus contains the "knowledge base building module". After the word segmentation, the words "knowledge base”, "build” and "module” are obtained. The participles of the mark are nouns, verbs and nouns.
  • the established two-word statistical model is respectively For the "knowledge base construction” and "build module", the part of speech is noun plus verb, verb plus noun, the established 3-yuan part-of-speech statistical model is "knowledge base building module”, the part of speech is noun plus verb plus noun, that is, in the establishment 2
  • the meta-sentence statistical model and the 3-member part-of-speech statistical model the corresponding part of speech also needs to be labeled.
  • the calculation method of the specific statistical model is similar to the method of establishing the 2-Gram and 3-Gram statistical models of the above words, and the present invention will not be described again.
  • the user equipment acquires two or more text types of the text to be corrected in a preset text classification standard.
  • the user equipment can obtain the text to be corrected in various ways, such as the user directly entering the user equipment through the user interface, or the user directly transmitting to the user equipment through an input interface such as a configuration management interface. Then, the user equipment uses the text classification technology to perform automatic text classification on the corrected text, and the classification process is divided into two stages: training stage and minute Class stage. In the training phase, the corrected text is subjected to word segmentation processing, and the word segmentation process is the same as the prior art, and will not be described herein.
  • the known classification label training classifier in the classification stage, the text processing to be corrected is represented as a vector, and input into the classifier to classify the text into sports, finance, etc.
  • the text to be corrected is classified according to different classifications. In the corresponding sub-language model, and update the probability of the corresponding sub-language model.
  • the user equipment acquires a mixed language model.
  • the user equipment may acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected.
  • the calibration knowledge base may include: a sub-language model, a word continuation model, a part-of-speech continuation model, a sound near dictionary, a shape near dictionary, and the like. Since there are many types of text in the correction knowledge base, it is only necessary to select a sub-language model corresponding to the text type of the text to be corrected to obtain a mixed language model.
  • the user equipment can obtain the proportion of each sub-language model in the text to be corrected by calculation.
  • the acquired two sub-language models to be combined are obtained to obtain the mixed language model.
  • the expectation maximization algorithm EM algorithm
  • the sub-language model to be combined according to the proportion of each sub-language model to be combined in the mixed language model Combine to get a mixed language model.
  • each sub-language model can also be multiplied by the corresponding weight to achieve the effect of obtaining a mixed language model according to the specific gravity combination.
  • the mixed language model is formed by linear interpolation for each sub-language model.
  • the mixed language model is represented by each sublanguage model as follows:
  • i is the length of the string to be corrected
  • k is the number of sub-language models
  • the weight of each sub-language model, mm is the string in the sub-language model
  • a likelihood function of the text to be processed can be given.
  • the likelihood function it is necessary to find the weight of the sub-language model ⁇ to maximize the likelihood function, then the ⁇ is the weight of the sub-language model.
  • t represents the tth weight estimation value, in the embodiment of the present invention, t is finally equal to the number of words in the text to be processed, M represents a language model, and Mj represents the embodiment provided by the embodiment of the present invention.
  • the j-th sub-language model in the mixed language model, k is the number of sub-language models involved in determining the text.
  • the user equipment determines, by using an error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string.
  • the error detection model can include: a word connection model, a word Any one or more of the splicing model, the syllabary, and the syllabary.
  • the error detection model may include other models, and the present invention will not be described again.
  • step S201 has obtained a word connection model, a part-of-speech connection model, a sound near dictionary, a shape near dictionary, and the like, and the user equipment can obtain one or more error detection models according to preset detection rules. .
  • the user equipment can perform word segmentation and part-of-speech tagging processing on the text to be processed.
  • word segmentation and part-of-speech tagging processing can be performed on the text to be processed.
  • step S201 the specific explanation in step S201, and details are not described herein again.
  • a single character or a scattered string that appears consecutively after the word segmentation can be checked by the word continuity model to see if it is correct.
  • the part of speech can be used to check the continuity of part of speech.
  • the specific process can refer to the prior art.
  • Non-multiple word errors means that such errors destroy the surface structure of words and form a single string, which leads to the original word string of a multi-word word not found in the word segment dictionary, such as "loyalty", the correct word is " Loyalty", but because it cannot be found in the word segmentation dictionary, the word segmentation program is divided into a plurality of individual Chinese characters or the words “loyalty", “ ⁇ ", " ⁇ ". Statistically speaking, the probability of " ⁇ ” after "loyalty” is very small. This type of error can be detected by setting an appropriate threshold, so such errors can be detected by the word continuity model.
  • the wrong string of "true multi-word error” is a multi-word in the segmentation vocabulary. Usually there is no word-level error, and this kind of error is generally a grammatical structure or a morphological error.
  • My book The correct string is "My Book” or “At a long time”. The correct string is "extended time”. For “every time”, “the director” is a noun and the “time” behind is also a noun. From the statistical point of noun, the probability of ⁇ is smaller; and the correct "extended time” is a combination of verbs and nouns, which is statistically reasonable. Therefore, such errors can be found by the part-of-speech model to determine the part-of-speech relationship.
  • the method of determining the error position by the sound near dictionary and the near-dictionary dictionary or the like can refer to the prior art.
  • the method for detecting the above-mentioned error position is only a schematic description, and any variation or substitution can be easily conceived within the scope of the present invention within the technical scope of the present invention.
  • the positive method may include: setting a first character in the sequence of the string to be corrected to an editing position, performing a correction operation on the corrected character string according to the word connection relationship in the language model, and generating a new set of N string sequence combinations, The above operation is repeated by setting the second character position of each string sequence in the newly generated string sequence set to the editing position.
  • N corrective strings can be obtained after a limited number of operations.
  • the operation process defaults to the error of the entire string of the text to be corrected, and it is necessary to correct the position of almost all the text in the text to be corrected, and the operation is complicated. If the string sequence of the text to be corrected is long, a state explosion may occur. happening.
  • the screening of the error position is performed before the correction, the number of corrections is effectively reduced, and the efficiency of the correction is improved.
  • the user equipment performs correction according to the mixed language model to obtain corrected correction text.
  • a sequence of strings to be corrected can be generated from the error location.
  • the user equipment may perform a correcting operation on the sequence of the string to be corrected by error detection model matching or other methods to obtain at least one sequence of corrected character strings, and the at least one sequence of the corrected string may constitute a set of corrected string sequences, and the specific correction is performed.
  • the operation can refer to the prior art.
  • the user equipment may obtain the first m and the last n characters of the error position in the to-be-corrected text, and combine with the corrected character string sequence to obtain at least one screening sequence.
  • m and n are positive integers or 0, which can be preset or dynamic. In this way, the sequence of correction strings is more closely related to the context of the text to be corrected.
  • the sequence of the string to be corrected is a string sequence of three characters of "intermittent"
  • the corrected string sequence is "intermittently”
  • the first 2 and the last 2 characters of the error position are obtained as "sound intermittent” as a screening sequence, and the "sound” can be calculated by using the statistical language model.
  • "intermittent” will appear later, which means that the correction string generated here is appropriate.
  • there may be multiple correction string sequences obtained after correction which is only a schematic description.
  • the user equipment may obtain, according to the mixed language model, a string sequence with the highest probability of occurrence of the ideal string in the at least one screening sequence as the correction suggestion text, or according to the mixed language model, by using a noise channel probability model.
  • the first few character string sequences having a high probability of occurrence of the ideal character string are obtained as the correction suggestion text in the at least one screening sequence by the noise channel probability model.
  • the correction suggestion text can be provided to the user through the human-machine interaction interface of the user equipment, for the user to confirm the correction scheme, and the corrected character string position can be emphasized by underlining, etc., and the correction for different types of errors can also be different. Mark the color symbols or shading.
  • the text correction method provided by the embodiment of the present invention, by classifying the text to be corrected, and then acquiring the corresponding mixed language model, so that the mixed language model according to the correction can dynamically change according to the text type of the text to be corrected, and the language model can More accurately reflect the linguistic phenomenon of the text.
  • the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
  • the number of corrections is effectively reduced, and the efficiency of the correction is improved.
  • the present invention can also identify a named entity that may cause anomalies generated in the word segmentation and part-of-speech tagging without correction processing, supplemented by a named entity recognition technique between corrections.
  • An embodiment of the present invention provides a user equipment 30, as shown in FIG. 3, including: an obtaining unit 30 1 , configured to acquire two or more text types of text to be corrected in a preset text classification standard.
  • the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.
  • the obtaining unit 30 1 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected, and send the acquired information of the two or more sub-language models to be combined to The unit 302 is generated.
  • the generating unit 302 is configured to receive the acquired information of the two or more sub-language models to be combined sent by the acquiring unit 302, and combine the obtained two or more sub-language models to be combined into a mixed language model, and The information of the mixed language model is sent to the correction unit 303.
  • the generating unit 302 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more to-be-combined language models according to a specific gravity of each text type to obtain the mixed language model.
  • the correcting unit 303 is configured to receive information about the mixed language model sent by the generating unit 302, and correct the text to be corrected according to the mixed language model to obtain a corrected suggestion text.
  • the correcting unit 303 is specifically configured to: generate a sequence of the character string to be corrected by the error location; perform a correcting operation on the sequence of the string to be corrected to obtain at least one sequence of the corrected character string; and obtain the text to be corrected
  • the first m and the last n characters of the error position are combined with the corrected character string sequence to obtain at least one screening sequence; according to the mixed language model, obtaining an ideal character in the at least one screening sequence by using a noise channel probability model a string sequence with the highest probability of occurrence of the string as the correction suggestion text, or according to the mixed language model, obtaining the first few characters with a high probability of occurrence of the ideal character string in the at least one selected sequence by the noise channel probability model
  • the string sequence is used as the correction suggestion text.
  • the obtaining unit classifies the text to be corrected, and then the generating unit acquires the corresponding mixed language model, so that the mixed language model on which the correcting unit performs the correction can dynamically change according to the text type of the text to be corrected, when the preset
  • the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
  • the user equipment 10 may further include: the acquiring unit 30 1 , configured to acquire the preset text classification standard, and send the preset text classification standard to the establishing unit 304 ;
  • the establishing unit 304 is configured to receive the preset text classification standard sent by the acquiring unit 30 1 , and establish two or more sub-language models according to the text type in the preset text classification standard.
  • the model obtaining unit 305 is configured to acquire an error detection model in the correction knowledge base, and send the information of the error detection model to the determining unit 306;
  • the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.
  • a determining unit 306 configured to receive information about the error detection model sent by the model obtaining unit 305, and determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an incorrect character or an incorrect character string.
  • the user equipment provided by the embodiment of the present invention classifies the text to be corrected, and then obtains a corresponding mixed language model, so that the mixed language model according to the correction can be dynamically changed according to the text type of the text to be corrected, and the language model can be compared.
  • the correction text can provide different correction options, thereby reducing correction errors and improving correction flexibility and correctness.
  • the screening is effective, reducing the number of corrections and improving the efficiency of calibration.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not executed.
  • the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical, mechanical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as the units may or may not be physical units, and may be located in one place or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiment of the present embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the embodiment of the present invention provides a user equipment 50, as shown in FIG. 5, including: a processor 501, configured to acquire two or more text types of the text to be corrected in a preset text classification standard.
  • the preset text classification standard may be: any one of a language environment, a theme background, an author, a writing style, and a theme.
  • the processor 501 is further configured to acquire, in the correction knowledge base, a sub-language model to be combined corresponding to each text type of the text to be corrected; Combining the sub-language models into a mixed language model; correcting the text to be corrected according to the mixed language model to obtain corrected suggestion text.
  • the processor 501 is specifically configured to: obtain a specific gravity of each text type in the text to be corrected; and combine the acquired two or more sub-language models to be combined to obtain the mixed language according to the specific gravity of each text type model.
  • the processor 501 is specifically configured to: generate a sequence of the string to be corrected by the error location; perform a correction operation on the sequence of the string to be corrected, to obtain at least one sequence of the corrected string; and acquire the text in the to-be-corrected text And m and n characters after the error position are combined with the corrected character string sequence to obtain at least one selected sequence; according to the mixed language model, obtaining an ideal string in the at least one selected sequence by using a noise channel probability model a string sequence with the highest probability of occurrence as the correction suggestion text, or according to the mixed language model, obtaining the first string sequence with a high probability of occurrence of the ideal string in the at least one selected sequence by using the noise channel probability model as Correct the suggested text.
  • the processor classifies the text to be corrected, and then obtains the corresponding mixed language model, so that the mixed language model on which the correction is performed can dynamically change according to the text type of the text to be corrected, when the preset text classification standard or
  • the correction text can provide different correction options, so the correction error can be reduced, and the correction flexibility and accuracy can be improved.
  • processor 501 is further configured to: obtain the preset text classification standard.
  • the user equipment 50 further includes: a memory 502, configured to establish two or more sub-language models according to the type in the preset text classification standard, and send the information of the sub-language model to the processing. 501.
  • the processor 501 is further configured to acquire an error detection model in the correction knowledge base.
  • the error detection model may include: any one or more of a word connection model, a part-of-speech connection model, a phonetic near dictionary, and a near-dictionary dictionary.
  • the processor 501 is further configured to determine, by using the error detection model, an error location of the to-be-processed text, where the error location includes an error character or an error string.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Des modes de réalisation de la présente invention concernent un procédé et un équipement utilisateur de correction de texte, et se rapportent au domaine du traitement de langue, lesquels peuvent réduire les erreurs de correction et améliorer la flexibilité et la précision de correction. Le procédé de correction de texte consiste : à obtenir plus de deux types de textes d'un texte à corriger dans une norme de classification de texte préétablie; à obtenir, dans une base de connaissances de correction, un modèle de sous-langue à combiner correspondant à chaque type de texte du texte à corriger; à combiner les plus de deux modèles de sous-langue à combiner obtenus en un modèle de langue mélangé; à corriger le texte à corriger selon le modèle de langue mélangé pour obtenir un texte de suggestion de correction. Le procédé et l'équipement utilisateur de correction de texte proposés dans les modes de réalisation de la présente invention sont utilisés pour corriger un texte erroné.
PCT/CN2013/073382 2012-09-10 2013-03-28 Procédé et équipement utilisateur de correction de texte WO2014036827A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210332263.3 2012-09-10
CN201210332263.3A CN103678271B (zh) 2012-09-10 2012-09-10 一种文本校正方法及用户设备

Publications (1)

Publication Number Publication Date
WO2014036827A1 true WO2014036827A1 (fr) 2014-03-13

Family

ID=50236498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/073382 WO2014036827A1 (fr) 2012-09-10 2013-03-28 Procédé et équipement utilisateur de correction de texte

Country Status (2)

Country Link
CN (1) CN103678271B (fr)
WO (1) WO2014036827A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051894A (zh) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 一种文本纠错的方法和装置
US11093712B2 (en) 2018-11-21 2021-08-17 International Business Machines Corporation User interfaces for word processors

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104409075B (zh) 2014-11-28 2018-09-04 深圳创维-Rgb电子有限公司 语音识别方法和系统
CN105550173A (zh) * 2016-02-06 2016-05-04 北京京东尚科信息技术有限公司 文本校正方法和装置
CN108628873B (zh) * 2017-03-17 2022-09-27 腾讯科技(北京)有限公司 一种文本分类方法、装置和设备
CN107729318B (zh) * 2017-10-17 2021-04-20 语联网(武汉)信息技术有限公司 一种自动更正部分文字的方法-由中文词性判断
CN111412925B (zh) * 2019-01-08 2023-07-18 阿里巴巴集团控股有限公司 一种poi位置的纠错方法及装置
CN112036273A (zh) * 2020-08-19 2020-12-04 泰康保险集团股份有限公司 一种图像识别方法及装置
CN115713934B (zh) * 2022-11-30 2023-08-15 中移互联网有限公司 一种语音转文本的纠错方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (zh) * 2007-03-02 2007-08-22 华为技术有限公司 文本处理方法和系统
CN101031913A (zh) * 2004-09-30 2007-09-05 皇家飞利浦电子股份有限公司 自动文本校正
CN101655837A (zh) * 2009-09-08 2010-02-24 北京邮电大学 一种对语音识别后文本进行检错并纠错的方法
JP2011113099A (ja) * 2009-11-21 2011-06-09 Kddi R & D Laboratories Inc 未知語を含む文章を修正するための文章修正プログラム、方法及び文章解析サーバ
CN102165435A (zh) * 2007-08-01 2011-08-24 金格软件有限公司 使用因特网语料库的自动上下文相关语言产生、校正和增强

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101031913A (zh) * 2004-09-30 2007-09-05 皇家飞利浦电子股份有限公司 自动文本校正
CN101021838A (zh) * 2007-03-02 2007-08-22 华为技术有限公司 文本处理方法和系统
CN102165435A (zh) * 2007-08-01 2011-08-24 金格软件有限公司 使用因特网语料库的自动上下文相关语言产生、校正和增强
CN101655837A (zh) * 2009-09-08 2010-02-24 北京邮电大学 一种对语音识别后文本进行检错并纠错的方法
JP2011113099A (ja) * 2009-11-21 2011-06-09 Kddi R & D Laboratories Inc 未知語を含む文章を修正するための文章修正プログラム、方法及び文章解析サーバ

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11093712B2 (en) 2018-11-21 2021-08-17 International Business Machines Corporation User interfaces for word processors
CN113051894A (zh) * 2021-03-16 2021-06-29 京东数字科技控股股份有限公司 一种文本纠错的方法和装置

Also Published As

Publication number Publication date
CN103678271B (zh) 2016-09-14
CN103678271A (zh) 2014-03-26

Similar Documents

Publication Publication Date Title
US11810568B2 (en) Speech recognition with selective use of dynamic language models
WO2014036827A1 (fr) Procédé et équipement utilisateur de correction de texte
US11693894B2 (en) Conversation oriented machine-user interaction
US20210201143A1 (en) Computing device and method of classifying category of data
US10114809B2 (en) Method and apparatus for phonetically annotating text
WO2018219023A1 (fr) Procédé et dispositif d'identification de mot-clé vocal, terminal et serveur
CN105988990B (zh) 汉语零指代消解装置和方法、模型训练方法和存储介质
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
US9881010B1 (en) Suggestions based on document topics
US8176419B2 (en) Self learning contextual spell corrector
WO2018076450A1 (fr) Procédé et appareil d'entrée, et appareil pour entrée
US20090192781A1 (en) System and method of providing machine translation from a source language to a target language
US10902211B2 (en) Multi-models that understand natural language phrases
WO2017161899A1 (fr) Procédé, dispositif et appareil informatique de traitement de texte
JP5799733B2 (ja) 認識装置、認識プログラムおよび認識方法
US9251141B1 (en) Entity identification model training
US20150242386A1 (en) Using language models to correct morphological errors in text
CN117043859A (zh) 查找表循环语言模型
Fusayasu et al. Word-error correction of continuous speech recognition based on normalized relevance distance
WO2020052060A1 (fr) Procédé et appareil permettant de générer une instruction de correction
CN102955770A (zh) 一种拼音自动识别方法及系统
US20230186898A1 (en) Lattice Speech Corrections
Ray et al. Iterative delexicalization for improved spoken language understanding
US20240296837A1 (en) Mask-conformer augmenting conformer with mask-predict decoder unifying speech recognition and rescoring

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13835272

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13835272

Country of ref document: EP

Kind code of ref document: A1