CN113807081A - Method and device for correcting chat text content based on context - Google Patents

Method and device for correcting chat text content based on context Download PDF

Info

Publication number
CN113807081A
CN113807081A CN202111101950.XA CN202111101950A CN113807081A CN 113807081 A CN113807081 A CN 113807081A CN 202111101950 A CN202111101950 A CN 202111101950A CN 113807081 A CN113807081 A CN 113807081A
Authority
CN
China
Prior art keywords
error correction
word
language model
chat text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111101950.XA
Other languages
Chinese (zh)
Inventor
元成
陈振标
杜晓祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yunshang Technology Co ltd
Original Assignee
Beijing Yunshang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yunshang Technology Co ltd filed Critical Beijing Yunshang Technology Co ltd
Priority to CN202111101950.XA priority Critical patent/CN113807081A/en
Publication of CN113807081A publication Critical patent/CN113807081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Abstract

The invention discloses a method and a device for correcting chat text content based on context.A language model of N-Gram is trained through a preset amount of monolingual training corpora of a given language, and a high-frequency lexicon corresponding to the language is generated; removing adjacent repeated letters from the words in the high-frequency word library according to the high-frequency word library, reducing, generating mapping from a reduction result to the words in the high-frequency word library, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the language model score of the original word in the chat text and the N-Gram language model score of the candidate word replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result. The invention utilizes the context information to improve the error correction accuracy and the user experience, is beneficial to attracting more users and increases the enterprise income.

Description

Method and device for correcting chat text content based on context
Technical Field
The invention relates to the technical field of text processing, in particular to a method and a device for correcting the content of a chat text based on context, which is used for English, Russian, French and other letter languages with spaces.
Background
In recent years, with the popularization of the internet, in the process of communication between different languages at home and abroad, communication in a text chat mode tends to be more and more popular. The text chatting is expressed in the meaning of characters, and semantic translation can be performed by means of translation software, so that language communication obstacles are reduced.
Although the text chatting mode is convenient, certain trouble is brought to mutual communication, especially to non-native language users and machine translation; this is because most of the knowledge is from standard text, and in text chatting, people often input irregular chatting text according to their own mood or carelessness, such as: when the mood is excited, inputting one or more letters of the word repeatedly for many times; misspell the word when careless; when the user is lazy, the user does not need to space when spelling words, and the like, which lead to poor understanding of information by the receiver, especially in translation chat software, reduce user experience, and finally lead to reduced income. The traditional error correction technology is to perform association and error correction through word frequency information in a dictionary or according to the position relation of each letter in a keyboard, and context information is not utilized, so that the error correction accuracy is low. A new technical solution for chat text content error correction is needed.
Disclosure of Invention
Therefore, the invention provides a method and a device for correcting the chat text content based on context, which aim to solve the problems that the traditional content correction technology is low in correction accuracy rate and influences semantic understanding.
In order to achieve the above purpose, the invention provides the following technical scheme: the method for correcting the chat text content based on the context comprises the following steps:
training an N-Gram language model: training an N-Gram language model of a given language through a preset amount of monolingual training corpora of the given language, and generating a high-frequency lexicon corresponding to the language;
duplication reduction: removing adjacent repeated letters from the words in the high-frequency word bank according to the high-frequency word bank, reducing, generating mapping from a reduction result to the words in the high-frequency word bank, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.
As a preferred scheme of the context-based chat text content error correction method, after the duplication reduction processing is performed on the original words of the chat text, if the error correction result of the original words of the chat text is not changed, entering an editing distance error correction:
and performing at least one of processing of adding letters, deleting letters, exchanging adjacent letters and changing letters on the original words in the chat text, and taking the number of times of executing operations of adding letters, deleting letters, exchanging adjacent letters and changing letters as an editing distance.
And screening out a preset editing distance, respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to replace the candidate word back to the chat text in the candidate word of the high-frequency word bank, and taking the word with the highest N-Gram language model score as an error correction result.
As a preferred scheme of the context-based chat text content error correction method, generating a dictionary tree for words in the high-frequency thesaurus, and after the editing distance error correction processing is performed on the native words of the astronomy, if the error correction result of the native words of the chat text is not changed, entering into continuous writing word segmentation error correction:
and searching paths for the original words in the chat text according to the dictionary tree, and simultaneously calculating and storing corresponding N-Gram language model scores to generate a candidate result of continuous writing word segmentation of the original words in the chat text.
And as a preferred scheme of the chat text content error correction method based on the context, when a path is searched, truncation is carried out according to the score of the N-Gram language model, the total scores of the N-Gram language models of candidate paths are compared, and a candidate result with the maximum score is selected.
The invention also provides a chat text content error correction device based on context, which comprises:
the language model training module is used for training an N-Gram language model of a given language through a preset amount of monolingual training linguistic data of the given language and generating a high-frequency word stock corresponding to the language;
the duplication reduction module is used for removing adjacent repeated letters from the words in the high-frequency word bank and reducing the words according to the high-frequency word bank to generate mapping from a reduction result to the words in the high-frequency word bank and obtain candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.
The device for correcting the chat text content based on the context further comprises an editing distance error correction module, wherein the editing distance error correction module is used for entering editing distance error correction if the error correction result of the original words of the chat text does not change after the original words of the chat text are subjected to the duplication reduction processing by the duplication reduction module;
and processing at least one of letter addition, letter deletion, adjacent letter exchange and letter change on the original word in the chat text by the editing distance error correction module, and taking the executed operation times of letter addition, letter deletion, adjacent letter exchange and letter change as the editing distance.
And as a preferred scheme of the context-based chat text content error correction device, screening out a preset editing distance through the editing distance error correction module, respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word in the chat text to replace the candidate word to the chat text in the high-frequency word bank, and taking the word with the highest N-Gram language model score as an error correction result.
As a preferred scheme of the context-based chat text content error correction device, the device further comprises a hyphenation word segmentation error correction module, wherein the hyphenation word segmentation error correction module is used for entering hyphenation word segmentation error correction if an error correction result of the original word of the chat text does not change after the editing distance error correction module performs the editing distance error correction processing on the original word of the chat text;
and the hyphenated word error correction module searches paths for the original words in the chat text according to the dictionary tree, and simultaneously calculates and stores corresponding N-Gram language model scores to generate hyphenated word candidate results of the original words in the chat text.
The hyphenation word segmentation error correction module is used as a preferable scheme of the context-based chat text content error correction device, truncation is carried out according to the score of the N-Gram language model when a path is searched, the total scores of the N-Gram language models of candidate paths are compared, and a candidate result with the maximum score is selected.
The invention has the following advantages: training an N-Gram language model of a language by using a preset amount of monolingual training corpora of a given language, and generating a high-frequency lexicon corresponding to the language; removing adjacent repeated letters from the words in the high-frequency word library according to the high-frequency word library, reducing, generating mapping from a reduction result to the words in the high-frequency word library, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result. The invention utilizes the context information to improve the error correction accuracy and the user experience, is beneficial to attracting more users and increases the enterprise income.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.
Fig. 1 is a schematic diagram illustrating a method for correcting chat text content based on context according to an embodiment of the present invention;
FIG. 2 is a flowchart of duplication reduction in a context-based chat text content correction method provided in an embodiment of the present invention;
FIG. 3 is a flowchart illustrating editing distance error correction in a context-based chat text content error correction method according to an embodiment of the present invention;
FIG. 4 is a flow chart illustrating error correction of hyphenated participles in the context-based method for error correction of chat text content according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an apparatus for correcting chat text content based on context according to an embodiment of the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Given a given sequence of words of length T, it is well known1,w2,...,wTThe N-Gram language model is the probability of calculating the sequence, namely:
Figure BDA0003270991510000051
computing the current word w due to the language modelTProbability of and w1,w2,...,wT-1Related, when it is in the right orderWhen the column length is increased, the complexity of calculating and storing the probability of the common occurrence of a plurality of words is exponentially increased, and language model calculation of a word sequence with any length is difficult to realize.
The N-Gram language model is a language model calculated by the N-Gram through Markov assumptions, and the N-Gram through a Markov chain of order N-1 means that the occurrence of a word is only related to the preceding N-1 words. The N-Gram language model calculation formula:
Figure BDA0003270991510000052
for example, the sequence w1,w2,w3,w4Wherein the probability of the 3-gram language model is as follows:
P(w1,w2,w3,w4)=P(w1)P(w2|w1)P(w3|w1,w2)P(w4|w2,w3).
the N-Gram language model is an important technology in the field of natural language processing, and can well learn word surrounding information. The following example introduces an N-Gram language model in chat text correction.
Example 1
Referring to fig. 1, embodiment 1 of the present invention provides a method for correcting chat text content based on context, including the following steps:
training an N-Gram language model: training an N-Gram language model of a given language through a preset amount of monolingual training corpora of the given language, and generating a high-frequency lexicon corresponding to the language;
duplication reduction: removing adjacent repeated letters from the words in the high-frequency word bank according to the high-frequency word bank, reducing, generating mapping from a reduction result to the words in the high-frequency word bank, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.
In this embodiment, the N-Gram language model of the language is trained through sufficient monolingual training corpora, and a high-frequency lexicon of the corresponding language is generated.
Referring to fig. 2, specifically, the duplication reduction mainly solves the problem that a certain letter or letters of a word appear many times or less. A reduction map is first generated. Removing adjacent repeated letters from the words in the high-frequency word bank according to the high-frequency word bank to obtain reduction, and generating mapping of the words reduced into the high-frequency word bank; for words in multiple word libraries corresponding to the same reduction, the corresponding words are put together and separated by spaces, for example, the reductions of too and to in english are both to, and the corresponding mapping is generated as to — > to too.
And when the input chatting text data is not in the high-frequency word bank and is not a noun such as a name, a place name and the like, entering duplication reduction, removing adjacent repeated letters to obtain reduction, obtaining candidate words according to the mapping, then calculating the language model score of the original words of the chatting text and the language model score of the candidate words replacing the chatting text, and comparing to obtain the word with the largest language model score, namely the error correction result.
In this embodiment, after the duplication reduction processing is performed on the original words of the chat text, if the error correction result of the original words of the chat text is not changed, the editing distance error correction is performed:
and performing at least one of processing of adding letters, deleting letters, exchanging adjacent letters and changing letters on the original words in the chat text, and taking the number of times of executing operations of adding letters, deleting letters, exchanging adjacent letters and changing letters as an editing distance.
In this embodiment, under a preset editing distance, and in the candidate words of the high frequency thesaurus, the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word in the chat text are respectively calculated and replaced, and the word with the highest N-Gram language model score is used as an error correction result.
Referring to fig. 3, specifically, when the result of the overwrite reduction error correction is not changed, entering the edit distance error correction; the edit distance is the number of operations of adding a letter, deleting a letter, exchanging adjacent letters, and changing a letter, and if the edit distance is 1, it means that one of the operations is performed only once. Firstly, screening out candidate words with an editing distance of 1 and in a high-frequency word bank, respectively calculating an N-Gram language model of the candidate words to be replaced by the chat text, and comparing the candidate words with the N-Gram language model score of the original words of the chat text to obtain the maximum error correction result; otherwise, screening the condition that the editing distance is 2, and carrying out the same operation; and returning the original word until the edit distance is 3.
Referring to fig. 4, in this embodiment, a dictionary tree is generated from the words in the high-frequency thesaurus, and after the editing distance error correction processing is performed on the original astronomical words, if the error correction result of the original astronomical words is not changed, the method proceeds to continuous writing word segmentation error correction:
and searching paths for the original words in the chat text according to the dictionary tree, and simultaneously calculating and storing corresponding N-Gram language model scores to generate a candidate result of continuous writing word segmentation of the original words in the chat text.
In the embodiment, when the path is searched, truncation is performed according to the score of the N-Gram language model, the total scores of the N-Gram language models of the candidate paths are compared, and the candidate result with the maximum score is selected.
See FIG. 4, where threshold is the threshold for the n-gram language model score. In the process of error correction of the continuous writing word segmentation, firstly, a dictionary tree is generated according to a high-frequency word bank, so that the search and the search of words are facilitated. When the error correction result of editing distance error correction is not changed, continuous writing word segmentation is carried out, the data to be corrected searches a path according to a dictionary tree, meanwhile, corresponding scores of the N-Gram language model are calculated and stored, candidate results of the continuous writing word segmentation are generated, and truncation is carried out according to the scores of the N-Gram language model when the path is searched, so that the path searching efficiency is higher; and then comparing the overall N-Gram language models of the candidate paths, and selecting the candidate result with the largest score. Because the candidate result does not consider the condition that the original word does not need to be segmented, the language model score of n elements is utilized, and all the average language model scores of n elements after the word segmentation are taken as reference due to the inconsistent lengths after the word segmentation, and a reasonable threshold value is set for the average language model scores to effectively solve the problem; wherein, the larger the n-element language model score is, the higher the combination possibility after word segmentation is proved to be.
The method comprises the steps of training an N-Gram language model of a language through a preset amount of monolingual training corpora of a given language, and generating a high-frequency lexicon of a corresponding language; removing adjacent repeated letters from the words in the high-frequency word library according to the high-frequency word library, reducing, generating mapping from a reduction result to the words in the high-frequency word library, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result. The invention can be applied to translation chat service, improves the lower limit of translation, and can well translate irregular text data input by a user. The invention utilizes the context information to improve the error correction accuracy and the user experience, is beneficial to attracting more users and increases the enterprise income.
Example 2
Referring to fig. 5, embodiment 2 of the present invention provides a device for correcting chat text content based on context, including:
the language model training module 1 is used for training an N-Gram language model of a given language through a preset amount of monolingual training linguistic data of the given language and generating a high-frequency word bank corresponding to the language;
the duplication reduction module 2 is used for removing adjacent repeated letters from the words in the high-frequency word bank to reduce the words according to the high-frequency word bank, generating mapping from a reduction result to the words in the high-frequency word bank, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.
In this embodiment, the device further includes an edit distance error correction module 3, where the edit distance error correction module 3 is configured to perform the duplication reduction processing on the original words of the chat text by using the duplication reduction module 2, and enter edit distance error correction if the error correction result of the original words of the chat text is not changed;
and performing at least one of processing of adding letters, deleting letters, exchanging adjacent letters and changing letters on the original words in the chat text through the editing distance error correction module 3, and taking the executed operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters as the editing distance.
In this embodiment, the editing distance error correction module 3 screens out a preset editing distance, respectively calculates the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word in the chat text to replace the candidate word with the N-Gram language model score of the chat text, and takes the word with the highest N-Gram language model score as an error correction result.
In this embodiment, the method further includes a hyphenation word segmentation error correction module 4, where the hyphenation word segmentation error correction module 4 is configured to enter hyphenation word segmentation error correction if an error correction result of a native word of the chat text does not change after the edit distance error correction module 3 performs the edit distance error correction on the native word of the chat text;
and the hyphenated word error correction module 4 searches a path for the original words in the chat text according to the dictionary tree, and simultaneously calculates and stores corresponding N-Gram language model scores to generate hyphenated word candidate results of the original words in the chat text.
In this embodiment, the hyphenation word segmentation error correction module 4 performs truncation according to the score of the N-Gram language model when searching for a path, compares the scores of the overall N-Gram language model of candidate paths, and selects a candidate result with the largest score.
It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiment in embodiment 1 of the present application, the technical effect brought by the contents is the same as the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.
Example 3
Embodiment 3 of the present invention provides a computer-readable storage medium, where a program code of a context-based chat text content error correction method is stored in the computer-readable storage medium, where the program code includes an instruction for executing the context-based chat text content error correction method of embodiment 1 or any possible implementation manner of the embodiment.
The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
Example 4
Embodiment 4 of the present invention provides an electronic device, where the electronic device includes a processor, the processor is coupled to a storage medium, and when the processor executes an instruction in the storage medium, the electronic device is enabled to execute the method for correcting the chat text content based on the context according to embodiment 1 or any possible implementation manner of the method.
Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor, located external to the processor, or stand-alone.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (10)

1. The method for correcting the chat text content based on the context is characterized by comprising the following steps of:
training an N-Gram language model: training an N-Gram language model of a given language through a preset amount of monolingual training corpora of the given language, and generating a high-frequency lexicon corresponding to the language;
duplication reduction: removing adjacent repeated letters from the words in the high-frequency word bank according to the high-frequency word bank, reducing, generating mapping from a reduction result to the words in the high-frequency word bank, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.
2. The method as claimed in claim 1, wherein after the duplication reduction processing is performed on the original words of the chat text, if the error correction result of the original words of the chat text is not changed, entering an edit distance error correction:
and performing at least one of processing of adding letters, deleting letters, exchanging adjacent letters and changing letters on the original words in the chat text, and taking the number of times of executing operations of adding letters, deleting letters, exchanging adjacent letters and changing letters as an editing distance.
3. The method according to claim 2, wherein an N-Gram language model score of a primitive word and an N-Gram language model score of a candidate word are calculated for the candidate word in the chat text and replaced with the chat text respectively at a preset editing distance, and a word with the highest N-Gram language model score is used as an error correction result.
4. The method of claim 3, wherein the words in the high-frequency thesaurus are generated into a dictionary tree, and after the editing distance error correction processing is performed on the native words of the astronomy, if the error correction result of the native words of the chatting does not change, the method enters into continuous writing segmentation error correction:
and searching paths for the original words in the chat text according to the dictionary tree, and simultaneously calculating and storing corresponding N-Gram language model scores to generate a candidate result of continuous writing word segmentation of the original words in the chat text.
5. The method of claim 4, wherein the path is searched for, and the path is truncated according to the score of the N-Gram language model, and the scores of the overall N-Gram language model of the candidate paths are compared to select the candidate result with the largest score.
6. A context-based chat text content correction apparatus, comprising:
the language model training module is used for training an N-Gram language model of a given language through a preset amount of monolingual training linguistic data of the given language and generating a high-frequency word stock corresponding to the language;
the duplication reduction module is used for removing adjacent repeated letters from the words in the high-frequency word bank and reducing the words according to the high-frequency word bank to generate mapping from a reduction result to the words in the high-frequency word bank and obtain candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.
7. The apparatus of claim 6, further comprising an edit distance error correction module, wherein the edit distance error correction module is configured to enter edit distance error correction if the error correction result of the original chat text word is not changed after the duplication reduction module performs the duplication reduction on the original chat text word;
and processing at least one of letter addition, letter deletion, adjacent letter exchange and letter change on the original word in the chat text by the editing distance error correction module, and taking the executed operation times of letter addition, letter deletion, adjacent letter exchange and letter change as the editing distance.
8. The context-based chat text content error correction device according to claim 7, wherein the editing distance error correction module screens out a preset editing distance, and calculates N-Gram language model scores of original words in the chat text and N-Gram language model scores of candidate words in the chat text to replace the candidate words back to the chat text, respectively, at the candidate words in the high frequency thesaurus, and takes the word with the highest N-Gram language model score as an error correction result.
9. The apparatus according to claim 8, further comprising a hyphenation error correction module, wherein the hyphenation error correction module is configured to enter hyphenation error correction if the error correction result of the original chat text word is not changed after the edit distance error correction module performs the edit distance error correction on the original chat text word;
and the hyphenated word error correction module searches paths for the original words in the chat text according to the dictionary tree, and simultaneously calculates and stores corresponding N-Gram language model scores to generate hyphenated word candidate results of the original words in the chat text.
10. The apparatus of claim 9, wherein the hyphenation error correction module truncates the path according to the score of the N-Gram language model, compares the scores of the global N-Gram language model of the candidate paths, and selects the candidate result with the largest score.
CN202111101950.XA 2021-09-18 2021-09-18 Method and device for correcting chat text content based on context Pending CN113807081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111101950.XA CN113807081A (en) 2021-09-18 2021-09-18 Method and device for correcting chat text content based on context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111101950.XA CN113807081A (en) 2021-09-18 2021-09-18 Method and device for correcting chat text content based on context

Publications (1)

Publication Number Publication Date
CN113807081A true CN113807081A (en) 2021-12-17

Family

ID=78896024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111101950.XA Pending CN113807081A (en) 2021-09-18 2021-09-18 Method and device for correcting chat text content based on context

Country Status (1)

Country Link
CN (1) CN113807081A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN111401012A (en) * 2020-03-09 2020-07-10 北京声智科技有限公司 Text error correction method, electronic device and computer readable storage medium
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability
CN111767731A (en) * 2020-07-09 2020-10-13 北京猿力未来科技有限公司 Training method and device of grammar error correction model and grammar error correction method and device
CN111933129A (en) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment
CN112016304A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method and device, electronic equipment and storage medium
CN112163585A (en) * 2020-11-10 2021-01-01 平安普惠企业管理有限公司 Text auditing method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN111401012A (en) * 2020-03-09 2020-07-10 北京声智科技有限公司 Text error correction method, electronic device and computer readable storage medium
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability
CN111767731A (en) * 2020-07-09 2020-10-13 北京猿力未来科技有限公司 Training method and device of grammar error correction model and grammar error correction method and device
CN112016304A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method and device, electronic equipment and storage medium
CN111933129A (en) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment
CN112163585A (en) * 2020-11-10 2021-01-01 平安普惠企业管理有限公司 Text auditing method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
达吾勒・阿布都哈依尔;努尔买买提・尤鲁瓦斯;刘艳;: "面向哈萨克语LVCSR的语言模型构建方法研究", 计算机工程与应用, no. 24 *

Similar Documents

Publication Publication Date Title
US11556713B2 (en) System and method for performing a meaning search using a natural language understanding (NLU) framework
US10402493B2 (en) System and method for inputting text into electronic devices
US20190087403A1 (en) Online spelling correction/phrase completion system
US9223779B2 (en) Text segmentation with multiple granularity levels
Duan et al. Online spelling correction for query completion
US9471566B1 (en) Method and apparatus for converting phonetic language input to written language output
US8612206B2 (en) Transliterating semitic languages including diacritics
US7493251B2 (en) Using source-channel models for word segmentation
KR101465770B1 (en) Word probability determination
US8095356B2 (en) Method and apparatus for processing natural language using tape-intersection
US20090150139A1 (en) Method and apparatus for translating a speech
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
US20080208566A1 (en) Automated word-form transformation and part of speech tag assignment
JP6705318B2 (en) Bilingual dictionary creating apparatus, bilingual dictionary creating method, and bilingual dictionary creating program
CN112256860A (en) Semantic retrieval method, system, equipment and storage medium for customer service conversation content
US8660969B1 (en) Training dependency parsers by jointly optimizing multiple objectives
JP7413630B2 (en) Summary generation model training method, apparatus, device and storage medium
JP3992348B2 (en) Morphological analysis method and apparatus, and Japanese morphological analysis method and apparatus
CN113673228A (en) Text error correction method, text error correction device, computer storage medium and computer program product
CN110889295A (en) Machine translation model, and method, system and equipment for determining pseudo-professional parallel corpora
CN107168950B (en) Event phrase learning method and device based on bilingual semantic mapping
CN113807081A (en) Method and device for correcting chat text content based on context
JP2006004366A (en) Machine translation system and computer program for it
CN115688748A (en) Question error correction method and device, electronic equipment and storage medium
CN113240485A (en) Training method of text generation model, and text generation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination