CN113807081A

CN113807081A - Method and device for correcting chat text content based on context

Info

Publication number: CN113807081A
Application number: CN202111101950.XA
Authority: CN
Inventors: 元成; 陈振标; 杜晓祥
Original assignee: Beijing Yunshang Technology Co ltd
Current assignee: Beijing Yunshang Technology Co ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-17

Abstract

The invention discloses a method and a device for correcting chat text content based on context.A language model of N-Gram is trained through a preset amount of monolingual training corpora of a given language, and a high-frequency lexicon corresponding to the language is generated; removing adjacent repeated letters from the words in the high-frequency word library according to the high-frequency word library, reducing, generating mapping from a reduction result to the words in the high-frequency word library, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the language model score of the original word in the chat text and the N-Gram language model score of the candidate word replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result. The invention utilizes the context information to improve the error correction accuracy and the user experience, is beneficial to attracting more users and increases the enterprise income.

Description

Method and device for correcting chat text content based on context

Technical Field

The invention relates to the technical field of text processing, in particular to a method and a device for correcting the content of a chat text based on context, which is used for English, Russian, French and other letter languages with spaces.

Background

In recent years, with the popularization of the internet, in the process of communication between different languages at home and abroad, communication in a text chat mode tends to be more and more popular. The text chatting is expressed in the meaning of characters, and semantic translation can be performed by means of translation software, so that language communication obstacles are reduced.

Although the text chatting mode is convenient, certain trouble is brought to mutual communication, especially to non-native language users and machine translation; this is because most of the knowledge is from standard text, and in text chatting, people often input irregular chatting text according to their own mood or carelessness, such as: when the mood is excited, inputting one or more letters of the word repeatedly for many times; misspell the word when careless; when the user is lazy, the user does not need to space when spelling words, and the like, which lead to poor understanding of information by the receiver, especially in translation chat software, reduce user experience, and finally lead to reduced income. The traditional error correction technology is to perform association and error correction through word frequency information in a dictionary or according to the position relation of each letter in a keyboard, and context information is not utilized, so that the error correction accuracy is low. A new technical solution for chat text content error correction is needed.

Disclosure of Invention

Therefore, the invention provides a method and a device for correcting the chat text content based on context, which aim to solve the problems that the traditional content correction technology is low in correction accuracy rate and influences semantic understanding.

In order to achieve the above purpose, the invention provides the following technical scheme: the method for correcting the chat text content based on the context comprises the following steps:

training an N-Gram language model: training an N-Gram language model of a given language through a preset amount of monolingual training corpora of the given language, and generating a high-frequency lexicon corresponding to the language;

duplication reduction: removing adjacent repeated letters from the words in the high-frequency word bank according to the high-frequency word bank, reducing, generating mapping from a reduction result to the words in the high-frequency word bank, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.

As a preferred scheme of the context-based chat text content error correction method, after the duplication reduction processing is performed on the original words of the chat text, if the error correction result of the original words of the chat text is not changed, entering an editing distance error correction:

and performing at least one of processing of adding letters, deleting letters, exchanging adjacent letters and changing letters on the original words in the chat text, and taking the number of times of executing operations of adding letters, deleting letters, exchanging adjacent letters and changing letters as an editing distance.

And screening out a preset editing distance, respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to replace the candidate word back to the chat text in the candidate word of the high-frequency word bank, and taking the word with the highest N-Gram language model score as an error correction result.

As a preferred scheme of the context-based chat text content error correction method, generating a dictionary tree for words in the high-frequency thesaurus, and after the editing distance error correction processing is performed on the native words of the astronomy, if the error correction result of the native words of the chat text is not changed, entering into continuous writing word segmentation error correction:

and searching paths for the original words in the chat text according to the dictionary tree, and simultaneously calculating and storing corresponding N-Gram language model scores to generate a candidate result of continuous writing word segmentation of the original words in the chat text.

And as a preferred scheme of the chat text content error correction method based on the context, when a path is searched, truncation is carried out according to the score of the N-Gram language model, the total scores of the N-Gram language models of candidate paths are compared, and a candidate result with the maximum score is selected.

The invention also provides a chat text content error correction device based on context, which comprises:

the language model training module is used for training an N-Gram language model of a given language through a preset amount of monolingual training linguistic data of the given language and generating a high-frequency word stock corresponding to the language;

the duplication reduction module is used for removing adjacent repeated letters from the words in the high-frequency word bank and reducing the words according to the high-frequency word bank to generate mapping from a reduction result to the words in the high-frequency word bank and obtain candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.

The device for correcting the chat text content based on the context further comprises an editing distance error correction module, wherein the editing distance error correction module is used for entering editing distance error correction if the error correction result of the original words of the chat text does not change after the original words of the chat text are subjected to the duplication reduction processing by the duplication reduction module;

and processing at least one of letter addition, letter deletion, adjacent letter exchange and letter change on the original word in the chat text by the editing distance error correction module, and taking the executed operation times of letter addition, letter deletion, adjacent letter exchange and letter change as the editing distance.

And as a preferred scheme of the context-based chat text content error correction device, screening out a preset editing distance through the editing distance error correction module, respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word in the chat text to replace the candidate word to the chat text in the high-frequency word bank, and taking the word with the highest N-Gram language model score as an error correction result.

As a preferred scheme of the context-based chat text content error correction device, the device further comprises a hyphenation word segmentation error correction module, wherein the hyphenation word segmentation error correction module is used for entering hyphenation word segmentation error correction if an error correction result of the original word of the chat text does not change after the editing distance error correction module performs the editing distance error correction processing on the original word of the chat text;

and the hyphenated word error correction module searches paths for the original words in the chat text according to the dictionary tree, and simultaneously calculates and stores corresponding N-Gram language model scores to generate hyphenated word candidate results of the original words in the chat text.

The hyphenation word segmentation error correction module is used as a preferable scheme of the context-based chat text content error correction device, truncation is carried out according to the score of the N-Gram language model when a path is searched, the total scores of the N-Gram language models of candidate paths are compared, and a candidate result with the maximum score is selected.

The invention has the following advantages: training an N-Gram language model of a language by using a preset amount of monolingual training corpora of a given language, and generating a high-frequency lexicon corresponding to the language; removing adjacent repeated letters from the words in the high-frequency word library according to the high-frequency word library, reducing, generating mapping from a reduction result to the words in the high-frequency word library, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result. The invention utilizes the context information to improve the error correction accuracy and the user experience, is beneficial to attracting more users and increases the enterprise income.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.

Fig. 1 is a schematic diagram illustrating a method for correcting chat text content based on context according to an embodiment of the present invention;

FIG. 2 is a flowchart of duplication reduction in a context-based chat text content correction method provided in an embodiment of the present invention;

FIG. 3 is a flowchart illustrating editing distance error correction in a context-based chat text content error correction method according to an embodiment of the present invention;

FIG. 4 is a flow chart illustrating error correction of hyphenated participles in the context-based method for error correction of chat text content according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an apparatus for correcting chat text content based on context according to an embodiment of the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Given a given sequence of words of length T, it is well known₁,w₂,...,w_TThe N-Gram language model is the probability of calculating the sequence, namely:

computing the current word w due to the language model_TProbability of and w₁,w₂,...,w_T-1Related, when it is in the right orderWhen the column length is increased, the complexity of calculating and storing the probability of the common occurrence of a plurality of words is exponentially increased, and language model calculation of a word sequence with any length is difficult to realize.

The N-Gram language model is a language model calculated by the N-Gram through Markov assumptions, and the N-Gram through a Markov chain of order N-1 means that the occurrence of a word is only related to the preceding N-1 words. The N-Gram language model calculation formula:

for example, the sequence w₁,w₂,w₃,w₄Wherein the probability of the 3-gram language model is as follows:

P(w₁,w₂,w₃，w₄)＝P(w₁)P(w₂|w₁)P(w₃|w₁，w₂)P(w₄|w₂，w₃).

the N-Gram language model is an important technology in the field of natural language processing, and can well learn word surrounding information. The following example introduces an N-Gram language model in chat text correction.

Example 1

Referring to fig. 1, embodiment 1 of the present invention provides a method for correcting chat text content based on context, including the following steps:

In this embodiment, the N-Gram language model of the language is trained through sufficient monolingual training corpora, and a high-frequency lexicon of the corresponding language is generated.

Referring to fig. 2, specifically, the duplication reduction mainly solves the problem that a certain letter or letters of a word appear many times or less. A reduction map is first generated. Removing adjacent repeated letters from the words in the high-frequency word bank according to the high-frequency word bank to obtain reduction, and generating mapping of the words reduced into the high-frequency word bank; for words in multiple word libraries corresponding to the same reduction, the corresponding words are put together and separated by spaces, for example, the reductions of too and to in english are both to, and the corresponding mapping is generated as to — > to too.

And when the input chatting text data is not in the high-frequency word bank and is not a noun such as a name, a place name and the like, entering duplication reduction, removing adjacent repeated letters to obtain reduction, obtaining candidate words according to the mapping, then calculating the language model score of the original words of the chatting text and the language model score of the candidate words replacing the chatting text, and comparing to obtain the word with the largest language model score, namely the error correction result.

In this embodiment, after the duplication reduction processing is performed on the original words of the chat text, if the error correction result of the original words of the chat text is not changed, the editing distance error correction is performed:

In this embodiment, under a preset editing distance, and in the candidate words of the high frequency thesaurus, the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word in the chat text are respectively calculated and replaced, and the word with the highest N-Gram language model score is used as an error correction result.

Referring to fig. 3, specifically, when the result of the overwrite reduction error correction is not changed, entering the edit distance error correction; the edit distance is the number of operations of adding a letter, deleting a letter, exchanging adjacent letters, and changing a letter, and if the edit distance is 1, it means that one of the operations is performed only once. Firstly, screening out candidate words with an editing distance of 1 and in a high-frequency word bank, respectively calculating an N-Gram language model of the candidate words to be replaced by the chat text, and comparing the candidate words with the N-Gram language model score of the original words of the chat text to obtain the maximum error correction result; otherwise, screening the condition that the editing distance is 2, and carrying out the same operation; and returning the original word until the edit distance is 3.

Referring to fig. 4, in this embodiment, a dictionary tree is generated from the words in the high-frequency thesaurus, and after the editing distance error correction processing is performed on the original astronomical words, if the error correction result of the original astronomical words is not changed, the method proceeds to continuous writing word segmentation error correction:

In the embodiment, when the path is searched, truncation is performed according to the score of the N-Gram language model, the total scores of the N-Gram language models of the candidate paths are compared, and the candidate result with the maximum score is selected.

See FIG. 4, where threshold is the threshold for the n-gram language model score. In the process of error correction of the continuous writing word segmentation, firstly, a dictionary tree is generated according to a high-frequency word bank, so that the search and the search of words are facilitated. When the error correction result of editing distance error correction is not changed, continuous writing word segmentation is carried out, the data to be corrected searches a path according to a dictionary tree, meanwhile, corresponding scores of the N-Gram language model are calculated and stored, candidate results of the continuous writing word segmentation are generated, and truncation is carried out according to the scores of the N-Gram language model when the path is searched, so that the path searching efficiency is higher; and then comparing the overall N-Gram language models of the candidate paths, and selecting the candidate result with the largest score. Because the candidate result does not consider the condition that the original word does not need to be segmented, the language model score of n elements is utilized, and all the average language model scores of n elements after the word segmentation are taken as reference due to the inconsistent lengths after the word segmentation, and a reasonable threshold value is set for the average language model scores to effectively solve the problem; wherein, the larger the n-element language model score is, the higher the combination possibility after word segmentation is proved to be.

The method comprises the steps of training an N-Gram language model of a language through a preset amount of monolingual training corpora of a given language, and generating a high-frequency lexicon of a corresponding language; removing adjacent repeated letters from the words in the high-frequency word library according to the high-frequency word library, reducing, generating mapping from a reduction result to the words in the high-frequency word library, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result. The invention can be applied to translation chat service, improves the lower limit of translation, and can well translate irregular text data input by a user. The invention utilizes the context information to improve the error correction accuracy and the user experience, is beneficial to attracting more users and increases the enterprise income.

Example 2

Referring to fig. 5, embodiment 2 of the present invention provides a device for correcting chat text content based on context, including:

the language model training module 1 is used for training an N-Gram language model of a given language through a preset amount of monolingual training linguistic data of the given language and generating a high-frequency word bank corresponding to the language;

the duplication reduction module 2 is used for removing adjacent repeated letters from the words in the high-frequency word bank to reduce the words according to the high-frequency word bank, generating mapping from a reduction result to the words in the high-frequency word bank, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word to be replaced back to the chat text, and taking the word with the maximum N-Gram language model score as an error correction result.

In this embodiment, the device further includes an edit distance error correction module 3, where the edit distance error correction module 3 is configured to perform the duplication reduction processing on the original words of the chat text by using the duplication reduction module 2, and enter edit distance error correction if the error correction result of the original words of the chat text is not changed;

and performing at least one of processing of adding letters, deleting letters, exchanging adjacent letters and changing letters on the original words in the chat text through the editing distance error correction module 3, and taking the executed operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters as the editing distance.

In this embodiment, the editing distance error correction module 3 screens out a preset editing distance, respectively calculates the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word in the chat text to replace the candidate word with the N-Gram language model score of the chat text, and takes the word with the highest N-Gram language model score as an error correction result.

In this embodiment, the method further includes a hyphenation word segmentation error correction module 4, where the hyphenation word segmentation error correction module 4 is configured to enter hyphenation word segmentation error correction if an error correction result of a native word of the chat text does not change after the edit distance error correction module 3 performs the edit distance error correction on the native word of the chat text;

and the hyphenated word error correction module 4 searches a path for the original words in the chat text according to the dictionary tree, and simultaneously calculates and stores corresponding N-Gram language model scores to generate hyphenated word candidate results of the original words in the chat text.

In this embodiment, the hyphenation word segmentation error correction module 4 performs truncation according to the score of the N-Gram language model when searching for a path, compares the scores of the overall N-Gram language model of candidate paths, and selects a candidate result with the largest score.

It should be noted that, because the contents of information interaction, execution process, and the like between the modules/units of the apparatus are based on the same concept as the method embodiment in embodiment 1 of the present application, the technical effect brought by the contents is the same as the method embodiment of the present application, and specific contents may refer to the description in the foregoing method embodiment of the present application, and are not described herein again.

Example 3

Embodiment 3 of the present invention provides a computer-readable storage medium, where a program code of a context-based chat text content error correction method is stored in the computer-readable storage medium, where the program code includes an instruction for executing the context-based chat text content error correction method of embodiment 1 or any possible implementation manner of the embodiment.

The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Example 4

Embodiment 4 of the present invention provides an electronic device, where the electronic device includes a processor, the processor is coupled to a storage medium, and when the processor executes an instruction in the storage medium, the electronic device is enabled to execute the method for correcting the chat text content based on the context according to embodiment 1 or any possible implementation manner of the method.

Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated in the processor, located external to the processor, or stand-alone.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. The method for correcting the chat text content based on the context is characterized by comprising the following steps of:

2. The method as claimed in claim 1, wherein after the duplication reduction processing is performed on the original words of the chat text, if the error correction result of the original words of the chat text is not changed, entering an edit distance error correction:

3. The method according to claim 2, wherein an N-Gram language model score of a primitive word and an N-Gram language model score of a candidate word are calculated for the candidate word in the chat text and replaced with the chat text respectively at a preset editing distance, and a word with the highest N-Gram language model score is used as an error correction result.

4. The method of claim 3, wherein the words in the high-frequency thesaurus are generated into a dictionary tree, and after the editing distance error correction processing is performed on the native words of the astronomy, if the error correction result of the native words of the chatting does not change, the method enters into continuous writing segmentation error correction:

5. The method of claim 4, wherein the path is searched for, and the path is truncated according to the score of the N-Gram language model, and the scores of the overall N-Gram language model of the candidate paths are compared to select the candidate result with the largest score.

6. A context-based chat text content correction apparatus, comprising:

7. The apparatus of claim 6, further comprising an edit distance error correction module, wherein the edit distance error correction module is configured to enter edit distance error correction if the error correction result of the original chat text word is not changed after the duplication reduction module performs the duplication reduction on the original chat text word;

8. The context-based chat text content error correction device according to claim 7, wherein the editing distance error correction module screens out a preset editing distance, and calculates N-Gram language model scores of original words in the chat text and N-Gram language model scores of candidate words in the chat text to replace the candidate words back to the chat text, respectively, at the candidate words in the high frequency thesaurus, and takes the word with the highest N-Gram language model score as an error correction result.

9. The apparatus according to claim 8, further comprising a hyphenation error correction module, wherein the hyphenation error correction module is configured to enter hyphenation error correction if the error correction result of the original chat text word is not changed after the edit distance error correction module performs the edit distance error correction on the original chat text word;

10. The apparatus of claim 9, wherein the hyphenation error correction module truncates the path according to the score of the N-Gram language model, compares the scores of the global N-Gram language model of the candidate paths, and selects the candidate result with the largest score.