CN113807081B

CN113807081B - Chat text content error correction method and device based on context

Info

Publication number: CN113807081B
Application number: CN202111101950.XA
Authority: CN
Inventors: 元成; 陈振标; 杜晓祥
Original assignee: Beijing Yunshang Technology Co ltd
Current assignee: Beijing Yunshang Technology Co ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2024-06-14
Anticipated expiration: 2041-09-18
Also published as: CN113807081A

Abstract

The invention discloses a chat text content error correction method and device based on context, which are used for training an N-Gram language model of a language through a preset quantity of single-language training corpus of the given language and generating a high-frequency word stock of the corresponding language; according to the high-frequency word stock, removing the words in the high-frequency word stock from adjacent repeated letters to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back into the chat text, and taking the word with the largest N-Gram language model score as an error correction result. The method and the device utilize the context information to improve the accuracy of error correction, improve the user experience, are favorable for attracting more users and increase the income of enterprises.

Description

Chat text content error correction method and device based on context

Technical Field

The invention relates to the technical field of text processing, in particular to a chat text content correction method and device based on context, which are used for English, russian and Chinese languages with spaces.

Background

In recent years, with the popularization of the internet, text chat communication is increasingly favored in communication processes of different languages at home and abroad. The text chat expresses meaning in the form of words, and can carry out semantic translation by means of translation software, so that language communication barriers are reduced.

The text chat mode brings certain trouble to communication with each other although convenient, especially for non-native language users and machine translation; this is because the knowledge that is touched comes mostly from standard text, whereas in text chat, people often input irregular chat text according to their own mood or carelessness, such as: when the mood is excited, inputting one or a plurality of letters of the word repeatedly for a plurality of times; when careless, misspelling the word; when lazy, spelling words without spaces, etc., all result in recipients not being able to understand the information well, especially in translating chat software, reducing the user experience, ultimately resulting in reduced revenue. The traditional error correction technology performs association and error correction through word frequency information in a dictionary or according to the position relation of each letter in a keyboard, and does not use context information, so that the error correction accuracy is lower. A new chat text content correction solution is needed.

Disclosure of Invention

Therefore, the invention provides a chat text content error correction method and device based on context, which are used for solving the problems that the error correction accuracy of the traditional content error correction technology is low and semantic understanding is affected.

In order to achieve the above object, the present invention provides the following technical solutions: a chat text content error correction method based on context comprises the following steps:

Training an N-Gram language model: training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language, and generating a high-frequency word stock corresponding to the language;

And (3) copying and reducing: according to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.

As a preferred scheme of the context-based chat text content correction method, after the copying reduction processing is performed on the chat text original words, if the chat text original word correction result is unchanged, editing distance correction is entered:

And performing at least one of letter adding, letter deleting, adjacent letter exchanging and letter changing on the original words in the chat text, and taking the operation times of letter adding, letter deleting, adjacent letter exchanging and letter changing as the editing distance.

And as a preferred scheme of the context-based chat text content correction method, screening out candidate words in the high-frequency word bank under a preset editing distance, respectively calculating N-Gram language model scores of original words in the chat text and replacing the candidate words with the N-Gram language model scores of the chat text, and taking the words with the highest N-Gram language model scores as correction results.

As a preferred scheme of the context-based chat text content correction method, generating dictionary trees from words in the high-frequency word stock, and entering continuous word segmentation correction if the chat text word correction result is unchanged after the edit distance correction processing is performed on the astronomical primitive words:

Searching a path for the original words in the chat text according to the dictionary tree, calculating corresponding N-Gram language model scores and storing the scores to generate a continuous word segmentation candidate result of the original words in the chat text.

As a preferred scheme of the context-based chat text content correction method, when searching paths, the method cuts according to the score of the N-Gram language model, compares the scores of the total N-Gram language models of candidate paths, and selects candidate results with the largest scores.

The invention also provides a chat text content error correction device based on the context, which comprises:

The language model training module is used for training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language and generating a high-frequency word stock corresponding to the language;

The copying reduction module is used for reducing the words in the high-frequency word stock by removing adjacent repeated letters according to the high-frequency word stock, generating the mapping from the reduction result to the words in the high-frequency word stock, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.

The chat text content error correction device based on the context also comprises an editing distance error correction module, wherein the editing distance error correction module is used for entering the editing distance error correction if the chat text original word error correction result is unchanged after the copying reduction processing is carried out on the chat text original word by the copying reduction module;

And performing at least one of letter adding, letter deleting, adjacent letter exchanging and letter changing processing on the original words in the chat text through the edit distance error correction module, and taking the operation times of letter adding, letter deleting, adjacent letter exchanging and letter changing as edit distances.

As a preferred scheme of the chat text content correction device based on the context, screening out candidate words in the high-frequency word bank under the preset editing distance through the editing distance correction module, respectively calculating N-Gram language model scores of original words in the chat text and replacing the candidate words with the N-Gram language model scores of the chat text, and taking the words with the highest N-Gram language model scores as correction results.

The chat text content error correction device based on the context also comprises a continuous writing word segmentation error correction module, wherein the continuous writing word segmentation error correction module is used for entering continuous writing word segmentation error correction if the chat text original word error correction result is unchanged after the editing distance error correction module performs the editing distance error correction processing on the chat text original word;

and the continuous word segmentation error correction module searches a path of the original words in the chat text according to the dictionary tree, calculates corresponding N-Gram language model scores and stores the scores, and generates continuous word segmentation candidate results of the original words in the chat text.

As a preferable scheme of the chat text content error correction device based on the context, the continuous word segmentation error correction module cuts according to the score of the N-Gram language model when searching the path, compares the total N-Gram language model score of the candidate path, and selects the candidate result with the largest score.

The invention has the following advantages: training an N-Gram language model of a language through a preset quantity of single-language training corpus of a given language, and generating a high-frequency word stock of the corresponding language; according to the high-frequency word stock, removing the words in the high-frequency word stock from adjacent repeated letters to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back into the chat text, and taking the word with the largest N-Gram language model score as an error correction result. The method and the device utilize the context information to improve the accuracy of error correction, improve the user experience, are favorable for attracting more users and increase the income of enterprises.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.

FIG. 1 is a schematic diagram of a method for context-based error correction of chat text content according to an embodiment of the present invention;

FIG. 2 is a flow chart of a replication reduction in a context-based error correction method for chat text content provided in an embodiment of the invention;

FIG. 3 is a flowchart of editing distance correction in a context-based chat text content correction method according to an embodiment of the present invention;

FIG. 4 is a flowchart of error correction by ligature word segmentation in the context-based chat text content error correction method provided in the embodiment of the invention;

Fig. 5 is a schematic diagram of a context-based chat text content correction apparatus according to an embodiment of the present invention.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It is known that given a word sequence w ₁,w₂,...,w_T of length T, the N-Gram language model calculates the probability of the sequence, namely:

Since the probability of calculating the current word w _T by the language model is related to w ₁,w₂,...,w_T-1, when the length of the sequence increases, the complexity of calculating and storing the probability of co-occurrence of a plurality of words increases exponentially, and it is difficult to implement language model calculation on word sequences of any length.

The N-Gram language model is a language model calculated by N-Gram through Markov hypothesis, and N-Gram through Markov, namely an N-1 order Markov chain, means that the occurrence of one word is only related to the previous N-1 words. N-Gram language model calculation formula:

For example, in sequence w ₁,w₂,w₃,w₄, the probability of its 3-gram language model is:

P(w₁,w₂,w₃,w₄)＝P(w₁)P(w₂|w₁)P(w₃|w₁,w₂)P(w₄|w₂,w₃).

the N-Gram language model is an important technology in the field of natural language processing, and can learn information around words well. The following embodiment introduces an N-Gram language model in chat text correction.

Example 1

Referring to fig. 1, embodiment 1 of the present invention provides a method for correcting chat text content based on context, comprising the steps of:

In this embodiment, the N-Gram language model of the language is trained by enough single-language training corpus, and a high-frequency word stock of the corresponding language is generated.

Referring to fig. 2, in particular, the handwriting reduction mainly solves the problem that a certain letter or letters of a word appear multiple times or less than several times. First a reduced map is generated. According to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to obtain reduction, and generating a mapping reduced to the words in the high-frequency word stock; for words in the same reduction corresponding to a plurality of word libraries, the corresponding plurality of words are put together and separated by spaces, for example, the reduction of to and to in english is to, and the corresponding mapping is to- > to.

When the input chat text data is not in the high-frequency word stock and is not nouns such as personal names, place names and the like, the input chat text data enters duplicating reduction, adjacent repeated letters are removed to obtain reduction, candidate words are obtained according to the mapping, then the language model score of the chat text original words and the language model score of the candidate words replaced back into the chat text are calculated, and the words with the maximum language model scores are obtained through comparison, namely the error correction result.

In this embodiment, after the copying and reducing process is performed on the chat text primitive, if the error correction result of the chat text primitive is unchanged, the editing distance error correction is entered:

In this embodiment, under a preset editing distance, candidate words of the high-frequency word stock are screened out, the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word are respectively calculated, the N-Gram language model score of the chat text is replaced by the candidate words, and the word with the highest N-Gram language model score is used as an error correction result.

Referring to fig. 3, specifically, when the result of the copying reduced error correction is unchanged, the editing distance error correction is entered; the edit distance is the number of operations of adding one letter, deleting one letter, exchanging adjacent letters, changing one letter, for example, the edit distance is 1, which means that only one of the above operations is performed once. Screening candidate words with the editing distance of 1 and in a high-frequency word stock, respectively calculating N-Gram language models of the candidate words replaced back to the chat text, and comparing the N-Gram language models with N-Gram language model scores of original words of the chat text, wherein the largest word is the error correction result; otherwise, screening the condition that the editing distance is 2, and carrying out the same operation; the original word is returned until the editing distance is 3.

Referring to fig. 4, in this embodiment, the words in the high-frequency word stock are generated into a dictionary tree, and after the edit distance error correction processing is performed on the astronomical primitive words, if the error correction result of the chat text primitive words is unchanged, the continuous writing word segmentation error correction is performed:

In the embodiment, when searching the path, the path is truncated according to the score of the N-Gram language model, the scores of the total N-Gram language models of the candidate paths are compared, and the candidate result with the largest score is selected.

See fig. 4, where threshold is the threshold of n-gram scores. In the continuous word segmentation error correction process, firstly, a dictionary tree is generated according to a high-frequency word stock, so that the search and the lookup of words are facilitated. When the error correction result of the edit distance error correction is not changed, entering continuous writing word segmentation, searching a path according to the dictionary tree by the data to be corrected, calculating the corresponding N-Gram language model score and storing the score to generate a candidate result of continuous writing word segmentation, and cutting according to the score of the N-Gram language model when searching the path, so that the searching path is more efficient; and then, comparing the overall N-Gram language model of the candidate path, and selecting the candidate result with the largest score. Because the candidate result does not consider the situation that the original word does not need word segmentation, the n-element language model score is utilized, and because the lengths after word segmentation are inconsistent, all the average n-element language model scores after word segmentation are taken as references, and a reasonable threshold value can be set for the n-element language model scores; wherein, the larger the n-element language model score is, the higher the combination possibility after word segmentation is proved.

According to the method, through a preset quantity of single-language training corpus of a given language, an N-Gram language model of the language is trained, and a high-frequency word stock of the corresponding language is generated; according to the high-frequency word stock, removing the words in the high-frequency word stock from adjacent repeated letters to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back into the chat text, and taking the word with the largest N-Gram language model score as an error correction result. The invention can be applied to translation chat service, improves the translation lower limit, and can well translate text data input by users without specification. The method and the device utilize the context information to improve the accuracy of error correction, improve the user experience, are favorable for attracting more users and increase the income of enterprises.

Example 2

Referring to fig. 5, embodiment 2 of the present invention provides a context-based chat text content correction apparatus, including:

the language model training module 1 is used for training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language, and generating a high-frequency word stock corresponding to the language;

The copying reduction module 2 is used for reducing the words in the high-frequency word stock by removing adjacent repeated letters according to the high-frequency word stock, generating a mapping from a reduction result to the words in the high-frequency word stock, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.

In this embodiment, the system further includes an edit distance error correction module 3, where the edit distance error correction module 3 is configured to enter edit distance error correction if the error correction result of the chat text original word is unchanged after the copying reduction processing is performed on the chat text original word by the copying reduction module 2;

At least one of adding letters, deleting letters, exchanging adjacent letters and changing letters is carried out on the original words in the chat text through the edit distance error correction module 3, and the operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters are used as edit distances.

In this embodiment, the edit distance error correction module 3 screens out candidate words in the high frequency word bank under a preset edit distance, calculates the N-Gram language model score of the original word in the chat text and replaces the candidate words with the N-Gram language model score of the chat text, and uses the word with the highest N-Gram language model score as the error correction result.

In this embodiment, the system further includes a continuous writing word segmentation error correction module 4, where the continuous writing word segmentation error correction module 4 is configured to enter continuous writing word segmentation error correction if the error correction result of the chat text original word is unchanged after the edit distance error correction processing is performed on the chat text original word by the edit distance error correction module 3;

And the continuous word segmentation error correction module 4 searches a path of the original words in the chat text according to the dictionary tree, calculates corresponding N-Gram language model scores and stores the scores, and generates continuous word segmentation candidate results of the original words in the chat text.

In this embodiment, the ligature word segmentation error correction module 4 cuts according to the score of the N-Gram language model when searching the path, compares the scores of the overall N-Gram language model of the candidate path, and selects the candidate result with the largest score.

It should be noted that, because the content of information interaction and execution process between the modules/units of the above-mentioned apparatus is based on the same concept as the method embodiment in the embodiment 1 of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and the specific content can be referred to the description in the foregoing illustrated method embodiment of the present application, which is not repeated here.

Example 3

Embodiment 3 of the present invention provides a computer-readable storage medium having stored therein program code for a context-based chat text content correction method, the program code comprising instructions for performing the context-based chat text content correction method of embodiment 1 or any possible implementation thereof.

Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Example 4

Embodiment 4 of the present invention provides an electronic device including a processor coupled to a storage medium, which when executing instructions in the storage medium, causes the electronic device to perform the context-based chat text content correction method of embodiment 1 or any possible implementation thereof.

Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor, implemented by reading software code stored in a memory, which may be integrated in the processor, or may reside outside the processor, and which may reside separately.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.).

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. The chat text content error correction method based on the context is characterized by comprising the following steps of:

and (3) copying and reducing: according to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result;

after the copying and reducing processing is carried out on the chat text original words, if the error correction result of the chat text original words is unchanged, entering the editing distance error correction:

At least one of adding letters, deleting letters, exchanging adjacent letters and changing letters is carried out on the original words in the chat text, and the operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters are used as editing distances;

Screening out candidate words in the high-frequency word stock under a preset editing distance, respectively calculating N-Gram language model scores of original words in the chat text and replacing the candidate words with the N-Gram language model scores of the chat text, and taking the word with the highest N-Gram language model score as an error correction result;

generating dictionary trees from the words in the high-frequency word stock, and entering continuous writing word segmentation error correction if the error correction result of the chat text original words is unchanged after the editing distance error correction processing is carried out on the chat text original words:

searching a path for the original words in the chat text according to the dictionary tree, simultaneously calculating and storing corresponding N-Gram language model scores, and generating a continuous word segmentation candidate result of the original words in the chat text;

When searching the path, cutting according to the score of the N-Gram language model, comparing the scores of the total N-Gram language models of the candidate paths, and selecting the candidate result with the largest score.

2. A context-based chat text content error correction apparatus comprising:

the copying reduction module is used for reducing the words in the high-frequency word stock by removing adjacent repeated letters according to the high-frequency word stock, generating the mapping from the reduction result to the words in the high-frequency word stock, and obtaining candidate words of the reduction result according to the mapping; respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result;

The system also comprises an editing distance error correction module, wherein the editing distance error correction module is used for entering the editing distance error correction if the error correction result of the chat text original words is unchanged after the copying reduction processing is carried out on the chat text original words by the copying reduction module;

At least one of adding letters, deleting letters, exchanging adjacent letters and changing letters is carried out on the original words in the chat text through the edit distance error correction module, and the operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters are used as edit distances;

Screening out candidate words of the high-frequency word stock under a preset editing distance through the editing distance error correction module, respectively calculating N-Gram language model scores of original words in the chat text and N-Gram language model scores of the candidate words replaced back to the chat text, and taking words with highest N-Gram language model scores as error correction results;

The continuous writing word segmentation error correction module is used for entering continuous writing word segmentation error correction if the chat text original word error correction result is unchanged after the editing distance error correction module performs the editing distance error correction processing on the chat text original word;

The continuous word segmentation error correction module searches a path of the original words in the chat text according to a dictionary tree, calculates corresponding N-Gram language model scores and stores the scores at the same time, and generates continuous word segmentation candidate results of the original words in the chat text;

And the continuous writing word segmentation error correction module cuts according to the score of the N-Gram language model when searching the path, compares the scores of the total N-Gram language models of the candidate paths, and selects the candidate result with the largest score.