CN113807081B - Chat text content error correction method and device based on context - Google Patents

Chat text content error correction method and device based on context Download PDF

Info

Publication number
CN113807081B
CN113807081B CN202111101950.XA CN202111101950A CN113807081B CN 113807081 B CN113807081 B CN 113807081B CN 202111101950 A CN202111101950 A CN 202111101950A CN 113807081 B CN113807081 B CN 113807081B
Authority
CN
China
Prior art keywords
error correction
words
chat text
language model
gram language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111101950.XA
Other languages
Chinese (zh)
Other versions
CN113807081A (en
Inventor
元成
陈振标
杜晓祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yunshang Technology Co ltd
Original Assignee
Beijing Yunshang Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yunshang Technology Co ltd filed Critical Beijing Yunshang Technology Co ltd
Priority to CN202111101950.XA priority Critical patent/CN113807081B/en
Publication of CN113807081A publication Critical patent/CN113807081A/en
Application granted granted Critical
Publication of CN113807081B publication Critical patent/CN113807081B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a chat text content error correction method and device based on context, which are used for training an N-Gram language model of a language through a preset quantity of single-language training corpus of the given language and generating a high-frequency word stock of the corresponding language; according to the high-frequency word stock, removing the words in the high-frequency word stock from adjacent repeated letters to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back into the chat text, and taking the word with the largest N-Gram language model score as an error correction result. The method and the device utilize the context information to improve the accuracy of error correction, improve the user experience, are favorable for attracting more users and increase the income of enterprises.

Description

Chat text content error correction method and device based on context
Technical Field
The invention relates to the technical field of text processing, in particular to a chat text content correction method and device based on context, which are used for English, russian and Chinese languages with spaces.
Background
In recent years, with the popularization of the internet, text chat communication is increasingly favored in communication processes of different languages at home and abroad. The text chat expresses meaning in the form of words, and can carry out semantic translation by means of translation software, so that language communication barriers are reduced.
The text chat mode brings certain trouble to communication with each other although convenient, especially for non-native language users and machine translation; this is because the knowledge that is touched comes mostly from standard text, whereas in text chat, people often input irregular chat text according to their own mood or carelessness, such as: when the mood is excited, inputting one or a plurality of letters of the word repeatedly for a plurality of times; when careless, misspelling the word; when lazy, spelling words without spaces, etc., all result in recipients not being able to understand the information well, especially in translating chat software, reducing the user experience, ultimately resulting in reduced revenue. The traditional error correction technology performs association and error correction through word frequency information in a dictionary or according to the position relation of each letter in a keyboard, and does not use context information, so that the error correction accuracy is lower. A new chat text content correction solution is needed.
Disclosure of Invention
Therefore, the invention provides a chat text content error correction method and device based on context, which are used for solving the problems that the error correction accuracy of the traditional content error correction technology is low and semantic understanding is affected.
In order to achieve the above object, the present invention provides the following technical solutions: a chat text content error correction method based on context comprises the following steps:
Training an N-Gram language model: training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language, and generating a high-frequency word stock corresponding to the language;
And (3) copying and reducing: according to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.
As a preferred scheme of the context-based chat text content correction method, after the copying reduction processing is performed on the chat text original words, if the chat text original word correction result is unchanged, editing distance correction is entered:
And performing at least one of letter adding, letter deleting, adjacent letter exchanging and letter changing on the original words in the chat text, and taking the operation times of letter adding, letter deleting, adjacent letter exchanging and letter changing as the editing distance.
And as a preferred scheme of the context-based chat text content correction method, screening out candidate words in the high-frequency word bank under a preset editing distance, respectively calculating N-Gram language model scores of original words in the chat text and replacing the candidate words with the N-Gram language model scores of the chat text, and taking the words with the highest N-Gram language model scores as correction results.
As a preferred scheme of the context-based chat text content correction method, generating dictionary trees from words in the high-frequency word stock, and entering continuous word segmentation correction if the chat text word correction result is unchanged after the edit distance correction processing is performed on the astronomical primitive words:
Searching a path for the original words in the chat text according to the dictionary tree, calculating corresponding N-Gram language model scores and storing the scores to generate a continuous word segmentation candidate result of the original words in the chat text.
As a preferred scheme of the context-based chat text content correction method, when searching paths, the method cuts according to the score of the N-Gram language model, compares the scores of the total N-Gram language models of candidate paths, and selects candidate results with the largest scores.
The invention also provides a chat text content error correction device based on the context, which comprises:
The language model training module is used for training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language and generating a high-frequency word stock corresponding to the language;
The copying reduction module is used for reducing the words in the high-frequency word stock by removing adjacent repeated letters according to the high-frequency word stock, generating the mapping from the reduction result to the words in the high-frequency word stock, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.
The chat text content error correction device based on the context also comprises an editing distance error correction module, wherein the editing distance error correction module is used for entering the editing distance error correction if the chat text original word error correction result is unchanged after the copying reduction processing is carried out on the chat text original word by the copying reduction module;
And performing at least one of letter adding, letter deleting, adjacent letter exchanging and letter changing processing on the original words in the chat text through the edit distance error correction module, and taking the operation times of letter adding, letter deleting, adjacent letter exchanging and letter changing as edit distances.
As a preferred scheme of the chat text content correction device based on the context, screening out candidate words in the high-frequency word bank under the preset editing distance through the editing distance correction module, respectively calculating N-Gram language model scores of original words in the chat text and replacing the candidate words with the N-Gram language model scores of the chat text, and taking the words with the highest N-Gram language model scores as correction results.
The chat text content error correction device based on the context also comprises a continuous writing word segmentation error correction module, wherein the continuous writing word segmentation error correction module is used for entering continuous writing word segmentation error correction if the chat text original word error correction result is unchanged after the editing distance error correction module performs the editing distance error correction processing on the chat text original word;
and the continuous word segmentation error correction module searches a path of the original words in the chat text according to the dictionary tree, calculates corresponding N-Gram language model scores and stores the scores, and generates continuous word segmentation candidate results of the original words in the chat text.
As a preferable scheme of the chat text content error correction device based on the context, the continuous word segmentation error correction module cuts according to the score of the N-Gram language model when searching the path, compares the total N-Gram language model score of the candidate path, and selects the candidate result with the largest score.
The invention has the following advantages: training an N-Gram language model of a language through a preset quantity of single-language training corpus of a given language, and generating a high-frequency word stock of the corresponding language; according to the high-frequency word stock, removing the words in the high-frequency word stock from adjacent repeated letters to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back into the chat text, and taking the word with the largest N-Gram language model score as an error correction result. The method and the device utilize the context information to improve the accuracy of error correction, improve the user experience, are favorable for attracting more users and increase the income of enterprises.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.
The structures, proportions, sizes, etc. shown in the present specification are shown only for the purposes of illustration and description, and are not intended to limit the scope of the invention, which is defined by the claims, so that any structural modifications, changes in proportions, or adjustments of sizes, which do not affect the efficacy or the achievement of the present invention, should fall within the scope of the invention.
FIG. 1 is a schematic diagram of a method for context-based error correction of chat text content according to an embodiment of the present invention;
FIG. 2 is a flow chart of a replication reduction in a context-based error correction method for chat text content provided in an embodiment of the invention;
FIG. 3 is a flowchart of editing distance correction in a context-based chat text content correction method according to an embodiment of the present invention;
FIG. 4 is a flowchart of error correction by ligature word segmentation in the context-based chat text content error correction method provided in the embodiment of the invention;
Fig. 5 is a schematic diagram of a context-based chat text content correction apparatus according to an embodiment of the present invention.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is known that given a word sequence w 1,w2,...,wT of length T, the N-Gram language model calculates the probability of the sequence, namely:
Since the probability of calculating the current word w T by the language model is related to w 1,w2,...,wT-1, when the length of the sequence increases, the complexity of calculating and storing the probability of co-occurrence of a plurality of words increases exponentially, and it is difficult to implement language model calculation on word sequences of any length.
The N-Gram language model is a language model calculated by N-Gram through Markov hypothesis, and N-Gram through Markov, namely an N-1 order Markov chain, means that the occurrence of one word is only related to the previous N-1 words. N-Gram language model calculation formula:
For example, in sequence w 1,w2,w3,w4, the probability of its 3-gram language model is:
P(w1,w2,w3,w4)=P(w1)P(w2|w1)P(w3|w1,w2)P(w4|w2,w3).
the N-Gram language model is an important technology in the field of natural language processing, and can learn information around words well. The following embodiment introduces an N-Gram language model in chat text correction.
Example 1
Referring to fig. 1, embodiment 1 of the present invention provides a method for correcting chat text content based on context, comprising the steps of:
Training an N-Gram language model: training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language, and generating a high-frequency word stock corresponding to the language;
And (3) copying and reducing: according to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.
In this embodiment, the N-Gram language model of the language is trained by enough single-language training corpus, and a high-frequency word stock of the corresponding language is generated.
Referring to fig. 2, in particular, the handwriting reduction mainly solves the problem that a certain letter or letters of a word appear multiple times or less than several times. First a reduced map is generated. According to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to obtain reduction, and generating a mapping reduced to the words in the high-frequency word stock; for words in the same reduction corresponding to a plurality of word libraries, the corresponding plurality of words are put together and separated by spaces, for example, the reduction of to and to in english is to, and the corresponding mapping is to- > to.
When the input chat text data is not in the high-frequency word stock and is not nouns such as personal names, place names and the like, the input chat text data enters duplicating reduction, adjacent repeated letters are removed to obtain reduction, candidate words are obtained according to the mapping, then the language model score of the chat text original words and the language model score of the candidate words replaced back into the chat text are calculated, and the words with the maximum language model scores are obtained through comparison, namely the error correction result.
In this embodiment, after the copying and reducing process is performed on the chat text primitive, if the error correction result of the chat text primitive is unchanged, the editing distance error correction is entered:
And performing at least one of letter adding, letter deleting, adjacent letter exchanging and letter changing on the original words in the chat text, and taking the operation times of letter adding, letter deleting, adjacent letter exchanging and letter changing as the editing distance.
In this embodiment, under a preset editing distance, candidate words of the high-frequency word stock are screened out, the N-Gram language model score of the original word in the chat text and the N-Gram language model score of the candidate word are respectively calculated, the N-Gram language model score of the chat text is replaced by the candidate words, and the word with the highest N-Gram language model score is used as an error correction result.
Referring to fig. 3, specifically, when the result of the copying reduced error correction is unchanged, the editing distance error correction is entered; the edit distance is the number of operations of adding one letter, deleting one letter, exchanging adjacent letters, changing one letter, for example, the edit distance is 1, which means that only one of the above operations is performed once. Screening candidate words with the editing distance of 1 and in a high-frequency word stock, respectively calculating N-Gram language models of the candidate words replaced back to the chat text, and comparing the N-Gram language models with N-Gram language model scores of original words of the chat text, wherein the largest word is the error correction result; otherwise, screening the condition that the editing distance is 2, and carrying out the same operation; the original word is returned until the editing distance is 3.
Referring to fig. 4, in this embodiment, the words in the high-frequency word stock are generated into a dictionary tree, and after the edit distance error correction processing is performed on the astronomical primitive words, if the error correction result of the chat text primitive words is unchanged, the continuous writing word segmentation error correction is performed:
Searching a path for the original words in the chat text according to the dictionary tree, calculating corresponding N-Gram language model scores and storing the scores to generate a continuous word segmentation candidate result of the original words in the chat text.
In the embodiment, when searching the path, the path is truncated according to the score of the N-Gram language model, the scores of the total N-Gram language models of the candidate paths are compared, and the candidate result with the largest score is selected.
See fig. 4, where threshold is the threshold of n-gram scores. In the continuous word segmentation error correction process, firstly, a dictionary tree is generated according to a high-frequency word stock, so that the search and the lookup of words are facilitated. When the error correction result of the edit distance error correction is not changed, entering continuous writing word segmentation, searching a path according to the dictionary tree by the data to be corrected, calculating the corresponding N-Gram language model score and storing the score to generate a candidate result of continuous writing word segmentation, and cutting according to the score of the N-Gram language model when searching the path, so that the searching path is more efficient; and then, comparing the overall N-Gram language model of the candidate path, and selecting the candidate result with the largest score. Because the candidate result does not consider the situation that the original word does not need word segmentation, the n-element language model score is utilized, and because the lengths after word segmentation are inconsistent, all the average n-element language model scores after word segmentation are taken as references, and a reasonable threshold value can be set for the n-element language model scores; wherein, the larger the n-element language model score is, the higher the combination possibility after word segmentation is proved.
According to the method, through a preset quantity of single-language training corpus of a given language, an N-Gram language model of the language is trained, and a high-frequency word stock of the corresponding language is generated; according to the high-frequency word stock, removing the words in the high-frequency word stock from adjacent repeated letters to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back into the chat text, and taking the word with the largest N-Gram language model score as an error correction result. The invention can be applied to translation chat service, improves the translation lower limit, and can well translate text data input by users without specification. The method and the device utilize the context information to improve the accuracy of error correction, improve the user experience, are favorable for attracting more users and increase the income of enterprises.
Example 2
Referring to fig. 5, embodiment 2 of the present invention provides a context-based chat text content correction apparatus, including:
the language model training module 1 is used for training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language, and generating a high-frequency word stock corresponding to the language;
The copying reduction module 2 is used for reducing the words in the high-frequency word stock by removing adjacent repeated letters according to the high-frequency word stock, generating a mapping from a reduction result to the words in the high-frequency word stock, and obtaining candidate words of the reduction result according to the mapping; and respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result.
In this embodiment, the system further includes an edit distance error correction module 3, where the edit distance error correction module 3 is configured to enter edit distance error correction if the error correction result of the chat text original word is unchanged after the copying reduction processing is performed on the chat text original word by the copying reduction module 2;
At least one of adding letters, deleting letters, exchanging adjacent letters and changing letters is carried out on the original words in the chat text through the edit distance error correction module 3, and the operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters are used as edit distances.
In this embodiment, the edit distance error correction module 3 screens out candidate words in the high frequency word bank under a preset edit distance, calculates the N-Gram language model score of the original word in the chat text and replaces the candidate words with the N-Gram language model score of the chat text, and uses the word with the highest N-Gram language model score as the error correction result.
In this embodiment, the system further includes a continuous writing word segmentation error correction module 4, where the continuous writing word segmentation error correction module 4 is configured to enter continuous writing word segmentation error correction if the error correction result of the chat text original word is unchanged after the edit distance error correction processing is performed on the chat text original word by the edit distance error correction module 3;
And the continuous word segmentation error correction module 4 searches a path of the original words in the chat text according to the dictionary tree, calculates corresponding N-Gram language model scores and stores the scores, and generates continuous word segmentation candidate results of the original words in the chat text.
In this embodiment, the ligature word segmentation error correction module 4 cuts according to the score of the N-Gram language model when searching the path, compares the scores of the overall N-Gram language model of the candidate path, and selects the candidate result with the largest score.
It should be noted that, because the content of information interaction and execution process between the modules/units of the above-mentioned apparatus is based on the same concept as the method embodiment in the embodiment 1 of the present application, the technical effects brought by the content are the same as the method embodiment of the present application, and the specific content can be referred to the description in the foregoing illustrated method embodiment of the present application, which is not repeated here.
Example 3
Embodiment 3 of the present invention provides a computer-readable storage medium having stored therein program code for a context-based chat text content correction method, the program code comprising instructions for performing the context-based chat text content correction method of embodiment 1 or any possible implementation thereof.
Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.
Example 4
Embodiment 4 of the present invention provides an electronic device including a processor coupled to a storage medium, which when executing instructions in the storage medium, causes the electronic device to perform the context-based chat text content correction method of embodiment 1 or any possible implementation thereof.
Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor, implemented by reading software code stored in a memory, which may be integrated in the processor, or may reside outside the processor, and which may reside separately.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.).
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (2)

1. The chat text content error correction method based on the context is characterized by comprising the following steps of:
Training an N-Gram language model: training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language, and generating a high-frequency word stock corresponding to the language;
and (3) copying and reducing: according to the high-frequency word stock, removing adjacent repeated letters from words in the high-frequency word stock to reduce, generating a mapping from a reduced result to the words in the high-frequency word stock, and obtaining candidate words of the reduced result according to the mapping; respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result;
after the copying and reducing processing is carried out on the chat text original words, if the error correction result of the chat text original words is unchanged, entering the editing distance error correction:
At least one of adding letters, deleting letters, exchanging adjacent letters and changing letters is carried out on the original words in the chat text, and the operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters are used as editing distances;
Screening out candidate words in the high-frequency word stock under a preset editing distance, respectively calculating N-Gram language model scores of original words in the chat text and replacing the candidate words with the N-Gram language model scores of the chat text, and taking the word with the highest N-Gram language model score as an error correction result;
generating dictionary trees from the words in the high-frequency word stock, and entering continuous writing word segmentation error correction if the error correction result of the chat text original words is unchanged after the editing distance error correction processing is carried out on the chat text original words:
searching a path for the original words in the chat text according to the dictionary tree, simultaneously calculating and storing corresponding N-Gram language model scores, and generating a continuous word segmentation candidate result of the original words in the chat text;
When searching the path, cutting according to the score of the N-Gram language model, comparing the scores of the total N-Gram language models of the candidate paths, and selecting the candidate result with the largest score.
2. A context-based chat text content error correction apparatus comprising:
The language model training module is used for training an N-Gram language model of a given language through a preset quantity of single-language training corpus of the given language and generating a high-frequency word stock corresponding to the language;
the copying reduction module is used for reducing the words in the high-frequency word stock by removing adjacent repeated letters according to the high-frequency word stock, generating the mapping from the reduction result to the words in the high-frequency word stock, and obtaining candidate words of the reduction result according to the mapping; respectively calculating the N-Gram language model score of the original words in the chat text and the N-Gram language model score of the candidate words replaced back to the chat text, and taking the word with the largest N-Gram language model score as an error correction result;
The system also comprises an editing distance error correction module, wherein the editing distance error correction module is used for entering the editing distance error correction if the error correction result of the chat text original words is unchanged after the copying reduction processing is carried out on the chat text original words by the copying reduction module;
At least one of adding letters, deleting letters, exchanging adjacent letters and changing letters is carried out on the original words in the chat text through the edit distance error correction module, and the operation times of adding letters, deleting letters, exchanging adjacent letters and changing letters are used as edit distances;
Screening out candidate words of the high-frequency word stock under a preset editing distance through the editing distance error correction module, respectively calculating N-Gram language model scores of original words in the chat text and N-Gram language model scores of the candidate words replaced back to the chat text, and taking words with highest N-Gram language model scores as error correction results;
The continuous writing word segmentation error correction module is used for entering continuous writing word segmentation error correction if the chat text original word error correction result is unchanged after the editing distance error correction module performs the editing distance error correction processing on the chat text original word;
The continuous word segmentation error correction module searches a path of the original words in the chat text according to a dictionary tree, calculates corresponding N-Gram language model scores and stores the scores at the same time, and generates continuous word segmentation candidate results of the original words in the chat text;
And the continuous writing word segmentation error correction module cuts according to the score of the N-Gram language model when searching the path, compares the scores of the total N-Gram language models of the candidate paths, and selects the candidate result with the largest score.
CN202111101950.XA 2021-09-18 2021-09-18 Chat text content error correction method and device based on context Active CN113807081B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111101950.XA CN113807081B (en) 2021-09-18 2021-09-18 Chat text content error correction method and device based on context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111101950.XA CN113807081B (en) 2021-09-18 2021-09-18 Chat text content error correction method and device based on context

Publications (2)

Publication Number Publication Date
CN113807081A CN113807081A (en) 2021-12-17
CN113807081B true CN113807081B (en) 2024-06-14

Family

ID=78896024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111101950.XA Active CN113807081B (en) 2021-09-18 2021-09-18 Chat text content error correction method and device based on context

Country Status (1)

Country Link
CN (1) CN113807081B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401012B (en) * 2020-03-09 2023-11-21 北京声智科技有限公司 Text error correction method, electronic device and computer readable storage medium
CN111613214A (en) * 2020-05-21 2020-09-01 重庆农村商业银行股份有限公司 Language model error correction method for improving voice recognition capability
CN111767731A (en) * 2020-07-09 2020-10-13 北京猿力未来科技有限公司 Training method and device of grammar error correction model and grammar error correction method and device
CN112016304A (en) * 2020-09-03 2020-12-01 平安科技(深圳)有限公司 Text error correction method and device, electronic equipment and storage medium
CN111933129B (en) * 2020-09-11 2021-01-05 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment
CN112163585B (en) * 2020-11-10 2023-11-10 上海七猫文化传媒有限公司 Text auditing method and device, computer equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine

Also Published As

Publication number Publication date
CN113807081A (en) 2021-12-17

Similar Documents

Publication Publication Date Title
US11556713B2 (en) System and method for performing a meaning search using a natural language understanding (NLU) framework
KR102268875B1 (en) System and method for inputting text into electronic devices
US20190087403A1 (en) Online spelling correction/phrase completion system
US8612206B2 (en) Transliterating semitic languages including diacritics
KR101465770B1 (en) Word probability determination
US20180089169A1 (en) Method, non-transitory computer-readable recording medium storing a program, apparatus, and system for creating similar sentence from original sentences to be translated
JP6705318B2 (en) Bilingual dictionary creating apparatus, bilingual dictionary creating method, and bilingual dictionary creating program
WO2012095696A2 (en) Text segmentation with multiple granularity levels
KR20060043682A (en) Systems and methods for improved spell checking
CN104077275A (en) Method and device for performing word segmentation based on context
US8660969B1 (en) Training dependency parsers by jointly optimizing multiple objectives
AU2019203783B2 (en) Extraction of tokens and relationship between tokens from documents to form an entity relationship map
US7398210B2 (en) System and method for performing analysis on word variants
US8972244B2 (en) Sampling and optimization in phrase-based machine translation using an enriched language model representation
CN112507721B (en) Method, apparatus, device and computer readable storage medium for generating text theme
CN113255329A (en) English text spelling error correction method and device, storage medium and electronic equipment
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN113807081B (en) Chat text content error correction method and device based on context
CN107168950B (en) Event phrase learning method and device based on bilingual semantic mapping
CN113553833B (en) Text error correction method and device and electronic equipment
WO2022227166A1 (en) Word replacement method and apparatus, electronic device, and storage medium
CN111625579B (en) Information processing method, device and system
JP2006004366A (en) Machine translation system and computer program for it
CN115688748A (en) Question error correction method and device, electronic equipment and storage medium
Jose et al. Lexical normalization model for noisy SMS text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant