CN114282527A - Multi-language text detection and correction method, system, electronic device and storage medium - Google Patents

Multi-language text detection and correction method, system, electronic device and storage medium Download PDF

Info

Publication number
CN114282527A
CN114282527A CN202111576592.8A CN202111576592A CN114282527A CN 114282527 A CN114282527 A CN 114282527A CN 202111576592 A CN202111576592 A CN 202111576592A CN 114282527 A CN114282527 A CN 114282527A
Authority
CN
China
Prior art keywords
detected
language
word
sentence
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111576592.8A
Other languages
Chinese (zh)
Inventor
杨子清
韦菁
崔一鸣
伍大勇
陈志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Xunfei Institute Of Artificial Intelligence
Zhongke Xunfei Internet Beijing Information Technology Co ltd
iFlytek Co Ltd
Original Assignee
Hebei Xunfei Institute Of Artificial Intelligence
Zhongke Xunfei Internet Beijing Information Technology Co ltd
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Xunfei Institute Of Artificial Intelligence, Zhongke Xunfei Internet Beijing Information Technology Co ltd, iFlytek Co Ltd filed Critical Hebei Xunfei Institute Of Artificial Intelligence
Priority to CN202111576592.8A priority Critical patent/CN114282527A/en
Publication of CN114282527A publication Critical patent/CN114282527A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method, a system, electronic equipment and a storage medium for detecting and correcting a multi-language text, wherein the method comprises the steps of obtaining the text to be detected, and carrying out multi-language character recognition on the text to be detected to obtain at least one sentence to be recognized; performing language detection on characters of a target language in the sentence to be recognized to obtain a language word to be detected, and performing spelling detection and semantic detection on the language word to be detected; and if at least one language word to be detected has spelling errors and/or semantic errors, carrying out corresponding spelling error correction and/or semantic error correction on the word with the spelling errors and/or the semantic errors. The invention can better understand the text semantics under the cross-language context, detect all the characters of the target language in the text and only correct the words with errors.

Description

Multi-language text detection and correction method, system, electronic device and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multi-language text detection and correction method, a system, electronic equipment and a storage medium.
Background
Some existing error correction technical schemes, whether based on rules or neural networks, can only correct the text against the background of a single language, but do not consider the situation of language code conversion, such as 'Xiaozhuang is gafe' (spanish slang language: downy). Based on foreign characters appearing in the context of Chinese, spelling errors and semantic errors can occur, which leads to the problem that the semantic understanding of cross-language texts in the system is wrong.
Disclosure of Invention
The invention provides a method, a system, electronic equipment and a storage medium for detecting and correcting a multi-language text, which are used for solving the problem of wrong speech understanding of a cross-language text in the prior art.
In a first aspect, the present invention provides a method for detecting and correcting a multilingual text, the method comprising:
acquiring a text to be detected, and performing multi-language character recognition on the text to be detected to obtain at least one sentence to be recognized, wherein the sentence to be recognized comprises characters of a main language and characters of at least one target language, and the main language is different from the target language;
performing language detection on characters of a target language in the sentence to be recognized to obtain a language word to be detected, and performing spelling detection and semantic detection on the language word to be detected;
and if at least one language word to be detected has spelling errors and/or semantic errors, carrying out corresponding spelling error correction and/or semantic error correction on the word with the spelling errors and/or the semantic errors.
In an embodiment of the present invention, the obtaining the text to be detected, and performing multilingual character recognition on the text to be detected to obtain at least one sentence to be recognized includes:
carrying out data cleaning on the text to be detected so as to delete illegal characters in the text to be detected and messy codes caused by coding errors;
sentence dividing is carried out on the text to be detected to obtain at least one sentence to be recognized, and blank characters of a sentence head and a sentence tail of each sentence to be recognized are deleted;
and identifying characters of the sentence to be identified, and recording the position of the characters of the target language in the sentence to be identified when the characters of the target language exist.
In an embodiment of the present invention, the language detection on the characters of the target language in the sentence to be recognized includes:
inputting a sentence to be recognized with characters of a target language into a preset language detection model;
the preset language detection model divides the input sentence to be recognized based on a sequence labeling mechanism and outputs the language corresponding to the word of the target language in the sentence to be recognized.
In an embodiment of the present invention, the segmenting, by the preset language detection model, the input sentence to be recognized based on a sequence tagging mechanism, and outputting the language corresponding to the word of the target language existing in the sentence to be recognized includes:
performing word segmentation on the sentence to be recognized to obtain a word segmentation list with at least one word segmentation, and adding preset special characters to the head and the tail of the word segmentation list respectively to represent the beginning and the end;
mapping each participle in the participle list to a corresponding identification number to obtain an identification number list;
inputting the identification number list into an embedding layer of the preset language detection model so as to convert the identification number list into a first matrix with a first preset dimensionality;
inputting the first matrix into a multilayer transform of the preset language detection model for calculation so as to output a second matrix with a second preset dimensionality;
inputting the second matrix into a full-connection layer of the preset language detection model, and performing normalization calculation on the output of the full-connection layer to obtain the language probability of the participle corresponding to each identification number;
and determining the language corresponding to each participle according to the language probability of each participle.
In an embodiment of the present invention, the spelling detection and the semantic detection are performed on the language word to be detected:
inputting each language word to be detected into a preset spelling detection model to detect whether spelling errors exist or not;
if the preset spelling detection model detects that each language word to be detected has no spelling error, inputting a sentence to be detected containing at least one language word to be detected into a preset semantic detection model so as to detect whether a semantic error exists;
and if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected as the detected text.
In an embodiment of the present invention, if at least one of the language words to be detected has a spelling error and/or a semantic error, performing corresponding spelling error correction and/or semantic error correction on the word having the spelling error and/or the semantic error includes:
if the language word to be detected has spelling errors, inputting the language word to be detected into a preset spelling error correction model to carry out spelling error correction;
inputting the sentence to be detected containing the language word to be detected after spelling error correction processing into the preset semantic detection model so as to detect whether semantic errors exist or not;
and if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected after the spelling error correction as the detected text.
In an embodiment of the present invention, if at least one of the language words to be detected has a spelling error and/or a semantic error, performing corresponding spelling error correction and/or semantic error correction on the word having the spelling error and/or the semantic error includes:
if the sentence to be detected has semantic errors, inputting the sentence to be detected into a preset semantic error correction model to perform semantic error correction processing;
and returning the text to be detected corresponding to the sentence to be detected after semantic error correction as the detected text.
In an embodiment of the present invention, before the inputting each language word to be detected into the preset spelling detection model to detect whether there is a spelling error, the method further includes:
inputting a preset number of reference language words into an encoder of the preset spelling detection model for encoding, wherein the preset number of reference language words comprises a marked correctly spelled word set and an incorrectly spelled word set;
performing word segmentation on each coded word, inputting a word segmentation result into a hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain various representations of each word;
aggregating the multiple representations of each word in respective spaces to obtain clustered results of the correctly spelled word set and the misspelled word set.
In an embodiment of the present invention, the inputting each language word to be detected into a preset spelling detection model to detect whether there is a spelling error includes:
inputting each language word to be detected into an encoder of the preset spelling detection model for encoding;
segmenting the coded language words to be detected, inputting segmentation results into a hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain multiple representations of each language word to be detected;
calculating a first average distance between each word to be detected and all words in the correctly spelled word set and calculating a second average distance between each word to be detected and all words in the incorrectly spelled word set according to the multiple representations of each word to be detected;
and for each language word to be detected, if the value of the first average distance exceeds a first preset threshold value, determining that the language word to be detected is a word with misspelling, and if the value of the second average distance exceeds a second preset threshold value, determining that the language word to be detected is a word with correct spelling.
In an embodiment of the present invention, the inputting the sentence to be detected including at least one language word to be detected into a preset semantic detection model to detect whether there is a semantic error includes:
performing word segmentation on the sentence to be detected, and inputting a word segmentation result into an encoder of the preset semantic detection model for processing to obtain a target matrix;
inputting each column of the target matrix into a classifier of the preset semantic detection model for processing to obtain corresponding two classification probabilities;
and determining whether semantic errors exist in the words of the target language in the sentence to be detected according to the two classification probabilities.
In an embodiment of the present invention, if the language word to be detected has a misspelling, the inputting the language word to be detected into a preset spelling correction model to perform a spelling correction process includes:
the following operations are executed for each language word to be detected with spelling errors:
calculating the editing distance between each language word to be detected and each word in a preset dictionary through the preset spelling error correction model;
if a word with the minimum editing distance exists in the preset dictionary, replacing the language word to be detected with the word with the minimum editing distance so as to correct the language word to be detected;
if a plurality of words with the minimum editing distance exist in the preset dictionary, replacing the language words to be detected with words with the highest returning frequency in the words with the minimum editing distance so as to correct the language words to be detected;
and the preset dictionary is a language dictionary corresponding to the target language.
In an embodiment of the present invention, the inputting the sentence to be detected into a preset semantic error correction model to perform semantic error correction processing includes:
extracting at least one word to be corrected from the sentence to be detected with semantic errors;
querying a preset dictionary, and taking all candidate words in the preset dictionary, of which the editing distance from the word to be corrected is smaller than or equal to a preset threshold value, as a candidate set;
masking the words to be corrected in the sentences to be detected by using preset marker symbols, and inputting the sentences to be detected after masking treatment into a preset semantic error correction model;
and predicting the probability of replacing the hidden word to be corrected by each candidate word in the candidate set by using the preset semantic error correction model, and taking the candidate word with the highest probability value as a corrected word.
In an embodiment of the present invention, the returning the text to be detected corresponding to the semantic error-corrected sentence to be detected as the detected text includes:
merging the correction words with the sentences to be detected;
during merging, if the length of a correction word is inconsistent with the length of a corresponding word to be corrected or a plurality of correction words exist in a sentence, deleting the word to be corrected in the sentence to be detected, and then inserting the corresponding correction word;
and returning the text to be detected corresponding to the combined sentence to be detected as the detected text.
In an embodiment of the present invention, before the sentence to be detected is input into a preset semantic error correction model, the method further includes:
training the preset semantic error correction model according to the following mode:
taking a preset number of correct target language texts or main language texts containing target language words as a training set;
inputting the texts in the training set into the preset semantic error correction model, and randomly masking part of words in the input texts by the preset semantic error correction model;
for each masked word, training the preset semantic error correction model to replace the masked word with the predicted word.
In a second aspect, the present invention further provides a multilingual text detection and correction system, comprising:
the recognition module is used for acquiring a text to be detected and performing multi-language character recognition on the text to be detected to obtain at least one sentence to be recognized, wherein the sentence to be recognized comprises characters of a main language and characters of at least one target language, and the main language is different from the target language;
the detection module is used for performing language detection on the characters of the target language in the sentence to be recognized to obtain the language words to be detected and performing spelling detection and semantic detection on the language words to be detected;
and the error correction module is used for carrying out corresponding spelling error correction and/or semantic error correction on the words with spelling errors and/or semantic errors if at least one language word to be detected has spelling errors and/or semantic errors.
In a third aspect, the present invention further provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the multilingual text detection and correction method according to any one of the first aspects when executing the program.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the multilingual text detection and correction method of any of the first aspects.
The multilingual text detection and correction method, the system, the electronic equipment and the storage medium provided by the invention identify the foreign characters of the input text, further determine the language of the foreign characters, firstly carry out spelling and semantic detection on the foreign characters in an artificial intelligence mode, and further carry out corresponding spelling correction and semantic correction according to the existing spelling errors and semantic errors. The invention can better understand the text semantics under the cross-language context, detect all the characters of the target language in the text and only correct the words with errors.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.
FIG. 1 is a flow chart of a multi-language text detection and correction method provided by the present invention;
FIG. 2 is a schematic view of a language detection process according to an embodiment of the present invention;
FIG. 3 is a flow chart of semantic detection provided by an embodiment of the present invention;
FIG. 4 is a diagram illustrating a detection module and an error correction module according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a method for detecting and correcting multiple language texts according to an embodiment of the present invention;
FIG. 6 is a schematic flow diagram corresponding to FIG. 5;
FIG. 7 is a schematic diagram of a multi-lingual text detection and correction system according to the present invention;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," and the like in the description and in the claims, and in the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein.
The following describes a technical scenario to which the present invention relates:
with the popularization and high-speed development of electronic intelligent devices, a large amount of documents are presented in an electronic form and are continuously growing. Meanwhile, as social languages develop, more and more language code conversion (for example, from Chinese to English) occurs in social networks, more and more foreign languages are brought into the languages during speaking, so as to express the words more closely, and the checking and error correction of the non-Chinese characters appearing in the Chinese context manuscript consumes great cost of labor and time. In contrast, in the case of foreign characters in the context of Chinese, there are many cases where there are foreign characters in Chinese, and there are cases where there are foreign characters in Chinese in modern social networks, such as "tiny Zhang is gafe" (Spanish slang: Fagus). In these cases, spelling errors and semantic contradiction errors are likely to occur.
In order to solve the problem of speech understanding errors of cross-language texts in the prior art, the invention provides a multi-language text detection and correction method, a system, an electronic device and a storage medium. The invention can better understand the text semantics under the cross-language context, detect all the characters of the target language in the text and only correct the words with errors.
The multilingual text detection and correction method, system, electronic device, and storage medium of the present invention are described below with reference to fig. 1-8.
FIG. 1 is a flow chart of the multi-language text detection and correction method provided by the present invention, as shown in FIG. 1. The embodiment of the invention provides a multilingual text detection and correction method, which comprises the following steps:
step 101, acquiring a text to be detected, and performing multilingual character recognition on the text to be detected to obtain at least one sentence to be recognized, wherein the sentence to be recognized comprises characters of a main language and characters of at least one target language, and the main language is different from the target language.
Illustratively, the characters of the stem language are Chinese characters and the characters of the target language are foreign characters.
Illustratively, the non-chinese characters include foreign characters and pre-set symbol characters. The non-Chinese characters comprise Arabic numerals, Latin letters, punctuation marks and the like, the foreign characters refer to characters comprising Latin letters, and the Arabic numerals and the punctuation marks belong to preset symbol characters.
Illustratively, one text to be detected includes one or more sentences to be recognized, and each sentence to be recognized may be composed of chinese characters and non-chinese characters.
For example, the sentence "hello, guten Tag" is composed of the Chinese character "hello", the foreign language character "guten Tag", and punctuation marks. Wherein the foreign language character "gut Tag" is composed of the word "gut" and the word "Tag".
102, performing language detection on characters of a target language in the sentence to be recognized to obtain a language word to be detected, and performing spelling detection and semantic detection on the language word to be detected.
The language word to be detected refers to a word corresponding to a certain language to be detected, for example, "guten" and "Tag" are the language words to be detected.
And 103, if at least one language word to be detected has spelling errors and/or semantic errors, performing corresponding spelling error correction and/or semantic error correction on the word with the spelling errors and/or the semantic errors.
It should be noted that, in the embodiment of the present invention, the text content to be detected is a non-chinese character in a chinese context, that is, the sentence to be recognized in the text to be detected is mixed with a foreign character, and if the sentence to be recognized does not contain a foreign character, the chinese character is detected and corrected by a chinese text detection and correction system in the prior art.
The invention can adopt different error correction modes aiming at different detection errors of foreign characters, for example, the spelling error correction mode is adopted when the spelling detection error is detected; for example, semantic error correction is adopted. Thus improving error correction efficiency.
The steps 101 to 103 are described in detail below.
In step 101, the obtaining a text to be detected and performing multilingual character recognition on the text to be detected to obtain at least one sentence to be recognized includes:
and step 1011, performing data cleaning on the text to be detected to delete the illegal characters in the text to be detected and messy codes caused by coding errors.
Step 1012, performing sentence segmentation on the text to be detected to obtain the at least one sentence to be recognized, and deleting the sentence head and the blank character of the sentence tail of each sentence to be recognized.
Illustratively, the text to be detected is divided into sentences using an LTP (Language Technology Platform) tool, and blank characters at the beginning and end of the sentence are removed, so that the subsequent model can be input in units of sentences.
LTP provides a series of chinese natural language processing tools that users can use to perform word segmentation, part-of-speech tagging, syntactic analysis, etc. on chinese text.
Step 1013, recognizing characters of the sentence to be recognized, and recording the position of the characters of the target language in the sentence to be recognized if the characters of the target language exist.
Illustratively, the linguistic classification of different characters in a sentence is identified by using a unicodedata (a module in the Python language) module, and the positions of the different characters in the sentence are recorded at the same time, wherein the characters are taken as units.
The unicodedata provides a category function, which takes a single character as input and returns a string of characters with multiple meanings, to identify the category of the character, which can be used to distinguish the parts of Chinese characters.
Specifically, each character is sent to the cateogry function one by one, and if the return value begins with "N", the character is marked as "Num" (number); if the character begins with "P", marking the character as "Punc" (punctuation); if "L" is started, then make further decisions: if the unicode code value of the character falls within any one of the following intervals:
[0x4E00,0x9FFF ], [0x3400,0x4DBF ], [0x20000,0x2A6DF ], [0x2A700,0x2B73F ], [0x2B740,0x2B81F ], [0x2B820,0x2CEAF ], [0xF900,0xFAFF ], [0x2F800,0x2FA1F ], then the character is marked as the Chinese character "ZH", otherwise the foreign character "OT". Then, adjacent and same character marks are combined into marks in interval form, and the result is returned.
For example, PTT with the sentence "i am tomorrow presation" to be recognized is still to be modified. "for example," [0,3, "ZH"), [3,15, "OT"), [15,16, "ZH"), [16,19, "OT"), [19,23, "ZH"), [23,24, "Punc") "will be returned. Where the character count starts at 0 and the interval is closed before and open after. For example, [0,3, "ZH") indicates that the characters from the 0 th position to the 2 nd position in the text are chinese characters, [3,15, "OT") indicates that the characters from the 3 rd position to the 14 th position are foreign characters, [23,24, "Punc") indicates that the 23 rd position is a punctuation mark, and so on.
Illustratively, if there are no target language characters (i.e., foreign characters), the original text is returned directly or the system is switched to the existing Chinese text detection and correction system for processing.
Fig. 2 is a schematic flow chart of language detection according to an embodiment of the present invention, as shown in fig. 2. In the step 102, the language detection of the characters of the target language in the sentence to be recognized includes:
step 1021, inputting the sentence to be recognized with the character of the target language into the preset language detection model.
Illustratively, the preset language detection model adopts an XLM-Roberta-based sequence labeling model.
The pre-training models are classified according to languages and can be divided into single-language pre-training models and multi-language pre-training models. The single language pre-training model is a model pre-trained on a single language corpus and can only process tasks of a single language; the multi-language pre-training model refers to a model pre-trained on multi-language linguistic data and can process tasks of multiple languages.
XLM-Roberta is a typical multilingual pre-training model, which is a converter-based language model that relies on a mask language model as a target to process text in more than 100 different languages. The neural network model of the embodiment of the invention is developed based on XLM-Roberta, but the embodiment of the invention does not limit the neural network structure based on XLM-Roberta, and can be other neural network structures.
Illustratively, the configuration of the present invention based on XLM-Roberta may adopt the following first or second configuration:
first, 12 layers/dimension 768 (XLM-Roberta-base);
second, 24 layers/dimension 1024 (XLM-Roberta-large).
Illustratively, in an XLM-Roberta based sequence tagging model, the language detection task can be represented as a multi-classification task with n sentences in the data set X, X = { X =1,x2,…,xn}, the sentence can be represented as xi={w1,w2,…,wmAt least one word is a foreign word, each foreign word in the sentence corresponds to a label representing the language, and Y = { Y =1,y2,…,ynWhere y is {0, 1.., q }, and q is the number of foreign languages.
And 1022, the preset language detection model performs word segmentation on the input sentence to be recognized based on a sequence tagging mechanism, and outputs a language corresponding to a word of a target language existing in the sentence to be recognized.
Illustratively, the step 1022 includes:
step 10221, performing word segmentation on the sentence to be recognized to obtain a word segmentation list with at least one word segmentation, and adding preset special characters to the head and the tail of the word segmentation list respectively to indicate the beginning and the end.
For example, assume that the input sentence to be recognized: "I am tomorrow's PTT of presation is yet to be modified. "first participled via an XLM-Roberta participler to get [ ' ▁ My ', ' tomorrow ', ' pres ', ' an ', ' position ', ' P ', ' TT ', ' also ', ' Modify ', '. ']. Then, preset special characters < s > and < \ s > are added to the head and the end of the segmentation list, respectively, to represent the beginning and the end.
Step 10222, mapping each word in the word list to a corresponding identification number to obtain an identification number list.
For example, each token (participle) in the above participle list is mapped to an identification number (ID) to obtain an identification number list (ID list), where the ID list is:
[0,13129,72938,9518,66,22062,43,683,13739,41604,45123,30,2]。
step 10223, inputting the identification number list into the embedding layer of the preset language detection model to convert it into a first matrix with a first preset dimension (e.g., (13, 768)).
For example, the ID list is entered into the XLM-Roberta model, first through the Embedding layer (Embedding) of the XLM-Roberta model. The Embedding layer converts the ID list into a first matrix of length the same as the ID list and width XLM-Roberta dimension (768 or 1024 dimensions).
Step 10224, inputting the first matrix into the multi-layer transformer of the language detection model for calculation, so as to output a second matrix with a second predetermined dimension (e.g., (13, 3)). And the second matrix represents semantic information of the participle context corresponding to the identification number.
A transformer is a transformation model that relies entirely on the self-attention mechanism to compute its input and output representations. the most outstanding feature of the transformer is the excellent parallel computing capability.
For example, the first matrix is sent to an encoder of an XLM-Roberta model, and a second matrix with the length being the same as that of the ID list and the width being the XLM-Roberta dimension (768 dimensions or 1024 dimensions) is output through calculation of a multi-layer transformer.
Step 10225, inputting the second matrix into the full-link layer of the preset language detection model, and performing normalization calculation on the output of the full-link layer to obtain the language probability of the participle corresponding to each identification number.
For example, a linear transformation is used by which the second matrix is converted into a prediction matrix having the same length and number of label categories as the ID list. At the position of each ID, a vector with the length being the number of label categories can be obtained, the vector is input into a full connection layer and is subjected to normalization (softmax) calculation, and the probability of the label categories, namely the probability that tokens (participles) corresponding to the ID belong to each category, is obtained.
Step 10226, determining the language corresponding to each participle according to the language probability of the participle.
The following describes the flow of the above-mentioned language detection with an application example.
For the sake of example, it is assumed that the target languages (i.e., foreign languages) identified by the preset language detection model are english and german (but the present invention is not limited to these two languages), which are denoted by numerals 1 and 2, and "others" (non-foreign characters) are denoted by 0, so that there are 3 categories in total.
For example, the PTT of the input sentence "i am tomorrow presation" to be recognized is yet to be modified. "
After the example sentences are segmented and special characters are added, the method comprises the following steps:
"[ ' < s > ', ' ▁ me ', ' tomorrow ', ' pres ', ' an ', ' position ', ' P ', ' TT ', ' yet ', ' modified ', '. ', ' < \ s > ' ].
After the above example sentence is participled, the length is 13, and if XLM-Roberta with the dimension of 768 is used, the dimension of the second matrix M of the encoder is 13 × 768. And the output layer is a linear layer, and the second matrix M is mapped to a matrix with a dimension of 13 × 3, that is, each column of the second matrix M is normalized (softmax) calculated to obtain a matrix of 13 × 3, wherein each column is a category probability corresponding to token (participle) at the position. And obtaining a vector with the length of 13 at the position corresponding to each column of probability maximum values: (0,0,0,1,1,1,0,1,1,0,0,0,0).
The vector can be interpreted as ('o', 'o', 'o', 'en', 'en', 'en', 'o', 'en', 'en', 'o', 'o', 'o' according to the above relationship of the category label and the number. Where 'o' represents a non-foreign character and 'en' represents english. Since the above example sentence does not contain german, the corresponding flag is not output.
After the target language corresponding to the language word to be detected is determined, the corresponding model (or called as "module") and the preset dictionary corresponding to the target language can be used for subsequent detection and error correction of the language word to be detected.
In step 102, the performing spelling detection and semantic detection on the language word to be detected includes: step 1021, inputting each word to be detected into a preset spelling detection model to detect whether there is a spelling error.
Illustratively, before performing the step 1021, the method further comprises:
step 1020, inputting a preset number of language words into the preset pinyin detection model for learning. The method specifically comprises the following steps:
step 10201, inputting a preset number of reference language words into the encoder of the preset spelling detection model for encoding, where the preset number of reference language words includes a labeled correctly spelled word set and an incorrectly spelled word set.
Illustratively, the predetermined pinyin detection model uses an XLM-Roberta-base model as an encoder and uses an unsupervised approach to detect word spelling errors.
Wherein, the difference between supervision and unsupervised lies in:
first, supervised is with tags, while unsupervised is without tags.
The supervised process is to train through known training samples (known inputs and corresponding outputs) to obtain an optimal model, and then apply the model to new data to map to an output result. After the process, the model has the prediction capability. Compared with supervision, the unsupervised method has no training process, and directly takes data for modeling analysis.
Second, supervised is classification and unsupervised is clustering.
The core of supervision is classification and the core of unsupervised is clustering (the division of a data set into classes consisting of similar objects). The supervised work is to select classifiers and determine weights, and the unsupervised work is to estimate density (find statistics describing data), i.e. the unsupervised algorithm can start working as long as it knows how to calculate similarity.
Third, supervised is the same dimension, while unsupervised is the dimension reduction.
If the supervised input is n-dimensional, the features are identified as n-dimensional, i.e., y ═ f (xi) or p (y | xi), i ═ n, and usually do not have the ability to reduce dimensions. And the unsupervised user often participates in deep learning and performs feature extraction, or directly adopts layer clustering or item clustering to reduce the dimensionality of data features so that i is less than n.
Fourthly, qualitative classification is performed with supervision, and qualitative classification is performed after clustering without supervision.
The supervised output results, i.e. the classified results, are directly labeled. Unsupervised results are just clusters of clusters.
Step 10202, performing word segmentation on each coded word, inputting a word segmentation result into a hidden layer of the preset spelling detection model, and extracting output of the hidden layer to obtain multiple representations of each word.
Illustratively, before the predetermined spelling detection model performs spelling detection on words, a predetermined number of reference language words, including a correctly spelled word set { Ci } and an incorrectly spelled word set { Wi }, are encoded by the same encoder, and 12 hidden layer outputs in the model are extracted, so that 12 representations can be obtained for each word. Because the characteristics of words captured by each layer of model are different, usually, the low-level network can well capture the characteristics of grammatical phrases, and the high-level network extracts richer semantic characteristics, the clustering effect on each layer has different meanings for word error detection.
Note that the correctly spelled word set { Ci } and the incorrectly spelled word set { Wi } are collected in advance. The correctly spelled word set { Ci } may come from a standard dictionary in various languages; the misspelled word set Wi may come from common misspelled words collected manually.
Illustratively, for each of the 12 representations of each word in the set of correctly spelled words { Ci } and the set of misspelled words { Wi } described above, all of their tokens are averaged and pooled.
Suppose that n tokens (participles) are obtained after word segmentation is carried out on a certain word w1,t2,…,tnThen calculate { t }1,t2,…,tn12 layers of representation, resulting in a total of n x 12 vectors:
{v1,1,...,v1,12,v2,1,...,v2,12,…,vn,1,...,vn,12};
wherein v isi,jMean token tiRepresentation at the j-th level. Average pooling operates by averaging all tokens over each floor: r isj=(v1,j+v2,j+…vn,j)。rjI.e. the average pooled representation of the word at level j, with a dimension of 768.
And 1023, performing aggregation processing on the multiple representations of each word in respective spaces to obtain a clustering result of the correctly spelled word set and the incorrectly spelled word set.
Illustratively, the TSNE algorithm (a dimension reduction algorithm) is used for the above rjDimension reduction is performed, the average pooled representation of all words is embedded into the same low-dimensional space, and the correctly spelled word set Ci and the incorrectly spelled word set Wi are aggregated in different clusters on the planar space. The 12 representations are aggregated in the respective spaces, but the misspelled word set { Ci } and the misspelled word set { Wi } are clearly separated and aggregated in the respective clusters.
For example, a set of correctly spelled words { Ci } and a set of misspelled words { Wi } result in 12 representations of each word in these sets, and 12 low latitude spaces (e.g., dimensions less than 10, which may be 2, 3, or 4, for example) are embedded using the TSNE algorithm, respectively, with the set of correctly spelled words { Ci } and the set of misspelled words { Wi } in each space being aggregated into two clusters, respectively.
When the projection space obtained by the correctly spelled word set { Ci } and the incorrectly spelled word set { Wi } is used as a reference, subsequent detection can be carried out on the language word to be detected.
Step 1021, inputting each word to be detected into a preset spelling detection model to detect whether a spelling error exists, includes:
step 10211, inputting each language word to be detected into the encoder of the preset spelling detection model for encoding.
Illustratively, when a certain language word to be detected is detected, the same encoder as the encoder is used for encoding, namely, an encoder based on an XLM-Roberta model is adopted.
Step 10212, performing word segmentation on the encoded language words to be detected, inputting the word segmentation result into the hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain multiple representations of each language word to be detected.
Illustratively, after calculation, the output is the average of the representations for each of the 12 layers of the transform, i.e., the average pooling described above.
For example, the language word to be detected is "preantion", and after word segmentation is [ 'pres', 'an', 'position' ], and the length thereof is 3. Each of the 12 layers of the transform will be calculated to yield a matrix of size 3 x 768. Then, the average value is calculated in the first dimension, and vectors with 12 lengths 768, namely 12 representations of the language words to be detected, are obtained.
Step 10213, calculating a first average distance from each of the plurality of representations of the language words to be detected to all words in the correctly spelled word set, and calculating a second average distance from each of the plurality of representations of the language words to be detected to all words in the incorrectly spelled word set.
Illustratively, the TSNE algorithm is used to embed each of the words of the language to be detected into the 12 spaces described above under the control of the same parameters, calculate a first average distance from all the words in the correctly spelled cluster, and calculate a second average distance from all the words in the incorrectly spelled cluster. Resulting in 12 pairs of distances.
Step 10214, for each language word to be detected, if the value of the first average distance exceeds a first preset threshold, determining that the language word to be detected is a word with misspelling, and if the value of the second average distance exceeds a second preset threshold, determining that the language word to be detected is a word with correct spelling.
For example, assume that 12 representations of the language word "presentation" to be detected are obtained, a first average distance from all words in the correctly spelled cluster is calculated, and a second average distance from all words in the incorrectly spelled cluster is calculated. If the number of the word "prediction" closer to the misspelled cluster is larger (e.g., exceeds a second predetermined threshold), the word "prediction" is determined to be a misspelled word, and the subsequent spell correction step is performed.
Step 1022, if the preset spelling detection model detects that there is no spelling error in each of the language words to be detected, inputting the sentence to be detected including at least one of the language words to be detected into a preset semantic detection model to detect whether there is a semantic error.
Illustratively, the preset semantic detection model adopts an XLM-Roberta model as a theme. The semantic detection task of the preset semantic detection model can be represented as a binary classification task, wherein n sentences exist in a data set X, and X = { X =1,x2,…,xnAnd each sentence corresponds to a semantic label to indicate whether the semantics thereof are wrong or not.
The step 1022 specifically includes:
step 10221, performing word segmentation on the sentence to be detected, and inputting the word segmentation result into the encoder of the preset semantic detection model for processing to obtain a target matrix.
Illustratively, the input of the preset semantic detection model is a sentence to be detected, and the output of the preset semantic detection model is a two-classification prediction probability which represents whether the sentence to be detected has semantic errors.
Step 10222, inputting each column of the target matrix into the classifier of the preset semantic detection model for processing, and obtaining corresponding two-classification probabilities.
Illustratively, the input of the preset semantic detection model is a first column of a target matrix output by the encoder, and the output of the preset semantic detection model is a classification probability.
For example, the encoder outputs an object matrix M with dimension (13,768) or (13,1024), the output layer takes the first column M [0 ] of the object matrix M as input, and feeds the input into a linear two-classifier to output a probability, and the higher the probability value is, the more likely the probability value is to contain semantic errors.
The PTT for the example sentence "i am tomorrow presentation" is also modified. ", the preset semantic detection model outputs a prediction close to 1, which indicates that the sentence to be detected has a word with a semantic error, i.e.," PTT ", although PTT is a correct abbreviation, the preset semantic detection model is considered as a semantic error and is considered as a PPT more appropriate.
Step 10223, determining whether semantic errors exist in the language words to be detected in the sentence to be detected according to the two-classification probability.
And 1023, if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected as a detected text.
The semantic detection described above is described below by way of an application example.
Fig. 3 is a schematic flow chart of semantic detection provided in the embodiment of the present invention, as shown in fig. 3.
Step 301, inputting the sentence to be detected, which has no spelling error or is subjected to spelling correction, into the preset semantic detection model.
For example, "i like to see MBA games" is entered into the preset semantic detection model.
Step 302, performing word segmentation on the to-be-detected sentence, and outputting the sentence through an Embedding layer of the preset semantic detection model and the last layer of the multiple layers of transformers.
Step 303, extracting the representation of the special marker [ CLS ] in the last layer output of the multi-layer transformer for classification.
And step 304, judging whether the statement to be detected has semantic errors or not through a normalization function (softmax).
And if the sentence to be detected has semantic errors, entering a subsequent preset semantic error correction module to correct the errors, and if the sentence to be detected has no errors, returning the text corresponding to the sentence to be detected. The MBA shown in fig. 3, although spelled correctly, is judged as a semantic error in the preset semantic detection model, and the NBA is considered more appropriate.
Fig. 4 is a schematic diagram of a detection module and an error correction module according to an embodiment of the present invention, as shown in fig. 4. In step 103, if at least one of the language words to be detected has misspelling and/or semantic error, performing corresponding spell error correction and/or semantic error correction on the word with misspelling and/or semantic error includes:
and step 1031, if the language word to be detected has misspelling, inputting the language word to be detected into a preset spelling error correction model for spelling error correction processing.
Illustratively, for a spelling error of a language word to be detected (i.e. a foreign word), a preset dictionary (i.e. a language dictionary corresponding to the target language) is used for error correction. And marking the frequency of the foreign language words in the preset dictionary for each foreign language word. And obtaining an error correction candidate word with the minimum editing distance according to an editing distance dynamic programming algorithm, and if a plurality of candidate words exist, returning the word with the highest frequency from the candidate words. The edit distance refers to the minimum number of edit operations required between two strings to convert from one to another. Editing operations used in embodiments of the present invention include replacing a character with another, inserting a character, deleting one or more combinations of characters.
Exemplarily, the step 1031 includes:
the following operations are executed for each language word to be detected with spelling errors:
step 10311, calculating the edit distance between each word to be detected and each word in a preset dictionary through the preset spelling error correction model. The preset dictionary is a language dictionary corresponding to the target language.
Illustratively, the predetermined spell correction model may be a non-neural network model, the input of which is a word of the language to be detected (i.e., a foreign word), and the output of which is a corrected word. If the input foreign language word has no error, returning the foreign language word.
For example, taking the word "prestation" in the sentence to be detected as an example, the preset spelling correction model first calculates the editing distance between the word "prestation" and each word in the preset dictionary, and selects the word with the smallest editing distance between the preset dictionary and the word "prestation".
And step 10312, if a word with the minimum editing distance exists in the preset dictionary, replacing the language word to be detected with the word with the minimum editing distance so as to correct the language word to be detected.
And step 10313, if a plurality of words with the minimum editing distance exist in the preset dictionary, replacing the language words to be detected with the words with the highest returning frequency in the words with the minimum editing distance so as to correct the language words to be detected.
For example, in the above-described example sentence, if a plurality of words have the same edit distance, a word having the highest word frequency is returned. The word with the smallest edit distance from "presentation" is "presentation", the edit distance is 1 ("a" - > "e"), there is and only this one word, and thus "presentation" is returned.
Step 1032, inputting the sentence to be detected containing the language word to be detected after the spelling error correction processing into the preset semantic detection model to detect whether a semantic error exists.
And 1033, if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected after the spelling error correction as the detected text.
Step 1034, if the sentence to be detected has semantic error, inputting the sentence to be detected into a preset semantic error correction model for semantic error correction processing, and returning the text to be detected corresponding to the sentence to be detected after semantic error correction as a detected text.
Illustratively, prior to performing the step 1034, the method further includes building and pre-training a pre-training model.
The pre-training model is a model pre-trained on a large-scale corpus (generally, the data size exceeds 10G) by using a mask Language prediction task (MLM).
Illustratively, the pre-training model is the XLM-Roberta model.
Illustratively, the pre-training model is pre-trained in the following manner:
a piece of text is used as input, some words in the text are masked, and the masked words are restored by the pre-training model.
For example, a certain sentence in the input text is a training process of a pre-training model, and the pre-training model is randomly masked by [ MASK ] to obtain a training [ MASK ] process of the pre-training [ MASK ] [ MASK ]. Using this statement as training data, the pre-trained model is required to restore the hidden character at [ MASK ], i.e. "model, over".
It should be noted that the XLM-Roberta model is pre-trained and then trained again in a specific downstream task.
Illustratively, the structure of the pre-training model consists of an Embedding layer, an encoder layer, and an output layer. The embedding layer is used for mapping an input text and converting the text into a series of vectors, which are called as vector representation of the text. The encoder layer is formed by stacking a plurality of transformers, usually 12 or 24, and can be used for making a series of nonlinear changes on the vector representation. The output layer makes a prediction for a specific task by using the output of the encoder (i.e. the output of the last transform). The design of its output layer may be different for different tasks. For example, in the above-mentioned preset language detection model, the output of the last layer of the transform is taken, and the output is sent to the full link layer and softmax for classification and judgment of the language of the token.
Illustratively, prior to performing the step 1034, the method further includes:
and training the preset semantic error correction model.
Illustratively, the preset semantic error correction model is trained as follows:
and taking a preset number of correct target language texts or main language texts containing target language words as a training set, inputting the texts in the training set into the preset semantic error correction model, and randomly masking partial words in the input texts by the preset semantic error correction model to obtain [ MASK ]. For each masked word, training the preset semantic error correction model to replace the masked word with the predicted word, namely training the model to restore the [ MASK ] word.
It should be noted that the above-mentioned pre-set semantic error correction model is based on the XLM-Roberta model, which is hidden by the [ MASK ] of training and requires to be restored to a complete word, rather than a finer-grained token (participle), and thus is not the same as the MLM of the above-mentioned pre-training model. For example:
suppose a sentence in The input text is "The green-rooted brilliant is a large hummingbird". The hummingbird was then entirely replaced by [ MASK ]: the green-grown brilliant a large [ MASK ] ", and The preset semantic error correction model is trained to predict [ MASK ] as hummingbird.
In the MLM task of the pre-trained model, the "hummingbird" is segmented into two tokens (i.e., "hummingbirds" and "bird"), and the word "hummingbird" does not exist, so that the reduction prediction task cannot be realized. In order to realize the reduction prediction task, the output prediction word list of the pre-training model is expanded, and all common words, such as the hummingbird, are added, so that the pre-training model can directly reduce the hummingbird from the MASK.
Therefore, the preset semantic error correction model differs from the pre-training model (MLM) in that: in the stage of restoring [ MASK ], the output word list of the pre-training model (MLM) is the same as the input word list, while the output word list of the semantic error correction model is different from the input word list, and the output word list is larger, so that the complete word can be restored.
Specifically, after The matrix M is obtained by encoder calculation for input "The green-grown parking brake a large [ MASK ]", The preset semantic error correction model calculates a dot product of a vector of The [ MASK ] position and a vector (v1, v2, … vn) of each word in The vocabulary, and takes a maximum product (for example, The preset semantic error correction model takes a maximum product during training, and The preset semantic error correction model takes a word ("hummingbird") corresponding to The top three, i.e., top-3, as prediction. The vocabulary is typically sized on the order of hundreds of thousands to cover as many words as possible.
Illustratively, the step 1034 includes:
step 10341, extracting at least one word to be corrected from the sentence to be detected with semantic error.
For example, a sentence to be detected in the input text to be detected is "zhurou mars vehicle successfully logs in mars", wherein "zhurou" is detected as a valid pinyin by the preset spelling detection model, that is, no spelling error exists, but the semantic meaning is detected to be incorrect by the preset semantic detection model.
Step 10342, querying a preset dictionary, and taking all candidate words in the preset dictionary whose edit distance from the word to be corrected is less than or equal to a preset threshold as a candidate set.
For example, for the above example, the legal pinyin whose edit distance from "zhurou" is equal to or less than the preset threshold (for example, the preset threshold is set to 3) is zhurong (whose edit distance is 2), chirou (whose edit distance is 2), wenrou (whose edit distance is 3), and the like.
Step 10343, the preset semantic error correction model predicts a probability of replacing the hidden word to be corrected with each candidate word in the candidate set, and takes the candidate word with the highest probability value as the corrected word.
For example, the input text is replaced with "[ MASK ] Mars successfully logged on to Mars".
Step 10344, predicting, by the preset semantic error correction model, a probability of replacing the masked word to be error corrected with each candidate word in the candidate set, and using the candidate word corresponding to the highest probability value as the corrected word.
For example, the preset semantic error correction model predicts that the word with the highest [ MASK ] probability is "zhuron", so "zhuron" is returned as the correction word.
The preset semantic error correction model error correction process is described in an application example.
For example, again with the example sentence "I am tomorrow's PTT is still being modified. "to explain. Semantic errors have been detected according to the above-mentioned preset semantic detection model, thus extracting the possible error correction words "presentation" and "PTT" in the sentence. It should be noted here that the "presentation" has been corrected by the above-mentioned pre-defined spell correction model.
Then, a preset threshold (for example, the threshold is 1) is set, the preset dictionary is queried, and all candidate words with editing distances smaller than or equal to the preset threshold from the word to be corrected are found out as a candidate set. The candidate set of presentation contains only itself, so the presentation does not need to be modified. And the candidate set of PTT includes (PTT, PPT, pot), etc.
The PTT is masked by the [ MASK ] to obtain the [ MASK ] of the "present in my tomorrow" to be modified. And sending the characters into the preset semantic error correction model, wherein the preset semantic error correction model respectively restores the characters at two [ MASK ] positions. For each [ MASK ], we can take the predictions of the first three (top-3), and rank from high to low the probabilities to get the "topic, PPT, paper", where PPT appears in the candidate set (PTT, PPT, pot) and the probability is highest, so we return the prediction result PPT.
In step 1034, returning the text to be detected corresponding to the sentence to be detected after semantic error correction as the detected text includes:
and 10345, combining the correction words with the sentences to be detected.
Illustratively, during merging, if the length of a correction word is inconsistent with the length of a corresponding word to be corrected or a plurality of correction words exist in a sentence, deleting the word to be corrected in the sentence to be detected, and then inserting the corresponding correction word.
And 10346, returning the text to be detected corresponding to the combined sentence to be detected as the detected text.
For example, taking the above example sentence as an example, "i am tomorrow's PTT is still modified. "the corrected text is" I am yet to be modified for PPT of tomorrow presentation ". ", the returned error correction bits are ([3,15), [17,20)) (counting with 0, interval left closed and right open), respectively corresponding to the text" PTT of the presation of my tomorrow is still to be modified. "Presation" and "PTT" in "are used.
In summary, the multilingual text detection and correction method provided by the present invention identifies the foreign language characters of the input text, determines the language of the foreign language characters, and performs spelling and semantic detection on the foreign language characters in combination with an artificial intelligence manner, and performs corresponding spelling correction and semantic correction according to the existence of spelling errors and semantic errors. The method covers a cross-language scene, can better understand the text semantics under the cross-language context, detects all the characters of the target language in the text and only corrects the words with errors.
The multilingual text detection and correction method provided by the embodiment of the present invention is described below with reference to fig. 5 and 6. The following more detailed description of the steps with respect to fig. 6 can also refer to the above description and will not be repeated below.
Fig. 5 is a schematic diagram of a multi-language text detection and correction method according to an embodiment of the present invention, and fig. 6 is a schematic flowchart corresponding to fig. 5, as shown in fig. 5 and fig. 6.
Step 601, acquiring an input text to be detected.
Step 602, a preprocessing module preprocesses the text to be detected.
Illustratively, the preprocessing includes data processing and cleaning to delete illegal characters in the text to be detected and messy codes caused by coding errors, and an LTP tool is used to perform sentence segmentation on the text to be processed to obtain a sentence to be recognized.
Step 603, the preprocessing module performs foreign language character recognition on the sentence to be recognized.
Exemplarily, performing foreign character recognition on each sentence to be recognized by using a unicodedata module; and if the foreign language characters exist, recording the positions of the foreign language characters in the sentence to be recognized. If no foreign language character exists, returning to the original and ending.
Step 604, the language detection model is preset to determine the language of each foreign word in the sentence to be recognized.
Illustratively, the sequence tagging mechanism in the XLM-Roberta model is utilized to determine the language of each foreign word in the sentence to be recognized.
Step 605, the foreign word is input into the predetermined spelling detection model.
In step 606, the default spelling detection model detects whether the foreign word has a misspelling.
Illustratively, the XLM-Roberta model is used to detect whether a foreign word has a spelling error in an unsupervised manner. If a spelling error exists, go to step 607; if no spelling errors exist, step 608 is performed.
Step 607, the preset spelling correction model calculates the minimum edit distance using the foreign language dictionary corresponding to the foreign language word, and corrects the foreign language word.
Step 608, the text to be detected is input into a preset semantic detection model.
Step 609, a preset semantic detection model detects whether each foreign word in the text to be detected has semantic error.
Illustratively, an XLM-Roberta model is used to detect whether semantic errors exist in each foreign word in the text to be detected. If there is a semantic error, go to step 611; if there is no semantic error, step 610 is performed.
Step 610, the preset semantic detection model returns the spell corrected text or the original text.
Step 611, the preset semantic error correction model corrects the foreign words with semantic errors.
Illustratively, the XLM-Roberta model is used to correct foreign words that have semantic errors.
Step 612, the preset semantic error correction model returns the text after error correction.
The multi-language text detection and correction system provided by the invention is described below, and the multi-language text detection and correction system described below and the multi-language text detection and correction method described above can be referred to correspondingly.
FIG. 7 is a schematic structural diagram of a multi-language text detection and correction system provided by the present invention, as shown in FIG. 7. The invention provides a multilingual text detection and correction system 700, which comprises an identification module 710, a detection module 720 and a correction module 730.
The recognition module 710 is configured to obtain a text to be detected, and perform multilingual character recognition on the text to be detected to obtain at least one sentence to be recognized, where the sentence to be recognized includes characters of a main language and characters of at least one target language, and the main language is different from the target language.
The detecting module 720 is configured to perform language detection on the characters of the target language in the sentence to be recognized to obtain a language word to be detected, and perform spelling detection and semantic detection on the language word to be detected.
And the error correction module 730 is configured to, if at least one of the language words to be detected has a spelling error and/or a semantic error, perform corresponding spelling error correction and/or semantic error correction on the word having the spelling error and/or the semantic error.
Illustratively, the identifying module 710 is further configured to:
carrying out data cleaning on the text to be detected so as to delete illegal characters in the text to be detected and messy codes caused by coding errors;
sentence dividing is carried out on the text to be detected to obtain at least one sentence to be recognized, and blank characters of a sentence head and a sentence tail of each sentence to be recognized are deleted;
and identifying characters of the sentence to be identified, and recording the position of the characters of the target language in the sentence to be identified when the characters of the target language exist.
Illustratively, the detecting module 720 is further configured to:
inputting a sentence to be recognized with characters of a target language into a preset language detection model;
the preset language detection model divides the input sentence to be recognized based on a sequence labeling mechanism and outputs the language corresponding to the word of the target language in the sentence to be recognized.
Illustratively, the detecting module 720 is further configured to:
performing word segmentation on the sentence to be recognized to obtain a word segmentation list with at least one word segmentation, and adding preset special characters to the head and the tail of the word segmentation list respectively to represent the beginning and the end;
mapping each participle in the participle list to a corresponding identification number to obtain an identification number list;
inputting the identification number list into an embedding layer of the preset language detection model so as to convert the identification number list into a first matrix with a first preset dimensionality;
inputting the first matrix into a multilayer transform of the preset language detection model for calculation so as to output a second matrix with a second preset dimensionality;
inputting the second matrix into a full-connection layer of the preset language detection model, and performing normalization calculation on the output of the full-connection layer to obtain the language probability of the participle corresponding to each identification number;
and determining the language corresponding to each participle according to the language probability of each participle.
Illustratively, the detecting module 720 is further configured to:
inputting each language word to be detected into a preset spelling detection model to detect whether spelling errors exist or not;
if the preset spelling detection model detects that each language word to be detected has no spelling error, inputting a sentence to be detected containing at least one language word to be detected into a preset semantic detection model so as to detect whether a semantic error exists;
and if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected as the detected text.
Illustratively, the error correction module 730 is further configured to:
if the language word to be detected has spelling errors, inputting the language word to be detected into a preset spelling error correction model to carry out spelling error correction;
inputting the sentence to be detected containing the language word to be detected after spelling error correction processing into the preset semantic detection model so as to detect whether semantic errors exist or not;
and if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected after the spelling error correction as the detected text.
Illustratively, the error correction module 730 is further configured to:
if the sentence to be detected has semantic errors, inputting the sentence to be detected into a preset semantic error correction model to perform semantic error correction processing;
and returning the text to be detected corresponding to the sentence to be detected after semantic error correction as the detected text.
Illustratively, the system further comprises a cluster referencing module for:
inputting a preset number of reference language words into an encoder of the preset spelling detection model for encoding, wherein the preset number of reference language words comprises a marked correctly spelled word set and an incorrectly spelled word set;
performing word segmentation on each coded word, inputting a word segmentation result into a hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain various representations of each word;
aggregating the multiple representations of each word in respective spaces to obtain clustered results of the correctly spelled word set and the misspelled word set.
Illustratively, the detecting module 720 is further configured to:
inputting each language word to be detected into an encoder of the preset spelling detection model for encoding;
segmenting the coded language words to be detected, inputting segmentation results into a hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain multiple representations of each language word to be detected;
calculating a first average distance between each word to be detected and all words in the correctly spelled word set and calculating a second average distance between each word to be detected and all words in the incorrectly spelled word set according to the multiple representations of each word to be detected;
and for each language word to be detected, if the value of the first average distance exceeds a first preset threshold value, determining that the language word to be detected is a word with misspelling, and if the value of the second average distance exceeds a second preset threshold value, determining that the language word to be detected is a word with correct spelling.
Illustratively, the detecting module 720 is further configured to:
performing word segmentation on the sentence to be detected, and inputting a word segmentation result into an encoder of the preset semantic detection model for processing to obtain a target matrix;
inputting each column of the target matrix into a classifier of the preset semantic detection model for processing to obtain corresponding two classification probabilities;
and determining whether semantic errors exist in the words of the target language in the sentence to be detected according to the two classification probabilities.
Illustratively, the error correction module 730 is further configured to:
the following operations are executed for each language word to be detected with spelling errors:
calculating the editing distance between each language word to be detected and each word in a preset dictionary through the preset spelling error correction model;
if a word with the minimum editing distance exists in the preset dictionary, replacing the language word to be detected with the word with the minimum editing distance so as to correct the language word to be detected;
if a plurality of words with the minimum editing distance exist in the preset dictionary, replacing the language words to be detected with words with the highest returning frequency in the words with the minimum editing distance so as to correct the language words to be detected;
and the preset dictionary is a language dictionary corresponding to the target language.
Illustratively, the error correction module 730 is further configured to:
extracting at least one word to be corrected from the sentence to be detected with semantic errors;
querying a preset dictionary, and taking all candidate words in the preset dictionary, of which the editing distance from the word to be corrected is smaller than or equal to a preset threshold value, as a candidate set;
masking the words to be corrected in the sentences to be detected by using preset marker symbols, and inputting the sentences to be detected after masking treatment into a preset semantic error correction model;
and predicting the probability of replacing the hidden word to be corrected by each candidate word in the candidate set by using the preset semantic error correction model, and taking the candidate word with the highest probability value as a corrected word.
Illustratively, the error correction module 730 is further configured to:
merging the correction words with the sentences to be detected;
during merging, if the length of a correction word is inconsistent with the length of a corresponding word to be corrected or a plurality of correction words exist in a sentence, deleting the word to be corrected in the sentence to be detected, and then inserting the corresponding correction word;
and returning the text to be detected corresponding to the combined sentence to be detected as the detected text.
Illustratively, the system further comprises a training module for:
training the preset semantic error correction model according to the following mode:
taking a preset number of correct target language texts or main language texts containing target language words as a training set;
inputting the texts in the training set into the preset semantic error correction model, and randomly masking part of words in the input texts by the preset semantic error correction model;
for each masked word, training the preset semantic error correction model to replace the masked word with the predicted word.
Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a Processor (Processor)810, a communication Interface 820, a Memory 830 and a communication bus 840, wherein the Processor 810, the communication Interface 820 and the Memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform the multilingual text detection and correction method, the method comprising:
acquiring a text to be detected, and performing multi-language character recognition on the text to be detected to obtain at least one sentence to be recognized, wherein the sentence to be recognized comprises characters of a main language and characters of at least one target language, and the main language is different from the target language;
performing language detection on characters of a target language in the sentence to be recognized to obtain a language word to be detected, and performing spelling detection and semantic detection on the language word to be detected;
and if at least one language word to be detected has spelling errors and/or semantic errors, carrying out corresponding spelling error correction and/or semantic error correction on the word with the spelling errors and/or the semantic errors.
In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the multilingual text detection and correction method provided by the above-described methods.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the multi-lingual text detection and correction methods provided above.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (17)

1. A method for multi-lingual text detection and correction, the method comprising:
acquiring a text to be detected, and performing multi-language character recognition on the text to be detected to obtain at least one sentence to be recognized, wherein the sentence to be recognized comprises characters of a main language and characters of at least one target language, and the main language is different from the target language;
performing language detection on characters of a target language in the sentence to be recognized to obtain a language word to be detected, and performing spelling detection and semantic detection on the language word to be detected;
and if at least one language word to be detected has spelling errors and/or semantic errors, carrying out corresponding spelling error correction and/or semantic error correction on the word with the spelling errors and/or the semantic errors.
2. The method for detecting and correcting the multilingual text according to claim 1, wherein the obtaining the text to be detected and performing multilingual character recognition on the text to be detected to obtain at least one sentence to be recognized comprises:
carrying out data cleaning on the text to be detected so as to delete illegal characters in the text to be detected and messy codes caused by coding errors;
sentence dividing is carried out on the text to be detected to obtain at least one sentence to be recognized, and blank characters of a sentence head and a sentence tail of each sentence to be recognized are deleted;
and identifying characters of the sentence to be identified, and recording the position of the characters of the target language in the sentence to be identified when the characters of the target language exist.
3. The multilingual text-detection and error-correction method of claim 1, wherein the language-detection of the characters of the target language in the sentence to be recognized comprises:
inputting a sentence to be recognized with characters of a target language into a preset language detection model;
the preset language detection model divides the input sentence to be recognized based on a sequence labeling mechanism and outputs the language corresponding to the word of the target language in the sentence to be recognized.
4. The method for detecting and correcting the multilingual text according to claim 3, wherein the predetermined language detection model segmenting the input sentence to be recognized based on a sequential labeling mechanism and outputting the language corresponding to the word of the target language in the sentence to be recognized comprises:
performing word segmentation on the sentence to be recognized to obtain a word segmentation list with at least one word segmentation, and adding preset special characters to the head and the tail of the word segmentation list respectively to represent the beginning and the end;
mapping each participle in the participle list to a corresponding identification number to obtain an identification number list;
inputting the identification number list into an embedding layer of the preset language detection model so as to convert the identification number list into a first matrix with a first preset dimensionality;
inputting the first matrix into a multilayer transform of the preset language detection model for calculation so as to output a second matrix with a second preset dimensionality;
inputting the second matrix into a full-connection layer of the preset language detection model, and performing normalization calculation on the output of the full-connection layer to obtain the language probability of the participle corresponding to each identification number;
and determining the language corresponding to each participle according to the language probability of each participle.
5. The multilingual text-detection and correction method of claim 1, wherein the spelling detection and semantic detection of the linguistic words to be detected:
inputting each language word to be detected into a preset spelling detection model to detect whether spelling errors exist or not;
if the preset spelling detection model detects that each language word to be detected has no spelling error, inputting a sentence to be detected containing at least one language word to be detected into a preset semantic detection model so as to detect whether a semantic error exists;
and if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected as the detected text.
6. The method for multi-lingual text detection and correction according to claim 5, wherein, if at least one of the language words to be detected has misspelling and/or semantic errors, performing corresponding misspelling and/or semantic correction on the misspelling and/or semantic error-containing word comprises:
if the language word to be detected has spelling errors, inputting the language word to be detected into a preset spelling error correction model to carry out spelling error correction;
inputting the sentence to be detected containing the language word to be detected after spelling error correction processing into the preset semantic detection model so as to detect whether semantic errors exist or not;
and if the sentence to be detected has no semantic error, returning the text to be detected corresponding to the sentence to be detected after the spelling error correction as the detected text.
7. The method according to claim 6, wherein if at least one of the language words to be detected has misspelling and/or semantic error, the performing spelling and/or semantic error correction on the misspelling and/or semantic error-containing word comprises:
if the sentence to be detected has semantic errors, inputting the sentence to be detected into a preset semantic error correction model to perform semantic error correction processing;
and returning the text to be detected corresponding to the sentence to be detected after semantic error correction as the detected text.
8. The method of claim 5, wherein before entering each of said language words to be detected into a predetermined spelling detection model to detect the presence of spelling errors, said method further comprises:
inputting a preset number of reference language words into an encoder of the preset spelling detection model for encoding, wherein the preset number of reference language words comprises a marked correctly spelled word set and an incorrectly spelled word set;
performing word segmentation on each coded word, inputting a word segmentation result into a hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain various representations of each word;
aggregating the multiple representations of each word in respective spaces to obtain clustered results of the correctly spelled word set and the misspelled word set.
9. The method of claim 8, wherein said entering each of said language words to be detected into a predetermined spelling detection model to detect the presence of spelling errors comprises:
inputting each language word to be detected into an encoder of the preset spelling detection model for encoding;
segmenting the coded language words to be detected, inputting segmentation results into a hidden layer of the preset spelling detection model, and extracting the output of the hidden layer to obtain multiple representations of each language word to be detected;
calculating a first average distance between each word to be detected and all words in the correctly spelled word set and calculating a second average distance between each word to be detected and all words in the incorrectly spelled word set according to the multiple representations of each word to be detected;
and for each language word to be detected, if the value of the first average distance exceeds a first preset threshold value, determining that the language word to be detected is a word with misspelling, and if the value of the second average distance exceeds a second preset threshold value, determining that the language word to be detected is a word with correct spelling.
10. The method for detecting and correcting multi-lingual text according to claim 5, wherein the step of inputting the sentence to be detected containing at least one of the language words to be detected into a predetermined semantic detection model to detect whether a semantic error exists comprises:
performing word segmentation on the sentence to be detected, and inputting a word segmentation result into an encoder of the preset semantic detection model for processing to obtain a target matrix;
inputting each column of the target matrix into a classifier of the preset semantic detection model for processing to obtain corresponding two classification probabilities;
and determining whether semantic errors exist in the words of the target language in the sentence to be detected according to the two classification probabilities.
11. The method for multi-lingual text detection and correction according to claim 6, wherein said entering the language word to be detected into a predetermined spelling error correction model for spelling error correction if the language word to be detected has a spelling error comprises:
the following operations are executed for each language word to be detected with spelling errors:
calculating the editing distance between each language word to be detected and each word in a preset dictionary through the preset spelling error correction model;
if a word with the minimum editing distance exists in the preset dictionary, replacing the language word to be detected with the word with the minimum editing distance so as to correct the language word to be detected;
if a plurality of words with the minimum editing distance exist in the preset dictionary, replacing the language words to be detected with words with the highest returning frequency in the words with the minimum editing distance so as to correct the language words to be detected;
and the preset dictionary is a language dictionary corresponding to the target language.
12. The method for detecting and correcting the multilingual text according to claim 7, wherein the entering the sentence to be detected into a preset semantic error correction model for semantic error correction comprises:
extracting at least one word to be corrected from the sentence to be detected with semantic errors;
querying a preset dictionary, and taking all candidate words in the preset dictionary, of which the editing distance from the word to be corrected is smaller than or equal to a preset threshold value, as a candidate set;
masking the words to be corrected in the sentences to be detected by using preset marker symbols, and inputting the sentences to be detected after masking treatment into a preset semantic error correction model;
and predicting the probability of replacing the hidden word to be corrected by each candidate word in the candidate set by using the preset semantic error correction model, and taking the candidate word with the highest probability value as a corrected word.
13. The multilingual text-detecting and correcting method of claim 12, wherein returning the text to be detected corresponding to the sentence to be detected after semantic correction as the detected text comprises:
merging the correction words with the sentences to be detected;
during merging, if the length of a correction word is inconsistent with the length of a corresponding word to be corrected or a plurality of correction words exist in a sentence, deleting the word to be corrected in the sentence to be detected, and then inserting the corresponding correction word;
and returning the text to be detected corresponding to the combined sentence to be detected as the detected text.
14. The method for detecting and correcting the multilingual text according to claim 7, wherein before the sentence to be detected is input into a predetermined semantic error correction model, the method further comprises:
training the preset semantic error correction model according to the following mode:
taking a preset number of correct target language texts or main language texts containing target language words as a training set;
inputting the texts in the training set into the preset semantic error correction model, and randomly masking part of words in the input texts by the preset semantic error correction model;
for each masked word, training the preset semantic error correction model to replace the masked word with the predicted word.
15. A multilingual text detection and correction system, the system comprising:
the recognition module is used for acquiring a text to be detected and performing multi-language character recognition on the text to be detected to obtain at least one sentence to be recognized, wherein the sentence to be recognized comprises characters of a main language and characters of at least one target language, and the main language is different from the target language;
the detection module is used for performing language detection on the characters of the target language in the sentence to be recognized to obtain the language words to be detected and performing spelling detection and semantic detection on the language words to be detected;
and the error correction module is used for carrying out corresponding spelling error correction and/or semantic error correction on the words with spelling errors and/or semantic errors if at least one language word to be detected has spelling errors and/or semantic errors.
16. An electronic device comprising a memory, a processor and a computer program stored on said memory and executable on said processor, wherein said processor when executing said program performs the steps of the method of multi-lingual text detection and correction according to any of claims 1 to 14.
17. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the multilingual text detection and correction method of any of claims 1-14.
CN202111576592.8A 2021-12-22 2021-12-22 Multi-language text detection and correction method, system, electronic device and storage medium Pending CN114282527A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111576592.8A CN114282527A (en) 2021-12-22 2021-12-22 Multi-language text detection and correction method, system, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111576592.8A CN114282527A (en) 2021-12-22 2021-12-22 Multi-language text detection and correction method, system, electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN114282527A true CN114282527A (en) 2022-04-05

Family

ID=80873905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111576592.8A Pending CN114282527A (en) 2021-12-22 2021-12-22 Multi-language text detection and correction method, system, electronic device and storage medium

Country Status (1)

Country Link
CN (1) CN114282527A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818669A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN115587599A (en) * 2022-09-16 2023-01-10 粤港澳大湾区数字经济研究院(福田) Quality detection method and device for machine translation corpus
CN115809662A (en) * 2023-02-03 2023-03-17 北京匠数科技有限公司 Text content abnormity detection method, device, equipment and medium
CN115910035A (en) * 2023-03-01 2023-04-04 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116306601A (en) * 2023-05-17 2023-06-23 上海蜜度信息技术有限公司 Training method, error correction method, system, medium and equipment for small language error correction model
CN117392985A (en) * 2023-12-11 2024-01-12 飞狐信息技术(天津)有限公司 Voice processing method, device, terminal and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114818669A (en) * 2022-04-26 2022-07-29 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN114818669B (en) * 2022-04-26 2023-06-27 北京中科智加科技有限公司 Method for constructing name error correction model and computer equipment
CN115587599A (en) * 2022-09-16 2023-01-10 粤港澳大湾区数字经济研究院(福田) Quality detection method and device for machine translation corpus
CN115809662A (en) * 2023-02-03 2023-03-17 北京匠数科技有限公司 Text content abnormity detection method, device, equipment and medium
CN115910035A (en) * 2023-03-01 2023-04-04 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116306601A (en) * 2023-05-17 2023-06-23 上海蜜度信息技术有限公司 Training method, error correction method, system, medium and equipment for small language error correction model
CN116306601B (en) * 2023-05-17 2023-09-08 上海蜜度信息技术有限公司 Training method, error correction method, system, medium and equipment for small language error correction model
CN117392985A (en) * 2023-12-11 2024-01-12 飞狐信息技术(天津)有限公司 Voice processing method, device, terminal and storage medium

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110135457B (en) Event trigger word extraction method and system based on self-encoder fusion document information
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN112836496B (en) Text error correction method based on BERT and feedforward neural network
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN110188781B (en) Ancient poetry automatic identification method based on deep learning
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN107977353A (en) A kind of mixing language material name entity recognition method based on LSTM-CNN
CN111274804A (en) Case information extraction method based on named entity recognition
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN111930939A (en) Text detection method and device
CN112966525B (en) Law field event extraction method based on pre-training model and convolutional neural network algorithm
CN113505200A (en) Sentence-level Chinese event detection method combining document key information
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN107797988A (en) A kind of mixing language material name entity recognition method based on Bi LSTM
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN111191452A (en) Railway text named entity recognition method and device
CN113705237A (en) Relation extraction method and device fusing relation phrase knowledge and electronic equipment
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN116502628A (en) Multi-stage fusion text error correction method for government affair field based on knowledge graph
CN114912453A (en) Chinese legal document named entity identification method based on enhanced sequence features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100193 311-2, floor 3, building 5, East District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing

Applicant after: iFLYTEK (Beijing) Co.,Ltd.

Applicant after: Hebei Xunfei Institute of Artificial Intelligence

Applicant after: IFLYTEK Co.,Ltd.

Address before: 100193 311-2, floor 3, building 5, East District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing

Applicant before: Zhongke Xunfei Internet (Beijing) Information Technology Co.,Ltd.

Applicant before: Hebei Xunfei Institute of Artificial Intelligence

Applicant before: IFLYTEK Co.,Ltd.

CB02 Change of applicant information