Text multilingual recognition method based on feature word weighting
Technical Field
The invention relates to a language identification method, in particular to a text multilingual identification method based on feature word weighting.
Background
Language is the most important tool of communication for human beings, and is the main expression mode for people to communicate. People save and transfer the results of human civilization by means of language. The characters are used as the expression of language visualization, break through the limitation of time and space of spoken language, and the characters are that human beings can completely inherit the wisdom and spirit wealth of human beings in writing, so that the human beings can perfect an education system, improve own wisdom, develop scientific technology and enter civilized society.
There are 5000 more languages in the world, where chinese is the most popular language in the world, chinese and english are the most widely used languages in the world, but there are also languages used by only thousands to hundreds of people, such as native indian of america, hucho taimen of china. People in different countries have different habits in terms of their language, which also have different characteristics. Because of the characteristics of the language, such as the variability and the complexity, the language has various classification standards. The linguists divide the world language into language systems, language families, language branches and languages according to the similarity of the world language, and in the language classification method of Beijing university of China, the world language is classified into 13 language systems and 45 language families. Then, in the language identification, the corresponding language analysis is performed according to the characteristics of the language, and the language identification of different language systems is relatively easy, but due to the complexity of the language, the language identification of the language with high similarity in the same language system may be very difficult.
In natural language processing, text language recognition is the determination of which language is based on given text content. With the development of cross-language retrieval technology, the research of text language identification as a core technology of the cross-language retrieval technology is focused, and the text multi-language identification technology is mainly applied to machine translation and multi-language retrieval tasks. Currently, research into text multilingual recognition is mainly a rule-based method and a machine learning-based method. The method based on the rules needs to manually summarize and generalize to obtain language rules, then character string matching is carried out, a large number of professional linguists are needed to analyze the language, and the accuracy is difficult to guarantee.
The machine learning-based method is mostly based on the text multilingual recognition of the N-Gram language model and the text multilingual recognition of the neural network, and compared with the rule-based method, the machine learning-based text multilingual recognition method has higher accuracy and saves a large amount of human resources. However, the method has room for further improvement in the accuracy of text recognition for different languages of the same language family. For example: portuguese and spanish belong to the same genus "indo-roman-western Luo Manyu", both composed of latin, example sentences: "1. She always closes the window before dinner. Text language identification is a complex research effort. ", post-translationally:
1.Ela fecha sempre a janela antes de jantar (Portuguese)
1.Ella cierra siempre la ventana antes de cenar (Spanish)
2.O reconhecimento de linguagemtextual e um trabalho de pesquisa complex o (Portuguese)
2.El reconocimiento del lenguaje textual es un trabajo de investigaci (spanish)
It was found that portuguese and spanish were written in close proximity, and many of them were spelled in the same way. The smaller the difference between languages, the worse the text language recognition using conventional machine learning methods.
Disclosure of Invention
Aiming at the problems of low accuracy rate, low speed and the like of the identification of similar languages of the same language family in actual use and the like of the conventional text language identification method, the invention aims to provide the characteristic word weighting-based text multilingual identification method which can quickly and accurately identify the language of the text content and has the characteristics of simplicity in implementation, high robustness and the like.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention discloses a text multilingual recognition method based on feature word weighting, which comprises the following steps:
1) The data preprocessing comprises the steps of performing generalization preprocessing on a plurality of languages to obtain generalized corpus;
2) Performing N-Gram language model training by using the generalized corpus, wherein a single-byte language training 5-Gram language model and a multi-byte language training 3-Gram language model are used;
3) Word segmentation processing is carried out by using the generalized corpus to obtain word segmentation data, words with the frequency of the first 5% are selected through word frequency statistics and duplication is removed, and a characteristic word list of each language is generated;
4) Training the weight of the feature words, namely training the weight of the feature words in the feature word list by adopting a random gradient descent method on development set data;
5) And (3) language similarity calculation, inputting the generalized text to be identified, calculating the byte length ratio of the text to be identified, selecting a language model to perform language similarity calculation, and obtaining the language with the highest similarity score as the final identification result.
In step 1), the preprocessing of the data comprises:
101 Dividing each language data into a training set, a testing set and a developing set data according to the ratio of 8:1:1, and performing generalized preprocessing on the training set, the testing set and the developing set data;
102 Generalization pretreatment including uppercase letter lowercase, digital replacement and punctuation replacement;
in the step 2), the N-Gram language model is as follows:
assume that the current word X n+1 The probability of occurrence is related to the first n words of the model, and is unrelated to the past words, namely the model of the n+1 order language model; current word X n+1 Probability of occurrence P (X n+1 |X 1 X 2 ...X n ) Dependent only on the first two words X n-1 And X n The formula is:
P(X n+1 |X 1 X 2 ...X n )=P(X n+1 |X n X n-1 )
in calculating P (X n+1 |X 1 X 2 ...X n ) When the probability is transferred, the maximum likelihood estimation method is adopted for solving, and the formula is as follows, wherein C (X 1 X 2 ...X n ) X represents 1 X 2 ...X n Number of occurrences:
the input data of the N-Gram language model is acquired by adopting a sliding window method, a window with N is dragged along a sentence, and then a word sequence for training the N-Gram model is established;
the English, french, spanish and other languages are defined as single byte languages, and the Chinese, japanese, korean and other languages are defined as multi-byte languages.
In the step 3), selecting different word segmentation methods according to the language characteristics to perform word segmentation pretreatment, wherein the word segmentation pretreatment comprises the following steps:
chinese, japanese, korean and Thai have no obvious word marks, and word segmentation is carried out by adopting a word segmentation method based on a language model; the languages of the same language system as English include spaces, and the languages are cut according to the marks of the spaces, and meanwhile, the keywords are noted.
In step 3), word frequency refers to the number of times a given word appears in the data, and word frequency statistics refers to the statistics of the number of times all words appear in the data.
Generating the feature vocabulary includes:
performing generalization pretreatment and word segmentation pretreatment on data, performing word frequency statistics, and selecting the words of the first 5% of each language frequency to generate an initialized characteristic word list of the language; and de-duplicating the initialized characteristic word list of each language in the initialized characteristic word list set of all languages, and finally obtaining the characteristic word list with uniqueness.
In step 5), the language similarity calculation includes:
501 Before calculating the similarity, generalizing the input text data;
502 Calculating the byte length ratio of the text after the generalization processing, and determining whether the text to be identified is in a single byte language or a multi-byte language;
503 Positioning the feature words in the text to be identified by adopting a reverse maximum length matching algorithm according to different lengths of the feature words of each language;
504 The similarity score of each language is calculated by using a language similarity algorithm, the similarity score takes the maximum value, and the language corresponding to the maximum value is the final recognition result.
In step 502), calculating the byte length ratio of the text to be recognized, wherein one letter in the same language family as English occupies one byte, one word in Chinese, japanese, korean and Thai occupies a plurality of bytes, judging that the text to be recognized selects a single-byte language model or a multi-byte language model to calculate the language similarity according to the byte length ratio, and pruning before the language similarity is carried out by calculating the byte length ratio, so that the language recognition speed is improved; the byte length ratio calculation formula:
where len (str) is a character length, len (str. Encode ()) is a byte length, and len_rate is a byte length ratio (len_rate. Gtoreq.1).
In step 503), the reverse maximum length matching algorithm matches from back to front according to the feature word list, if the feature word is matched, the current word position is returned, if the feature word is not matched, the leftmost word is reduced to continue matching until all sentences of the text to be recognized are matched, and the specific steps are as follows:
50301 Dividing the text to be identified according to punctuation as a sentence set;
50302 Intercepting the text with the longest word length in the feature word list at the tail of the unmatched part of the sentence;
50303 Matching the intercepted text in the feature word list;
50304 If the matching is successful, returning to the position of the word and returning to 50302) until all sentences are matched;
50305 If the match is not successful, remove the leftmost word of the sentence, return to 50303).
In step 504), the text language similarity probability calculation formula is as follows:
P(s)=∑p(x i )+∑λp(x j )
wherein λ is the feature word weight (λ>1),p(x i ) For non-feature word transition probability, p (x j ) P(s) is the language similarity probability for the feature word transition probability.
The invention has the following beneficial effects and advantages:
1. the text multilingual recognition method based on feature word weighting can accurately and efficiently recognize languages to which the text belongs, and the number of the recognized languages is far more than that of most text languages recognition methods, and the number of the recognized languages can be continuously expanded on the premise of having language data;
2. the method generates a characteristic word list, and the text language identification method based on characteristic word weighting is far more than a general method for language identification accuracy with high similarity in the same language system;
3. the method defines single-byte languages and multi-byte languages, uses the byte length ratio threshold to prune language similarity calculation, optimizes a text language similarity algorithm, and greatly improves the speed of identifying the multiple languages of the text.
Drawings
FIG. 1 is a sliding window method of acquiring N-Gram language model input data according to the method of the present invention;
FIG. 2 is a flowchart of the language similarity algorithm of the present invention.
Detailed Description
The invention is further described below with reference to the drawings.
The invention provides a text language identification method based on feature word weighting, which carries out language similarity calculation on the basis of feature words, thereby realizing quick and accurate identification of multiple languages of a text. Meanwhile, single-byte languages and multi-byte languages are defined in the method, the byte length ratio threshold is used for pruning the language similarity calculation, a text language similarity algorithm is optimized, and the speed of identifying the multiple languages of the text is improved.
The invention discloses a text multilingual recognition method based on feature word weighting, which comprises the following steps:
1) The data preprocessing comprises the steps of performing generalization preprocessing on a plurality of languages to obtain generalized corpus;
2) Training an N-Gram language model by using the generalized corpus, wherein a 5-Gram language model is trained by single byte languages (English, french, spanish and Portuguese), and a 3-Gram language model is trained by multi-byte languages (Chinese, japanese and Korean);
3) Word segmentation processing is carried out by using the generalized corpus to obtain word segmentation data, words with the frequency of the first 5% are selected through word frequency statistics and duplication is removed, and a characteristic word list of each language is generated;
4) Training the weight of the feature words, namely training the weight of the feature words in the feature word list by adopting a random gradient descent method on development set data;
5) And (3) language similarity calculation, inputting the generalized text to be identified, calculating the byte length ratio of the text to be identified, selecting a language model to perform language similarity calculation, and obtaining the language with the highest similarity score as the final identification result.
In step 1), the preprocessing of the data comprises:
101 Dividing each language data into a training set, a testing set and a developing set data according to the ratio of 8:1:1, and performing generalized preprocessing on the training set, the testing set and the developing set data;
102 To reduce the complexity of the N-Gram language model, the data for training the N-Gram language model is subjected to generalization pretreatment including uppercase letter lowercase, digital substitution and punctuation substitution.
For example, english data: "A scientist took home $25,000from a national science competition for inventing a liquid bandage that could replace antibiotics"
After generalization: "a scientific book home@punc@num from a national science competition for inventing a liquid bandage that could replace anti-ibiotics@punc"
In the step 2), the N-Gram language model is as follows:
assume that the current word X n+1 The probability of occurrence is related to the first n words of the model, and is unrelated to the past words, namely the model of the n+1 order language model; current word X n+1 Probability of occurrence P (X n+1 |X 1 X 2 ...X n ) Dependent only on the first two words X n-1 And X n The formula is:
P(X n+1 |X 1 X 2 ...X n )=P(X n+1 |X n X n-1 )
in calculating P (X n+1 |X 1 X 2 ...X n ) When the probability is transferred, the maximum likelihood estimation method is adopted for solving, and the formula is as follows, wherein C (X 1 X 2 …X n ) X represents 1 X 2 …X n Number of occurrences:
the input data of the N-Gram language model is acquired by adopting a sliding window method, a window with N is dragged along a sentence, and then a word sequence for training the N-Gram model is established;
the English, french, spanish and other languages are defined as single byte languages, and the Chinese, japanese, korean and other languages are defined as multi-byte languages.
The input data of the N-Gram language model is obtained by a sliding window method, as shown in figure 1. By dragging a window of N along the sentence, a word sequence for training the N-Gram model is then created, for example, in the Chinese word sequence "text word", the "text" is the word sequence of the current word, the "word" is the word sequence of the next word, and the current word and the word sequence of the next word are used together as inputs to the N-Gram language model. The N-Gram language model is characterized in that the calculation order of the N-Gram language model is exponentially upward along with the increase of the order, and the degree of data sparsity and the complexity of the model are increased. Training a 5-Gram language model by single-byte languages, namely training 4 words in the current word length and 1 word in the next word length; the 3-Gram language model is trained in multi-byte languages, i.e., the current word length is 2 words and the next word length is 1 word.
In step 3), word frequency refers to the number of times a given word appears in the data, and word frequency statistics is the statistics of the number of times all words appear in the data, and the data of word frequency statistics needs to be subjected to generalization preprocessing and word segmentation preprocessing.
Selecting different word segmentation methods according to the language characteristics to perform word segmentation pretreatment, wherein the word segmentation method specifically comprises the following steps:
no obvious word mark exists in languages such as Chinese, japanese, korean, thai and the like, and word segmentation is carried out by adopting a word segmentation method based on a language model; the languages such as the languages of the same language system as English include spaces, and the languages are cut according to the marks of the spaces, and meanwhile, the problems such as keywords are noted.
Generating the feature vocabulary includes:
after word frequency statistics is carried out on the data, selecting words with the frequency of 5% before each language to generate an initialized characteristic word list of the language; in order to ensure the effectiveness of the feature word list, the duplication elimination process is required, the duplication elimination of the initialized feature word list is performed in the initialized feature word list set of all languages, the feature word list is finally obtained, and the uniqueness of the feature words in the feature word list generated by each language is ensured.
In step 4), feature word weight training is trained using development set data. Training the weight of the characteristic words by adopting a random gradient descent method. In the random gradient descent parameter training, the iteration number is set to 1000, and the step length is set to 0.001. The objective function is as follows, where x j Representing characteristic words, x i Representing non-feature words, θ representing feature word weights.
h(θ)=x i +θx j
In step 5), the similarity of languages is calculated, and finally the language to which the identification text belongs is input, and the specific flow is shown in fig. 2.
501 Before calculating the similarity, generalizing the input text data;
502 Calculating the byte length ratio of the text after the generalization processing, and determining whether the text to be identified is in a single byte language or a multi-byte language;
503 Positioning the feature words in the text to be identified by adopting a reverse maximum length matching algorithm;
504 The similarity score of each language is calculated by using a language similarity algorithm, and after the similarity score is obtained by Max, the language corresponding to the value is the final recognition result.
In step 502), a byte length ratio of a text to be recognized is calculated, one letter in languages such as english, french and the like occupies one byte, one word in languages such as chinese, japanese and the like occupies a plurality of bytes, a single-byte language model or a multi-byte language model of the text to be recognized is selected according to the byte length ratio to calculate the language similarity, pruning processing can be performed before the language similarity is performed by calculating the byte length ratio, the language recognition speed is improved, and a byte length ratio calculation formula:
where len (str) is a character length, len (str. Encode ()) is a byte length, and len_rate is a byte length ratio (len_rate. Gtoreq.1).
Step 503), reverse maximum length matching algorithm idea: and according to the feature word list, matching is carried out from the right sentence to the left sentence, if the feature word is matched, the current word position is returned, and if the feature word is not matched, the leftmost word is reduced to continue matching until all sentences of the text to be recognized are matched. The method comprises the following specific steps:
50301 Dividing the text to be identified according to punctuation as a sentence set;
50302 Intercepting the text with the longest word length in the feature word list at the tail of the unmatched part of the sentence;
50303 Matching the intercepted text in the feature word list;
50304 If the matching is successful, returning to the position of the word and returning to 50302) until all sentences are matched;
50305 If the match is not successful, remove the leftmost word of the sentence, return to 50303).
In step 504), the text language similarity calculation formula is as follows:
P(s)=∑p(x i )+∑λp(x j )
wherein λ is the feature word weight (λ>1),p(x i ) For non-feature word transition probability, p (x j ) P(s) is the language similarity probability for the feature word transition probability.
The following description is made with a text language recognition example of 13 languages such as Chinese, english, japanese, etc., and the test text of 13 different languages is verified by using a text-to-language recognition method based on feature word weighting, so that all recognition results are found to be correct.
Text multilingual recognition result example
According to the experimental cases, the method can accurately identify the languages including Chinese, japanese, korean, english, french, spanish, portuguese, italian, arabic, russian, thai, vietnam and the like, wherein the very high-similarity languages of the Portuguese and the Spanish can be identified accurately; the method can identify the number of languages far more than most text language identification methods, and can continuously expand the number of the identified languages on the premise of having language data; in addition, the text language similarity algorithm is optimized by using the byte length ratio threshold value, so that the text multilingual recognition speed is far higher than that of a common method, and the method has the characteristics of simplicity in implementation, high robustness and the like.