CN111178009B - Text multilingual recognition method based on feature word weighting - Google Patents

Text multilingual recognition method based on feature word weighting Download PDF

Info

Publication number
CN111178009B
CN111178009B CN201911324134.8A CN201911324134A CN111178009B CN 111178009 B CN111178009 B CN 111178009B CN 201911324134 A CN201911324134 A CN 201911324134A CN 111178009 B CN111178009 B CN 111178009B
Authority
CN
China
Prior art keywords
language
word
text
feature
languages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911324134.8A
Other languages
Chinese (zh)
Other versions
CN111178009A (en
Inventor
杜权
毕东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Yayi Network Technology Co ltd
Original Assignee
Shenyang Yayi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Yayi Network Technology Co ltd filed Critical Shenyang Yayi Network Technology Co ltd
Priority to CN201911324134.8A priority Critical patent/CN111178009B/en
Publication of CN111178009A publication Critical patent/CN111178009A/en
Application granted granted Critical
Publication of CN111178009B publication Critical patent/CN111178009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text multilingual recognition method based on feature word weighting, which comprises the following steps: preprocessing data to obtain generalized corpus; training an N-Gram language model by using the generalized corpus; word segmentation processing is carried out by using the generalized corpus to obtain word segmentation data, words with the frequency of the first 5% are selected through word frequency statistics and duplication is removed, and a characteristic word list of each language is generated; training the weight of the feature words, namely training the weight of the feature words in the feature word list by adopting a random gradient descent method on development set data; and (3) language similarity calculation, inputting the generalized text to be identified, calculating the byte length ratio of the text to be identified, selecting a language model to perform language similarity calculation, and obtaining the language with the highest similarity score as the final identification result. The method can accurately and efficiently identify the languages to which the text belongs, can identify the number of the languages far more than that of most text language identification methods, and can continuously expand the number of the identified languages on the premise of having language data.

Description

Text multilingual recognition method based on feature word weighting
Technical Field
The invention relates to a language identification method, in particular to a text multilingual identification method based on feature word weighting.
Background
Language is the most important tool of communication for human beings, and is the main expression mode for people to communicate. People save and transfer the results of human civilization by means of language. The characters are used as the expression of language visualization, break through the limitation of time and space of spoken language, and the characters are that human beings can completely inherit the wisdom and spirit wealth of human beings in writing, so that the human beings can perfect an education system, improve own wisdom, develop scientific technology and enter civilized society.
There are 5000 more languages in the world, where chinese is the most popular language in the world, chinese and english are the most widely used languages in the world, but there are also languages used by only thousands to hundreds of people, such as native indian of america, hucho taimen of china. People in different countries have different habits in terms of their language, which also have different characteristics. Because of the characteristics of the language, such as the variability and the complexity, the language has various classification standards. The linguists divide the world language into language systems, language families, language branches and languages according to the similarity of the world language, and in the language classification method of Beijing university of China, the world language is classified into 13 language systems and 45 language families. Then, in the language identification, the corresponding language analysis is performed according to the characteristics of the language, and the language identification of different language systems is relatively easy, but due to the complexity of the language, the language identification of the language with high similarity in the same language system may be very difficult.
In natural language processing, text language recognition is the determination of which language is based on given text content. With the development of cross-language retrieval technology, the research of text language identification as a core technology of the cross-language retrieval technology is focused, and the text multi-language identification technology is mainly applied to machine translation and multi-language retrieval tasks. Currently, research into text multilingual recognition is mainly a rule-based method and a machine learning-based method. The method based on the rules needs to manually summarize and generalize to obtain language rules, then character string matching is carried out, a large number of professional linguists are needed to analyze the language, and the accuracy is difficult to guarantee.
The machine learning-based method is mostly based on the text multilingual recognition of the N-Gram language model and the text multilingual recognition of the neural network, and compared with the rule-based method, the machine learning-based text multilingual recognition method has higher accuracy and saves a large amount of human resources. However, the method has room for further improvement in the accuracy of text recognition for different languages of the same language family. For example: portuguese and spanish belong to the same genus "indo-roman-western Luo Manyu", both composed of latin, example sentences: "1. She always closes the window before dinner. Text language identification is a complex research effort. ", post-translationally:
1.Ela fecha sempre a janela antes de jantar (Portuguese)
1.Ella cierra siempre la ventana antes de cenar (Spanish)
2.O reconhecimento de linguagemtextual e um trabalho de pesquisa complex o (Portuguese)
2.El reconocimiento del lenguaje textual es un trabajo de investigaci (spanish)
It was found that portuguese and spanish were written in close proximity, and many of them were spelled in the same way. The smaller the difference between languages, the worse the text language recognition using conventional machine learning methods.
Disclosure of Invention
Aiming at the problems of low accuracy rate, low speed and the like of the identification of similar languages of the same language family in actual use and the like of the conventional text language identification method, the invention aims to provide the characteristic word weighting-based text multilingual identification method which can quickly and accurately identify the language of the text content and has the characteristics of simplicity in implementation, high robustness and the like.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention discloses a text multilingual recognition method based on feature word weighting, which comprises the following steps:
1) The data preprocessing comprises the steps of performing generalization preprocessing on a plurality of languages to obtain generalized corpus;
2) Performing N-Gram language model training by using the generalized corpus, wherein a single-byte language training 5-Gram language model and a multi-byte language training 3-Gram language model are used;
3) Word segmentation processing is carried out by using the generalized corpus to obtain word segmentation data, words with the frequency of the first 5% are selected through word frequency statistics and duplication is removed, and a characteristic word list of each language is generated;
4) Training the weight of the feature words, namely training the weight of the feature words in the feature word list by adopting a random gradient descent method on development set data;
5) And (3) language similarity calculation, inputting the generalized text to be identified, calculating the byte length ratio of the text to be identified, selecting a language model to perform language similarity calculation, and obtaining the language with the highest similarity score as the final identification result.
In step 1), the preprocessing of the data comprises:
101 Dividing each language data into a training set, a testing set and a developing set data according to the ratio of 8:1:1, and performing generalized preprocessing on the training set, the testing set and the developing set data;
102 Generalization pretreatment including uppercase letter lowercase, digital replacement and punctuation replacement;
in the step 2), the N-Gram language model is as follows:
assume that the current word X n+1 The probability of occurrence is related to the first n words of the model, and is unrelated to the past words, namely the model of the n+1 order language model; current word X n+1 Probability of occurrence P (X n+1 |X 1 X 2 ...X n ) Dependent only on the first two words X n-1 And X n The formula is:
P(X n+1 |X 1 X 2 ...X n )=P(X n+1 |X n X n-1 )
in calculating P (X n+1 |X 1 X 2 ...X n ) When the probability is transferred, the maximum likelihood estimation method is adopted for solving, and the formula is as follows, wherein C (X 1 X 2 ...X n ) X represents 1 X 2 ...X n Number of occurrences:
Figure BDA0002327914610000021
the input data of the N-Gram language model is acquired by adopting a sliding window method, a window with N is dragged along a sentence, and then a word sequence for training the N-Gram model is established;
the English, french, spanish and other languages are defined as single byte languages, and the Chinese, japanese, korean and other languages are defined as multi-byte languages.
In the step 3), selecting different word segmentation methods according to the language characteristics to perform word segmentation pretreatment, wherein the word segmentation pretreatment comprises the following steps:
chinese, japanese, korean and Thai have no obvious word marks, and word segmentation is carried out by adopting a word segmentation method based on a language model; the languages of the same language system as English include spaces, and the languages are cut according to the marks of the spaces, and meanwhile, the keywords are noted.
In step 3), word frequency refers to the number of times a given word appears in the data, and word frequency statistics refers to the statistics of the number of times all words appear in the data.
Generating the feature vocabulary includes:
performing generalization pretreatment and word segmentation pretreatment on data, performing word frequency statistics, and selecting the words of the first 5% of each language frequency to generate an initialized characteristic word list of the language; and de-duplicating the initialized characteristic word list of each language in the initialized characteristic word list set of all languages, and finally obtaining the characteristic word list with uniqueness.
In step 5), the language similarity calculation includes:
501 Before calculating the similarity, generalizing the input text data;
502 Calculating the byte length ratio of the text after the generalization processing, and determining whether the text to be identified is in a single byte language or a multi-byte language;
503 Positioning the feature words in the text to be identified by adopting a reverse maximum length matching algorithm according to different lengths of the feature words of each language;
504 The similarity score of each language is calculated by using a language similarity algorithm, the similarity score takes the maximum value, and the language corresponding to the maximum value is the final recognition result.
In step 502), calculating the byte length ratio of the text to be recognized, wherein one letter in the same language family as English occupies one byte, one word in Chinese, japanese, korean and Thai occupies a plurality of bytes, judging that the text to be recognized selects a single-byte language model or a multi-byte language model to calculate the language similarity according to the byte length ratio, and pruning before the language similarity is carried out by calculating the byte length ratio, so that the language recognition speed is improved; the byte length ratio calculation formula:
Figure BDA0002327914610000031
where len (str) is a character length, len (str. Encode ()) is a byte length, and len_rate is a byte length ratio (len_rate. Gtoreq.1).
In step 503), the reverse maximum length matching algorithm matches from back to front according to the feature word list, if the feature word is matched, the current word position is returned, if the feature word is not matched, the leftmost word is reduced to continue matching until all sentences of the text to be recognized are matched, and the specific steps are as follows:
50301 Dividing the text to be identified according to punctuation as a sentence set;
50302 Intercepting the text with the longest word length in the feature word list at the tail of the unmatched part of the sentence;
50303 Matching the intercepted text in the feature word list;
50304 If the matching is successful, returning to the position of the word and returning to 50302) until all sentences are matched;
50305 If the match is not successful, remove the leftmost word of the sentence, return to 50303).
In step 504), the text language similarity probability calculation formula is as follows:
P(s)=∑p(x i )+∑λp(x j )
wherein λ is the feature word weight (λ>1),p(x i ) For non-feature word transition probability, p (x j ) P(s) is the language similarity probability for the feature word transition probability.
The invention has the following beneficial effects and advantages:
1. the text multilingual recognition method based on feature word weighting can accurately and efficiently recognize languages to which the text belongs, and the number of the recognized languages is far more than that of most text languages recognition methods, and the number of the recognized languages can be continuously expanded on the premise of having language data;
2. the method generates a characteristic word list, and the text language identification method based on characteristic word weighting is far more than a general method for language identification accuracy with high similarity in the same language system;
3. the method defines single-byte languages and multi-byte languages, uses the byte length ratio threshold to prune language similarity calculation, optimizes a text language similarity algorithm, and greatly improves the speed of identifying the multiple languages of the text.
Drawings
FIG. 1 is a sliding window method of acquiring N-Gram language model input data according to the method of the present invention;
FIG. 2 is a flowchart of the language similarity algorithm of the present invention.
Detailed Description
The invention is further described below with reference to the drawings.
The invention provides a text language identification method based on feature word weighting, which carries out language similarity calculation on the basis of feature words, thereby realizing quick and accurate identification of multiple languages of a text. Meanwhile, single-byte languages and multi-byte languages are defined in the method, the byte length ratio threshold is used for pruning the language similarity calculation, a text language similarity algorithm is optimized, and the speed of identifying the multiple languages of the text is improved.
The invention discloses a text multilingual recognition method based on feature word weighting, which comprises the following steps:
1) The data preprocessing comprises the steps of performing generalization preprocessing on a plurality of languages to obtain generalized corpus;
2) Training an N-Gram language model by using the generalized corpus, wherein a 5-Gram language model is trained by single byte languages (English, french, spanish and Portuguese), and a 3-Gram language model is trained by multi-byte languages (Chinese, japanese and Korean);
3) Word segmentation processing is carried out by using the generalized corpus to obtain word segmentation data, words with the frequency of the first 5% are selected through word frequency statistics and duplication is removed, and a characteristic word list of each language is generated;
4) Training the weight of the feature words, namely training the weight of the feature words in the feature word list by adopting a random gradient descent method on development set data;
5) And (3) language similarity calculation, inputting the generalized text to be identified, calculating the byte length ratio of the text to be identified, selecting a language model to perform language similarity calculation, and obtaining the language with the highest similarity score as the final identification result.
In step 1), the preprocessing of the data comprises:
101 Dividing each language data into a training set, a testing set and a developing set data according to the ratio of 8:1:1, and performing generalized preprocessing on the training set, the testing set and the developing set data;
102 To reduce the complexity of the N-Gram language model, the data for training the N-Gram language model is subjected to generalization pretreatment including uppercase letter lowercase, digital substitution and punctuation substitution.
For example, english data: "A scientist took home $25,000from a national science competition for inventing a liquid bandage that could replace antibiotics"
After generalization: "a scientific book home@punc@num from a national science competition for inventing a liquid bandage that could replace anti-ibiotics@punc"
In the step 2), the N-Gram language model is as follows:
assume that the current word X n+1 The probability of occurrence is related to the first n words of the model, and is unrelated to the past words, namely the model of the n+1 order language model; current word X n+1 Probability of occurrence P (X n+1 |X 1 X 2 ...X n ) Dependent only on the first two words X n-1 And X n The formula is:
P(X n+1 |X 1 X 2 ...X n )=P(X n+1 |X n X n-1 )
in calculating P (X n+1 |X 1 X 2 ...X n ) When the probability is transferred, the maximum likelihood estimation method is adopted for solving, and the formula is as follows, wherein C (X 1 X 2 …X n ) X represents 1 X 2 …X n Number of occurrences:
Figure BDA0002327914610000051
the input data of the N-Gram language model is acquired by adopting a sliding window method, a window with N is dragged along a sentence, and then a word sequence for training the N-Gram model is established;
the English, french, spanish and other languages are defined as single byte languages, and the Chinese, japanese, korean and other languages are defined as multi-byte languages.
The input data of the N-Gram language model is obtained by a sliding window method, as shown in figure 1. By dragging a window of N along the sentence, a word sequence for training the N-Gram model is then created, for example, in the Chinese word sequence "text word", the "text" is the word sequence of the current word, the "word" is the word sequence of the next word, and the current word and the word sequence of the next word are used together as inputs to the N-Gram language model. The N-Gram language model is characterized in that the calculation order of the N-Gram language model is exponentially upward along with the increase of the order, and the degree of data sparsity and the complexity of the model are increased. Training a 5-Gram language model by single-byte languages, namely training 4 words in the current word length and 1 word in the next word length; the 3-Gram language model is trained in multi-byte languages, i.e., the current word length is 2 words and the next word length is 1 word.
In step 3), word frequency refers to the number of times a given word appears in the data, and word frequency statistics is the statistics of the number of times all words appear in the data, and the data of word frequency statistics needs to be subjected to generalization preprocessing and word segmentation preprocessing.
Selecting different word segmentation methods according to the language characteristics to perform word segmentation pretreatment, wherein the word segmentation method specifically comprises the following steps:
no obvious word mark exists in languages such as Chinese, japanese, korean, thai and the like, and word segmentation is carried out by adopting a word segmentation method based on a language model; the languages such as the languages of the same language system as English include spaces, and the languages are cut according to the marks of the spaces, and meanwhile, the problems such as keywords are noted.
Generating the feature vocabulary includes:
after word frequency statistics is carried out on the data, selecting words with the frequency of 5% before each language to generate an initialized characteristic word list of the language; in order to ensure the effectiveness of the feature word list, the duplication elimination process is required, the duplication elimination of the initialized feature word list is performed in the initialized feature word list set of all languages, the feature word list is finally obtained, and the uniqueness of the feature words in the feature word list generated by each language is ensured.
In step 4), feature word weight training is trained using development set data. Training the weight of the characteristic words by adopting a random gradient descent method. In the random gradient descent parameter training, the iteration number is set to 1000, and the step length is set to 0.001. The objective function is as follows, where x j Representing characteristic words, x i Representing non-feature words, θ representing feature word weights.
h(θ)=x i +θx j
In step 5), the similarity of languages is calculated, and finally the language to which the identification text belongs is input, and the specific flow is shown in fig. 2.
501 Before calculating the similarity, generalizing the input text data;
502 Calculating the byte length ratio of the text after the generalization processing, and determining whether the text to be identified is in a single byte language or a multi-byte language;
503 Positioning the feature words in the text to be identified by adopting a reverse maximum length matching algorithm;
504 The similarity score of each language is calculated by using a language similarity algorithm, and after the similarity score is obtained by Max, the language corresponding to the value is the final recognition result.
In step 502), a byte length ratio of a text to be recognized is calculated, one letter in languages such as english, french and the like occupies one byte, one word in languages such as chinese, japanese and the like occupies a plurality of bytes, a single-byte language model or a multi-byte language model of the text to be recognized is selected according to the byte length ratio to calculate the language similarity, pruning processing can be performed before the language similarity is performed by calculating the byte length ratio, the language recognition speed is improved, and a byte length ratio calculation formula:
Figure BDA0002327914610000061
where len (str) is a character length, len (str. Encode ()) is a byte length, and len_rate is a byte length ratio (len_rate. Gtoreq.1).
Step 503), reverse maximum length matching algorithm idea: and according to the feature word list, matching is carried out from the right sentence to the left sentence, if the feature word is matched, the current word position is returned, and if the feature word is not matched, the leftmost word is reduced to continue matching until all sentences of the text to be recognized are matched. The method comprises the following specific steps:
50301 Dividing the text to be identified according to punctuation as a sentence set;
50302 Intercepting the text with the longest word length in the feature word list at the tail of the unmatched part of the sentence;
50303 Matching the intercepted text in the feature word list;
50304 If the matching is successful, returning to the position of the word and returning to 50302) until all sentences are matched;
50305 If the match is not successful, remove the leftmost word of the sentence, return to 50303).
In step 504), the text language similarity calculation formula is as follows:
P(s)=∑p(x i )+∑λp(x j )
wherein λ is the feature word weight (λ>1),p(x i ) For non-feature word transition probability, p (x j ) P(s) is the language similarity probability for the feature word transition probability.
The following description is made with a text language recognition example of 13 languages such as Chinese, english, japanese, etc., and the test text of 13 different languages is verified by using a text-to-language recognition method based on feature word weighting, so that all recognition results are found to be correct.
Text multilingual recognition result example
Figure BDA0002327914610000071
Figure BDA0002327914610000081
According to the experimental cases, the method can accurately identify the languages including Chinese, japanese, korean, english, french, spanish, portuguese, italian, arabic, russian, thai, vietnam and the like, wherein the very high-similarity languages of the Portuguese and the Spanish can be identified accurately; the method can identify the number of languages far more than most text language identification methods, and can continuously expand the number of the identified languages on the premise of having language data; in addition, the text language similarity algorithm is optimized by using the byte length ratio threshold value, so that the text multilingual recognition speed is far higher than that of a common method, and the method has the characteristics of simplicity in implementation, high robustness and the like.

Claims (9)

1. A text multilingual recognition method based on feature word weighting is characterized by comprising the following steps:
1) The data preprocessing comprises the steps of performing generalization preprocessing on a plurality of languages to obtain generalized corpus;
2) Performing N-Gram language model training by using the generalized corpus, wherein a single-byte language training 5-Gram language model and a multi-byte language training 3-Gram language model are used;
3) Word segmentation processing is carried out by using the generalized corpus to obtain word segmentation data, words with the frequency of the first 5% are selected through word frequency statistics and duplication is removed, and a characteristic word list of each language is generated;
4) Training the weight of the feature words, namely training the weight of the feature words in the feature word list by adopting a random gradient descent method on development set data;
5) The language similarity calculation is carried out, a generalized text to be identified is input, the byte length ratio of the text to be identified is calculated, a language model is selected for carrying out the language similarity calculation, and the language with the highest similarity score is the final identification result;
the byte length ratio calculation formula:
Figure QLYQS_1
where len (str) is a character length, len (str. Encode ()) is a byte length, and len_rate is a byte length ratio (len_rate. Gtoreq.1).
2. The method for recognition of multiple languages of text based on feature word weighting according to claim 1, wherein in step 1), the preprocessing of data includes:
101 Dividing each language data into a training set, a testing set and a developing set data according to the ratio of 8:1:1, and performing generalized preprocessing on the training set, the testing set and the developing set data;
102 Generalization pretreatment including uppercase letter lowercase, numerical replacement, and punctuation replacement.
3. The method for recognizing multiple languages of text based on feature word weighting according to claim 1, wherein in the step 2), the N-Gram language model is:
assume that the current word X n+1 The probability of occurrence is related to the first n words of the model, and is unrelated to the past words, namely the model of the n+1 order language model; current word X n+1 Probability of occurrence P (X n+1 |X 1 X 2 ...X n ) Dependent only on the first two words X n-1 And X n The formula is:
P(X n+1 |X 1 X 2 ...X n )=P(X n+1 |X n X n-1 )
in calculating P (X n+1 |X 1 X 2 ...X n ) When the probability is transferred, the maximum likelihood estimation method is adopted for solving, and the formula is as follows, wherein C (X 1 X 2 ...X n ) X represents 1 X 2 ...X n Number of occurrences:
Figure QLYQS_2
the input data of the N-Gram language model is acquired by adopting a sliding window method, a window with N is dragged along a sentence, and then a word sequence for training the N-Gram model is established;
english, french and Spanish languages are defined as single byte languages, and Chinese, japanese and Korean languages are defined as multi-byte languages.
4. The text multilingual recognition method based on feature word weighting according to claim 1, wherein in step 3), different word segmentation methods are selected according to the language characteristics to perform word segmentation preprocessing, and specifically:
chinese, japanese, korean and Thai have no obvious word marks, and word segmentation is carried out by adopting a word segmentation method based on a language model; the languages of the same language system as English include spaces, and the languages are cut according to the marks of the spaces, and meanwhile, the keywords are noted.
5. The method for recognizing multiple languages of text based on feature word weighting according to claim 1, wherein in step 3), word frequency refers to the number of occurrences of a given word in the data, and word frequency statistics refers to statistics of the number of occurrences of all words in the data;
generating the feature vocabulary includes:
performing generalization pretreatment and word segmentation pretreatment on data, performing word frequency statistics, and selecting the words of the first 5% of each language frequency to generate an initialized characteristic word list of the language; and de-duplicating the initialized characteristic word list of each language in the initialized characteristic word list set of all languages, and finally obtaining the characteristic word list with uniqueness.
6. The method for recognizing multiple languages of text based on feature word weighting according to claim 1, wherein in step 5), the language similarity calculation includes:
501 Before calculating the similarity, generalizing the input text data;
502 Calculating the byte length ratio of the text after the generalization processing, and determining whether the text to be identified is in a single byte language or a multi-byte language;
503 Positioning the feature words in the text to be identified by adopting a reverse maximum length matching algorithm according to different lengths of the feature words of each language;
504 The similarity score of each language is calculated by using a language similarity algorithm, the similarity score takes the maximum value, and the language corresponding to the maximum value is the final recognition result.
7. The method for recognizing multiple languages of text based on feature word weighting according to claim 5, wherein in step 502), a byte length ratio of the text to be recognized is calculated, one letter in the same language family as english occupies one byte, one letter in chinese, japanese, korean and thai occupies a plurality of bytes, and it is determined that the text to be recognized selects a single-byte language model or a multi-byte language model to perform the calculation of the language similarity according to the byte length ratio, and pruning is performed before the language similarity is performed by calculating the byte length ratio, so that the language recognition speed is improved.
8. The method for recognizing multiple languages of text based on feature word weighting according to claim 5, wherein in step 503), a reverse maximum length matching algorithm matches from back to front according to a feature word list, if the feature word is matched, the current word position is returned, if the feature word is not matched, the leftmost word is reduced to continue matching until all sentences of the text to be recognized are matched, and the method specifically comprises the following steps:
50301 Dividing the text to be identified according to punctuation as a sentence set;
50302 Intercepting the text with the longest word length in the feature word list at the tail of the unmatched part of the sentence;
50303 Matching the intercepted text in the feature word list;
50304 If the matching is successful, returning to the position of the word and returning to 50302) until all sentences are matched;
50305 If the match is not successful, remove the leftmost word of the sentence, return to 50303).
9. The method for text multilingual recognition based on feature word weighting of claim 5 wherein in step 504), the text-language similarity probability calculation formula is as follows:
P(s)=∑p(x i )+∑λp(x j )
wherein λ is the feature word weight (λ>1),p(x i ) For non-feature word transition probability, p (x j ) P(s) is the language similarity probability for the feature word transition probability.
CN201911324134.8A 2019-12-20 2019-12-20 Text multilingual recognition method based on feature word weighting Active CN111178009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911324134.8A CN111178009B (en) 2019-12-20 2019-12-20 Text multilingual recognition method based on feature word weighting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911324134.8A CN111178009B (en) 2019-12-20 2019-12-20 Text multilingual recognition method based on feature word weighting

Publications (2)

Publication Number Publication Date
CN111178009A CN111178009A (en) 2020-05-19
CN111178009B true CN111178009B (en) 2023-05-09

Family

ID=70650260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911324134.8A Active CN111178009B (en) 2019-12-20 2019-12-20 Text multilingual recognition method based on feature word weighting

Country Status (1)

Country Link
CN (1) CN111178009B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112329454A (en) * 2020-11-03 2021-02-05 腾讯科技(深圳)有限公司 Language identification method and device, electronic equipment and readable storage medium
CN117236347B (en) * 2023-11-10 2024-03-05 腾讯科技(深圳)有限公司 Interactive text translation method, interactive text display method and related device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528535A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multi-language identification method based on coding and machine learning
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN109934251A (en) * 2018-12-27 2019-06-25 国家计算机网络与信息安全管理中心广东分中心 A kind of method, identifying system and storage medium for rare foreign languages text identification

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9740687B2 (en) * 2014-06-11 2017-08-22 Facebook, Inc. Classifying languages for objects and entities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106528535A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multi-language identification method based on coding and machine learning
CN108197109A (en) * 2017-12-29 2018-06-22 北京百分点信息科技有限公司 A kind of multilingual analysis method and device based on natural language processing
CN109934251A (en) * 2018-12-27 2019-06-25 国家计算机网络与信息安全管理中心广东分中心 A kind of method, identifying system and storage medium for rare foreign languages text identification

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王昊 ; 李思舒 ; 邓三鸿 ; .基于N-Gram的文本语种识别研究.现代图书情报技术.2013,(第04期),全文. *

Also Published As

Publication number Publication date
CN111178009A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
Cotterell et al. Labeled morphological segmentation with semi-markov models
CN110489760A (en) Based on deep neural network text auto-collation and device
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN107102983B (en) Word vector representation method of Chinese concept based on network knowledge source
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN105068997B (en) The construction method and device of parallel corpora
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
CN111046660B (en) Method and device for identifying text professional terms
CN106611041A (en) New text similarity solution method
CN106202065B (en) Across the language topic detecting method of one kind and system
US20200311345A1 (en) System and method for language-independent contextual embedding
US20230103728A1 (en) Method for sample augmentation
Patil et al. Issues and challenges in marathi named entity recognition
CN111178009B (en) Text multilingual recognition method based on feature word weighting
CN103744837B (en) Many texts contrast method based on keyword abstraction
CN107894975A (en) A kind of segmenting method based on Bi LSTM
Bedrick et al. Robust kaomoji detection in Twitter
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN108491383A (en) A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
Nehar et al. Rational kernels for Arabic root extraction and text classification
CN109325237B (en) Complete sentence recognition method and system for machine translation
Doush et al. Improving post-processing optical character recognition documents with Arabic language using spelling error detection and correction
CN106776590A (en) A kind of method and system for obtaining entry translation
CN109960782A (en) A kind of Tibetan language segmenting method and device based on deep neural network
CN111310452A (en) Word segmentation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Du Quan

Inventor after: Bi Dong

Inventor before: Du Quan

Inventor before: Bi Dong

Inventor before: Zhu Jingbo

Inventor before: Xiao Tong

Inventor before: Zhang Chunliang

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A text multilingual recognition method based on feature word weighting

Granted publication date: 20230509

Pledgee: China Construction Bank Shenyang Hunnan sub branch

Pledgor: SHENYANG YAYI NETWORK TECHNOLOGY CO.,LTD.

Registration number: Y2024210000102

PE01 Entry into force of the registration of the contract for pledge of patent right