CN106598937A - Language recognition method and device for text and electronic equipment - Google Patents
Language recognition method and device for text and electronic equipment Download PDFInfo
- Publication number
- CN106598937A CN106598937A CN201510672933.XA CN201510672933A CN106598937A CN 106598937 A CN106598937 A CN 106598937A CN 201510672933 A CN201510672933 A CN 201510672933A CN 106598937 A CN106598937 A CN 106598937A
- Authority
- CN
- China
- Prior art keywords
- languages
- text
- feature
- identified
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a language recognition method and device for a text and electronic equipment. The language recognition method for the text comprises the steps of extracting language characteristics from a text to be recognized; taking the extracted language characteristics as an input of a text language classifier generated in advance; and calculating to acquire a language to which the text to be recognized belongs by the text language classifier, wherein the language characteristics comprise at least one of N-element continuous word characteristics, N-element continuous character characteristics and affix characteristics. With the adoption of the method provided by the invention, the correct recognition rate and the robustness of language recognition can be improved; and meanwhile, a training corpus set is only needed to be a historical query set marked with a correct language, more content is not needed to be marked, and thus, an effect of high practicability can be achieved.
Description
Technical field
The application is related to languages technology of identification field, and in particular to a kind of Language Identification for text, device and
Electronic equipment.
Background technology
International electronic commerce website generally comprises English main website and multilingual substation, and though main website or substation all towards
Global Subscriber is opened.When user logs in any one website carries out product retrieval, the word for being used can be any
Language.For accurate understanding user view, it is necessary first to which the problem of solution is to automatically identify the query text institute of user input
The languages of category, i.e.,:Text languages are recognized.Only accurately know what languages text to be processed is, after just carrying out correctly
Continuous to process, for example, translation or search etc. are processed.
At present, conventional text Language Identification includes following several:
1) obtain one of Xerox in 2000 is entitled《AUTOMATIC LANGUAGE IDENTIFICATION
USING BOTH N-GRAM AND WORD INFORMATION》United States Patent (USP), Publication No. US6167369A of the patent.
The text Language Identification that the patent is proposed comprises the steps:
The first step, pretreatment is carried out to each word in text to be identified;
Second step, for each word, first judge that whether the word is that short word (is limited to less than being equal to 5 words in this scheme
Symbol), if short word, then directly calculate the probability of occurrence of the short word under each languages;If long word, then the word institute is obtained
3 metacharacter strings having, and its probability of occurrence under each languages is calculated to each 3 metacharacter string;
3rd step, for each languages, all probability scores under the comprehensive languages are therefrom selected text to be identified and most may be used
The languages that can belong to.
In sum, the method mainly considers the relative frequency of word relative frequency and 3 metacharacter strings to carry out
Languages recognize that in theory, the method belongs to most basic N-Gram language models, very simple.
2) article of a yahoo being published in for 2009 on computational linguistics top-level meeting ACL:《Language
Identification of Search Engine Queries》.The text Language Identification that this article is proposed, by certainly
Word frequency probability score, N unit continuous string probabilities score, affixe score of the plan tree-model by text to be identified under each languages
Three score values are in addition comprehensive, carry out the languages identification of text to be identified.
The method has the following advantages:Realize that simply amount of calculation is little.However, because three kinds of scores of the method foundation are all
Simple probability score, therefore there is a problem of that discrimination is not high and coverage and motility are all poor.Additionally, the method application
Decision-tree model in comprehensive three score values, need to be trained based on training data, and this training data is difficult to obtain,
Accordingly, there exist the problem of engineering practicability difference.
3) one of the application of Google companies in 2011 it is entitled《Query Language Identification》U.S.
State patent application, the patent is disclosed as US2011231423A1.The text Language Identification that the patent is proposed includes as follows
Step:
The first step, from interface receive one inquiry (text to be identified);
Second step, one class vector based on interface language information of generation;
3rd step, for each word in inquiry, the relative frequency of the every kind of languages according to it in corpus is generated
One class vector;
4th step, vectorial to Interface classification and all of query word class vector carry out synthesis;
5th step, one languages class vector of generation, the score of the value as correspondence languages of each of which dimension.
The method is also relatively simple, and actually used information is exactly relative frequency of the query word in each languages corpus
Degree information, then in conjunction with interface language information, draws final classification results.Therefore, the method still suffers from coverage and spirit
The all poor problem of activity.
In sum, existing text Language Identification is generally basede on language model technology, i.e.,:For text to be identified,
The scoring of each languages is carried out using the language model of each languages trained under line, person differentiates as final then to take high score
Languages.It is based on simple relative frequency statistical information to calculate probability score due to the design philosophy of prior art, thus,
Prior art has that correct recognition rata and robustness are low.
The content of the invention
The application provides a kind of Language Identification for text, device and electronic equipment, is deposited with solving prior art
In the low problem of correct recognition rata and robustness.
The application provides a kind of Language Identification for text, including:
Languages feature is extracted from text to be identified;
Using the languages feature for extracting as the text languages grader for previously generating input, by the text
Languages classifier calculated obtains the affiliated languages of the text to be identified;
Wherein, the languages feature includes at least the one of the continuous word feature of N units, the continuous character feature of N units and affixe feature
Person.
Optionally, it is described that the affiliated languages of the text to be identified are obtained by the text languages classifier calculated, including:
Search condition is characterized as with the languages for extracting, in the languages, languages feature and its weight for previously generating
Retrieval in corresponding relation obtains feature weight of the languages feature in each candidate's languages;
According to feature weight of the languages feature for obtaining in each candidate's languages is retrieved, the text to be identified is calculated
Originally it is belonging respectively to the score of each candidate's languages;
Languages belonging to the score is more than into candidate's languages as the text to be identified of predetermined threshold value.
Optionally, the feature weight is calculated based on discriminative model and obtained.
Optionally, the corresponding relation of the languages for previously generating, languages feature and its weight, is given birth to using following steps
Into:
Acquisition has marked the text set of affiliated languages;
The languages feature is extracted from the text that each has marked affiliated languages, and counts the languages feature each
The number of times occurred respectively in individual candidate's languages;
Each has marked the text of affiliated languages according to extracting languages feature and the languages for counting
The number of times that feature occurs respectively in each candidate's languages, calculating obtains each languages feature and goes out respectively in each candidate's languages
Existing number of times and occurs the ratio of total degree in all candidate's languages, as each languages feature in each candidate's languages
Feature weight;
By the triplet sets of described each candidate's languages, each languages feature and the feature weight, as institute's predicate
The corresponding relation of kind, languages feature and its weight.
Optionally, the corresponding relation of the languages, languages feature and its weight, stores in the following way:
Using dictionary tree data structure storage described in the continuous character feature of the continuous word feature of N units and N units.
Optionally, the corresponding relation of the languages, languages feature and its weight, stores in the following way:
It is for each languages feature in the corresponding relation of the languages, languages feature and its weight, the languages are special
Levy and its all candidate's languages that weight is not zero correspondence is stored.
Optionally, the text languages grader is the text languages grader of single languages;It is described according to retrieval obtain
Feature weight of the languages feature in each candidate's languages, calculates the text to be identified and is belonging respectively to described each candidate
The score of languages, is calculated using equation below:
Wherein, Y is the stochastic variable of the affiliated languages of the text to be identified;P belongs to specific language for the text to be identified
The score planted;X is the characteristic vector being made up of the languages feature extracted from the text to be identified;W be by with x in
Each languages feature corresponding feature weight composition weight vectors.
Optionally, the text languages grader is multilingual text languages grader;It is described according to retrieval obtain
Feature weight of the languages feature in each candidate's languages, calculates the text to be identified and is belonging respectively to described each candidate
The score of languages, is calculated using equation below:
Wherein, xiFor the text to be identified, pjFor the score that the text to be identified belongs to particular candidate languages j;f
(xi) it is the languages feature extracted from the text to be identified, λ1jTo λmjFor f (xi) in particular candidate languages j
In feature weight;Z is the score sum of each candidate's languages, is calculated using equation below:
Wherein, n is the quantity of candidate's languages.
Optionally, the languages feature also includes:It is word quantity that the text to be identified includes and average word length, default
Brand word feature, default model word feature, the peculiar character feature of each languages, the peculiar affixe feature of each languages and service feature
At least one.
Optionally, the continuous character feature of the N units includes N unit's continuation characters and its positional information in word.
Optionally, the text languages grader for previously generating include at least one towards particular candidate languages text
Languages grader;Each is run one by one towards the text languages grader of particular candidate languages with default execution sequence;
It is described that the affiliated languages of the text to be identified are obtained by the text languages classifier calculated, using such as lower section
Formula:
If judging the affiliated languages of the text to be identified by the text languages grader currently towards particular candidate languages
When being not belonging to current candidate's languages towards the text languages grader of particular candidate languages, then according to the default execution
Sequentially, by positioned at described current towards after the text languages grader of particular candidate languages, next text languages point
Class device is calculated and obtains the affiliated languages of the text to be identified;
If judging the affiliated languages of the text to be identified by the text languages grader currently towards particular candidate languages
When belonging to current candidate's languages towards the text languages grader of particular candidate languages, then terminate languages identification;
Wherein, the text languages grader towards particular candidate languages include single languages text languages grader or
Multilingual text languages grader.
Optionally, it is described languages feature is extracted from text to be identified before, also include:
With the text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating is waited to know with the presence or absence of described
Other text;The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
If above-mentioned judged result is yes, by the text to be identified, corresponding affiliated languages are made in the intervention vocabulary
For the affiliated languages of the text to be identified.
Optionally, the intervention vocabulary is generated using following steps:
The text that acquisition is erroneously identified;
Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.
Optionally, it is described languages feature is extracted from text to be identified before, also include:
The character included with the text to be identified is examined as search condition in the specific languages character code table for previously generating
The character that rope includes with the presence or absence of the text to be identified;
If above-mentioned judged result is yes, the character that the text to be identified is included is in the specific languages character code table
In it is corresponding belonging to languages as the affiliated languages of the text to be identified.
Optionally, it is described languages feature is extracted from text to be identified before, also include:
According to the brand vocabulary and model vocabulary for previously generating at least one, remove default product from the text to be identified
Board word or default model word.
Optionally, the device for performing the Language Identification for text is disposed in a distributed system.
Accordingly, the application also provides a kind of languages identifying device for text, including:
Extracting unit, for extracting languages feature from text to be identified;
Predicting unit, for using the languages feature for extracting as the defeated of the text languages grader for previously generating
Enter, the affiliated languages of the text to be identified are obtained by the text languages classifier calculated;
Wherein, the languages feature refers to the languages feature of ten million order of magnitude, including the continuous word feature of N units, N units consecutive word
At least one of symbol feature and affixe feature.
Optionally, the predicting unit includes:
Retrieval subelement, for being characterized as search condition with the languages for extracting, in the languages, languages that previously generate
Retrieval in the corresponding relation of feature and its weight obtains feature weight of the languages feature in each candidate's languages;
Computation subunit, for the feature weight according to the languages feature of retrieval acquisition in each candidate's languages,
Calculate the score that the text to be identified is belonging respectively to each candidate's languages;
Setting subelement, for using the score more than predetermined threshold value candidate's languages as the text to be identified belonging to
Languages.
Optionally, also include:
Signal generating unit, for generating the corresponding relation of the languages, languages feature and its weight for previously generating;
The signal generating unit includes:
Subelement is obtained, for obtaining the text set for having marked affiliated languages;
Subelement is extracted, for extracting the languages feature in the text for having marked affiliated languages from each, and is counted
The number of times that the languages feature occurs respectively in each candidate's languages;
Computation subunit, for according to described in extracting each marked affiliated languages text languages feature and system
The number of times that the languages feature counted out occurs respectively in each candidate's languages, calculates and obtains each languages feature in each time
Select the number of times occurred respectively in languages and occur the ratio of total degree in all candidate's languages, exist as each languages feature
Feature weight in each candidate's languages;
Setting subelement, for by the tlv triple of described each candidate's languages, each languages feature and the feature weight
Set, as the corresponding relation of the languages, languages feature and its weight.
Optionally, the predicting unit include at least one towards particular candidate languages prediction subelement;Held with default
Whether row order judges the affiliated languages of the text to be identified one by one using each towards the prediction subelement of particular candidate languages
Belong to candidate's languages of the current prediction subelement towards particular candidate languages;If so, languages identification is then terminated;If it is not, then leading to
Cross after the current prediction subelement towards particular candidate languages, the next prediction towards particular candidate languages
Subelement is calculated and obtains the affiliated languages of the text to be identified;
The prediction subelement towards particular candidate languages, for by the text languages towards particular candidate languages point
Class device is calculated and obtains the affiliated languages of the text to be identified;
Wherein, the text languages grader towards particular candidate languages include single languages text languages grader or
Multilingual text languages grader.
Optionally, also include:
Intervene unit, for the text to be identified as search condition, retrieving in the intervention vocabulary for previously generating to be
It is no to there is the text to be identified;It is if above-mentioned judged result is yes, the text to be identified is right in the intervention vocabulary
The affiliated languages answered are used as the affiliated languages of the text to be identified;
Wherein, the vocabulary of intervening includes the corresponding record collection of text and its affiliated languages.
Optionally, also include:
Character recognition unit, for the character that included with the text to be identified as search condition, in the spy for previously generating
The character that retrieval includes with the presence or absence of the text to be identified in attribute kind character code table;If above-mentioned judged result is yes, will
Languages are used as described to be identified belonging to the character that the text to be identified includes is corresponding in the specific languages character code table
The affiliated languages of text.
Optionally, also include:
At least one of removal noise unit, the brand vocabulary previously generated for basis and model vocabulary, treats from described
Identification text removes default brand word or default model word.
Accordingly, the application also provides a kind of electronic equipment, including:
Display;
Processor;And
Memorizer, the memorizer is configured to store the languages identifying device for text, the language for text
When planting identifying device by the computing device, comprise the steps:Languages feature is extracted from text to be identified;To extract
The languages feature for going out as the text languages grader for previously generating input, by the text languages classifier calculated
Obtain the affiliated languages of the text to be identified;Wherein, the languages feature includes the continuous word feature of N units, the continuous character feature of N units
With affixe feature at least one.
Additionally, the application also provides another is used for the Language Identification of text, including:
With text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified
This;The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
If above-mentioned judged result is yes, by the text to be identified, corresponding affiliated languages are made in the intervention vocabulary
For the affiliated languages of the text to be identified.
Optionally, the intervention vocabulary is generated using following steps:
The text that acquisition is erroneously identified;
Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.
Optionally, also include:
If above-mentioned judged result is no, the text to be identified is obtained by the text languages classifier calculated for previously generating
Languages belonging to this.
Accordingly, also a kind of languages identifying device for text of the application, including:
Retrieval unit, for text to be identified as search condition, retrieving in the intervention vocabulary for previously generating and whether depositing
In the text to be identified;The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
Judging unit, if for above-mentioned judged result be it is yes, the text to be identified is right in the intervention vocabulary
The affiliated languages answered are used as the affiliated languages of the text to be identified.
Optionally, also include:
Predicting unit, if being no for above-mentioned judged result, is obtained by the text languages classifier calculated for previously generating
Take the affiliated languages of the text to be identified.
Additionally, the application also provides another is used for the Language Identification of text, including:
As search condition, the retrieval in the specific languages character code table for previously generating is the character included with text to be identified
It is no to there is the character that the text to be identified includes;
If above-mentioned judged result is yes, the character that the text to be identified is included is in the specific languages character code table
In it is corresponding belonging to languages as the affiliated languages of the text to be identified.
Optionally, also include:
If above-mentioned judged result is no, the text to be identified is obtained by the text languages classifier calculated for previously generating
Languages belonging to this.
Accordingly, the application also provides a kind of languages identifying device for text, including:
Retrieval unit, for the character that included with text to be identified as search condition, in the specific languages word for previously generating
The character that retrieval includes with the presence or absence of the text to be identified in symbol table;
Judging unit, if being yes for above-mentioned judged result, the character that the text to be identified is included is in the spy
In attribute kind character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.
Optionally, also include:
Predicting unit, if being no for above-mentioned judged result, is obtained by the text languages classifier calculated for previously generating
Take the affiliated languages of the text to be identified.
Additionally, the application also provides another is used for the Language Identification of text, including:
According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word from text to be identified
Or default model word;
The affiliated languages of the text to be identified are obtained by the text languages classifier calculated for previously generating.
Accordingly, the application also provides a kind of languages identifying device for text, including:
At least one of filter element, the brand vocabulary previously generated for basis and model vocabulary, from text to be identified
Remove default brand word or default model word;
Predicting unit, for obtaining the affiliated language of the text to be identified by the text languages classifier calculated for previously generating
Kind.
Additionally, the application also provides another is used for the Language Identification of text, including:
Languages feature is extracted from text to be identified;
To preset each text languages grader that execution sequence runs one by one predetermined number, by the text languages point
Class device judges whether the affiliated languages of the text to be identified belong to candidate's languages of the text languages grader;If so, then tie
Beam languages are recognized;
Wherein, the languages feature includes at least the one of the continuous word feature of N units, the continuous character feature of N units and affixe feature
Person.
Optionally, the text languages grader includes the text languages grader or multilingual text languages of single languages
Grader.
Accordingly, the application also provides a kind of languages identifying device for text, including:
Extracting unit, for extracting languages feature from text to be identified;
Predicting unit, for preset each text languages grader that execution sequence runs one by one predetermined number, passing through
The text languages grader judges whether the affiliated languages of the text to be identified belong to the candidate of the text languages grader
Languages;If so, languages identification is then terminated;
Wherein, the languages feature includes at least the one of the continuous word feature of N units, the continuous character feature of N units and affixe feature
Person.
Compared with prior art, the application has advantages below:
The application provides a kind of Language Identification for text, device and electronic equipment, by from text to be identified
In extract languages feature, and the languages feature for extracting is passed through as the input of the text languages grader for previously generating
Text languages classifier calculated obtains the affiliated languages of text to be identified, wherein the languages feature includes the continuous word feature of N units, N
At least one of first continuous character feature and affixe feature.Because the languages of the method institute foundation of the application offer are characterized in that ten million
The feature of the order of magnitude, therefore, it is possible to improve the correct recognition rata and robustness of languages identification, simultaneously because corpus collection is only needed
To mark the historical query collection of correct languages, and more contents need not be marked such that it is able to reach the high effect of practicality.
Description of the drawings
Fig. 1 is the flow chart of the Language Identification embodiment for text of the application;
Fig. 2 is the particular flow sheet of Language Identification embodiment step S103 for text of the application;
Fig. 3 is that the Language Identification embodiment for text of the application generates languages, languages feature and its weight
The particular flow sheet of corresponding relation;
Fig. 4 is languages, languages feature and its weight that the Language Identification embodiment for text of the application is generated
Corresponding relation storage schematic diagram;
Fig. 5 is the system schematic of the Language Identification embodiment distributed deployment for text of the application;
Fig. 6 is the schematic diagram that the Language Identification embodiment multilamellar for text of the application recognizes framework;
Fig. 7 is the schematic diagram of the languages identifying device embodiment for text of the application;
Fig. 8 is the schematic diagram of the languages identifying device embodiment predicting unit 103 for text of the application;
Fig. 9 is the schematic diagram of the languages identifying device embodiment signal generating unit 201 for text of the application;
Figure 10 is the another schematic diagram of the languages identifying device embodiment for text of the application;
Figure 11 is the schematic diagram of the electronic equipment embodiment of the application.
Specific embodiment
Elaborate many details in order to fully understand the application in the following description.But the application can be with
Much it is different from alternate manner described here to implement, those skilled in the art can be in the situation without prejudice to the application intension
Under do similar popularization, therefore the application is not embodied as being limited by following public.
In this application, there is provided a kind of Language Identification for text, device and electronic equipment.In following reality
Apply in example and be described in detail one by one.
The Language Identification for text that the embodiment of the present application is provided, the basic thought of its core is:By design
The languages feature of millions simultaneously adopts machine learning model, to recognize the languages belonging to text to be identified.Because the application is provided
Method languages identification is carried out based on the languages feature of millions, it is thus possible to improve the identification of text languages correct recognition rata and
Robustness.
Fig. 1 is refer to, it is the flow chart of the Language Identification embodiment for text of the application.Methods described bag
Include following steps:
Step S101:Languages feature is extracted from text to be identified.
In each hierarchy of skill for carrying out pattern recognition by machine learning method, most important is exactly the aspect of model
Design.The languages feature that the inventor of technical scheme is proposed includes the languages feature of ten million order of magnitude, wherein big portion
Point languages feature belongs to the continuous word feature of N units, the continuous character feature of N units or affixe feature, and this few class languages is characterized in that basic
Languages feature.Additionally, inventor have also been devised the languages feature of following several classifications, including:Statistical nature, for example, word number and
Average word is long;Brand word feature and model word feature;The peculiar character feature of languages and the peculiar affixe feature of languages;Service feature,
For example, IP address, country origin, website, region setting etc..Above-mentioned all kinds of languages features are briefly described separately below.
1) the continuous word feature of N units
The continuous word of N units is characterized in that based on a class languages feature of N-Gram Design Theory.For example, text to be identified is:
iphone 6s caseThe continuous word feature that therefrom can be extracted includes:
41 yuan continuous word feature:iphone、6s、case、
32 yuan continuous word features:iphone 6s、6s case、case
23 yuan continuous word features:iphone 6s case、6s case
14 yuan continuous word features:iphone 6s case
Based on the languages feature of N-Gram the Theory Construction text to be identified, and realize that the embodiment of the present application is carried on this basis
For the Language Identification for text, languages can be carried out to various text languages widely used in the Internet automatic
Identification.Test result indicate that, relatively high and stable correct identification is had based on the text Language Identification of the continuous word feature of N units
Rate.
2) the continuous character feature of N units
In actual applications, one of text Language Identification main application scenarios are search scenes.In search scene
In, the query word of user input is typically all shorter, and a usual inquiry only includes 1 to 3 word, and suitable between vocabulary
Sequence is not also limited.Because prior art is not good enough to the stability of the language model scores of short text, therefore now technology is deposited
The relatively low problem of correct recognition rata when the languages of short text are recognized.Therefore, the base commonly used in general nature language processing techniques
Just do not apply in the N gram language model technologies of word.
The embodiment of the present application is proposed based on the N-Gram technologies of character, i.e.,:Using the continuous word feature of N units.N units consecutive word
Symbol is characterized in that based on the another class languages feature of N-Gram Design Theory the difference of continuous word feature first with above-mentioned N is:
The unit of the continuous character feature of N units is a character.For example, from the text iphone 6s case to be identified of upper exampleIn
The 3 yuan of continuous character features for extracting include:Iph, pho and one etc..
Additionally, in order to embody N unit's continuation characters in the position of word, for example, head and the tail position or centre connect
Position, the continuous character feature of N units described in the embodiment of the present application also includes N units continuation character in the position of word.In this enforcement
In, prefix is represented using " HEAD_ " affixe, " TAIL_ " represents suffix, and " _ HYP_ " represents connection.For example, from upper example wait know
Other text iphone 6s caseIn 3 yuan of continuous character features extracting be:HEAD_iph、HEAD_cas、HEAD_
for、TAIL_one、TAIL_ase、With e_HYP_6s etc..By the coal addition position in the continuous character feature of N units
Information, can carry out languages identification, so as to reach the effect for improving correct recognition rata based on the continuous character feature of more fine-grained N units
Really.
Particularly, it is contemplated that shorter continuation character feature (for example, 1 yuan of continuous character feature or 2 yuan of continuation character spies
Levy) it is very low for the discrimination of languages, inventor is also proposed using high-order continuation character feature (more than 3 yuan continuation character features)
As the continuous character feature of N units.Test result indicate that, had based on the text Language Identification of the continuous character feature of high-order N units
Relatively high and stable correct recognition rata.
3) affixe feature
Affixe feature described in the embodiment of the present application refers to the feature that affixe common in each languages is formed, including prefix
Feature and suffix feature, for example, the character string such as pre belongs to the character strings such as prefix characteristic, ing and belongs to suffix feature in English.
During realization, affixe feature can be extracted from text to be identified according to the affixe table for prestoring.
4) statistical nature
Statistical nature described in the embodiment of the present application refers to the languages feature obtained by various statistical method.Due to difference
The text of languages has respective feature in the word quantity of composition text or in terms of average word length, it is also possible to which problem is replied and made
For languages identification foundation, therefore, the statistical nature described in the embodiment of the present application includes the total of the word included from text to be identified
The statistical natures such as the average word length of quantity or each word.
5) brand word feature and model word feature
In actual applications, some brand words, model word or general description word information may be mixed in text to be identified,
Thus increase languages and recognize difficulty.Particularly, for including the short text of brand word, model word or general description word information
(for example, query word), its languages identification difficulty is very big.The method that the embodiment of the present application is provided, by designing brand word feature
And model word feature, whether can mix brand word, model word or general in view of text to be identified when text languages are recognized
Description word information, thus also can play a part of improve correct recognition rata.During realization, can be according to prestoring
Brand word and model vocabulary brand word feature and model word feature are extracted from text to be identified.
6) the peculiar character feature of languages and the peculiar affixe feature of languages
The language model that prior art is based on is trained from a large amount of language materials and obtained, and corresponding to the high languages of similarity
Corpus similarity it is also high.Additionally, there is great similarity between some languages in itself.Therefore, existing skill
The art languages high for similarity have that recognition performance is poor.
In order to solve this problem, the inventor of technical scheme devises the peculiar character feature of languages and languages spy
There is the languages feature that affixe feature etc. is new.The peculiar character feature of languages and the peculiar affixe of languages described in the embodiment of the present application be
Finger, a kind of exclusive character feature of the languages that languages and other languages can be distinguished and affixe feature.For example, character
It is the distinctive character feature of Portuguese, can be designed toThis feature
To express.
Test result indicate that, based on the peculiar character feature of languages and the text Language Identification of the peculiar affixe of languages, can
To solve the problems, such as that the high languages of similarity have that recognition performance is poor well, so as to reach the identification for improving similar languages
Effect.
7) service feature
At present, typically there is service feature on many lines in real time international electronic commerce website, for example, inquires about corresponding
Cookie information, locale information and IP address information etc..Service feature described in the embodiment of the present application can to languages identification
To provide good information, for example, from the query word of CHINESE REGION IP address, the query word is that the probability of Chinese is bigger.
Test result indicate that, by the mechanism using service feature, can specifically optimize languages identification under different business scene
Accuracy.
It should be noted that the importance and English inquiry in view of the identification of English languages is in the universal of each website
Property, service feature generally need not be adopted in the text languages grader for only recognizing English text.
Above section is illustrated to all kinds of languages features described in the embodiment of the present application.Because the embodiment of the present application is carried
For method be Language Identification based on ten million order of magnitude languages feature, therefore, it is possible to effectively improve the correct of text languages
Discrimination.
Step S103:Using the languages feature for extracting as the input of the text languages grader for previously generating, lead to
Cross the text languages classifier calculated and obtain the affiliated languages of the text to be identified.
After each languages feature of text to be identified is extracted by step S101, it is possible to by the text for previously generating
This languages classifier calculated obtains the affiliated languages of the text to be identified.
Text languages grader described in the embodiment of the present application refers to the text languages point built based on machine learning method
Class device.Fig. 2 is refer to, it is the particular flow sheet of Language Identification embodiment step S103 for text of the application.
In the present embodiment, step S103 is comprised the following steps:
Step S1031:Search condition is characterized as with the languages for extracting, in the languages, languages feature that previously generate
And its retrieval obtains feature weight of the languages feature in each candidate's languages in the corresponding relation of weight.
The corresponding relation of languages, languages feature and its weight described in the embodiment of the present application is referred to, based on given corpus
The parameter model of the text languages grader that training is obtained.The corresponding relation of described languages, languages feature and its weight includes
The triplet sets of each candidate's languages, each languages feature and its feature weight.Implement the application offer for text
Language Identification, it is necessary first to generate the corresponding relation of the languages, languages feature and its weight, i.e.,:By to multi-lingual
Plant corpus to be trained to obtain the parameter model of text languages grader.
Refer to Fig. 3, its be the application for text Language Identification embodiment generate languages, languages feature and
The particular flow sheet of the corresponding relation of its weight.In the present embodiment, the correspondence of the languages, languages feature and its weight is generated
Relation comprises the steps:
Step S301:Acquisition has marked the text set of affiliated languages.
The Language Identification for text that the embodiment of the present application is provided is a kind of languages based on machine learning algorithm
Recognition methodss, and be the parameter that grader is adjusted using the corpus of one group of known class, i.e.,:Feature weight, therefore belong to
In supervised learning.In supervised learning, each example is desired defeated by an input object (usually vector) and one
Go out value (also referred to as supervisory signals) composition.Therefore, training corpus includes marking the text set of affiliated languages.
The text set for marking affiliated languages described in the embodiment of the present application includes what text languages grader was capable of identify that
The text of each candidate's languages.For example, corpus text 1 is:En | | | iphone 4s case plastic, corpus
Text 2 is:Es | | | iphone 4s caso pl á stico, corpus text 3 are:En | | | iphone6s screen etc..
Step S303:The languages feature is extracted from the text that each has marked affiliated languages, and counts institute's predicate
Plant the number of times that feature occurs respectively in each candidate's languages.
After the text set for marking affiliated languages is got, it is special that needs extract languages from each corpus
Levy, these languages features are characterized as same concept with the languages of explanation in step S101.Extracting from corpus, languages are special
While levying, in addition it is also necessary to count occurrence number of each languages feature in each candidate's languages.For example, 1 yuan of continuous word
Iphone is occurred in that 500 times in English language material, or 1 yuan of continuous word caso occurs in that 300 is inferior in Spanish.
Step S305:Each has marked the languages feature of the text of affiliated languages and has counted according to extracting
The number of times that the languages feature occurs respectively in each candidate's languages, calculates and obtains each languages feature in each candidate's languages
The middle number of times for occurring respectively and occur the ratio of total degree in all candidate's languages, wait at each as each languages feature
Select the feature weight in languages.
Complete languages feature and count to complete each languages feature in each candidate's language all of corpus are extracted
After the number of times occurred respectively in kind, in addition it is also necessary to calculate the total degree that each languages feature occurs in all corpus.Most
Afterwards, the number of times for each languages feature being occurred in each candidate's languages and its occur in all candidate's languages it is total time
Several ratio, as feature weight of each languages feature in each candidate's languages.For example, it is related to 3 in training corpus
The text of language (English, Spanish and Portuguese) is planted, wherein, 1 yuan of continuous word iphone is occurred in that in English language material
500 times, occur in that in Spain's language material 200 times, occur in that in Portugal's language material 260 times, therefore iphone is in the language material
Occur in that altogether in storehouse 960 times, then this feature weights of languages feature in English of iphone is 500/960, in Spanish
In feature weight be 200/960, the feature weight in Portuguese be 260/960.As can be seen here, the embodiment of the present application institute
The feature weight stated is that the method based on discriminant calculates what is obtained, and prior art only calculates each word in its affiliated languages
The method of the word relative frequency of appearance, i.e. prior art based on production is calculated and obtains word frequency.Because the embodiment of the present application is adopted
Feature weight is calculated with discriminant method, it is thus possible to reach the effect for improving correct recognition rata.
Step S307:By the triplet sets of described each candidate's languages, each languages feature and the feature weight, make
For the corresponding relation of the languages, languages feature and its weight.
By above-mentioned steps S301 to step S305, the feature power of each languages feature under each candidate's languages is got
Weight, by the triplet sets of each candidate's languages, each languages feature and feature weight, as the languages, languages feature and
The corresponding relation of its weight.
Table 1 is referred to, it is languages, the languages feature that the Language Identification embodiment for text of the application is generated
And its sample table of the corresponding relation of weight.
Languages | Feature string | Feature weight |
en | iphone | 0.1 |
Es | iphone | 0.05 |
en | case | 0.3 |
es | plástico | 1 |
… |
The sample table of the corresponding relation of table 1, languages, languages feature and its weight
After the corresponding relation that training generates above-mentioned languages, languages feature and its weight, it becomes possible to from text to be identified
In the languages that extract be characterized as search condition, the retrieval in the corresponding relation of above-mentioned languages, languages feature and its weight is obtained
Feature weight of the languages feature in each candidate's languages.For example, text to be identified is:iphone 5s plástico
Model, the languages feature for therefrom extracting includes (only enumerating 1 yuan of continuous word feature below):Iphone, 5s, pl á stico and
Model, then, after entering line retrieval in the model described in above-mentioned table 1, the languages feature being activated is as shown in table 2:
Languages | Feature string | Feature weight |
En | iphone | 0.1 |
Es | iphone | 0.05 |
Es | plástico | 1 |
Table 2, activation examples of features table
It is visible by table 2, because word 5s is a model word, it is filtered in pretreatment stage, word model represents some
Languages feature be in parameter model retrieve less than, to languages differentiate do not work.
It should be noted that in actual applications, due to training the right of the languages for obtaining, languages feature and its weight
The languages feature that should be related to more than comprising millions, therefore the speed of the languages signature search step of step S1031 will be to whole
The performance of languages identification is affected greatly.In order to improve the speed of characteristic key, the embodiment of the present application is proposed in terms of two
Both storage modes are illustrated below by the corresponding relation of the optimization storage languages, languages feature and its weight.
1) storage mode one:Using dictionary tree data structure storage described in the continuous word feature of N units and N units consecutive word
Symbol feature.
Dictionary tree described in the embodiment of the present application, also known as word lookup tree, Trie trees, is a kind of tree structure, is a kind of Kazakhstan
The mutation of uncommon tree.Its advantage is:Query time is reduced using the common prefix of character string, reduces meaningless to greatest extent
Character string comparison, search efficiency is than the Hash height of tree.
The embodiment of the present application proposes to store the continuous character feature of the continuous word feature of N units and N units using dictionary tree, makes
When proper certain languages feature x does not obtain matching, it is possible to directly abandon the search of x+a (a represents arbitrarily string) feature.Experiment
As a result show, the effect of this storage strategy continuous word feature first for N and the continuous character feature of N units is clearly.
2) storage mode two:For each languages feature in the corresponding relation of the languages, languages feature and its weight,
All candidate's languages correspondence that the languages feature and its weight are not zero is stored.
Typically in the signature search of multilingual text languages grader, for each languages feature x, can be to each time
Languages y are selected, the search of (x+y) is combined.Therefore, each languages feature is required for carrying out L feature set search (L is candidate
Languages number).The embodiment of the present application is proposed to the languages feature in the parameter model of text languages grader and its corresponding institute
The mode that having candidate's languages carries out similar inverted index is stored.
Fig. 4 is refer to, it is languages, the languages feature that the Language Identification embodiment for text of the application is generated
And its storage schematic diagram of the corresponding relation of weight.By the storage mode shown in Fig. 4, each languages feature only needs to retrieval one
Time, it is possible to return is possible to the candidate's languages for matching, and integral retrieval efficiency can improve L times.
Step S1033:According to feature weight of the languages feature for obtaining in each candidate's languages is retrieved, institute is calculated
State the score that text to be identified is belonging respectively to each candidate's languages.
Feature of each languages feature of file to be identified in each candidate's languages is got by above-mentioned steps S1031
After weight, it is possible to calculate the score that text to be identified is belonging respectively to each candidate's languages according to these feature weights.
Text languages grader described in the embodiment of the present application can think the text languages grader of single languages, may be used also
Being multilingual text languages grader.For example, the text languages grader of single languages can be English languages grader etc.
Differentiate the grader of single languages;Multilingual text languages grader can include multiple candidate's languages, depending on training language
The text languages quantity that material storehouse includes.Text languages grader separately below to single languages and multilingual text languages point
Class device is illustrated.
1) the text languages grader of single languages
When the text languages grader that the text languages grader described in the embodiment of the present application is single languages, step
Feature weight of the languages feature in each candidate's languages obtained according to retrieval described in S1033, calculate described in wait to know
Other text is belonging respectively to the score of each candidate's languages, can be calculated using equation below:
Wherein, Y is the stochastic variable of the affiliated languages of the text to be identified;P belongs to specific language for the text to be identified
The score planted;X is the characteristic vector being made up of the languages feature extracted from the text to be identified;W be by with x in
Each languages feature corresponding feature weight composition weight vectors.
In the present embodiment, the text languages grader of single languages is English arbiter, and P (Y=1) represents text to be identified
It is the probability of English.The text languages grader of single languages that the embodiment of the present application is provided adopts Logic Regression Models.In reality
Using in, other machine learning models can also be adopted, for example:Support vector machine, CRF, decision tree etc..It is above-mentioned a variety of
Machine learning model is all the change of specific embodiment, all without departing from the core of the application, therefore all in the guarantor of the application
Within the scope of shield.
2) multilingual text languages grader
When the text languages grader described in the embodiment of the present application is multilingual text languages grader, step
Feature weight of the languages feature in each candidate's languages obtained according to retrieval described in S1033, calculate described in wait to know
Other text is belonging respectively to the score of each candidate's languages, can be calculated using equation below:
Wherein, xiFor the text to be identified, pjFor the score that the text to be identified belongs to particular candidate languages j;f
(xi) it is the languages feature extracted from the text to be identified, λ1jTo λmjFor f (xi) in particular candidate languages j
In feature weight;Z is the score sum of each candidate's languages, is calculated using equation below:
In above-mentioned formula, n is the languages quantity that multilingual text languages grader is capable of identify that.
The multilingual text languages grader that the embodiment of the present application is provided adopts maximum entropy model (Maximum
Entropy Model).Maximum entropy model is a kind of machine learning method, in many fields of natural language processing (such as part of speech mark
Note, Chinese word segmentation, sentence boundary detection, shallow parsing and text classification etc.) there is reasonable application effect.Maximum entropy
Model be able to can be reached preferably with comprehensive observing to various related or incoherent probabilistic knowledge, the process to many problems
As a result.Test result indicate that being effective based on the Language Identification of maximum entropy model.It not only can obtain most consistent
Distribution, and ensure that languages recognize precision ratio and recall ratio.Likewise, in actual applications, other can also be adopted
Machine learning model, for example:Support vector machine, CRF, decision tree etc..Above-mentioned a variety of machine learning models all simply have
The change of body embodiment, all without departing from the core of the application, therefore all within the protection domain of the application.
Step S1035:Language belonging to the score is more than into candidate's languages as the text to be identified of predetermined threshold value
Kind.
The score that text to be identified is belonging respectively to each candidate's languages is got by step S1033, on this basis, will
Score is more than candidate's languages of predetermined threshold value as the languages belonging to text to be identified.In actual applications, generally by score most
High candidate's languages are used as the languages belonging to text to be identified.For example, obtained not according to the feature calculation being activated in above-mentioned table 2
With the score of candidate's languages, its result is as follows:Es languages must be divided into:0.05+1=1.05, en languages must be divided into:0.1, by
It is more than en languages scores in es languages score, therefore, it is determined that text to be identified belongs to es languages.
The Language Identification for text realized by above-mentioned steps S101 and step S103 is a kind of based on machine
The Language Identification of study.In actual applications, on the basis of the above-mentioned Language Identification based on machine learning, can be with
Using some optimisation strategies, to improve the correct recognition rata of text languages.The embodiment of the present application is adopted some optimizations below
Strategy is illustrated respectively.
1) optimisation strategy one
In actual applications, the correspondence to obtain the languages, languages feature and its weight is trained to corpus to close
System is a very time-consuming operation, it is seen that corpus is trained do not have practicality in real time.However, this non real-time
The problem that the method for training may be brought is:Cannot in time from newer history recognition result learning to more accurately text language
Plant classifier parameters model.
The online languages identification service of one practicality, needs possess fast reaction mechanism to wrong phenomenon on line.To understand
Certainly the problems referred to above, to the wrong phenomenon of burst the effect of quick intervention is reached, and the embodiment of the present application passes through the intervention for previously generating
Vocabulary is quickly intervened the wrong phenomenon happened suddenly in application system on concrete line, to improve the correct identification of text languages
Rate.
Described in the embodiment of the present application intervene vocabulary have recorded it is a collection of marked correct languages in history by mistake knowledge
Other text data.The text being erroneously identified described in the embodiment of the present application is illustrated, for example, in one query search
In, it is text wrong, that such text is referred to as being erroneously identified to the languages recognition result of query word.
The scheme of optimisation strategy one is before step S101 extracts languages feature from text to be identified, also to include:
With the text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified;
The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;If above-mentioned judged result is yes, wait to know by described
Languages are used as the affiliated languages of the text to be identified belonging to other text is corresponding in the intervention vocabulary.
The Language Identification for text that the embodiment of the present application is provided devises intervention vocabulary mechanism, text to be identified
First have to through intervening vocabulary identification module, if intervening vocabulary includes text to be identified, can directly judge to be identified
The affiliated languages of text, without the need for being judged by text languages grader.Specifically, intervening vocabulary identification can be using accurate whole
The matching strategies such as body matching, part matching, weighted registration, fast rapid-curing cutback is carried out from multiple angles to the wrong phenomenon happened suddenly on line
In advance.
Intervention vocabulary described in the embodiment of the present application is generated using following steps:1) text being erroneously identified is obtained;2)
Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.I.e.:Get by
After the text of wrong identification, the text being erroneously identified and its affiliated correct languages are directly appended to intervene in vocabulary, with
For query and search.
2) optimisation strategy two
The languages that general large-scale international electronic commerce website is supported are more than 10.Therefore, languages technology of identification is at least
Support the languages identification demand of more than 10 kinds classifications.Showing for character is shared because most of language is all present with other languages
As, therefore, most languages identification needs to be entered based on the Language Identification of machine learning with what the embodiment of the present application was provided
Row identification.However, the character list of some language has code section alone in Unicode coding schedules, can be with to such language
Directly encode to be judged by Unicode, for example, Russian, Russia's Chinese character typically exists:0x0400~0x052F code sections.
The scheme of optimisation strategy two is before step S101 extracts languages feature from text to be identified, also to include:
Whether the character included with the text to be identified is retrieved in the specific languages character code table for previously generating and is deposited as search condition
In the character that the text to be identified includes;If above-mentioned judged result is yes, the character that the text to be identified includes is existed
In the specific languages character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.
Optimisation strategy two treats knowledge by the Language Identification with reference to character code recognition methodss and based on machine learning
Other text carries out languages identification.The manageable languages of character code recognition methodss include:Russian, Hebrew, Korean, Thailand
Language, Arabic etc., test result indicate that, its correct recognition rata is more than 99%.Language Identification based on machine learning
Manageable languages include:English, Portugal language, Spanish, German, French, Italian, Turkish, Vietnamese, Indonesia
Language, Dutch.Test result indicate that, in addition to Portugal language and Spanish, F1 is estimated more than 90%, its Chinese and English 98%.
3) optimisation strategy three
In actual applications, because the inquiry of user input is typically relatively freer, it is thus possible to comprising brand word, model word
And various descriptive vocabulary, for example, iPhone 5S, Cannon D70 etc..Brand word, model word are usually international
English literary style.And, English inquires about larger specific gravity of standing in the flow of international electronic commerce website, even non-English speaking country
User, the situation of input English inquiry is also very common.These special words can produce very big noise to languages identification, thus right
The accuracy of languages identification affects very big.For example, " Cannon D70 boxes " this text string, itself is a Chinese text
This, but wherein contain brand word, model word, therefore it is easily identified into English.However, prior art is not directed to this
A little special words carry out particular design.
Optimisation strategy three is before step S101 extracts languages feature from text to be identified, also to include:According to pre-
At least one of the brand vocabulary that first generates and model vocabulary, from the text to be identified default brand word or default model are removed
Word.
Optimisation strategy three is particularly inquired about English by carrying out special handling to special words such as brand word, model words
Special consideration is made, so as to reach the effect for improving correct recognition rata.
4) optimisation strategy four
Large-scale international electronic commerce website China, the inquiry request (QPS) for receiving each second be up to it is thousands of or even up to ten thousand,
And user is highstrung to the result waiting time (latency) inquired about.Therefore, the performance of languages identification is carried out excellent
Change most important.
Optimisation strategy four is will to perform the device (languages of the Language Identification for text that the embodiment of the present application is provided
Identifying device) dispose in a distributed system, design is optimized to the performance that languages are recognized from multi-thread concurrent angle.
In the present embodiment, languages identifying device adopt Blender/Searcher distributed architecture schemes, with improve languages identification and
Send out service ability.Fig. 5 is refer to, it is the system of the Language Identification embodiment distributed deployment for text of the application
Schematic diagram.
5) optimisation strategy five
Prior art is typically based on monolayer framework and carries out languages identification, i.e.,:The unification of all candidate's languages is considered, not
For the consideration that specific languages are specifically optimized.In actual applications, the languages such as English are common query texts, in order to
Optimize the languages identification of common languages text, the optimisation strategy five that the embodiment of the present application is proposed is:Known using multi-level languages
Other framework, wherein single languages identification layer of common languages is specially devised, for example:It is specifically designed for the languages identification layer of English.It is logical
Cross using multi-level languages identification framework, using the teaching of the invention it is possible to provide the special optimization ability of specific languages.
In actual applications, similar hierarchical design can be carried out for the languages for being actually needed optimization, or even is expanded to
The discrimination model step by step of multilamellar, each level can also realize that two classes or the languages of three classes differentiate.Above-mentioned a variety of multilamellars
Secondary languages identification framework is all the change of specific embodiment, all without departing from the core of the application, therefore all in the application
Protection domain within.
Fig. 6 is refer to, it is the signal that the Language Identification embodiment multilamellar for text of the application recognizes framework
Figure.Above which floor (A-X) is the text languages grader of the single languages for specific languages in Fig. 6, be only given "Yes" or
The specific languages of "no" class;If text to be identified is not belonging to above several specific languages, can again by last many
The text languages grader of languages, provides the languages classification of optimum from multiple candidate's languages.It should be noted that multilingual
Text languages grader output result in, still can specify whether to export before " A-X " these classes for having differentiated
Not.
In the above-described embodiment, there is provided a kind of Language Identification for text, corresponding, the application
A kind of languages identifying device for text is also provided.The device is corresponding with the embodiment of said method.
Fig. 7 is refer to, it is the schematic diagram of the languages identifying device embodiment for text of the application.Due to device reality
Apply example and be substantially similar to embodiment of the method, so describe fairly simple, part explanation of the related part referring to embodiment of the method
.Device embodiment described below is only schematic.
A kind of languages identifying device for text of the present embodiment, including:
Extracting unit 101, for extracting languages feature from text to be identified;
Predicting unit 103, for using the languages feature for extracting as the text languages grader for previously generating
Input, by the text languages classifier calculated the affiliated languages of the text to be identified are obtained;
Wherein, the languages feature includes at least the one of the continuous word feature of N units, the continuous character feature of N units and affixe feature
Person.
Fig. 8 is refer to, it is the signal of the languages identifying device embodiment predicting unit 103 for text of the application
Figure.Optionally, the predicting unit 103 includes:
Retrieval subelement 1031, for being characterized as search condition with the languages for extracting, the languages for previously generating,
Retrieval in the corresponding relation of languages feature and its weight obtains feature weight of the languages feature in each candidate's languages;
Computation subunit 1033, for the feature power according to the languages feature of retrieval acquisition in each candidate's languages
Weight, calculates the score that the text to be identified is belonging respectively to each candidate's languages;
Setting subelement 1035, for the score to be more than candidate's languages of predetermined threshold value as the text to be identified
Affiliated languages.
Optionally, the predicting unit 103 include at least one towards particular candidate languages prediction subelement;With default
Execution sequence judges that the affiliated languages of the text to be identified are one by one using each towards the prediction subelement of particular candidate languages
The no candidate's languages for belonging to the current prediction subelement towards particular candidate languages;If so, languages identification is then terminated;If it is not, then
By after the current prediction subelement towards the particular candidate languages, next one towards the pre- of particular candidate languages
Survey subelement calculating and obtain the affiliated languages of the text to be identified;
The prediction subelement towards particular candidate languages, for by the text languages towards particular candidate languages point
Class device is calculated and obtains the affiliated languages of the text to be identified;
Wherein, the text languages grader towards particular candidate languages include single languages text languages grader or
Multilingual text languages grader.
Fig. 9 is refer to, it is the signal of the languages identifying device embodiment signal generating unit 201 for text of the application
Figure.Optionally, also include:
Signal generating unit 201, for generating the corresponding relation of the languages, languages feature and its weight for previously generating;
The signal generating unit 201 includes:
Subelement 2011 is obtained, for obtaining the text set for having marked affiliated languages;
Subelement 2013 is extracted, for extracting the languages feature in the text for having marked affiliated languages from each, and
Count the number of times that the languages feature occurs respectively in each candidate's languages;
Computation subunit 2015, for according to described in extracting each marked affiliated languages text languages feature
And the number of times that the languages feature for counting occurs respectively in each candidate's languages, calculate and obtain each languages feature each
The number of times occurred respectively in individual candidate's languages and the ratio for occurring total degree in all candidate's languages, it is special as each languages
Levy the feature weight in each candidate's languages;
Setting subelement 2017, for by the three of described each candidate's languages, each languages feature and the feature weight
Tuple-set, as the corresponding relation of the languages, languages feature and its weight.
Figure 10 is refer to, it is the another schematic diagram of the languages identifying device embodiment for text of the application.It is optional
, also include:
Intervene unit 203, for the text to be identified as search condition, retrieving in the intervention vocabulary for previously generating
With the presence or absence of the text to be identified;If above-mentioned judged result is yes, by the text to be identified in the intervention vocabulary
Languages belonging to corresponding are used as the affiliated languages of the text to be identified;
Wherein, the vocabulary of intervening includes the corresponding record collection of text and its affiliated languages.
Optionally, also include:
Character recognition unit 205, for the character that included with the text to be identified as search condition, what is previously generated
The character that retrieval includes with the presence or absence of the text to be identified in specific languages character code table;If above-mentioned judged result is yes,
Languages belonging to the character that the text to be identified is included is corresponding in the specific languages character code table are waited to know as described
The affiliated languages of other text.
Optionally, also include:
Remove noise unit 207, for according to the brand vocabulary and model vocabulary for previously generating at least one, from described
Text to be identified removes default brand word or default model word.
Figure 11 is refer to, it is the schematic diagram of the electronic equipment embodiment of the application.Due to apparatus embodiments basic simlarity
In embodiment of the method, so describing fairly simple, related part is illustrated referring to the part of embodiment of the method.It is described below
Apparatus embodiments be only schematic.
The a kind of electronic equipment of the present embodiment, the electronic equipment includes:Display 1101;Processor 1102;And storage
Device 1103, the memorizer 1103 is configured to store the languages identifying device for text, and the languages for text are known
When other device is performed by the processor 1102, comprise the steps:Languages feature is extracted from text to be identified;To extract
The languages feature for going out as the text languages grader for previously generating input, by the text languages classifier calculated
Obtain the affiliated languages of the text to be identified;Wherein, the languages feature includes the continuous word feature of N units, the continuous character feature of N units
With affixe feature at least one.
The application provides a kind of Language Identification for text, device and electronic equipment, by from text to be identified
In extract languages feature, and the languages feature for extracting is passed through as the input of the text languages grader for previously generating
Text languages classifier calculated obtains the affiliated languages of text to be identified, wherein the languages feature includes the continuous word feature of N units, N
At least one of first continuous character feature and affixe feature.Because the languages of the method institute foundation of the application offer are characterized in that ten million
The feature of the order of magnitude, therefore, it is possible to improve the correct recognition rata and robustness of languages identification, simultaneously because corpus collection is only needed
To mark the historical query collection of correct languages, and more contents need not be marked such that it is able to reach the high effect of practicality.
Additionally, the embodiment of the present application provides another several Language Identifications for text, due to these embodiments of the method
Explanation is provided in said method embodiment, so describing fairly simple, related part is referring to said method embodiment
Part explanation.Embodiment of the method described below is only schematic.
The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps:1) with
Text to be identified is search condition, and the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified;It is described dry
Pre- vocabulary includes the corresponding record collection of text and its affiliated languages;If 2) above-mentioned judged result is yes, by the text to be identified
Languages are used as the affiliated languages of the text to be identified belonging to this is corresponding in the intervention vocabulary.
Related description with regard to intervening vocabulary and its application process, refers to the phase of optimisation strategy one in above-described embodiment one
Description is closed, here is omitted.
Preferably, the intervention vocabulary is generated using following steps:1) text being erroneously identified is obtained;2) by the quilt
The text of wrong identification and its affiliated correct languages are used as the record for intervening vocabulary.
Methods described also comprises the steps:If above-mentioned judged result is no, by the text languages point for previously generating
Class device is calculated and obtains the affiliated languages of the text to be identified.
Text languages grader described in the embodiment of the present application, i.e., including the text languages grader of prior art, also wrap
Include the text languages grader based on machine learning provided in said method embodiment one.
In the above-described embodiment, there is provided a kind of Language Identification for text, corresponding, the application
A kind of languages identifying device for text is also provided.The device is corresponding with the embodiment of said method.
A kind of languages identifying device for text that the embodiment of the present application is provided, including:
Retrieval unit, for text to be identified as search condition, retrieving in the intervention vocabulary for previously generating and whether depositing
In the text to be identified;The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
Judging unit, if for above-mentioned judged result be it is yes, the text to be identified is right in the intervention vocabulary
The affiliated languages answered are used as the affiliated languages of the text to be identified.
Optionally, also include:
Predicting unit, if being no for above-mentioned judged result, is obtained by the text languages classifier calculated for previously generating
Take the affiliated languages of the text to be identified.
The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps:1) with
The character that text to be identified includes is search condition, and retrieval is with the presence or absence of described in the specific languages character code table for previously generating
The character that text to be identified includes;2) if above-mentioned judged result is yes, the character that the text to be identified is included is described
In specific languages character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.
With regard to character code table and its related description of application process, the phase of optimisation strategy two in above-described embodiment one is referred to
Description is closed, here is omitted.
Methods described also comprises the steps:If above-mentioned judged result is no, by the text languages point for previously generating
Class device is calculated and obtains the affiliated languages of the text to be identified.
Text languages grader described in the embodiment of the present application, i.e., including the text languages grader of prior art, also wrap
Include the text languages grader based on machine learning provided in said method embodiment one.
In the above-described embodiment, there is provided a kind of Language Identification for text, corresponding, the application
A kind of languages identifying device for text is also provided.The device is corresponding with the embodiment of said method.
A kind of languages identifying device for text that the embodiment of the present application is provided, including:
Retrieval unit, for the character that included with text to be identified as search condition, in the specific languages word for previously generating
The character that retrieval includes with the presence or absence of the text to be identified in symbol table;
Judging unit, if being yes for above-mentioned judged result, the character that the text to be identified is included is in the spy
In attribute kind character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.
Optionally, also include:
Predicting unit, if being no for above-mentioned judged result, is obtained by the text languages classifier calculated for previously generating
Take the affiliated languages of the text to be identified.
The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps:1) root
According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word or default model from text to be identified
Word;2) the text languages classifier calculated by previously generating obtains the affiliated languages of the text to be identified.
Text languages grader described in the embodiment of the present application, i.e., including the text languages grader of prior art, also wrap
Include the text languages grader based on machine learning provided in said method embodiment one.With regard to brand vocabulary, model vocabulary
And the related description of filter method, the associated description of optimisation strategy three in above-described embodiment one is referred to, here is omitted.
In the above-described embodiment, there is provided a kind of Language Identification for text, corresponding, the application
A kind of languages identifying device for text is also provided.The device is corresponding with the embodiment of said method.
A kind of languages identifying device for text that the embodiment of the present application is provided, including:
At least one of filter element, the brand vocabulary previously generated for basis and model vocabulary, from text to be identified
Remove default brand word or default model word;
Predicting unit, for obtaining the affiliated language of the text to be identified by the text languages classifier calculated for previously generating
Kind.
The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps:1) from
Languages feature is extracted in text to be identified;2) each text languages classification of predetermined number is run one by one with default execution sequence
Device, judges whether the affiliated languages of the text to be identified belong to the text languages grader by the text languages grader
Candidate's languages;If so, languages identification is then terminated;Wherein, the languages feature includes the continuous word feature of N units, N units continuation character
At least one of feature and affixe feature.
Text languages grader described in the embodiment of the present application includes the text languages grader or multilingual of single languages
Text languages grader.The related description of framework is recognized with regard to multi-level languages, is referred in above-described embodiment one and is optimized plan
Slightly five associated description, here is omitted.
In the above-described embodiment, there is provided a kind of Language Identification for text, corresponding, the application
A kind of languages identifying device for text is also provided.The device is corresponding with the embodiment of said method.
A kind of languages identifying device for text that the embodiment of the present application is provided, including:
Extracting unit, for extracting languages feature from text to be identified;
Predicting unit, for preset each text languages grader that execution sequence runs one by one predetermined number, passing through
The text languages grader judges whether the affiliated languages of the text to be identified belong to the candidate of the text languages grader
Languages;If so, languages identification is then terminated;
Wherein, the languages feature includes at least the one of the continuous word feature of N units, the continuous character feature of N units and affixe feature
Person.
Although the application is disclosed as above with preferred embodiment, it is not for limiting the application, any this area skill
Art personnel can make possible variation and modification, therefore the guarantor of the application in without departing from spirit and scope
The scope that shield scope should be defined by the application claim is defined.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.
1st, computer-readable medium can be by any side including permanent and non-permanent, removable and non-removable media
Method or technology are realizing information Store.Information can be computer-readable instruction, data structure, the module of program or other numbers
According to.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM
(SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read only memory
(ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc are read-only
Memorizer (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic rigid disk storage or
Other magnetic storage apparatus or any other non-transmission medium, can be used to store the information that can be accessed by a computing device.According to
Herein defines, and computer-readable medium does not include non-temporary computer readable media (transitory media), such as modulates
Data signal and carrier wave.
2nd, it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product
Product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or with reference to the embodiment in terms of software and hardware
Form.And, the application can be adopted and can use in one or more computers for wherein including computer usable program code
The computer program implemented on storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.)
Form.
Claims (38)
1. a kind of Language Identification for text, it is characterised in that include:
Languages feature is extracted from text to be identified;
Using the languages feature for extracting as the text languages grader for previously generating input, by the text languages
Classifier calculated obtains the affiliated languages of the text to be identified;
Wherein, the languages feature includes at least one of the continuous word feature of N units, the continuous character feature of N units and affixe feature.
2. the Language Identification for text according to claim 1, it is characterised in that described by the text language
Plant classifier calculated and obtain the affiliated languages of the text to be identified, including:
Search condition is characterized as with the languages for extracting, in the correspondence of the languages, languages feature and its weight for previously generating
Retrieval in relation obtains feature weight of the languages feature in each candidate's languages;
According to feature weight of the languages feature for obtaining in each candidate's languages is retrieved, the text to be identified point is calculated
Do not belong to the score of each candidate's languages;
Languages belonging to the score is more than into candidate's languages as the text to be identified of predetermined threshold value.
3. the Language Identification for text according to claim 2, it is characterised in that the feature weight is based on sentencing
Other formula model is calculated and obtained.
4. the Language Identification for text according to claim 2, it is characterised in that the language for previously generating
The corresponding relation of kind, languages feature and its weight, is generated using following steps:
Acquisition has marked the text set of affiliated languages;
The languages feature is extracted from the text that each has marked affiliated languages, and counts the languages feature and waited at each
Select the number of times occurred respectively in languages;
Each has marked the text of affiliated languages according to extracting languages feature and the languages feature for counting
The number of times occurred respectively in each candidate's languages, calculates and obtains what each languages feature occurred respectively in each candidate's languages
Number of times and occurs the ratio of total degree in all candidate's languages, as spy of each languages feature in each candidate's languages
Levy weight;
By the triplet sets of described each candidate's languages, each languages feature and the feature weight, as the languages, language
Plant the corresponding relation of feature and its weight.
5. the Language Identification for text according to claim 2, it is characterised in that the languages, languages feature
And its corresponding relation of weight, store in the following way:
Using dictionary tree data structure storage described in the continuous character feature of the continuous word feature of N units and N units.
6. the Language Identification for text according to claim 2, it is characterised in that the languages, languages feature
And its corresponding relation of weight, store in the following way:
For each languages feature in the corresponding relation of the languages, languages feature and its weight, by the languages feature and
All candidate's languages correspondence that its weight is not zero is stored.
7. the Language Identification for text according to claim 2, it is characterised in that the text languages grader
For the text languages grader of single languages;The feature according to the languages feature of retrieval acquisition in each candidate's languages
Weight, calculates the score that the text to be identified is belonging respectively to each candidate's languages, is calculated using equation below:
Wherein, Y is the stochastic variable of the affiliated languages of the text to be identified;P is that the text to be identified belongs to specific languages
Score;X is the characteristic vector being made up of the languages feature extracted from the text to be identified;W be by with x in it is each
The weight vectors of the corresponding feature weight composition of individual languages feature.
8. the Language Identification for text according to claim 2, it is characterised in that the text languages grader
For multilingual text languages grader;The feature according to the languages feature of retrieval acquisition in each candidate's languages
Weight, calculates the score that the text to be identified is belonging respectively to each candidate's languages, is calculated using equation below:
Wherein, xiFor the text to be identified, pjFor the score that the text to be identified belongs to particular candidate languages j;f(xi) be
The languages feature extracted from the text to be identified, λ1jTo λmjFor f (xi) spy in particular candidate languages j
Levy weight;Z is the score sum of each candidate's languages, is calculated using equation below:
Wherein, n is the quantity of candidate's languages.
9. the Language Identification for text according to claim 1-8 any one, it is characterised in that the languages
Feature also includes:The word quantity and average word length, default brand word feature, default model word that the text to be identified includes is special
Levy, the peculiar character feature of each languages, the peculiar affixe feature of each languages and service feature at least one.
10. the Language Identification for text according to claim 1-8 any one, it is characterised in that the N is first
Continuation character feature includes N unit's continuation characters and its positional information in word.
11. Language Identifications for text according to claim 1-8 any one, it is characterised in that described pre-
The text languages grader for first generating include at least one towards particular candidate languages text languages grader;Each is towards spy
The text languages grader for determining candidate's languages is run one by one with default execution sequence;
It is described that the affiliated languages of the text to be identified are obtained by the text languages classifier calculated, in the following way:
If judging that the affiliated languages of the text to be identified do not belong to by the text languages grader currently towards particular candidate languages
When current candidate's languages towards the text languages grader of particular candidate languages, then according to it is described it is default perform it is suitable
Sequence, by positioned at described current towards after the text languages grader of particular candidate languages, next text languages classification
Device is calculated and obtains the affiliated languages of the text to be identified;
If judging that the affiliated languages of the text to be identified belong to by the text languages grader currently towards particular candidate languages
During current candidate's languages towards the text languages grader of particular candidate languages, then terminate languages identification;
Wherein, the text languages grader towards particular candidate languages includes the text languages grader or multi-lingual of single languages
The text languages grader planted.
12. Language Identifications for text according to claim 1, it is characterised in that described from text to be identified
Extract before languages feature in this, also include:
With the text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified
This;The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
If above-mentioned judged result is yes, using the text to be identified it is described intervention vocabulary in it is corresponding belonging to languages as institute
State the affiliated languages of text to be identified.
13. Language Identifications for text according to claim 12, it is characterised in that the intervention vocabulary is adopted
Following steps are generated:
The text that acquisition is erroneously identified;
Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.
14. Language Identifications for text according to claim 1, it is characterised in that described from text to be identified
Extract before languages feature in this, also include:
As search condition, the retrieval in the specific languages character code table for previously generating is the character included with the text to be identified
It is no to there is the character that the text to be identified includes;
If above-mentioned judged result is yes, the character that the text to be identified is included is right in the specific languages character code table
The affiliated languages answered are used as the affiliated languages of the text to be identified.
15. Language Identifications for text according to claim 1, it is characterised in that described from text to be identified
Extract before languages feature in this, also include:
According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word from the text to be identified
Or default model word.
16. Language Identifications for text according to claim 1, it is characterised in that perform described for text
Language Identification device dispose in a distributed system.
17. a kind of languages identifying devices for text, it is characterised in that include:
Extracting unit, for extracting languages feature from text to be identified;
Predicting unit, as the input of the text languages grader for previously generating, leads to for using the languages feature for extracting
Cross the text languages classifier calculated and obtain the affiliated languages of the text to be identified;
Wherein, the languages feature refers to the languages feature of ten million order of magnitude, including the continuous word feature of N units, N units continuation character spy
At least one for affixe feature of seeking peace.
The 18. languages identifying devices for text according to claim 17, it is characterised in that the predicting unit bag
Include:
Retrieval subelement, for being characterized as search condition with the languages for extracting, in the languages, languages feature that previously generate
And its retrieval obtains feature weight of the languages feature in each candidate's languages in the corresponding relation of weight;
Computation subunit, for the feature weight according to the languages feature of retrieval acquisition in each candidate's languages, calculates
The text to be identified is belonging respectively to the score of each candidate's languages;
Setting subelement, for using the score more than predetermined threshold value candidate's languages as the text to be identified belonging to language
Kind.
The 19. languages identifying devices for text according to claim 18, it is characterised in that also include:
Signal generating unit, for generating the corresponding relation of the languages, languages feature and its weight for previously generating;
The signal generating unit includes:
Subelement is obtained, for obtaining the text set for having marked affiliated languages;
Subelement is extracted, for extracting the languages feature in the text for having marked affiliated languages from each, and counts described
The number of times that languages feature occurs respectively in each candidate's languages;
Computation subunit, for according to described in extracting each marked affiliated languages text languages feature and count
The number of times that occurs respectively in each candidate's languages of the languages feature, calculate and obtain each languages feature in each candidate's language
The number of times occurred respectively in kind and the ratio for occurring total degree in all candidate's languages, as each languages feature at each
Feature weight in candidate's languages;
Setting subelement, for by the triplet sets of described each candidate's languages, each languages feature and the feature weight,
As the corresponding relation of the languages, languages feature and its weight.
The 20. languages identifying devices for text according to claim 17, it is characterised in that the predicting unit includes
At least one towards particular candidate languages prediction subelement;With default execution sequence one by one using each towards particular candidate language
The prediction subelement planted, judges whether the affiliated languages of the text to be identified belong to current prediction towards particular candidate languages
Candidate's languages of unit;If so, languages identification is then terminated;If it is not, then by positioned at described current towards particular candidate languages
Prediction subelement towards particular candidate languages after prediction subelement, next is calculated and obtained belonging to the text to be identified
Languages;
The prediction subelement towards particular candidate languages, for by the text languages grader towards particular candidate languages
Calculate and obtain the affiliated languages of the text to be identified;
Wherein, the text languages grader towards particular candidate languages includes the text languages grader or multi-lingual of single languages
The text languages grader planted.
The 21. languages identifying devices for text according to claim 17, it is characterised in that also include:
Intervene unit, for the text to be identified as search condition, retrieving in the intervention vocabulary for previously generating and whether depositing
In the text to be identified;It is if above-mentioned judged result is yes, the text to be identified is corresponding in the intervention vocabulary
Affiliated languages are used as the affiliated languages of the text to be identified;
Wherein, the vocabulary of intervening includes the corresponding record collection of text and its affiliated languages.
The 22. languages identifying devices for text according to claim 17, it is characterised in that also include:
Character recognition unit, for the character that included with the text to be identified as search condition, in the specific language for previously generating
Plant the character that retrieval includes with the presence or absence of the text to be identified in character code table;If above-mentioned judged result is yes, will be described
Languages are used as the text to be identified belonging to the character that text to be identified includes is corresponding in the specific languages character code table
Affiliated languages.
The 23. languages identifying devices for text according to claim 17, it is characterised in that also include:
Remove noise unit, for according to the brand vocabulary that previously generates and model vocabulary at least one, from described to be identified
Text removes default brand word or default model word.
24. a kind of electronic equipment, it is characterised in that include:
Display;
Processor;And
Memorizer, the memorizer is configured to store the languages identifying device for text, and the languages for text are known
When other device is by the computing device, comprise the steps:Languages feature is extracted from text to be identified;By what is extracted
The languages feature is obtained as the input of the text languages grader for previously generating by the text languages classifier calculated
The affiliated languages of the text to be identified;Wherein, the languages feature includes the continuous word feature of N units, the continuous character feature of N units and word
Sew feature at least one.
25. a kind of Language Identifications for text, it is characterised in that include:
With text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified;
The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
If above-mentioned judged result is yes, using the text to be identified it is described intervention vocabulary in it is corresponding belonging to languages as institute
State the affiliated languages of text to be identified.
26. Language Identifications for text according to claim 25, it is characterised in that the intervention vocabulary is adopted
Following steps are generated:
The text that acquisition is erroneously identified;
Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.
27. Language Identifications for text according to claim 25, it is characterised in that also include:
If above-mentioned judged result is no, the text institute to be identified is obtained by the text languages classifier calculated for previously generating
Category languages.
28. a kind of languages identifying devices for text, it is characterised in that include:
Retrieval unit, for text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating to whether there is institute
State text to be identified;The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages;
Judging unit, if for above-mentioned judged result be it is yes, the text to be identified is corresponding in the intervention vocabulary
Affiliated languages are used as the affiliated languages of the text to be identified.
The 29. languages identifying devices for text according to claim 28, it is characterised in that also include:
Predicting unit, if being no for above-mentioned judged result, by the text languages classifier calculated for previously generating institute is obtained
State the affiliated languages of text to be identified.
30. a kind of Language Identifications for text, it is characterised in that include:
Whether the character included with text to be identified is retrieved in the specific languages character code table for previously generating and is deposited as search condition
In the character that the text to be identified includes;
If above-mentioned judged result is yes, the character that the text to be identified is included is right in the specific languages character code table
The affiliated languages answered are used as the affiliated languages of the text to be identified.
31. Language Identifications for text according to claim 30, it is characterised in that also include:
If above-mentioned judged result is no, the text institute to be identified is obtained by the text languages classifier calculated for previously generating
Category languages.
32. a kind of languages identifying devices for text, it is characterised in that include:
Retrieval unit, for the character that included with text to be identified as search condition, in the specific languages character code for previously generating
The character that retrieval includes with the presence or absence of the text to be identified in table;
Judging unit, if being yes for above-mentioned judged result, the character that the text to be identified is included is in the specific language
In kind of character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.
The 33. languages identifying devices for text according to claim 32, it is characterised in that also include:
Predicting unit, if being no for above-mentioned judged result, by the text languages classifier calculated for previously generating institute is obtained
State the affiliated languages of text to be identified.
34. a kind of Language Identifications for text, it is characterised in that include:
According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word or pre- from text to be identified
If model word;
The affiliated languages of the text to be identified are obtained by the text languages classifier calculated for previously generating.
35. a kind of languages identifying devices for text, it is characterised in that include:
At least one of filter element, the brand vocabulary previously generated for basis and model vocabulary, removes from text to be identified
Default brand word or default model word;
Predicting unit, for obtaining the affiliated languages of the text to be identified by the text languages classifier calculated for previously generating.
36. a kind of Language Identifications for text, it is characterised in that include:
Languages feature is extracted from text to be identified;
To preset each text languages grader that execution sequence runs one by one predetermined number, by the text languages grader
Judge whether the affiliated languages of the text to be identified belong to candidate's languages of the text languages grader;If so, then conclusion
Plant identification;
Wherein, the languages feature includes at least one of the continuous word feature of N units, the continuous character feature of N units and affixe feature.
37. Language Identifications for text according to claim 37, it is characterised in that the text languages classification
Device includes the text languages grader or multilingual text languages grader of single languages.
38. a kind of languages identifying devices for text, it is characterised in that include:
Extracting unit, for extracting languages feature from text to be identified;
Predicting unit, for preset each text languages grader that execution sequence runs one by one predetermined number, by described
Text languages grader judges whether the affiliated languages of the text to be identified belong to candidate's languages of the text languages grader;
If so, languages identification is then terminated;
Wherein, the languages feature includes at least one of the continuous word feature of N units, the continuous character feature of N units and affixe feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510672933.XA CN106598937B (en) | 2015-10-16 | 2015-10-16 | Language Identification, device and electronic equipment for text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510672933.XA CN106598937B (en) | 2015-10-16 | 2015-10-16 | Language Identification, device and electronic equipment for text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106598937A true CN106598937A (en) | 2017-04-26 |
CN106598937B CN106598937B (en) | 2019-10-18 |
Family
ID=58553877
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510672933.XA Active CN106598937B (en) | 2015-10-16 | 2015-10-16 | Language Identification, device and electronic equipment for text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106598937B (en) |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106959943A (en) * | 2016-01-11 | 2017-07-18 | 阿里巴巴集团控股有限公司 | Languages recognize update method and device |
CN107957994A (en) * | 2017-10-30 | 2018-04-24 | 努比亚技术有限公司 | A kind of interpretation method, terminal and computer-readable recording medium |
CN108038189A (en) * | 2017-12-11 | 2018-05-15 | 南京茂毓通软件科技有限公司 | A kind of information extracting system of Email |
CN108172212A (en) * | 2017-12-25 | 2018-06-15 | 横琴国际知识产权交易中心有限公司 | A kind of voice Language Identification and system based on confidence level |
CN108417205A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Semantic understanding training method and system |
CN108595443A (en) * | 2018-03-30 | 2018-09-28 | 浙江吉利控股集团有限公司 | Simultaneous interpreting method, device, intelligent vehicle mounted terminal and storage medium |
CN108682417A (en) * | 2018-05-14 | 2018-10-19 | 中国科学院自动化研究所 | Small data Speech acoustics modeling method in speech recognition |
WO2018209608A1 (en) * | 2017-05-17 | 2018-11-22 | Beijing Didi Infinity Technology And Development Co., Ltd. | Method and system for robust language identification |
CN108932069A (en) * | 2018-07-11 | 2018-12-04 | 科大讯飞股份有限公司 | Input method candidate entry determines method, apparatus, equipment and readable storage medium storing program for executing |
CN109934251A (en) * | 2018-12-27 | 2019-06-25 | 国家计算机网络与信息安全管理中心广东分中心 | A kind of method, identifying system and storage medium for rare foreign languages text identification |
CN110019821A (en) * | 2019-04-09 | 2019-07-16 | 深圳大学 | Text category training method and recognition methods, relevant apparatus and storage medium |
CN110110299A (en) * | 2019-04-28 | 2019-08-09 | 腾讯科技(上海)有限公司 | Text transform method, apparatus and server |
CN110297888A (en) * | 2019-06-27 | 2019-10-01 | 四川长虹电器股份有限公司 | A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network |
CN110347934A (en) * | 2019-07-18 | 2019-10-18 | 腾讯科技(成都)有限公司 | A kind of text data filtering method, device and medium |
CN110888967A (en) * | 2018-09-11 | 2020-03-17 | 阿里巴巴集团控股有限公司 | Searching method, device and equipment |
CN110970018A (en) * | 2018-09-28 | 2020-04-07 | 珠海格力电器股份有限公司 | Speech recognition method and device |
CN111079408A (en) * | 2019-12-26 | 2020-04-28 | 北京锐安科技有限公司 | Language identification method, device, equipment and storage medium |
CN111178009A (en) * | 2019-12-20 | 2020-05-19 | 沈阳雅译网络技术有限公司 | Text multilingual recognition method based on feature word weighting |
CN111539207A (en) * | 2020-04-29 | 2020-08-14 | 北京大米未来科技有限公司 | Text recognition method, text recognition device, storage medium and electronic equipment |
CN111832657A (en) * | 2020-07-20 | 2020-10-27 | 上海眼控科技股份有限公司 | Text recognition method and device, computer equipment and storage medium |
CN112528682A (en) * | 2020-12-23 | 2021-03-19 | 北京百度网讯科技有限公司 | Language detection method and device, electronic equipment and storage medium |
CN112883967A (en) * | 2021-02-24 | 2021-06-01 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883966A (en) * | 2021-02-24 | 2021-06-01 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883968A (en) * | 2021-02-24 | 2021-06-01 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN113255398A (en) * | 2020-02-10 | 2021-08-13 | 百度在线网络技术(北京)有限公司 | Interest point duplicate determination method, device, equipment and storage medium |
US11977545B2 (en) * | 2018-10-15 | 2024-05-07 | Oclient Inc. | Generation of an optimized query plan in a database system |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2124986A1 (en) * | 1994-06-16 | 1995-12-17 | Mitsuhiro Aida | Text input method |
CN1276077A (en) * | 1997-09-15 | 2000-12-06 | 卡艾尔公司 | Automatic language identification system for multilingual optical character recognition |
US20010041978A1 (en) * | 1997-12-24 | 2001-11-15 | Jean-Francois Crespo | Search optimization for continuous speech recognition |
US20050086046A1 (en) * | 1999-11-12 | 2005-04-21 | Bennett Ian M. | System & method for natural language processing of sentence based queries |
CN101645269A (en) * | 2008-12-30 | 2010-02-10 | 中国科学院声学研究所 | Language recognition system and method |
CN101930430A (en) * | 2009-06-19 | 2010-12-29 | 株式会社日立制作所 | Language text processing device and language learning device |
CN102779135A (en) * | 2011-05-13 | 2012-11-14 | 北京百度网讯科技有限公司 | Method and device for obtaining cross-linguistic search resources and corresponding search method and device |
CN103065622A (en) * | 2012-12-20 | 2013-04-24 | 中国科学院声学研究所 | Language model practicing method and system thereof for language recognition |
CN103116578A (en) * | 2013-02-07 | 2013-05-22 | 北京赛迪翻译技术有限公司 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
CN104572767A (en) * | 2013-10-25 | 2015-04-29 | 北大方正集团有限公司 | Method and system for language classification of sites |
CN105760901A (en) * | 2016-01-27 | 2016-07-13 | 南开大学 | Automatic language identification method for multilingual skew document image |
-
2015
- 2015-10-16 CN CN201510672933.XA patent/CN106598937B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2124986A1 (en) * | 1994-06-16 | 1995-12-17 | Mitsuhiro Aida | Text input method |
CN1276077A (en) * | 1997-09-15 | 2000-12-06 | 卡艾尔公司 | Automatic language identification system for multilingual optical character recognition |
US20010041978A1 (en) * | 1997-12-24 | 2001-11-15 | Jean-Francois Crespo | Search optimization for continuous speech recognition |
US20050086046A1 (en) * | 1999-11-12 | 2005-04-21 | Bennett Ian M. | System & method for natural language processing of sentence based queries |
CN101645269A (en) * | 2008-12-30 | 2010-02-10 | 中国科学院声学研究所 | Language recognition system and method |
CN101930430A (en) * | 2009-06-19 | 2010-12-29 | 株式会社日立制作所 | Language text processing device and language learning device |
CN102779135A (en) * | 2011-05-13 | 2012-11-14 | 北京百度网讯科技有限公司 | Method and device for obtaining cross-linguistic search resources and corresponding search method and device |
CN103065622A (en) * | 2012-12-20 | 2013-04-24 | 中国科学院声学研究所 | Language model practicing method and system thereof for language recognition |
CN103116578A (en) * | 2013-02-07 | 2013-05-22 | 北京赛迪翻译技术有限公司 | Translation method integrating syntactic tree and statistical machine translation technology and translation device |
CN104572767A (en) * | 2013-10-25 | 2015-04-29 | 北大方正集团有限公司 | Method and system for language classification of sites |
CN105760901A (en) * | 2016-01-27 | 2016-07-13 | 南开大学 | Automatic language identification method for multilingual skew document image |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106959943A (en) * | 2016-01-11 | 2017-07-18 | 阿里巴巴集团控股有限公司 | Languages recognize update method and device |
WO2018209608A1 (en) * | 2017-05-17 | 2018-11-22 | Beijing Didi Infinity Technology And Development Co., Ltd. | Method and system for robust language identification |
US11183171B2 (en) | 2017-05-17 | 2021-11-23 | Beijing Didi Infinity Technology And Development Co., Ltd. | Method and system for robust language identification |
CN107957994A (en) * | 2017-10-30 | 2018-04-24 | 努比亚技术有限公司 | A kind of interpretation method, terminal and computer-readable recording medium |
CN108038189A (en) * | 2017-12-11 | 2018-05-15 | 南京茂毓通软件科技有限公司 | A kind of information extracting system of Email |
CN108172212A (en) * | 2017-12-25 | 2018-06-15 | 横琴国际知识产权交易中心有限公司 | A kind of voice Language Identification and system based on confidence level |
CN108172212B (en) * | 2017-12-25 | 2020-09-11 | 横琴国际知识产权交易中心有限公司 | Confidence-based speech language identification method and system |
CN108417205A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | Semantic understanding training method and system |
CN108595443A (en) * | 2018-03-30 | 2018-09-28 | 浙江吉利控股集团有限公司 | Simultaneous interpreting method, device, intelligent vehicle mounted terminal and storage medium |
CN108682417A (en) * | 2018-05-14 | 2018-10-19 | 中国科学院自动化研究所 | Small data Speech acoustics modeling method in speech recognition |
CN108932069A (en) * | 2018-07-11 | 2018-12-04 | 科大讯飞股份有限公司 | Input method candidate entry determines method, apparatus, equipment and readable storage medium storing program for executing |
CN110888967B (en) * | 2018-09-11 | 2023-04-28 | 阿里巴巴集团控股有限公司 | Searching method, device and equipment |
CN110888967A (en) * | 2018-09-11 | 2020-03-17 | 阿里巴巴集团控股有限公司 | Searching method, device and equipment |
CN110970018A (en) * | 2018-09-28 | 2020-04-07 | 珠海格力电器股份有限公司 | Speech recognition method and device |
US11977545B2 (en) * | 2018-10-15 | 2024-05-07 | Oclient Inc. | Generation of an optimized query plan in a database system |
CN109934251A (en) * | 2018-12-27 | 2019-06-25 | 国家计算机网络与信息安全管理中心广东分中心 | A kind of method, identifying system and storage medium for rare foreign languages text identification |
CN109934251B (en) * | 2018-12-27 | 2021-08-06 | 国家计算机网络与信息安全管理中心广东分中心 | Method, system and storage medium for recognizing text in Chinese language |
CN110019821A (en) * | 2019-04-09 | 2019-07-16 | 深圳大学 | Text category training method and recognition methods, relevant apparatus and storage medium |
CN110110299A (en) * | 2019-04-28 | 2019-08-09 | 腾讯科技(上海)有限公司 | Text transform method, apparatus and server |
CN110297888B (en) * | 2019-06-27 | 2022-05-03 | 四川长虹电器股份有限公司 | Domain classification method based on prefix tree and cyclic neural network |
CN110297888A (en) * | 2019-06-27 | 2019-10-01 | 四川长虹电器股份有限公司 | A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network |
CN110347934A (en) * | 2019-07-18 | 2019-10-18 | 腾讯科技(成都)有限公司 | A kind of text data filtering method, device and medium |
CN110347934B (en) * | 2019-07-18 | 2023-12-08 | 腾讯科技(成都)有限公司 | Text data filtering method, device and medium |
CN111178009A (en) * | 2019-12-20 | 2020-05-19 | 沈阳雅译网络技术有限公司 | Text multilingual recognition method based on feature word weighting |
CN111178009B (en) * | 2019-12-20 | 2023-05-09 | 沈阳雅译网络技术有限公司 | Text multilingual recognition method based on feature word weighting |
CN111079408A (en) * | 2019-12-26 | 2020-04-28 | 北京锐安科技有限公司 | Language identification method, device, equipment and storage medium |
CN111079408B (en) * | 2019-12-26 | 2023-05-30 | 北京锐安科技有限公司 | Language identification method, device, equipment and storage medium |
CN113255398A (en) * | 2020-02-10 | 2021-08-13 | 百度在线网络技术(北京)有限公司 | Interest point duplicate determination method, device, equipment and storage medium |
CN113255398B (en) * | 2020-02-10 | 2023-08-18 | 百度在线网络技术(北京)有限公司 | Point of interest weight judging method, device, equipment and storage medium |
CN111539207A (en) * | 2020-04-29 | 2020-08-14 | 北京大米未来科技有限公司 | Text recognition method, text recognition device, storage medium and electronic equipment |
CN111832657A (en) * | 2020-07-20 | 2020-10-27 | 上海眼控科技股份有限公司 | Text recognition method and device, computer equipment and storage medium |
CN112528682A (en) * | 2020-12-23 | 2021-03-19 | 北京百度网讯科技有限公司 | Language detection method and device, electronic equipment and storage medium |
CN112883967B (en) * | 2021-02-24 | 2023-02-28 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883968B (en) * | 2021-02-24 | 2023-02-28 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883966B (en) * | 2021-02-24 | 2023-02-24 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883968A (en) * | 2021-02-24 | 2021-06-01 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883966A (en) * | 2021-02-24 | 2021-06-01 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
CN112883967A (en) * | 2021-02-24 | 2021-06-01 | 北京有竹居网络技术有限公司 | Image character recognition method, device, medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN106598937B (en) | 2019-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106598937B (en) | Language Identification, device and electronic equipment for text | |
Ravichandran et al. | Learning surface text patterns for a question answering system | |
CN110321925B (en) | Text multi-granularity similarity comparison method based on semantic aggregated fingerprints | |
KR101173561B1 (en) | Question type and domain identifying apparatus and method | |
CN107992633A (en) | Electronic document automatic classification method and system based on keyword feature | |
CN107861939A (en) | A kind of domain entities disambiguation method for merging term vector and topic model | |
CN106557462A (en) | Name entity recognition method and system | |
CN106649282A (en) | Machine translation method and device based on statistics, and electronic equipment | |
CN110888991B (en) | Sectional type semantic annotation method under weak annotation environment | |
JP2004139553A (en) | Document retrieval system and question answering system | |
CN101599071A (en) | The extraction method of conversation text topic | |
Sun et al. | Mining dependency relations for query expansion in passage retrieval | |
Toselli et al. | Making two vast historical manuscript collections searchable and extracting meaningful textual features through large-scale probabilistic indexing | |
CN109271640B (en) | Text information region attribute identification method and device and electronic equipment | |
CN109299221A (en) | Entity extraction and sort method and device | |
Ranjan et al. | Question answering system for factoid based question | |
CN102929962A (en) | Evaluating method for search engine | |
Overell et al. | Geographic Co-occurrence as a tool for GIR. | |
Belz et al. | Extracting parallel fragments from comparable corpora for data-to-text generation | |
Kešelj et al. | A SUFFIX SUBSUMPTION-BASED APPROACH TO BUILDING STEMMERS AND LEMMATIZERS FOR HIGHLY INFLECTIONAL LANGUAGES WITH SPARSE RESOURCES. | |
Corrada-Emmanuel et al. | Answer passage retrieval for question answering | |
Arab et al. | A graph-based approach to word sense disambiguation. An unsupervised method based on semantic relatedness | |
Dhanjal et al. | Gravity based Punjabi question answering system | |
Zheng et al. | A novel hierarchical convolutional neural network for question answering over paragraphs | |
Garrido et al. | NEREA: Named entity recognition and disambiguation exploiting local document repositories |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20211115 Address after: No. 699, Wangshang Road, Binjiang District, Hangzhou, Zhejiang Patentee after: Alibaba (China) Network Technology Co., Ltd Address before: P.O. Box 847, 4th floor, Grand Cayman capital building, British Cayman Islands Patentee before: Alibaba Group Holdings Limited |