CN106598937A

CN106598937A - Language recognition method and device for text and electronic equipment

Info

Publication number: CN106598937A
Application number: CN201510672933.XA
Authority: CN
Inventors: 蒋宏飞; 骆卫华; 林锋
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba China Network Technology Co Ltd
Priority date: 2015-10-16
Filing date: 2015-10-16
Publication date: 2017-04-26
Anticipated expiration: 2035-10-16
Also published as: CN106598937B

Abstract

The invention discloses a language recognition method and device for a text and electronic equipment. The language recognition method for the text comprises the steps of extracting language characteristics from a text to be recognized; taking the extracted language characteristics as an input of a text language classifier generated in advance; and calculating to acquire a language to which the text to be recognized belongs by the text language classifier, wherein the language characteristics comprise at least one of N-element continuous word characteristics, N-element continuous character characteristics and affix characteristics. With the adoption of the method provided by the invention, the correct recognition rate and the robustness of language recognition can be improved; and meanwhile, a training corpus set is only needed to be a historical query set marked with a correct language, more content is not needed to be marked, and thus, an effect of high practicability can be achieved.

Description

For the Language Identification of text, device and electronic equipment

Technical field

The application is related to languages technology of identification field, and in particular to a kind of Language Identification for text, device and Electronic equipment.

Background technology

International electronic commerce website generally comprises English main website and multilingual substation, and though main website or substation all towards Global Subscriber is opened.When user logs in any one website carries out product retrieval, the word for being used can be any Language.For accurate understanding user view, it is necessary first to which the problem of solution is to automatically identify the query text institute of user input The languages of category, i.e.,：Text languages are recognized.Only accurately know what languages text to be processed is, after just carrying out correctly Continuous to process, for example, translation or search etc. are processed.

At present, conventional text Language Identification includes following several：

1) obtain one of Xerox in 2000 is entitled《AUTOMATIC LANGUAGE IDENTIFICATION USING BOTH N-GRAM AND WORD INFORMATION》United States Patent (USP), Publication No. US6167369A of the patent. The text Language Identification that the patent is proposed comprises the steps：

The first step, pretreatment is carried out to each word in text to be identified；

Second step, for each word, first judge that whether the word is that short word (is limited to less than being equal to 5 words in this scheme Symbol), if short word, then directly calculate the probability of occurrence of the short word under each languages；If long word, then the word institute is obtained 3 metacharacter strings having, and its probability of occurrence under each languages is calculated to each 3 metacharacter string；

3rd step, for each languages, all probability scores under the comprehensive languages are therefrom selected text to be identified and most may be used The languages that can belong to.

In sum, the method mainly considers the relative frequency of word relative frequency and 3 metacharacter strings to carry out Languages recognize that in theory, the method belongs to most basic N-Gram language models, very simple.

2) article of a yahoo being published in for 2009 on computational linguistics top-level meeting ACL：《Language Identification of Search Engine Queries》.The text Language Identification that this article is proposed, by certainly Word frequency probability score, N unit continuous string probabilities score, affixe score of the plan tree-model by text to be identified under each languages Three score values are in addition comprehensive, carry out the languages identification of text to be identified.

The method has the following advantages：Realize that simply amount of calculation is little.However, because three kinds of scores of the method foundation are all Simple probability score, therefore there is a problem of that discrimination is not high and coverage and motility are all poor.Additionally, the method application Decision-tree model in comprehensive three score values, need to be trained based on training data, and this training data is difficult to obtain, Accordingly, there exist the problem of engineering practicability difference.

3) one of the application of Google companies in 2011 it is entitled《Query Language Identification》U.S. State patent application, the patent is disclosed as US2011231423A1.The text Language Identification that the patent is proposed includes as follows Step：

The first step, from interface receive one inquiry (text to be identified)；

Second step, one class vector based on interface language information of generation；

3rd step, for each word in inquiry, the relative frequency of the every kind of languages according to it in corpus is generated One class vector；

4th step, vectorial to Interface classification and all of query word class vector carry out synthesis；

5th step, one languages class vector of generation, the score of the value as correspondence languages of each of which dimension.

The method is also relatively simple, and actually used information is exactly relative frequency of the query word in each languages corpus Degree information, then in conjunction with interface language information, draws final classification results.Therefore, the method still suffers from coverage and spirit The all poor problem of activity.

In sum, existing text Language Identification is generally basede on language model technology, i.e.,：For text to be identified, The scoring of each languages is carried out using the language model of each languages trained under line, person differentiates as final then to take high score Languages.It is based on simple relative frequency statistical information to calculate probability score due to the design philosophy of prior art, thus, Prior art has that correct recognition rata and robustness are low.

The content of the invention

The application provides a kind of Language Identification for text, device and electronic equipment, is deposited with solving prior art In the low problem of correct recognition rata and robustness.

The application provides a kind of Language Identification for text, including：

Languages feature is extracted from text to be identified；

Using the languages feature for extracting as the text languages grader for previously generating input, by the text Languages classifier calculated obtains the affiliated languages of the text to be identified；

Wherein, the languages feature includes at least the one of the continuous word feature of N units, the continuous character feature of N units and affixe feature Person.

Optionally, it is described that the affiliated languages of the text to be identified are obtained by the text languages classifier calculated, including：

Search condition is characterized as with the languages for extracting, in the languages, languages feature and its weight for previously generating Retrieval in corresponding relation obtains feature weight of the languages feature in each candidate's languages；

According to feature weight of the languages feature for obtaining in each candidate's languages is retrieved, the text to be identified is calculated Originally it is belonging respectively to the score of each candidate's languages；

Languages belonging to the score is more than into candidate's languages as the text to be identified of predetermined threshold value.

Optionally, the feature weight is calculated based on discriminative model and obtained.

Optionally, the corresponding relation of the languages for previously generating, languages feature and its weight, is given birth to using following steps Into：

Acquisition has marked the text set of affiliated languages；

The languages feature is extracted from the text that each has marked affiliated languages, and counts the languages feature each The number of times occurred respectively in individual candidate's languages；

Each has marked the text of affiliated languages according to extracting languages feature and the languages for counting The number of times that feature occurs respectively in each candidate's languages, calculating obtains each languages feature and goes out respectively in each candidate's languages Existing number of times and occurs the ratio of total degree in all candidate's languages, as each languages feature in each candidate's languages Feature weight；

By the triplet sets of described each candidate's languages, each languages feature and the feature weight, as institute's predicate The corresponding relation of kind, languages feature and its weight.

Optionally, the corresponding relation of the languages, languages feature and its weight, stores in the following way：

Using dictionary tree data structure storage described in the continuous character feature of the continuous word feature of N units and N units.

It is for each languages feature in the corresponding relation of the languages, languages feature and its weight, the languages are special Levy and its all candidate's languages that weight is not zero correspondence is stored.

Optionally, the text languages grader is the text languages grader of single languages；It is described according to retrieval obtain Feature weight of the languages feature in each candidate's languages, calculates the text to be identified and is belonging respectively to described each candidate The score of languages, is calculated using equation below：

Wherein, Y is the stochastic variable of the affiliated languages of the text to be identified；P belongs to specific language for the text to be identified The score planted；X is the characteristic vector being made up of the languages feature extracted from the text to be identified；W be by with x in Each languages feature corresponding feature weight composition weight vectors.

Optionally, the text languages grader is multilingual text languages grader；It is described according to retrieval obtain Feature weight of the languages feature in each candidate's languages, calculates the text to be identified and is belonging respectively to described each candidate The score of languages, is calculated using equation below：

Wherein, x_iFor the text to be identified, p_jFor the score that the text to be identified belongs to particular candidate languages j；f (x_i) it is the languages feature extracted from the text to be identified, λ_1jTo λ_mjFor f (x_i) in particular candidate languages j In feature weight；Z is the score sum of each candidate's languages, is calculated using equation below：

Wherein, n is the quantity of candidate's languages.

Optionally, the languages feature also includes：It is word quantity that the text to be identified includes and average word length, default Brand word feature, default model word feature, the peculiar character feature of each languages, the peculiar affixe feature of each languages and service feature At least one.

Optionally, the continuous character feature of the N units includes N unit's continuation characters and its positional information in word.

Optionally, the text languages grader for previously generating include at least one towards particular candidate languages text Languages grader；Each is run one by one towards the text languages grader of particular candidate languages with default execution sequence；

It is described that the affiliated languages of the text to be identified are obtained by the text languages classifier calculated, using such as lower section Formula：

If judging the affiliated languages of the text to be identified by the text languages grader currently towards particular candidate languages When being not belonging to current candidate's languages towards the text languages grader of particular candidate languages, then according to the default execution Sequentially, by positioned at described current towards after the text languages grader of particular candidate languages, next text languages point Class device is calculated and obtains the affiliated languages of the text to be identified；

If judging the affiliated languages of the text to be identified by the text languages grader currently towards particular candidate languages When belonging to current candidate's languages towards the text languages grader of particular candidate languages, then terminate languages identification；

Wherein, the text languages grader towards particular candidate languages include single languages text languages grader or Multilingual text languages grader.

Optionally, it is described languages feature is extracted from text to be identified before, also include：

With the text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating is waited to know with the presence or absence of described Other text；The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；

If above-mentioned judged result is yes, by the text to be identified, corresponding affiliated languages are made in the intervention vocabulary For the affiliated languages of the text to be identified.

Optionally, the intervention vocabulary is generated using following steps：

The text that acquisition is erroneously identified；

Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.

The character included with the text to be identified is examined as search condition in the specific languages character code table for previously generating The character that rope includes with the presence or absence of the text to be identified；

If above-mentioned judged result is yes, the character that the text to be identified is included is in the specific languages character code table In it is corresponding belonging to languages as the affiliated languages of the text to be identified.

According to the brand vocabulary and model vocabulary for previously generating at least one, remove default product from the text to be identified Board word or default model word.

Optionally, the device for performing the Language Identification for text is disposed in a distributed system.

Accordingly, the application also provides a kind of languages identifying device for text, including：

Extracting unit, for extracting languages feature from text to be identified；

Predicting unit, for using the languages feature for extracting as the defeated of the text languages grader for previously generating Enter, the affiliated languages of the text to be identified are obtained by the text languages classifier calculated；

Wherein, the languages feature refers to the languages feature of ten million order of magnitude, including the continuous word feature of N units, N units consecutive word At least one of symbol feature and affixe feature.

Optionally, the predicting unit includes：

Retrieval subelement, for being characterized as search condition with the languages for extracting, in the languages, languages that previously generate Retrieval in the corresponding relation of feature and its weight obtains feature weight of the languages feature in each candidate's languages；

Computation subunit, for the feature weight according to the languages feature of retrieval acquisition in each candidate's languages, Calculate the score that the text to be identified is belonging respectively to each candidate's languages；

Setting subelement, for using the score more than predetermined threshold value candidate's languages as the text to be identified belonging to Languages.

Optionally, also include：

Signal generating unit, for generating the corresponding relation of the languages, languages feature and its weight for previously generating；

The signal generating unit includes：

Subelement is obtained, for obtaining the text set for having marked affiliated languages；

Subelement is extracted, for extracting the languages feature in the text for having marked affiliated languages from each, and is counted The number of times that the languages feature occurs respectively in each candidate's languages；

Computation subunit, for according to described in extracting each marked affiliated languages text languages feature and system The number of times that the languages feature counted out occurs respectively in each candidate's languages, calculates and obtains each languages feature in each time Select the number of times occurred respectively in languages and occur the ratio of total degree in all candidate's languages, exist as each languages feature Feature weight in each candidate's languages；

Setting subelement, for by the tlv triple of described each candidate's languages, each languages feature and the feature weight Set, as the corresponding relation of the languages, languages feature and its weight.

Optionally, the predicting unit include at least one towards particular candidate languages prediction subelement；Held with default Whether row order judges the affiliated languages of the text to be identified one by one using each towards the prediction subelement of particular candidate languages Belong to candidate's languages of the current prediction subelement towards particular candidate languages；If so, languages identification is then terminated；If it is not, then leading to Cross after the current prediction subelement towards particular candidate languages, the next prediction towards particular candidate languages Subelement is calculated and obtains the affiliated languages of the text to be identified；

The prediction subelement towards particular candidate languages, for by the text languages towards particular candidate languages point Class device is calculated and obtains the affiliated languages of the text to be identified；

Optionally, also include：

Intervene unit, for the text to be identified as search condition, retrieving in the intervention vocabulary for previously generating to be It is no to there is the text to be identified；It is if above-mentioned judged result is yes, the text to be identified is right in the intervention vocabulary The affiliated languages answered are used as the affiliated languages of the text to be identified；

Wherein, the vocabulary of intervening includes the corresponding record collection of text and its affiliated languages.

Optionally, also include：

Character recognition unit, for the character that included with the text to be identified as search condition, in the spy for previously generating The character that retrieval includes with the presence or absence of the text to be identified in attribute kind character code table；If above-mentioned judged result is yes, will Languages are used as described to be identified belonging to the character that the text to be identified includes is corresponding in the specific languages character code table The affiliated languages of text.

Optionally, also include：

At least one of removal noise unit, the brand vocabulary previously generated for basis and model vocabulary, treats from described Identification text removes default brand word or default model word.

Accordingly, the application also provides a kind of electronic equipment, including：

Display；

Processor；And

Memorizer, the memorizer is configured to store the languages identifying device for text, the language for text When planting identifying device by the computing device, comprise the steps：Languages feature is extracted from text to be identified；To extract The languages feature for going out as the text languages grader for previously generating input, by the text languages classifier calculated Obtain the affiliated languages of the text to be identified；Wherein, the languages feature includes the continuous word feature of N units, the continuous character feature of N units With affixe feature at least one.

Additionally, the application also provides another is used for the Language Identification of text, including：

With text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified This；The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；

Optionally, the intervention vocabulary is generated using following steps：

The text that acquisition is erroneously identified；

Optionally, also include：

If above-mentioned judged result is no, the text to be identified is obtained by the text languages classifier calculated for previously generating Languages belonging to this.

Accordingly, also a kind of languages identifying device for text of the application, including：

Retrieval unit, for text to be identified as search condition, retrieving in the intervention vocabulary for previously generating and whether depositing In the text to be identified；The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；

Judging unit, if for above-mentioned judged result be it is yes, the text to be identified is right in the intervention vocabulary The affiliated languages answered are used as the affiliated languages of the text to be identified.

Optionally, also include：

Predicting unit, if being no for above-mentioned judged result, is obtained by the text languages classifier calculated for previously generating Take the affiliated languages of the text to be identified.

As search condition, the retrieval in the specific languages character code table for previously generating is the character included with text to be identified It is no to there is the character that the text to be identified includes；

Optionally, also include：

Retrieval unit, for the character that included with text to be identified as search condition, in the specific languages word for previously generating The character that retrieval includes with the presence or absence of the text to be identified in symbol table；

Judging unit, if being yes for above-mentioned judged result, the character that the text to be identified is included is in the spy In attribute kind character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.

Optionally, also include：

According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word from text to be identified Or default model word；

The affiliated languages of the text to be identified are obtained by the text languages classifier calculated for previously generating.

At least one of filter element, the brand vocabulary previously generated for basis and model vocabulary, from text to be identified Remove default brand word or default model word；

Predicting unit, for obtaining the affiliated language of the text to be identified by the text languages classifier calculated for previously generating Kind.

Languages feature is extracted from text to be identified；

To preset each text languages grader that execution sequence runs one by one predetermined number, by the text languages point Class device judges whether the affiliated languages of the text to be identified belong to candidate's languages of the text languages grader；If so, then tie Beam languages are recognized；

Optionally, the text languages grader includes the text languages grader or multilingual text languages of single languages Grader.

Extracting unit, for extracting languages feature from text to be identified；

Predicting unit, for preset each text languages grader that execution sequence runs one by one predetermined number, passing through The text languages grader judges whether the affiliated languages of the text to be identified belong to the candidate of the text languages grader Languages；If so, languages identification is then terminated；

Compared with prior art, the application has advantages below:

The application provides a kind of Language Identification for text, device and electronic equipment, by from text to be identified In extract languages feature, and the languages feature for extracting is passed through as the input of the text languages grader for previously generating Text languages classifier calculated obtains the affiliated languages of text to be identified, wherein the languages feature includes the continuous word feature of N units, N At least one of first continuous character feature and affixe feature.Because the languages of the method institute foundation of the application offer are characterized in that ten million The feature of the order of magnitude, therefore, it is possible to improve the correct recognition rata and robustness of languages identification, simultaneously because corpus collection is only needed To mark the historical query collection of correct languages, and more contents need not be marked such that it is able to reach the high effect of practicality.

Description of the drawings

Fig. 1 is the flow chart of the Language Identification embodiment for text of the application；

Fig. 2 is the particular flow sheet of Language Identification embodiment step S103 for text of the application；

Fig. 3 is that the Language Identification embodiment for text of the application generates languages, languages feature and its weight The particular flow sheet of corresponding relation；

Fig. 4 is languages, languages feature and its weight that the Language Identification embodiment for text of the application is generated Corresponding relation storage schematic diagram；

Fig. 5 is the system schematic of the Language Identification embodiment distributed deployment for text of the application；

Fig. 6 is the schematic diagram that the Language Identification embodiment multilamellar for text of the application recognizes framework；

Fig. 7 is the schematic diagram of the languages identifying device embodiment for text of the application；

Fig. 8 is the schematic diagram of the languages identifying device embodiment predicting unit 103 for text of the application；

Fig. 9 is the schematic diagram of the languages identifying device embodiment signal generating unit 201 for text of the application；

Figure 10 is the another schematic diagram of the languages identifying device embodiment for text of the application；

Figure 11 is the schematic diagram of the electronic equipment embodiment of the application.

Specific embodiment

Elaborate many details in order to fully understand the application in the following description.But the application can be with Much it is different from alternate manner described here to implement, those skilled in the art can be in the situation without prejudice to the application intension Under do similar popularization, therefore the application is not embodied as being limited by following public.

In this application, there is provided a kind of Language Identification for text, device and electronic equipment.In following reality Apply in example and be described in detail one by one.

The Language Identification for text that the embodiment of the present application is provided, the basic thought of its core is：By design The languages feature of millions simultaneously adopts machine learning model, to recognize the languages belonging to text to be identified.Because the application is provided Method languages identification is carried out based on the languages feature of millions, it is thus possible to improve the identification of text languages correct recognition rata and Robustness.

Fig. 1 is refer to, it is the flow chart of the Language Identification embodiment for text of the application.Methods described bag Include following steps：

Step S101：Languages feature is extracted from text to be identified.

In each hierarchy of skill for carrying out pattern recognition by machine learning method, most important is exactly the aspect of model Design.The languages feature that the inventor of technical scheme is proposed includes the languages feature of ten million order of magnitude, wherein big portion Point languages feature belongs to the continuous word feature of N units, the continuous character feature of N units or affixe feature, and this few class languages is characterized in that basic Languages feature.Additionally, inventor have also been devised the languages feature of following several classifications, including：Statistical nature, for example, word number and Average word is long；Brand word feature and model word feature；The peculiar character feature of languages and the peculiar affixe feature of languages；Service feature, For example, IP address, country origin, website, region setting etc..Above-mentioned all kinds of languages features are briefly described separately below.

1) the continuous word feature of N units

The continuous word of N units is characterized in that based on a class languages feature of N-Gram Design Theory.For example, text to be identified is： iphone 6s caseThe continuous word feature that therefrom can be extracted includes：

41 yuan continuous word feature：iphone、6s、case、

32 yuan continuous word features：iphone 6s、6s case、case

23 yuan continuous word features：iphone 6s case、6s case

14 yuan continuous word features：iphone 6s case

Based on the languages feature of N-Gram the Theory Construction text to be identified, and realize that the embodiment of the present application is carried on this basis For the Language Identification for text, languages can be carried out to various text languages widely used in the Internet automatic Identification.Test result indicate that, relatively high and stable correct identification is had based on the text Language Identification of the continuous word feature of N units Rate.

2) the continuous character feature of N units

In actual applications, one of text Language Identification main application scenarios are search scenes.In search scene In, the query word of user input is typically all shorter, and a usual inquiry only includes 1 to 3 word, and suitable between vocabulary Sequence is not also limited.Because prior art is not good enough to the stability of the language model scores of short text, therefore now technology is deposited The relatively low problem of correct recognition rata when the languages of short text are recognized.Therefore, the base commonly used in general nature language processing techniques Just do not apply in the N gram language model technologies of word.

The embodiment of the present application is proposed based on the N-Gram technologies of character, i.e.,：Using the continuous word feature of N units.N units consecutive word Symbol is characterized in that based on the another class languages feature of N-Gram Design Theory the difference of continuous word feature first with above-mentioned N is： The unit of the continuous character feature of N units is a character.For example, from the text iphone 6s case to be identified of upper exampleIn The 3 yuan of continuous character features for extracting include：Iph, pho and one etc..

Additionally, in order to embody N unit's continuation characters in the position of word, for example, head and the tail position or centre connect Position, the continuous character feature of N units described in the embodiment of the present application also includes N units continuation character in the position of word.In this enforcement In, prefix is represented using " HEAD_ " affixe, " TAIL_ " represents suffix, and " _ HYP_ " represents connection.For example, from upper example wait know Other text iphone 6s caseIn 3 yuan of continuous character features extracting be：HEAD_iph、HEAD_cas、HEAD_ for、TAIL_one、TAIL_ase、With e_HYP_6s etc..By the coal addition position in the continuous character feature of N units Information, can carry out languages identification, so as to reach the effect for improving correct recognition rata based on the continuous character feature of more fine-grained N units Really.

Particularly, it is contemplated that shorter continuation character feature (for example, 1 yuan of continuous character feature or 2 yuan of continuation character spies Levy) it is very low for the discrimination of languages, inventor is also proposed using high-order continuation character feature (more than 3 yuan continuation character features) As the continuous character feature of N units.Test result indicate that, had based on the text Language Identification of the continuous character feature of high-order N units Relatively high and stable correct recognition rata.

3) affixe feature

Affixe feature described in the embodiment of the present application refers to the feature that affixe common in each languages is formed, including prefix Feature and suffix feature, for example, the character string such as pre belongs to the character strings such as prefix characteristic, ing and belongs to suffix feature in English. During realization, affixe feature can be extracted from text to be identified according to the affixe table for prestoring.

4) statistical nature

Statistical nature described in the embodiment of the present application refers to the languages feature obtained by various statistical method.Due to difference The text of languages has respective feature in the word quantity of composition text or in terms of average word length, it is also possible to which problem is replied and made For languages identification foundation, therefore, the statistical nature described in the embodiment of the present application includes the total of the word included from text to be identified The statistical natures such as the average word length of quantity or each word.

5) brand word feature and model word feature

In actual applications, some brand words, model word or general description word information may be mixed in text to be identified, Thus increase languages and recognize difficulty.Particularly, for including the short text of brand word, model word or general description word information (for example, query word), its languages identification difficulty is very big.The method that the embodiment of the present application is provided, by designing brand word feature And model word feature, whether can mix brand word, model word or general in view of text to be identified when text languages are recognized Description word information, thus also can play a part of improve correct recognition rata.During realization, can be according to prestoring Brand word and model vocabulary brand word feature and model word feature are extracted from text to be identified.

6) the peculiar character feature of languages and the peculiar affixe feature of languages

The language model that prior art is based on is trained from a large amount of language materials and obtained, and corresponding to the high languages of similarity Corpus similarity it is also high.Additionally, there is great similarity between some languages in itself.Therefore, existing skill The art languages high for similarity have that recognition performance is poor.

In order to solve this problem, the inventor of technical scheme devises the peculiar character feature of languages and languages spy There is the languages feature that affixe feature etc. is new.The peculiar character feature of languages and the peculiar affixe of languages described in the embodiment of the present application be Finger, a kind of exclusive character feature of the languages that languages and other languages can be distinguished and affixe feature.For example, character It is the distinctive character feature of Portuguese, can be designed toThis feature To express.

Test result indicate that, based on the peculiar character feature of languages and the text Language Identification of the peculiar affixe of languages, can To solve the problems, such as that the high languages of similarity have that recognition performance is poor well, so as to reach the identification for improving similar languages Effect.

7) service feature

At present, typically there is service feature on many lines in real time international electronic commerce website, for example, inquires about corresponding Cookie information, locale information and IP address information etc..Service feature described in the embodiment of the present application can to languages identification To provide good information, for example, from the query word of CHINESE REGION IP address, the query word is that the probability of Chinese is bigger. Test result indicate that, by the mechanism using service feature, can specifically optimize languages identification under different business scene Accuracy.

It should be noted that the importance and English inquiry in view of the identification of English languages is in the universal of each website Property, service feature generally need not be adopted in the text languages grader for only recognizing English text.

Above section is illustrated to all kinds of languages features described in the embodiment of the present application.Because the embodiment of the present application is carried For method be Language Identification based on ten million order of magnitude languages feature, therefore, it is possible to effectively improve the correct of text languages Discrimination.

Step S103：Using the languages feature for extracting as the input of the text languages grader for previously generating, lead to Cross the text languages classifier calculated and obtain the affiliated languages of the text to be identified.

After each languages feature of text to be identified is extracted by step S101, it is possible to by the text for previously generating This languages classifier calculated obtains the affiliated languages of the text to be identified.

Text languages grader described in the embodiment of the present application refers to the text languages point built based on machine learning method Class device.Fig. 2 is refer to, it is the particular flow sheet of Language Identification embodiment step S103 for text of the application. In the present embodiment, step S103 is comprised the following steps：

Step S1031：Search condition is characterized as with the languages for extracting, in the languages, languages feature that previously generate And its retrieval obtains feature weight of the languages feature in each candidate's languages in the corresponding relation of weight.

The corresponding relation of languages, languages feature and its weight described in the embodiment of the present application is referred to, based on given corpus The parameter model of the text languages grader that training is obtained.The corresponding relation of described languages, languages feature and its weight includes The triplet sets of each candidate's languages, each languages feature and its feature weight.Implement the application offer for text Language Identification, it is necessary first to generate the corresponding relation of the languages, languages feature and its weight, i.e.,：By to multi-lingual Plant corpus to be trained to obtain the parameter model of text languages grader.

Refer to Fig. 3, its be the application for text Language Identification embodiment generate languages, languages feature and The particular flow sheet of the corresponding relation of its weight.In the present embodiment, the correspondence of the languages, languages feature and its weight is generated Relation comprises the steps：

Step S301：Acquisition has marked the text set of affiliated languages.

The Language Identification for text that the embodiment of the present application is provided is a kind of languages based on machine learning algorithm Recognition methodss, and be the parameter that grader is adjusted using the corpus of one group of known class, i.e.,：Feature weight, therefore belong to In supervised learning.In supervised learning, each example is desired defeated by an input object (usually vector) and one Go out value (also referred to as supervisory signals) composition.Therefore, training corpus includes marking the text set of affiliated languages.

The text set for marking affiliated languages described in the embodiment of the present application includes what text languages grader was capable of identify that The text of each candidate's languages.For example, corpus text 1 is：En | | | iphone 4s case plastic, corpus Text 2 is：Es | | | iphone 4s caso pl á stico, corpus text 3 are：En | | | iphone6s screen etc..

Step S303：The languages feature is extracted from the text that each has marked affiliated languages, and counts institute's predicate Plant the number of times that feature occurs respectively in each candidate's languages.

After the text set for marking affiliated languages is got, it is special that needs extract languages from each corpus Levy, these languages features are characterized as same concept with the languages of explanation in step S101.Extracting from corpus, languages are special While levying, in addition it is also necessary to count occurrence number of each languages feature in each candidate's languages.For example, 1 yuan of continuous word Iphone is occurred in that 500 times in English language material, or 1 yuan of continuous word caso occurs in that 300 is inferior in Spanish.

Step S305：Each has marked the languages feature of the text of affiliated languages and has counted according to extracting The number of times that the languages feature occurs respectively in each candidate's languages, calculates and obtains each languages feature in each candidate's languages The middle number of times for occurring respectively and occur the ratio of total degree in all candidate's languages, wait at each as each languages feature Select the feature weight in languages.

Complete languages feature and count to complete each languages feature in each candidate's language all of corpus are extracted After the number of times occurred respectively in kind, in addition it is also necessary to calculate the total degree that each languages feature occurs in all corpus.Most Afterwards, the number of times for each languages feature being occurred in each candidate's languages and its occur in all candidate's languages it is total time Several ratio, as feature weight of each languages feature in each candidate's languages.For example, it is related to 3 in training corpus The text of language (English, Spanish and Portuguese) is planted, wherein, 1 yuan of continuous word iphone is occurred in that in English language material 500 times, occur in that in Spain's language material 200 times, occur in that in Portugal's language material 260 times, therefore iphone is in the language material Occur in that altogether in storehouse 960 times, then this feature weights of languages feature in English of iphone is 500/960, in Spanish In feature weight be 200/960, the feature weight in Portuguese be 260/960.As can be seen here, the embodiment of the present application institute The feature weight stated is that the method based on discriminant calculates what is obtained, and prior art only calculates each word in its affiliated languages The method of the word relative frequency of appearance, i.e. prior art based on production is calculated and obtains word frequency.Because the embodiment of the present application is adopted Feature weight is calculated with discriminant method, it is thus possible to reach the effect for improving correct recognition rata.

Step S307：By the triplet sets of described each candidate's languages, each languages feature and the feature weight, make For the corresponding relation of the languages, languages feature and its weight.

By above-mentioned steps S301 to step S305, the feature power of each languages feature under each candidate's languages is got Weight, by the triplet sets of each candidate's languages, each languages feature and feature weight, as the languages, languages feature and The corresponding relation of its weight.

Table 1 is referred to, it is languages, the languages feature that the Language Identification embodiment for text of the application is generated And its sample table of the corresponding relation of weight.

Languages	Feature string	Feature weight
			en	iphone	0.1
Es	iphone	0.05
			en	case	0.3
es	plástico	1

…

The sample table of the corresponding relation of table 1, languages, languages feature and its weight

After the corresponding relation that training generates above-mentioned languages, languages feature and its weight, it becomes possible to from text to be identified In the languages that extract be characterized as search condition, the retrieval in the corresponding relation of above-mentioned languages, languages feature and its weight is obtained Feature weight of the languages feature in each candidate's languages.For example, text to be identified is：iphone 5s plástico Model, the languages feature for therefrom extracting includes (only enumerating 1 yuan of continuous word feature below)：Iphone, 5s, pl á stico and Model, then, after entering line retrieval in the model described in above-mentioned table 1, the languages feature being activated is as shown in table 2：

Languages	Feature string	Feature weight
			En	iphone	0.1
Es	iphone	0.05
			Es	plástico	1

Table 2, activation examples of features table

It is visible by table 2, because word 5s is a model word, it is filtered in pretreatment stage, word model represents some Languages feature be in parameter model retrieve less than, to languages differentiate do not work.

It should be noted that in actual applications, due to training the right of the languages for obtaining, languages feature and its weight The languages feature that should be related to more than comprising millions, therefore the speed of the languages signature search step of step S1031 will be to whole The performance of languages identification is affected greatly.In order to improve the speed of characteristic key, the embodiment of the present application is proposed in terms of two Both storage modes are illustrated below by the corresponding relation of the optimization storage languages, languages feature and its weight.

1) storage mode one：Using dictionary tree data structure storage described in the continuous word feature of N units and N units consecutive word Symbol feature.

Dictionary tree described in the embodiment of the present application, also known as word lookup tree, Trie trees, is a kind of tree structure, is a kind of Kazakhstan The mutation of uncommon tree.Its advantage is：Query time is reduced using the common prefix of character string, reduces meaningless to greatest extent Character string comparison, search efficiency is than the Hash height of tree.

The embodiment of the present application proposes to store the continuous character feature of the continuous word feature of N units and N units using dictionary tree, makes When proper certain languages feature x does not obtain matching, it is possible to directly abandon the search of x+a (a represents arbitrarily string) feature.Experiment As a result show, the effect of this storage strategy continuous word feature first for N and the continuous character feature of N units is clearly.

2) storage mode two：For each languages feature in the corresponding relation of the languages, languages feature and its weight, All candidate's languages correspondence that the languages feature and its weight are not zero is stored.

Typically in the signature search of multilingual text languages grader, for each languages feature x, can be to each time Languages y are selected, the search of (x+y) is combined.Therefore, each languages feature is required for carrying out L feature set search (L is candidate Languages number).The embodiment of the present application is proposed to the languages feature in the parameter model of text languages grader and its corresponding institute The mode that having candidate's languages carries out similar inverted index is stored.

Fig. 4 is refer to, it is languages, the languages feature that the Language Identification embodiment for text of the application is generated And its storage schematic diagram of the corresponding relation of weight.By the storage mode shown in Fig. 4, each languages feature only needs to retrieval one Time, it is possible to return is possible to the candidate's languages for matching, and integral retrieval efficiency can improve L times.

Step S1033：According to feature weight of the languages feature for obtaining in each candidate's languages is retrieved, institute is calculated State the score that text to be identified is belonging respectively to each candidate's languages.

Feature of each languages feature of file to be identified in each candidate's languages is got by above-mentioned steps S1031 After weight, it is possible to calculate the score that text to be identified is belonging respectively to each candidate's languages according to these feature weights.

Text languages grader described in the embodiment of the present application can think the text languages grader of single languages, may be used also Being multilingual text languages grader.For example, the text languages grader of single languages can be English languages grader etc. Differentiate the grader of single languages；Multilingual text languages grader can include multiple candidate's languages, depending on training language The text languages quantity that material storehouse includes.Text languages grader separately below to single languages and multilingual text languages point Class device is illustrated.

1) the text languages grader of single languages

When the text languages grader that the text languages grader described in the embodiment of the present application is single languages, step Feature weight of the languages feature in each candidate's languages obtained according to retrieval described in S1033, calculate described in wait to know Other text is belonging respectively to the score of each candidate's languages, can be calculated using equation below：

In the present embodiment, the text languages grader of single languages is English arbiter, and P (Y=1) represents text to be identified It is the probability of English.The text languages grader of single languages that the embodiment of the present application is provided adopts Logic Regression Models.In reality Using in, other machine learning models can also be adopted, for example：Support vector machine, CRF, decision tree etc..It is above-mentioned a variety of Machine learning model is all the change of specific embodiment, all without departing from the core of the application, therefore all in the guarantor of the application Within the scope of shield.

2) multilingual text languages grader

When the text languages grader described in the embodiment of the present application is multilingual text languages grader, step Feature weight of the languages feature in each candidate's languages obtained according to retrieval described in S1033, calculate described in wait to know Other text is belonging respectively to the score of each candidate's languages, can be calculated using equation below：

In above-mentioned formula, n is the languages quantity that multilingual text languages grader is capable of identify that.

The multilingual text languages grader that the embodiment of the present application is provided adopts maximum entropy model (Maximum Entropy Model).Maximum entropy model is a kind of machine learning method, in many fields of natural language processing (such as part of speech mark Note, Chinese word segmentation, sentence boundary detection, shallow parsing and text classification etc.) there is reasonable application effect.Maximum entropy Model be able to can be reached preferably with comprehensive observing to various related or incoherent probabilistic knowledge, the process to many problems As a result.Test result indicate that being effective based on the Language Identification of maximum entropy model.It not only can obtain most consistent Distribution, and ensure that languages recognize precision ratio and recall ratio.Likewise, in actual applications, other can also be adopted Machine learning model, for example：Support vector machine, CRF, decision tree etc..Above-mentioned a variety of machine learning models all simply have The change of body embodiment, all without departing from the core of the application, therefore all within the protection domain of the application.

Step S1035：Language belonging to the score is more than into candidate's languages as the text to be identified of predetermined threshold value Kind.

The score that text to be identified is belonging respectively to each candidate's languages is got by step S1033, on this basis, will Score is more than candidate's languages of predetermined threshold value as the languages belonging to text to be identified.In actual applications, generally by score most High candidate's languages are used as the languages belonging to text to be identified.For example, obtained not according to the feature calculation being activated in above-mentioned table 2 With the score of candidate's languages, its result is as follows：Es languages must be divided into：0.05+1=1.05, en languages must be divided into：0.1, by It is more than en languages scores in es languages score, therefore, it is determined that text to be identified belongs to es languages.

The Language Identification for text realized by above-mentioned steps S101 and step S103 is a kind of based on machine The Language Identification of study.In actual applications, on the basis of the above-mentioned Language Identification based on machine learning, can be with Using some optimisation strategies, to improve the correct recognition rata of text languages.The embodiment of the present application is adopted some optimizations below Strategy is illustrated respectively.

1) optimisation strategy one

In actual applications, the correspondence to obtain the languages, languages feature and its weight is trained to corpus to close System is a very time-consuming operation, it is seen that corpus is trained do not have practicality in real time.However, this non real-time The problem that the method for training may be brought is：Cannot in time from newer history recognition result learning to more accurately text language Plant classifier parameters model.

The online languages identification service of one practicality, needs possess fast reaction mechanism to wrong phenomenon on line.To understand Certainly the problems referred to above, to the wrong phenomenon of burst the effect of quick intervention is reached, and the embodiment of the present application passes through the intervention for previously generating Vocabulary is quickly intervened the wrong phenomenon happened suddenly in application system on concrete line, to improve the correct identification of text languages Rate.

Described in the embodiment of the present application intervene vocabulary have recorded it is a collection of marked correct languages in history by mistake knowledge Other text data.The text being erroneously identified described in the embodiment of the present application is illustrated, for example, in one query search In, it is text wrong, that such text is referred to as being erroneously identified to the languages recognition result of query word.

The scheme of optimisation strategy one is before step S101 extracts languages feature from text to be identified, also to include： With the text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified； The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；If above-mentioned judged result is yes, wait to know by described Languages are used as the affiliated languages of the text to be identified belonging to other text is corresponding in the intervention vocabulary.

The Language Identification for text that the embodiment of the present application is provided devises intervention vocabulary mechanism, text to be identified First have to through intervening vocabulary identification module, if intervening vocabulary includes text to be identified, can directly judge to be identified The affiliated languages of text, without the need for being judged by text languages grader.Specifically, intervening vocabulary identification can be using accurate whole The matching strategies such as body matching, part matching, weighted registration, fast rapid-curing cutback is carried out from multiple angles to the wrong phenomenon happened suddenly on line In advance.

Intervention vocabulary described in the embodiment of the present application is generated using following steps：1) text being erroneously identified is obtained；2) Using the text being erroneously identified and its affiliated correct languages as the record for intervening vocabulary.I.e.：Get by After the text of wrong identification, the text being erroneously identified and its affiliated correct languages are directly appended to intervene in vocabulary, with For query and search.

2) optimisation strategy two

The languages that general large-scale international electronic commerce website is supported are more than 10.Therefore, languages technology of identification is at least Support the languages identification demand of more than 10 kinds classifications.Showing for character is shared because most of language is all present with other languages As, therefore, most languages identification needs to be entered based on the Language Identification of machine learning with what the embodiment of the present application was provided Row identification.However, the character list of some language has code section alone in Unicode coding schedules, can be with to such language Directly encode to be judged by Unicode, for example, Russian, Russia's Chinese character typically exists：0x0400～0x052F code sections.

The scheme of optimisation strategy two is before step S101 extracts languages feature from text to be identified, also to include： Whether the character included with the text to be identified is retrieved in the specific languages character code table for previously generating and is deposited as search condition In the character that the text to be identified includes；If above-mentioned judged result is yes, the character that the text to be identified includes is existed In the specific languages character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.

Optimisation strategy two treats knowledge by the Language Identification with reference to character code recognition methodss and based on machine learning Other text carries out languages identification.The manageable languages of character code recognition methodss include：Russian, Hebrew, Korean, Thailand Language, Arabic etc., test result indicate that, its correct recognition rata is more than 99%.Language Identification based on machine learning Manageable languages include：English, Portugal language, Spanish, German, French, Italian, Turkish, Vietnamese, Indonesia Language, Dutch.Test result indicate that, in addition to Portugal language and Spanish, F1 is estimated more than 90%, its Chinese and English 98%.

3) optimisation strategy three

In actual applications, because the inquiry of user input is typically relatively freer, it is thus possible to comprising brand word, model word And various descriptive vocabulary, for example, iPhone 5S, Cannon D70 etc..Brand word, model word are usually international English literary style.And, English inquires about larger specific gravity of standing in the flow of international electronic commerce website, even non-English speaking country User, the situation of input English inquiry is also very common.These special words can produce very big noise to languages identification, thus right The accuracy of languages identification affects very big.For example, " Cannon D70 boxes " this text string, itself is a Chinese text This, but wherein contain brand word, model word, therefore it is easily identified into English.However, prior art is not directed to this A little special words carry out particular design.

Optimisation strategy three is before step S101 extracts languages feature from text to be identified, also to include：According to pre- At least one of the brand vocabulary that first generates and model vocabulary, from the text to be identified default brand word or default model are removed Word.

Optimisation strategy three is particularly inquired about English by carrying out special handling to special words such as brand word, model words Special consideration is made, so as to reach the effect for improving correct recognition rata.

4) optimisation strategy four

Large-scale international electronic commerce website China, the inquiry request (QPS) for receiving each second be up to it is thousands of or even up to ten thousand, And user is highstrung to the result waiting time (latency) inquired about.Therefore, the performance of languages identification is carried out excellent Change most important.

Optimisation strategy four is will to perform the device (languages of the Language Identification for text that the embodiment of the present application is provided Identifying device) dispose in a distributed system, design is optimized to the performance that languages are recognized from multi-thread concurrent angle. In the present embodiment, languages identifying device adopt Blender/Searcher distributed architecture schemes, with improve languages identification and Send out service ability.Fig. 5 is refer to, it is the system of the Language Identification embodiment distributed deployment for text of the application Schematic diagram.

5) optimisation strategy five

Prior art is typically based on monolayer framework and carries out languages identification, i.e.,：The unification of all candidate's languages is considered, not For the consideration that specific languages are specifically optimized.In actual applications, the languages such as English are common query texts, in order to Optimize the languages identification of common languages text, the optimisation strategy five that the embodiment of the present application is proposed is：Known using multi-level languages Other framework, wherein single languages identification layer of common languages is specially devised, for example：It is specifically designed for the languages identification layer of English.It is logical Cross using multi-level languages identification framework, using the teaching of the invention it is possible to provide the special optimization ability of specific languages.

In actual applications, similar hierarchical design can be carried out for the languages for being actually needed optimization, or even is expanded to The discrimination model step by step of multilamellar, each level can also realize that two classes or the languages of three classes differentiate.Above-mentioned a variety of multilamellars Secondary languages identification framework is all the change of specific embodiment, all without departing from the core of the application, therefore all in the application Protection domain within.

Fig. 6 is refer to, it is the signal that the Language Identification embodiment multilamellar for text of the application recognizes framework Figure.Above which floor (A-X) is the text languages grader of the single languages for specific languages in Fig. 6, be only given "Yes" or The specific languages of "no" class；If text to be identified is not belonging to above several specific languages, can again by last many The text languages grader of languages, provides the languages classification of optimum from multiple candidate's languages.It should be noted that multilingual Text languages grader output result in, still can specify whether to export before " A-X " these classes for having differentiated Not.

In the above-described embodiment, there is provided a kind of Language Identification for text, corresponding, the application A kind of languages identifying device for text is also provided.The device is corresponding with the embodiment of said method.

Fig. 7 is refer to, it is the schematic diagram of the languages identifying device embodiment for text of the application.Due to device reality Apply example and be substantially similar to embodiment of the method, so describe fairly simple, part explanation of the related part referring to embodiment of the method .Device embodiment described below is only schematic.

A kind of languages identifying device for text of the present embodiment, including：

Extracting unit 101, for extracting languages feature from text to be identified；

Predicting unit 103, for using the languages feature for extracting as the text languages grader for previously generating Input, by the text languages classifier calculated the affiliated languages of the text to be identified are obtained；

Fig. 8 is refer to, it is the signal of the languages identifying device embodiment predicting unit 103 for text of the application Figure.Optionally, the predicting unit 103 includes：

Retrieval subelement 1031, for being characterized as search condition with the languages for extracting, the languages for previously generating, Retrieval in the corresponding relation of languages feature and its weight obtains feature weight of the languages feature in each candidate's languages；

Computation subunit 1033, for the feature power according to the languages feature of retrieval acquisition in each candidate's languages Weight, calculates the score that the text to be identified is belonging respectively to each candidate's languages；

Setting subelement 1035, for the score to be more than candidate's languages of predetermined threshold value as the text to be identified Affiliated languages.

Optionally, the predicting unit 103 include at least one towards particular candidate languages prediction subelement；With default Execution sequence judges that the affiliated languages of the text to be identified are one by one using each towards the prediction subelement of particular candidate languages The no candidate's languages for belonging to the current prediction subelement towards particular candidate languages；If so, languages identification is then terminated；If it is not, then By after the current prediction subelement towards the particular candidate languages, next one towards the pre- of particular candidate languages Survey subelement calculating and obtain the affiliated languages of the text to be identified；

Fig. 9 is refer to, it is the signal of the languages identifying device embodiment signal generating unit 201 for text of the application Figure.Optionally, also include：

Signal generating unit 201, for generating the corresponding relation of the languages, languages feature and its weight for previously generating；

The signal generating unit 201 includes：

Subelement 2011 is obtained, for obtaining the text set for having marked affiliated languages；

Subelement 2013 is extracted, for extracting the languages feature in the text for having marked affiliated languages from each, and Count the number of times that the languages feature occurs respectively in each candidate's languages；

Computation subunit 2015, for according to described in extracting each marked affiliated languages text languages feature And the number of times that the languages feature for counting occurs respectively in each candidate's languages, calculate and obtain each languages feature each The number of times occurred respectively in individual candidate's languages and the ratio for occurring total degree in all candidate's languages, it is special as each languages Levy the feature weight in each candidate's languages；

Setting subelement 2017, for by the three of described each candidate's languages, each languages feature and the feature weight Tuple-set, as the corresponding relation of the languages, languages feature and its weight.

Figure 10 is refer to, it is the another schematic diagram of the languages identifying device embodiment for text of the application.It is optional , also include：

Intervene unit 203, for the text to be identified as search condition, retrieving in the intervention vocabulary for previously generating With the presence or absence of the text to be identified；If above-mentioned judged result is yes, by the text to be identified in the intervention vocabulary Languages belonging to corresponding are used as the affiliated languages of the text to be identified；

Optionally, also include：

Character recognition unit 205, for the character that included with the text to be identified as search condition, what is previously generated The character that retrieval includes with the presence or absence of the text to be identified in specific languages character code table；If above-mentioned judged result is yes, Languages belonging to the character that the text to be identified is included is corresponding in the specific languages character code table are waited to know as described The affiliated languages of other text.

Optionally, also include：

Remove noise unit 207, for according to the brand vocabulary and model vocabulary for previously generating at least one, from described Text to be identified removes default brand word or default model word.

Figure 11 is refer to, it is the schematic diagram of the electronic equipment embodiment of the application.Due to apparatus embodiments basic simlarity In embodiment of the method, so describing fairly simple, related part is illustrated referring to the part of embodiment of the method.It is described below Apparatus embodiments be only schematic.

The a kind of electronic equipment of the present embodiment, the electronic equipment includes：Display 1101；Processor 1102；And storage Device 1103, the memorizer 1103 is configured to store the languages identifying device for text, and the languages for text are known When other device is performed by the processor 1102, comprise the steps：Languages feature is extracted from text to be identified；To extract The languages feature for going out as the text languages grader for previously generating input, by the text languages classifier calculated Obtain the affiliated languages of the text to be identified；Wherein, the languages feature includes the continuous word feature of N units, the continuous character feature of N units With affixe feature at least one.

Additionally, the embodiment of the present application provides another several Language Identifications for text, due to these embodiments of the method Explanation is provided in said method embodiment, so describing fairly simple, related part is referring to said method embodiment Part explanation.Embodiment of the method described below is only schematic.

The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps：1) with Text to be identified is search condition, and the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified；It is described dry Pre- vocabulary includes the corresponding record collection of text and its affiliated languages；If 2) above-mentioned judged result is yes, by the text to be identified Languages are used as the affiliated languages of the text to be identified belonging to this is corresponding in the intervention vocabulary.

Related description with regard to intervening vocabulary and its application process, refers to the phase of optimisation strategy one in above-described embodiment one Description is closed, here is omitted.

Preferably, the intervention vocabulary is generated using following steps：1) text being erroneously identified is obtained；2) by the quilt The text of wrong identification and its affiliated correct languages are used as the record for intervening vocabulary.

Methods described also comprises the steps：If above-mentioned judged result is no, by the text languages point for previously generating Class device is calculated and obtains the affiliated languages of the text to be identified.

Text languages grader described in the embodiment of the present application, i.e., including the text languages grader of prior art, also wrap Include the text languages grader based on machine learning provided in said method embodiment one.

A kind of languages identifying device for text that the embodiment of the present application is provided, including：

Optionally, also include：

The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps：1) with The character that text to be identified includes is search condition, and retrieval is with the presence or absence of described in the specific languages character code table for previously generating The character that text to be identified includes；2) if above-mentioned judged result is yes, the character that the text to be identified is included is described In specific languages character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.

With regard to character code table and its related description of application process, the phase of optimisation strategy two in above-described embodiment one is referred to Description is closed, here is omitted.

Optionally, also include：

The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps：1) root According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word or default model from text to be identified Word；2) the text languages classifier calculated by previously generating obtains the affiliated languages of the text to be identified.

Text languages grader described in the embodiment of the present application, i.e., including the text languages grader of prior art, also wrap Include the text languages grader based on machine learning provided in said method embodiment one.With regard to brand vocabulary, model vocabulary And the related description of filter method, the associated description of optimisation strategy three in above-described embodiment one is referred to, here is omitted.

The embodiment of the present application provides another for the Language Identification of text, and the method comprises the steps：1) from Languages feature is extracted in text to be identified；2) each text languages classification of predetermined number is run one by one with default execution sequence Device, judges whether the affiliated languages of the text to be identified belong to the text languages grader by the text languages grader Candidate's languages；If so, languages identification is then terminated；Wherein, the languages feature includes the continuous word feature of N units, N units continuation character At least one of feature and affixe feature.

Text languages grader described in the embodiment of the present application includes the text languages grader or multilingual of single languages Text languages grader.The related description of framework is recognized with regard to multi-level languages, is referred in above-described embodiment one and is optimized plan Slightly five associated description, here is omitted.

Extracting unit, for extracting languages feature from text to be identified；

Although the application is disclosed as above with preferred embodiment, it is not for limiting the application, any this area skill Art personnel can make possible variation and modification, therefore the guarantor of the application in without departing from spirit and scope The scope that shield scope should be defined by the application claim is defined.

In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.

Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read only memory (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.

1st, computer-readable medium can be by any side including permanent and non-permanent, removable and non-removable media Method or technology are realizing information Store.Information can be computer-readable instruction, data structure, the module of program or other numbers According to.The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read only memory (ROM), Electrically Erasable Read Only Memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc are read-only Memorizer (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic cassette tape, tape magnetic rigid disk storage or Other magnetic storage apparatus or any other non-transmission medium, can be used to store the information that can be accessed by a computing device.According to Herein defines, and computer-readable medium does not include non-temporary computer readable media (transitory media), such as modulates Data signal and carrier wave.

2nd, it will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product Product.Therefore, the application can be using complete hardware embodiment, complete software embodiment or with reference to the embodiment in terms of software and hardware Form.And, the application can be adopted and can use in one or more computers for wherein including computer usable program code The computer program implemented on storage medium (including but not limited to disk memory, CD-ROM, optical memory etc.) Form.

Claims

1. a kind of Language Identification for text, it is characterised in that include：

Languages feature is extracted from text to be identified；

Wherein, the languages feature includes at least one of the continuous word feature of N units, the continuous character feature of N units and affixe feature.

2. the Language Identification for text according to claim 1, it is characterised in that described by the text language Plant classifier calculated and obtain the affiliated languages of the text to be identified, including：

Search condition is characterized as with the languages for extracting, in the correspondence of the languages, languages feature and its weight for previously generating Retrieval in relation obtains feature weight of the languages feature in each candidate's languages；

According to feature weight of the languages feature for obtaining in each candidate's languages is retrieved, the text to be identified point is calculated Do not belong to the score of each candidate's languages；

3. the Language Identification for text according to claim 2, it is characterised in that the feature weight is based on sentencing Other formula model is calculated and obtained.

4. the Language Identification for text according to claim 2, it is characterised in that the language for previously generating The corresponding relation of kind, languages feature and its weight, is generated using following steps：

Acquisition has marked the text set of affiliated languages；

The languages feature is extracted from the text that each has marked affiliated languages, and counts the languages feature and waited at each Select the number of times occurred respectively in languages；

Each has marked the text of affiliated languages according to extracting languages feature and the languages feature for counting The number of times occurred respectively in each candidate's languages, calculates and obtains what each languages feature occurred respectively in each candidate's languages Number of times and occurs the ratio of total degree in all candidate's languages, as spy of each languages feature in each candidate's languages Levy weight；

By the triplet sets of described each candidate's languages, each languages feature and the feature weight, as the languages, language Plant the corresponding relation of feature and its weight.

5. the Language Identification for text according to claim 2, it is characterised in that the languages, languages feature And its corresponding relation of weight, store in the following way：

6. the Language Identification for text according to claim 2, it is characterised in that the languages, languages feature And its corresponding relation of weight, store in the following way：

For each languages feature in the corresponding relation of the languages, languages feature and its weight, by the languages feature and All candidate's languages correspondence that its weight is not zero is stored.

7. the Language Identification for text according to claim 2, it is characterised in that the text languages grader For the text languages grader of single languages；The feature according to the languages feature of retrieval acquisition in each candidate's languages Weight, calculates the score that the text to be identified is belonging respectively to each candidate's languages, is calculated using equation below：

P (Y = 1 | x) = \frac{\exp (w \cdot x)}{1 + \exp (w \cdot x)} = \frac{1}{1 + \exp (- w \cdot x)}

Wherein, Y is the stochastic variable of the affiliated languages of the text to be identified；P is that the text to be identified belongs to specific languages Score；X is the characteristic vector being made up of the languages feature extracted from the text to be identified；W be by with x in it is each The weight vectors of the corresponding feature weight composition of individual languages feature.

8. the Language Identification for text according to claim 2, it is characterised in that the text languages grader For multilingual text languages grader；The feature according to the languages feature of retrieval acquisition in each candidate's languages Weight, calculates the score that the text to be identified is belonging respectively to each candidate's languages, is calculated using equation below：

p_{j} (x_{i}) = \frac{1}{z (λ_{1}, ..., λ_{m})} \exp [λ_{1 j} f_{1} (x_{i}) + ... + λ_{m j} f_{m} (x_{i})]

Wherein, x_iFor the text to be identified, p_jFor the score that the text to be identified belongs to particular candidate languages j；f(x_i) be The languages feature extracted from the text to be identified, λ_1jTo λ_mjFor f (x_i) spy in particular candidate languages j Levy weight；Z is the score sum of each candidate's languages, is calculated using equation below：

Z (λ_{1}, ..., λ_{m}) = Σ_{j = 1}^{n} \exp [λ_{1 j} f_{1} (x_{i}) + ... + λ_{m j} f_{m} (x_{i})]

Wherein, n is the quantity of candidate's languages.

9. the Language Identification for text according to claim 1-8 any one, it is characterised in that the languages Feature also includes：The word quantity and average word length, default brand word feature, default model word that the text to be identified includes is special Levy, the peculiar character feature of each languages, the peculiar affixe feature of each languages and service feature at least one.

10. the Language Identification for text according to claim 1-8 any one, it is characterised in that the N is first Continuation character feature includes N unit's continuation characters and its positional information in word.

11. Language Identifications for text according to claim 1-8 any one, it is characterised in that described pre- The text languages grader for first generating include at least one towards particular candidate languages text languages grader；Each is towards spy The text languages grader for determining candidate's languages is run one by one with default execution sequence；

It is described that the affiliated languages of the text to be identified are obtained by the text languages classifier calculated, in the following way：

If judging that the affiliated languages of the text to be identified do not belong to by the text languages grader currently towards particular candidate languages When current candidate's languages towards the text languages grader of particular candidate languages, then according to it is described it is default perform it is suitable Sequence, by positioned at described current towards after the text languages grader of particular candidate languages, next text languages classification Device is calculated and obtains the affiliated languages of the text to be identified；

If judging that the affiliated languages of the text to be identified belong to by the text languages grader currently towards particular candidate languages During current candidate's languages towards the text languages grader of particular candidate languages, then terminate languages identification；

Wherein, the text languages grader towards particular candidate languages includes the text languages grader or multi-lingual of single languages The text languages grader planted.

12. Language Identifications for text according to claim 1, it is characterised in that described from text to be identified Extract before languages feature in this, also include：

With the text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified This；The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；

If above-mentioned judged result is yes, using the text to be identified it is described intervention vocabulary in it is corresponding belonging to languages as institute State the affiliated languages of text to be identified.

13. Language Identifications for text according to claim 12, it is characterised in that the intervention vocabulary is adopted Following steps are generated：

The text that acquisition is erroneously identified；

14. Language Identifications for text according to claim 1, it is characterised in that described from text to be identified Extract before languages feature in this, also include：

As search condition, the retrieval in the specific languages character code table for previously generating is the character included with the text to be identified It is no to there is the character that the text to be identified includes；

If above-mentioned judged result is yes, the character that the text to be identified is included is right in the specific languages character code table The affiliated languages answered are used as the affiliated languages of the text to be identified.

15. Language Identifications for text according to claim 1, it is characterised in that described from text to be identified Extract before languages feature in this, also include：

According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word from the text to be identified Or default model word.

16. Language Identifications for text according to claim 1, it is characterised in that perform described for text Language Identification device dispose in a distributed system.

17. a kind of languages identifying devices for text, it is characterised in that include：

Extracting unit, for extracting languages feature from text to be identified；

Predicting unit, as the input of the text languages grader for previously generating, leads to for using the languages feature for extracting Cross the text languages classifier calculated and obtain the affiliated languages of the text to be identified；

Wherein, the languages feature refers to the languages feature of ten million order of magnitude, including the continuous word feature of N units, N units continuation character spy At least one for affixe feature of seeking peace.

The 18. languages identifying devices for text according to claim 17, it is characterised in that the predicting unit bag Include：

Retrieval subelement, for being characterized as search condition with the languages for extracting, in the languages, languages feature that previously generate And its retrieval obtains feature weight of the languages feature in each candidate's languages in the corresponding relation of weight；

Computation subunit, for the feature weight according to the languages feature of retrieval acquisition in each candidate's languages, calculates The text to be identified is belonging respectively to the score of each candidate's languages；

Setting subelement, for using the score more than predetermined threshold value candidate's languages as the text to be identified belonging to language Kind.

The 19. languages identifying devices for text according to claim 18, it is characterised in that also include：

The signal generating unit includes：

Subelement is extracted, for extracting the languages feature in the text for having marked affiliated languages from each, and counts described The number of times that languages feature occurs respectively in each candidate's languages；

Computation subunit, for according to described in extracting each marked affiliated languages text languages feature and count The number of times that occurs respectively in each candidate's languages of the languages feature, calculate and obtain each languages feature in each candidate's language The number of times occurred respectively in kind and the ratio for occurring total degree in all candidate's languages, as each languages feature at each Feature weight in candidate's languages；

Setting subelement, for by the triplet sets of described each candidate's languages, each languages feature and the feature weight, As the corresponding relation of the languages, languages feature and its weight.

The 20. languages identifying devices for text according to claim 17, it is characterised in that the predicting unit includes At least one towards particular candidate languages prediction subelement；With default execution sequence one by one using each towards particular candidate language The prediction subelement planted, judges whether the affiliated languages of the text to be identified belong to current prediction towards particular candidate languages Candidate's languages of unit；If so, languages identification is then terminated；If it is not, then by positioned at described current towards particular candidate languages Prediction subelement towards particular candidate languages after prediction subelement, next is calculated and obtained belonging to the text to be identified Languages；

The prediction subelement towards particular candidate languages, for by the text languages grader towards particular candidate languages Calculate and obtain the affiliated languages of the text to be identified；

The 21. languages identifying devices for text according to claim 17, it is characterised in that also include：

Intervene unit, for the text to be identified as search condition, retrieving in the intervention vocabulary for previously generating and whether depositing In the text to be identified；It is if above-mentioned judged result is yes, the text to be identified is corresponding in the intervention vocabulary Affiliated languages are used as the affiliated languages of the text to be identified；

The 22. languages identifying devices for text according to claim 17, it is characterised in that also include：

Character recognition unit, for the character that included with the text to be identified as search condition, in the specific language for previously generating Plant the character that retrieval includes with the presence or absence of the text to be identified in character code table；If above-mentioned judged result is yes, will be described Languages are used as the text to be identified belonging to the character that text to be identified includes is corresponding in the specific languages character code table Affiliated languages.

The 23. languages identifying devices for text according to claim 17, it is characterised in that also include：

Remove noise unit, for according to the brand vocabulary that previously generates and model vocabulary at least one, from described to be identified Text removes default brand word or default model word.

24. a kind of electronic equipment, it is characterised in that include：

Display；

Processor；And

Memorizer, the memorizer is configured to store the languages identifying device for text, and the languages for text are known When other device is by the computing device, comprise the steps：Languages feature is extracted from text to be identified；By what is extracted The languages feature is obtained as the input of the text languages grader for previously generating by the text languages classifier calculated The affiliated languages of the text to be identified；Wherein, the languages feature includes the continuous word feature of N units, the continuous character feature of N units and word Sew feature at least one.

25. a kind of Language Identifications for text, it is characterised in that include：

With text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating whether there is the text to be identified； The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；

26. Language Identifications for text according to claim 25, it is characterised in that the intervention vocabulary is adopted Following steps are generated：

The text that acquisition is erroneously identified；

27. Language Identifications for text according to claim 25, it is characterised in that also include：

If above-mentioned judged result is no, the text institute to be identified is obtained by the text languages classifier calculated for previously generating Category languages.

28. a kind of languages identifying devices for text, it is characterised in that include：

Retrieval unit, for text to be identified as search condition, the retrieval in the intervention vocabulary for previously generating to whether there is institute State text to be identified；The vocabulary of intervening includes the corresponding record collection of text and its affiliated languages；

Judging unit, if for above-mentioned judged result be it is yes, the text to be identified is corresponding in the intervention vocabulary Affiliated languages are used as the affiliated languages of the text to be identified.

The 29. languages identifying devices for text according to claim 28, it is characterised in that also include：

Predicting unit, if being no for above-mentioned judged result, by the text languages classifier calculated for previously generating institute is obtained State the affiliated languages of text to be identified.

30. a kind of Language Identifications for text, it is characterised in that include：

Whether the character included with text to be identified is retrieved in the specific languages character code table for previously generating and is deposited as search condition In the character that the text to be identified includes；

31. Language Identifications for text according to claim 30, it is characterised in that also include：

32. a kind of languages identifying devices for text, it is characterised in that include：

Retrieval unit, for the character that included with text to be identified as search condition, in the specific languages character code for previously generating The character that retrieval includes with the presence or absence of the text to be identified in table；

Judging unit, if being yes for above-mentioned judged result, the character that the text to be identified is included is in the specific language In kind of character code table it is corresponding belonging to languages as the affiliated languages of the text to be identified.

The 33. languages identifying devices for text according to claim 32, it is characterised in that also include：

34. a kind of Language Identifications for text, it is characterised in that include：

According to the brand vocabulary and model vocabulary for previously generating at least one, remove default brand word or pre- from text to be identified If model word；

35. a kind of languages identifying devices for text, it is characterised in that include：

At least one of filter element, the brand vocabulary previously generated for basis and model vocabulary, removes from text to be identified Default brand word or default model word；

Predicting unit, for obtaining the affiliated languages of the text to be identified by the text languages classifier calculated for previously generating.

36. a kind of Language Identifications for text, it is characterised in that include：

Languages feature is extracted from text to be identified；

To preset each text languages grader that execution sequence runs one by one predetermined number, by the text languages grader Judge whether the affiliated languages of the text to be identified belong to candidate's languages of the text languages grader；If so, then conclusion Plant identification；

37. Language Identifications for text according to claim 37, it is characterised in that the text languages classification Device includes the text languages grader or multilingual text languages grader of single languages.

38. a kind of languages identifying devices for text, it is characterised in that include：

Extracting unit, for extracting languages feature from text to be identified；

Predicting unit, for preset each text languages grader that execution sequence runs one by one predetermined number, by described Text languages grader judges whether the affiliated languages of the text to be identified belong to candidate's languages of the text languages grader； If so, languages identification is then terminated；