CN115374779B

CN115374779B - Text language identification method, device, equipment and medium

Info

Publication number: CN115374779B
Application number: CN202211306400.6A
Authority: CN
Inventors: 杨萌萌; 贺琳; 郝玉峰; 辛晓峰; 黄宇凯; 李科
Original assignee: Beijing Speechocean Technology Co ltd
Current assignee: Beijing Speechocean Technology Co ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-10
Anticipated expiration: 2042-10-25
Also published as: CN115374779A

Abstract

The invention discloses a text language identification method, a text language identification device, text language identification equipment and a text language identification medium. The method comprises the steps of determining each vocabulary to be recognized in a text to be recognized, determining at least one corresponding to-be-recognized binary group comprising the current vocabulary to be recognized and adjacent vocabularies aiming at the current vocabulary to be recognized to determine a binary group comprising vocabulary context, further determining the binary probability of the binary group to be recognized under each preset language according to a binary probability dictionary respectively corresponding to each preset language aiming at each to-be-recognized binary group, determining a target language corresponding to the current vocabulary to be recognized according to each binary probability, determining the probability of the binary group under each language through a pre-constructed dictionary, realizing the language recognition based on the vocabulary context, improving the accuracy of the language recognition, directly recognizing the vocabulary according to each language, and avoiding manual labeling of samples, thereby solving the technical problems of high cost and low recognition accuracy in the prior art.

Description

Text language identification method, device, equipment and medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a text language identification method, apparatus, device, and medium.

Background

Code-Switching (Code-Switching) is a common language phenomenon, which refers to a person alternating more than one language or its variants in a sentence, often in the everyday language of multiple speakers. In addition to everyday language dialogues, transcoding also occurs in written text. The language code conversion recognition has important significance in the voice model and the natural language processing task, and with the more and more common phenomenon of language code conversion, the more and more attention is paid to the relevant research of the language code conversion recognition.

The language-code conversion can be divided into two cases according to the combination of language and characters: one is a language-code converted text composed of different types of characters, for example, chinese composed of chinese characters and english composed of english latin letters belong to different types of characters; another is a transcoded text composed of characters of the same type, such as portuguese and english composed of latin letters belong to the same type of characters, and the same word (e.g., "no") can appear in both portuguese and english, which poses great difficulty in identifying whether a sentence is a transcoded sentence and the linguistic attributes of each word in the sentence.

At present, for the recognition of the language-code conversion text composed of characters of the same type, methods of manually collecting samples and marking the samples are mainly adopted, namely whether the manually marked text is the language attribute of the language-code conversion text and each word is included, then the samples are used for training the model, and the text is recognized through the trained model.

In the process of implementing the invention, at least the following technical problems are found in the prior art: the method for manually collecting and labeling the samples consumes a large amount of manpower, is high in cost, and can cause a situation of wrong labeling during manual labeling, so that the identification accuracy is influenced, and the identification accuracy is low.

Disclosure of Invention

The invention provides a text language identification method, a text language identification device, text language identification equipment and a text language identification medium, which aim to solve the technical problems of high cost and low identification accuracy in the prior art.

According to an aspect of the present invention, there is provided a text language recognition method including:

the method comprises the steps of obtaining a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one tuple to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized, wherein the tuple to be recognized comprises the current vocabulary to be recognized and adjacent vocabularies of the current vocabulary to be recognized;

aiming at each binary group to be recognized, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary corresponding to each preset language;

and determining a target language corresponding to the current vocabulary to be recognized in each preset language based on the binary probability of each binary group to be recognized in each preset language.

According to another aspect of the present invention, there is provided a text language recognition apparatus including:

the system comprises a binary group determining module, a searching module and a searching module, wherein the binary group determining module is used for acquiring a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one binary group to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized, wherein the binary group to be recognized comprises the current vocabulary to be recognized and adjacent vocabularies of the current vocabulary to be recognized;

the binary probability determining module is used for determining the binary probability of each to-be-identified binary group under each preset language based on the binary probability dictionary corresponding to each preset language;

and the language identification module is used for determining a target language corresponding to the current vocabulary to be identified in each preset language based on the binary probability of each binary group to be identified in each preset language.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the text language identification method according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the text language identification method according to any one of the embodiments of the present invention when the computer instructions are executed.

The technical scheme of the embodiment of the invention determines each vocabulary to be recognized in the text to be recognized, determines at least one corresponding binary group to be recognized comprising the current vocabulary to be recognized and adjacent vocabularies aiming at the current vocabulary to be recognized to determine the binary probability of the binary group to be recognized under each preset language according to the binary probability dictionary respectively corresponding to each preset language, determines the target language corresponding to the current vocabulary to be recognized according to each binary probability and determines the probability of the binary group under each language through the pre-established dictionary according to each binary probability, thereby realizing the language recognition based on the vocabulary context and improving the accuracy of the language recognition.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text language identification method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a text language identification method according to a second embodiment of the present invention;

fig. 3A is a schematic flowchart of a text language identification method according to a third embodiment of the present invention;

FIG. 3B is a diagram of a text language identification process according to a third embodiment of the present invention;

fig. 4 is a flowchart illustrating a text language identification method according to a fourth embodiment of the present invention;

fig. 5 is a schematic flowchart of a text language identification method according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of a text language recognition apparatus according to a sixth embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to a seventh embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Before describing the text language identification method provided by the embodiment of the invention in detail, an application scenario of the method is explained in an exemplary manner. The text language identification method provided by the embodiment can be used for identifying the language of each vocabulary in the text, and can also be used for identifying whether one text is a language code conversion text, namely one text consists of at least two languages.

Specifically, the text language identification method can be used for not only performing language identification on non-language-code conversion texts, but also performing language identification on language-code conversion texts, such as language-code conversion texts composed of type characters or language-code conversion texts composed of different types of characters. For example, the language identification method may be used for identifying the languages of the words in the language-code conversion text composed of english and portuguese for the same type of words, or may be used for identifying the languages of the words in the language-code conversion text composed of portuguese and spanish, or may be used for identifying the languages of the words in the language-code conversion text composed of polish, english and spanish, or the like. It should be noted that the text language identification method provided in the embodiment of the present invention may be used to identify a language-code conversion text composed of two or more languages of the same type of characters.

Example one

Fig. 1 is a flowchart of a text language identification method according to an embodiment of the present invention, where the present embodiment is applicable to identifying whether a text is a language-to-code conversion text, and/or identifying a language of each vocabulary in a text, and is particularly applicable to identifying a language of each vocabulary in a language-to-code conversion text composed of characters of the same type. As shown in fig. 1, the method includes:

s110, a text to be recognized is obtained, each vocabulary to be recognized in the text to be recognized is determined, and at least one duplet to be recognized corresponding to the current vocabulary to be recognized is determined aiming at the current vocabulary to be recognized, wherein the duplet to be recognized comprises the current vocabulary to be recognized and adjacent vocabularies of the current vocabulary to be recognized.

In this embodiment, the text to be recognized may be a non-language-code-converted text, a language-code-converted text composed of multiple languages of the same type of characters, and a language-code-converted text composed of multiple languages of different types of characters.

Specifically, a pre-trained word segmentation device may be used to process the text to be recognized, so as to obtain each word to be recognized in the text to be recognized. Furthermore, each vocabulary to be recognized may be sequentially recognized, and for the current vocabulary to be recognized, at least one tuple to be recognized may be determined based on the adjacent vocabulary of the current vocabulary to be recognized.

For example, if the current vocabulary to be recognized is the first vocabulary in the text to be recognized, two groups to be recognized may be constructed based on the current vocabulary to be recognized and the next adjacent vocabulary of the current vocabulary to be recognized; if the current vocabulary to be recognized is the last vocabulary in the text to be recognized, constructing a binary group to be recognized based on the current vocabulary to be recognized and a previous adjacent vocabulary of the current vocabulary to be recognized; if the current vocabulary to be recognized is an intermediate vocabulary except the first vocabulary and the last vocabulary, the binary group to be recognized can be constructed based on the current vocabulary to be recognized and the previous adjacent vocabulary of the current vocabulary to be recognized, and meanwhile, the binary group to be recognized is constructed based on the current vocabulary to be recognized and the next adjacent vocabulary of the current vocabulary to be recognized.

It should be noted that, in this embodiment, the purpose of constructing at least one tuple to be recognized corresponding to the current vocabulary to be recognized is to: and obtaining information containing the current vocabulary to be recognized and the preceding and following paragraphs of the vocabulary to be recognized, and then performing language recognition on the current vocabulary to be recognized based on the current vocabulary to be recognized and the preceding and following paragraphs, so that the accuracy of the language recognition is improved.

S120, aiming at each binary group to be recognized, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary corresponding to each preset language.

The preset language may be a language that may appear in a pre-set transcoding text composed of characters of the same type, such as english and portuguese, or polish, english, spanish, and the like. Specifically, for each preset language, a binary probability dictionary corresponding to the preset language may be pre-established, where the binary probability dictionary includes each bigram and a binary probability corresponding to each bigram.

In this embodiment, for each binary group to be recognized, the binary group to be recognized may be compared with each binary group in each binary probability dictionary to query the binary probability corresponding to the binary group to be recognized, so as to obtain the binary probability of the binary group to be recognized in each preset language. The binary probability of the to-be-identified binary group in the preset language can represent the probability of the to-be-identified binary group in the preset language.

Exemplarily, taking the number of the preset languages as two, the preset languages include english and portuguese, and if the to-be-recognized binary group corresponding to the current to-be-recognized vocabulary includes a binary group a and a binary group B, the binary probability of the binary group a in english and the binary probability of the portuguese can be respectively determined based on a binary probability dictionary of english and a binary probability dictionary of the portuguese; and a binary probability dictionary of English and a binary probability dictionary of Portuguese respectively determine the binary probability of the binary element B under English and the binary probability under Portuguese.

S130, determining a target language corresponding to the current vocabulary to be recognized in each preset language based on the binary probability of each binary group to be recognized in each preset language.

For example, after obtaining the binary probabilities of the binary groups to be recognized in the preset languages, a maximum value of the binary probabilities may be determined as a probability to be verified, and if the probability to be verified is greater than a preset threshold, a preset language corresponding to the probability to be verified may be determined as a target language corresponding to the vocabulary to be recognized.

In a specific embodiment, the determining, in each preset language, a target language corresponding to a current vocabulary to be recognized based on a binary probability of each binary group to be recognized in each preset language may be: for each preset language, determining the maximum value of the binary probabilities of the binary groups to be recognized under the preset language as the binary reference probability corresponding to the preset language; and determining a target language corresponding to the current vocabulary to be recognized based on the binary reference probabilities corresponding to the preset languages and the preset binary probability threshold values corresponding to the preset languages respectively.

Specifically, when the number of the to-be-identified tuples is multiple, the maximum binary probability of each to-be-identified tuple in the preset language may be determined as the binary reference probability corresponding to the preset language for each preset language.

Further, after the binary reference probability corresponding to each preset language is obtained, the target language can be determined according to the preset binary probability threshold corresponding to each preset language. The maximum value in the binary probabilities under the preset languages of the binary groups to be recognized is determined as the binary reference probability, so that the target language is further determined based on the binary reference probability, the recognition precision is guaranteed, meanwhile, the language recognition is avoided according to the binary probabilities of all the binary groups to be recognized under the preset languages, and the language recognition efficiency is improved.

The preset binary probability threshold may be a preset probability threshold value appearing in a preset language. In one embodiment, the determining a target language corresponding to a current vocabulary to be recognized based on a binary reference probability corresponding to each preset language and a preset binary probability threshold corresponding to each preset language may be: and aiming at each preset language, comparing the binary reference probability corresponding to the preset language with a preset binary probability threshold corresponding to the preset language, and if the binary reference probability is greater than the preset binary probability threshold, determining the preset language as a target language.

The preset binary probability threshold may also be a ratio critical value between a preset probability of occurrence in a preset language and probabilities of occurrence in other preset languages. In another embodiment, the determining the target language corresponding to the vocabulary to be recognized based on the binary reference probabilities corresponding to the preset languages and the preset binary probability thresholds corresponding to the preset languages may be: determining a binary probability ratio corresponding to the current vocabulary to be recognized based on the binary reference probability corresponding to each preset language; and determining a target language corresponding to the current vocabulary to be recognized based on the binary probability ratio and the preset binary probability threshold values respectively corresponding to the preset languages.

Specifically, the logarithm value of each binary reference probability may be calculated, and the ratio of the logarithm values of each binary reference probability is used as the binary probability ratio. For example, the binary probability ratio can be obtained based on the following formula:

；

wherein the content of the first and second substances,

the ratio of the two-dimensional probability is expressed,

representing the binary probability of a binary group to be recognized consisting of a previous vocabulary of the current vocabulary to be recognized and the current vocabulary to be recognized under a first preset language,

representing the binary probability of a binary group to be recognized consisting of the current vocabulary to be recognized and the next vocabulary of the current vocabulary to be recognized under a first preset language,

representing the binary probability of the binary group to be recognized consisting of the previous vocabulary of the current vocabulary to be recognized and the current vocabulary to be recognized under a second preset language,

and expressing the binary probability of the binary group to be recognized, which is composed of the current vocabulary to be recognized and the next vocabulary of the current vocabulary to be recognized, under the second preset language.

The above formula is suitable for the case that the number of the binary groups to be recognized is multiple and the number of the preset languages is 2. Of course, other situations may also be applicable. For example, if the number of the to-be-identified binary groups is one, the ratio between the logarithmic values of the binary probabilities in each preset language in the to-be-identified binary group can be directly determined as the binary probability ratio. If the number of the preset languages is greater than 2, for example, 3, the binary probability ratio may be calculated by taking the sum of the binary reference probabilities of any two preset languages as a numerator and the binary reference probability of the other preset language as a denominator.

Further, after the binary probability ratio is obtained, the binary probability ratio may be compared with preset binary probability thresholds respectively corresponding to preset languages. Specifically, if the binary probability proportion is smaller than a preset binary probability threshold of a preset language corresponding to a molecule in the binary probability proportion, it may be determined that a target language corresponding to a current vocabulary to be recognized is the preset language corresponding to the molecule; if the binary probability ratio is greater than a preset binary probability threshold of a preset language corresponding to a denominator in the binary probability ratio, it can be determined that a target language corresponding to the current vocabulary to be recognized is the preset language corresponding to the denominator.

For example, taking each preset language including english and portuguese as an example, the preset binary probability threshold corresponding to each preset language includes a threshold corresponding to english

And threshold corresponding to Portuguese

. If the preset language corresponding to the numerator is English and the preset language corresponding to the denominator is Portuguese in the process of calculating the binary probability ratio, the target language corresponding to the current vocabulary to be recognized meets the following formula:

；

wherein the binary probability ratio is less than EnglishThreshold value corresponding to language

When the target language is English; when the binary probability ratio is larger than the threshold value corresponding to the portuguese

When the target language is portuguese; in the case of binary probability ratio greater than

And is less than

Then, the current vocabulary to be recognized can be determined as unknown words. 0<

<1,

>1,

>

。

It should be noted that, the above process is exemplified by the number of the preset languages being 2, and the method provided in this embodiment does not limit the number of the preset languages.

Specifically, if the number of the preset languages is greater than 2, taking 3 as an example, determining the target language corresponding to the current vocabulary to be recognized based on the binary probability ratio and the preset binary probability threshold corresponding to each preset language respectively may be: if the binary probability ratio is greater than a preset binary probability threshold value of a preset language corresponding to a denominator in the binary probability ratio, determining that a target language corresponding to the current vocabulary to be recognized is the preset language corresponding to the denominator; if the binary probability ratio is smaller than the sum of the preset binary probability threshold values of the two preset languages corresponding to the numerator in the binary probability ratio, the target language of the current vocabulary to be recognized can be excluded as the preset language corresponding to the denominator, further, a new binary probability ratio can be calculated again according to the binary reference probability of the two preset languages corresponding to the numerator, and the target language is determined based on the new binary probability ratio and the preset binary probability threshold values of the two preset languages.

By the method, language identification based on the binary probability ratio and the preset binary probability threshold value of each preset language is realized, the binary probability ratio is calculated, the comparison information of the binary groups to be identified among the binary probabilities of each preset language can be determined, the target language is determined according to the comparison information among the probabilities and the threshold value, and compared with the method of directly determining the target language according to the binary probability of each preset language, the identification accuracy is further improved.

It should be noted that, in the process of determining the target language corresponding to the current vocabulary to be recognized based on the binary probability ratio and the preset binary probability threshold corresponding to each preset language, the binary probability ratio is greater than

And is less than

The current vocabulary to be recognized may be determined to be an unknown word at this time. Or when the current vocabulary to be recognized is determined to be an unknown word, the target language corresponding to the current vocabulary to be recognized can be further determined according to the unary probability dictionaries corresponding to the preset languages.

By the method, each vocabulary to be recognized in the text to be recognized can be sequentially used as the current vocabulary to be recognized, so that all vocabularies to be recognized in the text to be recognized are recognized, the target languages corresponding to all vocabularies to be recognized are determined, and the language recognition of the text to be recognized is realized.

Furthermore, whether the text to be recognized is a language code conversion text or not can be determined according to the target language corresponding to all the vocabularies to be recognized in the text to be recognized; or, determining whether the text to be recognized is a language code conversion text composed of characters of the same type; or, determining whether the text to be recognized is a language-code conversion text composed of different types of characters, and the like.

According to the technical scheme of the embodiment, each vocabulary to be recognized in the text to be recognized is determined, at least one corresponding binary group to be recognized comprising the current vocabulary to be recognized and adjacent vocabularies is determined aiming at the current vocabulary to be recognized, so that the binary probability of the binary group to be recognized under each preset language is determined according to the binary probability dictionary corresponding to each preset language, the target language corresponding to the current vocabulary to be recognized is determined according to each binary probability, and the probability of the binary group under each language is determined through the pre-constructed dictionary, so that the language recognition based on the vocabulary context is realized, the accuracy of the language recognition is improved.

Example two

Fig. 2 is a schematic flow diagram of a text language identification method according to a second embodiment of the present invention, and this embodiment performs supplementary explanation on a process of determining a target language corresponding to a current vocabulary to be identified based on a binary probability ratio and preset binary probability thresholds corresponding to preset languages, respectively, on the basis of the foregoing embodiments. As shown in fig. 2, the method includes:

s210, obtaining a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one tuple to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized.

S220, aiming at each binary group to be recognized, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary respectively corresponding to each preset language.

And S230, aiming at each preset language, determining the maximum value of the binary probabilities of the binary groups to be recognized under the preset language as the binary reference probability corresponding to the preset language.

S240, determining a binary probability ratio corresponding to the current vocabulary to be recognized based on the binary reference probabilities corresponding to the preset languages, and if the binary probability ratio does not meet preset binary probability threshold values corresponding to the preset languages, determining a unary probability of the current vocabulary to be recognized under the preset languages based on unary probability dictionaries corresponding to the preset languages.

Specifically, if the binary probability ratio is not greater than the preset binary probability threshold of the preset language corresponding to the denominator in the binary probability ratio, and the binary probability ratio is not less than the preset binary probability threshold of the preset language corresponding to the numerator in the binary probability ratio, it may be determined that the binary probability ratio does not satisfy the preset binary probability thresholds corresponding to the preset languages, respectively.

In this embodiment, for a case that the binary probability ratio does not satisfy the preset binary probability threshold corresponding to each preset language, the target language corresponding to the current vocabulary to be recognized may be determined further based on the unary probability dictionary corresponding to each preset language. The unary probability dictionary comprises vocabularies and unary probabilities corresponding to the vocabularies, and the unary probabilities can represent the probabilities of the vocabularies appearing in the preset language.

Specifically, the vocabulary to be recognized currently may be compared with each vocabulary in the unary probability dictionary corresponding to each preset language, so as to determine the unary probability in each preset language corresponding to the vocabulary to be recognized currently.

For example, taking the preset languages including english and portuguese as an example, the unary probability under english and the unary probability under portuguese of the current vocabulary to be recognized may be determined based on the unary probability dictionary of english and the unary probability dictionary of portuguese, respectively.

And S250, determining a target language corresponding to the current vocabulary to be recognized based on the unary probability of the current vocabulary to be recognized under each preset language.

Exemplarily, unary probabilities of the current vocabulary to be recognized under each preset language can be compared, and the preset language corresponding to the maximum unary probability is determined as the target language; or, it may be determined whether the maximum unitary probability is greater than a preset threshold of the preset language corresponding to the maximum unitary probability, and if so, the preset language corresponding to the maximum unitary probability may be determined as the target language.

In a specific embodiment, the determining, based on unary probabilities of the current vocabulary to be recognized in each preset language, a target language corresponding to the current vocabulary to be recognized may be: determining the unitary probability ratio corresponding to the current vocabulary to be recognized based on the unitary probability of the current vocabulary to be recognized under each preset language; and determining a target language corresponding to the current vocabulary to be recognized based on the unitary probability ratio and the preset unitary probability threshold value corresponding to each preset language.

The predetermined unary probability threshold may be a ratio threshold between a preset probability of occurrence in a predetermined language and probabilities of occurrence in other predetermined languages. Specifically, the ratio of the logarithmic values of the unary probabilities may be used as the unitary probability ratio by calculating the logarithmic values of the unary probabilities. For example, the unitary probability ratio can be obtained based on the following formula:

；

wherein the content of the first and second substances,

is the proportion of the unary probability,

representing the unary probability of the current vocabulary to be recognized under the first preset language,

and expressing the unary probability of the current vocabulary to be recognized under the second preset language.

The above formula is applicable to the case where the number of preset languages is 2, and of course, may also be applicable to other cases. For example, if the number of the predetermined languages is greater than 2, for example, 3, the unitary probability ratio may be calculated by taking the sum of unitary probabilities of any two predetermined languages as a numerator and the unitary probability of another predetermined language as a denominator.

Further, after the unitary probability ratio is obtained, the unitary probability ratio may be compared with preset unitary probability thresholds respectively corresponding to preset languages. For example, if the unitary probability ratio is smaller than a preset unitary probability threshold of a preset language corresponding to a molecule in the unitary probability ratio, it may be determined that a target language corresponding to a current vocabulary to be recognized is the preset language corresponding to the molecule; if the unitary probability ratio is greater than a preset unitary probability threshold of a preset language corresponding to a denominator in the unitary probability ratio, it can be determined that a target language corresponding to the current vocabulary to be recognized is the preset language corresponding to the denominator.

For example, taking the example that each preset language includes english and portuguese, the preset unitary probability threshold corresponding to each preset language includes a threshold corresponding to english

And threshold value corresponding to Portuguese

. If the preset language corresponding to the numerator is English and the preset language corresponding to the denominator is portuguese in the calculation of the unitary probability ratio, the target language corresponding to the current vocabulary to be recognized meets the following formula:

；

wherein, the unitary probability ratio is less than the threshold corresponding to English

Then, the target language is English; when the unitary probability ratio is greater than the threshold corresponding to the Portuguese

When the target language isA Portuguese; in the case of binary probability ratio greater than

And is less than

Then, the current vocabulary to be recognized can be determined as unknown words. 0,<

<1,

>1,

>

。

it should be noted that, the above process is exemplified by the number of preset languages being 2, and the method provided in this embodiment does not limit the number of preset languages. If the number of the preset languages is greater than 2, taking 3 as an example, if the unitary probability proportion is greater than a preset unitary probability threshold of the preset language corresponding to the denominator in the unitary probability proportion, it may be determined that the target language corresponding to the current vocabulary to be recognized is the preset language corresponding to the denominator; if the unitary probability ratio is smaller than the sum of the preset unitary probability threshold values of the two preset languages corresponding to the numerator in the unitary probability ratio, the target language of the current vocabulary to be recognized can be excluded as the preset language corresponding to the denominator, further, a new unitary probability ratio can be calculated again according to the unitary reference probabilities of the two preset languages corresponding to the numerator, and the target language is determined based on the new unitary probability ratio and the preset unitary probability threshold values of the two preset languages.

By the method, the language recognition based on the unitary probability ratio and the preset unitary probability threshold of each preset language is realized, the unitary probability ratio is calculated, the comparison information of the unitary probability of the current vocabulary to be recognized under each preset language can be determined, the target language is determined according to the comparison information between the probabilities and the threshold, and the recognition accuracy is further improved.

It should be noted that, in the method provided in the embodiment of the present invention, the language is recognized according to each binary group to be recognized corresponding to the current vocabulary to be recognized, and under the condition that each binary probability based on each binary group to be recognized cannot be recognized, each unitary probability based on the current vocabulary to be recognized is further recognized, so as to implement the language recognition by combining the context, and improve the accuracy of the language recognition.

In an optional embodiment, before determining the at least one to-be-recognized binary group corresponding to the current to-be-recognized vocabulary, the at least one to-be-recognized triple corresponding to the current to-be-recognized vocabulary may be determined, so as to determine the target language corresponding to the current to-be-recognized vocabulary according to the triple probability of each to-be-recognized triple under each preset language, and in a case that the at least one to-be-recognized binary group corresponding to the current to-be-recognized vocabulary cannot be recognized based on each triple probability of each to-be-recognized triple, the at least one to-be-recognized binary group corresponding to the current to-be-recognized vocabulary may be further determined. The ternary probability can be determined by a ternary probability dictionary which is constructed in advance and corresponds to each preset language.

Through the optional implementation mode, the language identification combined with the context can be realized, and the accuracy of the language identification is improved. Considering that a ternary probability dictionary corresponding to each preset language needs to be constructed in advance, the method can perform language recognition from a binary group corresponding to the vocabulary or from a ternary group corresponding to the vocabulary aiming at the current vocabulary to be recognized by combining actual requirements.

According to the technical scheme of the embodiment, under the condition that the binary probability ratio does not meet the preset binary probability threshold corresponding to all preset languages, the unary probability of the current vocabulary to be recognized under each preset language is determined according to the unary probability dictionary corresponding to each preset language, further, the target language corresponding to the current vocabulary to be recognized is determined according to each unary probability, when the binary group is formed by the vocabularies of two languages, language recognition is carried out according to the unary probability of the vocabularies in the dictionary, the language recognition accuracy is further improved, and misjudgment of the binary group containing the vocabularies of the two languages is avoided.

EXAMPLE III

Fig. 3A is a schematic flow chart of a text language identification method according to a third embodiment of the present invention, and in this embodiment, on the basis of the foregoing embodiments, a determination process of each preset binary probability threshold and each preset unary probability threshold is added. As shown in fig. 3A, the method includes:

s310, obtaining a text test set, wherein the text test set comprises each sample text and a sample label corresponding to each sample text, and each sample label comprises a sample language corresponding to each word in each sample text.

Specifically, the text test set includes a plurality of samples, each sample includes a sample text and a sample label corresponding to the sample text, where the sample label may be a sample language corresponding to each word in the sample text.

S320, obtaining initial binary probability threshold values and initial unitary probability threshold values corresponding to the preset languages respectively, and determining prediction labels corresponding to the sample texts aiming at each sample text based on the initial binary probability threshold values and the initial unitary probability threshold values, wherein the prediction labels comprise prediction languages corresponding to all words in the sample text.

Each initial binary probability threshold and each initial unary probability threshold may be default initial values set in advance, or may also be initial values set by human experience.

Specifically, each sample binary group corresponding to a sample word in a sample text can be determined, further, the binary probability of the sample binary group in each preset language is determined based on the binary probability dictionary corresponding to each preset language respectively, the predicted language is determined according to each initial binary probability threshold and each binary probability, if the sample word is determined to be an unknown word, the unary probability of the sample word in each preset language can be further determined according to each unary probability dictionary, and the predicted language is determined according to each initial unary probability threshold and the unary probability.

S330, adjusting each initial binary probability threshold and/or each initial unitary probability threshold based on each prediction label, each sample label and a preset target function to obtain each preset binary probability threshold and each preset unitary probability threshold.

In one embodiment, the preset objective function may be a loss function; such as a logarithmic loss function, a mean square error, a cross entropy cost function, or an exponential loss function, etc. Specifically, the preset objective function may be calculated according to each prediction label and each sample label, so as to obtain a calculation result of the preset objective function, and each initial binary probability threshold value or each initial unitary probability threshold value may be adjusted in a reverse direction according to the calculation result of the preset objective function, or each initial binary probability threshold value and each initial unitary probability threshold value may be adjusted at the same time.

It should be noted that, in the above process of determining, for each sample text, a prediction label corresponding to the sample text, and adjusting each initial binary probability threshold and/or each initial unitary probability threshold according to each prediction label, each sample label, and a preset objective function, the process may be performed in a loop, and the condition of loop cutoff may be that the calculation result of the preset objective function meets a preset threshold condition, or the number of loops reaches a preset number of times, or the calculation result of the preset objective function reaches a minimum value, and so on.

In a specific embodiment, the adjusting of each initial binary probability threshold and/or each initial unitary probability threshold based on each prediction label, each sample label, and a preset objective function may be: determining current prediction parameters corresponding to the text test set based on the prediction labels and the sample labels, wherein the current prediction parameters comprise sentence-level accuracy, sentence-level recall rate, word-level accuracy and word-level recall rate; and calculating the preset objective function according to the current prediction parameters, maximizing the calculation result of the preset objective function as an optimization target, and adjusting each initial binary probability threshold and/or each initial unitary probability threshold.

Specifically, sentence-level accuracy, sentence-level recall, word-level accuracy, and word-level recall of the text test set may be determined according to the differences between the predicted labels and the sample labels.

Further, a sentence-level F1 value can be calculated according to the sentence-level accuracy and the sentence-level recall rate, a word-level F1 value can be calculated according to the word-level accuracy and the word-level recall rate, and a preset objective function can be calculated according to the sentence-level F1 value and the word-level F1 value. Illustratively, the preset objective function satisfies the following formula:

；

wherein the content of the first and second substances,

represents the result of the calculation of the preset objective function,

a sentence-level F1 value is represented,

representing a word level F1 value. In particular, the method comprises the following steps of,

、

this can be obtained by the following formula:

，

；

wherein, the first and the second end of the pipe are connected with each other,

indicating sentence level accuracyThe ratio of the content to the content,

a sentence-level recall rate is represented,

the term level of accuracy is expressed in terms of,

representing sentence-level recall.

Further, the initial binary probability threshold values or the initial unitary probability threshold values may be adjusted by maximizing the calculation result of the preset objective function as the optimization objective, or the initial binary probability threshold values and the initial unitary probability threshold values may be adjusted at the same time. Optionally, a bayesian automatic parameter adjusting method may be adopted to adjust the initial binary probability threshold and/or each initial unitary probability threshold.

Specifically, each initial binary probability threshold and each initial unitary probability threshold corresponding to a maximum value in each calculation result of the preset objective function may be used as each preset binary probability threshold and each preset unitary probability threshold.

The sentence-level accuracy, the sentence-level recall rate, the word-level accuracy and the word-level recall rate are calculated, and the sentence-level F1 value and the word-level F1 value are calculated through the sentence-level accuracy, the sentence-level recall rate, the word-level accuracy and the word-level recall rate, so that the sum maximization of the sentence-level F1 value and the word-level F1 value is taken as an optimization target, each initial binary probability threshold value and/or each initial unitary probability threshold value is adjusted, the automatic optimization of parameters is realized, the precision of each preset binary probability threshold value and each preset unitary probability threshold value is improved, and further, the language identification effect on texts is improved.

S340, obtaining a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one tuple to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized.

S350, aiming at each binary group to be recognized, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary corresponding to each preset language.

S360, aiming at each preset language, determining the maximum value of the binary probabilities of the binary groups to be recognized under the preset language as the binary reference probability corresponding to the preset language.

And S370, determining a binary probability ratio corresponding to the current vocabulary to be recognized based on the binary reference probabilities corresponding to the preset languages, and if the binary probability ratio does not meet preset binary probability threshold values corresponding to the preset languages, determining a unary probability of the current vocabulary to be recognized under the preset languages based on unary probability dictionaries corresponding to the preset languages.

S380, determining unitary probability ratio corresponding to the current vocabulary to be recognized based on unitary probability of the current vocabulary to be recognized under each preset language; and determining a target language corresponding to the current vocabulary to be recognized based on the unitary probability ratio and the preset unitary probability threshold value corresponding to each preset language.

According to the technical scheme, the text test set is obtained, the prediction labels corresponding to the sample texts in the text test set are determined according to the initial binary probability threshold values and the initial unitary probability threshold values, and then the initial binary probability threshold values and/or the initial unitary probability threshold values are adjusted according to the prediction labels, the sample labels and the target functions to obtain the preset binary probability threshold values and the preset unitary probability threshold values.

For example, referring to fig. 3B, fig. 3B is a schematic diagram of a text language recognition process provided in this embodiment, and the recognition process of the transcoding text is exemplarily illustrated by taking preset languages including english and portuguese as an example. The method comprises the steps of obtaining massive pure English and pure gulf text corpora in advance, creating an English unitary frequency dictionary, an English binary frequency dictionary, a gulf unitary frequency dictionary and a gulf binary frequency dictionary according to the corpora, creating an English unitary probability dictionary, an English binary probability dictionary, a gulf unitary probability dictionary and a gulf binary probability dictionary, further calculating a binary probability ratio of a current vocabulary to be recognized, judging whether the binary probability ratio meets each preset binary probability threshold value, if so, updating a target language according to a preset language vocabulary, if not, calculating a unitary probability ratio, determining the target language according to the unitary probability ratio and each preset unitary probability threshold value, updating the target language according to the preset language vocabulary, and finally outputting a recognition result. The preset unitary probability threshold values and the preset binary probability threshold values can be obtained through steps of constructing a text test set, automatically adjusting parameters through Bayes and outputting optimal parameters.

In the process, the language identification method based on the vocabulary context is realized, the identification accuracy is improved, in addition, the threshold values are automatically determined by setting the Bayes automatic parameter adjustment for each threshold value, and meanwhile, the English special word vocabulary is collected to carry out forced identification in the mixed English text identification, so that the manual proofreading cost is saved, and the mixed English identification accuracy is improved.

Example four

Fig. 4 is a flowchart illustrating a text language identification method according to a fourth embodiment of the present invention, where in this embodiment, a process of updating a target language with a currently recognized vocabulary based on a preset language vocabulary is added on the basis of the foregoing embodiments. As shown in fig. 4, the method includes:

s410, obtaining a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one tuple to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized.

And S420, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary respectively corresponding to each preset language aiming at each binary group to be recognized.

And S430, determining a target language corresponding to the current vocabulary to be recognized in each preset language based on the binary probability of each binary group to be recognized in each preset language.

S440, acquiring a preset language word list corresponding to at least one reference language in each preset language, determining whether each preset language word list contains the current vocabulary to be recognized, and if yes, updating the target language corresponding to the current vocabulary to be recognized based on the reference language corresponding to the preset language word list containing the current vocabulary to be recognized.

Specifically, consider in each preset language. There may be a pre-set language with acronyms or a private vocabulary, such as the acronyms WIFI, GPS, WTO, exam, etc. in english. Therefore, the preset language with the abbreviation words or the special word list can be used as a reference language, and a preset language word list corresponding to the reference language can be constructed.

In this embodiment, after the target language of each current vocabulary to be recognized is determined according to the binary probability dictionary of each preset language, whether the current vocabulary to be recognized exists is queried in each preset language vocabulary, and if yes, the reference language corresponding to the preset language vocabulary including the current vocabulary to be recognized is directly used as the target language of the current vocabulary to be recognized.

Of course, in another embodiment, before determining each to-be-recognized binary group corresponding to the current to-be-recognized vocabulary, it may also be determined whether the current to-be-recognized vocabulary exists in each preset language vocabulary, and if so, the target language may be determined according to the preset language vocabulary including the current to-be-recognized vocabulary, and the step of determining each to-be-recognized binary group corresponding to the current to-be-recognized vocabulary is not required to be performed, so that the recognition efficiency is improved.

According to the technical scheme of the embodiment, whether the current vocabulary to be recognized is contained in each preset language vocabulary is determined through the preset language vocabulary corresponding to at least one reference language, if yes, the target language can be directly determined according to the reference language corresponding to the preset language vocabulary, so that re-recognition based on common abbreviation words or special words is realized, and the recognition accuracy is further improved.

EXAMPLE five

Fig. 5 is a schematic flow diagram of a text language identification method according to a fifth embodiment of the present invention, and this embodiment exemplarily illustrates a process of creating a unary probability dictionary and a binary probability dictionary corresponding to each preset language on the basis of the foregoing embodiments. As shown in fig. 5, the method includes:

s510, obtaining a corpus corresponding to each preset language, determining the use frequency of each vocabulary and the use frequency of each bigram according to the corpus, constructing a unary frequency dictionary corresponding to the preset language according to the use frequency of each vocabulary, and constructing a binary frequency dictionary corresponding to the preset language according to the use frequency of each bigram.

The corpus corresponding to each preset language may include a large number of sample texts in the preset language. Specifically, for each preset language, the use frequency of each vocabulary can be determined according to the total number of vocabularies in the corpus and the use frequency of each vocabulary; and determining the use frequency of each binary group according to the total number of each binary group in the corpus and the use frequency of each binary group.

Further, a unary frequency dictionary can be constructed according to the use frequency of each vocabulary, wherein the unary probability dictionary comprises each vocabulary and unary frequency (i.e. use frequency) corresponding to each vocabulary; and constructing a binary frequency dictionary according to the use frequency of each binary, wherein the binary frequency dictionary comprises each binary and the binary frequency (i.e. the use frequency) corresponding to each binary.

Based on the above manner, the unary frequency dictionary and the binary frequency dictionary corresponding to each preset language can be obtained.

S520, respectively constructing a unitary probability dictionary and a binary probability dictionary corresponding to the preset language based on the unitary frequency dictionary and the binary frequency dictionary.

Specifically, the unary probability of each vocabulary in the unary frequency dictionary can be calculated according to the unary frequency dictionary to obtain an unary probability dictionary; and calculating the binary probability of each binary group in the binary frequency dictionary according to the binary frequency dictionary to obtain a binary probability dictionary.

In a specific embodiment, constructing the unary probability dictionary and the binary probability dictionary corresponding to the preset language respectively based on the unary frequency dictionary and the binary frequency dictionary may include the following steps:

5201, aiming at each vocabulary in the unary frequency dictionary, determining unary probability of the vocabulary in a preset language based on the use frequency of the vocabulary, a preset unary smooth value, the length of the unary frequency dictionary and the intersection dictionary length corresponding to each preset language;

step 5202, determining a binary probability of each binary group in the binary frequency dictionary under a preset language based on the use frequency of the binary group, a preset binary smooth value and the length of the binary frequency dictionary;

5203, constructing a unary probability dictionary corresponding to the preset language according to the unary probability of each vocabulary in the preset language, and constructing a binary probability dictionary corresponding to the preset language according to the binary probability of each binary in the preset language.

The intersection dictionary length corresponding to each preset language may be a length of a dictionary formed by repeated vocabularies among corpora of all preset languages. For example, in the step 5201, the unary probability of the vocabulary under the preset language can be calculated by using the following formula:

；

representing a univariate probability of a vocabulary under a predetermined language,

expression vocabulary

The frequency of use of (a) is,

in order to preset the unary smooth value,

the length of the unary frequency dictionary is represented,

and expressing the length of the intersection dictionary corresponding to each preset language.

For example, in step 5202, the binary probability of the binary group in the preset language can be calculated by the following formula:

；

wherein the content of the first and second substances,

representing doublets

A binary probability in a preset language,

representing doublets

The frequency of use of (2) is,

in order to preset the binary smooth value,

is the length of the binary frequency dictionary.

Further, a unary probability dictionary of the preset language can be constructed according to unary probabilities of the vocabularies under the preset language, and a binary probability dictionary of the preset language can be constructed according to binary probabilities of the binary groups under the preset language. The unary probability dictionary comprises vocabularies and unary probabilities corresponding to the vocabularies; the binary probability dictionary comprises each binary group and binary probability corresponding to each binary group.

Through the steps 5201 to 5203, the construction of the probability dictionary of each preset language is realized, the smooth value and the length of the frequency dictionary are combined for determination, and the accuracy of the probability dictionary is improved.

It should be noted that the preset unary smooth value and the preset binary smooth value may also be obtained by performing automatic optimization according to a text test set. Specifically, a unitary probability dictionary and a binary probability dictionary can be constructed according to the initial unitary smooth value and the initial binary smooth value, the sample texts in the text test set are predicted to obtain a prediction label, and the initial unitary smooth value and the initial binary smooth value are reversely adjusted according to the calculation result of the preset objective function.

S530, obtaining a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one tuple to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized.

And S540, aiming at each binary group to be recognized, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary respectively corresponding to each preset language.

And S550, determining a target language corresponding to the current vocabulary to be recognized in each preset language based on the binary probability of each binary group to be recognized in each preset language.

According to the technical scheme of the embodiment, the unitary frequency dictionary and the binary frequency dictionary which correspond to the preset languages are constructed through the corpus which corresponds to the preset languages respectively, and then the unitary probability dictionary and the binary probability dictionary which correspond to the preset languages are constructed according to the unitary frequency dictionary and the binary probability dictionary, so that language recognition based on dictionaries is achieved, manual labeling of samples is not needed, and model training is carried out.

Example six

Fig. 6 is a schematic structural diagram of a text language recognition apparatus according to a sixth embodiment of the present invention. As shown in fig. 6, the apparatus includes a bigram determination module 610, a bigram probability determination module 620, and a language identification module 630.

The binary group determining module 610 is configured to obtain a text to be recognized, determine each vocabulary to be recognized in the text to be recognized, and determine, for a current vocabulary to be recognized, at least one binary group to be recognized corresponding to the current vocabulary to be recognized, where the binary group to be recognized includes the current vocabulary to be recognized and adjacent vocabularies of the current vocabulary to be recognized;

a binary probability determining module 620, configured to determine, for each to-be-identified binary group, a binary probability of the to-be-identified binary group in each preset language based on a binary probability dictionary corresponding to each preset language;

the language identification module 630 is configured to determine, in each of the preset languages, a target language corresponding to the current vocabulary to be identified based on a binary probability of each of the binary groups to be identified in each of the preset languages.

On the basis of the foregoing embodiment, optionally, the language identification module 630 is further configured to determine, for each preset language, a maximum value of binary probabilities of the to-be-identified binary groups in the preset language as a binary reference probability corresponding to the preset language; and determining a target language corresponding to the current vocabulary to be recognized based on the binary reference probability corresponding to each preset language and the preset binary probability threshold corresponding to each preset language.

On the basis of the foregoing embodiment, optionally, the language identification module 630 is further configured to determine a binary probability ratio corresponding to the current vocabulary to be identified based on the binary reference probability corresponding to each preset language; and determining the target language corresponding to the current vocabulary to be recognized based on the binary probability ratio and the preset binary probability threshold value respectively corresponding to each preset language.

On the basis of the foregoing embodiment, optionally, the language identification module 630 is further configured to determine, based on the unary probability dictionaries respectively corresponding to the preset languages, the unary probability of the current vocabulary to be recognized in each preset language if the binary probability ratio does not satisfy the preset binary probability threshold respectively corresponding to each preset language; and determining a target language corresponding to the current vocabulary to be recognized based on the unary probability of the current vocabulary to be recognized under each preset language.

On the basis of the foregoing embodiment, optionally, the language identification module 630 is further configured to determine a unitary probability ratio corresponding to the current vocabulary to be identified based on the unitary probabilities of the current vocabulary to be identified in each preset language; and determining the target language corresponding to the current vocabulary to be recognized based on the unitary probability ratio and the preset unitary probability threshold value corresponding to each preset language.

On the basis of the foregoing embodiment, optionally, the apparatus further includes a parameter determining module, where the parameter determining module is configured to obtain a text test set, where the text test set includes sample texts and sample labels corresponding to the sample texts, and the sample labels include sample languages corresponding to words in the sample texts; acquiring initial binary probability threshold values and initial unary probability threshold values corresponding to the preset languages respectively; for each sample text, determining a prediction label corresponding to the sample text based on each initial binary probability threshold and each initial unitary probability threshold, wherein the prediction label comprises a prediction language corresponding to each word in the sample text; and adjusting each initial binary probability threshold and/or each initial unary probability threshold based on each prediction label, each sample label and a preset objective function to obtain each preset binary probability threshold and each preset unary probability threshold.

Based on the foregoing embodiment, optionally, the parameter determining module is further configured to determine, based on each of the prediction tags and each of the sample tags, current prediction parameters corresponding to the text test set, where the current prediction parameters include a sentence-level accuracy rate, a sentence-level recall rate, a word-level accuracy rate, and a word-level recall rate; and calculating the preset objective function according to the current prediction parameters, maximizing the calculation result of the preset objective function as an optimization target, and adjusting each initial binary probability threshold and/or each initial unitary probability threshold.

On the basis of the foregoing embodiment, optionally, the apparatus further includes a vocabulary determining module, where the vocabulary determining module is configured to, after determining the target language corresponding to the current vocabulary to be recognized, obtain a preset language vocabulary corresponding to at least one reference language in each preset language; and determining whether each preset language word list contains the current vocabulary to be recognized, if so, updating the target language corresponding to the current vocabulary to be recognized based on the reference language corresponding to the preset language word list containing the current vocabulary to be recognized.

On the basis of the above embodiment, optionally, the apparatus further includes a dictionary building module, where the dictionary building module is configured to obtain a corpus corresponding to each preset language; for each preset language, determining the use frequency of each vocabulary and the use frequency of each binary group based on the corpus, constructing a unitary frequency dictionary corresponding to the preset language based on the use frequency of each vocabulary, and constructing a binary frequency dictionary corresponding to the preset language based on the use frequency of each binary group; and respectively constructing a unitary probability dictionary and a binary probability dictionary corresponding to the preset language based on the unitary frequency dictionary and the binary frequency dictionary.

On the basis of the foregoing embodiment, optionally, the dictionary building module is further configured to determine, for each vocabulary in the unary frequency dictionary, an unary probability of the vocabulary in the preset language based on the usage frequency of the vocabulary, a preset unary smooth value, the length of the unary frequency dictionary, and the intersection dictionary length corresponding to each preset language; for each binary group in the binary frequency dictionary, determining a binary probability of the binary group under the preset language based on the use frequency of the binary group, a preset binary smooth value and the length of the binary frequency dictionary; and constructing a unary probability dictionary corresponding to the preset language according to the unary probability of each vocabulary in the preset language, and constructing a binary probability dictionary corresponding to the preset language according to the binary probability of each binary group in the preset language.

The text language identification device provided by the embodiment of the invention can execute the text language identification method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE seven

Fig. 7 is a schematic structural diagram of an electronic device according to a seventh embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 7, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 may also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as a text language recognition method.

In some embodiments, the text language identification method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the text language recognition method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the text language recognition method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the text language identification method of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

Example eight

An eighth embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to enable a processor to execute a text language identification method, where the method includes:

aiming at each binary group to be recognized, determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary respectively corresponding to each preset language;

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for text language identification, comprising:

the method comprises the steps of obtaining a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one binary group to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized, wherein the binary group to be recognized comprises the current vocabulary to be recognized and adjacent vocabularies of the current vocabulary to be recognized;

2. The method as claimed in claim 1, wherein the determining a target language corresponding to the current vocabulary to be recognized based on the binary probability of each of the binary tuples to be recognized in each of the preset languages comprises:

for each preset language, determining the maximum value of the binary probabilities of the to-be-identified binary groups under the preset language as the binary reference probability corresponding to the preset language;

and determining a target language corresponding to the current vocabulary to be recognized based on the binary reference probability corresponding to each preset language and the preset binary probability threshold corresponding to each preset language.

3. The method according to claim 2, wherein the determining the target language corresponding to the vocabulary to be recognized based on the binary reference probability corresponding to each of the predetermined languages and the predetermined binary probability threshold corresponding to each of the predetermined languages respectively comprises:

determining a binary probability ratio corresponding to the current vocabulary to be recognized based on the binary reference probability corresponding to each preset language;

and determining the target language corresponding to the current vocabulary to be recognized based on the binary probability ratio and the preset binary probability threshold value respectively corresponding to each preset language.

4. The method according to claim 3, wherein the determining a target language corresponding to the current vocabulary to be recognized based on the binary probability ratio and a preset binary probability threshold corresponding to each preset language respectively comprises:

if the binary probability proportion does not meet the preset binary probability threshold value corresponding to each preset language, determining the unary probability of the current vocabulary to be recognized under each preset language based on the unary probability dictionary corresponding to each preset language;

and determining a target language corresponding to the current vocabulary to be recognized based on the unary probability of the current vocabulary to be recognized under each preset language.

5. The method of claim 4, wherein the determining the target language corresponding to the current vocabulary to be recognized based on the unary probability of the current vocabulary to be recognized under each of the preset languages comprises:

determining the unitary probability ratio corresponding to the current vocabulary to be recognized based on the unitary probability of the current vocabulary to be recognized under each preset language;

and determining the target language corresponding to the current vocabulary to be recognized based on the unitary probability ratio and the preset unitary probability threshold value corresponding to each preset language.

6. The method of claim 5, further comprising:

obtaining a text test set, wherein the text test set comprises sample texts and sample labels corresponding to the sample texts, and the sample labels comprise sample languages corresponding to words in the sample texts;

acquiring initial binary probability threshold values and initial unary probability threshold values corresponding to the preset languages respectively;

for each sample text, determining a prediction label corresponding to the sample text based on each initial binary probability threshold and each initial unitary probability threshold, wherein the prediction label comprises a prediction language corresponding to each word in the sample text;

and adjusting each initial binary probability threshold and/or each initial univariate probability threshold based on each prediction label, each sample label and a preset objective function to obtain each preset binary probability threshold and each preset univariate probability threshold.

7. The method of claim 6, wherein adjusting each of the initial binary probability thresholds and/or each of the initial univariate probability thresholds based on each of the predictive labels, each of the sample labels, and a preset objective function comprises:

determining current prediction parameters corresponding to the text test set based on the prediction labels and the sample labels, wherein the current prediction parameters comprise sentence-level accuracy, sentence-level recall, word-level accuracy and word-level recall;

and calculating the preset objective function according to the current prediction parameters, maximizing the calculation result of the preset objective function as an optimization target, and adjusting each initial binary probability threshold and/or each initial unitary probability threshold.

8. The method of claim 1, wherein after the determining the target language corresponding to the current vocabulary to be recognized, the method further comprises:

acquiring a preset language vocabulary corresponding to at least one reference language in each preset language;

and determining whether each preset language word list contains the current vocabulary to be recognized, if so, updating the target language corresponding to the current vocabulary to be recognized based on the reference language corresponding to the preset language word list containing the current vocabulary to be recognized.

9. The method according to any one of claims 1-8, further comprising:

obtaining a corpus corresponding to each preset language respectively;

for each preset language, determining the use frequency of each vocabulary and the use frequency of each binary group based on the corpus, constructing a unitary frequency dictionary corresponding to the preset language based on the use frequency of each vocabulary, and constructing a binary frequency dictionary corresponding to the preset language based on the use frequency of each binary group;

and respectively constructing a unitary probability dictionary and a binary probability dictionary corresponding to the preset language based on the unitary frequency dictionary and the binary frequency dictionary.

10. The method according to claim 9, wherein the constructing the unary probability dictionary and the binary probability dictionary corresponding to the preset language based on the unary frequency dictionary and the binary frequency dictionary respectively comprises:

for each vocabulary in the unary frequency dictionary, determining an unary probability of the vocabulary in the preset language based on the use frequency of the vocabulary, a preset unary smooth value, the length of the unary frequency dictionary and the intersection dictionary length corresponding to each preset language;

for each binary group in the binary frequency dictionary, determining a binary probability of the binary group under the preset language based on the use frequency of the binary group, a preset binary smooth value and the length of the binary frequency dictionary;

and constructing a unary probability dictionary corresponding to the preset language according to the unary probability of each vocabulary under the preset language, and constructing a binary probability dictionary corresponding to the preset language according to the binary probability of each binary group under the preset language.

11. A text language recognition apparatus, comprising:

the system comprises a binary group determining module, a recognition module and a recognition module, wherein the binary group determining module is used for acquiring a text to be recognized, determining each vocabulary to be recognized in the text to be recognized, and determining at least one binary group to be recognized corresponding to the current vocabulary to be recognized aiming at the current vocabulary to be recognized, and the binary group to be recognized comprises the current vocabulary to be recognized and adjacent vocabularies of the current vocabulary to be recognized;

the binary probability determining module is used for determining the binary probability of the binary group to be recognized under each preset language based on the binary probability dictionary respectively corresponding to each preset language aiming at each binary group to be recognized;

12. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the text language identification method of any one of claims 1-10.

13. A computer-readable storage medium storing computer instructions for causing a processor to perform the text language identification method of any one of claims 1-10 when executed.