CN110633456B - Language identification method, language identification device, server and storage medium - Google Patents

Language identification method, language identification device, server and storage medium Download PDF

Info

Publication number
CN110633456B
CN110633456B CN201910888663.4A CN201910888663A CN110633456B CN 110633456 B CN110633456 B CN 110633456B CN 201910888663 A CN201910888663 A CN 201910888663A CN 110633456 B CN110633456 B CN 110633456B
Authority
CN
China
Prior art keywords
text
language
recognized
target
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910888663.4A
Other languages
Chinese (zh)
Other versions
CN110633456A (en
Inventor
李应弟
张雨辰
贾鹏飞
阳安娜
张忠恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910888663.4A priority Critical patent/CN110633456B/en
Publication of CN110633456A publication Critical patent/CN110633456A/en
Application granted granted Critical
Publication of CN110633456B publication Critical patent/CN110633456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application provides a language identification method, a language identification device, a server and a storage medium, and belongs to the technical field of big data. The method comprises the following steps: converting the coding format of at least one text to be recognized into a ten-thousand-national code; recognizing at least one text to be recognized according to a preset grammar rule, and determining the language to which the at least one text to be recognized belongs, wherein the grammar rule comprises at least one of language special characters, positions of target common characters in a vocabulary and unique affixes; and when the unidentified text to be identified exists, determining the language to which the unidentified text to be identified belongs according to the high-frequency vocabulary set corresponding to each language. The method and the device improve the accuracy of the recognition result and the coverage of language recognition through multi-dimensional and multi-level recognition of languages with more common characters, thereby realizing effective language recognition.

Description

Language identification method, language identification device, server and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a language identification method, apparatus, server, and storage medium.
Background
With the rapid development of internet technology, a large amount of data information is stored in various websites, and text data in the websites can be subjected to big data processing through a big data algorithm, so that data with high value is obtained. At present, a plurality of websites using different languages exist, and part of the languages are from the same language family, and common characters exist among the languages, so that the languages of the texts in the websites cannot be visually distinguished. Therefore, how to identify text data of different languages is a problem which needs to be solved urgently at present.
In the prior art, unique characters in each language are usually recognized, that is, when a unique character of a certain language appears in text data, the language to which the text data belongs can be determined.
The technical scheme has the problem that the languages with more common characters and less unique characters, such as Uygur language, kazakh language, arabic language and the like, cannot be effectively identified through unique characters of the languages.
Disclosure of Invention
The embodiment of the application provides a language identification method, a language identification device, a language identification server and a language identification storage medium, which are used for solving the problem that a main database possibly cannot receive a response returned by a standby database when the databases are synchronized at present, so that database instructions executed by the main database cannot be executed or are executed slowly, and the processing performance of a database system is influenced. The technical scheme is as follows:
in one aspect, a language identification method is provided, including:
converting the coding format of at least one text to be recognized into a ten thousand national code;
recognizing the at least one text to be recognized according to a preset grammar rule, and determining the language to which the at least one text to be recognized belongs, wherein the grammar rule comprises at least one of language special characters, positions of target common characters in vocabularies and unique affixes;
and when the unidentified text to be identified exists, determining the language to which the unidentified text to be identified belongs according to the high-frequency vocabulary sets corresponding to all languages.
In another aspect, a language identification device is provided, which includes:
the conversion module is used for converting the coding format of at least one text to be recognized into ten thousand codes;
the recognition module is used for recognizing the at least one text to be recognized according to a preset grammar rule and determining the language to which the at least one text to be recognized belongs, wherein the grammar rule comprises at least one of language unique characters, positions of target common characters in a vocabulary and unique affixes;
and the determining module is used for determining the language to which the unidentified text to be identified belongs according to the high-frequency vocabulary sets corresponding to the languages when the unidentified text to be identified exists.
In a possible implementation manner, the conversion module is further configured to convert, for any text to be recognized in the at least one text to be recognized, the text to be recognized from a first character code into a ten thousand international code, where the first character code is an original coding format of the text to be recognized; and converting the font codes in the text to be recognized into corresponding second character codes according to the corresponding relation between the font codes and the second character codes, wherein the second character codes are formed by at least two ten thousand codes.
In a possible implementation manner, the recognition module is further configured to, for any text to be recognized, determine, according to a ten-thousand code of a special character of a first target language, a language to which the text to be recognized belongs as the first target language when the text to be recognized includes the special character of the first target language.
In a possible implementation manner, the recognition module is further configured to perform word segmentation on any text to be recognized to obtain a plurality of words; and when the target position in any vocabulary has target common characters, determining a second target language to which the text to be recognized belongs.
In a possible implementation manner, the recognition module is further configured to perform word segmentation on any text to be recognized to obtain a plurality of words; and when the grammar affix of any vocabulary is the unique affix of a third target language, determining the language to which the text to be recognized belongs as the third target language.
In a possible implementation manner, the determining module is further configured to obtain a high-frequency vocabulary set corresponding to each language, where the high-frequency vocabulary set includes a target number of high-frequency vocabularies; for any unidentified text to be identified, performing word segmentation on the text to be identified to obtain a plurality of words; and when the plurality of words comprise words in a target high-frequency word set, determining the language to which the text to be recognized belongs as the language corresponding to the target high-frequency word set.
In a possible implementation manner, the method for creating the high-frequency vocabulary sets corresponding to the languages includes:
for any language, removing numbers, english, blank spaces and texts except the language text from a first sample text comprising the language text to obtain a second sample text; performing word segmentation on the second sample text, and counting the word frequency of each word; removing the vocabulary common to all languages, and acquiring high-frequency vocabularies of a target quantity according to the word frequency of each vocabulary from high to low; and taking the set formed by the high-frequency vocabularies of the target number as the high-frequency vocabulary collection of the language.
In another aspect, a server is provided, where the server includes a processor and a memory, where the memory is used to store program codes, and the program codes are loaded and executed by the processor to implement the operations performed in the language identification method in the embodiment of the present application.
In another aspect, a storage medium is provided, where program codes are stored, and the program codes are used for executing the language identification method in the embodiment of the present application.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
the method has the advantages that the coding formats of the texts to be recognized are unified into the ten-thousand-national code, so that the texts to be recognized can be recognized in a multi-dimensional mode according to grammatical rules such as positions of language special characters and target common characters in vocabularies and unique affixes, the texts to be recognized which are not recognized by the rules are further recognized through high-frequency word collections corresponding to the languages, and the coverage degree is high. The method and the device have the advantages that through multi-dimensional and multi-level recognition of languages with more common characters, the accuracy of recognition results and the coverage of language recognition are improved, and accordingly effective language recognition is achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;
FIG. 2 is a flowchart of a language identification method according to an embodiment of the present application;
FIG. 3 is a table of partial character codes of Uha three languages according to an embodiment of the present application;
fig. 4 is a schematic diagram of a dimension language font coding conversion table provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a special character "Haimazai" provided by an embodiment of the present application;
FIG. 6 is a diagram illustrating language determination according to the location of a common character according to an embodiment of the present application;
FIG. 7 is a diagram illustrating another example of determining languages according to the positions of common characters according to an embodiment of the present application;
FIG. 8 is a diagram illustrating language determination based on affix according to an embodiment of the present application;
FIG. 9 is a diagram illustrating another example of determining languages according to affixes according to an embodiment of the present application;
FIG. 10 is a system framework diagram provided by an embodiment of the present application;
FIG. 11 is a flowchart illustrating a language identification system process according to an embodiment of the present application;
FIG. 12 is a graph of chapter-level test results provided by an embodiment of the present application;
FIG. 13 is a sentence-level test result diagram provided by an embodiment of the present application;
fig. 14 is a block diagram of a language identification device according to an embodiment of the present application;
fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The method provided by the embodiment of the application can be applied to a big data processing scene in the field of artificial intelligence and can also be applied to a scene that translation software automatically distinguishes languages. By taking the data cleaning link applied to the big data processing end of the network corpus as an example, the method provided by the embodiment of the application can classify the text data mixed with a plurality of languages according to the language to which the text data belongs to obtain the corpus classified according to the language type, and the corpus is high in purity and can be applied to machine learning. The network linguistic data is from various websites including Chinese websites and websites of other national languages.
The linguistic data with larger difference among languages can be directly distinguished, and the method provided by the embodiment of the application is mainly used for distinguishing similar languages, such as Uygur language, kazakh language, arabic language and the like. Uygur websites, kazakh websites, and Arabic websites are relatively large in number, contain large-magnitude corpora, and are written using Arabic characters. Because a large number of common characters and vocabularies exist among the three languages, the large number of common characters causes high coincidence degree of the three languages on the morphology of the vocabularies, so that the three languages are difficult to distinguish.
The following describes the main flow of language identification in the embodiment of the present application:
firstly, at least one text to be identified is obtained, and the coding format of the at least one text to be identified is unified into a Unicode. Secondly, the at least one text to be recognized is recognized according to preset rules, wherein the preset rules include but are not limited to special characters of each language, positions of target common characters in each vocabulary, unique affixes of each language and the like. And finally, when the unrecognized text exists, determining the language to which each text to be recognized belongs according to the high-frequency vocabulary corresponding to each language.
The embodiment of the application mainly relates to a certain link of a big data processing technology in artificial intelligence, and the corpus obtained by the method provided by the embodiment of the application can be used as a training sample for machine learning/deep learning. The related technologies in the field of the embodiments of the present application will be briefly introduced as follows:
artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application, and referring to fig. 1, the implementation environment includes a plurality of terminals 110 and a server 120.
The terminal 110 may be connected to the server 120 through a wireless network or a wired network. The terminal 110 may be at least one of a smartphone, a camcorder, a desktop computer, a tablet computer, an MP4 player, and a laptop portable computer. The terminal 110 is installed and operated with an application program for a text processing function. The application may be a social type application, a text processing type application, or a news information type application, etc. Illustratively, the terminal 110 may be a terminal used by a user, and an account of the user is logged in an application program run by the terminal 110.
The server 120 includes at least one of a server, a plurality of servers, and a cloud computing platform. The server 120 is used for providing a background service of language identification. Optional server 120 undertakes the work of primary language identification, and terminal 110 undertakes the work of secondary language identification; or the server 120 undertakes the work of secondary language identification, and the terminal 110 undertakes the work of primary language identification; alternatively, the server 120 and the terminal 110 may respectively undertake the work of language identification separately. When the terminal 110 undertakes the main language identification work, the terminal 110 may download and store the grammar rules of each language and the high frequency vocabulary sets corresponding to each language, which are needed in the identification process, from the server 120.
Optionally, the server 120 includes: the system comprises an access server, a language identification server and a database. The access server is used to provide access services for the terminal 110. And the language identification server is used for identifying the language to which the text to be identified belongs. The language identification server may be one or more than one language identification server, and when the language identification server is multiple language identification servers, at least two language identification servers exist for providing different services, and/or at least two language identification servers exist for providing the same service, for example, the same service is provided in a load balancing manner or the same service is provided in a manner of a main server and a mirror image server, which is not limited in the embodiment of the present application. The database is used for storing texts to be recognized, grammar rules, coding format conversion relations and high-frequency word collections. The information stored in the database is the information authorized to be used by the user.
The terminal 110 may refer to one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110. Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminal 110 may be only one, several tens or several hundreds, or more, and other terminals may be included in the implementation environment. The number and type of the terminals are not limited in the embodiments of the present disclosure.
Fig. 2 is a flowchart of a language identification method according to an embodiment of the present application, as shown in fig. 2. The method comprises the following steps:
201. the server obtains at least one text to be recognized, and converts the coding format of the text to be recognized into a ten-thousand-national code.
In the embodiment of the application, after the server acquires at least one text to be identified, the encoding format of each text to be identified can be converted from the original encoding format to a Unicode (Unicode) format. The text to be recognized may be a chapter-level text, a paragraph-level text, or a sentence-level text, which is not specifically limited in the embodiment of the present application.
In an alternative implementation manner, for any text to be recognized in the at least one text to be recognized, the server may convert the text to be recognized from a first character code into a ten thousand codes, where the first character code is a code of an original coding format of the text to be recognized. For the font code in the text to be recognized, the server may convert the font code in the text to be recognized into a corresponding second character code according to a corresponding relationship between the font code and the second character code, and the second character code may be composed of at least two ten thousand codes.
In an alternative implementation, the at least one text to be recognized includes at least one of a wiki (uygur) text, a haka (kazakh) text, and an arabic (arabic) text. Since the wiki text, the haar text and the aloud text are usually encoded by arabic characters, the server may unify the encoding format of the at least one text to be recognized from the arabic character encoding format into a ten thousand code format. For an exemplary partial character code of three languages of dimension, haar and arabian, reference may be made to fig. 3, where fig. 3 is a comparison table of partial character codes of three languages of dimension, haar and arabian provided in an embodiment of the present application.
In an optional implementation manner, due to the characteristics of the three languages, the three languages further include font codes, and the font codes cannot be directly converted into the ten thousand codes, so that the server may obtain the correspondence between the font codes of the three languages and the second character codes, and convert the font codes in the three languages into the corresponding second character codes respectively, where the second character codes may be formed by two ten thousand codes. For example, the corresponding relationship between the font code in the dimension language and the second character code can be seen in fig. 4, and fig. 4 is a schematic diagram of a dimension language font code conversion table provided in the embodiment of the present application.
It should be noted that when the encoding format of the text to be recognized is converted into the ten-thousand code, special characters with similar appearances may appear, and the special characters with similar appearances correspond to different ten-thousand codes in different languages. For special characters with similar appearance but different ten-thousand codes, the server can compare scenes where the special characters appear by extracting a large amount of sample texts comprising the special characters, so as to determine the ten-thousand codes corresponding to the special characters in various languages.
For example, the explanation will be given by taking, as an example, an arabic character "haimu young" (transliteration, which may be referred to as "hamslaughter" or the like) present in a wiki, a haar, and an aloud text. Referring to fig. 5, fig. 5 is a schematic view of a special character "seawood toy" provided in the embodiment of the present application. The probability of occurrence of the sea wood young in the dimension language, the Chinese language and the Arabic text is close to 80%, the server extracts a large number of sea wood young from various languages for comparison, and finds that the sea wood young is generally represented by using the ten thousand codes 0626 in the dimension language, the sea wood young is generally represented by using the ten thousand codes 0624 or 0621 in the Kazak language, and the sea wood young is generally represented by using the ten thousand codes 0621 in the Arabic language.
202. The server identifies at least one text to be identified according to a preset grammar rule, and determines the language to which the at least one text to be identified belongs, wherein the grammar rule comprises at least one of language special characters, positions of target common characters in vocabularies and unique affixes.
In the embodiment of the application, after the server converts the coding format of at least one text to be recognized into the ten-thousand-country code, the at least one text to be recognized can be recognized according to a preset grammar rule. The server can determine the language to which the text to be recognized belongs through the special characters of each language, can also determine the language to which the text to be recognized belongs through the positions of the common characters of each language in the vocabulary, and can also determine the language to which the text to be recognized belongs through the unique affix of each language. In addition, the server may also determine the language to which the text to be recognized belongs through vowel rules, consonant rules, and the like, which is not specifically limited in the embodiment of the present application. Of course, the server may also perform recognition through two or more rules simultaneously to determine the language to which the text to be recognized belongs, which is not specifically limited in the embodiment of the present application.
In an alternative implementation manner, the step of determining, by the server, the language to which the text to be recognized belongs according to the special characters of each language may be: for any text to be recognized in at least one text to be recognized, when the server determines that the text to be recognized includes the special character of the first target language, the server may determine, according to the ten-thousand code of the special character, the language to which the text to be recognized belongs as the first target language. The server can quickly determine the language to which the text to be recognized comprising the special characters belongs, so that the language recognition efficiency is improved.
For example, taking the "young aureobasidium" mentioned in step 201 as an example, when the server determines that any text to be identified includes the "young aureobasidium" and the ten-thousand code of the "young aureobasidium" is 0626, the server may determine that the language to which the text to be identified belongs is an uygur language; when the ten-thousand code of the "young seawood" is 0621, the server may determine that the language to which the text to be recognized belongs is Kazakh or Arabic, and the server may further determine the language to which the text to be recognized belongs by using other grammar rules.
In an alternative implementation manner, the step of determining, by the server, the language to which the text to be recognized belongs according to the position of the common characters of each language in the vocabulary may be: for any text to be recognized in the at least one text to be recognized, the server can perform word segmentation on the text to be recognized to obtain a plurality of words. When the target common characters appear at the target position in any vocabulary, the server can determine a second target language to which the text to be recognized belongs. The server can quickly determine the language to which the text to be recognized belongs according to the positions of part of the shared characters in the vocabulary, so that the shared characters can also be used for language recognition, the dimensionality of the language recognition is expanded, and the language recognition efficiency is improved.
For example, referring to fig. 6, fig. 6 is a schematic diagram illustrating language determination according to positions of common characters according to an embodiment of the present application. In fig. 6, the ten thousand codes are shown as characters of 06C6 as an example, when the 06C6 character is located at the beginning of a word, the server may determine that the word belongs to a haar; when the 06C6 character is in the beginning of the word and combined with the 0626 character, the server may determine that the vocabulary belongs to a wiki.
In addition, as shown in fig. 7, fig. 7 is another schematic diagram for determining a language according to a position of a common character according to the embodiment of the present application. In fig. 7, the ten thousand codes are shown as characters of 06C7 as an example, when the 06C7 character is at the beginning of a word, the server may determine that the word belongs to a haar; when the 06C7 character is located at the beginning of the word and combined with the 0621 character, the server may determine that the vocabulary belongs to a wiki.
It should be noted that, the embodiment of the present application only exemplarily shows two common characters that can be used for distinguishing the languages in the language, two common characters that can be used for distinguishing the languages in the above three languages, and two common characters that can be used for distinguishing the languages in the corresponding other languages, which are not listed any more in the embodiment of the present application.
In an alternative implementation manner, the step of determining, by the server, the language to which the text to be recognized belongs according to the unique affix of each language may be: for any text to be recognized in the at least one text to be recognized, the server can perform word segmentation on the text to be recognized to obtain a plurality of words. When the grammatical affix of any vocabulary is the unique affix of the third target language, the server may determine that the language to which the text to be recognized belongs is the third target language. The server can rapidly determine the language to which the text to be recognized belongs according to the unique affix of each language, so that the grammar affix is used for language recognition, the dimension of the language recognition is expanded, and the efficiency of the language recognition is improved.
For example, taking the examples of the haar and the wiki, the server may distinguish the haar and the wiki according to their "kingmarks" (an affix of grammar), or the server may distinguish the haar and the wiki according to their "future verb forms" (an affix of grammar). See fig. 8 and 9. Fig. 8 is a schematic diagram of determining languages according to affixes according to an embodiment of the present application, and fig. 8 illustrates how to distinguish between haar and wiki according to "subject lattice". Fig. 9 is another schematic diagram for determining languages according to affixes provided in an embodiment of the present application, and fig. 9 illustrates how to distinguish between haar and wiki according to a "future verb form".
In an optional implementation manner, the server may further determine, according to unique characters of each language, a language to which the text to be recognized belongs, and the step may be: for any text to be recognized in the at least one text to be recognized, when the server determines that the text to be recognized includes the unique character of the fourth target language, it may be determined that the language to which the text to be recognized belongs is the fourth target language.
It should be noted that, after the server identifies at least one text to be identified according to the preset grammar rule, if there is no unrecognized text to be identified, the identification is completed; when the unrecognized text to be recognized exists, the server may perform the step of determining the language to which the unrecognized text to be recognized belongs according to the high-frequency vocabulary corresponding to each language in step 203.
203. And when the unidentified text to be identified exists, the server determines the language to which the unidentified text to be identified belongs according to the high-frequency vocabulary set corresponding to each language.
In the embodiment of the application, for the text to be recognized, the language of which cannot be determined by the preset grammar rule, the server can perform further recognition according to the high-frequency vocabulary corresponding to each language.
In an optional implementation manner, the step of determining, by the server, the language to which the unrecognized text to be recognized belongs according to the high-frequency vocabulary corresponding to each language may be: the server may obtain a high frequency vocabulary set corresponding to each language, where the high frequency vocabulary set includes a target number of high frequency vocabularies. For any unidentified text to be recognized, the server can perform word segmentation on the text to be recognized to obtain a plurality of words. When the server determines that the plurality of words include a word in the target high-frequency word set, the server may determine that the language to which the text to be recognized belongs is the language corresponding to the target high-frequency word set. The server can further identify the text to be identified which cannot be identified by laws and regulations, so that more texts to be identified can be identified, and the coverage rate of language identification is improved.
In an alternative implementation manner, the high-frequency vocabulary set may be created by the following steps: for any language, the server may obtain a first sample text comprising text of that language, the first sample text comprising text data not less than a target order of magnitude, which may be GB (gigabytes), TB (terabytes), or PB (bytes of beats). The server may remove the number, english, space, and text other than the language text from the first sample text including the language text to obtain a second sample text. The server can perform word segmentation on the second sample text and count the word frequency of each word. The server can remove the vocabulary common to all languages, only the vocabulary unique to the language is remained, and then the high-frequency vocabulary with the target number of 300, 500 or 800 can be obtained from top to bottom according to the word frequency of each vocabulary. The server may use a set composed of the high-frequency vocabularies of the target number as the high-frequency vocabulary sets of the language, and the high-frequency vocabulary sets may be stored in a database of the server and establish a one-to-one correspondence with the language.
It should be noted that the high-frequency vocabulary set can be used for further identifying the unrecognized text to be identified, and can also be used for verifying the text which is identified by the grammar rule, so that the accuracy of the language identification result is improved.
It should be noted that, the foregoing steps 201 to 203 are preferred manners of the language identification method provided in the embodiment of the present application, and the server may also implement language identification in other manners, such as identifying the language to which the text to be identified belongs by using a preset grammar rule and a high-frequency vocabulary set.
It should be further noted that, in order to make the framework of the steps described in the above step 201 to step 203 clearer, reference may be made to fig. 10, where fig. 10 is a system framework diagram provided in an embodiment of the present application. The figure shows the recognition of a large amount of online data including dimension, haar and aloud, and the recognition is performed through three aspects of language character difference, language and grammar difference and vocabulary verification set, which corresponds to the above three steps. Wherein for step 202, common characters of three languages can be identified by the position and grammar of the common characters, and special characters of each language can be identified by unique codes of each language.
It should be further noted that the language identification method provided in the embodiment of the present application may also be applied to a language identification system, a processing flow of the language identification system may be shown in fig. 11, where fig. 11 is a processing flow chart of the language identification system provided in the embodiment of the present application. The language identification system inputs a language material mixed by a plurality of languages, firstly, the encoding format of the language material is converted by unified encoding, and mainly the font encoding is converted into character encoding. Secondly, the recognition is performed through grammar rules, for example, the recognition can be performed through special characters, character positions, unique affixes and meta-consonant rules in the case of the dimensional language, the Kazai language and the Alphaea. Finally, further identification and verification is performed through the high-frequency vocabulary. The language identification system outputs the identification result of the multilingual corpus. The source of the high frequency vocabulary is also identified in the figure: the high-frequency word collections of various languages are obtained by extracting massive linguistic data.
It should be further noted that, in order to verify the validity of the language identification method provided in the embodiment of the present application, tests on the text to be identified at the chapter level and the text to be identified at the sentence level are also performed in the embodiment of the present application. The languages tested were Argan, ha and Wei.
Fig. 12 is a chart of chapter-level test results provided in the embodiments of the present application. Fig. 13 is a diagram of a sentence-level test result provided in an embodiment of the present application. Where the number column refers to the total number of samples involved in the experiment, there are cases where one sample is identified as multiple languages. Precision (Precision) = number of samples correctly identified as the original language/total number of samples identified as the original language; coverage (Recall) = number of correct identifications as original language/total number of samples in that language.
Experiments show that the language identification method provided by the embodiment of the application has a good identification effect on the dimensional language, the Kazaki language and the Alphabet language, and similarly, the method is also suitable for the Altai language family, the Turkey language family and other languages, has strong expansibility, and is not verified one by one any more in the embodiment of the application.
In the embodiment of the application, the coding formats of the texts to be recognized are unified into the ten-thousand-national code, so that the texts to be recognized can be recognized in a multi-dimensional mode according to grammatical rules such as positions of language special characters and target common characters in vocabularies and unique affixes, the texts to be recognized which are not recognized by the rules are further recognized through high-frequency word collections corresponding to the languages, and the coverage degree is high. The method and the device have the advantages that through multi-dimensional and multi-level recognition of languages with more common characters, the accuracy of recognition results and the coverage of language recognition are improved, and accordingly effective language recognition is achieved.
Fig. 14 is a block diagram of a language identification device according to an embodiment of the present application, and as shown in fig. 14, the language identification device includes: a conversion module 1401, a recognition module 1402 and a determination module 1403.
A conversion module 1401, configured to convert the encoding format of at least one text to be recognized into a ten thousand national code;
the recognition module 1402 is configured to recognize at least one text to be recognized according to a preset grammar rule, and determine a language to which the at least one text to be recognized belongs, where the grammar rule includes at least one of language unique characters, a position of a target common character in a vocabulary, and unique affix;
the determining module 1403 is configured to determine, when there is an unidentified text to be identified, a language to which the unidentified text to be identified belongs according to the high-frequency vocabulary sets corresponding to the languages.
In a possible implementation manner, the conversion module 1401 is further configured to convert, for any text to be recognized in the at least one text to be recognized, a first character code of the text to be recognized into a ten thousand international code, where the first character code is an original coding format of the text to be recognized; and converting the font code in the text to be recognized into a corresponding second character code according to the corresponding relation between the font code and the second character code, wherein the second character code is composed of at least two ten thousand codes.
In a possible implementation manner, the recognition module 1402 is further configured to, for any text to be recognized, when the text to be recognized includes a special character of a first target language, determine, according to a ten-thousand country code of the special character, that a language to which the text to be recognized belongs is the first target language.
In a possible implementation manner, the recognition module 1402 is further configured to perform word segmentation on any text to be recognized to obtain a plurality of words; and when the target shared characters appear at the target position in any vocabulary, determining a second target language to which the text to be recognized belongs.
In a possible implementation manner, the recognition module 1402 is further configured to perform word segmentation on any text to be recognized to obtain a plurality of words; and when the grammatical affix of any vocabulary is the unique affix of the third target language, determining the language to which the text to be recognized belongs as the third target language.
In a possible implementation manner, the determining module 1403 is further configured to obtain a high-frequency vocabulary set corresponding to each language, where the high-frequency vocabulary set includes a target number of high-frequency vocabularies; for any unidentified text to be identified, performing word segmentation on the text to be identified to obtain a plurality of words; and when the plurality of vocabularies comprise the vocabularies in the target high-frequency vocabulary set, determining the language to which the text to be recognized belongs as the language corresponding to the target high-frequency vocabulary set.
In a possible implementation manner, the method for creating the high-frequency vocabulary sets corresponding to various languages includes:
for any language, removing numbers, english, blank spaces and texts except the language text from a first sample text comprising the language text to obtain a second sample text; performing word segmentation on the second sample text, and counting the word frequency of each word; removing the vocabulary common to all languages, and acquiring high-frequency vocabularies of a target quantity according to the word frequency of each vocabulary from high to low; and taking a set consisting of the high-frequency vocabularies of the target number as the high-frequency vocabulary collection of the languages.
In the embodiment of the application, the coding formats of the texts to be recognized are unified into the ten-thousand-national code, so that the texts to be recognized can be recognized in a multi-dimensional mode according to grammatical rules such as positions of language special characters and target common characters in vocabularies and unique affixes, the texts to be recognized which are not recognized by the rules are further recognized through high-frequency word collections corresponding to the languages, and the coverage degree is high. The method and the device have the advantages that through multi-dimensional and multi-level recognition of languages with more common characters, the accuracy of recognition results and the coverage of language recognition are improved, and accordingly effective language recognition is achieved.
Fig. 15 is a schematic structural diagram of a server 1500 according to an embodiment of the present invention, where the server 1500 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1501 and one or more memories 1502, where at least one instruction is stored in the memory 1502, and the at least one instruction is loaded and executed by the processor 1501 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
The embodiment of the present application further provides a storage medium, where the storage medium is applied to a server, and program codes are stored in the storage medium, and the program codes are used to execute the language identification method in the embodiment of the present application.
It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a storage medium, such as a read-only memory, a magnetic disk or an optical disk.
The present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed.

Claims (7)

1. A language identification method, the method comprising:
converting the coding format of at least one text to be recognized into a ten thousand national code;
recognizing the at least one text to be recognized according to a preset grammar rule, and determining the language to which the at least one text to be recognized belongs, wherein the grammar rule comprises at least one of language special characters, positions of target common characters in vocabularies and unique affixes;
when the unidentified text to be identified exists, determining the language to which the unidentified text to be identified belongs according to the high-frequency vocabulary sets corresponding to all languages;
the identifying the at least one text to be identified according to a preset grammar rule and determining the language to which the at least one text to be identified belongs includes:
when the grammar rule is the language special character, for any text to be recognized, when the text to be recognized comprises a special character of a first target language, determining the language to which the text to be recognized belongs as the first target language according to the ten-thousand code of the special character;
when the grammar rule is the position of the target common character in the vocabulary, for any text to be recognized, performing word segmentation on the text to be recognized to obtain a plurality of vocabularies, and when the target common character appears at the target position in any vocabulary, determining a second target language to which the text to be recognized belongs;
when the grammar rule is the unique affix, performing word segmentation on any text to be recognized to obtain a plurality of words, and when the grammar affix of any word is the unique affix of a third target language, determining the language to which the text to be recognized belongs as the third target language.
2. The method according to claim 1, wherein the converting the encoding format of the at least one text to be recognized into ten thousand codes comprises:
for any text to be recognized in the at least one text to be recognized, converting the text to be recognized into a Wangbai code by a first character code, wherein the first character code is an original coding format of the text to be recognized;
and converting the font codes in the text to be recognized into corresponding second character codes according to the corresponding relation between the font codes and the second character codes, wherein the second character codes are formed by at least two ten thousand codes.
3. The method according to claim 1, wherein the determining the language to which the unrecognized text to be recognized belongs according to the high-frequency vocabulary sets corresponding to the languages comprises:
acquiring high-frequency vocabulary sets corresponding to various languages, wherein the high-frequency vocabulary sets comprise high-frequency vocabularies with target quantity;
for any unidentified text to be identified, performing word segmentation on the text to be identified to obtain a plurality of words;
and when the plurality of words comprise words in a target high-frequency word set, determining the language to which the text to be recognized belongs as the language corresponding to the target high-frequency word set.
4. The method according to claim 3, wherein the method for creating the high frequency vocabulary sets corresponding to the languages comprises:
for any language, removing numbers, english, blank spaces and texts except the language text from a first sample text comprising the language text of the language to obtain a second sample text;
performing word segmentation on the second sample text, and counting the word frequency of each word;
removing the vocabulary common to all languages, and acquiring high-frequency vocabularies of a target quantity according to the word frequency of each vocabulary from high to low;
and taking the set formed by the high-frequency vocabularies of the target number as the high-frequency vocabulary set of the language.
5. A language identification apparatus, comprising:
the conversion module is used for converting the coding format of at least one text to be recognized into ten thousand codes;
the recognition module is used for recognizing the at least one text to be recognized according to a preset grammar rule and determining the language to which the at least one text to be recognized belongs, wherein the grammar rule comprises at least one of language special characters, positions of target common characters in vocabularies and unique affixes;
the determining module is used for determining the language to which the unidentified text to be identified belongs according to the high-frequency vocabulary sets corresponding to all languages when the unidentified text to be identified exists;
the identifying the at least one text to be identified according to a preset grammar rule and determining the language to which the at least one text to be identified belongs includes:
when the grammar rule is the language special character, for any text to be recognized, when the text to be recognized comprises a special character of a first target language, determining the language to which the text to be recognized belongs as the first target language according to the ten-thousand code of the special character;
when the grammar rule is the position of the target shared character in the vocabulary, performing word segmentation on any text to be recognized to obtain a plurality of vocabularies, and when the target shared character appears at the target position of any vocabulary, determining a second target language to which the text to be recognized belongs;
and when the grammar rule is the unique affix, performing word segmentation on any text to be recognized to obtain a plurality of vocabularies, and when the grammar affix of any vocabulary is the unique affix of a third target language, determining the language to which the text to be recognized belongs as the third target language.
6. A server, comprising a processor and a memory, wherein the memory is configured to store program code, and wherein the program code is loaded by the processor and executed by the language identification method according to any one of claims 1 to 4.
7. A storage medium for storing program code for loading by a processor and executing the language identification method of any one of claims 1 to 4.
CN201910888663.4A 2019-09-19 2019-09-19 Language identification method, language identification device, server and storage medium Active CN110633456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910888663.4A CN110633456B (en) 2019-09-19 2019-09-19 Language identification method, language identification device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910888663.4A CN110633456B (en) 2019-09-19 2019-09-19 Language identification method, language identification device, server and storage medium

Publications (2)

Publication Number Publication Date
CN110633456A CN110633456A (en) 2019-12-31
CN110633456B true CN110633456B (en) 2023-04-07

Family

ID=68971800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910888663.4A Active CN110633456B (en) 2019-09-19 2019-09-19 Language identification method, language identification device, server and storage medium

Country Status (1)

Country Link
CN (1) CN110633456B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818212B (en) * 2020-04-23 2023-10-13 腾讯科技(深圳)有限公司 Corpus data acquisition method, corpus data acquisition device, computer equipment and storage medium
CN112329454A (en) * 2020-11-03 2021-02-05 腾讯科技(深圳)有限公司 Language identification method and device, electronic equipment and readable storage medium
CN112749639B (en) * 2020-12-29 2022-01-14 中电金信软件有限公司 Model training method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297764A (en) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 A kind of multilingual mixed Chinese language treatment method and system
CN106383818A (en) * 2015-07-30 2017-02-08 阿里巴巴集团控股有限公司 Machine translation method and device
CN106528535A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multi-language identification method based on coding and machine learning
CN107945805A (en) * 2017-12-19 2018-04-20 程海波 A kind of intelligent across language voice identification method for transformation
CN110211588A (en) * 2019-06-03 2019-09-06 北京达佳互联信息技术有限公司 Audio recognition method, device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106297764A (en) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 A kind of multilingual mixed Chinese language treatment method and system
CN106383818A (en) * 2015-07-30 2017-02-08 阿里巴巴集团控股有限公司 Machine translation method and device
CN106528535A (en) * 2016-11-14 2017-03-22 北京赛思信安技术股份有限公司 Multi-language identification method based on coding and machine learning
CN107945805A (en) * 2017-12-19 2018-04-20 程海波 A kind of intelligent across language voice identification method for transformation
CN110211588A (en) * 2019-06-03 2019-09-06 北京达佳互联信息技术有限公司 Audio recognition method, device and electronic equipment

Also Published As

Publication number Publication date
CN110633456A (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110633456B (en) Language identification method, language identification device, server and storage medium
CN111046679B (en) Quality information acquisition method and device of translation model and computer equipment
US11170169B2 (en) System and method for language-independent contextual embedding
CN111144102B (en) Method and device for identifying entity in statement and electronic equipment
CN111694937A (en) Interviewing method and device based on artificial intelligence, computer equipment and storage medium
CN113761868B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN113723105A (en) Training method, device and equipment of semantic feature extraction model and storage medium
Fashwan et al. SHAKKIL: an automatic diacritization system for modern standard Arabic texts
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
CN113449514A (en) Text error correction method and device suitable for specific vertical field
US20230034414A1 (en) Dialogue processing apparatus, learning apparatus, dialogue processing method, learning method and program
Jo et al. Modeling mathematical notation semantics in academic papers
Wax Automated grammar engineering for verbal morphology
CN110929532B (en) Data processing method, device, equipment and storage medium
CN113705207A (en) Grammar error recognition method and device
CN116861242A (en) Language perception multi-language pre-training and fine tuning method based on language discrimination prompt
CN110413737B (en) Synonym determination method, synonym determination device, server and readable storage medium
CN112100355A (en) Intelligent interaction method, device and equipment
Dilawari et al. Neural attention model for abstractive text summarization using linguistic feature space
CN110852066A (en) Multi-language entity relation extraction method and system based on confrontation training mechanism
CN113849623A (en) Text visual question answering method and device
CN114298032A (en) Text punctuation detection method, computer device and storage medium
CN111553168A (en) Bilingual short text matching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40019572

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant