CN113919330A - Language identification method, information distribution method, device and medium - Google Patents

Language identification method, information distribution method, device and medium Download PDF

Info

Publication number
CN113919330A
CN113919330A CN202111198143.4A CN202111198143A CN113919330A CN 113919330 A CN113919330 A CN 113919330A CN 202111198143 A CN202111198143 A CN 202111198143A CN 113919330 A CN113919330 A CN 113919330A
Authority
CN
China
Prior art keywords
language
text
model
identification
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111198143.4A
Other languages
Chinese (zh)
Inventor
吴臻
郭子嘉
孙玉霞
韩宝龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Travel Information Technology Shanghai Co Ltd
Original Assignee
Ctrip Travel Information Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Travel Information Technology Shanghai Co Ltd filed Critical Ctrip Travel Information Technology Shanghai Co Ltd
Priority to CN202111198143.4A priority Critical patent/CN113919330A/en
Publication of CN113919330A publication Critical patent/CN113919330A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a language identification method, an information distribution method, a device, equipment and a medium, wherein the language identification method comprises the following steps: determining a text to be recognized; preprocessing the text to be recognized to obtain a semantic related text after the semantic unrelated text is removed; inputting the semantic related text to a plurality of language judgment models to obtain an identification result of each language judgment model in the plurality of language judgment models; and judging the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models. The technical scheme in the embodiment of the invention can improve the accuracy of language identification.

Description

Language identification method, information distribution method, device and medium
Technical Field
The present invention relates to the field of software, and in particular, to a language identification method, an information distribution apparatus, a device, and a medium.
Background
Under the global background, more and more interactions are generated worldwide, and how to identify texts more accurately so as to adapt to global wave becomes a problem to be solved urgently.
Disclosure of Invention
To solve the problems in the prior art, an embodiment of the present invention provides a language identification method, including:
determining a text to be recognized;
preprocessing the text to be recognized to obtain a semantic related text after the semantic unrelated text is removed;
inputting the semantic related text to a plurality of language judgment models to obtain an identification result of each language judgment model in the plurality of language judgment models;
and judging the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models.
Optionally, the judgment strategies of the language judgment models are different.
Optionally, the language judgment models include rule models based on rule settings of different languages.
Optionally, the training of the rule model and the recognition result of the rule model are determined according to an application scenario.
Optionally, the language judgment models include at least one of the following: a joint identification model of a specific language and a general language identification model; the specific language is determined according to an application scene, and the identification result of the joint identification model is selected from the specific language; the language identification range of the general language identification model is larger than that of the combined identification model.
Optionally, the determining the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models includes: and preferentially determining the language of the text to be recognized based on the output result of the joint recognition model.
Optionally, the determining the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models includes:
determining a logical order of recognition results using each of the plurality of language judgment models;
and carrying out logic judgment on the recognition result according to the logic sequence so as to determine the language of the text to be recognized.
Optionally, the language judgment models include a rule model, a joint identification model and a general language identification model, and the judgment of the language of the text to be recognized based on the identification result of each language judgment model in the language judgment models includes:
determining the language of the text to be recognized according to the recognition result of the rule model and the recognition result of the joint recognition model;
and if the language of the text to be recognized is determined to be failed according to the recognition result of the rule model and the recognition result of the combined recognition model, determining the language of the text to be recognized according to the recognition result of the general language recognition model.
Optionally, the determining the language of the text to be recognized according to the recognition result of the rule model and the recognition result of the joint recognition model includes: if the recognition result of the rule model and the recognition result of the combined recognition model both contain the same language, determining the language as the language of the text to be recognized; the recognition result when the rule model is successfully recognized comprises one or more languages, and the recognition result package when the combined recognition model is successfully recognized is one language.
The embodiment of the invention also provides an information distribution method, which comprises the following steps:
adopting the language identification method to identify the language of the client;
and distributing the information of the corresponding language to the client according to the language identification result.
An embodiment of the present invention further provides a language identification device, including:
the text to be recognized determining unit is suitable for determining a text to be recognized;
the preprocessing unit is suitable for preprocessing the text to be recognized to obtain a semantic related text from which the semantic unrelated text is removed;
the recognition unit is suitable for inputting the semantic related text to a plurality of language judgment models to obtain a recognition result of each language judgment model in the plurality of language judgment models;
and the language determining unit is suitable for judging the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models.
An embodiment of the present invention further provides an electronic device, including:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to execute the executable instructions for implementing the language identification method or the information distribution method via execution of the executable instructions.
An embodiment of the present invention further provides a computer-readable storage medium, configured to store a program, where the program implements the language identification method or the information distribution method when executed.
In the embodiment of the invention, the interference in the text to be recognized can be removed by removing the semantic related text after the semantic unrelated text, and the recognition accuracy is improved. By setting a plurality of language judgment models and judging the language of the text to be recognized by combining the recognition result of each language judgment model in the plurality of language judgment models, the recognition of the plurality of language judgment models can be comprehensively utilized, and the language recognition can be more accurately carried out.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
FIG. 1 is a flow chart of a language identification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of one specific implementation of step S14 in FIG. 1;
FIG. 3 is a flow chart of another specific implementation of step S14 in FIG. 1;
FIG. 4 is a flow chart of another language identification method according to an embodiment of the present invention;
FIG. 5 is a flow chart of an information scoring method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a unit structure of a language identification method according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a client or server according to an embodiment of the present invention; and
fig. 8 is a schematic structural diagram of a computer-readable storage medium according to one embodiment of the present invention.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repetitive description will be omitted.
Referring to fig. 1 in combination, an embodiment of the present invention provides a language identification method, which may specifically include the following steps:
step S11, determining a text to be recognized;
step S12, preprocessing the text to be recognized to obtain a semantic related text with the semantic unrelated text removed;
step S13, inputting the semantic related text to a plurality of language judgment models to obtain the recognition result of each language judgment model in the plurality of language judgment models;
step S14, determining the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models.
In the embodiment of the invention, the interference in the text to be recognized can be removed by removing the semantic related text after the semantic unrelated text, and the recognition accuracy is improved. By setting a plurality of language judgment models and judging the language of the text to be recognized by combining the recognition result of each language judgment model in the plurality of language judgment models, the recognition of the plurality of language judgment models can be comprehensively utilized, and the language recognition can be more accurately carried out.
In a specific implementation, the text to be recognized may be a variety of texts, and may be, for example, data generated by APP (application software) at a client. Information-based APPs (application software) facing global users are increasing, international content producers and consumers are increasing in the process of content-based application, and how to accurately identify the accuracy of texts generated in the application process becomes a problem to be solved urgently. In the embodiment of the invention, the text generated in the APP application process can be used as the text to be recognized. By adopting the language identification method in the embodiment of the invention, the accuracy of language identification in the scene can be improved.
In a specific implementation, the preprocessing of the text to be recognized may be preprocessing a sentence in the text after the text to be recognized is input. And cleaning semantically irrelevant texts, such as web addresses, telephones, mailboxes and the like, and storing the cleaned texts, such as a list. The cleaned text may include an input sentence.
In a specific implementation, the judgment strategies of the language judgment models can be different. For example, it may be a rule model set according to rules of different languages. The specific training of the rule model and the recognition result of the rule model are determined according to the application scenario, for example, if the language mainly appearing in the application scenario is chinese, english or japanese, the rule model may be trained using the language mainly appearing, and the output result may be one or more of the languages mainly appearing. It is understood that the number of the languages in which the main appearance occurs may be various, and the languages in which the main appearance occurs are not limited to the above examples, and may be determined in advance by statistics of the frequency of the language appearance in the application scenario, or may be determined in other manners not combined with the application scenario.
In an embodiment of the present invention, the languages that mainly appear are 12, and the influence factors e are respectively set for the 12 languages (the influence factors are mainly applied to language identification and determination in the case where multiple languages exist in the same text). Aiming at the input text, the model firstly carries out text preprocessing, the length of the sentence is judged after a space character in the sentence is removed, if the length is 0, English is output by default, otherwise, the number of characters of each language in the sentence is judged by using a regular expression. And then, respectively obtaining the weight of each language according to the ratio of the number of characters of each language in the sentence to the whole sentence and the set influence factor e, judging the language with the maximum weight as the target language by people, and expressing the output result as rr for convenient expression. Specifically, the weight can be calculated by using the following formula:
Weight=len(s)/len(string)*e
wherein len(s) is the number of characters in a certain language; len (string) is the length of the input sentence; e is an influencing factor.
In a specific implementation, the language judgment models may further include at least one of: a joint identification model of a specific language and a general language identification model; the specific language is determined according to an application scene, and the identification result of the joint identification model is selected from the specific language; the language identification range of the general language identification model is larger than that of the combined identification model.
In a specific implementation, the specific language may be a language with a low recognition rate of other recognition models included in the plurality of language judgment models, for example, two or more languages that are easily confused by other recognition models. The specific language may also be determined in combination with the frequency of occurrence of the language in the application scenario.
In an embodiment of the present invention, the combined recognition model is a chinese-japanese model, and the similarity between japanese and chinese is high. By setting the Chinese and Japanese model, aiming at Chinese and Japanese language identification, the model of the whole Chinese and Japanese training set is trained, and the accuracy of language identification can be further improved. In specific implementation, the identification result of the middle day can be verified for the second time by the middle day model, so that the accuracy is improved.
In specific implementation, the implementation of the universal language identification model may be various, and an existing open source model, for example, a fasttext model, may be used, and the model supports only 200 language identifications. By setting the general language identification model, more languages which can be identified can be covered, and the situation that the identification accuracy is reduced due to the fact that the languages cannot be identified is avoided.
In a specific implementation, when the language judgment models include a joint recognition model of a specific language and a universal language recognition model, the judging the language of the text to be recognized based on the recognition result of each language judgment model in the language judgment models may include: and preferentially determining the language of the text to be recognized based on the output result of the joint recognition model. As described above, the recognition result of the joint recognition model is set specifically for the case where the recognition accuracy of other models in the multiple language judgment models is poor, so that the language recognition accuracy can be improved by preferentially determining the language of the text to be recognized based on the output result of the joint recognition model.
In a specific implementation, referring to fig. 1 and fig. 2 in combination, in step S14 in fig. 1, the determining the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models may include the following steps:
step S21 of determining a logical order of recognition results using each of the plurality of language judgment models;
and step S22, performing logic judgment on the recognition result according to the logic sequence to determine the language of the text to be recognized.
In an implementation, the recognition result of the language judgment model may be different, for example, the recognition result of the rule model may be one or more languages when the languages are successfully recognized, or the output result may be empty when the languages are not recognized. The output result of the joint recognition model may be the language it recognized, or when no language is recognized, the output result may be null.
The advantages of the recognition results of different language judgment models are different, and the recognition results are judged logically according to the logic sequence by setting the logic sequence of the recognition results of each language judgment model, so that the advantages of the different language judgment models can be more prominent, and the recognition results are more accurate.
With reference to fig. 1 and fig. 3 in combination, in a specific implementation, the language judgment models include a rule model, a joint recognition model and a universal language recognition model, and in step S14 in fig. 1, the judging the language of the text to be recognized based on the recognition result of each language judgment model in the language judgment models may further include:
step S31, determining the language of the text to be recognized according to the recognition result of the rule model and the recognition result of the joint recognition model;
step S32, if the language of the text to be recognized is determined to be failed according to the recognition result of the rule model and the recognition result of the joint recognition model, determining the language of the text to be recognized according to the recognition result of the general language recognition model.
In a specific implementation, the determining the language of the text to be recognized according to the recognition result of the rule model and the recognition result of the joint recognition model may include: if the recognition result of the rule model and the recognition result of the combined recognition model both contain the same language, determining the language as the language of the text to be recognized; the recognition result when the rule model is successfully recognized comprises one or more languages, and the recognition result package when the combined recognition model is successfully recognized is one language. Therefore, the accuracy of identification can be improved more efficiently.
Fig. 4 is a flowchart of a language identification method according to an embodiment of the present invention, in which text input is to input a text to be identified, the text may be to clear semantically irrelevant text, such as a web address, a telephone, a mailbox, and the like, and the cleared text is stored in a list. And judging the texts in the list, and if the length len of the texts is greater than 0, determining that the language finalresult of the texts to be recognized is English. Otherwise, the preprocessed related text is sent into three language judgment models, wherein the rule-based model is a rule model, the fasttext model is a specific implementation of a general language identification model, and the Chinese model is a specific implementation of a combined identification model.
For convenience of description, the output results of the above three language judgment models are named as follows: the output result of the rule model is rr; the output result of the fasttext model is rb; the outcome of the midday model is rzj.
And logically sequencing the output results to determine a final output result, namely the language of the text to be recognized. The following logic judges from top to bottom, and the input sentence only takes effect on one of the judging conditions. (for example, if the first judgment condition is not satisfied, the second judgment condition is carried out, and if the first judgment condition is satisfied, the rest conditions are skipped, and the result of the second judgment condition is directly output.)
a) If the output result rule model output result rr is Chinese or Japanese, and the Chinese model output result rzj is Japanese, then we judge that the content is Japanese
b) If the output result rr of the output result rule model is Chinese or Chinese complex (hong Kong, Taiwan), the method jumps to the Chinese inner subdivision judgment logic.
Reading Chinese, using the Chinese character table corresponding to the complex Chinese (hong Kong, Taiwan), judging the input sentence character by character, and obtaining the specific number of Chinese characters or complex Chinese characters. And calculating the number of other characters except the Chinese characters, if the percentage of the number of the other characters is more than 80 percent of the sentence, judging again by using the fasttext model, directly outputting a recognition result, namely the output result rb of the fasttext model, and otherwise, determining whether the input sentence belongs to a Chinese simplified form or a Chinese traditional form according to the percentage of the number of various Chinese characters.
c) And if the output result rb of the fasttext model is not in the well-defined 12 common languages, directly outputting the result of the output result rr of the rule model.
d) And finally, using the ability of the fasttext model to carry out bottom packing, and outputting the output result rb result of the fasttext model.
The embodiment of the invention also provides an information distribution method. With reference to fig. 5, the method may specifically include:
step S51, performing language identification on the client by using the language identification method;
and step S52, distributing the information of the corresponding language to the client according to the language identification result.
The information corresponding to the language may be recommendation information, reply information in an interactive process, or translation information consistent with the language of the client provided to the client.
Through the mode, matched information can be provided for the application of the internationalized APP to the client, and user experience is improved. Specifically, since the user does not consider whether the currently selected language and the produced content language match when the user is a content producer, there is a case where the APP selects the a language but produces the B language content. Resulting in the misdistribution of portions of premium content across the a-language platform. From the consumption experience of most A language users, the distribution logic needs to be optimized, the distribution logic according to the release place is optimized to distribute the content according to the language selected by the user, and the accuracy rate of content recommendation is improved. Language identification, on the other hand, can be used in translation scenarios for content distribution, identifying non-user-selected language content and providing a translation button for text translation. In the case of insufficient platform content, the distribution rate of the content can be increased.
The embodiment of the present invention further provides various identification devices, which, with reference to fig. 6, may specifically include the following units:
a text to be recognized determining unit 61 adapted to determine a text to be recognized;
the preprocessing unit 62 is adapted to preprocess the text to be recognized to obtain a semantic related text from which the semantic unrelated text is removed;
the recognition unit 63 is adapted to input the semantic related text to a plurality of language judgment models to obtain a recognition result of each language judgment model in the plurality of language judgment models;
a language type determining unit 64 adapted to determine the language type of the text to be recognized based on the recognition result of each of the plurality of language type judgment models.
The functional blocks in the above embodiments may be implemented wholly or partially by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of program instruction products. The program instruction product includes one or more program instructions. The processes or functions according to the present application occur in whole or in part when program instruction instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The program instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
Moreover, the apparatuses disclosed in the above embodiments may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or modules may be combined or may be dynamic to another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or modules, and may be in an electrical or other form.
In addition, each functional module and each sub-module in the above embodiments may be dynamically in one processing unit, or each module may exist alone physically, or two or more modules may be dynamically in one unit. The dynamic component can be realized in a form of hardware or a form of a software functional module. The dynamic components described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like
An embodiment of the present invention further provides an electronic device, including:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to execute via execution of the executable instructions for implementing the aforementioned language identification method or information distribution method.
An embodiment of the present invention further provides a computer-readable storage medium, which is used for storing a program, and when the program is executed, the language identification method or the information distribution method is implemented.
The implementation principles and beneficial effects of the device, the method, the medium, the equipment and the like in the embodiment of the invention are the same, and can be referred to each other.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
Fig. 7 is a schematic structural diagram of a client or a server in an embodiment of the present invention. An electronic apparatus 600 for implementing the aforementioned language identification method according to this embodiment of the present invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 7, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code which can be executed by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention as described in the above-mentioned method section of the present specification.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
Fig. 8 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 8, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In the description herein, references to the description of "an embodiment," "a further embodiment," "specifically," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without conflicting aspects.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
In summary, according to the technical scheme in the embodiment of the present invention, by removing the semantic related text after the semantic unrelated text, the interference in the text to be recognized can be removed, and the recognition accuracy is improved. By setting a plurality of language judgment models and judging the language of the text to be recognized by combining the recognition result of each language judgment model in the plurality of language judgment models, the recognition of the plurality of language judgment models can be comprehensively utilized, and the language recognition can be more accurately carried out.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (13)

1. A language identification method, comprising:
determining a text to be recognized;
preprocessing the text to be recognized to obtain a semantic related text after the semantic unrelated text is removed;
inputting the semantic related text to a plurality of language judgment models to obtain an identification result of each language judgment model in the plurality of language judgment models;
and judging the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models.
2. The language identification method as claimed in claim 1, wherein said plurality of language judgment models differ in judgment strategy.
3. The language identification method of claim 1 wherein said plurality of language judgment models comprises rule models based on rule settings of different languages.
4. The language identification method according to claim 3, wherein the training of said rule model and the identification result of said rule model are determined according to an application scenario.
5. The language identification method of claim 1, wherein said plurality of language judgment models comprises at least one of: a joint identification model of a specific language and a general language identification model; the specific language is determined according to an application scene, and the identification result of the joint identification model is selected from the specific language; the language identification range of the general language identification model is larger than that of the combined identification model.
6. The language identification method according to claim 5, wherein said determining the language of the text to be identified based on the identification result of each of said plurality of language judgment models comprises: and preferentially determining the language of the text to be recognized based on the output result of the joint recognition model.
7. The language identification method according to claim 1, wherein said determining the language of the text to be identified based on the identification result of each of the plurality of language judgment models comprises:
determining a logical order of recognition results using each of the plurality of language judgment models;
and carrying out logic judgment on the recognition result according to the logic sequence so as to determine the language of the text to be recognized.
8. The language identification method according to claim 1, wherein said plurality of language judgment models comprises a regular model, a joint recognition model and a universal language identification model, and said judging the language of said text to be identified based on the recognition result of each of said plurality of language judgment models comprises:
determining the language of the text to be recognized according to the recognition result of the rule model and the recognition result of the joint recognition model;
and if the language of the text to be recognized is determined to be failed according to the recognition result of the rule model and the recognition result of the combined recognition model, determining the language of the text to be recognized according to the recognition result of the general language recognition model.
9. The language identification method according to claim 8, wherein said determining the language of the text to be identified according to the identification result of the rule model and the identification result of the joint identification model comprises: if the recognition result of the rule model and the recognition result of the combined recognition model both contain the same language, determining the language as the language of the text to be recognized; the recognition result when the rule model is successfully recognized comprises one or more languages, and the recognition result package when the combined recognition model is successfully recognized is one language.
10. An information distribution method, comprising:
adopting the language identification method of any one of claims 1 to 9 to identify the language of the client;
and distributing the information of the corresponding language to the client according to the language identification result.
11. A language identification device, comprising:
the text to be recognized determining unit is suitable for determining a text to be recognized;
the preprocessing unit is suitable for preprocessing the text to be recognized to obtain a semantic related text from which the semantic unrelated text is removed;
the recognition unit is suitable for inputting the semantic related text to a plurality of language judgment models to obtain a recognition result of each language judgment model in the plurality of language judgment models;
and the language determining unit is suitable for judging the language of the text to be recognized based on the recognition result of each language judgment model in the plurality of language judgment models.
12. An electronic device, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to execute, via execution of the executable instructions, a method for implementing the language identification method of any one of claims 1 to 9 or the information distribution method of claim 10.
13. A computer-readable storage medium storing a program, wherein the program is executed to implement the language identification method according to any one of claims 1 to 9 or the information distribution method according to claim 10.
CN202111198143.4A 2021-10-14 2021-10-14 Language identification method, information distribution method, device and medium Pending CN113919330A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111198143.4A CN113919330A (en) 2021-10-14 2021-10-14 Language identification method, information distribution method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111198143.4A CN113919330A (en) 2021-10-14 2021-10-14 Language identification method, information distribution method, device and medium

Publications (1)

Publication Number Publication Date
CN113919330A true CN113919330A (en) 2022-01-11

Family

ID=79240561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111198143.4A Pending CN113919330A (en) 2021-10-14 2021-10-14 Language identification method, information distribution method, device and medium

Country Status (1)

Country Link
CN (1) CN113919330A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596566A (en) * 2022-04-18 2022-06-07 腾讯科技(深圳)有限公司 Text recognition method and related device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596566A (en) * 2022-04-18 2022-06-07 腾讯科技(深圳)有限公司 Text recognition method and related device

Similar Documents

Publication Publication Date Title
CN110069608B (en) Voice interaction method, device, equipment and computer storage medium
CN109657054B (en) Abstract generation method, device, server and storage medium
US11176141B2 (en) Preserving emotion of user input
US20130159848A1 (en) Dynamic Personal Dictionaries for Enhanced Collaboration
CN110134931B (en) Medium title generation method, medium title generation device, electronic equipment and readable medium
US11640551B2 (en) Method and apparatus for recommending sample data
US20060271910A1 (en) Method and system for customizations in a dynamic environment
CN109558597A (en) Text interpretation method and device, equipment and storage medium
US11475588B2 (en) Image processing method and device for processing image, server and storage medium
US20160188569A1 (en) Generating a Table of Contents for Unformatted Text
CN110674620A (en) Target file generation method, device, medium and electronic equipment
AU2018250372A1 (en) Method to construct content based on a content repository
CN111241496B (en) Method and device for determining small program feature vector and electronic equipment
CN108932218A (en) A kind of example extended method, device, equipment and medium
CN112966824A (en) Deployment method and device of inference library and electronic equipment
CN107368568A (en) A kind of method, apparatus, equipment and storage medium for taking down notes generation
CN113919330A (en) Language identification method, information distribution method, device and medium
CN110442803A (en) Data processing method, device, medium and the calculating equipment executed by calculating equipment
CN107862035A (en) Network read method, device, Intelligent flat and the storage medium of minutes
CN109960752B (en) Query method and device in application program, computer equipment and storage medium
JP7055764B2 (en) Dialogue control system, dialogue control method and program
US20210064697A1 (en) List-based entity name detection
WO2017034937A1 (en) Smart flip operation for grouped objects
CN114757299A (en) Text similarity judgment method and device and storage medium
CN109460511B (en) Method and device for acquiring user portrait, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination