CN112528682A - Language detection method and device, electronic equipment and storage medium - Google Patents

Language detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112528682A
CN112528682A CN202011540408.XA CN202011540408A CN112528682A CN 112528682 A CN112528682 A CN 112528682A CN 202011540408 A CN202011540408 A CN 202011540408A CN 112528682 A CN112528682 A CN 112528682A
Authority
CN
China
Prior art keywords
language
identification result
input text
classification model
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011540408.XA
Other languages
Chinese (zh)
Inventor
王曦阳
张睿卿
何中军
李芝
吴华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011540408.XA priority Critical patent/CN112528682A/en
Publication of CN112528682A publication Critical patent/CN112528682A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods

Abstract

The application discloses a language detection method, a language detection device, electronic equipment and a storage medium, and relates to the technical field of computers, in particular to the technical field of artificial intelligence such as natural language processing and deep learning. The specific implementation scheme is as follows: acquiring an input text; calling a first classification model to perform language detection on an input text to generate a first language identification result; and if the first language identification result meets the preset condition, calling a second classification model to perform language detection on the input text to generate a second language identification result, wherein the identification precision of the second classification model is higher than that of the first classification model. The language detection method of the embodiment of the application can effectively detect the language of the input text, and further improves the accuracy of the language identification result.

Description

Language detection method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as natural language processing and deep learning, and particularly relates to a language detection method and device, electronic equipment and a storage medium.
Background
In natural language processing, language detection is a technique for identifying, for a given text, the language to which the text belongs. Language detection is a key step in natural language processing, and particularly in the process of processing a large-scale real-world text, corresponding processing is usually required to be performed on the basis of judging language information of the text.
For example, in machine translation, a user may input a text in any language as a source language, and the system is required to judge the language input by the user to perform correct translation. Language detection is essentially a text classification task, and languages correspond to text categories.
Disclosure of Invention
The application provides a language detection method, a language detection device, electronic equipment and a storage medium.
According to an aspect of the present application, there is provided a language detection method, including:
acquiring an input text;
calling a first classification model to perform language detection on the input text to generate a first language identification result; and
and if the first language identification result meets a preset condition, calling a second classification model to perform language detection on the input text to generate a second language identification result, wherein the identification precision of the second classification model is higher than that of the first classification model.
According to another aspect of the present application, there is provided a language detection apparatus, including:
the acquisition module is used for acquiring an input text;
the first detection module is used for calling a first classification model to carry out language detection on the input text so as to generate a first language identification result; and
and the second detection module is used for calling a second classification model to perform language detection on the input text to generate a second language identification result if the first language identification result meets a preset condition, wherein the identification precision of the second classification model is higher than that of the first classification model.
According to another aspect of the present application, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the language detection method of an embodiment of an aspect described above.
According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing thereon a computer program for causing a computer to execute a language detection method according to an embodiment of the above-described aspect.
According to another aspect of the present application, there is provided a computer program product comprising a computer program, which when executed by a processor implements the language detection method according to an embodiment of the above-mentioned aspect.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a schematic flow chart of a language detection method according to an embodiment of the present application;
fig. 2 is a schematic flow chart of another language detection method according to an embodiment of the present application;
fig. 3 is a schematic network structure diagram of a first classification model provided in an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating another language detection method according to an embodiment of the present application;
FIG. 5 is a schematic network structure diagram of a second classification model provided in an embodiment of the present application;
fig. 6 is a schematic structural diagram of a language detection device according to an embodiment of the present application; and
fig. 7 is a block diagram of an electronic device of a language detection apparatus according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
A language detection method, a language detection apparatus, an electronic device, and a storage medium according to an embodiment of the present application are described below with reference to the drawings.
Artificial intelligence is the subject of research on the use of computers to simulate certain mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of humans, both in the hardware and software domain. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology comprises a computer vision technology, a voice recognition technology, a natural language processing technology, deep learning, a big data processing technology, a knowledge map technology and the like.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics.
Deep learning is a new research direction in the field of machine learning. Deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art.
The language detection method provided in the embodiment of the present application may be executed by an electronic device, where the electronic device may be a Personal Computer (PC), a tablet Computer, a palmtop Computer, or the like, and is not limited herein.
In the embodiment of the application, the electronic device can be provided with a processing component, a storage component and a driving component. Optionally, the driving component and the processing component may be integrated, the storage component may store an operating system, an application program, or other program modules, and the processing component implements the language detection method provided in the embodiment of the present application by executing the application program stored in the storage component.
Fig. 1 is a schematic flow chart of a language detection method according to an embodiment of the present application.
The language detection method according to the embodiment of the application may also be implemented by a language detection device provided in the embodiment of the application, and the device may be configured in an electronic device to implement that a first classification model is invoked to perform language detection on an acquired input text to generate a first language identification result, and when the first language identification result meets a preset condition, a second classification model is invoked to perform language detection on the acquired input text to generate a second language identification result.
As a possible situation, the language detection method in the embodiment of the present application may also be executed at a server, where the server may be a cloud server, and the language detection method may be executed at a cloud end.
As shown in fig. 1, the language detection method may include:
step 101, an input text is obtained. It should be noted that the input text described in this embodiment may be text expressed in various written languages, for example, it may be chinese text, english text, russian text, malaysian text, mixed chinese and english text, and the like. The input text may contain a sentence, a paragraph, or a chapter, such as a news article.
In this embodiment of the application, the input text may include text information input by a user through speech recognition and input content input by the user into the input method system through an input method, the input method system may convert the input content into word candidates of the input text according to a current input manner of the user, and provide a user with a choice, the user may input the text information through various input means, such as a keyboard, a touch pad, a mouse, and the like, and the user may also select any input manner to input the text information, such as pinyin, wubi, stroke, handwriting, english, and a keypad, and the like, which is not limited herein.
As a possible scenario, the input text may further include text information obtained by the user through copy and paste.
Specifically, the electronic device may obtain input information (input text) input to the input method system by the user through the input method, for example, the user inputs a text description of a chinese character through the input method.
Step 102, calling a first classification model to perform language detection on the input text to generate a first language identification result.
As a possible scenario, the first classification model may be improved based on an open-source FastText model.
It should be noted that the first classification model described in this embodiment may be trained in advance and pre-stored in a storage space of the electronic device to facilitate retrieval of the application, where the storage space is not limited to an entity-based storage space, such as a hard disk, and the storage space may also be a storage space of a network hard disk connected to the electronic device (cloud storage space).
In the embodiment of the application, after the electronic device acquires the input text, the electronic device may first perform preprocessing on the input text, remove punctuations, continuous blank characters, arabic numbers, emoticons, and the like in the input text, and convert the text into lowercase, thereby eliminating interference items for subsequent language detection and further improving the accuracy of subsequent language detection.
Specifically, after acquiring the preprocessed input text, the electronic device may first call a first classification model from a storage space of the electronic device, and then input the input text into the first classification model, so that the language detection is performed on the preprocessed input text through the first classification model to obtain a first language identification result output by the first classification model.
And 103, if the first language identification result meets a preset condition, calling a second classification model to perform language detection on the input text to generate a second language identification result, wherein the identification precision of the second classification model is higher than that of the first classification model. The preset conditions can be calibrated according to actual conditions.
It should be noted that the first classification model described in this embodiment may be trained in advance and pre-stored in the memory space of the electronic device, so as to facilitate the retrieval of the application,
specifically, after obtaining the first language identification result, the electronic device may first determine whether the first language identification result meets a preset condition, and if the first language identification result meets the preset condition, call a second classification model from a storage space of the electronic device, and input the preprocessed input text into the second classification model, so as to perform language detection on the preprocessed input text through the second classification model, to obtain a second language identification result output by the second classification model, and use the second language identification result as a final identification result.
In this embodiment of the present application, training and generation of the first classification model and the second classification model may be performed by a related server, where the server may be a cloud server or a host of a computer, and a communication connection is established between the server and an electronic device that is capable of executing the language detection method provided in the embodiment of the present application, where the communication connection may be at least one of a wireless network connection and a wired network connection. The server can send the trained first classification model and the trained second classification model to the electronic device so that the electronic device can call the trained first classification model and the trained second classification model when needed, and therefore computing stress of the electronic device is greatly reduced.
In the embodiment of the application, an input text is obtained firstly, then a first classification model is called to perform language detection on the input text to generate a first language identification result, if the first language identification result meets a preset condition, a second classification model is called to perform language detection on the input text to generate a second language identification result, and finally the second language identification result can be used as a final identification result, so that the language of the input text can be effectively detected, and the accuracy of the language identification result is improved.
Further, in an embodiment of the present application, the language detection method may further include taking the first language identification result as a final identification result if the first language identification result does not satisfy a preset condition.
Specifically, after acquiring the preprocessed input text, the electronic device may first call the first classification model from its own storage space, and then input the preprocessed input text into the first classification model, so as to perform language detection on the preprocessed input text through the first classification model, so as to obtain a first language identification result output by the first classification model. After the electronic equipment obtains the first language identification result, whether the first language identification result meets the preset condition is judged, if the first language identification result does not meet the preset condition, the language detection of the preprocessed input text is carried out without calling a second classification model, and at the moment, the electronic equipment can directly provide the first language identification result as a final identification result for a user, so that the language detection speed is improved.
Further, in an embodiment of the present application, it may be determined that the first language identification result satisfies the preset condition through the following steps: and inquiring the similar language group according to the first language identification result to judge whether the similar languages exist, if so, judging that the first language identification result meets the preset condition, and if not, judging that the first language identification result does not meet the preset condition.
Specifically, after obtaining a first language identification result, the electronic device may analyze the first language identification result to determine whether a language given by the first language identification result is in a similar language group, if so, it is indicated that a similar language exists in the input text, at this time, it may be determined that the first language identification result satisfies a preset condition, and a second classification model may be invoked to perform language detection on the input text to generate a second language identification result; if not, it indicates that there is no similar language in the input text, at this time, it may be determined that the first language identification result does not satisfy the preset condition, and the first language identification result may be directly provided to the user as the final identification result. Because the first language identification result does not meet the preset condition, the input text detected at this time is not required to call the second classification model to continue to carry out language detection, so that the logic of the language detection method is optimized, and the language detection speed is increased.
For clarity of the above embodiment, in an embodiment of the present application, as shown in fig. 2, the first classification model may perform language detection on the input text to generate a first language recognition result by the following steps:
step 201, a plurality of first characters are generated according to an input text.
In the embodiment of the application, a plurality of first characters can be generated according to an input text and a preset character generation algorithm, wherein the preset character generation algorithm can be calibrated according to actual conditions.
Step 202, generating corresponding first character category feature vectors according to the plurality of first characters, and generating the first character feature vectors according to the plurality of first characters.
Specifically, after the electronic device obtains the input text, the electronic device may first preprocess the input text to obtain a preprocessed input text, and input the preprocessed input text into the first classification model. The first classification model can generate a plurality of first characters according to the preprocessed input text and a preset character generation algorithm, then classifies the plurality of first characters according to codes, counts the distribution of the plurality of first characters on different character categories to obtain first character category characteristics, and codes (i.e. solving vectors) the first character category characteristics by adopting a word embedding mode to obtain first character category characteristic vectors.
Meanwhile, the first classification model can also count character-level n-gram characteristics in a plurality of first characters and calculate the hash value to obtain the first character characteristics. The first classification model may then encode (i.e., vector) the first character features using word embedding to obtain a first character feature vector.
In this embodiment of the present application, the first character category feature vectors and the first character feature vectors obtained by the first classification model may be multiple, and at this time, the first classification model may average the multiple first character category feature vectors and the multiple first character feature vectors, and use the corresponding average values as the final first character category feature vectors and the final first character feature vectors.
Step 203, extracting a plurality of first words from the input text, and generating a first word feature vector according to the plurality of first words.
In the embodiment of the application, the input text can be extracted according to a preset word extraction algorithm to obtain a plurality of first words, wherein the preset word extraction algorithm can be calibrated according to actual conditions.
Specifically, the first classification model may extract a plurality of first words in the preprocessed input text according to a preset word extraction algorithm, for example, extracting words in the preprocessed input text with spaces as separators. The first classification model then extracts first word features from the plurality of first words and may encode (i.e., vector) the first word features in a word-embedding manner to obtain a first word feature vector.
In this embodiment of the application, the first word feature vector obtained by the first classification model may be multiple, and at this time, the electronic device may average the multiple first word feature vectors, and use a corresponding average value as a final first word feature vector.
Step 204, generating a first language identification result according to the first character category feature vector, the first character feature vector and the first word feature vector.
Specifically, referring to fig. 3, the first classification model may include a hidden layer and a classifier, after obtaining the first character category feature vector, the first character feature vector and the first word feature vector, the first classification model may first serially connect the three feature vectors and then send the serially connected feature vectors into the hidden layer, and then send data output after being processed by the hidden layer into the classifier, so as to perform classification processing on the data by the classifier, thereby obtaining the first language identification result. Therefore, the language of the input text can be effectively detected through the first classification model.
For clarity of the above embodiment, in an embodiment of the present application, as shown in fig. 4, the second classification model may perform language detection on the input text to generate a second language recognition result by the following steps:
step 401, a plurality of second characters are generated according to the input text.
In the embodiment of the present application, a plurality of second characters may also be generated according to the input text and a preset character generation algorithm.
Step 402, generating corresponding second character category feature vectors according to the plurality of second characters, and generating second character feature vectors according to the plurality of second characters.
Specifically, after the electronic device determines that the first language identification result meets the preset condition, the electronic device may input the preprocessed input text into the second classification model. The second classification model can generate a plurality of second characters according to the preprocessed input text and a preset character generation algorithm, then classify the plurality of second characters according to codes, count the distribution of the plurality of second characters on different character categories to obtain second character category characteristics, and encode (i.e., solve vectors) the second character category characteristics by adopting a word embedding mode to obtain second character category characteristic vectors.
Meanwhile, the second classification model can also count character-level n-gram characteristics in a plurality of second characters and calculate the hash value to obtain the second character characteristics. The second classification model may then encode (i.e., vector) the second character features using word embedding to obtain a second character feature vector.
In this embodiment of the application, the second classification model may obtain a plurality of second character type feature vectors and a plurality of second character feature vectors, and at this time, the second classification model may average the plurality of second character type feature vectors and the plurality of second character feature vectors, and use the corresponding average value as the final second character type feature vector and second character feature vector.
Step 403, extracting a plurality of second words from the input text, and generating a second word feature vector and a word feature vector according to the plurality of second words.
In the embodiment of the present application, the input text may also be extracted according to a preset word extraction algorithm to obtain a plurality of second words.
Specifically, the second classification model may extract a plurality of second words in the preprocessed input text according to a preset word extraction algorithm, for example, extracting words in the preprocessed input text with spaces as separators. The second classification model then extracts second word features from the plurality of second words and may encode (i.e., vector) the second word features using word embedding to obtain a second word feature vector.
Meanwhile, the second classification model can also count n-gram characteristics of word levels in a plurality of second words and calculate the hash value to obtain word characteristics. The second classification model may then encode (i.e., vector) the word features using word embedding to obtain a word feature vector.
In this embodiment of the application, the second classification model may obtain a plurality of second word feature vectors and a plurality of word feature vectors, and at this time, the second classification model may average the plurality of second word feature vectors and the plurality of word feature vectors, and use the corresponding average value as the final second word feature vector and word feature vector.
And step 404, generating a second language identification result according to the second character category feature vector, the second character feature vector, the second word feature vector and the word feature vector.
Specifically, referring to fig. 5, the second classification model may also include a hidden layer and a classifier, and after obtaining the second character category feature vector, the second character feature vector, the second word feature vector, and the word feature vector, the second classification model may first serially connect the four feature vectors and then send the four feature vectors into the hidden layer, and then send data output after being processed by the hidden layer into the classifier, so as to perform classification processing on the data by the classifier, thereby obtaining a second language identification result. Therefore, for the input text containing similar languages, the secondary detection is carried out through the second classification model with higher identification precision, the distinguishing capability of the similar languages can be improved, and the detection effect of the similar languages is obviously improved.
In order to improve the language detection effect of the short text, in an embodiment of the present application, before invoking the first classification model to perform language detection on the input text to generate the first language recognition result, the method may further include: and matching the input text with a preset word list, and if the corresponding word list is matched, taking the language corresponding to the word list as a recognition result.
It should be noted that the preset vocabulary described in this embodiment may be generated in advance and pre-stored in the storage space of the electronic device, so as to facilitate the retrieval and use.
Specifically, after the electronic device obtains the preprocessed input text, a preset word list can be called from a storage space of the electronic device, the preprocessed input text is matched with the preset word list, if the corresponding word list is matched, the language corresponding to the word list is used as a recognition result, and the recognition result can be directly provided for a user. Therefore, language detection is carried out on the input text by adding the preset word list between calling of the first classification model, the disadvantage of the classification model in short text recognition is made up, and the language detection effect of the short text is remarkably improved.
Fig. 6 is a schematic structural diagram of a language detection device according to an embodiment of the present application.
The language detection device according to the embodiment of the application may also be configured in the electronic device, so as to implement that the first classification model is invoked to perform language detection on the obtained input text to generate a first language identification result, and when the first language identification result meets a preset condition, the second classification model is invoked to perform language detection on the obtained input text to generate a second language identification result.
As shown in fig. 1, the language detection device 600 may include: an acquisition module 610, a first detection module 620, and a second detection module 630.
The obtaining module 610 is configured to obtain an input text. It should be noted that the input text described in this embodiment may be text expressed in various written languages, for example, it may be chinese text, english text, russian text, malaysian text, mixed chinese and english text, and the like. The input text may contain a sentence, a paragraph, or a chapter, such as a news article.
In this embodiment of the application, the input text may include text information input by a user through speech recognition and input content input by the user into the input method system through an input method, the input method system may convert the input content into word candidates of the input text according to a current input manner of the user, and provide a user with a choice, the user may input the text information through various input means, such as a keyboard, a touch pad, a mouse, and the like, and the user may also select any input manner to input the text information, such as pinyin, wubi, stroke, handwriting, english, and a keypad, and the like, which is not limited herein.
As a possible scenario, the input text may further include text information obtained by the user through copy and paste.
Specifically, the obtaining module 610 may obtain input information (input text) input to the input method system by the user through an input method, for example, the user inputs a text description of a chinese character through the input method.
The first detection module 620 is configured to invoke the first classification model to perform language detection on the input text to generate a first language identification result.
As a possible scenario, the first classification model may be improved based on an open-source FastText model.
It should be noted that the first classification model described in this embodiment may be trained in advance and pre-stored in a storage space of the electronic device to facilitate retrieval of the application, where the storage space is not limited to an entity-based storage space, such as a hard disk, and the storage space may also be a storage space of a network hard disk connected to the electronic device (cloud storage space).
In this embodiment of the application, after the obtaining module 610 obtains the input text, the first detecting module 620 may first perform preprocessing on the input text, remove punctuations, continuous blank characters, arabic numbers, emoticons, and the like in the input text, and convert the text into a lower case, thereby eliminating an interfering item for subsequent language detection, and further improving the accuracy of subsequent language detection.
Specifically, after the first detection module 620 obtains the preprocessed input text, the first classification model may be called from the storage space of the electronic device, and then the input text is input into the first classification model, so that the language of the preprocessed input text is detected by the first classification model, and the first language identification result output by the first classification model is obtained.
The second detecting module 630 is configured to, if the first language identification result meets a preset condition, invoke a second classification model to perform language detection on the input text to generate a second language identification result, where identification accuracy of the second classification model is higher than identification accuracy of the first classification model. The preset conditions can be calibrated according to actual conditions.
It should be noted that the first classification model described in this embodiment may be trained in advance and pre-stored in the memory space of the electronic device, so as to facilitate the retrieval of the application,
specifically, after the first detecting module 620 obtains the first language identification result, the second detecting module 630 may first determine whether the first language identification result meets a preset condition, if the first language identification result meets the preset condition, call a second classification model from a storage space of the second detecting module, and input the preprocessed input text into the second classification model, so as to perform language detection on the preprocessed input text through the second classification model, to obtain a second language identification result output by the second classification model, and use the second language identification result as a final identification result.
In the embodiment of the application, the input text is obtained through the obtaining module, the first classification model is called through the first detection module to perform language detection on the input text to generate a first language identification result, and if the first language identification result meets a preset condition, the second classification model is called through the second detection module to perform language detection on the input text to generate a second language identification result. Therefore, the language of the input text can be effectively detected, and the accuracy of the language identification result is improved.
In an embodiment of the present application, the first detecting module 620 is specifically configured to: generating a plurality of first characters according to an input text; generating corresponding first character category feature vectors according to the first characters, and generating first character feature vectors according to the first characters; extracting a plurality of first words from the input text and generating a first word feature vector according to the first words; and generating a first language recognition result according to the first character category feature vector, the first character feature vector and the first word feature vector.
In an embodiment of the present application, the second detecting module 630 is specifically configured to: generating a plurality of second characters according to the input text; generating corresponding second character category feature vectors according to the plurality of second characters, and generating second character feature vectors according to the plurality of second characters; extracting a plurality of second words from the input text, and generating a second word feature vector and a word feature vector according to the plurality of second words; and generating a second language identification result according to the second character category feature vector, the second character feature vector, the second word feature vector and the word feature vector.
In an embodiment of the present application, as shown in fig. 6, the language detection apparatus 600 may further include a recognition module 640, where the recognition module 640 is configured to match the input text with a preset vocabulary before the first detection module invokes the first classification model to perform language detection on the input text to generate a first language recognition result; and if the word list is matched with the corresponding word list, taking the language corresponding to the word list as a recognition result.
In an embodiment of the present application, the second detecting module 630 determines that the first language identification result satisfies the predetermined condition by: inquiring similar language groups according to the first language identification result to judge whether similar languages exist or not; if so, judging that the first language identification result meets a preset condition; and if the first language identification result does not meet the preset condition, judging that the first language identification result does not meet the preset condition.
In one embodiment of the present application, the second detection module 630 is further configured to: and if the first language identification result does not meet the preset condition, taking the first language identification result as a final identification result.
It should be noted that the foregoing explanation for the embodiment of the language detection method is also applicable to the language detection apparatus of the embodiment, and is not repeated herein.
The language detection device of the embodiment of the application acquires the input text through the acquisition module, calls the first classification model through the first detection module to perform language detection on the input text so as to generate a first language identification result, and calls the second classification model through the second detection module to perform language detection on the input text so as to generate a second language identification result if the first language identification result meets the preset condition. Therefore, the language of the input text can be effectively detected, and the accuracy of the language identification result is improved.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the language detection method. For example, in some embodiments, the language detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the language detection method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the language detection method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (15)

1. A language detection method comprises the following steps:
acquiring an input text;
calling a first classification model to perform language detection on the input text to generate a first language identification result; and
and if the first language identification result meets a preset condition, calling a second classification model to perform language detection on the input text to generate a second language identification result, wherein the identification precision of the second classification model is higher than that of the first classification model.
2. The language detection method of claim 1, wherein said first classification model performs language detection on said input text to generate a first language identification result by:
generating a plurality of first characters according to the input text;
generating corresponding first character category feature vectors according to the first characters, and generating first character feature vectors according to the first characters;
extracting a plurality of first words from the input text and generating a first word feature vector according to the first words; and
and generating the first language identification result according to the first character category feature vector, the first character feature vector and the first word feature vector.
3. The language detection method of claim 1, wherein said second classification model performs language detection on said input text to generate a second language identification result by:
generating a plurality of second characters according to the input text;
generating corresponding second character category feature vectors according to the second characters, and generating second character feature vectors according to the second characters;
extracting a plurality of second words from the input text, and generating a second word feature vector and a word feature vector according to the plurality of second words; and
and generating the second language identification result according to the second character category feature vector, the second character feature vector, the second word feature vector and the word feature vector.
4. The language detection method as claimed in claim 1, wherein before said invoking the first classification model to perform language detection on the input text to generate the first language identification result, further comprising:
matching the input text with a preset word list; and
and if the corresponding word list is matched, taking the language corresponding to the word list as a recognition result.
5. The language detection method as claimed in claim 1, wherein the first language identification result is determined to satisfy a predetermined condition by:
inquiring similar language groups according to the first language identification result to judge whether similar languages exist or not;
if yes, judging that the first language identification result meets a preset condition; and
and if not, judging that the first language identification result does not meet the preset condition.
6. The language detection method according to claim 1 or 5, further comprising:
and if the first language identification result does not meet the preset condition, taking the first language identification result as a final identification result.
7. A language detection device, comprising:
the acquisition module is used for acquiring an input text;
the first detection module is used for calling a first classification model to carry out language detection on the input text so as to generate a first language identification result; and
and the second detection module is used for calling a second classification model to perform language detection on the input text to generate a second language identification result if the first language identification result meets a preset condition, wherein the identification precision of the second classification model is higher than that of the first classification model.
8. The language detection device of claim 7, wherein the first detection module is specifically configured to:
generating a plurality of first characters according to the input text;
generating corresponding first character category feature vectors according to the first characters, and generating first character feature vectors according to the first characters;
extracting a plurality of first words from the input text and generating a first word feature vector according to the first words; and
and generating the first language identification result according to the first character category feature vector, the first character feature vector and the first word feature vector.
9. The language detection device of claim 7, wherein the second detection module is specifically configured to:
generating a plurality of second characters according to the input text;
generating corresponding second character category feature vectors according to the second characters, and generating second character feature vectors according to the second characters;
extracting a plurality of second words from the input text, and generating a second word feature vector and a word feature vector according to the plurality of second words; and
and generating the second language identification result according to the second character category feature vector, the second character feature vector, the second word feature vector and the word feature vector.
10. The language detection device of claim 7, further comprising:
the recognition module is used for matching the input text with a preset word list before the first detection module calls a first classification model to perform language detection on the input text so as to generate a first language recognition result; and if the corresponding word list is matched, taking the language corresponding to the word list as a recognition result.
11. The language detection device as claimed in claim 7, wherein the second detection module determines that the first language identification result satisfies a predetermined condition by:
inquiring similar language groups according to the first language identification result to judge whether similar languages exist or not;
if yes, judging that the first language identification result meets a preset condition; and
and if not, judging that the first language identification result does not meet the preset condition.
12. The language detection device of claim 7 or 11, the second detection module further configured to:
and if the first language identification result does not meet the preset condition, taking the first language identification result as a final identification result.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the language detection method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the language detection method according to any one of claims 1 to 6.
15. A computer program product comprising a computer program which, when executed by a processor, implements a language detection method according to any one of claims 1-6.
CN202011540408.XA 2020-12-23 2020-12-23 Language detection method and device, electronic equipment and storage medium Pending CN112528682A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011540408.XA CN112528682A (en) 2020-12-23 2020-12-23 Language detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011540408.XA CN112528682A (en) 2020-12-23 2020-12-23 Language detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112528682A true CN112528682A (en) 2021-03-19

Family

ID=74976552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011540408.XA Pending CN112528682A (en) 2020-12-23 2020-12-23 Language detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112528682A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926847A (en) * 2021-12-06 2022-08-19 百度在线网络技术(北京)有限公司 Image processing method, device, equipment and storage medium for minority language

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106959943A (en) * 2016-01-11 2017-07-18 阿里巴巴集团控股有限公司 Languages recognize update method and device
CN111027528A (en) * 2019-11-22 2020-04-17 华为技术有限公司 Language identification method and device, terminal equipment and computer readable storage medium
CN111079408A (en) * 2019-12-26 2020-04-28 北京锐安科技有限公司 Language identification method, device, equipment and storage medium
US20200242302A1 (en) * 2019-01-29 2020-07-30 Ricoh Company, Ltd. Intention identification method, intention identification apparatus, and computer-readable recording medium
CN111539207A (en) * 2020-04-29 2020-08-14 北京大米未来科技有限公司 Text recognition method, text recognition device, storage medium and electronic equipment
CN111724766A (en) * 2020-06-29 2020-09-29 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN106959943A (en) * 2016-01-11 2017-07-18 阿里巴巴集团控股有限公司 Languages recognize update method and device
US20200242302A1 (en) * 2019-01-29 2020-07-30 Ricoh Company, Ltd. Intention identification method, intention identification apparatus, and computer-readable recording medium
CN111027528A (en) * 2019-11-22 2020-04-17 华为技术有限公司 Language identification method and device, terminal equipment and computer readable storage medium
CN111079408A (en) * 2019-12-26 2020-04-28 北京锐安科技有限公司 Language identification method, device, equipment and storage medium
CN111539207A (en) * 2020-04-29 2020-08-14 北京大米未来科技有限公司 Text recognition method, text recognition device, storage medium and electronic equipment
CN111724766A (en) * 2020-06-29 2020-09-29 合肥讯飞数码科技有限公司 Language identification method, related equipment and readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LAWRENCE A. KLEIN: "传感器数据融合理论及应用", 29 February 2004, 北京理工大学出版社, pages: 58 *
孙茂松: "自然语言处理研究前沿", 31 December 2019, 上海交通大学出版社, pages: 266 *
张琳琳;杨雅婷;陈沾衡;潘一荣;李毓;: "基于深度学习的相似语言短文本的语种识别方法", 计算机应用与软件, no. 02 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926847A (en) * 2021-12-06 2022-08-19 百度在线网络技术(北京)有限公司 Image processing method, device, equipment and storage medium for minority language

Similar Documents

Publication Publication Date Title
CN112800919A (en) Method, device and equipment for detecting target type video and storage medium
EP4123496A2 (en) Method and apparatus for extracting text information, electronic device and storage medium
CN114416943A (en) Training method and device for dialogue model, electronic equipment and storage medium
CN112528641A (en) Method and device for establishing information extraction model, electronic equipment and readable storage medium
US20230103728A1 (en) Method for sample augmentation
CN114429633A (en) Text recognition method, model training method, device, electronic equipment and medium
CN113553412A (en) Question and answer processing method and device, electronic equipment and storage medium
US20230096921A1 (en) Image recognition method and apparatus, electronic device and readable storage medium
CN112699237B (en) Label determination method, device and storage medium
CN114021548A (en) Sensitive information detection method, training method, device, equipment and storage medium
CN112528682A (en) Language detection method and device, electronic equipment and storage medium
CN116383382A (en) Sensitive information identification method and device, electronic equipment and storage medium
CN114417029A (en) Model training method and device, electronic equipment and storage medium
CN113221566B (en) Entity relation extraction method, entity relation extraction device, electronic equipment and storage medium
CN112560848B (en) Training method and device for POI (Point of interest) pre-training model and electronic equipment
CN113886543A (en) Method, apparatus, medium, and program product for generating an intent recognition model
CN114692778A (en) Multi-modal sample set generation method, training method and device for intelligent inspection
CN114416974A (en) Model training method and device, electronic equipment and storage medium
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN114254650A (en) Information processing method, device, equipment and medium
CN113408269A (en) Text emotion analysis method and device
CN114078274A (en) Face image detection method and device, electronic equipment and storage medium
CN113887630A (en) Image classification method and device, electronic equipment and storage medium
CN112784599A (en) Poetry sentence generation method and device, electronic equipment and storage medium
CN113051396A (en) Document classification identification method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination