CN112749556B - Multi-language model training method and device, storage medium and electronic equipment - Google Patents

Multi-language model training method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112749556B
CN112749556B CN202010774741.0A CN202010774741A CN112749556B CN 112749556 B CN112749556 B CN 112749556B CN 202010774741 A CN202010774741 A CN 202010774741A CN 112749556 B CN112749556 B CN 112749556B
Authority
CN
China
Prior art keywords
vector
language
sentence
model
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010774741.0A
Other languages
Chinese (zh)
Other versions
CN112749556A (en
Inventor
童丽霞
雷植程
杨念民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010774741.0A priority Critical patent/CN112749556B/en
Publication of CN112749556A publication Critical patent/CN112749556A/en
Application granted granted Critical
Publication of CN112749556B publication Critical patent/CN112749556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for training a multi-language model, a storage medium and electronic equipment. The method comprises the following steps: inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, training the language model to be trained to obtain a pre-training language model, wherein the multilingual speech packet comprises a plurality of languages, and a word segmentation set obtained by segmenting the multilingual speech packet is stored in the multilingual shared vocabulary; adjusting the pre-training language model by using a first corpus set of a first language with word segmentation labels and a second corpus set of a second language with word segmentation labels to obtain an intention recognition model, wherein the intention recognition model is used for recognizing the relation between the semantics and the semantics expressed by the sentences of the first language and the second language; and inputting the sentences of the multiple languages into the intention identification model to obtain a target multi-language model, wherein the target multi-language model is used for identifying the semantics expressed by the sentences of the multiple languages and the relation between the semantics.

Description

Multi-language model training method and device, storage medium and electronic equipment
Technical Field
The invention relates to the field of computers, in particular to a method and a device for training a multi-language model, a storage medium and electronic equipment.
Background
With the gradual maturity of AI technology, intelligent customer service products gradually become ToB, and provide service platforms to the outside in a unified way. In the face of clients with different nationalities and different languages, the robot needs to correctly identify the intention of the user.
At present, a common method for products on the market is to train a model according to each language, but the classification effect is poor due to less linguistic data of individual languages, and the intention of a user is difficult to understand correctly.
Aiming at the problem that in the related art, when the intention identification and classification are performed on clients of different nationalities and different languages, the classification effect is poor due to the fact that the language material of a specific language is less, and an effective solution is not provided.
Disclosure of Invention
The embodiment of the invention provides a multi-language model training method and device, a storage medium and electronic equipment, which are used for at least solving the problem that in the related technology, when intention identification and classification are carried out on clients of different nationalities and different languages, the classification effect is poor due to less language material of individual language.
According to an aspect of an embodiment of the present invention, there is provided a method for training a multilingual model, including: inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of multiple languages, the multilingual speech packet is a speech packet comprising multiple languages, and a participle set obtained by participle division on the multilingual speech packet is stored in the multilingual shared vocabulary; adjusting the pre-training language model by using a first corpus set of a first language with word segmentation labels and a second corpus set of a second language with word segmentation labels to obtain an intention recognition model, wherein the intention recognition model is used for recognizing the relation between semantics and semantics represented by sentences of the first language and the second language, and the plurality of languages comprise the first language and the second language; and inputting the sentences of the plurality of languages into the intention identification model to obtain a target multi-language model, wherein the target multi-language model is used for identifying the semantics represented by the sentences of the plurality of languages and the relation between the semantics.
According to another aspect of the embodiments of the present invention, there is provided a multi-language model training apparatus, including: the device comprises a first input unit, a second input unit and a third input unit, wherein the first input unit is used for inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained and training the language model to be trained to obtain a pre-trained language model, the pre-trained language model is used for performing semantic recognition on linguistic data of multiple languages, the multilingual speech packet is a speech packet comprising the multiple languages, and a participle set obtained by participle of the multilingual speech packet is stored in the multilingual shared vocabulary; a first processing unit, configured to adjust the pre-training language model by using a first corpus set of a first language with word segmentation labels and a second corpus set of a second language with word segmentation labels, so as to obtain an intention recognition model, where the intention recognition model is used to recognize a relationship between semantics and semantics represented by sentences of the first language and the second language, and the multiple languages include the first language and the second language; and a second processing unit, configured to input the multi-language sentences into the intention recognition model to obtain a target multi-language model, where the target multi-language model is used to recognize semantics represented by the multi-language sentences and relations between the semantics.
According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above training method for multiple language models when running.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for training the multi-language model through the computer program.
By the invention, the multi-language material packet is participled to obtain the multi-language shared vocabulary, the multi-language material packet and the multi-language shared vocabulary are input into the language model to be trained, the language model to be trained is trained to obtain the pre-trained language model, then the pre-trained language model is adjusted by using the first corpus set of the first language with participle labels and the second corpus set of the second language with participle labels, the intention recognition model is obtained, the relation between the semantics and the semantics represented by the sentences of the first language and the second language can be recognized by the intention recognition model, the plurality of languages comprise the first language and the second language, finally, the sentences of the plurality of languages are input into the intention recognition model, so that the intention recognition model can generalize the capability of recognizing the intentions of the first language and the second language onto the target multi-language model, thus, the target multi-language model is obtained, and the obtained target multi-language model can identify the semantics represented by the sentences of the multiple languages and the relation between the semantics. In the above way, through the first corpus set with the participle labels in the first language and the second corpus set with the participle labels in the second language, adjusting the pre-training language model to obtain an intention recognition model, so that the intention recognition model has the capability of recognizing the semantics represented by the sentences of the first language and the second language and the relation between the semantics, further, the multi-language sentence is inputted into the intention recognition model to obtain the target multi-language model, so that the target multi-language model can also have the capability of identifying the semantics represented by the sentences of multiple languages and the relation between the semantics under the training of the corpus sets of the multiple languages without word segmentation and labeling, thereby solving the problems in the related technology, when the intention identification and classification are carried out on the clients with different nationalities and different languages, the problem of poor classification effect exists because the language material of individual language is less.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a diagram illustrating an application environment of a multi-language model training method according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating an alternative multi-lingual model training method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an application environment of another multi-language model training method according to an embodiment of the present invention;
FIG. 4 is a flow chart illustrating an alternative multi-language model training method according to an embodiment of the present invention;
FIG. 5 is a flow chart illustrating an alternative multi-lingual model training method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of an alternative intent recognition model according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an alternative multi-language model training apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Technical terms involved in the embodiments of the present invention include:
1. transfer learning: transfer Learning is a branch of machine Learning, namely, the model developed for task a is taken as an initial point and reused in the process of developing the model for task B.
2. Bidirectional encoding representation (Bert) based on a transform model.
3. TextCNN: the convolutional neural network CNN is applied to a text classification task, and key information (similar to ngram with multiple window sizes) in a sentence is extracted by using a plurality of kernel with different sizes, so that local relevance can be captured better.
4. Word2vec is a group of related models used to generate Word vectors. These models are shallow, two-level neural networks trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is complete, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, the vector being the hidden layer of the neural network.
5. fine-tune: and a plurality of subtasks are performed based on a trained model, so that compared with the method of training from the beginning, a large amount of computing resources and computing time can be saved, and the computing efficiency and even the accuracy are improved.
According to an aspect of an embodiment of the present invention, a method for training a multi-language model is provided. Alternatively, the above-mentioned multi-language model training method can be applied, but not limited, to the application environment shown in fig. 1. As shown in fig. 1, a terminal device 102 inputs a multi-language speech packet and a multi-language shared vocabulary into a language model to be trained, and a server 104 trains the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of multiple languages, the multi-language speech packet is a speech packet including multiple languages, and a segmentation set obtained by segmenting the multi-language speech packet is stored in the multi-language shared vocabulary; the server 104 adjusts the pre-training language model by using a first corpus set of a first language with word segmentation labels and a second corpus set of a second language with word segmentation labels to obtain an intention recognition model, wherein the intention recognition model is used for recognizing the semantic and semantic relations represented by the sentences of the first language and the second language, and the plurality of languages comprise the first language and the second language; the server 104 inputs the sentences of the plurality of languages into the intention recognition model to obtain a target multilingual model, wherein the target multilingual model is used for recognizing the semantics represented by the sentences of the plurality of languages and the relationship between the semantics. The above is merely an example, and the embodiments of the present application are not limited herein.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Optionally, the method may be applied to an artificial intelligence natural language processing technology, machine learning/deep learning, for example, in a scenario where an intelligent customer service product processes multiple languages, and this embodiment is not limited in any way here.
It should be noted that Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
Optionally, in this embodiment, the terminal device may be a terminal device configured with a target client, and may include, but is not limited to, at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, smart televisions, etc. The target client may be a video client, an instant messaging client, a browser client, an educational client, etc. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is only an example, and the present embodiment does not limit this.
Optionally, in this embodiment, as an optional implementation manner, the method may be executed by a server, or may be executed by a terminal device, or may be executed by both the server and the terminal device, and in this embodiment, the description is given by taking an example that the server (for example, the server 104) executes. As shown in fig. 2, the flow of the multi-language model training method may include the steps of:
step S202, inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of the multiple languages, the multilingual speech packet is a speech packet comprising the multiple languages, and a participle set obtained by participle division on the multilingual speech packet is stored in the multilingual shared vocabulary.
Optionally, the multi-lingual speech packet may be a speech packet in dozens of languages, such as english, chinese, indonesia, arabic, and turkish.
The multilingual speech packet is pre-trained to obtain a public multilingual pre-training language model, and the multilingual shared vocabulary table is a vocabulary table shared among the multiple languages, so that the generalization capability of the multiple languages can be enhanced.
Step S204, using a first corpus set of a first language with word segmentation labels and a second corpus set of a second language with word segmentation labels, adjusting the pre-trained language model to obtain an intention recognition model, where the intention recognition model is used to recognize a relation between semantics and semantics represented by sentences of the first language and the second language, and the plurality of languages include the first language and the second language.
Optionally, the first language and the second language may be languages that are relatively common and relatively easy to acquire corresponding corpus sets with participle labels, for example, the first language may be english, and the second language may be chinese.
The plurality of languages include a first language and a second language, for example, the plurality of languages may be dozens of languages, such as english, chinese, indonesia, arabic, and turkish, the first language may be english, and the second language may be chinese.
Optionally, the pre-trained language model may be adjusted by using a first corpus set of a first language with word segmentation labels and a second corpus set of a second language with word segmentation labels, so as to obtain an intention recognition model.
For example, the pre-trained language model is adjusted using the labeled chinese and english corpus to obtain the intent recognition model.
The system comprises an intention recognition model, a search engine and a database, wherein the intention recognition model is used for performing intention recognition on sentences input to the intention recognition model, and the application field of the intention recognition mainly relates to the following scenes and the field of the search engine; the field of dialog systems, recognizing what a user wants based on intent, business (e.g., e-commerce, buy tickets, inquire weather), or chatting; the field of intelligent internet of things; the field of robots. In other words, the intention recognition can be regarded as a classification problem for recognizing the intention of the user corresponding to the sentence or voice input by the user. For example, how much tomorrow? Then, the corresponding intention of the sentence or the voice is a weather query; what are good recommendations for a casual game? Then the sentence or voice corresponds to the intent that the game consultation is being made.
Step S206, inputting the multi-language sentences into the intention identification model to obtain a target multi-language model, wherein the target multi-language model is used for identifying the relation between the semantics represented by the multi-language sentences and the semantics.
Optionally, the intention recognition model is directly migrated to other languages in the multiple languages except the first language and the second language for intention recognition, so that the other languages can be basically used under the condition of zero samples, and the target multi-language model is obtained and can recognize the semantics represented by the sentences of the multiple languages and the relation between the semantics.
Through the embodiment, the multilingual speech material packet is participled to obtain the multilingual shared vocabulary, the multilingual speech material packet and the multilingual shared vocabulary are input into the language model to be trained, the language model to be trained is trained to obtain the pre-trained language model, then the first corpus set of the first language with participle labels and the second corpus set of the second language with participle labels are used for adjusting the pre-trained language model to obtain the intention recognition model, the obtained intention recognition model can recognize the relation between the semantics and the semantics expressed by the sentences of the first language and the second language, the multiple languages comprise the first language and the second language, finally, the sentences of the multiple languages are input into the intention recognition model, so that the intention recognition model can generalize the capability of performing intention recognition on the first language and the second language onto the target multilingual model, thus, the target multi-language model is obtained, and the obtained target multi-language model can identify the semantics represented by the sentences of the multiple languages and the relation between the semantics. In the above way, through the first corpus set with the participle labels in the first language and the second corpus set with the participle labels in the second language, adjusting the pre-training language model to obtain an intention recognition model, so that the intention recognition model has the capability of recognizing the semantics represented by the sentences of the first language and the second language and the relation between the semantics, further, the multi-language sentence is inputted into the intention recognition model to obtain the target multi-language model, so that the target multi-language model can also have the capability of identifying the semantics represented by the sentences of multiple languages and the relation between the semantics under the training of the corpus sets of the multiple languages without word segmentation and labeling, thereby solving the problems in the related technology, when the intention identification and classification are carried out on the clients with different nationalities and different languages, the problem of poor classification effect exists because the language material of individual language is less.
Optionally, in this embodiment, before the inputting the multilingual speech package and the multilingual shared vocabulary into the language model to be trained, and training the language model to be trained to obtain the pre-trained language model, the method further includes: segmenting words in the multi-language speech packet, and determining the word frequency of each word in the multi-language speech packet; and determining the word segmentation set corresponding to the word frequency greater than or equal to a preset threshold value as the multi-language shared vocabulary table.
Optionally, the multi-language pre-material packet is subjected to word segmentation, the word frequency of each word in the multi-language pre-material packet is determined, and a word set corresponding to the word frequency greater than or equal to a preset threshold is determined as the multi-language shared vocabulary table.
For example, after the multilingual speech packet is segmented, the word frequency of each word is calculated, the words with the frequency smaller than a preset threshold value are removed, and the remaining words are sorted into a multilingual shared vocabulary list.
Through the embodiment, the vocabulary can be shared among a plurality of languages through the multilingual shared vocabulary table, and the generalization capability of the languages is improved.
Optionally, in this embodiment, before the pre-training language model is adjusted by using the first corpus set of the first language with word segmentation and the second corpus set of the second language with word segmentation, the method further includes: executing the following steps for a sentence in the first corpus set or the second corpus set: performing word segmentation on the sentence to obtain a sentence vector corresponding to the sentence, wherein the sentence vector is composed of N word segmentation vectors, and one word segmentation vector of the N word segmentation vectors includes: word sense information of one word corresponding to the one word segmentation vector, and position information of the one word segmentation, wherein the word sense information is used for indicating the meaning of the one word segmentation, the position information is used for indicating the position of the one word segmentation in the one sentence, and N is an integer greater than 0; and inputting the sentence vectors into the pre-training language model.
Optionally, before the pre-training language model is adjusted by the first corpus set and the second corpus set, the following steps are further performed on one sentence in the first corpus set or the second corpus set:
a sentence is divided into words to obtain a sentence vector corresponding to the sentence, the obtained sentence vector is composed of N word-dividing vectors, and one word-dividing vector of the N word-dividing vectors may include the following information: the method comprises the steps that word meaning information of a word corresponding to a word segmentation vector and position information of the word in a corresponding sentence are obtained, wherein the word meaning information can be used for representing the meaning of the word, the position information is used for representing the position of the word in the sentence, N is an integer larger than 0, and finally the obtained sentence vector is input into a pre-training language model to adjust the pre-training language model.
It should be noted that, what has been described above is a process how a sentence in the first corpus set or the second corpus set obtains a corresponding sentence vector, and for other sentences in the first corpus set or the second corpus set, the corresponding sentence vector can be obtained according to the above-mentioned manner.
Optionally, in this embodiment, the adjusting the pre-training language model by using the first corpus set of the first language with word segmentation labeling and the second corpus set of the second language with word segmentation labeling to obtain the intention recognition model includes: inputting a first coding vector corresponding to the sentence vector into a text classification model, wherein the first coding vector is obtained by coding the sentence vector; classifying the first coding vector to obtain a first classification label of each word segmentation vector in the first coding vector; and determining that the pre-training language model is adjusted to obtain the intention recognition model when the sentences included in the first corpus set and the second corpus set are input into the pre-training language model and the second classification labels corresponding to the sentences included in the first corpus set and the second corpus set are obtained.
Optionally, after the sentence vector obtained in the above manner is encoded, a first encoding vector corresponding to the sentence vector may be obtained, the first encoding vector is input into the text classification model, and the first encoding vector is classified, so as to obtain a first classification tag corresponding to each word segmentation vector in the first encoding vector.
And inputting all sentences contained in the first corpus set and the second corpus set into the pre-training language model to obtain a plurality of first coding vectors corresponding to the sentences contained in the first corpus set and the second corpus set, inputting the plurality of first coding vectors into a text classification model, and when a second classification label corresponding to the sentences contained in the first corpus set and the second corpus set is obtained, indicating that the adjustment process of the pre-training language model is completed, so as to obtain the intention recognition model.
Optionally, in this embodiment, before the first encoding vector corresponding to the sentence vector is input into the text classification model, the method further includes: and coding the sentence vector to obtain the first coding vector.
The following describes a specific process of encoding the sentence vector to obtain the first encoded vector.
Optionally, in this embodiment, the encoding the sentence vector to obtain the first encoded vector includes: performing word segmentation on the one sentence to obtain the sentence vector X ═ w corresponding to the one sentence 1 ,w 2 ,…,w i Where i is 1 … N, w above i For the ith participle in the sentence vector, w 1 Is CLS, w above 1 A hidden state for receiving the sentence vectors; coding each participle in the sentence vector X to obtain a second coded vector corresponding to the sentence vector X, wherein the second coded vector is XE ═ { X ═ 1 e 1 ,x 2 e 2 ,…,x i e i 1, i ═ 1 … N, above x i e i ∈R d D is a vector dimension, and each participle in the sentence vector X corresponds to each vector in the second encoding vector one to one; encoding the second encoding vector corresponding to the ith participle according to the i-1 th participle and the i +1 th participle to obtain the first encoding vector, wherein the i-1 th participle is a preceding participle of the position of the ith participle in the sentence, the i +1 th participle is a following participle of the position of the ith participle in the sentence, and the first encoding vector is XE { x ═ 1 b 1 e 1 ,x 2 b 2 e 2 ,…,x i b i e i 1 … N, each participle in the sentence vector X corresponding to each vector in the first encoded vector one-to-one.
Alternatively, the word segmentation may be performed on the one sentence to obtain the sentence vector corresponding to the one sentence, and specifically, the word segmentation may be performed on the one sentence to obtain X ═ w 1 ,w 2 ,…,w i Where X is the sentence vector, i is 1 … N, and w is the sentence vector i For the ith participle in the sentence vector, w 1 Is CLS, w above 1 For receiving the hidden state of the sentence vector.
Wherein the first word w 1 Is always "[ CLS]", is used for receiving the hidden state of the sentence vector generated after encoding, and is convenient for the model to do downstream tasks (such as classification, entity extraction, etc.).
Optionally, after obtaining the sentence vector, further converting the sentence vector is further required, specifically as follows:
sentence-pair vector X ═ w 1 ,w 2 ,…,w i Coding each participle to obtain a second coding vector XE (X) corresponding to the sentence vector X 1 e 1 ,x 2 e 2 ,…,x i e i 1, i-1 … N, where x i e i ∈R d And d represents a vector dimension, and each participle in the sentence vector X corresponds to each vector in the second encoding vector one to one.
Optionally, after the second encoding vector is obtained through the above steps, the information of two parts, namely the word before the ith word and the word after the ith word, can be simultaneously used to weight the ith word in the whole input sentence to obtain a new representation.
For example, the second encoding vector corresponding to the ith participle may be encoded according to the (i-1) th participle and the (i + 1) th participle, and the first encoding vector XE ═ x may be obtained 1 b 1 e 1 ,x 2 b 2 e 2 ,…,x i b i e i 1 … N, each participle in the sentence vector X corresponding to each vector in the first encoded vector one-to-one.
In the above one sentence, the i-1 th participle is a preceding participle at a position where the i-th participle is located, and the i +1 th participle is a succeeding participle at a position where the i-th participle is located in the above one sentence.
According to the method and the device for classifying the words, two parts of information of the words in front of and behind the current position are utilized, each word is weighted in the whole input sequence to obtain a new representation, richer meanings of each word can be obtained, and the final classification of the words can be more accurate.
Optionally, in this embodiment, after the words in the multiple languages are input into the intention recognition model to obtain a target multi-language model, the method further includes: inputting a target sentence to the target multilingual model; performing word segmentation on the target sentence to obtain a target sentence vector corresponding to the target sentence, wherein the target sentence vector comprises a plurality of target word segmentation vectors; converting the target sentence vector into a first target coding vector corresponding to the target sentence vector; coding the first target coding vector to obtain a second target coding vector; and inputting the second target coding vector into a text classification model, and classifying the second target coding vector to obtain a target classification label of each word segmentation vector in the second target coding vector.
Alternatively, the target sentence may be a sentence in any one of the plurality of languages.
After the target multi-language model is obtained, the trained target multi-language model can be used for classifying the target sentences, and the specific process is as follows:
firstly, inputting an obtained target sentence into the target multi-language model, and then segmenting words of the target sentence to obtain a target sentence vector corresponding to the target sentence, wherein the target sentence vector comprises a plurality of target segmentation vectors; further, the target sentence vector is converted into a first target coding vector (d-dimensional vector) corresponding to the target sentence vector, and the first target coding vector is coded to obtain a second target coding vector; and inputting the second target coding vector into a text classification model, and classifying the second target coding vector to obtain a target classification label of each word segmentation vector in the second target coding vector.
After the target multi-language model is obtained in the above manner, the target multi-language model can be used in an intelligent customer service system, which can be understood as an application program based on a dialog system, and the intelligent customer service system can be installed in a robot, a terminal device (a mobile phone, a notebook computer, a tablet computer, a palm computer, an MID, a PAD, a desktop computer, an intelligent television, etc.) to perform intent recognition on a sentence or voice input by a user. The sentence or voice input by the user can be dozens of languages such as english, chinese, indonesia, arabic, turkish, and the like.
For example, as shown in fig. 3, a user inputs a sentence or voice to a client installed in a robot or a terminal device, an intelligent customer service sends the sentence or voice input by the user to a server, and the server performs intention recognition on the sentence or voice input by the user and outputs the corresponding sentence. It is understood that fig. 3 exemplifies that the input language is chinese, but the input language is not limited in this embodiment.
Through the embodiment, the sentence of any one of the languages is input into the trained target multi-language model, each participle target classification label of the target sentence can be obtained, the target multi-language model only uses the first language and the second language with participle labels, namely less corpora with participle labels are used, the dilemma that maintenance is difficult and the corpora is less and the effect is poor due to the fact that each model needs to be labeled again for training is avoided, the cold start problem of intention recognition of the small languages can be effectively solved, and the manual labeling cost is greatly reduced.
It should be noted that, a commonly used scheme for processing multiple languages by an intelligent customer service product in the market at present is to label each language, and then train an individual language model with a labeled corpus, when a customer accesses a service platform, first configure one language, and then allocate a robot of the language model to the customer, and if multiple languages are needed for service, configure multiple robots of different languages, and operate the multiple robots individually.
The above scheme has the following disadvantages:
1. each language needs to be marked with enough data to enable the model to achieve a good effect, a large amount of linguistic data needs to be marked manually, and manpower is consumed;
2. a model is built from 0 every time a language is added to the service platform, and cold start of the added language cannot be achieved;
3. meanwhile, the multi-language robot is maintained, the state of each robot needs to be monitored, and the operation cost is high.
In order to solve the above disadvantages, the following describes the flow of the multi-language model training method with reference to an optional example. The method comprises the following specific steps:
as shown in fig. 4, a flow chart of multi-language model training mainly includes the following three steps:
and S301, collecting the multi-language speech packet, and pre-training the multi-language speech packet by using Bert to obtain a pre-training language model.
Optionally, a speech packet of dozens of languages such as english, chinese, indonesia, arabic and turkish is collected, and the speech packet is pre-trained by using Bert to obtain a common pre-training language model, which enables multiple languages to share a vocabulary table, thereby enhancing the generalization ability of the languages.
Step S302, the pre-training language model is adjusted by using Chinese and English labeling corpora to obtain an intention recognition model.
Optionally, the pre-trained language model may be adjusted by using the fine-tune method using the chinese and english markup corpora, and the obtained intention recognition model may be trained.
It should be noted that the above intention recognition can be understood as dividing sentences into corresponding intention categories by classification.
Step S303, the intention recognition model is subjected to transfer learning, and the intention recognition model can directly apply the Chinese and English intention recognition capability to other languages of the zero sample to obtain a target multi-language model.
Optionally, through transfer learning, the intention recognition model realizes cold start, and the intention recognition capability of other languages can be realized without a large number of labeled samples, so that the manpower is greatly reduced.
Step S304, multiple languages maintain the target multi-language model together.
Optionally, the target multi-language model can be updated and maintained through a small amount of multi-language labeling corpora, monitoring, operation and maintenance of a plurality of robot customer services are avoided, and meanwhile, the memory is saved by only using a single model.
It should be noted that the pre-training process described below can be performed by using 500 million unlabeled corpora of mandarin chinese and cantonese.
In a possible embodiment, in the intelligent customer service application, the platform already supports the intention recognition capability of a certain game product Y in two languages of Chinese and English, and needs to be added with a plurality of languages such as Arabic, Turkish and Russian, and if the design method of the Chinese-English robot is adopted, the respective word vectors of other languages are pre-trained from 0, a set of corpus similar to Chinese and English is labeled, and a separate intention classification model is trained. However, the scheme is complicated, the period is long, the online can not be realized quickly, the existing same knowledge structure can not be referred, and the expandability is poor.
The following details describe a training method of a multi-language model based on transfer learning, and aims to enable a machine to learn the deep representation of a knowledge structure through the knowledge of the existing language and transfer the knowledge structure to a new language, and the new language can obtain a good effect only by labeling a small amount of data. The detailed flow design is shown in fig. 5:
step S401, capturing dozens of language document sets such as Arabic, Turkish, Russian and the like from a Wikipedia page by using a crawler tool, screening and filtering the document sets, eliminating useless data, and combining all Chinese and English conversations in the cloud intelligent service process in the last year to form a multilingual speech packet.
And step S402, after the multilingual speech material packet is subjected to word segmentation, calculating the word frequency of each word, eliminating the words with the frequency less than a threshold value, and sorting the rest words into a multilingual shared vocabulary list.
Step S403, inputting the multilingual speech packet and the multilingual shared vocabulary obtained in steps S401 and S402 into a Bert Base model in an unsupervised manner for training, so as to obtain a multilingual pre-training language model, where the Bert Base model may include 12 layers, 768 hidden units, 12 Attention heads, and 110M parameters. And is not limited herein.
Step S404, fine-tune the pre-training language Model obtained in step S403 by using the labeled Chinese and English corpus, and train the intention recognition Model _ Intent _ cls for game Y.
Step S405, the deep cross-language representation of the Model _ Intent _ cls is obtained through Chinese and English joint training, and the Model is directly migrated to other language versions of the game Y for intention recognition, so that other languages can be basically used under the condition of zero samples, and the cold start effect is achieved.
Step S406, collecting the problem of inaccurate Model identification in the using process of other online languages, and labeling a small amount of data fine-tune Model _ Intent _ cls, so that the Model is continuously updated in an iterative manner, and the accuracy is continuously improved.
It should be noted that the intention identification model is obtained by using Bert + TextCNN, and after the word vector is obtained by using Bert, the text classification network TextCNN needs to be further used for text classification to obtain the multi-classification label of the intention.
The framework of the intent recognition model of Bert + TextCNN is shown in FIG. 6, and is as follows:
(1) input layer
The input layer has two main parts: the Word itself represents Word Embeddings and Position information Position Embeddings of the Word. Segmenting the user input sentence X to obtain X ═ w [ w ] 1 ,w 2 ,…,w i In which w i Represents the ith word of sentence X, the first word always being "[ CLS ]]", is used for receiving the hidden state of the sentence vector generated after encoding, and is convenient for the model to do downstream tasks (such as classification, entity extraction, etc.). After Word Embedding is performed on each Word, XE ═ x is obtained 1 e 1 ,x 2 e 2 ,…,x i e i 1, i-1 … N, where x i e i ∈R d D is the dimension of the vector dimension
(2) Coding layer
In the coding layer, Bert's bidirectional transform coding is performed, that is, information of two parts of words in front of and behind the current position is used at the same time, each word is weighted in the whole input sequence to obtain a new representation, and simultaneously, 15% of all words in the corpus are randomly selected to be covered, so that uncertainty is increased, the complexity of a model is improved, and the representation information is rich in word level.
(3) A classification layer
And inputting rich word characteristic information obtained by the coding layer through a Bert model into a TextCNN model, further performing convolution pooling, obtaining more interactive information between words to obtain a final sentence vector final _ sent _ vec, and obtaining a final intention multi-classification label by using softmax.
Through the mode, fine-tuning fine-tune training is carried out on the pre-training language Model by using Chinese, English and Chinese-English corpus sets respectively to obtain an intention recognition classification Model _ Intent _ cls, and the Model _ Intent _ cls is used for testing in Chinese, English, Arabic, Turkish, Russian and Indonesia, each language test set comprises 20 categories, each category comprises 10 sentences to form 200 test sets, so that the evaluation result shown in the following table can be obtained, and a good migration effect is found on languages which do not participate in fine-tune training fine-tune through experiments.
TABLE 1
Figure BDA0002617967890000181
By the embodiment, the target multi-language model based on the transfer learning uses the cross-language representation of the shared word list and the depth, so that the Chinese and English models can be transferred to other languages for use, the dilemma that each model needs to be re-labeled and trained to cause difficult maintenance, less linguistic data and poor effect is avoided, the cold start problem of intention recognition of the language in the small languages can be effectively solved, and the manual labeling cost is greatly reduced; along with data accumulation, because the pre-training language model learns the semantic information of different languages in the pre-training process, the pre-training language model has stronger generalization capability, and the effect is better than that of a common model directly retrained by using the data.
It should be noted that for simplicity of description, the above-mentioned method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
According to another aspect of the embodiments of the present invention, there is also provided a multi-language model training apparatus, as shown in fig. 7, the apparatus including:
a first input unit 602, configured to input a multilingual corpus bag and a multilingual shared vocabulary into a language model to be trained, and train the language model to be trained to obtain a pre-trained language model, where the pre-trained language model is used to perform semantic recognition on corpora of multiple languages, the multilingual corpus bag is a corpus bag including multiple languages, and a participle set obtained by participle segmentation on the multilingual corpus bag is stored in the multilingual shared vocabulary;
a first processing unit 604, configured to adjust the pre-training language model by using a first corpus set of a first language with word segmentation and a second corpus set of a second language with word segmentation, so as to obtain an intention recognition model, where the intention recognition model is used to recognize a relationship between semantics and semantics represented by sentences of the first language and the second language, and the multiple languages include the first language and the second language;
a second processing unit 606, configured to input the sentences in the multiple languages into the intent recognition model, so as to obtain a target multi-language model, where the target multi-language model is used to recognize semantics represented by the sentences in the multiple languages and a relationship between the semantics.
Through the embodiment, the multilingual speech material packet is participled to obtain the multilingual shared vocabulary, the multilingual speech material packet and the multilingual shared vocabulary are input into the language model to be trained, the language model to be trained is trained to obtain the pre-trained language model, then the first corpus set of the first language with participle labels and the second corpus set of the second language with participle labels are used for adjusting the pre-trained language model to obtain the intention recognition model, the obtained intention recognition model can recognize the relation between the semantics and the semantics expressed by the sentences of the first language and the second language, the multiple languages comprise the first language and the second language, finally, the sentences of the multiple languages are input into the intention recognition model, so that the intention recognition model can generalize the capability of performing intention recognition on the first language and the second language onto the target multilingual model, thus, the target multi-language model is obtained, and the obtained target multi-language model can identify the semantics represented by the sentences of the multiple languages and the relation between the semantics. In the above way, through the first corpus set with the participle labels in the first language and the second corpus set with the participle labels in the second language, adjusting the pre-training language model to obtain an intention recognition model, so that the intention recognition model has the capability of recognizing the semantics represented by the sentences of the first language and the second language and the relation between the semantics, further, the multi-language sentence is inputted into the intention recognition model to obtain the target multi-language model, so that the target multi-language model can also have the capability of identifying the semantics represented by the sentences of multiple languages and the relation between the semantics under the training of the corpus sets of the multiple languages without word segmentation and labeling, thereby solving the problems in the related technology, when the intention identification and classification are carried out on the clients with different nationalities and different languages, the problem of poor classification effect exists because the language material of individual language is less.
As an optional technical solution, the apparatus further includes: the first determining unit is used for segmenting words of the multi-language speech packet and determining the word frequency of each word in the multi-language speech packet; and the second determining unit is used for determining the participles corresponding to the word frequency greater than or equal to the preset threshold as the multi-language shared vocabulary list.
As an optional technical solution, the apparatus further includes: a third processing unit, configured to execute the following steps for a sentence in the first corpus set or the second corpus set: performing word segmentation on the sentence to obtain a sentence vector corresponding to the sentence, wherein the sentence vector is composed of N word segmentation vectors, and one word segmentation vector of the N word segmentation vectors includes: word sense information of a word corresponding to the word vector, and position information of the word, wherein the word sense information is used for indicating the meaning of the word, the position information is used for indicating the position of the word in the sentence, and N is an integer greater than 0; and the second input unit is used for inputting the sentence vectors into the pre-training language model.
As an optional technical solution, the first processing unit: the method comprises the following steps: a first input module, configured to input a first coding vector corresponding to the sentence vector into a text classification model, where the first coding vector is a vector obtained by coding the sentence vector; the first processing module is used for classifying the first coding vector to obtain a first classification label of each word segmentation vector in the first coding vector; a first determining module, configured to determine that the pre-trained language model has been adjusted to obtain the intention recognition model when the sentences included in the first corpus set and the second corpus set are both input into the pre-trained language model and a second classification label corresponding to the sentences included in the first corpus set and the second corpus set is obtained.
As an optional technical solution, the apparatus further includes: and the second processing module is used for coding the sentence vector to obtain the first coding vector.
As an optional technical solution, the second processing module includes: a processing sub-module, configured to perform word segmentation on the one sentence to obtain the sentence vector X ═ w { corresponding to the one sentence 1 ,w 2 ,…,w i Where i is 1 … N, w above i For the ith participle in the sentence vector, w 1 Is CLS, w above 1 A hidden state for receiving the sentence vectors; a first coding submodule for coding each participle in the sentence vector X to obtain the sentence vector XA second coded vector corresponding to X, wherein the second coded vector is XE ═ X 1 e 1 ,x 2 e 2 ,…,x i e i 1, i ═ 1 … N, above x i e i ∈R d D is a vector dimension, and each participle in the sentence vector X corresponds to each vector in the second encoding vector one to one; a second encoding sub-module, configured to encode the second encoding vector corresponding to the ith participle according to an i-1 th participle and an i +1 th participle to obtain the first encoding vector, where the i-1 th participle is a preceding participle of the ith participle in the sentence, the i +1 th participle is a following participle of the ith participle in the sentence, and the first encoding vector is XE { x ═ 1 b 1 e 1 ,x 2 b 2 e 2 ,…,x i b i e i 1 … N, each participle in the sentence vector X corresponding to each vector in the first encoded vector one-to-one.
As an optional technical solution, the apparatus further includes: a third input unit for inputting a target sentence to the target multilingual model; a fourth processing unit, configured to perform word segmentation on the target sentence to obtain a target sentence vector corresponding to the target sentence, where the target sentence vector includes a plurality of target word segmentation vectors; a first encoding unit, configured to convert the target sentence vector into a first target encoding vector corresponding to the target sentence vector; a second encoding unit, configured to encode the first target encoding vector to obtain a second target encoding vector; and the fifth processing unit is used for inputting the second target coding vector into a text classification model, classifying the second target coding vector and obtaining a target classification label of each word segmentation vector in the second target coding vector.
According to a further aspect of embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of multiple languages, the multilingual speech packet comprises the multilingual speech packet, and a participle set obtained by participle division on the multilingual speech packet is stored in the multilingual shared vocabulary;
s2, inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of multiple languages, the multilingual speech packet comprises the multilingual speech packet, and a participle set obtained by participle division on the multilingual speech packet is stored in the multilingual shared vocabulary;
and S3, inputting the multi-language sentence into the intention recognition model to obtain a target multi-language model, wherein the target multi-language model is used for recognizing the relationship between the semantics represented by the multi-language sentence and the semantics.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by instructing hardware related to the terminal device through a program, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, ROM (Read-Only memories), RAM (Random Access memories), magnetic or optical disks, etc.
According to another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the above training method for a multilingual model, where the electronic device may be a terminal device or a server as shown in fig. 1. The present embodiment takes the electronic device as a server as an example for explanation. As shown in fig. 8, the electronic device comprises a memory 702 and a processor 704, wherein the memory 702 stores a computer program, and the processor 704 is configured to execute the steps of any of the above method embodiments by the computer program.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of multiple languages, the multilingual speech packet comprises the multilingual speech packet, and a participle set obtained by participle division on the multilingual speech packet is stored in the multilingual shared vocabulary;
s2, inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on the linguistic data of multiple languages, the multilingual speech packet comprises the multilingual speech packet, and a participle set obtained by participle division on the multilingual speech packet is stored in the multilingual shared vocabulary;
and S3, inputting the multi-language sentence into the intention recognition model to obtain a target multi-language model, wherein the target multi-language model is used for recognizing the relationship between the semantics represented by the multi-language sentence and the semantics.
Alternatively, it is understood by those skilled in the art that the structure shown in fig. 8 is only an illustration and is not a limitation to the structure of the electronic device. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
The memory 702 may be used to store software programs and modules, such as program commands/modules corresponding to the method and apparatus for training a multi-language model in the embodiment of the present invention, and the processor 704 executes various functional applications and data processing by running the software programs and modules stored in the memory 702, so as to implement the above-mentioned method for training a multi-language model. The memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 702 can further include memory located remotely from the processor 704, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. As an example, as shown in fig. 8, the memory 702 may include, but is not limited to, a first input unit 602, a first processing unit 604, and a second processing unit 606 of the multi-language model training apparatus. In addition, other module units in the training apparatus for the multi-language model may also be included, but are not limited to these, and are not described in this example again.
Optionally, the transmitting device 706 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 706 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 706 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In addition, the electronic device further includes: a connection bus 708 for connecting the respective module components in the electronic apparatus.
In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. Nodes can form a Peer-To-Peer (P2P, Peer To Peer) network, and any type of computing device, such as a server, a terminal, and other electronic devices, can become a node in the blockchain system by joining the Peer-To-Peer network.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by instructing hardware related to the terminal device through a program, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several commands for enabling one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the above methods according to the embodiments of the present invention.
In the above embodiments of the present invention, the description of each embodiment has its own emphasis, and reference may be made to the related description of other embodiments for parts that are not described in detail in a certain embodiment.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (15)

1. A method for training a multilingual model, comprising:
inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, wherein the pre-trained language model is used for performing semantic recognition on linguistic data of multiple languages, the multilingual speech packet is a speech packet comprising the multiple languages, and a participle set obtained by participling the multilingual speech packet is stored in the multilingual shared vocabulary;
inputting all sentences included in a first corpus set of a first language with word segmentation annotations and a second corpus set of a second language with word segmentation annotations into the pre-training language model, and adjusting the pre-training language model to obtain an intention recognition model, wherein the intention recognition model is used for recognizing the relationship between the semantics and the semantics expressed by the sentences of the first language and the second language, and the plurality of languages comprise the first language and the second language;
and inputting the sentences of the multiple languages into the intention identification model to obtain a target multi-language model, wherein the target multi-language model is used for identifying the semantics expressed by the sentences of the multiple languages and the relation between the semantics.
2. The method of claim 1, wherein before inputting the multilingual speech corpus and the multilingual shared vocabulary into a language model to be trained and training the language model to be trained to obtain a pre-trained language model, the method further comprises:
dividing words of the multi-language speech packet, and determining the word frequency of each word in the multi-language speech packet;
and determining the word segmentation set corresponding to the word frequency greater than or equal to a preset threshold value as the multi-language shared vocabulary table.
3. The method according to claim 1, wherein before the sentences included in the first corpus set of the first language with word segmentation labels and the second corpus set of the second language with word segmentation labels are all input into the pre-trained language model and the pre-trained language model is adjusted, the method further comprises:
for a sentence in the first corpus set or the second corpus set, performing the following steps:
performing word segmentation on the sentence to obtain a sentence vector corresponding to the sentence, wherein the sentence vector is composed of N word segmentation vectors, and one word segmentation vector of the N word segmentation vectors includes: word sense information of a word corresponding to the word segmentation vector and position information of the word segmentation, wherein the word sense information is used for indicating the meaning of the word segmentation, the position information is used for indicating the position of the word segmentation in the sentence, and N is an integer greater than 0;
inputting the sentence vector into the pre-training language model.
4. The method according to claim 3, wherein the step of inputting all sentences included in the first corpus set of the first language with word segmentation labels and the second corpus set of the second language with word segmentation labels into the pre-trained language model, and adjusting the pre-trained language model to obtain the intention recognition model comprises:
inputting a first coding vector corresponding to the sentence vector into a text classification model, wherein the first coding vector is obtained by coding the sentence vector;
classifying the first coding vector to obtain a first classification label of each word segmentation vector in the first coding vector;
and under the condition that the sentences included in the first corpus set and the second corpus set are input into the pre-training language model and the second classification labels corresponding to the sentences included in the first corpus set and the second corpus set are obtained, determining that the pre-training language model is adjusted, and obtaining the intention recognition model.
5. The method of claim 4, wherein prior to said inputting the first encoding vector corresponding to the sentence vector into the text classification model, the method further comprises:
and coding the sentence vector to obtain the first coding vector.
6. The method of claim 5, wherein the encoding the sentence vector to obtain the first encoded vector comprises:
performing word segmentation on the sentence to obtain a sentence vector X ═ w corresponding to the sentence 1 ,w 2 ,…,w i Wherein i is 1 … N, said w i For the ith participle in the sentence vector, the w 1 Is CLS, said w 1 A hidden state for receiving the sentence vectors;
coding each participle in the sentence vector X to obtain a second coding vector corresponding to the sentence vector X, wherein the second coding vector is XE ═ { X ═ 1 e 1 ,x 2 e 2 ,…,x i e i 1 … N, said x i e i ∈R d D is a vector dimension, and each participle in the sentence vector X corresponds to each vector in the second encoding vector one to one;
encoding the second encoding vector corresponding to the ith participle according to the (i-1) th participle and the (i + 1) th participle to obtain the first encoding vector, wherein the (i-1) th participle is a preceding participle of the position of the ith participle in the sentence, the (i + 1) th participle is a subsequent participle of the position of the ith participle in the sentence, and the first encoding vector is XE { x ═ 1 b 1 e 1 ,x 2 b 2 e 2 ,…,x i b i e i 1 … N, each participle in the sentence vector X corresponding one-to-one to each vector in the first encoding vector.
7. The method of any of claims 1-6, wherein after said inputting said multilingual sentences into said intent recognition model to obtain a target multilingual model, said method further comprises:
inputting a target statement to the target multi-language model;
performing word segmentation on the target sentence to obtain a target sentence vector corresponding to the target sentence, wherein the target sentence vector comprises a plurality of target word segmentation vectors;
converting the target sentence vector into a first target coding vector corresponding to the target sentence vector;
coding the first target coding vector to obtain a second target coding vector;
and inputting the second target coding vector into a text classification model, and classifying the second target coding vector to obtain a target classification label of each word segmentation vector in the second target coding vector.
8. An apparatus for training a multilingual model, comprising:
the device comprises a first input unit, a second input unit and a third input unit, wherein the first input unit is used for inputting a multilingual speech packet and a multilingual shared vocabulary into a language model to be trained, and training the language model to be trained to obtain a pre-trained language model, the pre-trained language model is used for performing semantic recognition on linguistic data of multiple languages, the multilingual speech packet is a speech packet comprising the multiple languages, and a word segmentation set obtained by segmenting the multilingual speech packet is stored in the multilingual shared vocabulary;
a first processing unit, configured to input all sentences included in a first corpus set of a first language with word segmentation labeling and a second corpus set of a second language with word segmentation labeling into the pre-trained language model, and adjust the pre-trained language model to obtain an intention recognition model, where the intention recognition model is configured to recognize a relationship between semantics and semantics represented by the sentences of the first language and the second language, and the multiple languages include the first language and the second language;
and the second processing unit is used for inputting the sentences of the multiple languages into the intention identification model to obtain a target multi-language model, wherein the target multi-language model is used for identifying the semantics represented by the sentences of the multiple languages and the relation between the semantics.
9. The apparatus of claim 8, further comprising:
the first determining unit is used for segmenting words of the multi-language speech packet and determining the word frequency of each word in the multi-language speech packet;
and the second determining unit is used for determining the participle set corresponding to the word frequency greater than or equal to the preset threshold as the multi-language shared vocabulary table.
10. The apparatus of claim 8, further comprising:
a third processing unit, configured to execute the following steps for a sentence in the first corpus set or the second corpus set: performing word segmentation on the sentence to obtain a sentence vector corresponding to the sentence, wherein the sentence vector is composed of N word segmentation vectors, and one word segmentation vector of the N word segmentation vectors includes: the word meaning information of a participle corresponding to the participle vector and the position information of the participle are used for representing the meaning of the participle, the position information is used for representing the position of the participle in the sentence, and N is an integer greater than 0;
and the second input unit is used for inputting the sentence vector into the pre-training language model.
11. The apparatus of claim 10, wherein the first processing unit: the method comprises the following steps:
a first input module, configured to input a first coding vector corresponding to the sentence vector into a text classification model, where the first coding vector is a vector obtained by coding the sentence vector;
the first processing module is used for classifying the first coding vector to obtain a first classification label of each word segmentation vector in the first coding vector;
a first determining module, configured to determine that the pre-training language model has been adjusted to obtain the intention recognition model when the sentences included in the first corpus set and the second corpus set are both input into the pre-training language model and a second classification label corresponding to the sentences included in the first corpus set and the second corpus set is obtained.
12. The apparatus of claim 11, further comprising:
and the second processing module is used for coding the sentence vector to obtain the first coding vector.
13. The apparatus of claim 12, wherein the second processing module comprises:
a processing submodule, configured to perform word segmentation on the sentence, and obtain the sentence vector X ═ w corresponding to the sentence 1 ,w 2 ,…,w i Wherein i is 1 … N, said w i For the ith participle in the sentence vector, the w 1 Is CLS, said w 1 A hidden state for receiving the sentence vector;
a first encoding submodule, configured to encode each participle in the sentence vector X to obtain a second encoding vector corresponding to the sentence vector X, where the second encoding vector is XE ═ { X ═ X 1 e 1 ,x 2 e 2 ,…,x i e i 1 … N, said x i e i ∈R d D is a vector dimension, and each participle in the sentence vector X corresponds to each vector in the second encoding vector one to one;
a second encoding submodule, configured to encode the second encoding vector corresponding to the ith participle according to an i-1 th participle and an i +1 th participle to obtain the first encoding vector, where the i-1 th participle is a preceding participle of the ith participle in the sentence, the i +1 th participle is a subsequent participle of the ith participle in the sentence, and the first encoding vector is XE { x ═ 1 b 1 e 1 ,x 2 b 2 e 2 ,…,x i b i e i 1 … N, each participle in the sentence vector X corresponding one-to-one to each vector in the first encoding vector.
14. A computer-readable storage medium comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 7.
15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 7 by means of the computer program.
CN202010774741.0A 2020-08-04 2020-08-04 Multi-language model training method and device, storage medium and electronic equipment Active CN112749556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010774741.0A CN112749556B (en) 2020-08-04 2020-08-04 Multi-language model training method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010774741.0A CN112749556B (en) 2020-08-04 2020-08-04 Multi-language model training method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN112749556A CN112749556A (en) 2021-05-04
CN112749556B true CN112749556B (en) 2022-09-13

Family

ID=75645267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010774741.0A Active CN112749556B (en) 2020-08-04 2020-08-04 Multi-language model training method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112749556B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115238708B (en) * 2022-08-17 2024-02-27 腾讯科技(深圳)有限公司 Text semantic recognition method, device, equipment, storage medium and program product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853703A (en) * 2014-02-19 2014-06-11 联想(北京)有限公司 Information processing method and electronic equipment
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN109388793A (en) * 2017-08-03 2019-02-26 阿里巴巴集团控股有限公司 Entity mask method, intension recognizing method and corresponding intrument, computer storage medium
CN111125331A (en) * 2019-12-20 2020-05-08 京东方科技集团股份有限公司 Semantic recognition method and device, electronic equipment and computer-readable storage medium
CN111382568A (en) * 2020-05-29 2020-07-07 腾讯科技(深圳)有限公司 Training method and device of word segmentation model, storage medium and electronic equipment
CN111460164A (en) * 2020-05-22 2020-07-28 南京大学 Intelligent barrier judgment method for telecommunication work order based on pre-training language model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3616083A4 (en) * 2017-04-23 2021-01-13 Nuance Communications, Inc. Multi-lingual semantic parser based on transferred learning
CN111563208B (en) * 2019-01-29 2023-06-30 株式会社理光 Method and device for identifying intention and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103853703A (en) * 2014-02-19 2014-06-11 联想(北京)有限公司 Information processing method and electronic equipment
CN109388793A (en) * 2017-08-03 2019-02-26 阿里巴巴集团控股有限公司 Entity mask method, intension recognizing method and corresponding intrument, computer storage medium
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN111125331A (en) * 2019-12-20 2020-05-08 京东方科技集团股份有限公司 Semantic recognition method and device, electronic equipment and computer-readable storage medium
CN111460164A (en) * 2020-05-22 2020-07-28 南京大学 Intelligent barrier judgment method for telecommunication work order based on pre-training language model
CN111382568A (en) * 2020-05-29 2020-07-07 腾讯科技(深圳)有限公司 Training method and device of word segmentation model, storage medium and electronic equipment

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Classification of Traditional Chinese Medicine Cases based on Character-level Bert and Deep Learning;Zihao Song等;《2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC 2019)》;20191231;第1383-1387页 *
Language-agnostic BERT Sentence Embedding;Fangxiaoyu Feng等;《arXiv:2007.01852v1》;20200703;第1-12页 *
Multi-Language Neural Network Language Models;Anton Ragni等;《INTERSPEECH 2016》;20161231;第3042-3046页 *
基于增量式自学习策略的多语言翻译模型;周张萍等;《厦门大学学报(自然科学版)》;20190331;第58卷(第2期);第170-175页 *
面向自然语言处理的预训练技术研究综述;李舟军等;《计算机科学》;20200410(第03期);第162-173页 *

Also Published As

Publication number Publication date
CN112749556A (en) 2021-05-04

Similar Documents

Publication Publication Date Title
CN111554268B (en) Language identification method based on language model, text classification method and device
CN111444340B (en) Text classification method, device, equipment and storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN110704576B (en) Text-based entity relationship extraction method and device
CN111026842A (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
CN112182166A (en) Text matching method and device, electronic equipment and storage medium
CN114676234A (en) Model training method and related equipment
WO2022253074A1 (en) Data processing method and related device
CN114298121A (en) Multi-mode-based text generation method, model training method and device
CN113761868B (en) Text processing method, text processing device, electronic equipment and readable storage medium
CN111783903A (en) Text processing method, text model processing method and device and computer equipment
CN113705191A (en) Method, device and equipment for generating sample statement and storage medium
CN114282055A (en) Video feature extraction method, device and equipment and computer storage medium
CN113761220A (en) Information acquisition method, device, equipment and storage medium
CN111767720B (en) Title generation method, computer and readable storage medium
CN112749556B (en) Multi-language model training method and device, storage medium and electronic equipment
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN115934891A (en) Question understanding method and device
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN110362734A (en) Text recognition method, device, equipment and computer readable storage medium
CN113741759B (en) Comment information display method and device, computer equipment and storage medium
CN113239143B (en) Power transmission and transformation equipment fault processing method and system fusing power grid fault case base
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system
CN111597306B (en) Sentence recognition method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40043540

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant