CN113408619B

CN113408619B - Language model pre-training method and device

Info

Publication number: CN113408619B
Application number: CN202110683642.6A
Authority: CN
Inventors: 陈桂兴; 黄羿衡
Original assignee: Jiangsu Suyun Information Technology Co ltd
Current assignee: Jiangsu Suyun Information Technology Co ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2024-02-13
Anticipated expiration: 2041-06-21
Also published as: CN113408619A

Abstract

The invention provides a language model pre-training method, which comprises the following steps: acquiring a first word vector initialized based on first features, wherein the first features comprise image features; acquiring a second word vector initialized randomly; a language model is trained based on the first word vector and the second word vector. The multi-modal characteristics comprising images and words are combined for pre-training, so that the relevance of the language and the reality is improved; the corpus required by the pre-training of the language model is reduced, the external knowledge is effectively utilized, and the use effect of the language model in the downstream task is further improved. The language model pre-training device provided by the invention can realize the language model pre-training method provided by the invention and has corresponding advantages.

Description

Language model pre-training method and device

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a language model pre-training method based on image feature multi-mode initialization and a corresponding language model pre-training device.

Background

Natural language processing (Nature Language processing, NLP) is an important direction in the art of artificial intelligence. The pre-training of the language model is widely applied to natural language processing, and in many text tasks, the pre-trained language model can obviously reduce the required training data amount and improve the precision of the language model. In the pre-training language models commonly used at present, such as RNNLM and word2vec, glove, elmo, GPT, bert, the models all relate to word representation, namely, words need to be represented as vectors in the training process, and the word vectors are trained together in the training process of the language models. In the training process of the above 3 language models, a random initialization method is generally adopted to initialize word vectors. The language models RNNLM and word2vec, glove, elmo, GPT, bert learn some relations between words, so that the learned sentences or expressions of words do not really understand the meaning of the real world, but merely a co-occurrence rule between words. For example, if the three words "wolf", "dog", "cat" are trained according to existing language models, the models would consider the similarity of "cat" and "dog" to be high, because the co-occurrence frequency of "cat" and "dog" is very high. However, the model does not actually understand the meaning of "cat", "dog", "wolf", so the probability that a "cat grows much like a dog" is higher than a "wolf, which in some downstream tasks, such as in fact judgment, has a higher probability of producing a final erroneous result; in other tasks, such as text classification tasks, this can also lead to poorer performance. For example, downstream is the task of a factual judgment, "dongle also likes bone," if the word "dongle" does not appear in the corpus of the pre-trained language model, or a similar sentence does not appear, then the language model may consider this statement as a false statement when judging that the sentence is factually true using the language model. Humans learn languages in combination with the real world, and the languages are also used for expressing the real world, but the existing language model training method cannot completely fit the meaning of the real world, so that the accuracy is not high enough, and the requirements of natural language processing tasks cannot be met. Therefore, research on the pre-training method and device of the language model is very needed at present, so that the pre-trained language model can be better combined with real things in the world for more accurately reflecting the real things, thereby further promoting the deep development and wide application of the natural language processing technology.

Disclosure of Invention

In order to solve all or part of the problems in the prior art, the invention provides a language model pre-training method, which is suitable for pre-training a language model. In another aspect of the present invention, a language model pre-training apparatus is provided for performing a pre-training of a language model.

The language model pre-training method provided by the aspect of the invention comprises the following steps: s1, acquiring a first word vector initialized based on first features, wherein the first features comprise image features; s2, acquiring a second word vector initialized randomly; s3, training a language model based on the first word vector and the second word vector. Taking the task of "dongle like eating bone" which is still a real judgment as an example, by the language model pre-trained by the method of the invention, the word vectors of the "dongle" and the "dog" are very similar because the similarity of the "dongle" and the "dog" on the image is very high, and the probability that the sentence of "dongle like eating bone" is judged to be a true statement sentence is very high as long as the sentence of "dog like eating bone" appears in the training corpus.

In general, the step S1 further includes: preparing text training data, and acquiring the part of speech of the words in the text training data; screening out entity words based on the part of speech, marking the entity words as first words, and marking other words except the entity words as second words; acquiring the first feature of the first word; and (2) randomly initializing the word vector of the second word to obtain the second word vector. The term "entity" refers to a term that characterizes an object as an entity and that is more uniform than the image representing the object (e.g., upon reference to a term, a relatively fixed image of some figures is immediately thought of in the human brain). And carrying out image feature extraction on the images similar to the entity word, and initializing and acquiring the first word vector based on the extracted image features. And the word vectors of the rest non-entity words are randomly initialized to obtain a second word vector.

The step of obtaining the part of speech of the words in the text training data comprises the following steps: performing word segmentation and part-of-speech tagging by using a word segmentation and part-of-speech tagging tool; the segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCRNLP. LTP (language technology platform ) provides a series of chinese natural language processing tools that users can use to perform word segmentation, part-of-speech tagging, syntactic analysis, etc. on chinese text. NLPIR is a text processing system developed in Beijing university data searching and mining laboratory, and has the functions of word segmentation, part of speech tagging and the like. StanfordC ore NLP is CoreNLP, university of Stenford, is Python encapsulated, which provides a simple API for text processing tasks, such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, and the like.

The process of screening out the entity words comprises the following steps: extracting all words with part-of-speech labeling results as nouns and marking the words as initial words; preparing a first training model for training on the computer vision data set; the number N is preset, and the first N pictures are selected from the picture search results of the initial words captured by the search engine to serve as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, and if the similarity meets a preset condition, the initial word is the first word.

If the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.

The first training model is a deep learning model for extracting the image features, and comprises a ResNet (residual network) model, a VGGNet (Visual Geometry Group Network) model or an acceptance model; the computer vision dataset includes an ImageNet dataset. ResNet (residual error network) is made up of residual error blocks, which can effectively solve the problems of gradient explosion and gradient disappearance in the training process. The VGG model is a preferred algorithm for extracting CNN features from images, and has the advantages of small convolution kernel, small pooling kernel and the like. The acceptance model can reduce parameters while increasing network depth and width. The ImageNet dataset document is detailed, has special team maintenance, is convenient to use, has 1400 or more tens of thousands of pictures, and covers 2 or more categories, wherein more than one million pictures have explicit category labels and labels of object positions in images. Where N may be manually adjusted to the configuration according to the actual application scenario, the number of specific search engines or specific search results, or other requirements, N is preferably in the range of 10-30 sheets in a general embodiment.

The "calculating the image feature vectors corresponding to the N pictures" includes: inputting the first training data into the first training model for deep learning; and counting the number of preset layers as M, and selecting the result of the previous M layers as the image feature vector. The preset M is generally performed by considering the specific number of layers of the first training model, and the number of layers is more, the number of layers is less, and the preset M is not limited, but the preferred range of layers is the first 3-8 layers in general.

The step of judging the similarity of the N pictures according to the image feature vector comprises the following steps: calculating cosine values of included angles between the image feature vectors corresponding to every 2 pictures in the N pictures; the "preset condition" means that cosine values of included angles between the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. the included angle is less than 60 °. And judging the similarity of the pictures by calculating the position relation between the image feature vectors, and judging that the similarity of the N pictures is high enough if the included angle between the image feature vectors corresponding to any 2 pictures in the N pictures is smaller than 60 degrees, namely the images of the object represented by the words are unified.

In the step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; the second training model is a Bert class model (Bert) based on an attention mechanism. The best name of BERT is Bidirectional Encoder Representations from Transformer, a transform-based bi-directional encoder characterization. BERT uses the Encoder portion of the transducer and, when processing a word, can also take into account words preceding and following the word, giving it its meaning in context.

The attention-mechanism-based Bert class model is a mask language model of the attention mechanism. Training with self-attention mechanisms and mask language models, the process comprising: replacing one of the words with [ MASK ], and then predicting the probability distribution of the word by context, with the goal that the probability of the distribution on the masked word is the greatest; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; repeating the steps until the decreasing amplitude of the loss function is very small, and obtaining the pre-training language model after training is finished.

Another aspect of the present invention provides a language model pre-training apparatus, including a network communication module, an acquisition module, a storage module, and a processing module; the network communication module is in communication connection with the Internet and is used for acquiring Internet data, wherein the Internet data comprises search result data in a search engine; the acquisition module is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data; the storage module at least stores preset text training data and a plurality of neural network deep learning models; the processing module runs the neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains a language model based on the language model pre-training method according to one aspect of the invention.

Compared with the prior art, the invention has the main beneficial effects that:

1. the language model pre-training method is based on a first word vector initialized by first features including image features; the second word vector is initialized randomly, the first word vector and the second word vector are trained to obtain a language model, the multi-mode features comprising images and words are combined to conduct pre-training of the language model, and the relevance of the language and the real things is improved; the corpus required by pre-training the language model is reduced, the external knowledge is effectively utilized, the capability of the language model for understanding the meaning of the true things is improved, and the use effect of the language model in a downstream task can be further improved.

2. The language model pre-training device can realize the language model pre-training method of the invention, has corresponding advantages, simple structure and easy setting on the basis of the existing intelligent electronic equipment, is beneficial to improving the performance of the existing equipment and promotes the practicability of the voice recognition technology in wider fields.

Drawings

Fig. 1 is a schematic diagram of a language model pre-training method according to a first embodiment of the present invention.

Fig. 2 is a schematic diagram of a word vector pre-training process according to a first embodiment of the present invention.

Fig. 3 is a schematic diagram of a language model pre-training apparatus according to a first embodiment of the present invention.

Fig. 4 is a schematic diagram of an initialization word vector flow according to a second embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully, and it is apparent that the embodiments described are only some, but not all, of the embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The foregoing and/or additional aspects and advantages of the present invention will be apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings. In the figures, parts of the same structure or function are denoted by the same reference numerals, and not all illustrated parts are denoted by the associated reference numerals throughout the figures, if necessary, for the sake of clarity.

The operations of the embodiments are depicted in the following examples in a particular order, which is presented to provide a better understanding of the details of the embodiments and to provide a thorough understanding of the invention, but is not necessarily a one-to-one correspondence with the methods of the invention, nor is it intended to limit the scope of the invention in this regard.

It should be noted that the flowcharts and block diagrams in the figures illustrate the operational processes that may be implemented by the methods according to the embodiments of the present invention. It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the intervening blocks, depending upon the objectives sought to be achieved by the steps involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and manual operations.

Example 1

In a first embodiment of the present invention, as shown in fig. 1, a language model pre-training method includes: s1, acquiring a first word vector initialized based on first features, wherein the first features comprise image features; s2, acquiring a second word vector initialized randomly; s3, training a language model based on the first word vector and the second word vector.

In this embodiment, before the step S1, the method further includes: preparing text training data, and acquiring the part of speech of the words in the text training data; screening out entity words based on the part of speech, marking the entity words as first words, and marking other words except the entity words as second words; acquiring the first feature of the first word; and (2) randomly initializing the word vector of the second word to obtain the second word vector. The text training data in this embodiment is specifically prepared from the largest and full angle, and includes various types of texts such as news, bbs forum, critique, game, television subtitle, etc., and the text training data in this embodiment may be captured from a network, where capturing sources include social platforms, news websites, etc., and are not limited. The specific method for acquiring the part of speech of the words in the text training data is to adopt a word segmentation and part of speech tagging tool to carry out word segmentation and part of speech tagging, and to carry out entity word screening based on the part of speech. The word segmentation and part of speech tagging tools utilized in this embodiment are LTPs, and in some embodiments, the word segmentation and part of speech tagging are performed by using NLPIR or stanfordcore nlp, and the advantages and disadvantages of different word segmentation and part of speech tagging tools are comprehensively considered according to specific application situations to select and use, which is not limited.

In this embodiment, the process of "screening out entity words" includes: extracting all words with part-of-speech labeling results as nouns and marking the words as initial words; preparing a first training model for training on the computer vision data set; the number N is preset, and the first N pictures are selected from the picture search results of the initial words captured by the search engine to serve as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, and if the similarity meets a preset condition, the initial word is the first word. The search engine in this embodiment is an internet search engine, such as hundred degrees and Google, and after obtaining a picture search result of an initial word with a part of speech as a noun, the top N pictures are displayed in the search result, where N can be manually adjusted and configured, if N is too much, the computing resource cost is increased, and if too little, it is impossible to accurately evaluate whether the initial word is a physical word representing that the image is basically similar, in this embodiment, preferably, N is preset to 20, that is, 20 pictures are captured.

The example of "judging the similarity of N pictures according to the image feature vector" in this example is: and calculating the distance between the image feature vectors corresponding to every 2 pictures in the N pictures. The distance is obtained by adopting a Hamming distance algorithm, the algorithm is suitable for orthogonal vector calculation, the similarity of two vectors, namely the distance, can be calculated approximately, the efficiency is very high, and the electric quantity consumed by calculation is less. In some embodiments, the euclidean distance algorithm is adopted, and the selection is selected according to the configuration of an actual hardware device or the whole network system by combining the advantages of each algorithm, without limitation.

The "preset condition" means that the image feature vectors corresponding to any 2 pictures are smaller than a preset value. And judging the similarity of the pictures by calculating the distance relation between the image feature vectors, and judging that the similarity of the N pictures is high enough if the image feature vectors corresponding to any 2 pictures in the N pictures are smaller than a preset value, namely the images of the objects represented by the words are unified. In this embodiment, if the initial word is the first word, that is, a physical word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector. In some embodiments, other statistics of the image feature vectors corresponding to the N pictures are used as the initialized first word vector, which is not limited. For non-entity words, the embodiment adopts a method of randomly initializing vectors to obtain initialized second word vectors. The first training model of this embodiment is a deep learning model for the image feature extraction, specifically a res net model that performs training on the disclosed computer vision dataset ImageNet. In other embodiments, the first training model also considers the advantages and disadvantages of each model according to the practical application, and selects a suitable model, such as VGG, inception or any other deep learning model that can be used for image feature extraction, and can also complete training on other computer vision data sets, which is not limited.

In step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; an example second training model is the Bert class model based on the attention mechanism. As shown in fig. 2, this embodiment describes the process of pre-training with the first word vector and the second word vector to obtain a language model by taking the mask language model in which the Bert class model based on the attention mechanism is the attention mechanism as an example: replacing one of the words with [ MASK ], and then predicting the probability distribution of the word by context, with the goal that the probability of the distribution on the masked word is the greatest; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; repeating the steps until the decreasing amplitude of the loss function is very small, and obtaining the pre-training language model after training is finished.

The language model pre-training device of the embodiment is specifically integrated in a device A with a network communication function, GPU computing and a large-scale data storage function, as shown in fig. 3, and comprises a network communication module 1, an acquisition module 2, a storage module 3 and a processing module 4; the network communication module 1 is in communication connection with the Internet and is used for acquiring Internet data, wherein the Internet data comprises search result data in a search engine; the acquisition module 2 is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data; the storage module 3 stores preset text training data and a plurality of neural network deep learning models, and can also store other data according to actual application requirements without limitation; the processing module 4 runs the neural network deep learning models based on the internet data, the multimodal data and the text training data, and trains a language model based on the language model pre-training method of the embodiment.

Example two

In this embodiment, "calculating the image feature vectors corresponding to the N pictures" specifically includes: inputting the first training data into the first training model for deep learning; and counting the number of preset layers as M, and selecting the result of the previous M layers as the image feature vector. The preset M is generally performed by considering that the specific number of layers of the first training model is larger, the number of layers is smaller, and the result of selecting the first 5 convolutional network layers is preferable in this embodiment.

As shown in fig. 4, the difference between the second embodiment and the first embodiment is that "determining the similarity of N pictures according to the image feature vector" includes: setting each initial word as W (i), and calculating cosine values of included angles between the image feature vectors corresponding to each 2 pictures in N pictures of W (i); the "preset condition" means that cosine values of included angles between the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. the included angle is less than 60 °. And judging the similarity of the pictures by calculating the cosine similarity between the image feature vectors, and judging that the similarity of the N pictures is high enough if the included angle between the image feature vectors corresponding to any 2 pictures in the N pictures is smaller than 60 degrees, namely the images of the objects represented by W (i) are unified. W (i) is the entity word selected.

The use of certain conventional english terms or letters for the sake of clarity of description of the invention is intended to be exemplary only and not limiting of the interpretation or particular use, and should not be taken to limit the scope of the invention in terms of its possible chinese translations or specific letters.

It should also be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The foregoing has outlined rather broadly the more detailed description of the invention in order that the detailed description of the structure and operation of the invention may be better understood, and in order that the present invention may be better understood. It should be noted that it will be apparent to those skilled in the art that various improvements and modifications can be made to the present invention without departing from the principles of the invention, and such improvements and modifications fall within the scope of the appended claims.

Claims

1. The language model pre-training method is characterized in that: comprising the following steps:

s1, acquiring a first word vector initialized based on first features, wherein the first features comprise image features;

s2, acquiring a second word vector initialized randomly;

s3, training a language model based on the first word vector and the second word vector;

the step S1 further includes:

preparing text training data, and acquiring the part of speech of the words in the text training data;

screening out entity words based on the part of speech, marking the entity words as first words, and marking other words except the entity words as second words; acquiring the first feature of the first word;

randomly initializing a word vector of the second word in the step S2 to obtain the second word vector;

the process of screening out the entity words comprises the following steps:

extracting all words with part-of-speech labeling results as nouns and marking the words as initial words;

preparing a first training model for training on the computer vision data set;

the number N is preset, and the first N pictures are selected from the picture search results of the initial words captured by the search engine to serve as first training data;

inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures;

and judging the similarity of the N pictures according to the image feature vector, and if the similarity meets a preset condition, the initial word is the first word.

2. The language model pre-training method of claim 1, wherein: the step of obtaining the part of speech of the words in the text training data comprises the following steps: performing word segmentation and part-of-speech tagging by using a word segmentation and part-of-speech tagging tool;

the segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCRNLP.

3. The language model pre-training method of claim 1, wherein: if the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.

4. The language model pre-training method of claim 1, wherein: the first training model is a deep learning model for extracting the image features and comprises a ResNet model, a VGGNe model or an acceptance model; the computer vision dataset includes an ImageNet dataset.

5. The language model pre-training method according to any one of claims 1, 3 and 4, wherein: the "calculating the image feature vectors corresponding to the N pictures" includes:

inputting the first training data into the first training model for deep learning;

the number of the preset layers is recorded as M, and the result of the previous M layers is selected as the image feature vector;

the M ranges from 3 to 8.

6. The language model pre-training method according to any one of claims 1, 3 and 4, wherein: the step of judging the similarity of the N pictures according to the image feature vector comprises the following steps:

calculating cosine values of included angles between the image feature vectors corresponding to every 2 pictures in the N pictures;

the "preset condition" means that cosine values of included angles between the image feature vectors corresponding to any 2 pictures are all larger than 0.5.

7. The language model pre-training method according to any one of claims 1-4, wherein: in the step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; the second training model is a Bert class model based on an attention mechanism.

8. The language model pre-training device is characterized in that: the system comprises a network communication module, an acquisition module, a storage module and a processing module;

the network communication module is in communication connection with the Internet and is used for acquiring Internet data, wherein the Internet data comprises search result data in a search engine;

the acquisition module is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data;

the storage module is used for storing at least preset text training data, a computer vision data set and a plurality of neural network deep learning models;

the processing module runs the neural network deep learning models based on the internet data, the multimodal data and the text training data, and trains a language model based on the language model pre-training method of any one of claims 1 to 7.