CN113408619A

CN113408619A - Language model pre-training method and device

Info

Publication number: CN113408619A
Application number: CN202110683642.6A
Authority: CN
Inventors: 陈桂兴; 黄羿衡
Original assignee: Jiangsu Suyun Information Technology Co ltd
Current assignee: Jiangsu Suyun Information Technology Co ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-09-17
Anticipated expiration: 2041-06-21
Also published as: CN113408619B

Abstract

The invention provides a language model pre-training method, which comprises the following steps: acquiring a first word vector initialized based on first features, wherein the first features comprise image features; acquiring a second word vector initialized randomly; training a language model based on the first word vector and the second word vector. The multi-modal characteristics comprising images and words are combined for pre-training, so that the relevance of languages and real objects is improved; the linguistic data required by the pre-training of the language model is reduced, external knowledge is effectively utilized, and the use effect of the language model in downstream tasks is further improved. The language model pre-training device provided by the invention can realize the language model pre-training method provided by the invention and has corresponding advantages.

Description

Language model pre-training method and device

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a language model pre-training method based on image feature multi-mode initialization and a corresponding language model pre-training device.

Background

Natural Language Processing (NLP) is an important direction in the field of artificial intelligence technology. The pre-training of the language model is more and more widely applied to natural language processing, and in many text tasks, the amount of required training data can be obviously reduced by adopting the pre-trained language model, and the precision of the language model is improved. In the currently commonly used pre-training language models, for example, the main types RNNLM, word2vec, glove, Elmo, GPT, and Bert, these models all involve the representation of words, that is, during the training process, the words need to be represented as vectors first, and during the training process of the language models, these word vectors are trained together. In the above 3 language model training processes, a word vector is initialized by a random initialization method. The language models RNNLM, word2vec, glove, Elmo, GPT, and Bert are relations between learning words, so that the representation of the learned sentences or words does not really understand the meaning of the real world, but is only a rule of co-occurrence between words. For example, the words "wolf", "dog" and "cat" can be considered by the model to have high similarity if the training method of the existing language model is adopted, because the co-occurrence frequency of "cat" and "dog" is very high. However, the model does not really understand the meaning of "cat", "dog", "wolf", so the probability of "cat growing very much like dog" will be higher than the probability of "wolf growing very much like dog", which is higher in downstream tasks, e.g. some tasks of factual judgment, with the possibility of producing final erroneous results; in other tasks, such as text classification tasks, worse performance may also result. For example, if the downstream is the task of factual determination, "native dogs like to eat bones" and if the word "native dogs" does not appear in the corpus of the pre-trained language model or similar sentences do not appear, then the language model may consider the statement as a false statement when the language model is used to determine that the sentence is factual. Human beings learn languages by combining the real world, the languages are used for expressing the real world, and the existing language model training method cannot completely fit the meaning of the real world, has low accuracy and cannot meet the requirement of natural language processing tasks. Therefore, there is a need to research a pre-training method and device for a language model, so that the pre-trained language model can be better combined with real world objects to more accurately reflect the real world objects, thereby further promoting the deep development and wide application of natural language processing technology.

Disclosure of Invention

In order to solve all or part of the problems in the prior art, the invention provides a language model pre-training method which is suitable for pre-training a language model. Another aspect of the present invention provides a pre-training apparatus for pre-training a language model.

The language model pre-training method provided by the invention comprises the following steps: s1, acquiring a first word vector initialized based on first characteristics, wherein the first characteristics comprise image characteristics; s2, acquiring a second word vector initialized randomly; and S3, training a language model based on the first word vector and the second word vector. Still taking the task that the following is factual judgment, "the local dog also likes to eat the bone" as an example, with the language model pre-trained by the method of the present invention, since the similarity between the image of the local dog and the image of the local dog is very high, the word vectors of the local dog and the local dog are very similar, and as long as the sentence "the dog likes to eat the bone" appears in the training corpus, the possibility that the sentence "the local dog also likes to eat the bone" is judged as a true statement sentence is very high.

In general, the step S1 is preceded by: preparing text training data and acquiring the part of speech of words in the text training data; screening out entity words based on the part of speech, and recording the entity words as first words, and recording other words except the entity words as second words; acquiring the first feature of the first word; in step S2, the word vector of the second word is initialized randomly to obtain the second word vector. The entity words refer to words representing objects as entities, and images representing the objects are relatively uniform (for example, when a certain entity word is mentioned, certain image images which are relatively fixed are immediately thought of in human brain). And performing image feature extraction on the relatively similar images without performing random initialization on the word vectors of the entity words, and initializing and acquiring the first word vector based on the extracted image features. And randomly initializing the word vectors of the rest non-entity words to obtain a second word vector.

The acquiring the part of speech of the words in the text training data comprises: performing word segmentation and part-of-speech tagging by adopting a word segmentation and part-of-speech tagging tool; the word segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCoreNLP. LTP (Language Technology Platform) provides a series of chinese natural Language processing tools that users can use to work on chinese text for word segmentation, part-of-speech tagging, syntactic analysis, and so on. The NLPIR is a text processing system developed by a big data search and mining laboratory of Beijing university of science and engineering, and has the functions of word segmentation, part of speech tagging and the like. Stanford CoreNLP is a CoreNLP of Stanford university and is Python-encapsulated, which provides a simple API for text processing tasks, such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, and the like.

The process of screening out entity words comprises the following steps: extracting all words with part-of-speech tagging results as nouns and marking the words as initial words; preparing a first training model which is trained on a computer vision data set; presetting a quantity N, and selecting the previous N pictures from the picture search results of the initial words captured in a search engine as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, wherein if the similarity meets a preset condition, the initial word is the first word.

If the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.

The first training model is a deep learning model for the image feature extraction, and comprises a ResNet (residual error network) model, a VGGNet (visual Geometry Group network) model or an inclusion model; the computer vision dataset comprises an ImageNet dataset. The ResNet (residual network) is composed of residual blocks, and can effectively solve the problems of gradient explosion and gradient disappearance in the training process. The VGG model is an optimal algorithm for extracting the CNN characteristics from the image and has the advantages of small convolution kernel, small pooling kernel and the like. The inclusion model enables a reduction in parameters while increasing the depth and width of the network. The ImageNet data set is detailed in document, special for team maintenance and convenient to use, more than 1400 million pictures cover more than 2 million categories, and more than one million pictures have definite category labels and labels of object positions in the images. Wherein N may be configured by human according to the actual application scenario, the number of specific search engines or specific search results, or other requirements, and in a general embodiment, N is preferably in the range of 10-30 sheets.

The step of calculating the image feature vectors corresponding to the N pictures comprises the following steps: inputting the first training data into the first training model for deep learning; and recording the preset number of layers as M, and selecting the result of the previous M layers as the image feature vector. According to the preset M of the actual application scene, the number of layers of the specific first training model is generally considered for presetting, some layers are more, some layers are less, and the number of layers is not limited, but generally the preferred range of the number of layers is the first 3-8 layers.

The step of judging the similarity of the N pictures according to the image feature vectors comprises the following steps: calculating cosine values of included angles between the image characteristic vectors corresponding to each 2 pictures in the N pictures; the preset condition means that cosine values of included angles among the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. said angle is less than 60 °. And judging the similarity of the pictures by calculating the position relation among the image characteristic vectors, and if the included angle between the image characteristic vectors corresponding to any 2 pictures in the N pictures is less than 60 degrees, judging that the similarity of the N pictures is high enough, namely the images of the object represented by the words are uniform.

In step S3, inputting the first word vector and the second word vector into a preset second training model to obtain a pre-training language model; the second training model is a Bert-like model (Bert) based on an attention mechanism. The BERT is collectively referred to as Bidirectional Encoder responses from Transformer, i.e., transform-based bi-directional Encoder characterization. BERT takes the Encoder portion of the Transformer and when processing a word, can also take into account the words preceding and following the word to obtain its meaning in context.

The Bert-type model based on the attention mechanism is an attention mechanism masking language model. Training by taking a self-attention mechanism and a mask language model as targets, wherein the process comprises the following steps: converting one of the words to [ MASK ], and then predicting the probability distribution of the word through the context, wherein the goal is that the probability of the distribution on the hidden word is maximum; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; and repeating the steps until the loss function has a very small descending amplitude, and finishing training to obtain the pre-training language model.

The invention also provides a language model pre-training device, which comprises a network communication module, an acquisition module, a storage module and a processing module; the network communication module is in communication connection with the Internet and is used for acquiring Internet data, and the Internet data comprises search result data in a search engine; the acquisition module is used for acquiring multi-modal data, and the multi-modal data comprises image data and text data; the storage module at least stores preset text training data and a plurality of neural network deep learning models; the processing module runs the plurality of neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains the language model based on the language model pre-training method of the aspect of the invention.

Compared with the prior art, the invention has the main beneficial effects that:

1. the language model pre-training method is based on a first word vector initialized by first characteristics including image characteristics; the second word vector is initialized randomly, the first word vector and the second word vector are trained to obtain a language model, and the pre-training of the language model is carried out by combining multi-modal characteristics comprising images and words, so that the relevance of the language and real things is improved; the linguistic data required by the pre-training of the language model is reduced, the external knowledge is effectively utilized, the capability of the language model for understanding the meaning of real objects is improved, and the use effect of the language model in downstream tasks can be further improved.

2. The language model pre-training device can realize the language model pre-training method, has corresponding advantages, is simple in structure, is easy to set on the basis of the existing intelligent electronic equipment, is beneficial to improving the performance of the existing equipment and promotes the practicability of the voice recognition technology in wider fields.

Drawings

Fig. 1 is a schematic process diagram of a language model pre-training method according to a first embodiment of the present invention.

Fig. 2 is a schematic diagram of a word vector pre-training process according to a first embodiment of the present invention.

Fig. 3 is a schematic diagram of a language model pre-training apparatus according to a first embodiment of the present invention.

Fig. 4 is a schematic diagram of a process of initializing word vectors according to a second embodiment of the present invention.

Detailed Description

The technical solutions in the specific embodiments of the present invention will be clearly and completely described below, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings. In the figures, parts of the same structure or function are denoted by the same reference numerals, and not all parts shown are denoted by the associated reference numerals in all figures for reasons of clarity of presentation.

The operations of the embodiments are depicted in the following embodiments in a particular order, which is provided for better understanding of the details of the embodiments and to provide a thorough understanding of the present invention, but the order is not necessarily one-to-one correspondence with the methods of the present invention, and is not intended to limit the scope of the present invention.

It is to be noted that the flow charts and block diagrams in the figures illustrate the operational procedures which may be implemented by the methods according to the embodiments of the present invention. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the alternative, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and manual acts.

Example one

In an embodiment of the present invention, as shown in fig. 1, a method for pre-training a language model includes: s1, acquiring a first word vector initialized based on first characteristics, wherein the first characteristics comprise image characteristics; s2, acquiring a second word vector initialized randomly; and S3, training a language model based on the first word vector and the second word vector.

In this embodiment, before the step S1, the method further includes: preparing text training data and acquiring the part of speech of words in the text training data; screening out entity words based on the part of speech, and recording the entity words as first words, and recording other words except the entity words as second words; acquiring the first feature of the first word; in step S2, the word vector of the second word is initialized randomly to obtain the second word vector. The text training data in this embodiment is specifically prepared from the point of view of being as large and complete as possible, and includes various texts, such as news, bbs forum, comment, game, tv series caption, and the like. The specific method for acquiring the part of speech of the words in the text training data is to adopt a word segmentation and part of speech tagging tool to perform word segmentation and part of speech tagging, and perform solid word screening based on the part of speech. The word segmentation and part-of-speech tagging tool used in this embodiment is LTP, and in some embodiments, NLPIR or StanfordCoreNLP is used for word segmentation and part-of-speech tagging, and different word segmentation and part-of-speech tagging tools are selected and used according to specific application conditions by comprehensively considering advantages and disadvantages, without limitation.

In this embodiment, the process of "screening out entity words" includes: extracting all words with part-of-speech tagging results as nouns and marking the words as initial words; preparing a first training model which is trained on a computer vision data set; presetting a quantity N, and selecting the previous N pictures from the picture search results of the initial words captured in a search engine as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, wherein if the similarity meets a preset condition, the initial word is the first word. The search engine of this embodiment is an internet search engine, such as a Baidu search engine, Google, and the like, and after obtaining a picture search result of an initial word whose part of speech is a noun, the previous N pictures are displayed in the search result, where N may be manually adjusted and configured, if too much, the cost of computing resources is increased, and if too little, it is impossible to accurately evaluate whether the initial word is an entity word representing a substantially similar image, and in this embodiment, preferably, N is preset to 20, that is, 20 pictures are captured.

In this example, "judge the similarity of N pictures according to the image feature vector" includes: and calculating the distance between the image feature vectors corresponding to each 2 pictures in the N pictures. The distance is obtained by adopting a Hamming distance algorithm, the algorithm is suitable for orthogonal vector calculation, the similarity of two vectors, namely the distance, can be approximately calculated, the efficiency is very high, and the calculation consumption is less. In some embodiments, the euclidean distance algorithm is adopted, and the selection is performed according to the configuration of an actual hardware device or the whole network system by combining the advantages of each algorithm, which is not limited.

The "preset condition" means that the image feature vectors corresponding to any 2 pictures are all smaller than a preset value. And judging the similarity of the pictures by calculating the distance relationship between the image feature vectors, and if the image feature vectors corresponding to any 2 pictures in the N pictures are smaller than a preset value, judging that the similarity of the N pictures is high enough, namely the images of the objects represented by the words are uniform. In this embodiment, if the initial word is the first word, that is, the entity word, in the step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector. In some embodiments, other statistics of the image feature vectors corresponding to the N pictures are used as the initialized first word vector, and are not limited. For non-entity words, the embodiment obtains the initialized second word vector by using a random vector initialization method. The first training model of this embodiment is a deep learning model for the image feature extraction, and specifically is a ResNet model trained on a public computer vision data set ImageNet. In other embodiments, the first training model also considers advantages and disadvantages of each model according to practical application, selects a suitable model, such as VGG, inclusion, or any other deep learning model that can be used for image feature extraction, and may also complete training on other computer vision data sets, without limitation.

In step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; an example second training model is a Bert-like model based on the attention mechanism. As shown in fig. 2, this embodiment takes the Bert-based model as an example of a masked language model in the attention mechanism, which is an attention mechanism, to describe a process of obtaining a language model by pre-training a first word vector and a second word vector: converting one of the words to [ MASK ], and then predicting the probability distribution of the word through the context, wherein the goal is that the probability of the distribution on the hidden word is maximum; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; and repeating the steps until the loss function has a very small descending amplitude, and finishing training to obtain the pre-training language model.

The language model pre-training device of the embodiment is specifically integrated in a device a having a network communication function, a GPU calculation function, and a large-scale data storage function, and as shown in fig. 3, includes a network communication module 1, an acquisition module 2, a storage module 3, and a processing module 4; the network communication module 1 is in communication connection with the internet and is used for acquiring internet data, wherein the internet data comprises search result data in a search engine; the acquisition module 2 is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data; the storage module 3 stores preset text training data and a plurality of neural network deep learning models, and can also store other data according to the actual application requirements without limitation; the processing module 4 runs the plurality of neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains the language model based on the language model pre-training method of the embodiment.

Example two

In this embodiment, the "calculating the image feature vectors corresponding to the N pictures" specifically includes: inputting the first training data into the first training model for deep learning; and recording the preset number of layers as M, and selecting the result of the previous M layers as the image feature vector. According to the preset M of the actual application scenario, the number of layers of the first training model is generally considered to be preset, some layers are more, some layers are less, and the method is not limited, and the result of selecting the first 5 layers of convolutional network layers is preferred in this embodiment.

As shown in fig. 4, the difference between the second embodiment and the first embodiment is mainly that the "determining the similarity of N pictures according to the image feature vector" includes: setting each initial word as W (i), and calculating cosine values of included angles between the image characteristic vectors corresponding to each 2 pictures in the N pictures of W (i); the preset condition means that cosine values of included angles among the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. said angle is less than 60 °. And judging the similarity of the pictures by calculating the cosine similarity between the image feature vectors, and if the included angle between the image feature vectors corresponding to any 2 pictures in the N pictures is less than 60 degrees, judging that the similarity of the N pictures is high enough, namely W (i) the images of the represented objects are uniform. W (i) is the selected entity word.

For clarity of description, the use of certain conventional and specific terms and phrases is intended to be illustrative and not restrictive, but rather to limit the scope of the invention to the particular letter and translation thereof.

It is further noted that, herein, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The present invention has been described in detail, and the structure and operation principle of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method and core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. The language model pre-training method is characterized by comprising the following steps: the method comprises the following steps:

s1, acquiring a first word vector initialized based on first characteristics, wherein the first characteristics comprise image characteristics;

s2, acquiring a second word vector initialized randomly;

and S3, training a language model based on the first word vector and the second word vector.

2. The language model pre-training method of claim 1, wherein: the step S1 is preceded by:

preparing text training data and acquiring the part of speech of words in the text training data;

screening out entity words based on the part of speech, and recording the entity words as first words, and recording other words except the entity words as second words; acquiring the first feature of the first word;

in step S2, the word vector of the second word is initialized randomly to obtain the second word vector.

3. The language model pre-training method of claim 2, wherein: the acquiring the part of speech of the words in the text training data comprises: performing word segmentation and part-of-speech tagging by adopting a word segmentation and part-of-speech tagging tool;

the word segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCoreNLP.

4. The language model pre-training method of claim 2, wherein: the process of screening out entity words comprises the following steps:

extracting all words with part-of-speech tagging results as nouns and marking the words as initial words;

preparing a first training model which is trained on a computer vision data set;

presetting a quantity N, and selecting the previous N pictures from the picture search results of the initial words captured in a search engine as the first training data;

inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures;

and judging the similarity of the N pictures according to the image feature vector, wherein if the similarity meets a preset condition, the initial word is the first word.

5. The language model pre-training method of claim 4, wherein: if the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.

6. The language model pre-training method of claim 4, wherein: the first training model is a deep learning model for the image feature extraction, and comprises a ResNet model, a VGGNe model or an inclusion model; the computer vision dataset comprises an ImageNet dataset.

7. A language model pre-training method according to any one of claims 4-6, characterized by: the step of calculating the image feature vectors corresponding to the N pictures comprises the following steps:

inputting the first training data into the first training model for deep learning;

the preset number of layers is marked as M, and the result of the previous M layers is selected as the image feature vector;

the range of M is 3 to 8.

8. A language model pre-training method according to any one of claims 4-6, characterized by: the step of judging the similarity of the N pictures according to the image feature vectors comprises the following steps:

calculating cosine values of included angles between the image characteristic vectors corresponding to each 2 pictures in the N pictures;

the preset condition means that cosine values of included angles among the image feature vectors corresponding to any 2 pictures are all larger than 0.5.

9. A language model pre-training method according to any one of claims 1 to 6, characterized in that: in step S3, inputting the first word vector and the second word vector into a preset second training model to obtain a pre-training language model; the second training model is a Bert-like model based on an attention mechanism.

10. The language model pre-training device is characterized in that: the system comprises a network communication module, an acquisition module, a storage module and a processing module;

the network communication module is in communication connection with the Internet and is used for acquiring Internet data, and the Internet data comprises search result data in a search engine;

the acquisition module is used for acquiring multi-modal data, and the multi-modal data comprises image data and text data;

the storage module at least stores preset text training data, a computer vision data set and a plurality of neural network deep learning models;

the processing module runs the plurality of neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains the language model based on the language model pre-training method of any one of claims 1 to 9.