CN113408619B - Language model pre-training method and device - Google Patents
Language model pre-training method and device Download PDFInfo
- Publication number
- CN113408619B CN113408619B CN202110683642.6A CN202110683642A CN113408619B CN 113408619 B CN113408619 B CN 113408619B CN 202110683642 A CN202110683642 A CN 202110683642A CN 113408619 B CN113408619 B CN 113408619B
- Authority
- CN
- China
- Prior art keywords
- training
- language model
- word
- data
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 110
- 238000000034 method Methods 0.000 title claims abstract description 44
- 239000013598 vector Substances 0.000 claims abstract description 85
- 230000011218 segmentation Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 13
- 238000004891 communication Methods 0.000 claims description 10
- 238000013136 deep learning model Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000013135 deep learning Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 7
- 241000282326 Felis catus Species 0.000 description 5
- 210000000988 bone and bone Anatomy 0.000 description 4
- 241000282461 Canis lupus Species 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a language model pre-training method, which comprises the following steps: acquiring a first word vector initialized based on first features, wherein the first features comprise image features; acquiring a second word vector initialized randomly; a language model is trained based on the first word vector and the second word vector. The multi-modal characteristics comprising images and words are combined for pre-training, so that the relevance of the language and the reality is improved; the corpus required by the pre-training of the language model is reduced, the external knowledge is effectively utilized, and the use effect of the language model in the downstream task is further improved. The language model pre-training device provided by the invention can realize the language model pre-training method provided by the invention and has corresponding advantages.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a language model pre-training method based on image feature multi-mode initialization and a corresponding language model pre-training device.
Background
Natural language processing (Nature Language processing, NLP) is an important direction in the art of artificial intelligence. The pre-training of the language model is widely applied to natural language processing, and in many text tasks, the pre-trained language model can obviously reduce the required training data amount and improve the precision of the language model. In the pre-training language models commonly used at present, such as RNNLM and word2vec, glove, elmo, GPT, bert, the models all relate to word representation, namely, words need to be represented as vectors in the training process, and the word vectors are trained together in the training process of the language models. In the training process of the above 3 language models, a random initialization method is generally adopted to initialize word vectors. The language models RNNLM and word2vec, glove, elmo, GPT, bert learn some relations between words, so that the learned sentences or expressions of words do not really understand the meaning of the real world, but merely a co-occurrence rule between words. For example, if the three words "wolf", "dog", "cat" are trained according to existing language models, the models would consider the similarity of "cat" and "dog" to be high, because the co-occurrence frequency of "cat" and "dog" is very high. However, the model does not actually understand the meaning of "cat", "dog", "wolf", so the probability that a "cat grows much like a dog" is higher than a "wolf, which in some downstream tasks, such as in fact judgment, has a higher probability of producing a final erroneous result; in other tasks, such as text classification tasks, this can also lead to poorer performance. For example, downstream is the task of a factual judgment, "dongle also likes bone," if the word "dongle" does not appear in the corpus of the pre-trained language model, or a similar sentence does not appear, then the language model may consider this statement as a false statement when judging that the sentence is factually true using the language model. Humans learn languages in combination with the real world, and the languages are also used for expressing the real world, but the existing language model training method cannot completely fit the meaning of the real world, so that the accuracy is not high enough, and the requirements of natural language processing tasks cannot be met. Therefore, research on the pre-training method and device of the language model is very needed at present, so that the pre-trained language model can be better combined with real things in the world for more accurately reflecting the real things, thereby further promoting the deep development and wide application of the natural language processing technology.
Disclosure of Invention
In order to solve all or part of the problems in the prior art, the invention provides a language model pre-training method, which is suitable for pre-training a language model. In another aspect of the present invention, a language model pre-training apparatus is provided for performing a pre-training of a language model.
The language model pre-training method provided by the aspect of the invention comprises the following steps: s1, acquiring a first word vector initialized based on first features, wherein the first features comprise image features; s2, acquiring a second word vector initialized randomly; s3, training a language model based on the first word vector and the second word vector. Taking the task of "dongle like eating bone" which is still a real judgment as an example, by the language model pre-trained by the method of the invention, the word vectors of the "dongle" and the "dog" are very similar because the similarity of the "dongle" and the "dog" on the image is very high, and the probability that the sentence of "dongle like eating bone" is judged to be a true statement sentence is very high as long as the sentence of "dog like eating bone" appears in the training corpus.
In general, the step S1 further includes: preparing text training data, and acquiring the part of speech of the words in the text training data; screening out entity words based on the part of speech, marking the entity words as first words, and marking other words except the entity words as second words; acquiring the first feature of the first word; and (2) randomly initializing the word vector of the second word to obtain the second word vector. The term "entity" refers to a term that characterizes an object as an entity and that is more uniform than the image representing the object (e.g., upon reference to a term, a relatively fixed image of some figures is immediately thought of in the human brain). And carrying out image feature extraction on the images similar to the entity word, and initializing and acquiring the first word vector based on the extracted image features. And the word vectors of the rest non-entity words are randomly initialized to obtain a second word vector.
The step of obtaining the part of speech of the words in the text training data comprises the following steps: performing word segmentation and part-of-speech tagging by using a word segmentation and part-of-speech tagging tool; the segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCRNLP. LTP (language technology platform ) provides a series of chinese natural language processing tools that users can use to perform word segmentation, part-of-speech tagging, syntactic analysis, etc. on chinese text. NLPIR is a text processing system developed in Beijing university data searching and mining laboratory, and has the functions of word segmentation, part of speech tagging and the like. StanfordC ore NLP is CoreNLP, university of Stenford, is Python encapsulated, which provides a simple API for text processing tasks, such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, and the like.
The process of screening out the entity words comprises the following steps: extracting all words with part-of-speech labeling results as nouns and marking the words as initial words; preparing a first training model for training on the computer vision data set; the number N is preset, and the first N pictures are selected from the picture search results of the initial words captured by the search engine to serve as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, and if the similarity meets a preset condition, the initial word is the first word.
If the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.
The first training model is a deep learning model for extracting the image features, and comprises a ResNet (residual network) model, a VGGNet (Visual Geometry Group Network) model or an acceptance model; the computer vision dataset includes an ImageNet dataset. ResNet (residual error network) is made up of residual error blocks, which can effectively solve the problems of gradient explosion and gradient disappearance in the training process. The VGG model is a preferred algorithm for extracting CNN features from images, and has the advantages of small convolution kernel, small pooling kernel and the like. The acceptance model can reduce parameters while increasing network depth and width. The ImageNet dataset document is detailed, has special team maintenance, is convenient to use, has 1400 or more tens of thousands of pictures, and covers 2 or more categories, wherein more than one million pictures have explicit category labels and labels of object positions in images. Where N may be manually adjusted to the configuration according to the actual application scenario, the number of specific search engines or specific search results, or other requirements, N is preferably in the range of 10-30 sheets in a general embodiment.
The "calculating the image feature vectors corresponding to the N pictures" includes: inputting the first training data into the first training model for deep learning; and counting the number of preset layers as M, and selecting the result of the previous M layers as the image feature vector. The preset M is generally performed by considering the specific number of layers of the first training model, and the number of layers is more, the number of layers is less, and the preset M is not limited, but the preferred range of layers is the first 3-8 layers in general.
The step of judging the similarity of the N pictures according to the image feature vector comprises the following steps: calculating cosine values of included angles between the image feature vectors corresponding to every 2 pictures in the N pictures; the "preset condition" means that cosine values of included angles between the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. the included angle is less than 60 °. And judging the similarity of the pictures by calculating the position relation between the image feature vectors, and judging that the similarity of the N pictures is high enough if the included angle between the image feature vectors corresponding to any 2 pictures in the N pictures is smaller than 60 degrees, namely the images of the object represented by the words are unified.
In the step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; the second training model is a Bert class model (Bert) based on an attention mechanism. The best name of BERT is Bidirectional Encoder Representations from Transformer, a transform-based bi-directional encoder characterization. BERT uses the Encoder portion of the transducer and, when processing a word, can also take into account words preceding and following the word, giving it its meaning in context.
The attention-mechanism-based Bert class model is a mask language model of the attention mechanism. Training with self-attention mechanisms and mask language models, the process comprising: replacing one of the words with [ MASK ], and then predicting the probability distribution of the word by context, with the goal that the probability of the distribution on the masked word is the greatest; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; repeating the steps until the decreasing amplitude of the loss function is very small, and obtaining the pre-training language model after training is finished.
Another aspect of the present invention provides a language model pre-training apparatus, including a network communication module, an acquisition module, a storage module, and a processing module; the network communication module is in communication connection with the Internet and is used for acquiring Internet data, wherein the Internet data comprises search result data in a search engine; the acquisition module is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data; the storage module at least stores preset text training data and a plurality of neural network deep learning models; the processing module runs the neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains a language model based on the language model pre-training method according to one aspect of the invention.
Compared with the prior art, the invention has the main beneficial effects that:
1. the language model pre-training method is based on a first word vector initialized by first features including image features; the second word vector is initialized randomly, the first word vector and the second word vector are trained to obtain a language model, the multi-mode features comprising images and words are combined to conduct pre-training of the language model, and the relevance of the language and the real things is improved; the corpus required by pre-training the language model is reduced, the external knowledge is effectively utilized, the capability of the language model for understanding the meaning of the true things is improved, and the use effect of the language model in a downstream task can be further improved.
2. The language model pre-training device can realize the language model pre-training method of the invention, has corresponding advantages, simple structure and easy setting on the basis of the existing intelligent electronic equipment, is beneficial to improving the performance of the existing equipment and promotes the practicability of the voice recognition technology in wider fields.
Drawings
Fig. 1 is a schematic diagram of a language model pre-training method according to a first embodiment of the present invention.
Fig. 2 is a schematic diagram of a word vector pre-training process according to a first embodiment of the present invention.
Fig. 3 is a schematic diagram of a language model pre-training apparatus according to a first embodiment of the present invention.
Fig. 4 is a schematic diagram of an initialization word vector flow according to a second embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully, and it is apparent that the embodiments described are only some, but not all, of the embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The foregoing and/or additional aspects and advantages of the present invention will be apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings. In the figures, parts of the same structure or function are denoted by the same reference numerals, and not all illustrated parts are denoted by the associated reference numerals throughout the figures, if necessary, for the sake of clarity.
The operations of the embodiments are depicted in the following examples in a particular order, which is presented to provide a better understanding of the details of the embodiments and to provide a thorough understanding of the invention, but is not necessarily a one-to-one correspondence with the methods of the invention, nor is it intended to limit the scope of the invention in this regard.
It should be noted that the flowcharts and block diagrams in the figures illustrate the operational processes that may be implemented by the methods according to the embodiments of the present invention. It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the intervening blocks, depending upon the objectives sought to be achieved by the steps involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and manual operations.
Example 1
In a first embodiment of the present invention, as shown in fig. 1, a language model pre-training method includes: s1, acquiring a first word vector initialized based on first features, wherein the first features comprise image features; s2, acquiring a second word vector initialized randomly; s3, training a language model based on the first word vector and the second word vector.
In this embodiment, before the step S1, the method further includes: preparing text training data, and acquiring the part of speech of the words in the text training data; screening out entity words based on the part of speech, marking the entity words as first words, and marking other words except the entity words as second words; acquiring the first feature of the first word; and (2) randomly initializing the word vector of the second word to obtain the second word vector. The text training data in this embodiment is specifically prepared from the largest and full angle, and includes various types of texts such as news, bbs forum, critique, game, television subtitle, etc., and the text training data in this embodiment may be captured from a network, where capturing sources include social platforms, news websites, etc., and are not limited. The specific method for acquiring the part of speech of the words in the text training data is to adopt a word segmentation and part of speech tagging tool to carry out word segmentation and part of speech tagging, and to carry out entity word screening based on the part of speech. The word segmentation and part of speech tagging tools utilized in this embodiment are LTPs, and in some embodiments, the word segmentation and part of speech tagging are performed by using NLPIR or stanfordcore nlp, and the advantages and disadvantages of different word segmentation and part of speech tagging tools are comprehensively considered according to specific application situations to select and use, which is not limited.
In this embodiment, the process of "screening out entity words" includes: extracting all words with part-of-speech labeling results as nouns and marking the words as initial words; preparing a first training model for training on the computer vision data set; the number N is preset, and the first N pictures are selected from the picture search results of the initial words captured by the search engine to serve as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, and if the similarity meets a preset condition, the initial word is the first word. The search engine in this embodiment is an internet search engine, such as hundred degrees and Google, and after obtaining a picture search result of an initial word with a part of speech as a noun, the top N pictures are displayed in the search result, where N can be manually adjusted and configured, if N is too much, the computing resource cost is increased, and if too little, it is impossible to accurately evaluate whether the initial word is a physical word representing that the image is basically similar, in this embodiment, preferably, N is preset to 20, that is, 20 pictures are captured.
The example of "judging the similarity of N pictures according to the image feature vector" in this example is: and calculating the distance between the image feature vectors corresponding to every 2 pictures in the N pictures. The distance is obtained by adopting a Hamming distance algorithm, the algorithm is suitable for orthogonal vector calculation, the similarity of two vectors, namely the distance, can be calculated approximately, the efficiency is very high, and the electric quantity consumed by calculation is less. In some embodiments, the euclidean distance algorithm is adopted, and the selection is selected according to the configuration of an actual hardware device or the whole network system by combining the advantages of each algorithm, without limitation.
The "preset condition" means that the image feature vectors corresponding to any 2 pictures are smaller than a preset value. And judging the similarity of the pictures by calculating the distance relation between the image feature vectors, and judging that the similarity of the N pictures is high enough if the image feature vectors corresponding to any 2 pictures in the N pictures are smaller than a preset value, namely the images of the objects represented by the words are unified. In this embodiment, if the initial word is the first word, that is, a physical word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector. In some embodiments, other statistics of the image feature vectors corresponding to the N pictures are used as the initialized first word vector, which is not limited. For non-entity words, the embodiment adopts a method of randomly initializing vectors to obtain initialized second word vectors. The first training model of this embodiment is a deep learning model for the image feature extraction, specifically a res net model that performs training on the disclosed computer vision dataset ImageNet. In other embodiments, the first training model also considers the advantages and disadvantages of each model according to the practical application, and selects a suitable model, such as VGG, inception or any other deep learning model that can be used for image feature extraction, and can also complete training on other computer vision data sets, which is not limited.
In step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; an example second training model is the Bert class model based on the attention mechanism. As shown in fig. 2, this embodiment describes the process of pre-training with the first word vector and the second word vector to obtain a language model by taking the mask language model in which the Bert class model based on the attention mechanism is the attention mechanism as an example: replacing one of the words with [ MASK ], and then predicting the probability distribution of the word by context, with the goal that the probability of the distribution on the masked word is the greatest; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; repeating the steps until the decreasing amplitude of the loss function is very small, and obtaining the pre-training language model after training is finished.
The language model pre-training device of the embodiment is specifically integrated in a device A with a network communication function, GPU computing and a large-scale data storage function, as shown in fig. 3, and comprises a network communication module 1, an acquisition module 2, a storage module 3 and a processing module 4; the network communication module 1 is in communication connection with the Internet and is used for acquiring Internet data, wherein the Internet data comprises search result data in a search engine; the acquisition module 2 is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data; the storage module 3 stores preset text training data and a plurality of neural network deep learning models, and can also store other data according to actual application requirements without limitation; the processing module 4 runs the neural network deep learning models based on the internet data, the multimodal data and the text training data, and trains a language model based on the language model pre-training method of the embodiment.
Example two
In this embodiment, "calculating the image feature vectors corresponding to the N pictures" specifically includes: inputting the first training data into the first training model for deep learning; and counting the number of preset layers as M, and selecting the result of the previous M layers as the image feature vector. The preset M is generally performed by considering that the specific number of layers of the first training model is larger, the number of layers is smaller, and the result of selecting the first 5 convolutional network layers is preferable in this embodiment.
As shown in fig. 4, the difference between the second embodiment and the first embodiment is that "determining the similarity of N pictures according to the image feature vector" includes: setting each initial word as W (i), and calculating cosine values of included angles between the image feature vectors corresponding to each 2 pictures in N pictures of W (i); the "preset condition" means that cosine values of included angles between the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. the included angle is less than 60 °. And judging the similarity of the pictures by calculating the cosine similarity between the image feature vectors, and judging that the similarity of the N pictures is high enough if the included angle between the image feature vectors corresponding to any 2 pictures in the N pictures is smaller than 60 degrees, namely the images of the objects represented by W (i) are unified. W (i) is the entity word selected.
The use of certain conventional english terms or letters for the sake of clarity of description of the invention is intended to be exemplary only and not limiting of the interpretation or particular use, and should not be taken to limit the scope of the invention in terms of its possible chinese translations or specific letters.
It should also be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The foregoing has outlined rather broadly the more detailed description of the invention in order that the detailed description of the structure and operation of the invention may be better understood, and in order that the present invention may be better understood. It should be noted that it will be apparent to those skilled in the art that various improvements and modifications can be made to the present invention without departing from the principles of the invention, and such improvements and modifications fall within the scope of the appended claims.
Claims (8)
1. The language model pre-training method is characterized in that: comprising the following steps:
s1, acquiring a first word vector initialized based on first features, wherein the first features comprise image features;
s2, acquiring a second word vector initialized randomly;
s3, training a language model based on the first word vector and the second word vector;
the step S1 further includes:
preparing text training data, and acquiring the part of speech of the words in the text training data;
screening out entity words based on the part of speech, marking the entity words as first words, and marking other words except the entity words as second words; acquiring the first feature of the first word;
randomly initializing a word vector of the second word in the step S2 to obtain the second word vector;
the process of screening out the entity words comprises the following steps:
extracting all words with part-of-speech labeling results as nouns and marking the words as initial words;
preparing a first training model for training on the computer vision data set;
the number N is preset, and the first N pictures are selected from the picture search results of the initial words captured by the search engine to serve as first training data;
inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures;
and judging the similarity of the N pictures according to the image feature vector, and if the similarity meets a preset condition, the initial word is the first word.
2. The language model pre-training method of claim 1, wherein: the step of obtaining the part of speech of the words in the text training data comprises the following steps: performing word segmentation and part-of-speech tagging by using a word segmentation and part-of-speech tagging tool;
the segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCRNLP.
3. The language model pre-training method of claim 1, wherein: if the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.
4. The language model pre-training method of claim 1, wherein: the first training model is a deep learning model for extracting the image features and comprises a ResNet model, a VGGNe model or an acceptance model; the computer vision dataset includes an ImageNet dataset.
5. The language model pre-training method according to any one of claims 1, 3 and 4, wherein: the "calculating the image feature vectors corresponding to the N pictures" includes:
inputting the first training data into the first training model for deep learning;
the number of the preset layers is recorded as M, and the result of the previous M layers is selected as the image feature vector;
the M ranges from 3 to 8.
6. The language model pre-training method according to any one of claims 1, 3 and 4, wherein: the step of judging the similarity of the N pictures according to the image feature vector comprises the following steps:
calculating cosine values of included angles between the image feature vectors corresponding to every 2 pictures in the N pictures;
the "preset condition" means that cosine values of included angles between the image feature vectors corresponding to any 2 pictures are all larger than 0.5.
7. The language model pre-training method according to any one of claims 1-4, wherein: in the step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; the second training model is a Bert class model based on an attention mechanism.
8. The language model pre-training device is characterized in that: the system comprises a network communication module, an acquisition module, a storage module and a processing module;
the network communication module is in communication connection with the Internet and is used for acquiring Internet data, wherein the Internet data comprises search result data in a search engine;
the acquisition module is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data;
the storage module is used for storing at least preset text training data, a computer vision data set and a plurality of neural network deep learning models;
the processing module runs the neural network deep learning models based on the internet data, the multimodal data and the text training data, and trains a language model based on the language model pre-training method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110683642.6A CN113408619B (en) | 2021-06-21 | 2021-06-21 | Language model pre-training method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110683642.6A CN113408619B (en) | 2021-06-21 | 2021-06-21 | Language model pre-training method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113408619A CN113408619A (en) | 2021-09-17 |
CN113408619B true CN113408619B (en) | 2024-02-13 |
Family
ID=77681831
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110683642.6A Active CN113408619B (en) | 2021-06-21 | 2021-06-21 | Language model pre-training method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113408619B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115123885A (en) * | 2022-07-22 | 2022-09-30 | 江苏苏云信息科技有限公司 | Estimated arrival time estimation system for elevator |
CN116580445B (en) * | 2023-07-14 | 2024-01-09 | 江西脑控科技有限公司 | Large language model face feature analysis method, system and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710923A (en) * | 2018-12-06 | 2019-05-03 | 浙江大学 | Based on across the entity language matching process across media information |
CN111126068A (en) * | 2019-12-25 | 2020-05-08 | 中电云脑(天津)科技有限公司 | Chinese named entity recognition method and device and electronic equipment |
CN111858954A (en) * | 2020-06-29 | 2020-10-30 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Task-oriented text-generated image network model |
WO2021042904A1 (en) * | 2019-09-06 | 2021-03-11 | 平安国际智慧城市科技股份有限公司 | Conversation intention recognition method, apparatus, computer device, and storage medium |
CN112733533A (en) * | 2020-12-31 | 2021-04-30 | 浙大城市学院 | Multi-mode named entity recognition method based on BERT model and text-image relation propagation |
-
2021
- 2021-06-21 CN CN202110683642.6A patent/CN113408619B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710923A (en) * | 2018-12-06 | 2019-05-03 | 浙江大学 | Based on across the entity language matching process across media information |
WO2021042904A1 (en) * | 2019-09-06 | 2021-03-11 | 平安国际智慧城市科技股份有限公司 | Conversation intention recognition method, apparatus, computer device, and storage medium |
CN111126068A (en) * | 2019-12-25 | 2020-05-08 | 中电云脑(天津)科技有限公司 | Chinese named entity recognition method and device and electronic equipment |
CN111858954A (en) * | 2020-06-29 | 2020-10-30 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Task-oriented text-generated image network model |
CN112733533A (en) * | 2020-12-31 | 2021-04-30 | 浙大城市学院 | Multi-mode named entity recognition method based on BERT model and text-image relation propagation |
Non-Patent Citations (1)
Title |
---|
Di Qi, Lin Su, Jia Song, Edward Cui,and et al.."IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA".《arXiv:2001.07966v1》.2020,1-12. * |
Also Published As
Publication number | Publication date |
---|---|
CN113408619A (en) | 2021-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113254599B (en) | Multi-label microblog text classification method based on semi-supervised learning | |
Bai et al. | A survey on automatic image caption generation | |
CN109840287B (en) | Cross-modal information retrieval method and device based on neural network | |
CN110750959B (en) | Text information processing method, model training method and related device | |
CN108846017A (en) | The end-to-end classification method of extensive newsletter archive based on Bi-GRU and word vector | |
CN110175221B (en) | Junk short message identification method by combining word vector with machine learning | |
CN113408619B (en) | Language model pre-training method and device | |
CN110889282B (en) | Text emotion analysis method based on deep learning | |
CN109711465A (en) | Image method for generating captions based on MLL and ASCA-FR | |
CN111475622A (en) | Text classification method, device, terminal and storage medium | |
CN111881292B (en) | Text classification method and device | |
CN112148831B (en) | Image-text mixed retrieval method and device, storage medium and computer equipment | |
KR20200087977A (en) | Multimodal ducument summary system and method | |
CN113220890A (en) | Deep learning method combining news headlines and news long text contents based on pre-training | |
CN107909014A (en) | A kind of video understanding method based on deep learning | |
CN111581364B (en) | Chinese intelligent question-answer short text similarity calculation method oriented to medical field | |
CN106227836B (en) | Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters | |
CN114691864A (en) | Text classification model training method and device and text classification method and device | |
CN115203421A (en) | Method, device and equipment for generating label of long text and storage medium | |
CN110956038A (en) | Repeated image-text content judgment method and device | |
CN114462385A (en) | Text segmentation method and device | |
CN111813993A (en) | Video content expanding method and device, terminal equipment and storage medium | |
CN112949293B (en) | Similar text generation method, similar text generation device and intelligent equipment | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
CN107122378B (en) | Object processing method and device and mobile terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Language model pre training methods and devices Granted publication date: 20240213 Pledgee: Bank of Suzhou Co.,Ltd. Shishan road sub branch Pledgor: Jiangsu Suyun Information Technology Co.,Ltd. Registration number: Y2024980011833 |