CN113408619A - Language model pre-training method and device - Google Patents

Language model pre-training method and device Download PDF

Info

Publication number
CN113408619A
CN113408619A CN202110683642.6A CN202110683642A CN113408619A CN 113408619 A CN113408619 A CN 113408619A CN 202110683642 A CN202110683642 A CN 202110683642A CN 113408619 A CN113408619 A CN 113408619A
Authority
CN
China
Prior art keywords
training
language model
word
words
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110683642.6A
Other languages
Chinese (zh)
Other versions
CN113408619B (en
Inventor
陈桂兴
黄羿衡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Suyun Information Technology Co ltd
Original Assignee
Jiangsu Suyun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Suyun Information Technology Co ltd filed Critical Jiangsu Suyun Information Technology Co ltd
Priority to CN202110683642.6A priority Critical patent/CN113408619B/en
Publication of CN113408619A publication Critical patent/CN113408619A/en
Application granted granted Critical
Publication of CN113408619B publication Critical patent/CN113408619B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a language model pre-training method, which comprises the following steps: acquiring a first word vector initialized based on first features, wherein the first features comprise image features; acquiring a second word vector initialized randomly; training a language model based on the first word vector and the second word vector. The multi-modal characteristics comprising images and words are combined for pre-training, so that the relevance of languages and real objects is improved; the linguistic data required by the pre-training of the language model is reduced, external knowledge is effectively utilized, and the use effect of the language model in downstream tasks is further improved. The language model pre-training device provided by the invention can realize the language model pre-training method provided by the invention and has corresponding advantages.

Description

Language model pre-training method and device
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a language model pre-training method based on image feature multi-mode initialization and a corresponding language model pre-training device.
Background
Natural Language Processing (NLP) is an important direction in the field of artificial intelligence technology. The pre-training of the language model is more and more widely applied to natural language processing, and in many text tasks, the amount of required training data can be obviously reduced by adopting the pre-trained language model, and the precision of the language model is improved. In the currently commonly used pre-training language models, for example, the main types RNNLM, word2vec, glove, Elmo, GPT, and Bert, these models all involve the representation of words, that is, during the training process, the words need to be represented as vectors first, and during the training process of the language models, these word vectors are trained together. In the above 3 language model training processes, a word vector is initialized by a random initialization method. The language models RNNLM, word2vec, glove, Elmo, GPT, and Bert are relations between learning words, so that the representation of the learned sentences or words does not really understand the meaning of the real world, but is only a rule of co-occurrence between words. For example, the words "wolf", "dog" and "cat" can be considered by the model to have high similarity if the training method of the existing language model is adopted, because the co-occurrence frequency of "cat" and "dog" is very high. However, the model does not really understand the meaning of "cat", "dog", "wolf", so the probability of "cat growing very much like dog" will be higher than the probability of "wolf growing very much like dog", which is higher in downstream tasks, e.g. some tasks of factual judgment, with the possibility of producing final erroneous results; in other tasks, such as text classification tasks, worse performance may also result. For example, if the downstream is the task of factual determination, "native dogs like to eat bones" and if the word "native dogs" does not appear in the corpus of the pre-trained language model or similar sentences do not appear, then the language model may consider the statement as a false statement when the language model is used to determine that the sentence is factual. Human beings learn languages by combining the real world, the languages are used for expressing the real world, and the existing language model training method cannot completely fit the meaning of the real world, has low accuracy and cannot meet the requirement of natural language processing tasks. Therefore, there is a need to research a pre-training method and device for a language model, so that the pre-trained language model can be better combined with real world objects to more accurately reflect the real world objects, thereby further promoting the deep development and wide application of natural language processing technology.
Disclosure of Invention
In order to solve all or part of the problems in the prior art, the invention provides a language model pre-training method which is suitable for pre-training a language model. Another aspect of the present invention provides a pre-training apparatus for pre-training a language model.
The language model pre-training method provided by the invention comprises the following steps: s1, acquiring a first word vector initialized based on first characteristics, wherein the first characteristics comprise image characteristics; s2, acquiring a second word vector initialized randomly; and S3, training a language model based on the first word vector and the second word vector. Still taking the task that the following is factual judgment, "the local dog also likes to eat the bone" as an example, with the language model pre-trained by the method of the present invention, since the similarity between the image of the local dog and the image of the local dog is very high, the word vectors of the local dog and the local dog are very similar, and as long as the sentence "the dog likes to eat the bone" appears in the training corpus, the possibility that the sentence "the local dog also likes to eat the bone" is judged as a true statement sentence is very high.
In general, the step S1 is preceded by: preparing text training data and acquiring the part of speech of words in the text training data; screening out entity words based on the part of speech, and recording the entity words as first words, and recording other words except the entity words as second words; acquiring the first feature of the first word; in step S2, the word vector of the second word is initialized randomly to obtain the second word vector. The entity words refer to words representing objects as entities, and images representing the objects are relatively uniform (for example, when a certain entity word is mentioned, certain image images which are relatively fixed are immediately thought of in human brain). And performing image feature extraction on the relatively similar images without performing random initialization on the word vectors of the entity words, and initializing and acquiring the first word vector based on the extracted image features. And randomly initializing the word vectors of the rest non-entity words to obtain a second word vector.
The acquiring the part of speech of the words in the text training data comprises: performing word segmentation and part-of-speech tagging by adopting a word segmentation and part-of-speech tagging tool; the word segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCoreNLP. LTP (Language Technology Platform) provides a series of chinese natural Language processing tools that users can use to work on chinese text for word segmentation, part-of-speech tagging, syntactic analysis, and so on. The NLPIR is a text processing system developed by a big data search and mining laboratory of Beijing university of science and engineering, and has the functions of word segmentation, part of speech tagging and the like. Stanford CoreNLP is a CoreNLP of Stanford university and is Python-encapsulated, which provides a simple API for text processing tasks, such as word segmentation, part-of-speech tagging, named entity recognition, syntactic analysis, and the like.
The process of screening out entity words comprises the following steps: extracting all words with part-of-speech tagging results as nouns and marking the words as initial words; preparing a first training model which is trained on a computer vision data set; presetting a quantity N, and selecting the previous N pictures from the picture search results of the initial words captured in a search engine as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, wherein if the similarity meets a preset condition, the initial word is the first word.
If the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.
The first training model is a deep learning model for the image feature extraction, and comprises a ResNet (residual error network) model, a VGGNet (visual Geometry Group network) model or an inclusion model; the computer vision dataset comprises an ImageNet dataset. The ResNet (residual network) is composed of residual blocks, and can effectively solve the problems of gradient explosion and gradient disappearance in the training process. The VGG model is an optimal algorithm for extracting the CNN characteristics from the image and has the advantages of small convolution kernel, small pooling kernel and the like. The inclusion model enables a reduction in parameters while increasing the depth and width of the network. The ImageNet data set is detailed in document, special for team maintenance and convenient to use, more than 1400 million pictures cover more than 2 million categories, and more than one million pictures have definite category labels and labels of object positions in the images. Wherein N may be configured by human according to the actual application scenario, the number of specific search engines or specific search results, or other requirements, and in a general embodiment, N is preferably in the range of 10-30 sheets.
The step of calculating the image feature vectors corresponding to the N pictures comprises the following steps: inputting the first training data into the first training model for deep learning; and recording the preset number of layers as M, and selecting the result of the previous M layers as the image feature vector. According to the preset M of the actual application scene, the number of layers of the specific first training model is generally considered for presetting, some layers are more, some layers are less, and the number of layers is not limited, but generally the preferred range of the number of layers is the first 3-8 layers.
The step of judging the similarity of the N pictures according to the image feature vectors comprises the following steps: calculating cosine values of included angles between the image characteristic vectors corresponding to each 2 pictures in the N pictures; the preset condition means that cosine values of included angles among the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. said angle is less than 60 °. And judging the similarity of the pictures by calculating the position relation among the image characteristic vectors, and if the included angle between the image characteristic vectors corresponding to any 2 pictures in the N pictures is less than 60 degrees, judging that the similarity of the N pictures is high enough, namely the images of the object represented by the words are uniform.
In step S3, inputting the first word vector and the second word vector into a preset second training model to obtain a pre-training language model; the second training model is a Bert-like model (Bert) based on an attention mechanism. The BERT is collectively referred to as Bidirectional Encoder responses from Transformer, i.e., transform-based bi-directional Encoder characterization. BERT takes the Encoder portion of the Transformer and when processing a word, can also take into account the words preceding and following the word to obtain its meaning in context.
The Bert-type model based on the attention mechanism is an attention mechanism masking language model. Training by taking a self-attention mechanism and a mask language model as targets, wherein the process comprises the following steps: converting one of the words to [ MASK ], and then predicting the probability distribution of the word through the context, wherein the goal is that the probability of the distribution on the hidden word is maximum; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; and repeating the steps until the loss function has a very small descending amplitude, and finishing training to obtain the pre-training language model.
The invention also provides a language model pre-training device, which comprises a network communication module, an acquisition module, a storage module and a processing module; the network communication module is in communication connection with the Internet and is used for acquiring Internet data, and the Internet data comprises search result data in a search engine; the acquisition module is used for acquiring multi-modal data, and the multi-modal data comprises image data and text data; the storage module at least stores preset text training data and a plurality of neural network deep learning models; the processing module runs the plurality of neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains the language model based on the language model pre-training method of the aspect of the invention.
Compared with the prior art, the invention has the main beneficial effects that:
1. the language model pre-training method is based on a first word vector initialized by first characteristics including image characteristics; the second word vector is initialized randomly, the first word vector and the second word vector are trained to obtain a language model, and the pre-training of the language model is carried out by combining multi-modal characteristics comprising images and words, so that the relevance of the language and real things is improved; the linguistic data required by the pre-training of the language model is reduced, the external knowledge is effectively utilized, the capability of the language model for understanding the meaning of real objects is improved, and the use effect of the language model in downstream tasks can be further improved.
2. The language model pre-training device can realize the language model pre-training method, has corresponding advantages, is simple in structure, is easy to set on the basis of the existing intelligent electronic equipment, is beneficial to improving the performance of the existing equipment and promotes the practicability of the voice recognition technology in wider fields.
Drawings
Fig. 1 is a schematic process diagram of a language model pre-training method according to a first embodiment of the present invention.
Fig. 2 is a schematic diagram of a word vector pre-training process according to a first embodiment of the present invention.
Fig. 3 is a schematic diagram of a language model pre-training apparatus according to a first embodiment of the present invention.
Fig. 4 is a schematic diagram of a process of initializing word vectors according to a second embodiment of the present invention.
Detailed Description
The technical solutions in the specific embodiments of the present invention will be clearly and completely described below, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings. In the figures, parts of the same structure or function are denoted by the same reference numerals, and not all parts shown are denoted by the associated reference numerals in all figures for reasons of clarity of presentation.
The operations of the embodiments are depicted in the following embodiments in a particular order, which is provided for better understanding of the details of the embodiments and to provide a thorough understanding of the present invention, but the order is not necessarily one-to-one correspondence with the methods of the present invention, and is not intended to limit the scope of the present invention.
It is to be noted that the flow charts and block diagrams in the figures illustrate the operational procedures which may be implemented by the methods according to the embodiments of the present invention. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the alternative, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and manual acts.
Example one
In an embodiment of the present invention, as shown in fig. 1, a method for pre-training a language model includes: s1, acquiring a first word vector initialized based on first characteristics, wherein the first characteristics comprise image characteristics; s2, acquiring a second word vector initialized randomly; and S3, training a language model based on the first word vector and the second word vector.
In this embodiment, before the step S1, the method further includes: preparing text training data and acquiring the part of speech of words in the text training data; screening out entity words based on the part of speech, and recording the entity words as first words, and recording other words except the entity words as second words; acquiring the first feature of the first word; in step S2, the word vector of the second word is initialized randomly to obtain the second word vector. The text training data in this embodiment is specifically prepared from the point of view of being as large and complete as possible, and includes various texts, such as news, bbs forum, comment, game, tv series caption, and the like. The specific method for acquiring the part of speech of the words in the text training data is to adopt a word segmentation and part of speech tagging tool to perform word segmentation and part of speech tagging, and perform solid word screening based on the part of speech. The word segmentation and part-of-speech tagging tool used in this embodiment is LTP, and in some embodiments, NLPIR or StanfordCoreNLP is used for word segmentation and part-of-speech tagging, and different word segmentation and part-of-speech tagging tools are selected and used according to specific application conditions by comprehensively considering advantages and disadvantages, without limitation.
In this embodiment, the process of "screening out entity words" includes: extracting all words with part-of-speech tagging results as nouns and marking the words as initial words; preparing a first training model which is trained on a computer vision data set; presetting a quantity N, and selecting the previous N pictures from the picture search results of the initial words captured in a search engine as the first training data; inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures; and judging the similarity of the N pictures according to the image feature vector, wherein if the similarity meets a preset condition, the initial word is the first word. The search engine of this embodiment is an internet search engine, such as a Baidu search engine, Google, and the like, and after obtaining a picture search result of an initial word whose part of speech is a noun, the previous N pictures are displayed in the search result, where N may be manually adjusted and configured, if too much, the cost of computing resources is increased, and if too little, it is impossible to accurately evaluate whether the initial word is an entity word representing a substantially similar image, and in this embodiment, preferably, N is preset to 20, that is, 20 pictures are captured.
In this example, "judge the similarity of N pictures according to the image feature vector" includes: and calculating the distance between the image feature vectors corresponding to each 2 pictures in the N pictures. The distance is obtained by adopting a Hamming distance algorithm, the algorithm is suitable for orthogonal vector calculation, the similarity of two vectors, namely the distance, can be approximately calculated, the efficiency is very high, and the calculation consumption is less. In some embodiments, the euclidean distance algorithm is adopted, and the selection is performed according to the configuration of an actual hardware device or the whole network system by combining the advantages of each algorithm, which is not limited.
The "preset condition" means that the image feature vectors corresponding to any 2 pictures are all smaller than a preset value. And judging the similarity of the pictures by calculating the distance relationship between the image feature vectors, and if the image feature vectors corresponding to any 2 pictures in the N pictures are smaller than a preset value, judging that the similarity of the N pictures is high enough, namely the images of the objects represented by the words are uniform. In this embodiment, if the initial word is the first word, that is, the entity word, in the step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector. In some embodiments, other statistics of the image feature vectors corresponding to the N pictures are used as the initialized first word vector, and are not limited. For non-entity words, the embodiment obtains the initialized second word vector by using a random vector initialization method. The first training model of this embodiment is a deep learning model for the image feature extraction, and specifically is a ResNet model trained on a public computer vision data set ImageNet. In other embodiments, the first training model also considers advantages and disadvantages of each model according to practical application, selects a suitable model, such as VGG, inclusion, or any other deep learning model that can be used for image feature extraction, and may also complete training on other computer vision data sets, without limitation.
In step S3, the first word vector and the second word vector are input into a preset second training model to obtain a pre-training language model; an example second training model is a Bert-like model based on the attention mechanism. As shown in fig. 2, this embodiment takes the Bert-based model as an example of a masked language model in the attention mechanism, which is an attention mechanism, to describe a process of obtaining a language model by pre-training a first word vector and a second word vector: converting one of the words to [ MASK ], and then predicting the probability distribution of the word through the context, wherein the goal is that the probability of the distribution on the hidden word is maximum; calculating a cross entropy loss function and a gradient; error back transmission, updating language model parameters; and repeating the steps until the loss function has a very small descending amplitude, and finishing training to obtain the pre-training language model.
The language model pre-training device of the embodiment is specifically integrated in a device a having a network communication function, a GPU calculation function, and a large-scale data storage function, and as shown in fig. 3, includes a network communication module 1, an acquisition module 2, a storage module 3, and a processing module 4; the network communication module 1 is in communication connection with the internet and is used for acquiring internet data, wherein the internet data comprises search result data in a search engine; the acquisition module 2 is used for acquiring multi-modal data, wherein the multi-modal data comprises image data and text data; the storage module 3 stores preset text training data and a plurality of neural network deep learning models, and can also store other data according to the actual application requirements without limitation; the processing module 4 runs the plurality of neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains the language model based on the language model pre-training method of the embodiment.
Example two
In this embodiment, the "calculating the image feature vectors corresponding to the N pictures" specifically includes: inputting the first training data into the first training model for deep learning; and recording the preset number of layers as M, and selecting the result of the previous M layers as the image feature vector. According to the preset M of the actual application scenario, the number of layers of the first training model is generally considered to be preset, some layers are more, some layers are less, and the method is not limited, and the result of selecting the first 5 layers of convolutional network layers is preferred in this embodiment.
As shown in fig. 4, the difference between the second embodiment and the first embodiment is mainly that the "determining the similarity of N pictures according to the image feature vector" includes: setting each initial word as W (i), and calculating cosine values of included angles between the image characteristic vectors corresponding to each 2 pictures in the N pictures of W (i); the preset condition means that cosine values of included angles among the image feature vectors corresponding to any 2 pictures are all larger than 0.5. I.e. said angle is less than 60 °. And judging the similarity of the pictures by calculating the cosine similarity between the image feature vectors, and if the included angle between the image feature vectors corresponding to any 2 pictures in the N pictures is less than 60 degrees, judging that the similarity of the N pictures is high enough, namely W (i) the images of the represented objects are uniform. W (i) is the selected entity word.
For clarity of description, the use of certain conventional and specific terms and phrases is intended to be illustrative and not restrictive, but rather to limit the scope of the invention to the particular letter and translation thereof.
It is further noted that, herein, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The present invention has been described in detail, and the structure and operation principle of the present invention are explained by applying specific embodiments, and the above description of the embodiments is only used to help understanding the method and core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (10)

1. The language model pre-training method is characterized by comprising the following steps: the method comprises the following steps:
s1, acquiring a first word vector initialized based on first characteristics, wherein the first characteristics comprise image characteristics;
s2, acquiring a second word vector initialized randomly;
and S3, training a language model based on the first word vector and the second word vector.
2. The language model pre-training method of claim 1, wherein: the step S1 is preceded by:
preparing text training data and acquiring the part of speech of words in the text training data;
screening out entity words based on the part of speech, and recording the entity words as first words, and recording other words except the entity words as second words; acquiring the first feature of the first word;
in step S2, the word vector of the second word is initialized randomly to obtain the second word vector.
3. The language model pre-training method of claim 2, wherein: the acquiring the part of speech of the words in the text training data comprises: performing word segmentation and part-of-speech tagging by adopting a word segmentation and part-of-speech tagging tool;
the word segmentation and part-of-speech tagging tools include, but are not limited to, LTP, NLPIR, or StanfordCoreNLP.
4. The language model pre-training method of claim 2, wherein: the process of screening out entity words comprises the following steps:
extracting all words with part-of-speech tagging results as nouns and marking the words as initial words;
preparing a first training model which is trained on a computer vision data set;
presetting a quantity N, and selecting the previous N pictures from the picture search results of the initial words captured in a search engine as the first training data;
inputting the first training data into the first training model, and calculating image feature vectors corresponding to the N pictures;
and judging the similarity of the N pictures according to the image feature vector, wherein if the similarity meets a preset condition, the initial word is the first word.
5. The language model pre-training method of claim 4, wherein: if the initial word is the first word, in step S1, an average value of image feature vectors corresponding to the N pictures is calculated as an initialized first word vector.
6. The language model pre-training method of claim 4, wherein: the first training model is a deep learning model for the image feature extraction, and comprises a ResNet model, a VGGNe model or an inclusion model; the computer vision dataset comprises an ImageNet dataset.
7. A language model pre-training method according to any one of claims 4-6, characterized by: the step of calculating the image feature vectors corresponding to the N pictures comprises the following steps:
inputting the first training data into the first training model for deep learning;
the preset number of layers is marked as M, and the result of the previous M layers is selected as the image feature vector;
the range of M is 3 to 8.
8. A language model pre-training method according to any one of claims 4-6, characterized by: the step of judging the similarity of the N pictures according to the image feature vectors comprises the following steps:
calculating cosine values of included angles between the image characteristic vectors corresponding to each 2 pictures in the N pictures;
the preset condition means that cosine values of included angles among the image feature vectors corresponding to any 2 pictures are all larger than 0.5.
9. A language model pre-training method according to any one of claims 1 to 6, characterized in that: in step S3, inputting the first word vector and the second word vector into a preset second training model to obtain a pre-training language model; the second training model is a Bert-like model based on an attention mechanism.
10. The language model pre-training device is characterized in that: the system comprises a network communication module, an acquisition module, a storage module and a processing module;
the network communication module is in communication connection with the Internet and is used for acquiring Internet data, and the Internet data comprises search result data in a search engine;
the acquisition module is used for acquiring multi-modal data, and the multi-modal data comprises image data and text data;
the storage module at least stores preset text training data, a computer vision data set and a plurality of neural network deep learning models;
the processing module runs the plurality of neural network deep learning models based on the internet data, the multi-modal data and the text training data, and trains the language model based on the language model pre-training method of any one of claims 1 to 9.
CN202110683642.6A 2021-06-21 2021-06-21 Language model pre-training method and device Active CN113408619B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110683642.6A CN113408619B (en) 2021-06-21 2021-06-21 Language model pre-training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110683642.6A CN113408619B (en) 2021-06-21 2021-06-21 Language model pre-training method and device

Publications (2)

Publication Number Publication Date
CN113408619A true CN113408619A (en) 2021-09-17
CN113408619B CN113408619B (en) 2024-02-13

Family

ID=77681831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110683642.6A Active CN113408619B (en) 2021-06-21 2021-06-21 Language model pre-training method and device

Country Status (1)

Country Link
CN (1) CN113408619B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115123885A (en) * 2022-07-22 2022-09-30 江苏苏云信息科技有限公司 Estimated arrival time estimation system for elevator
CN116580445A (en) * 2023-07-14 2023-08-11 江西脑控科技有限公司 Large language model face feature analysis method, system and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710923A (en) * 2018-12-06 2019-05-03 浙江大学 Based on across the entity language matching process across media information
CN111126068A (en) * 2019-12-25 2020-05-08 中电云脑(天津)科技有限公司 Chinese named entity recognition method and device and electronic equipment
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model
WO2021042904A1 (en) * 2019-09-06 2021-03-11 平安国际智慧城市科技股份有限公司 Conversation intention recognition method, apparatus, computer device, and storage medium
CN112733533A (en) * 2020-12-31 2021-04-30 浙大城市学院 Multi-mode named entity recognition method based on BERT model and text-image relation propagation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710923A (en) * 2018-12-06 2019-05-03 浙江大学 Based on across the entity language matching process across media information
WO2021042904A1 (en) * 2019-09-06 2021-03-11 平安国际智慧城市科技股份有限公司 Conversation intention recognition method, apparatus, computer device, and storage medium
CN111126068A (en) * 2019-12-25 2020-05-08 中电云脑(天津)科技有限公司 Chinese named entity recognition method and device and electronic equipment
CN111858954A (en) * 2020-06-29 2020-10-30 西南电子技术研究所(中国电子科技集团公司第十研究所) Task-oriented text-generated image network model
CN112733533A (en) * 2020-12-31 2021-04-30 浙大城市学院 Multi-mode named entity recognition method based on BERT model and text-image relation propagation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DI QI, LIN SU, JIA SONG, EDWARD CUI, AND ET AL.: ""IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA"", 《ARXIV:2001.07966V1》, pages 1 - 12 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115123885A (en) * 2022-07-22 2022-09-30 江苏苏云信息科技有限公司 Estimated arrival time estimation system for elevator
CN116580445A (en) * 2023-07-14 2023-08-11 江西脑控科技有限公司 Large language model face feature analysis method, system and electronic equipment
CN116580445B (en) * 2023-07-14 2024-01-09 江西脑控科技有限公司 Large language model face feature analysis method, system and electronic equipment

Also Published As

Publication number Publication date
CN113408619B (en) 2024-02-13

Similar Documents

Publication Publication Date Title
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN110750959B (en) Text information processing method, model training method and related device
CN109255118B (en) Keyword extraction method and device
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN110134954B (en) Named entity recognition method based on Attention mechanism
CN107315734B (en) A kind of method and system to be standardized based on time window and semantic variant word
CN106503055A (en) A kind of generation method from structured text to iamge description
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN113408619B (en) Language model pre-training method and device
CN113220890A (en) Deep learning method combining news headlines and news long text contents based on pre-training
CN110929022A (en) Text abstract generation method and system
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN106227836B (en) Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN114462385A (en) Text segmentation method and device
CN111191413B (en) Method, device and system for automatically marking event core content based on graph sequencing model
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN112949293B (en) Similar text generation method, similar text generation device and intelligent equipment
CN110347807B (en) Problem information processing method and device
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN114722774B (en) Data compression method, device, electronic equipment and storage medium
Nandwalkar et al. Descriptive Handwritten Paper Grading System using NLP and Fuzzy Logic
CN114943236A (en) Keyword extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Language model pre training methods and devices

Granted publication date: 20240213

Pledgee: Bank of Suzhou Co.,Ltd. Shishan road sub branch

Pledgor: Jiangsu Suyun Information Technology Co.,Ltd.

Registration number: Y2024980011833