CN115937615B

CN115937615B - Topic label classification method and device based on multi-mode pre-training model

Info

Publication number: CN115937615B
Application number: CN202310134196.2A
Authority: CN
Inventors: 尹俏; 李飞阳; 王霄坤; 孟凡飞; 薛娇; 李大海
Original assignee: Zhizhe Sihai Beijing Technology Co Ltd
Current assignee: Zhizhe Sihai Beijing Technology Co Ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-05-16
Anticipated expiration: 2043-02-20
Also published as: CN115937615A

Abstract

The application provides a topic label classification method and device based on a multi-mode pre-training model, wherein the method comprises the following steps: acquiring a training data set, wherein the training data set is topic label data obtained based on label clustering and generalization degree sequencing; training the initial multi-mode pre-training model based on the training data set to obtain a trained multi-mode pre-training model; and converting the trained multi-mode pre-training model into an ONNX model, and deploying the ONNX model to a target application program to realize classification of the theme labels. By constructing a training data set with high accuracy, carrying out serialization, parallelization and FFN (fringe field effect network) processing on a text side, and fusing a feature vector on a picture side and a feature vector on the text side in advance, the training speed and the subsequent reasoning speed of a model are increased, and the model effect is improved; meanwhile, the topic label classification of the multi-mode data under different frameworks is realized through the ONNX model.

Description

Topic label classification method and device based on multi-mode pre-training model

Technical Field

The application relates to the technical field of computer application, in particular to a topic label classification method and device based on a multi-mode pre-training model.

Background

At present, aiming at millions of questions and answers in communities, the categories are generally classified by adopting a mode of generating a theme label, and the generated accurate theme label has strong service value in scenes such as searching, recommending, and marketing. The answer content of the user is usually mixed with data such as text, pictures and videos, and the data such as text, photo images and the like are combined in the processing process to help information complementation, so that the generated theme label is more accurate. Thus, how to match the content of these heterogeneous data with a set of suitable topic labels becomes a complex and important issue.

In the prior art, the calculation method of the multi-modal data mainly comprises multi-modal characterization learning, multi-modal alignment, multi-modal mapping, multi-modal fusion and the like, for example, a LXMERT (learning cross-Modality Encoder Representations from Transformers) model based on a transfer encoder and a novel cross-modal encoder is adopted to learn the relation between visual languages, and then model pre-training is carried out on a large-scale 'image sentence pair' data set by utilizing different tasks; the inputs of images and text are processed on each side separately using a multi-modal, dual-flow model ViLBERT (Vision-and-LanguageBidirectional Encoder Representation from Transformers), and the flows are interacted through the Co-descriptive former layer at a later stage. However, LXMERT and ViLBERT are mainly aimed at english text content, and the processing accuracy of chinese text is different. In addition, viLBERT introduces a Co-attitudinansformer layer, which increases the compatibility of modality information for image and text feature processing, thereby reducing the reasoning speed of the model.

Disclosure of Invention

The method and the device for classifying the topic labels based on the multi-mode pre-training model are used for accelerating the training speed and the reasoning speed of the multi-mode pre-training model, improving the model effect and enabling the model to determine the proper topic labels for the multi-mode data under different frameworks.

In a first aspect, the present application provides a topic label classification method based on a multi-modal pre-training model, the method comprising:

acquiring a training data set, wherein the training data set is topic label data obtained based on label clustering and generalization degree sequencing; the theme tag data adopts a form of matching content with a theme tag, wherein the content comprises picture information and text information;

training the initial multi-mode pre-training model based on the training data set to obtain a trained multi-mode pre-training model; the initial multi-mode pre-training model is a double-tower model comprising a picture side model and a text side model, wherein the picture side model is used for obtaining a characteristic vector of a picture side based on the picture information, and the text side model is used for carrying out serialization processing and FFN processing based on the text information to obtain the characteristic vector of a text side;

and converting the trained multi-mode pre-training model into an ONNX model, and deploying the ONNX model to a target application program to realize classification of the theme labels.

According to the topic label classification method based on the multi-mode pre-training model provided by the application, the steps of carrying out serialization processing and FFN processing on the text information to obtain the feature vector of the text side specifically comprise: performing text serialization processing based on the text information to obtain an initial word vector; based on the initial word vector, performing text segmentation processing to obtain a plurality of text segments, wherein each text segment has an overlapping part; parallelizing the text segments to obtain a serialization vector; and carrying out FFN processing on the serialization vector to obtain a characteristic vector of the text side.

According to the topic label classification method based on the multi-mode pre-training model provided by the application, FFN processing is carried out on the serialized vector to obtain a feature vector of a text side, and the topic label classification method comprises the following steps: inputting the serialization vector into three FFN structures to obtain FFN weight coefficients; wherein the three FFN structures are a short text FFN, a long text FFN and a video character FFN; and obtaining a characteristic vector of the text side based on the serialization vector and the FFN weight coefficient.

According to the topic label classification method based on the multi-mode pre-training model, the picture side model is a 12-layer ViT model, the text side model is a 12-layer BERT model, and each layer of input of the picture side model and the text side model is a fusion vector of the feature vector of the picture side and the feature vector of the text side of the previous layer.

According to the topic label classification method based on the multi-mode pre-training model provided by the application, the step of ordering the obtained topic label data based on label clustering and generalization degree specifically comprises the following steps: obtaining topic information of question and answer contents of a database; clustering the topic information by adopting a K-means clustering method to obtainKA plurality of topics, and each topic includes a plurality of topic information; and acquiring topic label data based on the generalization degree of the topic information of each topic.

According to the topic tag classification method based on the multi-mode pre-training model provided by the application, topic tag data is obtained based on the generalization degree of a plurality of topic information of each topic, and the method comprises the following steps: acquiring matching content of each topic information based on the topic information of each topic; calculating the correlation degree of the matching content of each topic information, and taking the correlation degree as the generalization degree of the corresponding topic information; if the topic information matched with a certain matched content is more than one, sorting based on the generalization degree of the matched topic information, regarding the topic information with low generalization degree as a topic positive label of the matched content, and removing other matched topics; and determining a theme corresponding to the matching content based on the topic positive label of the matching content, and constructing theme label data comprising the matching content and the theme.

According to the topic label classification method based on the multi-mode pre-training model provided by the application, the matching content of each topic information is obtained based on a plurality of topic information of each topic, and the method comprises the following steps: based on the topic information of each topic, adopting PMI calculation and AC automaton to match a plurality of corresponding contents for each topic information from the question-answer contents of the database as matching contents of each topic information.

In a second aspect, the present application further provides a topic tag classification device based on a multi-modal pre-training model, the device comprising:

the data construction module is used for acquiring a training data set, wherein the training data set is topic label data obtained based on label clustering and generalization degree sequencing; the theme tag data adopts a form of matching content with a theme tag, wherein the content comprises picture information and text information;

the model training module is used for training the initial multi-mode pre-training model based on the training data set to obtain a trained multi-mode pre-training model; the initial multi-mode pre-training model is a double-tower model comprising a picture side model and a text side model, wherein the picture side model is used for obtaining a characteristic vector of a picture side based on the picture information, and the text side model is used for carrying out serialization processing and FFN processing based on the text information to obtain the characteristic vector of a text side;

the model deployment module is used for converting the trained multi-mode pre-training model into an ONNX model, deploying the ONNX model to a target application program and realizing classification of the theme labels.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program, the processor executes the steps in any implementation manner of the above-mentioned topic label classification method based on the multi-mode pre-training model.

In a fourth aspect, embodiments of the present application further provide a readable storage medium having a computer program stored therein, where the computer program, when executed on a processor, performs steps in any implementation of the above-described topic label classification method based on a multi-modal pre-training model.

In summary, the topic label classification method and device based on the multi-mode pre-training model provided by the embodiment of the application construct a training data set with high accuracy through clustering and generalization degree sequencing. The model has richer semantic representation space by constructing the double-tower model, and serialization processing and parallelization processing are added on the text side, so that the training speed and the follow-up reasoning speed of the model are accelerated, and the model can extract text features of different modes by adopting different FFN structures to carry out FFN processing, thereby improving the model effect. A multimodal pre-training model; in addition, through designing each layer of input of the 12-layer ViT model and the 12-layer BERT model as fusion vectors of the feature vectors of the picture side and the text side of the previous layer, fusion of the feature vectors of the picture side and the feature vectors of the text side is performed in advance, so that the model can learn the features from the bottom layer to the top layer in the training process, and the effect of the model is improved. By converting the trained model into an ONNX model for redeployment, the process of transferring the model between different modes of operation of the artificial intelligence is simplified.

Drawings

For a clearer description of the present application or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a topic tag classification method based on a multi-modal pre-training model provided herein;

fig. 2 is a schematic flow chart of obtaining a feature vector of a text side by performing serialization processing and FFN processing based on the text information;

FIG. 3 is a flow chart of a method of acquiring a training data set provided herein;

FIG. 4 is a schematic structural diagram of a topic tag classification device based on a multi-modal pre-training model provided by the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Icon: 400-topic tag classification means; 410-a data construction module; 420-a model training module; 430-a model deployment module; 500-an electronic device; 510-memory; 520-a processor; 530-bus.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

Fig. 1 is a schematic flow chart of a topic tag classification method based on a multi-mode pre-training model provided in the present application, as shown in fig. 1, where the method includes:

s101, acquiring a training data set, wherein the training data set is topic label data obtained based on label clustering and generalization degree sequencing.

The theme tag data adopts a form of matching content with a theme tag, and the content comprises picture information and text information.

In the community, there are tens of millions of question information, hundreds of millions of answer information, and even tens of thousands of topic information. Due to the large number, there is some difficulty in classifying tags, and due to the excessive topic information, there is excessive topic matching for given data. Therefore, the topic information needs to be clustered and simplified through a topic label system, so that a training data set with high accuracy is constructed.

And S102, training the initial multi-mode pre-training model based on the training data set to obtain a trained multi-mode pre-training model.

The initial multi-mode pre-training model is a double-tower model comprising a picture side model and a text side model, wherein the picture side model is used for obtaining a feature vector of a picture side based on the picture information, and the text side model is used for carrying out serialization processing and FFN (Feed Forward neural Network) processing based on the text information to obtain the feature vector of a text side. The picture side model is a 12-layer ViT (Vision Transformer) model, the text side model is a 12-layer BERT (Bidirectional EncoderRepresentation from Transformers) model, and each layer of input of the picture side model and the text side model is a fusion vector of the feature vector of the picture side and the feature vector of the text side of the previous layer.

Specifically, it may be appreciated that the training the initial multi-modal pre-training model based on the training dataset includes: inputting the picture information of the training data into the picture side model to obtain a feature vector of a picture side; inputting the text information of the training data into the text side model to obtain a feature vector of a text side; fusing the feature vector of the picture side and the feature vector of the text side to obtain a fused feature vector; and inputting the fusion feature vector into a classification layer of the initial multi-mode pre-training model to obtain a predicted topic label classification score.

It should be noted that the fused feature vector here is a vector obtained by fusing the feature vector finally output by the 12-layer picture-side model ViT and the feature vector finally output by the 12-layer text-side model BERT. The above-mentioned fusion vector of the "the input of each layer of the picture-side model and the text-side model is the fusion vector of the feature vector of the picture-side and the feature vector of the text-side of the previous layer" means that the feature outputs of the two models are fused in the previous layer, and the picture-side model ViT and the text-side model BERT of the initial multi-mode pre-training model are both 12 layers, so that the fusion vector is a vector obtained by respectively fusing the features in the 2 nd layer to the 11 th layer.

In some embodiments, inputting the fused feature vector into a classification layer of the initial multi-modal pre-training model to obtain a predicted topic label classification score comprises: calculating a loss function of the model based on the predicted topic label classification score; and performing a reverse transfer optimization model based on the loss function. Since there are thousands of categories of multi-topic tag classification, and training data differences between categories are large, some categories are inherently more difficult to distinguish. Therefore, the Loss function of the initial multi-mode pre-training model can adopt a focus Loss (Focal Loss) function, the Focal Loss is essentially that by modifying the cross Loss entropy function of two classifications, the classification weight and the sample difficulty weight adjustment factor are increased, so that the problems of unbalanced classification, unbalanced classification difficulty and the like in classification are solved, and the model is mainly prone to be focused on distinguishing difficulty to improve the model accuracy, for example, a label a is provided for a certain content, and the classification degree of the label a is improved by the model due to the fact that the label a has a low score in the actual training process.

It should be noted that, in the present application, the picture side model ViT and the text side model BERT of the initial multi-mode pre-training model are both 12 layers, and feature outputs of the two models are respectively fused and calculated in each layer, that is, feature input of ViT of the third layer on the picture side is feature output of the second layer ViT and feature output of the second layer BERT are overlapped, or feature input of BERT of the third layer on the text side is feature output of the second layer ViT and feature output of the second layer BERT are overlapped, and the fusion vector is feature output of each layer ViT and feature output of BERT are overlapped. Compared with the prior art, in the topic label classification method based on the multi-mode pre-training model, fusion of the feature vector at the picture side and the feature vector at the text side is performed in advance, so that the model can learn the features from the bottom layer to the top layer in the training process, and the effect of the model is improved.

The following describes, with reference to fig. 2, the steps of obtaining a feature vector on the text side by performing serialization processing and FFN processing based on the text information provided in the present application. As shown in fig. 2, the step of obtaining the feature vector of the text side by performing serialization processing and FFN processing based on the text information specifically includes:

and a step a1, carrying out text serialization processing based on the text information to obtain an initial word vector.

Specifically, word segmentation is performed on the text information to obtain a token sequence, and serialization is performed on the token sequence to obtain an initial word vector token_input, wherein the length of the initial word vector token_input is generally 2048.

It should be noted that, in general, the text information is classified into short text, long text and video characters, and since the ratio of short text in the question-answer content of the community is relatively large, we use serialization for text processing. In some embodiments, the serialization processing on the text side of the multi-mode pre-training model may be designed as frame extraction processing according to practical situations, for example, the long text in the database occupies a relatively large amount, and the effect of the serialization processing is relatively poor at this time, so that the serialization processing may be considered to be modified as frame extraction processing.

And a step a2 of carrying out text segmentation processing to obtain a plurality of text segments based on the initial word vector, wherein each text segment has an overlapping part.

Specifically, the initial word vector token_input is subjected to text segmentation processing to obtain a plurality of text segments, which are generally divided into 8 text segments.

And a step a3 of parallelizing the text segments to obtain a serialization vector.

Specifically, the plurality of texts are subjected to reshape processing and then simultaneously subjected to coding training to obtain a serialization vector so as to realize parallelization processing of the plurality of text segments.

And a step a4, performing FFN processing on the serialization vector to obtain a characteristic vector of a text side.

The FFN processing the serialized vector to obtain a feature vector of a text side includes: inputting the serialization vector into three FFN structures to obtain FFN weight coefficients; based on the serialization vector and the FFN weight coefficient, obtaining a characteristic vector of a text side; wherein the three FFN structures are a short text FFN, a long text FFN, and a video character FFN.

In particular, it is to be appreciated that considering that the content in the training dataset includes picture information and text information, the text information is mixed with text data of different modalities such as short text, long text, ultra-long text, etc., and fine tuning is required in the same BERT. Therefore, in combination with the modality types of the question information and the answer information in the community, three different FFN structures are selected, on one hand, considering that the question information is mostly short text types, the short text FFN is selected, and on the other hand, considering that the answer information may relate to multiple types of articles, videos and the like, the long text FFN and the video character FFN are selected, wherein the video character FFN is mainly text introduction information and the like of the video itself in the text information of the training dataset. In the process of training the initial multi-mode pre-training model based on the training data set, text features of different modes are extracted through different FFN structures, so that the model effect is improved.

In some embodiments, the step of obtaining the feature vector of the picture side based on the picture information specifically includes:

step b1, dividing the picture information into a plurality of patches, and then mapping the patches into a plurality of ebedding layers, namely a linear projection layer;

step b2, adding a one-dimensional positioning element on the basis of the plurality of the imbeds, and adding a learnableclassification token in front of the sequence to obtain an imbeds sequence;

and b3, inputting the embedding sequence into a multi-layer encoding structure to obtain the feature vector of the picture side.

S103, converting the trained multi-mode pre-training model into an ONNX model, and deploying the ONNX model to a target application program to realize classification of the theme labels.

Wherein the ONNX model is a Open Neural Network Exchange model, and the target program may be one or more.

Specifically, it can be appreciated that different platforms are typically deployed using different frameworks, S102 "training the initial multimodal pre-training model based on the training dataset" is typically training under the pytorch framework, and the trained multimodal pre-training model needs to be integrated into the existing java/python development-based application to make reasoning. Therefore, the trained multi-mode pre-training model is required to be converted into the ONNX model and then deployed to the target application program to realize classification of the theme labels, so that the model can be trained in one framework and transferred to another framework for reasoning, and the process of transferring the model between different working modes of the artificial intelligence is simplified.

Specifically, it may be further understood that deploying the ONNX model to a target application program, to implement classification of the theme label, includes the following steps:

step 1031, deploying the ONNX model to a target application program;

step 1032, acquiring question and answer information on the target application program;

the question and answer information can be question information or answer information;

step 1033, inputting the question-answer information into the ONNX model to obtain a predicted topic label classification score;

specifically, based on the ONNX model, obtaining a feature vector of a picture side and a feature vector of a text side; fusing the feature vector of the picture side and the feature vector of the text side to obtain a fused feature vector; and inputting the fusion feature vector into a classification layer of the ONNX model to obtain a predicted topic label classification score.

Step 1034, determining the topic label of the question-answer information based on the predicted topic label classification score.

According to the topic label classification method and device based on the multi-mode pre-training model, a training data set with high accuracy is constructed through clustering and generalization degree sequencing. The model has richer semantic representation space by constructing the double-tower model, and serialization processing and parallelization processing are added on the text side, so that the training speed and the follow-up reasoning speed of the model are accelerated, and the model can extract text features of different modes by adopting different FFN structures to carry out FFN processing, thereby improving the model effect. A multimodal pre-training model; in addition, through designing each layer of input of the 12-layer ViT model and the 12-layer BERT model as fusion vectors of the feature vectors of the picture side and the text side of the previous layer, fusion of the feature vectors of the picture side and the feature vectors of the text side is performed in advance, so that the model can learn the features from the bottom layer to the top layer in the training process, and the effect of the model is improved. By converting the trained model into an ONNX model for redeployment, the process of transferring the model between different modes of operation of the artificial intelligence is simplified.

Fig. 3 is a flow chart of a method for obtaining a training data set, where the training data set is topic tag data obtained based on tag clustering and generalization degree sequencing. As shown in fig. 3, the topic label data obtained based on label clustering and generalization degree sorting specifically includes the following steps:

s301, topic information of question and answer contents of a database is acquired.

The question and answer content of the database comprises question information and answer information; the question information includes text, pictures, etc., and the answer information includes text, articles, pictures, videos, etc.

In the process of classifying the topic labels for the answer information, the question information corresponding to each answer information can be combined for more accurate classification.

S302, clustering the topic information by adopting a K-means clustering method to obtainKTopics, and each topic includes a plurality of topic information.

Specifically, it can be appreciated that similar topic information is clustered under the same topic by a clustering method, so that repetitive topic information is reduced. The clustering method adopted in the embodiment of the application is K-means, and other clustering methods can be adopted in other embodiments, and the application is not particularly limited. In the clustering process adopting the K-means clustering method, converting each topic information into a topic information word vector through a word2vec technology, and projecting each topic information word vector to the corresponding topic information word vectornIn the dimensional space, the point is regarded as one point, so that subsequent clustering calculation is facilitated.

In some embodiments, step S302 specifically includes the steps of:

s3021, random selectionKThe points are used as initial clustering centers;

s3022, calculating all other points to using Euclidean distanceKThe distance between the initial cluster centers is calculated, and all the points are divided into clusters corresponding to the initial cluster center with the minimum point distance, so as to obtainKClustering;

s3023, for theKEach cluster, calculating the average value of all points in each cluster, and carrying outKAverage of individual clusters asKClustering center points;

s3024, repeating S2022 and S2023 until the preset iteration number is reached, and determining the finalKAnd (5) clustering.

It should be noted that the preset iteration number may be set to 50; in some embodiments, the iteration may also be stopped when the central value of the cluster does not change.

S303, acquiring topic label data based on the generalization degree of a plurality of topic information of each topic.

Specifically, it can be appreciated that step S302 results inKTopic information is needed to be simplified in order to construct a high-coverage topic label system, and the simplified topic information can accurately cover possible classification of question and answer contents of a database. Therefore, topic information is simplified based on the generalization degree of a plurality of topic information of each topic, so that topic tag data with high accuracy is obtained.

In some embodiments, step S303 specifically includes the steps of:

s3031, acquiring matching content of each topic information based on the topic information of each topic;

specifically, based on the plurality of topic information of each topic, a PMI (Pointwise MutualInformation, point mutual information) calculation and an AC automaton (Aho-corasikautomoton) are adopted to match a plurality of corresponding contents for each topic information from among the question-answer contents of the database as matching contents of each topic information.

S3032, calculating the correlation degree of the matching content of each topic information, and taking the correlation degree as the generalization degree of the corresponding topic information;

specifically, each topic information corresponds to a plurality of matching contents, and the closer the correlation degree is to 1, the higher the correlation degree of the matching contents is, that is, the lower the generalization degree of the topic information is. The topic generalization degree is higher if more different content data are contained under certain topic information; if more content data is included in a certain topic information, the topic generalization degree is relatively low.

Taking one topic information as an example, calculating the correlation degree of the matching content comprises the following steps:

step c1, vector encoding is carried out on a plurality of matching contents to obtain vector expression of the plurality of matching contents;

step c2, calculating the vector expression of the plurality of matching contents, and dividing the vector expression by two norms to obtain the center of the topic information;

and c3, respectively carrying out dot product and averaging on the vector expressions of the plurality of matching contents and the center of the topic information to obtain the correlation degree of the plurality of matching contents.

S3033, if there is more than one topic information matched with a certain matched content, sorting is performed based on the generalization degree of the matched topic information, and the topic information with low generalization degree is regarded as a topic positive label of the matched content, and other matched topics are removed.

Specifically, it can be understood that ranking is performed according to the generalization degree of a plurality of topic information matched with a certain matched content, only topic information with the lowest generalization degree is reserved and is regarded as a topic positive label of the matched content, so that the topic information is closest to the matched content, that is, the topic information can better summarize the matched content.

When the matching content of each topic information is obtained based on the topic information of each topic, one topic information corresponds to a plurality of matching contents, and therefore, any matching content may have a plurality of matching topic information, and therefore, redundant topic information of each matching content needs to be removed.

S3034, determining a theme corresponding to the matching content based on the topic positive label of the matching content, and constructing theme label data comprising the matching content and the theme.

Specifically, it can be understood that the data obtained by step S3033 is data matching the content and topic information pair, and in order to constructHigh accuracy topic tag systems, also requiring combination with S302KAdding the matching content corresponding to the topic positive label to the clustering theme where the topic positive label is located, and constructing topic label data comprising the matching content and the theme.

According to the method for acquiring the training data set, a large amount of topic information is clustered through the K-means clustering method, and the annotation data are generated on the topic information, so that the annotation data are acquired under the condition that no manual annotation data participate, and compared with the traditional supervised learning classification task, the manual annotation data are not needed, and the labor cost is reduced. In addition, the simplified topic information is ordered by the generalization degree, and the simplified topic information covers all possible classifications as accurately as possible, so that the quasimplicity of the labeling data is ensured, and the obtained training data set is the topic label data with high accuracy.

Fig. 4 is a schematic structural diagram of a topic tag classification device based on a multi-mode pre-training model provided in the present application, which can be used to implement the method described in the above embodiments. As shown in fig. 4, the apparatus includes:

the data construction module 410 is configured to obtain a training data set, where the training data set is topic tag data obtained based on tag clustering and generalization degree ordering; the theme tag data adopts a form of matching content with a theme tag, wherein the content comprises picture information and text information;

the model training module 420 is configured to train the initial multi-mode pre-training model based on the training data set, so as to obtain a trained multi-mode pre-training model; the initial multi-mode pre-training model is a double-tower model comprising a picture side model and a text side model, wherein the picture side model is used for obtaining a characteristic vector of a picture side based on the picture information, and the text side model is used for carrying out serialization processing and FFN processing based on the text information to obtain the characteristic vector of a text side;

the model deployment module 430 is configured to convert the trained multimodal pre-training model into an ONNX model, and deploy the ONNX model to a target application program to implement classification of the topic label.

For a detailed description of the above-mentioned topic tag classification device based on the multi-mode pre-training model, please refer to the description of the related method steps in the above-mentioned embodiment, and the repetition is omitted. The apparatus embodiments described above are merely illustrative, wherein the "module" as illustrated as a separate component may or may not be physically separate, as may be a combination of software and/or hardware implementing the intended function. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Fig. 5 is a schematic structural diagram of an electronic device provided in the present application, as shown in fig. 5, the electronic device 500 includes: the memory 510 and the processor 520 are connected through the bus 530, the memory 510 stores a computer program, and when the processor 520 reads and runs the computer program, the electronic device 500 can execute all or part of the flow of the method in the embodiment, so as to implement the topic label classification based on the multi-mode pre-training model.

Embodiments of the present application also provide a readable storage medium having stored therein a computer program which, when run on a processor, performs steps in a topic label classification based on a multimodal pre-training model.

It should be understood that the electronic device may be an electronic device with a logic computing function, such as a personal computer, a tablet computer, a smart phone, etc.; the readable storage medium may be a ROM (Read-only memory), RAM (RandomAccess Memory ), magnetic disk, optical disk, or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and they should not fall within the scope of the present invention.

Claims

1. A method for classifying topic labels based on a multi-modal pre-training model, the method comprising:

acquiring a training dataset, the acquiring the training dataset comprising: acquiring topic label data based on the generalization degree of a plurality of topic information of each topic; the theme tag data adopts a form of matching content with a theme tag, wherein the content comprises picture information and text information;

converting the trained multi-mode pre-training model into an ONNX model, and deploying the ONNX model to a target application program to realize classification of the theme labels;

the obtaining the topic label data based on the generalization degree of the topic information of each topic comprises the following steps:

acquiring matching content of each topic information based on the topic information of each topic;

calculating the correlation degree of the matching content of each topic information, and taking the correlation degree as the generalization degree of the corresponding topic information;

if the topic information matched with a certain matched content is more than one, sorting based on the generalization degree of the matched topic information, regarding the topic information with low generalization degree as a topic positive label of the matched content, and removing other matched topics;

determining a topic corresponding to the matching content based on the topic positive tag of the matching content, and constructing topic tag data comprising the matching content and the topic;

the step of obtaining the feature vector of the text side by carrying out serialization processing and FFN processing based on the text information specifically comprises the following steps:

performing text serialization processing based on the text information to obtain an initial word vector;

based on the initial word vector, performing text segmentation processing to obtain a plurality of text segments, wherein each text segment has an overlapping part;

parallelizing the text segments to obtain a serialization vector;

FFN processing is carried out on the serialization vector to obtain a characteristic vector of a text side;

the FFN processing the serialized vector to obtain a feature vector of a text side includes:

inputting the serialization vector into three FFN structures to obtain FFN weight coefficients; wherein the three FFN structures are a short text FFN, a long text FFN and a video character FFN;

and obtaining a characteristic vector of the text side based on the serialization vector and the FFN weight coefficient.

2. The method of claim 1, wherein the picture side model is a 12-layer ViT model, the text side model is a 12-layer BERT model, and each layer of input of the picture side model and the text side model is a fusion vector of a feature vector of the picture side and a feature vector of the text side of a previous layer.

3. The method of claim 1, wherein prior to the obtaining topic tag data based on the degree of generalization of the topic information for each topic, the obtaining a training data set further comprises:

obtaining topic information of question and answer contents of a database;

and clustering the topic information by adopting a K-means clustering method to obtain K topics, wherein each topic comprises a plurality of topic information.

4. The method according to claim 1, wherein the obtaining matching content of each topic information based on the plurality of topic information of each topic comprises: based on the topic information of each topic, adopting PMI calculation and AC automaton to match a plurality of corresponding contents for each topic information from the question-answer contents of the database as matching contents of each topic information.

5. A topic tag classification device based on a multi-modal pre-training model, the device comprising:

the data construction module is used for acquiring a training data set, and the acquiring of the training data set comprises the following steps: acquiring topic label data based on the generalization degree of a plurality of topic information of each topic; the theme tag data adopts a form of matching content with a theme tag, wherein the content comprises picture information and text information;

the model deployment module is used for converting the trained multi-mode pre-training model into an ONNX model, deploying the ONNX model to a target application program and realizing classification of the theme labels;

parallelizing the text segments to obtain a serialization vector;

6. An electronic device comprising a memory storing a computer program and a processor executing the method of topic label classification based on a multimodal pre-training model of any of claims 1 to 4 when the computer program is run.

7. A readable storage medium, characterized in that it has stored therein a computer program which, when run on a processor, performs the topic label classification method based on a multimodal pre-training model as claimed in any of claims 1 to 4.