CN113762321A - Multi-modal classification model generation method and device - Google Patents
Multi-modal classification model generation method and device Download PDFInfo
- Publication number
- CN113762321A CN113762321A CN202110394335.6A CN202110394335A CN113762321A CN 113762321 A CN113762321 A CN 113762321A CN 202110394335 A CN202110394335 A CN 202110394335A CN 113762321 A CN113762321 A CN 113762321A
- Authority
- CN
- China
- Prior art keywords
- threshold
- classification model
- modal
- module
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013145 classification model Methods 0.000 title claims abstract description 311
- 238000000034 method Methods 0.000 title claims abstract description 78
- 239000013598 vector Substances 0.000 claims abstract description 188
- 230000004927 fusion Effects 0.000 claims abstract description 183
- 238000012549 training Methods 0.000 claims abstract description 102
- 230000004913 activation Effects 0.000 claims description 34
- 238000004590 computer program Methods 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 4
- 238000013473 artificial intelligence Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 24
- 230000008569 process Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000006243 chemical reaction Methods 0.000 description 7
- 230000015654 memory Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000008014 freezing Effects 0.000 description 2
- 238000007710 freezing Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 231100000656 threshold model Toxicity 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The disclosure provides a method and a device for generating a multi-modal classification model, and relates to the technical field of artificial intelligence. One embodiment of the method comprises: acquiring a preset sample set, wherein the sample comprises at least two sub-samples with different modalities; acquiring a pre-established multi-modal fusion network, wherein the multi-modal fusion network comprises: the system comprises a threshold module, a modal fusion module and at least two classification models which classify different modal data respectively; selecting samples from the sample set; respectively inputting the subsamples of different modes of the sample into the classification models corresponding to the modes to obtain the feature vectors output by the classification models, extracting the threshold vectors of all the feature vectors through a threshold module, inputting all the feature vectors and the threshold vectors into a mode fusion module, and taking the multi-mode fusion network as the multi-mode classification model if the multi-mode fusion network meets the training completion condition. This embodiment improves the classification effect of multi-modal targets.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a multi-modal classification model, a method and an apparatus for classifying a multi-modal object, an electronic device, a computer-readable medium, and a computer program product.
Background
Data information has a plurality of modes, such as images, texts, videos, audios and the like; due to the great difference between different types of algorithms and fields, principles, application ranges and the like, most of the traditional models process data of one mode independently. However, in reality, many data exist in more than two modalities simultaneously, for example, the goods of the e-commerce exist in two modalities, namely, an image modality and a text modality, the data of the two modalities are mutually explained in information, and are in a mutually complementary relationship, and the missing of any information may cause deviation in understanding of the goods.
Disclosure of Invention
Embodiments of the present disclosure propose a multimodal classification model generation method and apparatus, an electronic device, a computer readable medium, and a computer program product.
In a first aspect, an embodiment of the present disclosure provides a method for generating a multi-modal classification model, where the method includes: acquiring a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises at least two sub-samples with different modalities; acquiring a pre-established multi-modal fusion network, wherein the multi-modal fusion network comprises: the system comprises a threshold module, a modal fusion module and at least two classification models, wherein each classification model classifies different modal data; the following training steps are performed: selecting samples from the sample set; respectively inputting the subsamples of different modes of the sample into the classification models corresponding to the modes to obtain the feature vectors output by the classification models, extracting the threshold vectors of all the feature vectors through a threshold module, inputting all the feature vectors and the threshold vectors into a mode fusion module, and taking the multi-mode fusion network as the multi-mode classification model in response to the fact that the multi-mode fusion network meets the training completion condition.
In some embodiments, the training completion condition of the multi-modal fusion network includes: training each classification model, threshold module and mode fusion module in the multi-mode fusion network in sequence; after all the classification models, the threshold modules and the modal fusion modules are trained, all the modules in the multi-modal fusion network are trained simultaneously until the multi-modal fusion network is trained completely.
In some embodiments, the training of each classification model, the threshold module, and the modality fusion module in the multi-modality fusion network in turn includes: the gradient back propagation of the threshold module is cut off, and the threshold vector of the threshold module is set as a constant value corresponding to each classification model in the multi-mode fusion network; training the classification model to obtain the trained classification model; and after all the classification models are trained, connecting the gradient back propagation of the threshold module, fixing the parameters of all the trained classification models, and training the threshold module and the modal fusion module to obtain the trained threshold module and the trained modal fusion module.
In some embodiments, the classification model includes: an image classification model and a text classification model; setting a threshold vector of a threshold module as a constant value corresponding to each classification model in the multi-mode fusion network; training the classification model to obtain the trained classification model, comprising: setting a threshold vector allocated to the image classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the text classification model by the threshold setting module as a full zero vector of the set dimension; training the image classification model to obtain a trained image classification model; setting a threshold vector allocated to the text classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the image classification model by the threshold setting module as a full zero vector of the set dimension; and training the text classification model to obtain the trained text classification model.
In some embodiments, the modal fusion module fuses all feature vectors and threshold vectors using the following formula:
wherein i is an integer between 0 and 511, i is an index of a 512-dimensional vector,the ith value of the feature vector representing the image classification model,ith value, g, of a feature vector representing a text classification modeliThe ith value of the threshold vector representing the threshold module,the ith value representing the vector after the modality fusion module fuses.
In some embodiments, the threshold module comprises: two first threshold sub-modules and a second threshold sub-module which are connected in series; the first threshold sub-module includes: the device comprises a full connection layer, a batch normalization layer and a first activation layer; the second threshold module includes: the device comprises a full connection layer and a second activation layer, wherein the activation function of the first activation layer is different from that of the second activation layer.
In a second aspect, an embodiment of the present disclosure provides a method for multi-modal object classification, including: acquiring a target to be classified, wherein the target has at least two different modal data; inputting the target into a multi-modal classification model generated by adopting the method described in any implementation manner of the first aspect, and obtaining a classification result output by the multi-modal classification model.
In a third aspect, an embodiment of the present disclosure provides a multi-modal classification model generation apparatus, including: the device comprises a sample acquisition unit, a data acquisition unit and a data processing unit, wherein the sample acquisition unit is configured to acquire a preset sample set, the sample set at least comprises one sample, and the sample comprises at least two sub-samples with different modalities; a network acquisition unit configured to acquire a pre-established multimodal fusion network, the multimodal fusion network including: the system comprises a threshold module, a modal fusion module and at least two classification models, wherein each classification model classifies different modal data; a selecting unit configured to select a sample from a sample set; the input unit is configured to input the subsamples of different modes of the sample to the classification models corresponding to the modes respectively to obtain the feature vectors output by the classification models; an extraction unit configured to extract threshold vectors of all the feature vectors by a threshold module; a fusion unit configured to input all the feature vectors and the threshold vectors into a modality fusion module; an output unit configured to treat the multimodal fusion network as a multimodal classification model in response to determining that the multimodal fusion network satisfies a training completion condition.
In some embodiments, the output unit includes: the single training subunit is configured to train each classification model, the threshold module and the modal fusion module in the multi-modal fusion network in sequence; and the training subunit is configured to train all the modules in the multi-mode fusion network at the same time after the training of all the classification models, the threshold module and the modal fusion module is completed until the training of the multi-mode fusion network is completed.
In some embodiments, the single training subunit is further configured to cut off gradient back propagation of the threshold module, and for each classification model in the multi-modal fusion network, set the threshold vector of the threshold module to a constant value corresponding to the classification model; training the classification model to obtain the trained classification model; and after all the classification models are trained, connecting the gradient back propagation of the threshold module, fixing the parameters of all the trained classification models, and training the threshold module and the modal fusion module to obtain the trained threshold module and the trained modal fusion module.
In some embodiments, the classification model comprises: an image classification model and a text classification model; the single trainer unit is further configured to: setting a threshold vector allocated to the image classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the text classification model by the threshold setting module as a full zero vector of the set dimension; training the image classification model to obtain a trained image classification model; setting a threshold vector allocated to the text classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the image classification model by the threshold setting module as a full zero vector of the set dimension; and training the text classification model to obtain the trained text classification model.
In some embodiments, the modal fusion module fuses all feature vectors and threshold vectors using the following formula:
wherein i is an integer between 0 and 511, i is an index of a 512-dimensional vector,the ith value of the feature vector representing the image classification model,ith value, g, of a feature vector representing a text classification modeliThe ith value of the threshold vector representing the threshold module,the ith value representing the vector after the modality fusion module fuses.
In some embodiments, the threshold module comprises: two first threshold sub-modules and a second threshold sub-module which are connected in series; the first threshold sub-module includes: the device comprises a full connection layer, a batch normalization layer and a first activation layer; the second threshold module includes: the device comprises a full connection layer and a second activation layer, wherein the activation function of the first activation layer is different from that of the second activation layer.
In a fourth aspect, an embodiment of the present disclosure provides a multi-modal object classification apparatus, including: an acquisition unit configured to acquire a target to be classified, the target having at least two different modality data; and the classification unit is configured to input a target into the multi-modal classification model generated by adopting the method described in any one of the implementation manners in the first aspect, and obtain a classification result output by the multi-modal classification model.
In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.
In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method as described in any of the implementations of the first aspect.
In a seventh aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements the method as described in any of the implementations of the first aspect.
According to the multi-modal classification model generation method and device provided by the embodiment of the disclosure, a preset sample set is obtained at first; secondly, acquiring a pre-established multi-mode fusion network; thirdly, selecting samples from the sample set; and secondly, training the multi-modal fusion network by adopting samples in the sample set to obtain a multi-modal classification model, wherein the training step comprises the following steps: respectively inputting the sub-samples of the sample in different modes to the classification models corresponding to the modes to obtain the feature vectors output by the classification models; and extracting threshold vectors of all the feature vectors through a threshold module, inputting all the feature vectors and the threshold vectors into a modal fusion module, and taking the multimodal fusion network as a multimodal classification model in response to the fact that the multimodal fusion network meets training completion conditions. Therefore, the provided multi-modal classification model can promote mutual complementation and mutual explanation among multi-modal data in the training process, and improves the accuracy of multi-modal target classification.
Drawings
Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:
FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;
FIG. 2 is a flow diagram for one embodiment of a multi-modal classification model generation method according to the present disclosure;
FIG. 3 is a schematic diagram of a multi-modal fusion network provided by the present disclosure;
FIG. 4 is a schematic diagram of another configuration of a multimodal fusion network provided by the present disclosure;
FIG. 5 is a flow diagram for one embodiment of a multi-modal object classification method according to the present disclosure;
FIG. 6 is a schematic structural diagram of an embodiment of a multi-modal classification model generation apparatus according to the present disclosure;
FIG. 7 is a schematic structural diagram of an embodiment of a multi-modal object classification apparatus according to the present disclosure;
FIG. 8 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 of a multi-modal classification model generation method, a multi-modal classification model generation apparatus, a multi-modal object classification method, a multi-modal object classification apparatus, to which embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminals 101, 102, a network 103, a database server 104, and a server 105. The network 103 is a medium for providing communication links between the terminals 101, 102, the database server 104 and the server 105. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user 110 may use the terminals 101, 102 to interact with the server 105 over the network 103 to receive or send messages or the like. The terminals 101 and 102 may have various client applications installed thereon, such as a model training application, an image conversion application, a shopping application, a payment application, a web browser, an instant messenger, and the like.
Here, the terminals 101 and 102 may be hardware or software. When the terminals 101, 102 are hardware, they may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the terminals 101 and 102 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.
When the terminals 101 and 102 are hardware, a target information collecting device may be further installed thereon. The target information acquisition device can be various devices capable of acquiring multi-modal information (images, texts, videos and audios), such as a camera, a sensor and the like. The user 110 can utilize the object information collecting device on the terminal 101, 102 to collect the objects of the multiple modalities to be classified.
The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the terminals 101, 102. The background server may train the initial model using the samples in the sample set sent by the terminals 101 and 102, and may send the training result (e.g., the generated multi-modal classification model) to the terminals 101 and 102. In this way, the user can apply the generated multi-modal classification model for object classification.
Here, the database server 104 and the server 105 may be hardware or software. When they are hardware, they can be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the multi-modal classification model generation method or the multi-modal object classification method provided by the embodiment of the present disclosure is generally executed by the server 105. Accordingly, a multi-modal classification model generation means or a multi-modal object classification means is also typically provided in the server 105.
It is noted that database server 104 may not be provided in system architecture 100, as server 105 may perform the relevant functions of database server 104.
It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.
Referring to FIG. 2, a flow 200 of one embodiment of a multi-modal classification model generation method according to the present disclosure is shown, the multi-modal classification model generation method comprising the steps of:
In this embodiment, the executing agent (e.g., the server 105 shown in fig. 1) of the multi-modal classification model generation method may obtain the sample set through various means, for example, the executing agent may obtain the existing sample set stored therein from a database server (e.g., the database server 104 shown in fig. 1) through a wired connection manner or a wireless connection manner. As another example, a user may collect a sample via a terminal (e.g., terminals 101, 102 shown in FIG. 1). In this way, the executing entity may receive samples collected by the terminal and store the samples locally, thereby generating a sample set.
Here, the sample set may include at least one sample. The sample may include at least two subsamples of different modalities, where a subsample is data, a modality refers to a manner in which an object occurs or exists, and a subsample of a different modality is embodied as data of a different modality, for example, a subsample is one of text data, image data, video data, audio data, and the like. A sample is a combination of two or more subsamples of different modalities, e.g., a sample comprising text data, image data, audio data.
The type of mode of the sub-sample in the sample is not limited herein and may be any combination. The specific implementation process may refer to the sample selection step of step 203.
In this embodiment, a structure of a pre-established multi-modal fusion network is obtained as shown in fig. 3, where the multi-modal fusion network includes: the device comprises a threshold module, a modal fusion module and at least two classification models, wherein each classification model classifies different modal data. In FIG. 3, the at least two classification models include the first classification model … the Nth classification model, where N ≧ 2. The first classification model to the Nth classification model are used for classifying the N types of modal data to obtain respective classification results.
In this embodiment, the data characteristics of different modalities are different, and the at least two classification models have different structures, for example, the text data is a sequence data, and there is a logical relationship between preceding and following words, and the classification model may use a recurrent neural network, for example, to "infer" the text information using the recurrent neural network, so as to obtain the related semantic information. The image data has completely different characteristics from the text data, and when the image data is processed, a convolutional neural network can be adopted, for example, a plurality of convolutional kernels are used for capturing local information of the image data, and semantic information of the picture is extracted layer by layer.
It should be noted that, the at least two classification models may be classification models trained by using the preset sample set, for example, the classification models classify images (unknown types) of products, and output feature vectors of the images belonging to types of clothes, shoes, socks, and the like; alternatively, the at least two classification models may also be models that have not been trained by the sample.
In this embodiment, the threshold module is configured to obtain threshold vectors of all classification models according to output vectors of all classification models, and the threshold vectors may be used to represent weights of feature vectors output by the classification models in all feature vectors. And after splicing the output vectors of all the classification models, entering a threshold module to obtain a threshold vector, and inputting the threshold vector into a modal fusion module for fusing the output vectors of all the classification models.
In some optional implementations of this embodiment, the threshold module includes: two first threshold sub-modules and a second threshold sub-module which are connected in series; the first threshold sub-module includes: the device comprises a full connection layer, a batch normalization layer and a first activation layer; the second threshold module includes: the device comprises a full connection layer and a second activation layer, wherein the activation function of the first activation layer is different from that of the second activation layer. For example, the activation function of the first activation layer adopts a relu function, and the second activation layer uses a sigmoid function. Here, if the number of modalities is greater than 2, the structure of the fully-connected layer in the second threshold module is adjusted such that a vector of dimension (512, N) is output, where N is the number of modalities, and softmax is used as an activation function for the second activation layer. Then, a column in the vector is multiplied by the corresponding position of the corresponding characteristic vector (the length is also 512) of one mode to obtain a vector with the length of 512, all N vectors with the length of 512 are added to obtain a vector with the length of 512, and then the subsequent calculation is carried out, so that the reliability of the threshold vector is improved by the threshold module.
In this embodiment, the modal fusion module fuses the output vectors of all the classification modules based on the threshold vectors of all the classification models to obtain a fusion result, where the fusion result is a final classification result after all the classification models are fused. As shown in fig. 3, the component has a plurality of input, i.e., feature vectors output from the first classification model to the nth classification model, and the threshold vectors output from the threshold models correspond to the respective classification models (i.e., one threshold vector corresponds to one classification model). It should be noted that the modality fusion module needs to fuse vectors of the same dimension, and therefore, in the modality fusion module having a dimension conversion component, as shown in fig. 4, FC + BN + Relu between the image classification model and the modality fusion component is a dimension conversion component, and the dimension conversion component includes a full connection layer, a normalization layer and a Relu activation function layer (the activation function adopts the activation layer of Relu).
In step 203, a sample is selected from the sample set.
In this embodiment, the executing subject may select a sample from the sample set obtained in step 201, and perform the training steps from step 203 to step 207. The selection manner and the number of samples are not limited in the present disclosure. For example, at least one sample may be selected randomly, or a sample with better definition (i.e. higher pixels) and clear text content may be selected from the samples.
According to the input requirements of the classification model, each sample can comprise at least two subsamples of different modalities, and each subsample is provided with a corresponding classification model. For example, in this embodiment, the at least two classification models include an image classification model and a text classification model, and the selected sample includes an image modality subsample and a text modality subsample, the image modality subsample is input into the image classification model, and the text modality subsample is input into the text classification model.
And 204, respectively inputting the sub-samples of the sample in different modes to the classification models corresponding to the modes to obtain the feature vectors output by the classification models.
In this embodiment, the classification model is used to classify the respective modality data, for example, the classification model includes an image classification model, and an image including a jacket, trousers, and socks is input to the image classification model, so as to obtain a label for labeling the category of the jacket, trousers, and socks in the image by the classification model.
In this embodiment, the feature vectors output by the at least two classification models are very different, and when at least two feature vectors are fused, in order to achieve mutual complementation and reference of multiple kinds of information, how to perform information fusion is determined by a learnable threshold module according to the characteristics of the actual multiple kinds of information, and the threshold vector output by the threshold module is a specific value that the at least two feature vectors can be fused.
In this embodiment, the threshold module, the modal fusion module, and the plurality of at least two classification models may be jointly trained, and during the joint training, the threshold module updates the threshold vectors of the respective classification models according to a gradient back propagation algorithm, and simultaneously updates the threshold vectors of the at least two classification models in the modal fusion module.
And step 207, in response to determining that the multi-modal fusion network meets the training completion condition, taking the multi-modal fusion network as a multi-modal classification model.
In this embodiment, after the sample is input into the multi-modal fusion network, a loss function is calculated, and the loss function is used to represent the difference between the predicted value and the known answer. When the multi-mode fusion network is trained, all parameters in the multi-mode fusion network are continuously changed, so that the loss function is continuously reduced, and a multi-mode classification model with higher accuracy is trained.
Specifically, the loss value of the loss function can be calculated using a gradient back propagation method, which is a method in which back propagation (abbreviated as BP) is used in combination with a gradient descent method. The method calculates the gradient of the loss function for all weights in the network. This gradient is fed back to the gradient descent method for updating the weights to minimize the loss function.
In this embodiment, optionally, the training completion condition includes at least one of the following: and training iteration times of the multi-mode fusion network reach a preset iteration threshold, and the loss value of the loss function is smaller than a preset loss value threshold. For example, the training iterations reach 5 thousand times. The loss value is less than 0.05. And after the training is finished, the multi-modal fusion network is used as a multi-modal classification model. The model convergence speed can be accelerated by setting the training completion condition.
If the multi-modal fusion network does not satisfy the training completion condition, adjusting the related parameters in the multi-modal fusion network so that the loss value is converged, and continuing to execute the step 203-207 based on the adjusted multi-modal fusion network until the multi-modal fusion network satisfies the training completion condition.
The multi-modal classification model generation method provided by the embodiment of the disclosure includes the steps of firstly, obtaining a preset sample set; secondly, acquiring a pre-established multi-mode fusion network; thirdly, selecting samples from the sample set; training the multi-modal fusion network by using the samples in the sample set to obtain a multi-modal classification model, wherein the training step comprises the following steps: respectively inputting the sub-samples of the sample in different modes to the classification models corresponding to the modes to obtain the feature vectors output by the classification models; and extracting threshold vectors of all the feature vectors through a threshold module, inputting all the feature vectors and the threshold vectors into a modal fusion module, and taking the multimodal fusion network as a multimodal classification model in response to the fact that the multimodal fusion network meets training completion conditions. Therefore, the provided multi-modal classification model can complement and explain multi-modal targets in the training process, and improves the accuracy of multi-modal target classification.
Because the multi-modal data have great difference, and the training processes of the classification models corresponding to different modal data are different, the multi-modal fusion network training process needs to have the function of respectively and independently training. That is to say, after the multi-modal fusion network is built, the independent training of each classification model in the at least two classification models is performed, and after each classification model is fully trained, the fusion training of all modules in the multi-modal fusion network is performed.
In some optional implementations of the present embodiment, the training completion condition of the multimodal fusion network includes: training each classification model, threshold module and mode fusion module in the multi-mode fusion network in sequence; after all the classification models, the threshold modules and the modal fusion modules are trained, all the modules in the multi-modal fusion network are trained simultaneously until the multi-modal fusion network is trained completely.
In this optional implementation, the individual training of each module may be controlled by a switch, and when all modules in the multi-modal fusion network perform joint training, the training may also be performed by turning on all models and gradient back propagation of all modules at the same time by the switch, and the specific control flow of the switch is described in detail with reference to the embodiment shown in fig. 4.
Specifically, training each classification model, threshold module and mode fusion module in the multi-mode fusion network in sequence means: firstly, training each classification module independently, secondly, training the threshold module and the mode fusion module after training all classification models, and in the process of training the threshold module and the mode fusion module, all classification models need to be frozen. After the training at the stage is finished, all classification models are unfrozen, and then the whole multi-mode fusion network is jointly trained.
The multi-modal fusion network training process is introduced by taking the classification model as an image classification model and a text classification model, and the training completion conditions of the multi-modal fusion network training are divided into 4 stages:
opening a switch to an image classification model, and training the image classification model;
secondly, opening a switch to the text classification model, and training the text classification model;
thirdly, turning on all switches, freezing the image classification model and the text classification model (namely fixing the parameters of the trained image classification model and the trained text classification model), and then training a threshold module and a mode fusion module;
and fourthly, releasing the freezing of the image classification model and the text classification model (the parameters of the trained image classification model and the trained text classification model are not fixed values, but are updated along with the updating of the parameters of the multi-mode fusion network), and starting to train all the modules and the models of the multi-mode fusion network.
In the optional implementation mode, each classification model, the threshold module and the mode fusion module in the multi-mode fusion network are independently trained in sequence, and then the whole multi-mode fusion network is jointly trained, so that the classification models, the threshold modules and the mode fusion modules achieve the effect of full training, and the reliability of multi-mode fusion network training is improved.
In some optional implementation manners of this embodiment, training each classification model, threshold module, and modality fusion module in the multi-modality fusion network in sequence includes: the gradient back propagation of the threshold module is cut off, and the threshold vector of the threshold module is set as a constant value corresponding to each classification model in the multi-mode fusion network; training the classification model to obtain the trained classification model; and after all the classification models are trained, connecting the gradient back propagation of the threshold module, fixing the parameters of all the trained classification models, and training the threshold module and the modal fusion module to obtain the trained threshold module and the trained modal fusion module.
The above-mentioned gradient back propagation of the switch-off and switch-on threshold modules can be implemented by switches, and the specific control flow of the switches is described in detail with reference to the embodiment shown in fig. 4.
In the optional implementation mode, the threshold vector of the threshold module is set for each classification model in the network by cutting off the gradient back propagation of the threshold module, so that the independent training of each classification model is completed; and after all the classification models are trained, the gradient back propagation of the threshold module is switched on, the threshold module and the modal fusion module are trained, and the trained threshold module and the trained modal fusion module are obtained.
In some optional implementations of this embodiment, as shown in fig. 4, the at least two classification models include: an image classification model and a text classification model.
Specifically, the model structures of the image classification model and the text classification model can be selected according to the scene, for example, tensoflow 2.0 is used as an image classification model infrastructure, and fine tuning is performed on the basis of the original weight by using transfer learning, the image data enhancement effect is perfected on the basis, the model output is used as a feature vector, that is, the output of the model is used as a picture coding process.
The text data is sequence data, and a recurrent neural network or a Bert (Bidirectional encoding network from Transformers) type network based on attribute can be used. For example, the text classification model adopts a Bi-LSTM (Long Short Term Memory) model suitable for two-way reasoning, so that two-way reasoning can be performed on text modal data, the relationship between a target (referring to a target with the text modal data, such as a title) and a category is better mined,
in practice, when the target is the title of a product, the title is usually a stack of a plurality of words, the text classification model divides the title into words, and then scrambles the words locally, it should be noted that local scrambling is a method for enhancing text data, for example, the words in 3-6 orders are scrambled inside the words, and this is done because the title is unordered but contains a certain overall preference for order, for example, the brand word of the product usually appears at the forefront of the title, the size color of the product appears at the last with a high probability, and the like.
In this optional implementation, the input of the modality fusion module has three parts, one is a feature vector output by the image classification model, and a 512-dimensional feature vector is obtained in the modality fusion module through a dimension conversion component (FC + BN + Relu between the image classification model and the modality fusion component in fig. 4) composed of a full connection layer, a normalization layer, and a Relu activation layer. The second is a feature vector of the text classification model, and a 512-dimensional feature vector is obtained in the modal fusion module through a maintenance conversion component (FC + BN + Tanh in fig. 4 between the text classification model and the modal fusion component) composed of a full connection layer, a normalization layer, and a Tanh activation layer. The third is the threshold vector of the gate assembly output, which is a 512-dimensional vector. The modal fusion component fuses the feature vectors of the image classification model and the text classification model through the value of a threshold vector, the fused vector (512 dimensions) passes through a full connection layer, a normalization layer and a Relu activation layer (FC + BN + Relu connected with the output end of the modal fusion component in figure 4), then is connected with the full connection layer (FC in figure 4) to become a vector with the dimension number equal to the number of classified categories, and the final output of the modal fusion module is obtained.
In the optional implementation manner, for each classification model in the multi-modal fusion network, setting a threshold vector of a threshold module as a constant value corresponding to the classification model; training the classification model to obtain the trained classification model, comprising: setting a threshold vector allocated to the image classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the text classification model by the threshold setting module as a full zero vector of the set dimension; training the image classification model to obtain a trained image classification model; setting a threshold vector allocated to the text classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the image classification model by the threshold setting module as a full zero vector of the set dimension; and training the text classification model to obtain the trained text classification model.
In this optional implementation, the control process may adopt the switch control shown in fig. 4, and in practice, the switch directly adds an interface capable of assigning values to the threshold vector, and sets the opening or closing of the gradient back propagation channel. For example, when only the image classification model part is trained, the threshold vector is set to be a 512-dimensional all-1 vector, the propagation of the gradient to the threshold module part is cut off, and the threshold vector is directly allocated to the image classification model to be all-1 (the output value of the modal fusion component is the threshold vector of all-1) and the threshold vector is allocated to the text classification model to be all-0 (the output value of the modal fusion component is the threshold vector of all-0). Otherwise, if the threshold vector is set to be a 512-dimensional vector with 0, only the text classification model is trained (which is equivalent to the text classification model is dialed by the switch). When the threshold vector is not assigned, the two classification models are opened, the outputs of the two classification models are fused through the threshold vector output by the threshold module, and the threshold vector is obtained through the learning of the threshold module. During training, the gradient back propagation is passed to all parts of the modal fusion network and all weights are updated.
In this embodiment, the switch in fig. 4 is used to control the training process of the multi-modal fusion network, and may also be used to observe the trained multi-modal classification model after the model training is completed. E.g. after training is completed, one of the classification models is turned off, the difference between the output and the output of the model before joint training is observed, etc.
In the optional implementation mode, the image classification model or the text classification model is independently trained by respectively allocating the all-one threshold vectors to the image classification model or the text classification model, so that the purpose of independently training each classification model of at least two classification models at different time is realized, and the reliability of independent training of the classification models is improved.
In some optional implementations of the present embodiment, the modality fusion component in the modality fusion module may fuse all the feature vectors and the threshold vector by using the following formula (1):
wherein i is an integer between 0 and 511, i is an index of a 512-dimensional vector,the ith value of the feature vector representing the image classification model,ith value, g, of a feature vector representing a text classification modeliThe ith value of the threshold vector representing the threshold module,the ith value representing the vector after the modality fusion module fuses.
In the optional implementation mode, the above formula is adopted to simply and conveniently realize the fusion between the feature vectors output by at least two different classification models, and the fusion effect after the classification of the data in different modes is improved.
Referring to fig. 5, a flow 500 of one embodiment of a multi-modal object classification method provided by the present disclosure is shown. The multi-modal object classification method may include the steps of:
In the present embodiment, an executing subject (e.g., the server 105 shown in fig. 1) of the multi-modal object classification method may acquire an object to be classified in various ways. For example, the executing entity may obtain the target to be classified stored in a database server (e.g., database server 104 shown in fig. 1) through a wired connection or a wireless connection, where the target has at least two different modality data. As another example, the executing entity may also receive the objects to be classified collected by the terminals (e.g., terminals 101 and 102 shown in fig. 1) or other devices.
In this embodiment, the target to be classified may be product information including at least two different modality data, and the product information may specifically include information such as an image, text, video, voice, and the like, where the image may be a color image and/or a grayscale image, and the format of the image is not limited in this disclosure. For example, products on the internet are targets of images and texts, images of the products are image data, and titles, product details, filling attributes and the like of the products are text data.
In this embodiment, the executing subject may input the target acquired in step 501 into a multi-modal classification model, thereby generating a classification result. Step 201 and 207 train the generated multi-modal classification model to determine the type of the input target. For example, the input target includes image data (picture of clothes) and text data (title, details, attribute, etc. of clothes) of clothes, and the multi-modal classification model outputs "T-shirts".
In this embodiment, the multi-modal classification model may be generated using the method described above in the embodiment of FIG. 2. For a specific generation process, reference may be made to the related description of the embodiment in fig. 2, which is not described herein again.
It should be noted that the multi-modal object classification method of the present embodiment can be used to test the multi-modal classification model generated by the above embodiments. And then the multi-modal classification model can be continuously optimized according to the conversion result. The method may also be a practical application method of the multi-modal classification model generated by the above embodiments. The multi-modal classification model generated by the embodiments is adopted to classify the target, which is helpful for improving the accuracy of target classification.
When a multimodal product (for example, the product includes a picture and a title, the title describes the product in the picture, and the picture visually gives the product the same way) is processed (for example, the product is obtained by attributes, classified, and personalized recommendation are made for the product), fine-grained understanding and identification of the product are required. If only one of the modality data of the product is used in identifying the product, the amount of information is insufficient. For example, a piece of clothes may include attributes such as size, color, style, collar type, model, material, whether it is a suit or several suits, a notebook computer may include attributes such as appearance, cpu model, memory, and hard disk, and these information are not accurate enough to obtain a recognition result only by one modality data of a picture or a text, and are not accurate enough to simply superimpose two modality data without performing fusion learning.
In this embodiment, the multi-modal product is input into the multi-modal classification model generated by the multi-modal classification model generation method, so that a classification result obtained by accurately identifying and classifying the multi-modal product by the multi-modal classification model can be obtained, for example, the multi-modal product is a picture of a T-shirt and a title of the T-shirt, where the title of the T-shirt is added with a keyword of "one-piece dress" in order to set down the flow of the one-piece dress. If the classification result obtained by the monomodal data classification model (such as a text classification model or a picture classification model) is: t-shirts (type) 60% (confidence). And the classification result of the product through the multi-modal classification model is as follows: t-shirts (type) 90% (confidence). Therefore, the product information can be accurately understood on the basis of fusing the multi-modal data information through the multi-modal classification model, and the product can be accurately classified on the basis of accurately understanding the product information.
With continuing reference to FIG. 6, as an implementation of the method illustrated in FIG. 2 above, the present disclosure provides one embodiment of a multi-modal classification model generation apparatus. The embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device can be applied to various electronic devices.
As shown in fig. 6, the multi-modal classification model generation apparatus 600 of the present embodiment may include: sample acquisition section 601, network acquisition section 602, selection section 603, input section 604, extraction section 605, fusion section 606, and output section 607. The sample acquiring unit 601 is configured to acquire a preset sample set, where the sample set includes at least one sample, and the sample includes at least two sub-samples of different modalities. A network obtaining unit 602 configured to obtain a pre-established multimodal fusion network, the multimodal fusion network including: the device comprises a threshold module, a modal fusion module and at least two classification models, wherein each classification model classifies different modal data. A selecting unit 603 configured to select samples from the set of samples. The input unit 604 is configured to input sub-samples of different modalities of the sample to the classification models corresponding to the modalities, respectively, to obtain feature vectors output by the classification models. The extracting unit 605 is configured to extract threshold vectors of all feature vectors through a threshold module. A fusion unit 606 configured to input all feature vectors and threshold vectors into the modality fusion module. An output unit 607 configured to, in response to determining that the multi-modal fusion network satisfies the training completion condition, treat the multi-modal fusion network as a multi-modal classification model.
In some optional implementations of this embodiment, the output unit 607 includes: a single trainer (not shown), and a trainer (not shown). The single training subunit is configured to train each classification model, the threshold module and the modal fusion module in the multi-modal fusion network in sequence; and the training subunit is configured to train all the modules in the multi-mode fusion network at the same time after the training of all the classification models, the threshold module and the modal fusion module is completed until the training of the multi-mode fusion network is completed.
In some optional implementations of this embodiment, the single training subunit is further configured to cut off gradient back propagation of the threshold module, and for each classification model in the multi-modal fusion network, set a threshold vector of the threshold module to a constant value corresponding to the classification model; training the classification model to obtain the trained classification model; and after all the classification models are trained, connecting the gradient back propagation of the threshold module, fixing the parameters of all the trained classification models, and training the threshold module and the modal fusion module to obtain the trained threshold module and the trained modal fusion module.
In some optional implementations of this embodiment, the classification model includes: an image classification model and a text classification model; the single trainer unit is further configured to: setting a threshold vector allocated to the image classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the text classification model by the threshold setting module as a full zero vector of the set dimension; training the image classification model to obtain a trained image classification model; setting a threshold vector allocated to the text classification model by the threshold setting module as a full one vector of a set dimension, and setting a threshold vector allocated to the image classification model by the threshold setting module as a full zero vector of the set dimension; and training the text classification model to obtain the trained text classification model.
In some optional implementations of this embodiment, the modality fusion module fuses all the feature vectors and the threshold vector by using the following formula:
wherein i is an integer between 0 and 511, i is an index of a 512-dimensional vector,the ith value of the feature vector representing the image classification model,ith value, g, of a feature vector representing a text classification modeliThe ith value of the threshold vector representing the threshold module,the ith value representing the vector after the modality fusion module fuses.
In some optional implementations of this embodiment, the threshold module includes: two first threshold sub-modules and a second threshold sub-module which are connected in series; the first threshold sub-module includes: the device comprises a full connection layer, a batch normalization layer and a first activation layer; the second threshold module includes: the device comprises a full connection layer and a second activation layer, wherein the activation function of the first activation layer is different from that of the second activation layer.
It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.
With continuing reference to FIG. 7, as an implementation of the method illustrated in FIG. 5 above, the present disclosure provides one embodiment of a multi-modal object classification apparatus. The embodiment of the device corresponds to the embodiment of the method shown in fig. 5, and the device can be applied to various electronic devices.
As shown in fig. 7, the multi-modal object classification apparatus 700 of the present embodiment may include: an acquiring unit 701 configured to acquire a target to be classified, the target having at least two different modality data. The classifying unit 702 is configured to input the target into the multi-modal classification model generated by the method described in the embodiment of fig. 2 or fig. 5, and obtain a classification result output by the multi-modal classification model.
It will be understood that the elements described in the apparatus 700 correspond to various steps in the method described with reference to fig. 5. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 700 and the units included therein, and will not be described herein again.
Referring now to FIG. 8, shown is a schematic diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure.
As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, etc.; an output device 807 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 8 may represent one device or may represent multiple devices as desired.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the server; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: acquiring a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises at least two sub-samples with different modalities; acquiring a pre-established multi-modal fusion network, wherein the multi-modal fusion network comprises: the system comprises a threshold module, a modal fusion module and at least two classification models, wherein each classification model classifies different modal data; the following training steps are performed: selecting samples from the sample set; respectively inputting the subsamples of different modes of the sample into the classification models corresponding to the modes to obtain the feature vectors output by the classification models, extracting the threshold vectors of all the feature vectors through a threshold module, inputting all the feature vectors and the threshold vectors into a mode fusion module, and taking the multi-mode fusion network as the multi-mode classification model in response to the fact that the multi-mode fusion network meets the training completion condition.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a sample acquisition unit, a network acquisition unit, a selection unit, an input unit, an extraction unit, a fusion unit and an output unit. Where the names of the units do not in some cases constitute a limitation of the units themselves, for example, the sample extraction unit may also be described as a unit "configured to obtain a preset sample set".
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.
Claims (12)
1. A method of generating a multi-modal classification model, the method comprising:
obtaining a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises at least two sub-samples with different modalities;
obtaining a pre-established multi-modal converged network, the multi-modal converged network comprising: the system comprises a threshold module, a modal fusion module and at least two classification models, wherein each classification model classifies different modal data;
the following training steps are performed: selecting a sample from the sample set; respectively inputting the subsamples of different modes of the sample into the classification models corresponding to the modes to obtain the feature vectors output by the classification models, calculating all the spliced feature vectors through the threshold module to obtain threshold vectors, inputting all the feature vectors and the threshold vectors into the mode fusion module, and taking the multi-mode fusion network as the multi-mode classification model in response to the fact that the multi-mode fusion network meets the training completion condition.
2. The method of claim 1, wherein the training completion condition of the multimodal fusion network comprises:
training each classification model, the threshold module and the mode fusion module in the multi-mode fusion network in sequence;
after all the classification models, the threshold module and the modal fusion module are trained, all the modules in the multi-modal fusion network are trained simultaneously until the multi-modal fusion network is trained completely.
3. The method of claim 2, wherein said training each classification model, threshold module, and modality fusion module in the multi-modal fusion network in turn comprises:
cutting off gradient back propagation of the threshold module, and setting a threshold vector of the threshold module as a constant value corresponding to each classification model in the multi-modal fusion network; training the classification model to obtain the trained classification model;
and after all the classification models are trained, connecting the gradient back propagation of the threshold module, fixing the parameters of all the trained classification models, and training the threshold module and the modal fusion module to obtain the trained threshold module and the trained modal fusion module.
4. The method of claim 3, wherein classifying a model comprises: an image classification model and a text classification model; setting a threshold vector of the threshold module as a constant value corresponding to each classification model in the multi-mode fusion network; training the classification model to obtain the trained classification model, comprising:
setting the threshold vector allocated to the image classification model by the threshold module as a full one vector of a set dimension, and setting the threshold vector allocated to the text classification model by the threshold module as a full zero vector of the set dimension;
training the image classification model to obtain a trained image classification model;
setting the threshold vector allocated to the text classification model by the threshold module as a full one vector of a set dimension, and setting the threshold vector allocated to the image classification model by the threshold module as a full zero vector of the set dimension;
and training the text classification model to obtain a trained text classification model.
5. The method according to claim 3, wherein the modality fusion module fuses all feature vectors and threshold vectors using the following formula:
wherein i is an integer between 0 and 511, i is an index of a 512-dimensional vector,the ith value of the feature vector representing the image classification model,ith value, g, of a feature vector representing a text classification modeliThe ith value of the threshold vector representing the threshold module,the ith value representing the vector after the modality fusion module fuses.
6. The method of one of claims 1 to 5, wherein the threshold module comprises:
two first threshold sub-modules and a second threshold sub-module which are connected in series;
the first threshold sub-module comprises: the device comprises a full connection layer, a batch normalization layer and a first activation layer;
the second threshold module comprises: a fully connected layer and a second active layer, the activation function of the first active layer being different from the activation function of the second active layer.
7. A method of multi-modal object classification, the method comprising:
acquiring a target to be classified, wherein the target has at least two different modal data;
inputting the target into a multi-modal classification model generated by adopting the method of any one of claims 1-6 to obtain a classification result output by the multi-modal classification model.
8. An apparatus for generating a multi-modal classification model, the apparatus comprising:
a sample acquiring unit configured to acquire a preset sample set, wherein the sample set at least comprises one sample, and the sample comprises at least two sub-samples with different modalities;
a network acquisition unit configured to acquire a pre-established multimodal fusion network, the multimodal fusion network including: the system comprises a threshold module, a modal fusion module and at least two classification models which classify different modal targets respectively;
a selecting unit configured to select a sample from the set of samples;
the input unit is configured to input the subsamples of different modes of the sample to the classification models corresponding to the modes respectively to obtain the feature vectors output by the classification models;
an extraction unit configured to extract threshold vectors of all the feature vectors by the threshold module;
a fusion unit configured to input all feature vectors and the threshold vector into the modality fusion module;
an output unit configured to treat the multi-modal fusion network as a multi-modal classification model in response to determining that the multi-modal fusion network satisfies a training completion condition.
9. A multi-modal object classification apparatus, the apparatus comprising:
an acquisition unit configured to acquire a target to be classified, the target having at least two different modality data;
a classification unit configured to input the target into a multi-modal classification model generated by the method according to any one of claims 1 to 6, and obtain a classification result output by the multi-modal classification model.
10. An electronic device, comprising:
one or more processors;
a storage device having one or more programs stored thereon;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
11. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
12. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110394335.6A CN113762321A (en) | 2021-04-13 | 2021-04-13 | Multi-modal classification model generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110394335.6A CN113762321A (en) | 2021-04-13 | 2021-04-13 | Multi-modal classification model generation method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113762321A true CN113762321A (en) | 2021-12-07 |
Family
ID=78786876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110394335.6A Pending CN113762321A (en) | 2021-04-13 | 2021-04-13 | Multi-modal classification model generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113762321A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113961710A (en) * | 2021-12-21 | 2022-01-21 | 北京邮电大学 | Fine-grained thesis classification method and device based on multi-mode layered fusion network |
CN114187605A (en) * | 2021-12-13 | 2022-03-15 | 苏州方兴信息技术有限公司 | Data integration method and device and readable storage medium |
CN115600091A (en) * | 2022-12-16 | 2023-01-13 | 珠海圣美生物诊断技术有限公司(Cn) | Classification model recommendation method and device based on multi-modal feature fusion |
CN117746069A (en) * | 2024-02-18 | 2024-03-22 | 浙江啄云智能科技有限公司 | Graph searching model training method and graph searching method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120230568A1 (en) * | 2011-03-09 | 2012-09-13 | Siemens Aktiengesellschaft | Method and System for Model-Based Fusion of Multi-Modal Volumetric Images |
CN106855947A (en) * | 2016-12-28 | 2017-06-16 | 西安电子科技大学 | Multispectral image change detecting method based on the mutual modal factor analysis core fusion of core |
CN109816039A (en) * | 2019-01-31 | 2019-05-28 | 深圳市商汤科技有限公司 | A kind of cross-module state information retrieval method, device and storage medium |
CN110334689A (en) * | 2019-07-16 | 2019-10-15 | 北京百度网讯科技有限公司 | Video classification methods and device |
CN110674339A (en) * | 2019-09-18 | 2020-01-10 | 北京工业大学 | Chinese song emotion classification method based on multi-mode fusion |
WO2020083073A1 (en) * | 2018-10-23 | 2020-04-30 | 苏州科达科技股份有限公司 | Non-motorized vehicle image multi-label classification method, system, device and storage medium |
CN111580078A (en) * | 2020-04-14 | 2020-08-25 | 哈尔滨工程大学 | Single hydrophone target identification method based on fusion mode flicker index |
-
2021
- 2021-04-13 CN CN202110394335.6A patent/CN113762321A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120230568A1 (en) * | 2011-03-09 | 2012-09-13 | Siemens Aktiengesellschaft | Method and System for Model-Based Fusion of Multi-Modal Volumetric Images |
CN106855947A (en) * | 2016-12-28 | 2017-06-16 | 西安电子科技大学 | Multispectral image change detecting method based on the mutual modal factor analysis core fusion of core |
WO2020083073A1 (en) * | 2018-10-23 | 2020-04-30 | 苏州科达科技股份有限公司 | Non-motorized vehicle image multi-label classification method, system, device and storage medium |
CN109816039A (en) * | 2019-01-31 | 2019-05-28 | 深圳市商汤科技有限公司 | A kind of cross-module state information retrieval method, device and storage medium |
CN110334689A (en) * | 2019-07-16 | 2019-10-15 | 北京百度网讯科技有限公司 | Video classification methods and device |
US20210019531A1 (en) * | 2019-07-16 | 2021-01-21 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for classifying video |
CN110674339A (en) * | 2019-09-18 | 2020-01-10 | 北京工业大学 | Chinese song emotion classification method based on multi-mode fusion |
CN111580078A (en) * | 2020-04-14 | 2020-08-25 | 哈尔滨工程大学 | Single hydrophone target identification method based on fusion mode flicker index |
Non-Patent Citations (2)
Title |
---|
张力行;叶宁;黄海平;王汝传;: "基于皮肤电信号与文本信息的双模态情感识别系统", 计算机系统应用, no. 11, 14 November 2018 (2018-11-14) * |
林子杰等: "一种基于多任务学习的多模态情感识别方法", 北京大学学报, vol. 57, no. 1, 20 January 2021 (2021-01-20), pages 1 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114187605A (en) * | 2021-12-13 | 2022-03-15 | 苏州方兴信息技术有限公司 | Data integration method and device and readable storage medium |
CN113961710A (en) * | 2021-12-21 | 2022-01-21 | 北京邮电大学 | Fine-grained thesis classification method and device based on multi-mode layered fusion network |
CN115600091A (en) * | 2022-12-16 | 2023-01-13 | 珠海圣美生物诊断技术有限公司(Cn) | Classification model recommendation method and device based on multi-modal feature fusion |
CN115600091B (en) * | 2022-12-16 | 2023-03-10 | 珠海圣美生物诊断技术有限公司 | Classification model recommendation method and device based on multi-modal feature fusion |
CN117746069A (en) * | 2024-02-18 | 2024-03-22 | 浙江啄云智能科技有限公司 | Graph searching model training method and graph searching method |
CN117746069B (en) * | 2024-02-18 | 2024-05-14 | 浙江啄云智能科技有限公司 | Graph searching model training method and graph searching method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110334689B (en) | Video classification method and device | |
CN109740018B (en) | Method and device for generating video label model | |
CN108427939B (en) | Model generation method and device | |
CN108229478B (en) | Image semantic segmentation and training method and device, electronic device, storage medium, and program | |
CN113762321A (en) | Multi-modal classification model generation method and device | |
CN110688528B (en) | Method, apparatus, electronic device, and medium for generating classification information of video | |
CN109145828B (en) | Method and apparatus for generating video category detection model | |
CN111611436A (en) | Label data processing method and device and computer readable storage medium | |
CN111711869B (en) | Label data processing method and device and computer readable storage medium | |
CN111625645B (en) | Training method and device for text generation model and electronic equipment | |
CN109816023B (en) | Method and device for generating picture label model | |
CN113140012B (en) | Image processing method, device, medium and electronic equipment | |
CN112766284B (en) | Image recognition method and device, storage medium and electronic equipment | |
CN112861896A (en) | Image identification method and device | |
CN116127080A (en) | Method for extracting attribute value of description object and related equipment | |
CN115578570A (en) | Image processing method, device, readable medium and electronic equipment | |
US20230367972A1 (en) | Method and apparatus for processing model data, electronic device, and computer readable medium | |
CN111444335B (en) | Method and device for extracting central word | |
CN113742590A (en) | Recommendation method and device, storage medium and electronic equipment | |
CN117351192A (en) | Object retrieval model training, object retrieval method and device and electronic equipment | |
CN115761410A (en) | Training method and device for attribute recognition model and electronic equipment | |
CN114625876B (en) | Method for generating author characteristic model, method and device for processing author information | |
CN116188887A (en) | Attribute recognition pre-training model generation method and attribute recognition model generation method | |
CN115249361A (en) | Instructional text positioning model training, apparatus, device, and medium | |
CN115115836A (en) | Image recognition method, image recognition device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |