CN114329036A - Cross-modal characteristic fusion system based on attention mechanism - Google Patents

Cross-modal characteristic fusion system based on attention mechanism Download PDF

Info

Publication number
CN114329036A
CN114329036A CN202210256553.8A CN202210256553A CN114329036A CN 114329036 A CN114329036 A CN 114329036A CN 202210256553 A CN202210256553 A CN 202210256553A CN 114329036 A CN114329036 A CN 114329036A
Authority
CN
China
Prior art keywords
video
audio
segment
rgb
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210256553.8A
Other languages
Chinese (zh)
Other versions
CN114329036B (en
Inventor
王青
兰浩源
刘阳
林倞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210256553.8A priority Critical patent/CN114329036B/en
Publication of CN114329036A publication Critical patent/CN114329036A/en
Application granted granted Critical
Publication of CN114329036B publication Critical patent/CN114329036B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a cross-modal characteristic fusion system based on an attention mechanism, which is based on the complementary relation of audio and video images, provides a method for extracting the characteristics of two modes of audio and video by using supervised contrast learning as a framework, simultaneously constructs an audio and video association analysis module to realize audio and video alignment, and designs a cross-modal characteristic fusion module based on the attention mechanism to realize the fusion of the audio and video characteristics. The audio and RGB pictures are used as input to achieve the goal of learning the video representation.

Description

Cross-modal characteristic fusion system based on attention mechanism
Technical Field
The invention relates to the technical field of audio and video processing, in particular to a cross-modal feature fusion system based on an attention mechanism.
Background
For video representation learning, a number of supervised learning methods are receiving increasing attention. The method includes a conventional method and a deep learning method. For example, the dual-flow CNN determines the video image and the dense optical flow respectively, and then directly fuses the class scores of the two networks to obtain a classification result. C3D processes the video using a three-dimensional convolution kernel. The Time Segment Network (TSN) samples each video into several segments to model the remote temporal structure of the video. A Temporal Relationship Network (TRN) introduces an interpretable network to learn and infer temporal dependencies between video frames at multiple time scales. A Time Shift Module (TSM) moves portions of the channel along the time dimension to facilitate information exchange between adjacent frames. Although these surveillance methods achieve better performance in modeling time dependencies, most of them only extract information on RGB image modalities in video. With the development of the multi-modal field, researchers have begun to attempt to introduce multi-modal learning into video representation learning. Because of the dynamics and time strictness of the video, if the dynamic characteristics of the video can be learned, the learning capability of the network on the video characteristics can be undoubtedly improved. The optical flow is the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by using the change of the pixels in the image sequence on a time domain and the correlation between adjacent frames so as to calculate the motion information of the object between the adjacent frames. Since optical flow preferably includes motion information of video, most researchers use optical flow as a modality to improve the performance of video representation learning.
Although the RGB image contains static information of the video and the optical flow contains dynamic information of the video, the optical flow is also a modality based on the RGB image generation and is not completely independent of the RGB image modality; in the existing 3D convolution network, the dynamic information in the input image segment can be well extracted. Thus, the utilization of the optical flow modality reaches a bottleneck. In video, besides rich picture information, there is a lot of effective sound information. For example, in the behavior of cutting trees, the sound of cutting trees is often accompanied; some studies in the "baseball" practice, also accompanied by the sound of a bat hitting a baseball, have also demonstrated the effectiveness of audio. In the past related work, the network is trained by judging whether the audios and videos are aligned or not and judging whether the audios and videos belong to the same sample or not. Although the method can better perform information interaction between modalities, the method cannot solve the problems that the intra-class sample difference is large and the inter-class sample difference is small. Although these methods can achieve better characteristics, they have one disadvantage: the relevance of features between actions belonging to the same category is not taken into account.
The prior art discloses a patent of a bimodal emotion recognition method based on multimode deep learning, which respectively obtains three-channel input matrixes of audio and video RGB images and obtains an audio data sample and a video data sample; constructing an audio deep convolution neural network and a video deep convolution neural network to obtain high-level audio characteristics and high-level video characteristics; establishing a fusion network formed by full connection layers, and constructing a high-level audio and video unified characteristic; aggregating the audio and video unified features output by the last layer of full connection layer of the fusion network into global features, and inputting the global features into a classifier to obtain audio and video emotion recognition classification results; the fusion network formed by full connection layers is adopted to realize the fusion of audio and video emotion information, construct high-level audio and video uniform characteristic representation and effectively improve the audio and video emotion recognition performance. However, this invention does not relate to any technical content relating to learning a video representation using audio and RGB pictures as input.
Disclosure of Invention
The invention provides a cross-modal characteristic fusion system based on an attention mechanism, which realizes the fusion of audio and video characteristics and takes audio and RGB pictures as input to learn the video representation.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a cross-modal feature fusion system based on an attention mechanism, comprising:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
Further, the audio and video correlation analysis module continuously acquires RGB fragments v generated by RGB images of 16 frames from a section of video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
Further, the specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame are
Figure 714699DEST_PATH_IMAGE001
Corresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are
Figure 19778DEST_PATH_IMAGE002
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
Further, the specific process of generating the self-supervision contrast loss by the self-supervision contrast learning is as follows:
directly facing
Figure 88229DEST_PATH_IMAGE001
,
Figure 820561DEST_PATH_IMAGE002
}i=1,···,NExpressed as: RGB segment features from the same video i
Figure 872831DEST_PATH_IMAGE001
And Mel frequency spectrum characteristics generated corresponding to audio
Figure 477207DEST_PATH_IMAGE002
(ii) a Negative couple
Figure 665743DEST_PATH_IMAGE001
,
Figure 568977DEST_PATH_IMAGE003
}i, j =1, ·, N and i ≠ j∪{
Figure 905280DEST_PATH_IMAGE001
,
Figure 427528DEST_PATH_IMAGE002
}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i production
Figure 595204DEST_PATH_IMAGE001
And all RGB slice features derived from video j (i ≠ j) generation
Figure 544706DEST_PATH_IMAGE003
And Mel frequency spectrum characteristics
Figure 696201DEST_PATH_IMAGE002
(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:
Figure 553299DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 716427DEST_PATH_IMAGE005
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
Figure 462928DEST_PATH_IMAGE006
the overall self-supervision contrast loss is given by the equations (1) and (2):
Figure 711507DEST_PATH_IMAGE007
further, the specific process of generating the supervised contrast loss by the supervised contrast learning is as follows:
directly facing
Figure 372296DEST_PATH_IMAGE001
,
Figure 780143DEST_PATH_IMAGE008
}i, j =1, ·, N, and yi = yj∪{
Figure 71447DEST_PATH_IMAGE001
,
Figure 666377DEST_PATH_IMAGE003
}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:
Figure 334118DEST_PATH_IMAGE001
and all RGB segment features from video i and video j of the same class
Figure 268576DEST_PATH_IMAGE003
And Mel-frequency-map features generated by audio
Figure 855415DEST_PATH_IMAGE002
Figure 813007DEST_PATH_IMAGE008
(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
Figure 644959DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 637186DEST_PATH_IMAGE010
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
Figure 332610DEST_PATH_IMAGE011
the overall supervised contrast loss is given by equations (4) (5):
Figure 167710DEST_PATH_IMAGE012
further, the cross-modal feature fusion module receives features from different modalities and learns global context embedding, which is then used to recalibrate input features from different segments, using video segment features learned from the supervised contrast learning framework as inputs, fused features as outputs, and computing the loss function of the fused portion by cross entropy.
Further, the specific processing procedure of the cross-modal feature fusion module is as follows:
two modes of a video i are viAnd aiThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is
Figure 646096DEST_PATH_IMAGE001
,
Figure 883043DEST_PATH_IMAGE002
To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
Figure 483788DEST_PATH_IMAGE013
wherein [, ]]It is shown that the connection operation is performed,
Figure 681551DEST_PATH_IMAGE014
represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selecting
Figure 353841DEST_PATH_IMAGE015
To limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
Figure 55081DEST_PATH_IMAGE016
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtained
Figure 921668DEST_PATH_IMAGE017
Thereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism
Figure 341148DEST_PATH_IMAGE001
Figure 489233DEST_PATH_IMAGE002
Figure 435192DEST_PATH_IMAGE018
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors
Figure 846582DEST_PATH_IMAGE019
,
Figure 877991DEST_PATH_IMAGE020
Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
Figure 501871DEST_PATH_IMAGE021
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separately
Figure 567916DEST_PATH_IMAGE022
And
Figure 150207DEST_PATH_IMAGE023
the contribution of (c) is as follows:
Figure 341017DEST_PATH_IMAGE024
further, each video segment is sized to have a size of
Figure 654406DEST_PATH_IMAGE025
Where c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames.
Further, the size of the 3D convolution kernel is t × D × D, where t is the time length and D is the spatial size; defining a video RGB image sequence as
Figure 919165DEST_PATH_IMAGE026
Wherein v isiAn RGB segment is generated for sampling m frames consecutively from a video i (i = 1...., N).
Further, the audio modality is a Mel frequency spectrogram generated by short-time Fourier transform of the whole audio of a video; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented as
Figure 796991DEST_PATH_IMAGE027
Wherein a isiA mel-frequency spectrogram generated for audio extracted from a video i.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a method for extracting characteristics of two modes of audio and video by using supervised contrast learning as a framework based on a complementary relation of information of the audio and video images, constructs an audio and video association analysis module to realize audio and video alignment, and designs a cross-mode characteristic fusion module based on an attention mechanism to realize the fusion of the audio and video characteristics. The audio and RGB pictures are used as input to achieve the goal of learning the video representation.
Drawings
FIG. 1 is a block diagram of the overall process of the system of the present invention;
FIG. 2 is an exemplary diagram of audio/video comparison learning according to the present invention;
FIG. 3 is a block diagram of a Supervised Contrast Learning (SCL) process of the present invention;
fig. 4 is a framework diagram of a cross-modal feature fusion module (MFAM) processing procedure in the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a cross-modal feature fusion system based on attention mechanism includes:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
An audio and video correlation analysis module continuously acquires RGB fragments v generated by 16 frames of RGB images from a video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
The specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame are
Figure 678359DEST_PATH_IMAGE001
Corresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are
Figure 971937DEST_PATH_IMAGE002
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
And aligning two modes of an audio and video RGB image through an audio and video association analysis (AVCA) module. In the module, for each video, a video RGB image mode and an audio mode are used as input. The video RGB image mode is that a segment of video randomly samples 16 continuous frames to form a segment. The audio modality is a Mel frequency spectrogram generated by short-time Fourier transform of the whole audio of a video. At this time, a segment of the video RGB image is aligned with the generated mel-frequency spectrogram of the whole video as an input.
In a Supervised Contrast Learning (SCL) module, spatio-temporal features are extracted for video RGB image segments and audio mel-frequency spectrograms respectively using two different three-dimensional convolutional neural networks (3D CNNs), and all 3D CNNs share the same weights. Then, supervised contrast loss is designed for the features generated by the two modalities to enhance the discriminative power of the homogeneous sample representation learning.
In multi-modal fusion, a cross-modal feature fusion module (MFAM) based on an attention mechanism is introduced, and features learned from a supervised contrast learning framework are adaptively propagated through the MFAM module and channel features are adaptively recalibrated. And calculating a loss function through cross entropy after connecting the calibrated features.
Example 2
As shown in fig. 1, a cross-modal feature fusion system based on attention mechanism includes:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
An audio and video correlation analysis module continuously acquires RGB fragments v generated by 16 frames of RGB images from a video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
The specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame are
Figure 746995DEST_PATH_IMAGE001
Corresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are
Figure 139931DEST_PATH_IMAGE002
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
The specific process of generating the self-supervision contrast loss by the self-supervision contrast learning is as follows:
directly facing
Figure 633229DEST_PATH_IMAGE001
,
Figure 668181DEST_PATH_IMAGE002
}i=1,···,NExpressed as: RGB segment features from the same video i
Figure 235429DEST_PATH_IMAGE001
And Mel frequency spectrum characteristics generated corresponding to audio
Figure 690943DEST_PATH_IMAGE002
(ii) a Negative couple
Figure 281324DEST_PATH_IMAGE001
,
Figure 244601DEST_PATH_IMAGE003
}i, j =1, ·, N and i ≠ j∪{
Figure 604038DEST_PATH_IMAGE001
,
Figure 666672DEST_PATH_IMAGE002
}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i production
Figure 868983DEST_PATH_IMAGE001
And all RGB slice features derived from video j (i ≠ j) generation
Figure 245738DEST_PATH_IMAGE003
And Mel frequency spectrum characteristics
Figure 849895DEST_PATH_IMAGE002
(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:
Figure 83430DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 648403DEST_PATH_IMAGE005
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
Figure 454948DEST_PATH_IMAGE006
the overall self-supervision contrast loss is given by the equations (1) and (2):
Figure 523398DEST_PATH_IMAGE007
the specific process of generating the supervised contrast loss by the supervised contrast learning is as follows:
directly facing
Figure 255731DEST_PATH_IMAGE001
,
Figure 104738DEST_PATH_IMAGE008
}i, j =1, ·, N, and yi = yj∪{
Figure 823295DEST_PATH_IMAGE001
,
Figure 136465DEST_PATH_IMAGE003
}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:
Figure 915065DEST_PATH_IMAGE001
and all RGB segment features from video i and video j of the same class
Figure 579264DEST_PATH_IMAGE003
And Mel-frequency-map features generated by audio
Figure 101513DEST_PATH_IMAGE002
Figure 206872DEST_PATH_IMAGE008
(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
Figure 782472DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 278175DEST_PATH_IMAGE010
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
Figure 728748DEST_PATH_IMAGE011
the overall supervised contrast loss is given by equations (4) (5):
Figure 891876DEST_PATH_IMAGE012
the cross-modal feature fusion module receives features from different modalities and learns global context embedding, then the embedding is used for recalibrating input features from different segments, video segment features learned from a supervised contrast learning framework are used as input, the fused features are used as output, and loss functions of a fusion part are calculated through cross entropy.
The specific processing process of the cross-modal feature fusion module is as follows:
two modes of a video i are viAnd aiThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is
Figure 809017DEST_PATH_IMAGE001
,
Figure 447809DEST_PATH_IMAGE002
To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
Figure 46280DEST_PATH_IMAGE013
wherein [, ]]It is shown that the connection operation is performed,
Figure 188549DEST_PATH_IMAGE014
represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selecting
Figure 542169DEST_PATH_IMAGE015
To limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
Figure 278044DEST_PATH_IMAGE016
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtained
Figure 589463DEST_PATH_IMAGE017
Thereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism
Figure 196025DEST_PATH_IMAGE001
Figure 48443DEST_PATH_IMAGE002
Figure 271614DEST_PATH_IMAGE018
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors
Figure 274205DEST_PATH_IMAGE019
,
Figure 391066DEST_PATH_IMAGE020
Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
Figure 24172DEST_PATH_IMAGE021
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separately
Figure 593694DEST_PATH_IMAGE022
And
Figure 603238DEST_PATH_IMAGE023
the contribution of (c) is as follows:
Figure 512288DEST_PATH_IMAGE024
example 3
To facilitate the description of each module, given N different videos, the segments of each video are sized by
Figure 942395DEST_PATH_IMAGE025
Where c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames. The size of the 3D convolution kernel is t × D × D, where t is the time length and D is the spatial size; defining a video RGB image sequence as
Figure 874579DEST_PATH_IMAGE026
Wherein v isiGenerated for successive sampling of m frames from one video i (i = 1...., N)An RGB segment. The audio mode is a Mel frequency spectrogram generated by the whole audio of a section of video through short-time Fourier transform; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented as
Figure 546869DEST_PATH_IMAGE027
Wherein a isiA mel-frequency spectrogram generated for audio extracted from a video i.
Figure 513688DEST_PATH_IMAGE028
Is a category label for video i.
1) Audio video association analysis (audio video alignment)
The sound signal is a one-dimensional signal, and only time domain information can be visually seen, but frequency domain information cannot be seen. And the frequency domain can be transformed by Fourier Transform (FT), but time domain information is lost, and the time-frequency relation cannot be seen. In order to solve the problem, a plurality of methods are generated, and short-time Fourier transform, wavelet and the like are common time-frequency analysis methods.
Short-time fourier transform (STFT) is a fourier transform of a short-time signal. The principle is as follows: the method comprises the steps of framing and windowing a long speech signal, performing Fourier transform on each frame, and stacking results of each frame along the other dimension to obtain a graph (similar to a two-dimensional signal), wherein the graph is a spectrogram.
Since the obtained spectrogram is large, in order to obtain a sound feature with a proper size, it is usually passed through a Mel-scale filter banks (Mel-scale filter banks) to become a Mel-scale spectrum.
In the conventional audio-video alignment, mostly, a mel frequency spectrum generated by an RGB image corresponding to a certain time length audio is aligned. This method can align the two modalities and extract still image information and audio information of the video, but ignores timing information included in the video itself.
In order to utilize the time sequence information of the video, the invention looks from one segmentRGB segments v produced by the continuous acquisition of 16 frames of RGB images in frequency i (i = 1...., N)iAs input to the RGB image modality. At this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs input to an audio modality.
1.1) Audio video comparison learning
In the field of images, self-supervision contrast learning is a common learning method. The core idea is that the data features from the same sample are as close as possible, while the data features from different samples are as far apart as possible. In the field of images, data is generally augmented by turning or cropping a picture, so that a positive pair is formed between the generated picture and an original picture, and a negative pair is formed between all other pictures and the original picture. The positive pair is close and the negative pair is far by contrast loss.
In order to make the features of similar actions similar to each other, a new comparative learning method is proposed by researchers: there is supervised contrast learning. The core idea is that the distance between data features from the same category is as close as possible, while the distance between data features from different categories is as far as possible. At this time, the positive pair is extended to the picture generated by the original picture augmentation and the picture having the same category information as the original picture, and the negative pair is all the pictures not belonging to the same category as the original picture.
Although the contrast learning method has been widely applied in the field of image learning, some students also introduce the method into video representation learning, but the combination of contrast learning in the multi-modal field is proposed in recent years. In the multi-modal field, most scholars only learn the RGB images and the optical flow as two modalities, and the audio is used as one of the modalities, which is much less and less. Therefore, the method introduces supervised contrast learning into audio/video multi-mode learning, so that the model can better extract different modal characteristics and can better distinguish samples with large intra-class difference and small inter-class difference.
2) Modal feature extraction
RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame are
Figure 285334DEST_PATH_IMAGE001
Corresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are
Figure 95028DEST_PATH_IMAGE002
. The difference between the two networks is the number of channels of the image input.
3) Supervised contrast learning
3.1) self-supervised contrast learning
As shown in fig. 3, there is a Supervised Contrast Learning (SCL) framework: the kernel of the self-supervised contrast learning is that the distance between data features from the same sample is as close as possible, and the data features from different samples are as far as possible.
In the invention, the RGB fragment characteristics of the video i are used
Figure 180795DEST_PATH_IMAGE001
As an example, a directly facing
Figure 126754DEST_PATH_IMAGE001
,
Figure 334882DEST_PATH_IMAGE002
}i=1,···,NExpressed as: RGB segment features from the same video i
Figure 241658DEST_PATH_IMAGE001
And Mel frequency spectrum characteristics generated corresponding to audio
Figure 757215DEST_PATH_IMAGE002
(ii) a Negative couple
Figure 433047DEST_PATH_IMAGE001
,
Figure 874393DEST_PATH_IMAGE003
}i, j =1, ·, N and i ≠ j∪{
Figure 268465DEST_PATH_IMAGE001
,
Figure 758352DEST_PATH_IMAGE002
}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i production
Figure 678904DEST_PATH_IMAGE001
And all RGB slice features derived from video j (i ≠ j) generation
Figure 166517DEST_PATH_IMAGE003
And Mel frequency spectrum characteristics
Figure 172519DEST_PATH_IMAGE002
. As shown in fig. 2. At this time, the self-supervised contrast loss of the RGB image modality of the video is expressed as:
Figure 403780DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 850942DEST_PATH_IMAGE005
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
Figure 401134DEST_PATH_IMAGE006
the overall self-supervision contrast loss is given by the equations (1) and (2):
Figure 504220DEST_PATH_IMAGE007
3.2) supervised contrast learning
Although self-supervised contrast learning can learn better features, there is a disadvantage: the relevance of features between actions belonging to the same category is not considered; in order to make the characteristics of the same kind of actions close to each other, a new comparison learning method is adopted: there is supervised contrast. The core of supervised contrast learning is to make the distance between data features from the same class sample as close as possible, and the distance between data features from different classes as far as possible.
In the present invention, the RGB fragment features
Figure 663805DEST_PATH_IMAGE029
As an example, a directly facing
Figure 168736DEST_PATH_IMAGE001
,
Figure 60469DEST_PATH_IMAGE008
}i, j =1, ·, N, and yi = yj∪{
Figure 41063DEST_PATH_IMAGE001
,
Figure 348548DEST_PATH_IMAGE003
}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:
Figure 98198DEST_PATH_IMAGE001
and all RGB segment features from video i and video j of the same class
Figure 160832DEST_PATH_IMAGE003
And Mel-frequency-map features generated by audio
Figure 238509DEST_PATH_IMAGE002
Figure 501082DEST_PATH_IMAGE008
(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
Figure 715026DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 276457DEST_PATH_IMAGE010
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
Figure 638169DEST_PATH_IMAGE011
the overall supervised contrast loss is given by equations (4) (5):
Figure 553035DEST_PATH_IMAGE012
4) multimodal fusion
As shown in fig. 4, in order to better fuse information between different modalities, a cross-modality feature fusion (MFAM) module based on attention mechanism is proposed. Since features from different modalities are correlated, a cross-modality feature fusion module is constructed that receives features from different modalities and learns global context embedding, which is then used to re-align input features from different segments, as shown in FIG. 4. And (3) using the video segment characteristics learned from the supervised contrast learning framework as input, using the fused characteristics as output, and calculating the loss function of the fusion part through cross entropy.
To fix the symbols, assume that the two modalities of a video i are viAnd aiThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is
Figure 11698DEST_PATH_IMAGE001
,
Figure 619397DEST_PATH_IMAGE002
To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
Figure 796300DEST_PATH_IMAGE013
wherein [, ]]It is shown that the connection operation is performed,
Figure 311595DEST_PATH_IMAGE014
represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selecting
Figure 765711DEST_PATH_IMAGE015
To limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
Figure 904830DEST_PATH_IMAGE016
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtained
Figure 444396DEST_PATH_IMAGE017
Thereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism
Figure 763382DEST_PATH_IMAGE001
Figure 196637DEST_PATH_IMAGE002
Figure 146139DEST_PATH_IMAGE018
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors
Figure 766476DEST_PATH_IMAGE019
,
Figure 92415DEST_PATH_IMAGE020
Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
Figure 52281DEST_PATH_IMAGE021
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separately
Figure 297317DEST_PATH_IMAGE022
And
Figure 545896DEST_PATH_IMAGE023
the contribution of (c) is as follows:
Figure 36045DEST_PATH_IMAGE024
the same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A cross-modal feature fusion system based on an attention mechanism, comprising:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
2. The attention-based cross-modal feature fusion system of claim 1, wherein the audio-video correlation analysis module continuously collects RGB segments v generated by 16 RGB images from a video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
3. The attention-based cross-modal feature fusion system of claim 2, wherein the specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame are
Figure 501130DEST_PATH_IMAGE001
Corresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are
Figure 489815DEST_PATH_IMAGE002
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
4. The cross-modal feature fusion system based on attention mechanism as claimed in claim 3 wherein the specific process of generating the self-supervised contrast loss by the self-supervised contrast learning is:
directly facing
Figure 569766DEST_PATH_IMAGE001
,
Figure 188967DEST_PATH_IMAGE002
}i=1,···,NExpressed as: RGB segment features from the same video i
Figure 721579DEST_PATH_IMAGE001
And Mel frequency spectrum characteristics generated corresponding to audio
Figure 389321DEST_PATH_IMAGE002
(ii) a Negative couple
Figure 887561DEST_PATH_IMAGE001
,
Figure 349766DEST_PATH_IMAGE003
}i, j =1, ·, N and i ≠ j∪{
Figure 369675DEST_PATH_IMAGE001
,
Figure 965741DEST_PATH_IMAGE002
}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i production
Figure 957968DEST_PATH_IMAGE001
And all RGB slice features derived from video j (i ≠ j) generation
Figure 450129DEST_PATH_IMAGE003
And Mel frequency spectrum characteristics
Figure 160596DEST_PATH_IMAGE002
(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:
Figure 294774DEST_PATH_IMAGE004
wherein the content of the first and second substances,
Figure 938245DEST_PATH_IMAGE005
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
Figure 742253DEST_PATH_IMAGE006
the overall self-supervision contrast loss is given by the equations (1) and (2):
Figure 294676DEST_PATH_IMAGE007
5. the attention-based cross-modal feature fusion system of claim 4, wherein the specific process of generating supervised contrast loss by supervised contrast learning is as follows:
directly facing
Figure 107911DEST_PATH_IMAGE001
,
Figure 933785DEST_PATH_IMAGE008
}i, j =1, ·, N, and yi = yj∪{
Figure 971011DEST_PATH_IMAGE001
,
Figure 390491DEST_PATH_IMAGE003
}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:
Figure 600893DEST_PATH_IMAGE001
and all RGB segment features from video i and video j of the same class
Figure 422218DEST_PATH_IMAGE003
And Mel-frequency-map features generated by audio
Figure 958242DEST_PATH_IMAGE002
Figure 661755DEST_PATH_IMAGE008
(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
Figure 816793DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 853145DEST_PATH_IMAGE010
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
Figure 435436DEST_PATH_IMAGE011
the overall supervised contrast loss is given by equations (4) (5):
Figure 626246DEST_PATH_IMAGE012
6. the attention-based cross-modal feature fusion system of claim 5 wherein the cross-modal feature fusion module receives features from different modalities and learns global context embedding, which is then used to recalibrate input features from different segments, using video segment features learned from the supervised contrast learning framework as input, fused features as output, and computing the loss function of the fused portion by cross entropy.
7. The attention-based cross-modal feature fusion system of claim 6, wherein the specific processing procedure of the cross-modal feature fusion module is:
two modes of a video i are viAnd aiThe slave has supervisionThe characteristic extracted by the three-dimensional convolution network in the Du-contrast learning frame is
Figure 444029DEST_PATH_IMAGE001
,
Figure 239947DEST_PATH_IMAGE002
To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
Figure 586614DEST_PATH_IMAGE013
wherein [, ]]It is shown that the connection operation is performed,
Figure 467983DEST_PATH_IMAGE014
represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selecting
Figure 761561DEST_PATH_IMAGE015
To limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
Figure 536619DEST_PATH_IMAGE016
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtained
Figure 195133DEST_PATH_IMAGE017
Thereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism
Figure 924317DEST_PATH_IMAGE001
Figure 959269DEST_PATH_IMAGE002
Figure 526517DEST_PATH_IMAGE018
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors
Figure 746145DEST_PATH_IMAGE019
,
Figure 70947DEST_PATH_IMAGE020
Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
Figure 34224DEST_PATH_IMAGE021
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separately
Figure 455978DEST_PATH_IMAGE022
And
Figure 721875DEST_PATH_IMAGE023
the contribution of (c) is as follows:
Figure 924186DEST_PATH_IMAGE024
8. the attention-based cross-modal feature fusion system of any one of claims 1-7 wherein each video segment is of a size of
Figure 300941DEST_PATH_IMAGE025
Where c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames.
9. The attention-based cross-modal feature fusion system of claim 8 wherein the size of the 3D convolution kernel is txdxd, where t is the temporal length and D is the spatial size; defining a video RGB image sequence as
Figure 140983DEST_PATH_IMAGE026
Wherein v isiAn RGB segment is generated for sampling m frames consecutively from a video i (i = 1...., N).
10. The attention-based cross-modal feature fusion system of claim 9, wherein the audio modality is a mel-frequency spectrogram generated by a short-time fourier transform of an entire audio segment of a video; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented as
Figure 577780DEST_PATH_IMAGE027
Wherein a isiA mel-frequency spectrogram generated for audio extracted from a video i.
CN202210256553.8A 2022-03-16 2022-03-16 Cross-modal characteristic fusion system based on attention mechanism Active CN114329036B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210256553.8A CN114329036B (en) 2022-03-16 2022-03-16 Cross-modal characteristic fusion system based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210256553.8A CN114329036B (en) 2022-03-16 2022-03-16 Cross-modal characteristic fusion system based on attention mechanism

Publications (2)

Publication Number Publication Date
CN114329036A true CN114329036A (en) 2022-04-12
CN114329036B CN114329036B (en) 2022-07-05

Family

ID=81033312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210256553.8A Active CN114329036B (en) 2022-03-16 2022-03-16 Cross-modal characteristic fusion system based on attention mechanism

Country Status (1)

Country Link
CN (1) CN114329036B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019397A (en) * 2022-06-15 2022-09-06 北京大学深圳研究生院 Comparison self-monitoring human behavior recognition method and system based on temporal-spatial information aggregation
CN115100390A (en) * 2022-08-24 2022-09-23 华东交通大学 Image emotion prediction method combining contrast learning and self-supervision region positioning
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115620110A (en) * 2022-12-16 2023-01-17 华南理工大学 Video event positioning and identifying method, device and storage medium
CN116824495A (en) * 2023-06-26 2023-09-29 华东交通大学 Dangerous behavior identification method, system, storage medium and computer equipment
CN117173394A (en) * 2023-08-07 2023-12-05 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data
WO2024087337A1 (en) * 2022-10-24 2024-05-02 深圳先进技术研究院 Method for directly synthesizing speech from tongue ultrasonic images

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112820320A (en) * 2020-12-31 2021-05-18 中国科学技术大学 Cross-modal attention consistency network self-supervision learning method
US20210342646A1 (en) * 2020-04-30 2021-11-04 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for training a deep model to learn contrastive representations embedded within part-whole semantics via a self-supervised learning framework
CN114118200A (en) * 2021-09-24 2022-03-01 杭州电子科技大学 Multi-modal emotion classification method based on attention-guided bidirectional capsule network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210342646A1 (en) * 2020-04-30 2021-11-04 Arizona Board Of Regents On Behalf Of Arizona State University Systems, methods, and apparatuses for training a deep model to learn contrastive representations embedded within part-whole semantics via a self-supervised learning framework
CN112820320A (en) * 2020-12-31 2021-05-18 中国科学技术大学 Cross-modal attention consistency network self-supervision learning method
CN114118200A (en) * 2021-09-24 2022-03-01 杭州电子科技大学 Multi-modal emotion classification method based on attention-guided bidirectional capsule network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
檀华东: "面向视听觉数据的跨模态生成及同步判别研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115019397A (en) * 2022-06-15 2022-09-06 北京大学深圳研究生院 Comparison self-monitoring human behavior recognition method and system based on temporal-spatial information aggregation
CN115019397B (en) * 2022-06-15 2024-04-19 北京大学深圳研究生院 Method and system for identifying contrasting self-supervision human body behaviors based on time-space information aggregation
CN115100390A (en) * 2022-08-24 2022-09-23 华东交通大学 Image emotion prediction method combining contrast learning and self-supervision region positioning
CN115116448A (en) * 2022-08-29 2022-09-27 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
CN115116448B (en) * 2022-08-29 2022-11-15 四川启睿克科技有限公司 Voice extraction method, neural network model training method, device and storage medium
WO2024087337A1 (en) * 2022-10-24 2024-05-02 深圳先进技术研究院 Method for directly synthesizing speech from tongue ultrasonic images
CN115620110A (en) * 2022-12-16 2023-01-17 华南理工大学 Video event positioning and identifying method, device and storage medium
CN115620110B (en) * 2022-12-16 2023-03-21 华南理工大学 Video event positioning and identifying method, device and storage medium
CN116824495A (en) * 2023-06-26 2023-09-29 华东交通大学 Dangerous behavior identification method, system, storage medium and computer equipment
CN117173394A (en) * 2023-08-07 2023-12-05 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data
CN117173394B (en) * 2023-08-07 2024-04-02 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data

Also Published As

Publication number Publication date
CN114329036B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN114329036B (en) Cross-modal characteristic fusion system based on attention mechanism
Liu et al. Deep learning for generic object detection: A survey
CN108804453B (en) Video and audio recognition method and device
Lee et al. Multi-view automatic lip-reading using neural network
US20200004493A1 (en) Electronic apparatus, document displaying method thereof and non-transitory computer readable recording medium
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
WO2020177673A1 (en) Video sequence selection method, computer device and storage medium
JP2023546173A (en) Facial recognition type person re-identification system
Zong et al. Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis
US20220415023A1 (en) Model update method and related apparatus
CN113822125B (en) Processing method and device of lip language recognition model, computer equipment and storage medium
CN114519809A (en) Audio-visual video analysis device and method based on multi-scale semantic network
CN110991500A (en) Small sample multi-classification method based on nested integrated depth support vector machine
Agbo-Ajala et al. A lightweight convolutional neural network for real and apparent age estimation in unconstrained face images
CN115147641A (en) Video classification method based on knowledge distillation and multi-mode fusion
Islam et al. Representation for action recognition with motion vector termed as: SDQIO
Aliakbarian et al. Deep action-and context-aware sequence learning for activity recognition and anticipation
Chen et al. Dual-bottleneck feature pyramid network for multiscale object detection
Rastgoo et al. Word separation in continuous sign language using isolated signs and post-processing
US20220086401A1 (en) System and method for language-guided video analytics at the edge
Liu et al. A multimodal approach for multiple-relation extraction in videos
Afrasiabi et al. Spatial-temporal dual-actor CNN for human interaction prediction in video
de Souza et al. Building semantic understanding beyond deep learning from sound and vision
CN116958852A (en) Video and text matching method and device, electronic equipment and storage medium
CN115222047A (en) Model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant