CN111291204B

CN111291204B - Multimedia data fusion method and device

Info

Publication number: CN111291204B
Application number: CN201911259689.9A
Authority: CN
Inventors: 何志强; 刘鑫; 张继勇; 庄浩
Original assignee: Hebei Finance University
Current assignee: Hebei Finance University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2023-08-29
Anticipated expiration: 2039-12-10
Also published as: CN111291204A

Abstract

The embodiment of the application provides a multimedia data fusion method and device, which comprises the following steps: receiving multimedia data from a plurality of terminal devices, the data types of the multimedia data comprising at least two of: text, images, audio. And respectively carrying out corresponding identification on the multimedia data of each data type to obtain feature vectors of each multimedia data, wherein the feature vectors are used for representing the features of each multimedia data. And carrying out vector conversion on the feature vectors of the multimedia data based on the relation between the feature vectors of the multimedia data and the preset conversion vector so that the feature vectors of the multimedia data with different data types are in the same vector space. And clustering the multimedia data with different data types according to the feature vectors of the converted multimedia data.

Description

Multimedia data fusion method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for multimedia data fusion.

Background

With the high development of information technology, large-scale multimedia data is generated from multiple dimensions. For example, video data and picture data are obtained from a video camera, text data are obtained from the inside of a text, and audio data are obtained by a buried point technique. The same subject is presented against many different forms of data whose high level semantics are very similar but whose underlying features are far from different media, such data having a strong correlation. Such data with relevance may be applied in many ways, such as searching, where each person may search for one of the other related events. For example, a celebrity name may be used as a keyword to search through a hundred-degree search engine to find information about the celebrity. These materials include photographs, personal data, speech audio, video, etc. of the moxidec. Therefore, multimedia data fusion becomes critical.

In the existing data fusion technology, corresponding labels are marked for multimedia data by a manual marking method, and clustering is carried out through the labels of the multimedia data so as to realize the fusion of the multimedia data. By the method, a large number of labeling personnel and auditing personnel are needed, and a large amount of manpower is consumed. On the other hand, due to subjectivity of labeling personnel and auditing personnel and richness of semantic content, labels for labeling multimedia data are often insufficient to clearly and completely express meanings represented by the data, so that the relevance of the multimedia data is weaker.

Disclosure of Invention

The embodiment of the specification provides a multimedia data fusion method and device, which are used for solving the problems of low efficiency, poor quality and the like of multimedia data fusion caused by the need of manually marking the multimedia data when the multimedia data is fused in the prior art.

In one aspect, an embodiment of the present application provides a multimedia data fusion method, where the method includes: receiving multimedia data from each of a plurality of terminal devices, the data types of the multimedia data including at least two of: one or more of text, image, audio; respectively carrying out corresponding identification on the multimedia data of each data type to obtain feature vectors of each multimedia data, wherein the feature vectors are used for representing the features of each multimedia data; based on the relation between the characteristic vector of each multimedia data and a preset conversion vector, carrying out vector conversion on the characteristic vector of each multimedia data so as to enable the characteristic vectors of the multimedia data with different data types to be in the same vector space; and clustering the multimedia data with different data types according to the feature vectors of the converted multimedia data.

In one possible implementation manner, vector conversion is performed on the feature vector of each multimedia data based on the feature vector of each multimedia data and the number of preset categories of the multimedia data, and a specific preset algorithm is shown in the following formula:

wherein k is the number of categories of the preset multimedia data, θ _k And x is a preset conversion vector, T is a transposition, and P (i) is a vector converted feature vector.

In one possible implementation manner, clustering the multimedia data of different data types according to the feature vector of each multimedia data after vector conversion specifically includes: determining whether the multimedia data with different data types are of one type according to the feature vector of each converted multimedia data; based on a preset clustering algorithm, clustering the multimedia data of different data types.

In one possible implementation manner, according to the feature vector of each multimedia data after vector conversion, determining whether the multimedia data with different data types is of a type, specifically: calculating Euler distances among feature vectors converted by the multimedia data vectors of different data types; and under the condition that the Euler distance is smaller than a preset threshold value, determining the multimedia data with different data types as one type.

In one possible implementation, the data types of the multimedia data further include: video.

In one possible implementation manner, before respective identification is performed on the multimedia data of each data type, and a feature vector of each multimedia data is obtained, the method further includes: and respectively preprocessing the multimedia data with different data types.

On the other hand, the embodiment of the application also provides a multimedia data fusion device, which comprises: the receiving module is used for receiving the multimedia data from a plurality of terminal devices, and the data types of the multimedia data comprise at least two of the following: text, image, audio; the identification module is used for respectively carrying out corresponding identification on the multimedia data of each data type to obtain the feature vector of each multimedia data; wherein, the feature vector is used for representing the feature of each multimedia data; the vector conversion module is used for carrying out vector conversion on the feature vectors of the multimedia data based on the relation between the feature vectors of the multimedia data and the preset conversion vectors so as to enable the feature vectors of the multimedia data with different data types to be in the same vector space; and the clustering module is used for clustering the multimedia data with different data types according to the feature vectors of the converted multimedia data.

In one possible implementation, the clustering module includes: a determining unit and a clustering unit; the determining unit is used for determining whether the multimedia data with different data types are of one type according to the feature vectors of the converted multimedia data; and the clustering unit is used for clustering the multimedia data of different data types based on a preset clustering algorithm.

In one possible implementation, the determining unit is specifically configured to: calculating Euler distances among feature vectors converted by the multimedia data vectors of different data types; and under the condition that the Euler distance is smaller than a preset threshold value, determining the multimedia data with different data types as one type.

According to the multimedia data fusion method and device provided by the embodiment of the application, the multimedia data with different data types can be classified through the feature vectors of the multimedia data, and the multimedia data of one type can be clustered. On the one hand, compared with the manual labeling method, the method can save a great deal of manpower and material resources and has more objectivity. On the other hand, when the multimedia data set is carried out, the clustering of the multimedia data by artificial labeling is avoided, so that the efficiency and the quality of data fusion are further improved, and the user experience is also improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a flowchart of a multimedia data fusion method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a multimedia data fusion device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art without the exercise of inventive faculty, are intended to be within the scope of the application, based on the embodiments in the specification.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a flowchart of a multimedia data fusion method according to an embodiment of the present application. As shown in fig. 1, the data processing method includes the steps of:

s101, the server receives multimedia data from a plurality of terminal devices.

Wherein the data type of the multimedia data includes at least one of: text, images, audio. In some embodiments of the present application, the data type of the multimedia data further includes video, where the video may be a video with sound or may be an audio-free video.

The terminal device may be hardware or software. When the terminal device is hardware, it may be various electronic devices such as a computer, a camera, a scanner, and the like. When the terminal device is software, it can be installed in the above-listed electronic device. For example, when the terminal device is a camera, the multimedia data received by the server is video data; when the terminal equipment is music software, the multimedia data received by the server are audio data; when the terminal device is a camera, the multimedia data received by the server is image data.

S102, respectively preprocessing the multimedia data with different data types.

The preprocessing of the text type multimedia data (hereinafter referred to as text data) can be performed by, for example, regularization processing case, semantic disambiguation, synonym substitution processing, and the like, that is, the preprocessing of the text data.

The preprocessing of the image type multimedia data (hereinafter referred to as image data) can appropriately discard the low quality image. Such as blurred images, and images with strong complexity of the person's scene.

The preprocessing of the text data and the image data is not limited to the above method, and may be performed by other methods. For example PS on the image data to increase the resolution of the image.

For preprocessing audio-type multimedia data (hereinafter, audio data), noise reduction processing may be performed on the audio data to reduce the influence of noise.

The preprocessing of multimedia data of video type (hereinafter referred to as video data) may be performed by generating a combined picture of the video data from a sequence of video frames of the video data, and processing the combined picture according to a preprocessing method of the image data.

It should be noted that, the server may send the request information to the corresponding terminal device, and the terminal device sends the corresponding multimedia data to the server based on the received request information.

S103, respectively carrying out corresponding identification on the multimedia data of each data type to obtain the feature vector of each multimedia data.

The feature vector referred to herein refers to a feature for representing multimedia data. For example, the feature vector corresponding to the multimedia data of the image type is an image feature vector, and is used to represent the feature of the shape in the image.

For the text data, the feature vector of the text data can be obtained through a preset text feature extraction model. The text feature extraction model referred to herein may be a pre-trained neural network model, such as a BERT model. The BERT model training is divided into two steps of pre-training and fine tuning. Pretraining is independent of downstream tasks, but is a very time-consuming and costly process. For this, a neural network model calling an open source should be adopted without repeating this process. The neural network model is a summary of a priori knowledge of the language, and does not require repeated construction once owned. A network extension architecture that fine-tunes to specific downstream tasks may be employed. Overall, the fine tuning of BERT is a lightweight task, with the main tuning being to extend the network rather than BERT itself. Furthermore, one of the important roles of the BERT model is to generate word vectors, which can be used to solve the word ambiguity problem that cannot be solved by the word2vec model.

For image data, feature vectors of the image data can be obtained through an image feature extraction model, which is a neural network model. For example, using a relatively classical deep convolutional neural network in combination with a pooling layer. Because the image is used as a signal source, the parameters of the neural network are huge, in order to reduce the calculated amount of training, a pooling layer is provided for further abstracting the calculated result of the convolutional layer of the neural network model, the weight amount to be trained is reduced, and meanwhile, the overfitting is prevented.

For the audio data, the feature vector of the audio data can be directly obtained through the corresponding audio feature extraction model, the audio data can be converted into text data, and the feature vector of the audio data can be obtained by inputting the text data into the corresponding text feature extraction model.

For video data, feature vectors of the video data can be directly obtained through corresponding video feature extraction models; the video data can also be generated into a combined graph according to the video sequence frames, and the generated combined graph is input into a corresponding image feature extraction model to obtain the feature vector corresponding to the video data.

The audio feature extraction model and the video feature extraction model are all pre-trained neural network models.

It should be noted that, the feature vector of the multimedia data may be obtained not only through a corresponding model, but also through other algorithms, which is not limited in the embodiment of the present application.

And S104, carrying out vector conversion on the feature vectors of the multimedia data based on the relation between the feature vectors of the multimedia data and the preset conversion vectors so that the feature vectors of the multimedia data with different data types are in the same vector space.

The preset conversion vector may be obtained by learning through a neural network model.

Since the data types of the multimedia data are different, whether the multimedia data with different data types are in one type cannot be directly determined according to the corresponding feature vector.

Therefore, in some embodiments of the present application, the feature vectors of each multimedia data may be subjected to vector conversion according to a preset algorithm, so that the feature vectors of the multimedia data of different data types are in the same vector space.

In some embodiments of the present application, vector conversion is performed on the feature vector of each multimedia data based on the feature vector of each multimedia data and the number of preset categories of the multimedia data, and the specific formula is as follows:

wherein k is the number of categories of the preset multimedia data, and θ _k And the x is a preset conversion vector, T is a transpose, and P (i) is a vector converted feature vector.

The k may be a self-defined parameter.

Through the formula, the characteristic vectors of the multimedia data with different data types can be converted into vectors in the same vector space.

S105, determining whether the multimedia data with different data types are of one type according to the feature vector of each converted multimedia data.

Specifically, calculating Euler distances between the vector-converted feature vectors of the multimedia data of different data types;

and under the condition that the Euler distance is smaller than a preset threshold value, determining the multimedia data with different data types as one type.

For example, the euler distance between the converted feature vector of the text data and the converted feature vector of the image data is calculated, and the text data and the image data are determined to be in a class when the euler distance is smaller than a preset threshold.

For another example, in the case that a text data and an image data are classified, if the euler distance between the converted feature vector of the text data and the converted feature vector of another audio data is also smaller than a predetermined threshold, the text data, the image data and the audio data are determined to be classified.

The preset threshold value may be set in advance, or may be adjusted in real time according to actual situations.

S106, based on a preset clustering algorithm, clustering the multimedia data of different data types.

In the embodiment of the application, the multimedia data of different data types can be clustered through a preset clustering algorithm, such as a k-means clustering algorithm.

Based on the above scheme, the multimedia data fusion method provided by the embodiment of the application can determine whether the multimedia data with different data types are of one type or not through the feature vector of each multimedia data, and cluster the multimedia data with different data types of one type so as to realize the fusion of the multimedia data. On the one hand, compared with the manual labeling method, the method can save a great deal of manpower and material resources and has more objectivity. On the other hand, when the multimedia data set is carried out, the clustering of the multimedia data by artificial labeling is avoided, so that the efficiency and the quality of data fusion are further improved, and the user experience is also improved.

Based on the same thought, some embodiments of the present application further provide a device corresponding to the above method.

Fig. 2 is a schematic structural diagram of a multimedia data fusion device according to an embodiment of the present application. As shown in fig. 2, the apparatus 200 includes: receiving module 210, identifying module 220, vector converting module 230, clustering module 240

The receiving module 210 is configured to receive multimedia data from a plurality of terminal devices, where data types of the multimedia data include at least two of the following: text, images, audio. The identifying module 220 is configured to identify the multimedia data of each data type respectively, so as to obtain a feature vector of each multimedia data; wherein the feature vector is used for representing the features of each multimedia data. The vector conversion module 230 is configured to perform vector conversion on the feature vectors of each multimedia data based on a relationship between the feature vectors of each multimedia data and a preset conversion vector, so that the feature vectors of the multimedia data with different data types are in the same vector space. The clustering module 240 is configured to cluster the multimedia data with different data types according to the feature vector of each of the converted multimedia data.

In one possible implementation, the clustering module 240 includes: a determining unit (not shown in the figure) and a clustering unit (not shown in the figure). And the determining unit is used for determining whether the multimedia data with different data types are of one type according to the feature vector of each converted multimedia data. And the clustering unit is used for clustering the multimedia data of different data types based on a preset clustering algorithm.

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The devices and the methods provided in the embodiments of the present application are in one-to-one correspondence, so that the devices also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices are not described here again.

In this description, all embodiments of the present application are described in a progressive manner, and identical and similar parts of all embodiments are referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of multimedia data fusion, the method comprising:

receiving multimedia data from a plurality of terminal devices, wherein the data types of the multimedia data comprise at least two of the following: text, image, audio;

respectively carrying out corresponding identification on the multimedia data of each data type to obtain feature vectors of each multimedia data, wherein the feature vectors are used for representing the features of each multimedia data;

based on the relation between the characteristic vector of each multimedia data and the preset conversion vector, carrying out vector conversion on the characteristic vector of each multimedia data so as to enable the characteristic vectors of the multimedia data with different data types to be in the same vector space, wherein the specific formula is as follows:

wherein k is the number of categories of the preset multimedia data, and θ _k The method is characterized in that the method is the characteristic vector of kth multimedia data, x is a preset conversion vector, T is a transposition, and P (i) is the characteristic vector after vector conversion;

clustering the multimedia data with different data types according to the feature vector of each converted multimedia data, wherein the clustering comprises the following steps:

determining whether the multimedia data with different data types are in one type according to the feature vectors of the converted multimedia data, specifically, calculating Euler distances between the feature vectors of the multimedia data with different data types, and determining the multimedia data with different data types are in one type under the condition that the Euler distances are smaller than a preset threshold;

based on a preset clustering algorithm, clustering the multimedia data of different data types.

2. The method of claim 1, wherein the data type of the multimedia data further comprises: video.

3. The method of claim 1, wherein before the respective identification of the multimedia data of each data type is performed to obtain the feature vector of each multimedia data, the method further comprises:

and respectively preprocessing the multimedia data with different data types.

4. A multimedia data fusion device, the device comprising:

a receiving module, configured to receive multimedia data from a plurality of terminal devices, where data types of the multimedia data include at least two of the following: text, image, audio;

the identification module is used for respectively carrying out corresponding identification on the multimedia data of each data type to obtain the feature vector of each multimedia data; wherein the feature vector is used for representing the features of each multimedia data;

the vector conversion module is used for carrying out vector conversion on the feature vectors of the multimedia data based on the relation between the feature vectors of the multimedia data and the preset conversion vectors so that the feature vectors of the multimedia data with different data types are in the same vector space, and the specific formula is as follows:

the clustering module is used for clustering the multimedia data of different data types according to the feature vectors of the converted multimedia data;

the clustering module comprises: a determining unit and a clustering unit;

the determining unit is configured to determine whether the multimedia data of different data types are in a class according to the feature vectors of the converted multimedia data, specifically, calculate euler distances between feature vectors of the multimedia data vectors of different data types, and determine that the multimedia data of different data types are in a class when the euler distances are smaller than a preset threshold;

the clustering unit is used for clustering the multimedia data of different data types based on a preset clustering algorithm.