CN111291204A

CN111291204A - Multimedia data fusion method and device

Info

Publication number: CN111291204A
Application number: CN201911259689.9A
Authority: CN
Inventors: 何志强; 刘鑫; 张继勇; 庄浩
Original assignee: Huarui Xinzhi Technology Beijing Co ltd; Hebei Finance University
Current assignee: Hebei Finance University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-06-16
Anticipated expiration: 2039-12-10
Also published as: CN111291204B

Abstract

The embodiment of the application provides a multimedia data fusion method and equipment, and the method comprises the following steps: receiving multimedia data from a plurality of terminal devices, wherein the data types of the multimedia data comprise at least two of the following types: text, image, audio. And correspondingly identifying the multimedia data of each data type to obtain the feature vector of each multimedia data, wherein the feature vector is used for expressing the feature of each multimedia data. And performing vector conversion on the feature vectors of the multimedia data based on the relationship between the feature vectors of the multimedia data and the preset conversion vectors so as to enable the feature vectors of the multimedia data of different data types to be in the same vector space. And clustering the multimedia data of different data types according to the converted feature vectors of the multimedia data.

Description

Multimedia data fusion method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a multimedia data fusion method and device.

Background

With the high development of information technology, large-scale multimedia data is generated from multiple dimensions. For example, video data and picture data are obtained from a camera, text data is obtained from the inside of the text, and audio data is obtained through a buried point technology. The same theme is expressed in the face of many different forms of data, whose high level semantics are very similar, but whose underlying features are very different between different media, and such data have strong relevance. Such correlated data can be applied to many aspects, such as searching, where each person can search for one of them that can be correlated to other related events. For example, a celebrity name can be used as a keyword to search through a hundred-degree search engine, and the information of the celebrity can be searched. These materials include photographs of the moto, personal data, audio of a lecture, video, etc. Therefore, multimedia data fusion becomes crucial.

In the existing data fusion technology, a corresponding label is often labeled for multimedia data by a manual labeling method, and the multimedia data is clustered by the label of the multimedia data, so as to realize the fusion of the multimedia data. By the method, on one hand, a large number of annotators and auditors are needed, and a large amount of manpower is consumed. On the other hand, due to subjectivity of annotators and auditors and richness of semantic content, labels used for annotating multimedia data cannot sufficiently clearly and completely express meanings represented by the data, so that relevance of the multimedia data is weak.

Disclosure of Invention

The embodiment of the specification provides a multimedia data fusion method and device, and is used for solving the problems of low efficiency, poor quality and the like of multimedia data fusion caused by the fact that manual labeling is needed to be carried out on multimedia data when the multimedia data fusion is carried out in the prior art.

In one aspect, an embodiment of the present application provides a multimedia data fusion method, where the method includes: receiving multimedia data from each of a plurality of terminal devices, the data types of the multimedia data including at least two of: one or more of text, images, audio; respectively carrying out corresponding identification on the multimedia data of each data type to obtain a feature vector of each multimedia data, wherein the feature vector is used for expressing the feature of each multimedia data; performing vector conversion on the feature vectors of the multimedia data based on the relationship between the feature vectors of the multimedia data and preset conversion vectors so as to enable the feature vectors of the multimedia data of different data types to be in the same vector space; and clustering the multimedia data of different data types according to the converted feature vectors of the multimedia data.

In a possible implementation manner, based on the feature vector of each multimedia data and the number of preset multimedia data categories, vector conversion is performed on the feature vector of each multimedia data, and a specific preset algorithm is shown in the following formula:

where k is the number of predetermined multimedia data categories, θ_kIs the feature vector of the kth multimedia data, x is the preset conversion vector, T represents transposition, and P (i) is the feature vector after vector conversion.

In a possible implementation manner, clustering multimedia data of different data types according to feature vectors of each multimedia data after vector conversion specifically includes: determining whether the multimedia data of different data types are of one type or not according to the converted feature vectors of the multimedia data; and clustering the multimedia data of one type with different data types based on a preset clustering algorithm.

In a possible implementation manner, according to the feature vector of each multimedia data after vector conversion, it is determined whether multimedia data of different data types are of one type, specifically: calculating Euler distances among feature vectors after the multimedia data vectors of different data types are converted; and under the condition that the Euler distance is smaller than a preset threshold value, determining the multimedia data of different data types as one type.

In one possible implementation, the data types of the multimedia data further include: and (6) video.

In a possible implementation manner, before performing corresponding identification on the multimedia data of each data type to obtain the feature vector of each multimedia data, the method further includes: respectively carrying out corresponding preprocessing on the multimedia data of different data types.

On the other hand, an embodiment of the present application further provides a multimedia data fusion device, which includes: the receiving module is used for receiving multimedia data from a plurality of terminal devices, and the data types of the multimedia data comprise at least two of the following types: text, images, audio; the identification module is used for correspondingly identifying the multimedia data of each data type to obtain the characteristic vector of each multimedia data; wherein, the feature vector is used for representing the feature of each multimedia data; the vector conversion module is used for performing vector conversion on the feature vectors of the multimedia data based on the relationship between the feature vectors of the multimedia data and the preset conversion vectors so as to enable the feature vectors of the multimedia data of different data types to be in the same vector space; and the clustering module is used for clustering the multimedia data of different data types according to the converted feature vectors of the multimedia data.

In one possible implementation, the clustering module includes: a determining unit and a clustering unit; a determining unit, configured to determine whether multimedia data of different data types are of one type according to the feature vector of each converted multimedia data; and the clustering unit is used for clustering multimedia data of different data types based on a preset clustering algorithm.

In a possible implementation manner, the determining unit is specifically configured to: calculating Euler distances among feature vectors after the multimedia data vectors of different data types are converted; and under the condition that the Euler distance is smaller than a preset threshold value, determining the multimedia data of different data types as one type.

According to the multimedia data fusion method and device provided by the embodiment of the application, multimedia data of different data types can be classified through the feature vectors of the multimedia data, and one type of multimedia data is clustered. On one hand, compared with the manual labeling method for classification, a large amount of manpower and material resources can be saved, and the method has objectivity. On the other hand, when multimedia data set is carried out, clustering of the multimedia data through artificial marking is avoided, so that the efficiency and the quality of data fusion are further improved, and the user experience is also improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a multimedia data fusion method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a multimedia data fusion device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person skilled in the art without making any inventive step based on the embodiments in the description belong to the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a multimedia data fusion method according to an embodiment of the present application. As shown in fig. 1, the data processing method includes the steps of:

s101, the server receives multimedia data from a plurality of terminal devices.

Wherein the data type of the multimedia data comprises at least one of: text, image, audio. In some embodiments of the present application, the data type of the multimedia data further includes video, which may be either audio video or non-audio video.

The terminal device may be hardware or software. When the terminal device is hardware, it may be various electronic devices such as a computer, a camera, a scanner, and the like. When the terminal device is software, the software can be installed in the electronic devices listed above. For example, when the terminal device is a video camera, the multimedia data received by the server is video data; when the terminal equipment is music software, the multimedia data received by the server is audio data; when the terminal device is a camera, the multimedia data received by the server is image data.

S102, respectively preprocessing the multimedia data with different data types.

For preprocessing multimedia data (hereinafter referred to as text data) of text type, data cleaning can be performed by, for example, regularization processing case and case, semantic disambiguation, synonym replacement processing, and the like, that is, the text data is preprocessed.

The preprocessing performed on the multimedia data of an image type (hereinafter referred to as image data) can appropriately discard a low-quality image. For example, blurred images, images with strong human scene complexity.

The preprocessing of the text data and the image data is not limited to the above-described method, and may be performed by other methods. For example, the image data is PS processed to improve the resolution of the image.

The audio type multimedia data (hereinafter referred to as audio data) is preprocessed, and the audio data may be subjected to noise reduction processing to reduce the influence of noise.

For preprocessing multimedia data of a video type (hereinafter referred to as video data), a composite image of the video data may be generated according to a video frame sequence of the video data, and then processed according to a preprocessing method of the image data.

It should be noted that the server may send request information to the corresponding terminal device, and the terminal device sends the corresponding multimedia data to the server based on the received request information.

S103, respectively carrying out corresponding identification on the multimedia data of each data type to obtain the feature vector of each multimedia data.

The feature vector as referred to herein refers to a feature for representing multimedia data. The feature vector corresponding to multimedia data of, for example, an image type is an image feature vector, which is used to represent a feature of a shape in an image.

For the text data, the feature vector of the text data can be obtained through a preset text feature extraction model. The text feature extraction model may be a pre-trained neural network model, such as a BERT model. The training of the BERT model is divided into two steps of pre-training and fine-tuning. Pre-training is not related to downstream tasks, but is a very time consuming and expensive process. Calling an open-source neural network model should be undertaken for this without repeating this process. Neural network models are summaries of a priori knowledge of the language, and once owned, do not require repeated construction. A network expansion architecture that fine tunes the specific downstream task can be employed. In general, fine tuning of BERTs is a lightweight task, with fine tuning primarily adjusting to extend the network rather than the BERTs themselves. Furthermore, one of the important roles of the BERT model is to generate word vectors, which can be used to solve a word ambiguity problem that cannot be solved by the word2vec model.

For image data, feature vectors of the image data can be obtained through an image feature extraction model, wherein the image feature extraction model is a neural network model. For example, using a relatively classical deep convolutional neural network in conjunction with pooling layers. Because the image is used as a signal source, parameters of the neural network are huge, in order to reduce the calculation amount of training, a pooling layer is provided to further abstract the calculation result of the neural network model convolution layer, the weight amount to be trained is reduced, and overfitting is also prevented.

For the audio data, the feature vector of the audio data can be directly obtained through the corresponding audio feature extraction model, or the audio data can be converted into text data, and the feature vector of the audio data is obtained in the process of inputting the text data into the corresponding text feature extraction model.

For video data, a feature vector of the video data can be directly obtained through a corresponding video feature extraction model; or generating a combined graph from the video data according to the video sequence frame, and inputting the generated combined graph into a corresponding image feature extraction model to obtain a feature vector corresponding to the video data.

The audio characteristic extraction model and the video characteristic extraction model are pre-trained neural network models.

It should be noted that the feature vector of the multimedia data can be obtained not only by the corresponding model, but also by other algorithms, which is not limited in the embodiment of the present application.

And S104, performing vector conversion on the feature vectors of the multimedia data based on the relationship between the feature vectors of the multimedia data and the preset conversion vectors so as to enable the feature vectors of the multimedia data of different data types to be in the same vector space.

The preset transformation vector may be obtained by learning through a neural network model.

Because the data types of the multimedia data are different, whether the multimedia data with different data types are of one type or not can not be directly determined according to the corresponding feature vectors.

Therefore, in some embodiments of the present application, the feature vectors of the multimedia data may be vector-converted according to a preset algorithm, so that the feature vectors of the multimedia data of different data types are in the same vector space.

In some embodiments of the present application, vector conversion is performed on the feature vector of each multimedia data based on the feature vector of each multimedia data and the number of preset categories of multimedia data, and the following formula is specifically shown:

wherein k is the number of the preset multimedia data categories, theta_kThe feature vector of the kth multimedia data is represented by x, T represents transposition, and p (i) is a feature vector after vector conversion.

The k may be a parameter defined by itself.

Through the formula, the feature vectors of the multimedia data with different data types can be converted into the vectors of the same vector space.

And S105, determining whether the multimedia data of different data types are of one type or not according to the converted feature vectors of the multimedia data.

Specifically, calculating Euler distances between feature vectors of multimedia data of different data types after vector conversion;

and under the condition that the Euler distance is smaller than a preset threshold value, determining the multimedia data of different data types as one type.

For example, a euler distance between a converted feature vector of text data and a converted feature vector of image data is calculated, and in the case where the euler distance is smaller than a preset threshold, the text data and the image data are determined to be one type.

For another example, when a text data and an image data are classified, the euler distance between the converted feature vector of the text data and the converted feature vector of another audio data is also smaller than a preset threshold, and the text data, the image data, and the audio data are classified.

It should be noted that the preset threshold may be set in advance, or may be adjusted in real time according to actual conditions.

And S106, clustering the multimedia data of different data types based on a preset clustering algorithm.

In the embodiment of the application, multimedia data of one type and different data types can be clustered through a preset clustering algorithm, such as a k-means clustering algorithm.

Based on the above scheme, the multimedia data fusion method provided in the embodiment of the present application can determine whether multimedia data of different data types are of one type or not through the feature vector of each multimedia data, and cluster the multimedia data of one type of different data types to implement the fusion of the multimedia data. On one hand, compared with the manual labeling method for classification, a large amount of manpower and material resources can be saved, and the method has objectivity. On the other hand, when multimedia data set is carried out, clustering of the multimedia data through artificial marking is avoided, so that the efficiency and the quality of data fusion are further improved, and the user experience is also improved.

Based on the same idea, some embodiments of the present application further provide a device corresponding to the above method.

Fig. 2 is a schematic structural diagram of a multimedia data fusion device according to an embodiment of the present application. As shown in fig. 2, the apparatus 200 includes: receiving module 210, identifying module 220, vector converting module 230, clustering module 240

The receiving module 210 is configured to receive multimedia data from a plurality of terminal devices, where data types of the multimedia data include at least two of the following: text, image, audio. The identification module 220 is configured to perform corresponding identification on the multimedia data of each data type to obtain a feature vector of each multimedia data; wherein the feature vector is used to represent the feature of each multimedia data. The vector transformation module 230 is configured to perform vector transformation on the feature vectors of the multimedia data based on a relationship between the feature vectors of the multimedia data and a preset transformation vector, so that the feature vectors of the multimedia data of different data types are in the same vector space. The clustering module 240 is configured to cluster the multimedia data of different data types according to the feature vector of each converted multimedia data.

In one possible implementation, the clustering module 240 includes: a determination unit (not shown in the figure) and a clustering unit (not shown in the figure). And the determining unit is used for determining whether the multimedia data of different data types are of one type or not according to the converted feature vectors of the multimedia data. And the clustering unit is used for clustering multimedia data of different data types based on a preset clustering algorithm.

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The devices and the methods provided by the embodiment of the application are in one-to-one correspondence, so the devices also have beneficial technical effects similar to the corresponding methods.

The embodiments of the present invention are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment is described with emphasis on differences from other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of the present application shall be included in the scope of the claims of the present application.

Claims

1. A method for multimedia data fusion, the method comprising:

receiving multimedia data from a plurality of terminal devices, wherein the data types of the multimedia data comprise at least two of the following types: text, images, audio;

respectively carrying out corresponding identification on the multimedia data of each data type to obtain a feature vector of each multimedia data, wherein the feature vector is used for expressing the feature of each multimedia data;

performing vector conversion on the feature vectors of the multimedia data based on the relationship between the feature vectors of the multimedia data and preset conversion vectors so as to enable the feature vectors of the multimedia data of different data types to be in the same vector space;

and clustering the multimedia data of different data types according to the converted feature vectors of the multimedia data.

2. The method of claim 1, wherein the feature vector of each multimedia data is vector-converted based on the feature vector of each multimedia data and a predetermined number of categories of the multimedia data, as shown in the following formula:

3. The method of claim 1, wherein clustering multimedia data of different data types according to the feature vectors of the multimedia data after vector conversion specifically comprises:

determining whether the multimedia data of different data types are of one type or not according to the converted feature vectors of the multimedia data;

and clustering the multimedia data of one type with different data types based on a preset clustering algorithm.

4. The method according to claim 3, wherein determining whether multimedia data of different data types are of one type according to the feature vector of each multimedia data after vector conversion is specifically:

calculating Euler distances among feature vectors after the multimedia data vectors of different data types are converted;

5. The method of claim 1, wherein the data type of the multimedia data further comprises: and (6) video.

6. The method of claim 1, wherein before the corresponding identification of the multimedia data of each data type is performed to obtain the feature vector of each multimedia data, the method further comprises:

respectively carrying out corresponding preprocessing on the multimedia data of different data types.

7. A multimedia data fusion device, characterized in that the device comprises:

the receiving module is used for receiving multimedia data from a plurality of terminal devices, and the data types of the multimedia data comprise at least two of the following types: text, images, audio;

the identification module is used for correspondingly identifying the multimedia data of each data type to obtain the characteristic vector of each multimedia data; wherein, the feature vector is used for representing the feature of each multimedia data;

the vector conversion module is used for performing vector conversion on the feature vectors of the multimedia data based on the relationship between the feature vectors of the multimedia data and the preset conversion vectors so as to enable the feature vectors of the multimedia data of different data types to be in the same vector space;

and the clustering module is used for clustering the multimedia data of different data types according to the converted feature vectors of the multimedia data.

8. The apparatus of claim 7, wherein the feature vector of each multimedia data is vector-converted based on the feature vector of each multimedia data and a number of predetermined categories of multimedia data, as shown in the following formula:

9. The apparatus of claim 7, wherein the clustering module comprises: a determining unit and a clustering unit;

the determining unit is used for determining whether the multimedia data of different data types are of one type or not according to the converted feature vectors of the multimedia data;

the clustering unit is used for clustering multimedia data of different data types based on a preset clustering algorithm.

10. The device according to claim 9, wherein the determining unit is specifically configured to: