CN111061907B

CN111061907B - Media data processing method, device and storage medium

Info

Publication number: CN111061907B
Application number: CN201911259305.3A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2023-06-20
Anticipated expiration: 2039-12-10
Also published as: CN111061907A

Abstract

The present disclosure provides a media data processing method, device and storage medium. The method comprises the following steps: acquiring an initial feature vector of the first media data according to a first spectrogram of the first media data; determining one or more second media data in the historical media data set adjacent to the first media data according to the initial feature vector and the historical media data set, wherein the historical media data set comprises a group of historical media data with relevance to the user selection behavior; and obtaining the optimized feature vector of the media data according to the initial feature vector and one or more second media data. By extracting the preliminary feature vector of the media data and further extracting the optimized feature vector according to the preliminary feature vector and a plurality of media data adjacent to the media data in the discrete media data set, the features of the media data can be accurately obtained.

Description

Media data processing method, device and storage medium

Technical Field

The disclosure relates to the technical field of media processing, and in particular relates to a media data processing method, a device and a storage medium.

Background

In media data recommendation systems, it is necessary to make an embedded representation of media data, i.e. to map a media data onto a vector without artificial marking. The current main practice includes:

item2vec: i.e., the media data is treated as a word in natural language processing (natual language processing, nlp), the sequence of media data is treated as a sentence in nlp, and then represented by word2 vec.

Item and user (user) joint embedding representation.

Either way, the media data sequence is actually constructed from the standpoint of user behavior, without taking into account the characteristics of the media data itself.

Thus, there is no scheme for extracting media data features.

Disclosure of Invention

The present disclosure provides a media data processing method, apparatus and storage medium to accurately obtain characteristics of media data.

In a first aspect, a media data processing method is provided, including:

acquiring a first feature vector of first media data according to a first spectrogram of the first media data, wherein the first feature vector is an initial feature vector of the first media data;

determining one or more second media data in the historical media data set adjacent to the first media data according to the first feature vector and the historical media data set, wherein the historical media data set comprises a group of historical media data with relevance of user selection behaviors;

and obtaining a second feature vector of the first media data according to the first feature vector and the one or more second media data, wherein the second feature vector is an optimized feature vector of the first media data.

In one implementation, the method further comprises:

and respectively extracting the first media data to perform time domain characteristic vectors and frequency domain characteristic vectors, and obtaining a first spectrogram of the first media data.

In yet another implementation, the obtaining the first feature vector of the first media data according to the first spectrogram of the first media data includes:

and encoding the first spectrogram of the first media data to obtain the first feature vector.

In yet another implementation, the method further comprises:

decoding the first characteristic vector obtained after the encoding to obtain a reconstructed second spectrogram;

updating a reconstruction error according to the first spectrogram and the second spectrogram;

training the encoded parameters based on the re-errors.

In yet another implementation, the method further comprises:

and classifying the plurality of second media data according to the relevance of the behavior of the user for selecting the plurality of second media data to obtain one or more historical media data sets.

In yet another implementation, the determining, from the first feature vector and the set of historical media data, one or more second media data in the set of historical media data that is adjacent to the first media data includes:

searching the one or more second media data in the historical media data set, wherein the searched one or more second media data and the first feature vector meet a first functional relation.

and obtaining a result of a second function according to the first feature vector and the second function, wherein the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

In yet another implementation, the obtaining the second feature vector of the first media data according to the first feature vector and the one or more second media data includes:

and extracting the second feature vector from the first feature vector in a neural network for extracting the media data, wherein the neural network for extracting the media data is trained according to one or more initial feature vectors and one or more second media data.

In yet another implementation, the method further comprises:

receiving a recall instruction of a server, wherein the recall instruction is used for indicating deletion of first media data with the similarity of the second feature vector and a feature vector of reference media data being greater than or equal to a first set value;

and comparing the second characteristic vector with the characteristic vector of the reference media data, and deleting the first media data if the similarity between the second characteristic vector and the characteristic vector of the reference media data is larger than or equal to a first set value.

In yet another implementation, the method further comprises:

and comparing the second characteristic vector with the characteristic vectors of the plurality of third media data to obtain third media data with the similarity between the characteristic vector and the second characteristic vector being greater than or equal to a second set value.

In a second aspect, there is provided a media data processing device comprising:

a first obtaining unit, configured to obtain a first feature vector of first media data according to a first spectrogram of the first media data, where the first feature vector is an initial feature vector of the first media data;

a determining unit, configured to determine one or more second media data adjacent to the first media data in the historical media data set according to the first feature vector and the historical media data set, where the historical media data set includes a group of historical media data with relevance to a user selection behavior;

and a second obtaining unit, configured to obtain a second feature vector of the first media data according to the first feature vector and the one or more second media data, where the second feature vector is an optimized feature vector of the first media data.

In one implementation, the apparatus further comprises:

and the extraction unit is used for respectively extracting the first media data to carry out time domain characteristic vectors and frequency domain characteristic vectors and obtaining a first spectrogram of the first media data.

In yet another implementation, the first obtaining unit is configured to encode a first spectrogram of the first media data to obtain the first feature vector.

In yet another implementation, the apparatus further comprises:

the decoding unit is used for decoding the first characteristic vector obtained after the encoding to obtain a reconstructed second spectrogram;

the updating unit is used for updating the reconstruction error according to the first spectrogram and the second spectrogram;

and the training unit is used for training the encoded parameters according to the re-error.

In yet another implementation, the apparatus further comprises:

and the classification unit is used for classifying the plurality of second media data according to the relevance of the behaviors of the plurality of second media data selected by the user to obtain one or more historical media data sets.

In yet another implementation, the determining unit is configured to search the historical media data set for the one or more second media data that satisfies the first functional relationship with the first feature vector.

In yet another implementation, the determining unit is configured to obtain a result of a second function according to the first feature vector and the second function, where the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

In yet another implementation, the second obtaining unit is configured to extract the second feature vector from the first feature vector in a neural network for extracting media data, where the neural network for extracting media data is trained from one or more initial feature vectors and one or more second media data.

In yet another implementation, the apparatus further comprises:

the receiving unit is used for receiving a recall instruction of the server, wherein the recall instruction is used for indicating to delete the first media data of which the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value;

a first comparing unit for comparing the second feature vector with a feature vector of reference media data;

and the deleting unit is used for deleting the first media data if the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value.

In yet another implementation, the apparatus further comprises:

a second comparing unit for comparing the second feature vector with feature vectors of a plurality of third media data;

and the third acquisition unit is used for acquiring third media data with the similarity between the feature vector and the second feature vector being greater than or equal to a second set value.

In a third aspect, there is provided a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform a method as described in the first aspect or any one of the first aspects.

By adopting the scheme disclosed by the invention, the following technical effects are achieved:

by extracting the preliminary feature vector of the media data and further extracting the optimized feature vector according to the preliminary feature vector and a plurality of media data adjacent to the media data in the discrete media data set, the features of the media data can be accurately obtained.

Drawings

Fig. 1 is a flowchart of a media data processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of yet another media data processing method provided by an embodiment of the present disclosure;

FIG. 3 is an exemplary spectrogram;

FIG. 4 is a schematic diagram of a model for extracting initial feature vectors of media data;

FIG. 5 is a flow chart of yet another media data processing method provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a media data processing device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of yet another media data processing device according to an embodiment of the present disclosure.

Detailed Description

As shown in fig. 1, a flowchart of a media data processing method according to an embodiment of the disclosure may include:

s101, according to a first spectrogram of first media data, a first feature vector of the first media data is obtained, wherein the first feature vector is an initial feature vector of the first media data.

The present embodiment extracts features of the first media data. First, according to a spectrogram of first media data, an initial feature vector of the first media data is obtained. The spectrogram of the first media data includes a time domain feature vector and a frequency domain feature vector of the first media data. Generally, the horizontal axis of the spectrogram is a time domain feature vector of the first media data, and the vertical axis of the spectrogram is a frequency domain feature vector of the first media data. The first feature vector is obtained based on the time domain feature vector and the frequency domain feature vector of the first media data without taking into account any other factors, such as the historical behavior of the user selected media data, and is therefore referred to as an initial feature vector.

S102, determining one or more second media data adjacent to the first media data in the historical media data set according to the first feature vector and the historical media data set, wherein the historical media data set comprises a group of historical media data with relevance of user selection behaviors.

In addition to obtaining the initial feature vector of the first media data, the feature vector of the first media data may be optimized based on historical behavior of the user-selected media data. Specifically, a set of historical media data is first determined. Media data with relevance before and after the media data selected by the user are divided into one type or one set to form the historical media data set. One or more second media data in the set of historical media data that is adjacent to the first media data is then determined based on the initial feature vector and the set of historical media data. "proximity" may mean that feature vectors of media data are similar. The number of the second media data which is specifically determined may be empirically valued.

S103, obtaining a second feature vector of the first media data according to the first feature vector and one or more second media data, wherein the second feature vector is an optimized feature vector of the first media data.

The initial feature vector of the first media data and one or more second media data similar to the feature vector of the first media data are obtained, namely, the second feature vector of the first media data can be obtained based on the features of the first media data and the historical behavior features of the user-selected media data. The second feature vector is optimized relative to the first feature vector, and historical behavior features of media data selected by a user are considered, so that the features of the media data are obtained more accurately, and the second feature vector can be used for application scenes such as user selection of the media data.

According to the media data processing method provided by the embodiment of the disclosure, the features of the media data can be accurately obtained by extracting the preliminary feature vector of the media data and further extracting the optimized feature vector according to the preliminary feature vector and a plurality of media data adjacent to the media data in the discrete media data set.

As shown in fig. 2, a flowchart of yet another media data processing method according to an embodiment of the disclosure may include:

s201, respectively extracting first media data to perform time domain characteristic vectors and frequency domain characteristic vectors, and obtaining a first spectrogram of the first media data.

The present embodiment is described by way of example with respect to feature extraction of an audio signal, and the extraction principle can also be applied to feature extraction of other media data having similar characteristics to the audio signal. The audio signal has two dimensional expressions of time domain and frequency domain, namely, the audio signal can be expressed as a time sequence or a frequency sequence. Specifically, the audio signal is sampled in a time dimension, for example, an audio signal is sampled every 0.1s, so as to obtain discrete time sequences T1 to Tn, each value represents the size of the audio at the sampling point, then the audio signals are combined according to a fixed time period (for example, 3 s), for example, the time period length is 3s, the sampling interval is 0.1s, each group of sequences includes 3s/0.1 s=30 values, for example, T1 to T30 are a group, called G1, T31 to T60 are G2, and so on. The frequency domain signal is then obtained by frequency domain transforming each set of time series (including but not limited to FFT, MFCC, DFT etc.), representing a distribution of different frequencies contained within a set of time series, and the frequency signal is also sampled, e.g. at 10hz, to obtain a discrete frequency series. Assuming that the upper and lower limits of the frequencies are 0-f, the number of each frequency sequence is f/10, and each Gi can be expressed as such a plurality of frequency sequences, except that the values of the same frequencies of different gis are different in magnitude. Corresponding to music, some parts of the music have very high bass, those gis have very high low frequency values, and some parts of the music have very high treble, those gis have very high frequency values. Therefore, gi may be expressed as time series T1 to T30 or as frequency series, and is a spectrum chart. The spectrogram as illustrated in fig. 3 is a spectrogram after real audio decomposition, the horizontal axis is time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, the upper and lower frequency limits are 110hz to 3520hz, and the gray scale represents the magnitude of the corresponding value of different frequencies.

For another example, after sampling and grouping an audio signal, several groups of time sequences will be obtained, where we collectively refer to ti for the sake of unified expression, and there are total t1 to tn groups of sequences; each ti may be transformed into a sequence of frequency domains in the manner described above and sampled to obtain values corresponding to m discrete frequencies, the values of m discrete frequencies forming an m-dimensional vector for representation. The spectrogram of the whole audio signal is a two-dimensional matrix of mxn.

S202, encoding a first spectrogram of the first media data to obtain a first feature vector.

This step is to obtain an initial feature vector of the first media data. The essence is to initially compress the spectrogram into a vector.

This may be done with an automatic encoder. The auto encoder may be an AutoEncoder/VarationalAutoencoder or the like. Specifically, a two-dimensional spectrogram is input into an encoder, and an intermediate hidden layer vector h, namely a first feature vector, is output after multiple transformations.

Further, before or after S202, the following steps may be further included:

decoding the first characteristic vector obtained after encoding to obtain a reconstructed second spectrogram;

updating the reconstruction error according to the first spectrogram and the second spectrogram;

based on the re-errors, the encoded parameters are trained.

As shown in fig. 4, after a two-dimensional spectrogram is input into an encoder and transformed multiple times to output an intermediate hidden layer vector h, h can be restored back to the spectrogram by a decoder to obtain a reconstructed spectrogram. The automatic encoder parameters are learned by constructing the reconstruction errors such that the reconstruction errors are minimized.

By such an operation, the hidden vector hi thereof can be obtained for each media data si, which is the result after the spectrogram transformation, so that the time-frequency domain information of the media data is contained. However, since hi is only obtained based on the reconstruction error of the spectrogram of the music itself, this stage of operation is called preliminary feature extraction.

The pre-training model is then further developed based on the initially extracted features in combination with the user behavior sequence to extract a second feature vector. The second feature vector considers the behavior feature of the media data selected by the user before, and is optimized compared with the initial feature vector, so that more accurate results can be obtained in subsequent application scenes of the media data features, such as music recall and similarity calculation scenes.

S203, classifying the plurality of second media data according to the relevance of the actions of the user selecting the plurality of second media data to obtain one or more historical media data sets.

A set of historical media data is first determined. Media data with relevance before and after the media data selected by the user are divided into one type or one set to form the historical media data set. One or more second media data in the set of historical media data that is adjacent to the first media data is then determined based on the initial feature vector and the set of historical media data. "proximity" may mean that feature vectors of media data are similar. The number of the second media data which is specifically determined may be empirically valued.

S204, searching one or more second media data in the historical media data set, wherein the one or more searched second media data and the first feature vector meet a first functional relation.

The step is determining one or more second media data in the set of historical media data that is adjacent to the first media data. The determination is similar to CBOW in that it is determined whether the one or more second media data and the first media data satisfy a first functional relationship, given the one or more second media data: f (si-1, si+1, …) =si. The second media data si-1, si+1, … are surrounding media data of si, and the number of specifically selected second media data may be empirically chosen, for example, by dividing a fixed window, and selecting 3 or 4 historical media data surrounding si in the window.

In particular to the form of f, i.e. how to construct a pre-trained model, here using the same model structure as word2vec, it is noted that here the si is initialized with hi learned by the auto encoder above, instead of initializing si with only one-hot corresponding to id as mentioned in word2vec/item2vec, the vectors corresponding to si are also randomly derived. Here, the features hi initially extracted in the first stage are used to represent si, which is then optimized so that the user behavior sequence can be considered, in order to sufficiently fuse the time-frequency information of the media data with the user behavior sequence.

S205, extracting a second feature vector from the neural network for extracting the media data according to the first feature vector, wherein the neural network for extracting the media data is trained according to one or more initial feature vectors and one or more second media data.

S206, receiving a recall instruction of the server, wherein the recall instruction is used for indicating deletion of the first media data with the similarity of the second feature vector and the feature vector of the reference media data being greater than or equal to a first set value.

S207, comparing the second feature vector with the feature vector of the reference media data, and deleting the first media data if the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value.

After the pre-training model is trained, the model parameters, i.e., the optimized features h' i, can be output as representations of the music si, i.e., representations of the music, for use in downstream tasks, such as music recall, etc.

In a music recall scene, for some released music, for example, the questions of the ownership of the music are not clear, the music release party is not qualified to release the music, but a certain user terminal downloads the music, and the server needs to instruct the user terminal to delete the music to be recalled. The feature vector of the media data to be recalled (reference media data) may be extracted first, then the second feature vector of the first media data is compared with the feature vector of the reference media data, and if the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to the first set value, the first media data is deleted in the user terminal. The feature vector of the reference media data may be an optimized feature vector.

As shown in fig. 5, a flowchart of yet another media data processing method according to an embodiment of the disclosure may include:

s301, respectively extracting first media data to perform time domain characteristic vectors and frequency domain characteristic vectors, and obtaining a first spectrogram of the first media data.

S302, encoding a first spectrogram of the first media data to obtain a first feature vector.

S303, classifying the plurality of second media data according to the relevance of the actions of the user for selecting the plurality of second media data, and obtaining one or more historical media data sets.

S304, according to the first feature vector and the second function, obtaining a result of the second function, wherein the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

The step is determining one or more second media data in the set of historical media data that is adjacent to the first media data. This determination is similar to skip-gram, i.e. where w (which may be understood herein as a similar property to the first feature vector) is known, one or more second media data adjacent to the first media data is predicted. Assuming that the first media data is si and the second function is f, a result f (si) =si-1, si+1, … of the second function can be obtained according to f (si). The second media data si-1, si+1, … are surrounding media data si, and the number of specifically selected second media data may be empirically determined.

And S305, extracting the second feature vector from the neural network for extracting the media data according to the first feature vector, wherein the neural network for extracting the media data is trained according to one or more initial feature vectors and one or more second media data.

This embodiment differs from the embodiment shown in fig. 2 in step S304, and other steps S301 to S303 may refer to steps S201 to S203 of the embodiment shown in fig. 2, and step S305 may refer to step S205 of the embodiment shown in fig. 2.

S306, comparing the second characteristic vector with characteristic vectors of a plurality of third media data to obtain third media data with the similarity between the characteristic vector and the second characteristic vector being larger than or equal to a second set value.

After the pre-training model is trained, the model parameters, i.e. the optimized features h' i, can be used as representations of the music si, i.e. portraits of the music, for downstream tasks, such as similarity calculation, etc.

For example, to recommend similar media data to the user, the second feature vector may be compared to the feature vectors of the plurality of third media data. The feature vector of the third media data may be an optimized feature vector. And acquiring the similarity between the second feature vector of the first media data and the feature vector of each third media data, acquiring the third media data with the similarity value larger than or equal to a second set value, and recommending the acquired third media data to a user. This allows more targeted recommendation of similar media data to the user.

Based on the same concept of the media data processing method in the foregoing embodiments, as shown in fig. 6, an embodiment of the disclosure further provides a media data processing device. The apparatus 1000 comprises: the first acquiring unit 11, the determining unit 12, and the second acquiring unit 13 may further include an extracting unit 14, a decoding unit 15, an updating unit 16, and a training unit 17 (shown in broken lines in the figure).

Wherein:

a first obtaining unit 11, configured to obtain a first feature vector of first media data according to a first spectrogram of the first media data, where the first feature vector is an initial feature vector of the first media data;

a determining unit 12, configured to determine one or more second media data adjacent to the first media data in the historical media data set according to the first feature vector and the historical media data set, where the historical media data set includes a group of historical media data with relevance to the user selection behavior;

a second obtaining unit 13, configured to obtain a second feature vector of the media data according to the first feature vector and the one or more second media data, where the second feature vector is an optimized feature vector of the first media data.

In one implementation, the extracting unit 14 is configured to extract the first media data respectively to perform a time domain feature vector and a frequency domain feature vector, so as to obtain a first spectrogram of the first media data.

In yet another implementation, the first obtaining unit 11 is configured to encode a first spectrogram of the first media data to obtain the first feature vector.

In yet another implementation, the decoding unit 15 is configured to decode the first feature vector obtained after the encoding, to obtain a reconstructed second spectrogram;

the updating unit 16 is configured to update a reconstruction error according to the first spectrogram and the second spectrogram;

the training unit 17 is configured to train the encoded parameters according to the re-error.

In yet another implementation, the determining unit 12 is configured to search the historical media data set for the one or more second media data that satisfies the first functional relationship with the first feature vector.

In yet another implementation, the determining unit 12 is configured to obtain a result of the second function according to the first feature vector and a second function, where the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

In yet another implementation, the apparatus further comprises:

According to the media data processing device provided by the embodiment of the disclosure, the features of the media data can be accurately obtained by extracting the preliminary feature vector of the media data and further extracting the optimized feature vector according to the preliminary feature vector and a plurality of media data adjacent to the media data in the discrete media data set.

Fig. 7 is a schematic structural diagram of yet another media data processing device according to an embodiment of the present disclosure. In one embodiment, the media data processing device may correspond to the embodiment of fig. 1, fig. 2 or fig. 5. As shown in fig. 7, the media data processing device may include: the processor, the network interface and the memory, and in addition, the above media data processing device may further include: a user interface, and at least one communication bus. Wherein the communication bus is used to enable connection communication between these components. The user interface may include a display screen (display), a keyboard (keypad), and the optional user interface may further include a standard wired interface, a wireless interface, among others. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory may be a high-speed RAM memory or a non-volatile memory, such as at least one disk memory. The memory may optionally also be at least one storage device located remotely from the aforementioned processor. As shown in fig. 7, an operating system, a network communication module, a user interface module, and a device control application program may be included in a memory as one type of computer-readable storage medium.

In the media data processing device shown in fig. 7, the network interface may provide a network communication function; the user interface is mainly used for providing input for users; the processor may be configured to invoke the device control application stored in the memory to implement the description of the media data processing method in any of the embodiments corresponding to fig. 1, fig. 2, or fig. 5, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

It should be understood that the media data processing device described in the embodiments of the present disclosure may perform the description of the media data processing method in any of the foregoing embodiments corresponding to fig. 1, fig. 2 or fig. 5, and will not be described herein. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiments of the present disclosure further provide a computer readable storage medium, in which a computer program executed by the aforementioned media data processing device 1000 is stored, and the computer program includes program instructions, when executed by a processor, can perform the description of the media data processing method in any of the foregoing embodiments corresponding to fig. 1, fig. 2, or fig. 5, and therefore, a detailed description thereof will not be provided herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present disclosure, please refer to the description of the embodiments of the method according to the present disclosure.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the division of the unit is merely a logic function division, and there may be another division manner when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a digital versatile disk (digital versatile disc, DVD), or a semiconductor medium, such as a Solid State Disk (SSD), or the like.

Claims

1. A method of media data processing, comprising:

respectively extracting first media data according to a first spectrogram of the first media data to carry out time domain characteristic vectors and frequency domain characteristic vectors, and obtaining the first spectrogram of the first media data;

encoding a first spectrogram of the first media data to obtain a first feature vector of the first media data, wherein the first feature vector is an initial feature vector of the first media data;

2. The method according to claim 1, wherein the method further comprises:

and training the encoded parameters according to the reconstruction errors.

3. The method according to claim 1, wherein the method further comprises:

4. A method according to claim 1 or 3, wherein said determining one or more second media data in said set of historical media data that are adjacent to said first media data from said first feature vector and set of historical media data comprises:

5. A method according to claim 1 or 3, wherein said determining one or more second media data in said set of historical media data that are adjacent to said first media data from said first feature vector and set of historical media data comprises:

6. The method according to claim 1, wherein the method further comprises:

7. A media data processing device, comprising:

the first acquisition unit is used for respectively extracting the first media data to carry out time domain characteristic vectors and frequency domain characteristic vectors according to a first spectrogram of the first media data to obtain the first spectrogram of the first media data; encoding a first spectrogram of the first media data to obtain a first feature vector of the first media data, wherein the first feature vector is an initial feature vector of the first media data;

8. A media data processing device comprising a processor and a storage device, the processor and the storage device being interconnected, wherein the storage device is adapted to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-6.

9. A computer readable storage medium storing one or more instructions adapted to be loaded by a processor and to perform the method of any one of claims 1 to 6.