CN111061907A

CN111061907A - Media data processing method, device and storage medium

Info

Publication number: CN111061907A
Application number: CN201911259305.3A
Authority: CN
Inventors: 缪畅宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2020-04-24
Anticipated expiration: 2039-12-10
Also published as: CN111061907B

Abstract

The disclosure provides a media data processing method, a device and a storage medium. The method comprises the following steps: acquiring an initial feature vector of the first media data according to a first spectrogram of the first media data; determining one or more second media data adjacent to the first media data in the historical media data set according to the initial feature vector and the historical media data set, wherein the historical media data set comprises a group of historical media data with relevance in user selection behavior; and acquiring the optimized feature vector of the media data according to the initial feature vector and one or more second media data. The features of the media data can be accurately obtained by extracting the preliminary feature vector of the media data and further extracting the optimized feature vector according to the preliminary feature vector and a plurality of media data adjacent to the media data in the discrete media data set.

Description

Media data processing method, device and storage medium

Technical Field

The present disclosure relates to the field of media processing technologies, and in particular, to a method and an apparatus for processing media data, and a storage medium.

Background

In a media data recommendation system, embedded representation of media data is required, i.e., one media data is mapped onto one vector without human marking. The current main methods comprise:

item2 vec: that is, the media data is regarded as a word in natural language processing (nlp), and the sequence of the media data is regarded as a sentence in nlp and then expressed by the method of word2 vec.

Item and user (user) joint embedding representation.

In any case, the media data sequence is constructed from the user behavior point of view without considering the characteristics of the media data.

Thus, there is no scheme for extracting features of media data.

Disclosure of Invention

The present disclosure provides a media data processing method, apparatus, and storage medium to accurately obtain characteristics of media data.

In a first aspect, a media data processing method is provided, including:

acquiring a first feature vector of first media data according to a first spectrogram of the first media data, wherein the first feature vector is an initial feature vector of the first media data;

determining one or more second media data adjacent to the first media data in the historical media data set according to the first feature vector and the historical media data set, wherein the historical media data set comprises a group of historical media data with relevance in user selection behavior;

and acquiring a second feature vector of the first media data according to the first feature vector and the one or more second media data, wherein the second feature vector is an optimized feature vector of the first media data.

In one implementation, the method further comprises:

and respectively extracting the first media data to perform time domain characteristic vector and frequency domain characteristic vector, and obtaining a first spectrogram of the first media data.

In yet another implementation, the obtaining a first feature vector of the first media data according to a first spectrogram of the first media data includes:

and encoding a first spectrogram of the first media data to obtain the first feature vector.

In yet another implementation, the method further comprises:

decoding the first feature vector obtained after encoding to obtain a reconstructed second spectrogram;

updating a reconstruction error according to the first spectrogram and the second spectrogram;

and training the coded parameters according to the re-error.

In yet another implementation, the method further comprises:

and classifying the plurality of second media data according to the relevance of the behavior of selecting the plurality of second media data by the user to obtain one or more historical media data sets.

In yet another implementation, the determining, from the first feature vector and a set of historical media data, one or more second media data of the set of historical media data that are adjacent to the first media data includes:

and searching the one or more second media data in the historical media data set, wherein the searched one or more second media data and the first characteristic vector satisfy a first functional relation.

and obtaining a result of the second function according to the first feature vector and the second function, wherein the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

In yet another implementation, the obtaining a second feature vector of the first media data according to the first feature vector and the one or more second media data includes:

and extracting the second feature vector in a neural network for extracting the media data according to the first feature vector, wherein the neural network for extracting the media data is trained according to one or more initial feature vectors and one or more second media data.

In yet another implementation, the method further comprises:

receiving a recall indication of a server, wherein the recall indication is used for indicating to delete first media data of which the similarity of the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value;

and comparing the second feature vector with the feature vector of the reference media data, and deleting the first media data if the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value.

In yet another implementation, the method further comprises:

and comparing the second feature vector with a plurality of feature vectors of third media data to obtain third media data of which the similarity between the feature vectors and the second feature vector is greater than or equal to a second set value.

In a second aspect, there is provided a media data processing apparatus comprising:

a first obtaining unit, configured to obtain a first feature vector of first media data according to a first spectrogram of the first media data, where the first feature vector is an initial feature vector of the first media data;

a determining unit, configured to determine, according to the first feature vector and a historical media data set, one or more second media data adjacent to the first media data in the historical media data set, where the historical media data set includes a set of historical media data with relevance for a user selection behavior;

a second obtaining unit, configured to obtain a second feature vector of the first media data according to the first feature vector and the one or more second media data, where the second feature vector is an optimized feature vector of the first media data.

In one implementation, the apparatus further comprises:

and the extraction unit is used for respectively extracting the time domain characteristic vector and the frequency domain characteristic vector of the first media data to obtain a first spectrogram of the first media data.

In yet another implementation, the first obtaining unit is configured to encode a first spectrogram of the first media data, and obtain the first feature vector.

In yet another implementation, the apparatus further comprises:

a decoding unit, configured to decode the first feature vector obtained after the encoding to obtain a reconstructed second spectrogram;

an updating unit, configured to update a reconstruction error according to the first spectrogram and the second spectrogram;

and the training unit is used for training the coded parameters according to the re-error.

In yet another implementation, the apparatus further comprises:

and the classification unit is used for classifying the second media data according to the relevance of the behavior of selecting the second media data by the user to obtain one or more historical media data sets.

In yet another implementation, the determining unit is configured to search the historical media data set for one or more second media data that satisfy the first functional relationship with the first feature vector.

In yet another implementation, the determining unit is configured to obtain a result of a second function according to the first feature vector and the second function, where the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

In yet another implementation, the second obtaining unit is configured to extract the second feature vector in a neural network for extracting media data according to the first feature vector, where the neural network for extracting media data is trained according to one or more initial feature vectors and one or more second media data.

In yet another implementation, the apparatus further comprises:

a receiving unit, configured to receive a recall instruction of a server, where the recall instruction is used to instruct to delete first media data in which a similarity between the second feature vector and a feature vector of reference media data is greater than or equal to a first set value;

a first comparing unit for comparing the second feature vector with a feature vector of reference media data;

and the deleting unit is used for deleting the first media data if the similarity between the second characteristic vector and the characteristic vector of the reference media data is larger than or equal to a first set value.

In yet another implementation, the apparatus further comprises:

a second comparing unit for comparing the second feature vector with feature vectors of a plurality of third media data;

and the third acquisition unit is used for acquiring third media data of which the similarity between the feature vector and the second feature vector is greater than or equal to a second set value.

In a third aspect, there is provided a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform a method as described in the first aspect or any one of the first aspects.

By adopting the scheme disclosed by the invention, the following technical effects are achieved:

the features of the media data can be accurately obtained by extracting the preliminary feature vector of the media data and further extracting the optimized feature vector according to the preliminary feature vector and a plurality of media data adjacent to the media data in the discrete media data set.

Drawings

Fig. 1 is a schematic flow chart of a media data processing method provided by an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of another media data processing method provided by the embodiment of the disclosure;

FIG. 3 is an exemplary spectral diagram;

FIG. 4 is a schematic diagram of a model for extracting initial feature vectors of media data;

fig. 5 is a schematic flowchart of another media data processing method provided by an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a media data processing apparatus according to an embodiment of the disclosure;

fig. 7 is a schematic structural diagram of another media data processing apparatus according to an embodiment of the disclosure.

Detailed Description

As shown in fig. 1, which is a schematic flowchart of a media data processing method provided in an embodiment of the present disclosure, the method may include:

s101, obtaining a first feature vector of the first media data according to a first spectrogram of the first media data, wherein the first feature vector is an initial feature vector of the first media data.

The present embodiment extracts the feature of the first media data. First, according to a spectrogram of first media data, an initial feature vector of the first media data is obtained. The spectrogram of the first media data comprises time-domain feature vectors and frequency-domain feature vectors of the first media data. Generally, the horizontal axis of the spectrogram is a time domain feature vector of the first media data, and the vertical axis of the spectrogram is a frequency domain feature vector of the first media data. The first feature vector is obtained based on the time domain feature vector and the frequency domain feature vector of the first media data without considering any other factors, such as the historical behavior of the user selecting the media data, and thus, the first feature vector is referred to as an initial feature vector.

And S102, determining one or more second media data adjacent to the first media data in the historical media data set according to the first feature vector and the historical media data set, wherein the historical media data set comprises a group of historical media data with relevance in user selection behavior.

In addition to obtaining initial feature vectors for the first media data, the feature vectors for the first media data may be optimized based on historical behavior of the user-selected media data. Specifically, a set of historical media data is first determined. And dividing the media data with relevance front and back in the media data selected by the user into one type or one set to form the historical media data set. One or more second media data in the historical media data set that are adjacent to the first media data are then determined based on the initial feature vector and the historical media data set. "adjacent" may mean that the feature vectors of the media data are close. The number of the second media data may be determined empirically.

S103, obtaining a second feature vector of the first media data according to the first feature vector and one or more second media data, wherein the second feature vector is an optimized feature vector of the first media data.

The initial feature vector of the first media data and one or more second media data close to the feature vector of the first media data are obtained, that is, the second feature vector of the first media data can be obtained based on the features of the first media data and the historical behavior features of the media data selected by the user. The second feature vector is optimized relative to the first feature vector, and historical behavior features of media data selected by a user are considered, so that the features of the media data are obtained more accurately, and the method can be used for application scenes such as user selection of the media data.

According to the media data processing method provided by the embodiment of the disclosure, the optimized feature vector is further extracted according to the preliminary feature vector of the media data and the plurality of media data adjacent to the media data in the discrete media data set, so that the features of the media data can be accurately obtained.

Fig. 2 is a schematic flowchart of another media data processing method provided in an embodiment of the present disclosure, where the method may include:

s201, respectively extracting the first media data to perform time domain characteristic vector and frequency domain characteristic vector, and obtaining a first spectrogram of the first media data.

The feature extraction of the audio signal is exemplified in the present embodiment, and the extraction principle can also be applied to the feature extraction of other media data having similar characteristics to the audio signal. The audio signal has two dimensions of time domain and frequency domain, that is, the audio signal can be expressed as a time sequence or a frequency sequence. Specifically, the audio signal is sampled in a time dimension, for example, one audio signal is sampled every 0.1s, so as to obtain discrete time series T1-Tn, each value represents the size of the audio at the sampling point, and then the discrete time series are combined according to a fixed time period (for example, 3s), for example, the time period length is 0.1s, and then each set of series includes 30 values per 3s/0.1s, for example, T1-T30 are a set, which is referred to as G1, T31-T60 are referred to as G2, and so on. Then, a frequency domain transform (including but not limited to FFT, MFCC, DFT, etc.) is performed on each group of time sequences to obtain a frequency domain signal, which represents the distribution of different frequencies contained in a group of time sequences, and the frequency signal is also sampled, for example, 10hz, to obtain a discrete frequency sequence. Assuming that the upper and lower limits of the frequency are 0-f, the number of each frequency sequence is f/10, and each Gi can be represented as such a plurality of frequency sequences, except that the values of the same frequency of different Gis are different in size. Corresponding to music, some parts of the music have heavy bass and the low frequency values of Gi are large, and some parts of the music have high treble and the high frequency values of Gi are large. Therefore, Gi may be represented as a time series T1 to T30, or as a frequency series, and is a single spectrogram in a unified manner. The spectrogram as illustrated in fig. 3 is a spectrogram after real audio decomposition, wherein the horizontal axis represents time, and the time period is about 1.75s, that is, a time slice is cut every 1.75 s; the frequency corresponding to each time segment is a vertical axis, the upper and lower limits of the frequency are 110 hz-3520 hz, and the depth of the gray scale represents the value corresponding to different frequencies.

For another example, after sampling and grouping an audio signal, a plurality of groups of time sequences are obtained, and for the sake of uniform expression, the groups are collectively referred to as ti, and there are t1 to tn groups of sequences in total; each ti may be transformed into a frequency domain sequence in the manner described above and sampled to obtain values corresponding to m discrete frequencies, which values form an m-dimensional vector. The spectrogram of the entire audio signal is a two-dimensional matrix of mxn.

S202, encoding a first spectrogram of the first media data to obtain a first feature vector.

The method comprises the steps of obtaining an initial feature vector of first media data. The essence is to initially compress the spectrogram into a vector.

This can be done with an auto-encoder. The AutoEncoder may be an AutoEncoder/Varational AutoEncoder, or the like. Specifically, a two-dimensional spectrogram is input into an encoder, and a middle hidden layer vector h, namely a first feature vector, is output after multiple transformations.

Further, before or after S202, the following steps may be further included:

updating the reconstruction error according to the first spectrogram and the second spectrogram;

and training the parameters of the code according to the re-error.

As shown in fig. 4, in the model diagram for extracting the initial feature vector of the media data, after a two-dimensional spectrogram is input into an encoder, and is transformed for multiple times to output a middle hidden layer vector h, h may be restored back to the spectrogram by a decoder, so as to obtain a reconstructed spectrogram. The auto-encoder parameters are learned by constructing the reconstruction error such that the reconstruction error is minimized.

By doing so, for each media data si, its implicit vector hi, which is the result after the spectrogram transform, is obtained, and therefore contains the time-frequency domain information of the media data. However, since hi is only obtained based on the reconstruction error of the spectrogram of the music, the hi is called as a preliminary feature, and the operation at this stage is called as preliminary feature extraction.

And further performing a pre-training model based on the preliminarily extracted features and combined with the user behavior sequence to extract a second feature vector. The second feature vector considers the behavior feature of the media data selected by the user before, and is optimized compared with the initial feature vector, so that more accurate results can be obtained in subsequent application scenes of the media data features, such as music recall and similarity calculation scenes.

S203, classifying the plurality of second media data according to the relevance of the behavior of selecting the plurality of second media data by the user to obtain one or more historical media data sets.

A set of historical media data is first determined. And dividing the media data with relevance front and back in the media data selected by the user into one type or one set to form the historical media data set. One or more second media data in the historical media data set that are adjacent to the first media data are then determined based on the initial feature vector and the historical media data set. "adjacent" may mean that the feature vectors of the media data are close. The number of the second media data may be determined empirically.

S204, one or more second media data are searched in the historical media data set, and the searched one or more second media data and the first feature vector meet a first functional relation.

The step is determining one or more second media data in the set of historical media data that are adjacent to the first media data. This determination is similar to CBOW, i.e. in case one or more second media data are known, it is determined whether the one or more second media data and the first media data satisfy a first functional relationship: f (si-1, si +1, …) ═ si. The second media data si-1, si +1, … are media data around si, and the number of the specifically selected second media data may be taken according to experience, for example, a fixed window is divided, and 3 or 4 historical media data around si in the window are selected.

In particular to the form of f, namely how to construct a pre-training model, the model structure is the same as that of word2vec, but note that hi learned by an automatic encoder is used to initialize si, rather than just one-hot corresponding to id as mentioned in word2vec/item2vec, and the vector corresponding to si is also obtained randomly. The si is represented by the features hi preliminarily extracted in the first stage, and then the si is optimized so that the user behavior sequence can be considered, and the purpose is to sufficiently fuse the time-frequency information of the media data with the user behavior sequence.

And S205, extracting a second feature vector from the neural network for extracting the media data according to the first feature vector, wherein the neural network for extracting the media data is obtained by training according to one or more initial feature vectors and one or more second media data.

S206, receiving a recall instruction of the server, wherein the recall instruction is used for instructing to delete the first media data of which the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value.

S207, comparing the second feature vector with the feature vector of the reference media data, and deleting the first media data if the similarity between the second feature vector and the feature vector of the reference media data is larger than or equal to a first set value.

After the pre-trained model training is completed, the model parameters, i.e., the optimized features h' i, can be output as a representation of the music si, i.e., an image of the music, for use in downstream tasks, such as music recall, etc.

In particular, in a music recall scenario, a recall is required for some released music, for example, problems such as ownership of the music are unclear, the music releasing party is not qualified to release the music, but a certain user terminal has downloaded the music, and the server needs to instruct the user terminal to delete the music to be recalled. The feature vector of the media data to be recalled (reference media data) may be extracted first, then the second feature vector of the first media data is compared with the feature vector of the reference media data, and if the similarity between the second feature vector and the feature vector of the reference media data is greater than or equal to a first set value, the first media data is deleted in the user terminal. The feature vector of the reference media data may be an optimized feature vector.

Fig. 5 is a schematic flowchart of another media data processing method provided in an embodiment of the present disclosure, where the method may include:

s301, extracting the first media data respectively to perform time domain characteristic vector and frequency domain characteristic vector, and obtaining a first spectrogram of the first media data.

S302, encode the first spectrogram of the first media data, and obtain a first feature vector.

S303, classifying the plurality of second media data according to the relevance of the behavior of selecting the plurality of second media data by the user to obtain one or more historical media data sets.

S304, obtaining a result of the second function according to the first feature vector and the second function, wherein the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

The step is determining one or more second media data in the set of historical media data that are adjacent to the first media data. The way this is determined is similar to a skip-gram, i.e. one or more second media data adjacent to the first media data are predicted knowing w (which here can be understood as similar properties to the first feature vector). Assuming that the first media data is si, the second function f, and according to f (si), the result f (si) -1, si +1, … of the second function can be obtained. The second media data si-1, si +1, … are media data around si, and the number of the specifically selected second media data may be empirically determined.

S305, extracting the second feature vector in a neural network for extracting the media data according to the first feature vector, wherein the neural network for extracting the media data is obtained by training according to one or more initial feature vectors and one or more second media data.

This embodiment differs from the embodiment shown in fig. 2 in that step S304, the other steps S301 to S303 can refer to steps S201 to S203 of the embodiment shown in fig. 2, and step S305 can refer to step S205 of the embodiment shown in fig. 2.

S306, comparing the second feature vector with a plurality of feature vectors of third media data to obtain third media data of which the similarity between the feature vectors and the second feature vector is greater than or equal to a second set value.

After the pre-training model is trained, the model parameters, i.e. the optimized features h' i, can be output to be used as the representation of the music si, i.e. the portrait of the music, for the downstream tasks, such as similarity calculation, etc.

For example, to recommend similar media data to the user, the second feature vector may be compared to feature vectors of a plurality of third media data. The feature vector of the third media data may be an optimized feature vector. And acquiring the similarity between the second feature vector of the first media data and the feature vector of each third media data, acquiring third media data with the similarity value larger than or equal to a second set value, and recommending the acquired third media data to the user. This may allow for more targeted recommendation of similar media data to the user.

Based on the same concept of the media data processing method in the foregoing embodiment, as shown in fig. 6, an embodiment of the present disclosure further provides a media data processing apparatus. The apparatus 1000 comprises: the first acquiring unit 11, the determining unit 12, and the second acquiring unit 13 may further include an extracting unit 14, a decoding unit 15, an updating unit 16, and a training unit 17 (shown by dotted lines in the figure).

Wherein:

a first obtaining unit 11, configured to obtain a first feature vector of first media data according to a first spectrogram of the first media data, where the first feature vector is an initial feature vector of the first media data;

a determining unit 12, configured to determine, according to the first feature vector and a historical media data set, one or more second media data adjacent to the first media data in the historical media data set, where the historical media data set includes a set of historical media data with relevance to a user selection behavior;

a second obtaining unit 13, configured to obtain a second feature vector of the media data according to the first feature vector and the one or more second media data, where the second feature vector is an optimized feature vector of the first media data.

In one implementation, the extracting unit 14 is configured to extract the first media data to perform a time domain characteristic vector and a frequency domain characteristic vector, respectively, to obtain a first spectrogram of the first media data.

In yet another implementation, the first obtaining unit 11 is configured to encode a first spectrogram of the first media data, and obtain the first feature vector.

In yet another implementation, the decoding unit 15 is configured to decode the first feature vector obtained after the encoding to obtain a reconstructed second spectrogram;

the updating unit 16 is configured to update a reconstruction error according to the first spectrogram and the second spectrogram;

the training unit 17 is configured to train the encoded parameters according to the re-error.

In yet another implementation, the determining unit 12 is configured to search the historical media data set for one or more second media data that satisfy the first functional relationship with the first feature vector.

In yet another implementation, the determining unit 12 is configured to obtain a result of a second function according to the first feature vector and the second function, where the result of the second function is one or more second media data adjacent to the first media data in the historical media data set.

In yet another implementation, the apparatus further comprises:

According to the media data processing device provided by the embodiment of the disclosure, the optimized feature vector is further extracted according to the preliminary feature vector of the media data and the plurality of media data adjacent to the media data in the discrete media data set, so that the features of the media data can be accurately obtained.

Fig. 7 is a schematic structural diagram of another media data processing apparatus according to an embodiment of the disclosure. In one embodiment, the media data processing apparatus may correspond to the embodiments corresponding to fig. 1, fig. 2, or fig. 5 described above. As shown in fig. 7, the media data processing apparatus may include: the processor, the network interface and the memory, and the media data processing device may further include: a user interface, and at least one communication bus. Wherein the communication bus is used for realizing connection communication among the components. The user interface may include a display screen (display) and a keyboard (keyboard), and the selectable user interface may also include a standard wired interface and a standard wireless interface. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory may optionally be at least one memory device located remotely from the processor. As shown in fig. 7, a memory, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the media data processing apparatus shown in fig. 7, the network interface may provide a network communication function; the user interface is mainly used for providing an input interface for a user; the processor may be configured to call a device control application stored in the memory to implement the description of the media data processing method in the embodiment corresponding to any one of fig. 1, fig. 2, or fig. 5, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

It should be understood that the media data processing apparatus described in the embodiments of the present disclosure may perform the description of the media data processing method in the embodiment corresponding to any of fig. 1, fig. 2, or fig. 5, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: the present disclosure also provides a computer-readable storage medium, where the computer program executed by the aforementioned media data processing apparatus 1000 is stored in the computer-readable storage medium, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the media data processing method in any of the embodiments corresponding to fig. 1, fig. 2, or fig. 5 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium to which the present disclosure relates, refer to the description of embodiments of the method of the present disclosure.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other division may be implemented in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).

Claims

1. A method for media data processing, comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1 or 2, wherein the obtaining a first feature vector of the first media data according to the first spectrogram of the first media data comprises:

4. The method according to claim 1 or 2, characterized in that the method further comprises:

and training the coded parameters according to the re-error.

5. The method according to claim 1 or 2, characterized in that the method further comprises:

6. The method of claim 1 or 5, wherein the determining one or more second media data of the set of historical media data that are adjacent to the first media data according to the first feature vector and the set of historical media data comprises:

7. The method of claim 1 or 5, wherein the determining one or more second media data of the set of historical media data that are adjacent to the first media data according to the first feature vector and the set of historical media data comprises:

8. The method of claim 1, further comprising:

9. A media data processing apparatus, comprising:

10. A computer storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform the method of any of claims 1 to 8.