CN111583890A

CN111583890A - Audio classification method and device

Info

Publication number: CN111583890A
Application number: CN201910117805.7A
Authority: CN
Inventors: 陈燕青; 李腾; 陈斯枫; 黄杰; 陆品冰; 马辉; 任佳亮; 张启晟; 张宏吉
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-02-15
Filing date: 2019-02-15
Publication date: 2020-08-25

Abstract

A machine learning based audio classification and corresponding song recommendation scheme is disclosed. The classification method comprises the following steps: carrying out frequency spectrum on audio to be classified to obtain an audio frequency spectrogram; sending the audio frequency spectrogram into a Machine Learning (ML) image classifier for classification; and determining the category of the audio to be classified according to the classification result of the ML image classifier. Here, the ML image classifier may be, in particular, a trained ANN, such as a CNN classifier or a labeler that performs only two classification decisions. Therefore, the audio to be classified is subjected to spectrum visualization, and machine learning is used for image classification, so that the audio can be more objectively and accurately classified from the aspect of spectrum distribution. Before image classification, the spectral images can be sent to, for example, a specially trained ANN for dimension reduction processing, so that the main audio information is saved, the complexity of image classification is greatly reduced, and the classification and marking efficiency is improved.

Description

Audio classification method and device

Technical Field

The invention relates to the field of audio analysis, in particular to an audio classification method and device based on machine learning.

Background

When a user listens to music by using music playing software, the user often wants to expand the range of listening to songs of the user in addition to playing songs that the user likes, and listens to some songs that the user does not listen to but may like. To listen to such a song, the user may attempt to click on a recommended menu for a particular topic. Recommending whether the song list really contains songs loved by the user becomes an important standard for measuring the user experience.

In the prior art, a song library marks and classifies a song according to the characteristics of the song, and the parameters of marking can be singers, players, belonged albums, years, languages, music styles and the like. Whether the marking is manual or automatic, the existing marking and classifying methods are difficult to select songs similar to the songs liked by the user and even the "feeling" with a relatively uniform standard, so that a high-quality recommended song list is difficult to generate.

For this reason, a solution capable of efficiently and objectively classifying audio is required.

Disclosure of Invention

In view of the above, the present invention provides an audio classification and corresponding song recommendation scheme based on artificial intelligence, which performs spectrum visualization on audio to be classified and performs image classification by using a machine learning classifier, so as to classify the audio more objectively and accurately from the perspective of spectrum distribution. Before the spectral image is classified, for example, the dimension reduction processing of an autoencoder can be performed, so that the complexity of image classification is greatly reduced while the main audio information is saved, and the classification and marking efficiency is further improved.

According to an aspect of the present invention, there is provided an audio classification method, including: performing frequency spectrum on the audio to be classified to obtain an audio frequency spectrogram, for example, performing framing and spectrum transformation on the audio to be classified to obtain an audio framing spectrogram; sending the audio frequency spectrogram into a Machine Learning (ML) image classifier for classification; and determining the category of the audio to be classified according to the classification result of the ML image classifier. Here, the ML image classifier may be, inter alia, a trained CNN classifier and/or a tagger for determining whether the input picture belongs to a certain style or listening. Therefore, the invention can utilize the artificial neural network to mine the frequency spectrum characteristics to objectively and efficiently realize the grasp of the audio content.

Preferably, the classification method may further include: the audio spectrogram that is spectrally reduced, for example, using an auto-encoder, is reduced in dimension to obtain a simplified audio spectrogram, and what is fed into the ML image classifier for classification is the simplified audio spectrogram. Therefore, the classification processing capacity of the classifier is improved.

For an audio frame-divided spectrogram, the method of the present invention may further comprise: and combining the audio frequency frame spectrograms to obtain a frame-combined audio frequency spectrogram, and sending the audio frequency spectrogram to the ML image classifier for classification, wherein the audio frequency spectrogram is subjected to frame combination.

Preferably, the song to be classified includes a plurality of audios to be classified, and the method may further include: and determining the classification of the song to be classified according to the classification result of the included multiple audios to be classified.

Preferably, feeding the audio spectrogram into a Machine Learning (ML) image classifier for classification may include: sending the audio frequency spectrogram into a plurality of different labelers for label judgment, and determining the classification of the audio to be classified according to the classification result of the ML image classifier, wherein the classification comprises the following steps: and marking corresponding labels for the audio to be classified according to the judgment result of each labeler. Therefore, the processing capacity of the parallel marking is improved.

Preferably, the classification method of the present invention may further include: collecting a plurality of audio frequency spectrograms of the audio frequency to be classified and the classification results of the audio frequency spectrograms; and retraining the ML image classifier for classification based on a plurality of audio frequency spectrograms of the audio to be classified and the classification results thereof. Therefore, the training material library of the model can be enriched continuously, and a richer sample is provided for subsequent model training while the classification precision of the existing model (for example, an ANN classification model) is improved.

In one embodiment, the invention may also be implemented as a song recommendation method comprising: performing image classification on the audio to be classified according to the above, wherein the audio to be classified is a song to be classified or a part of the song to be classified; and generating a recommended song list containing other songs in the category of the songs based on at least the category of the songs in the playing history of the user.

According to another aspect of the present invention, there is provided an audio classification apparatus including: the audio imaging device is used for performing frequency spectrum on the audio to be classified to obtain an audio frequency spectrogram, for example, performing framing and spectrum transformation on the audio to be classified to obtain an audio frequency framing spectrogram; the image classification device is used for sending the audio frequency spectrogram into a Machine Learning (ML) image classifier for classification; and the classification determining device is used for determining the classification of the audio to be classified according to the classification result of the ML image classifier. Here, the ML image classifier may be a trained CNN classifier and/or a labeler for determining whether the input picture belongs to a certain style.

In one embodiment, the audio classification apparatus may further include: the image simplifying device is used for sending the audio frequency spectrogram into a self-encoder used for reducing the spectral dimension of the sent audio frequency spectrogram so as to obtain a reduced-dimension audio frequency spectrogram, and the image classifying device is used for sending the ML image classifier to classify the reduced-dimension audio frequency spectrogram.

In the case of using a framed audio spectrogram, the audio classification apparatus may further include: and the audio image combination device is used for combining the audio frame spectrograms to obtain the audio frequency spectrogram subjected to frame combination, and the audio frequency spectrogram subjected to frame combination is sent to the ML image classifier for classification.

Preferably, the song to be classified includes a plurality of audios to be classified, and the classification determining means is further configured to: and determining the classification of the song to be classified according to the classification result of the included multiple audios to be classified.

Preferably, the image classification device is further configured to send the audio spectrogram to a plurality of different labelers for label determination, and the classification determination device is configured to: and marking corresponding labels for the audio to be classified according to the judgment result of each labeler.

Preferably, the audio classification apparatus may further include: retraining apparatus for: collecting a plurality of audio frequency spectrograms of the audio frequency to be classified and the classification results of the audio frequency spectrograms; and retraining the ML image classifier for classification based on a plurality of audio frequency spectrograms of the audio to be classified and the classification results thereof. The retraining apparatus may be the training apparatus itself or a part thereof.

According to another aspect of the present invention, there is provided a song recommending apparatus including: the audio classification apparatus according to the above, wherein the audio to be classified is a song to be classified or a part thereof; and the song list generating device is used for generating a recommended song list containing other songs in the affiliated classification at least based on the affiliated classification of the songs in the user playing history.

According to yet another aspect of the invention, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the audio classification and/or song recommendation method as described above.

According to yet another aspect of the present invention, a non-transitory machine-readable storage medium is presented having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the audio classification and/or song recommendation method as described above.

The audio classification and the corresponding song recommendation scheme according to the invention have been described in detail above with reference to the accompanying drawings. The classification scheme provided by the invention can more objectively find out the audio with similar spectral distribution characteristics by using a machine for classification through the spectral imaging of the audio and the image classification of the machine learning classifier, thereby improving the objectivity and the high efficiency of song classification. Furthermore, the spectrogram to be classified can be subjected to dimensionality reduction and simplification, so that the subsequent classification calculation efficiency is further improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

Fig. 1 shows an example of the composition layers of a typical CNN.

Fig. 2 shows a schematic flow diagram of a method for audio classification using images according to an embodiment of the invention.

Fig. 3 shows an example of calculating the classification of an entire song based on the classification result of the segmented audio to which it belongs.

Fig. 4 shows an example of audio marking according to the invention.

Fig. 5 shows an example of song sorting according to the present invention.

Fig. 6 shows a schematic composition diagram of an audio classification apparatus according to an embodiment of the invention.

FIG. 7 illustrates a block diagram of a computing device that may be used to implement the audio classification and/or song recommendation methods described above, according to one embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Today, with ever increasing personalization demands, users are increasingly demanding "feel" to listen to songs. For example, it is desirable to listen to a relaxing jazz music when reading a book, to listen to a fast-paced pop music when exercising, and to listen to an ancient style after an occasional time of listening to an ancient song. In general, the "atmosphere" presented by a song as a whole is difficult to be accurately presented by using the existing song marking parameters (such as singers, years, languages, and even music styles), and a generated song list manually classified according to music editing and other users is difficult to always conform to the taste of listening to songs of the current user. In other words, the prior art lacks an efficient solution that can provide an overall understanding of the "song atmosphere" from all aspects with relatively objective classification rules.

In view of the above, the present invention provides an audio classification and corresponding song recommendation scheme based on artificial intelligence, which performs spectrum visualization on audio to be classified, and performs image classification by using a specially trained machine learning classifier, so as to classify the audio more objectively and accurately from the perspective of spectrum distribution. Before the spectral image is classified, for example, the dimension reduction processing of an autoencoder can be performed, so that the complexity of image classification is greatly reduced while the main audio information is saved, and the classification and marking efficiency is further improved. Here, "audio to be classified" may refer to a complete song to be classified, or may be a part of a complete song.

It is known from the basic vocal music knowledge that the melody, the singing style of the singer, and the timbre represented by the musical instrument or accompaniment can be reflected in the frequency spectrum of the audio. For example, the high, middle and low tones of a man and a woman can be reflected in the fundamental frequency, the pop, the beautiful voice and the national singing rules are different from each other in the overtone distribution, and different musical instruments can present different harmonic distributions. In other words, the frequency spectrum of the audio can reflect the essential features of the melody and the tone of the audio to some extent. Thus, the music style or "listening" of the audio, i.e., the user's perception while listening to the audio, can be inferred from the analysis of the spectrum.

The invention adopts a machine learning model to analyze the frequency spectrum image of the audio frequency. Herein, "machine learning" refers to an artificial intelligence-enabled method of algorithmic model training using large amounts of data to learn regularities from the data and make decisions and predictions about events in the real world. Existing Machine Learning (ML) classifier models include relatively simple Softmax, SVM classifiers, and more complex Artificial Neural Network (ANN) models. Among these, the Convolutional Neural Network (CNN) model, in particular, achieves good results over a wide range of picture classifications. Therefore, the audio frequency is subjected to frequency spectrum visualization, and the audio frequency is objectively classified and marked by utilizing the advantages of the ANN (particularly the CNN model) in image feature extraction, so that the accuracy of song recommendation based on classification marking is improved.

As described above, the ANN classifier used in the present invention may be, in particular, a CNN classifier that achieves good results in various image classification tasks. Fig. 1 shows an example of the composition layers of a typical CNN. As shown in fig. 1, a typical CNN consists of a series of layers that run in order.

The CNN neural network is composed of an input layer, an output layer and a plurality of hidden layers which are connected in series. The first layer of the CNN reads an input value, such as an input image, and outputs a series of activation values (which may also be referred to as a feature map). The lower layer reads the activation value generated by the previous layer and outputs a new activation value. The last classifier (classifier) outputs the probability of each class to which the input image may belong.

These layers can be broadly divided into weighted layers (e.g., convolutional layers, fully-connected layers, batch normalization layers, etc.) and unweighted layers (e.g., pooling layers, ReLU layers, Softmax layers, etc.). Here, the CONV layer (convolution layer) takes a series of feature maps as input, and convolves with a convolution kernel to obtain an output activation value. The pooling layer is typically connected to the CONV layer for outputting a maximum or average value for each partition (sub area) in each feature map, thereby reducing the computational effort by sub-sampling while maintaining some degree of displacement, scale and deformation invariance. Multiple alternations between convolutional and pooling layers may be included in a CNN, thereby gradually reducing the spatial resolution and increasing the number of feature maps. A one-dimensional vector output comprising a plurality of eigenvalues may then be derived by applying a linear transformation on the input eigenvector, possibly connected to at least one fully connected layer.

In general, the operation of weighted layers can be represented as:

Y＝WX+b，

where W is the weight value, b is the bias, X is the input activation value, and Y is the output activation value.

The operation of the unweighted layer can be represented as:

Y＝f(X)，

wherein f (X) is a non-linear function.

Here, "weights" (weights) refer to parameters in the hidden layer, which in a broad sense can include biases, are values learned through the training process, and remain unchanged at inference; the activation value refers to a value, also referred to as a feature value, transferred between layers, starting from an input layer, and an output of each layer is obtained by an operation of the input value and a weight value. Unlike the weight values, the distribution of activation values varies dynamically according to the input data sample.

Before using CNN for reasoning (e.g., image classification), CNN needs to be trained first. Parameters, such as weights and biases, of the various layers of the neural network model are determined through a large import of training data.

Fig. 2 shows a schematic flow diagram of a method for audio classification using images according to an embodiment of the invention. The method is particularly suitable for classifying massive songs contained in a large library of songs using a trained machine learning model, particularly a CNN model.

In step S210, the audio to be classified is subjected to frequency spectrum conversion to obtain an audio frequency spectrogram. In step S220, the audio spectrogram is fed into a Machine Learning (ML) image classifier for classification. In step S230, the classification of the audio to be classified is determined according to the classification result of the ML image classifier.

Thus, for example, an ML model for classification may be first trained using, as a classification sample, audio based on, for example, existing artificial classification, and the ML model learns a spectrum classification rule in a sample image that embodies classification features through iterative convergence during training. Therefore, the spectrogram of the audio to be classified can be sent to the trained ML model for classification, and the classification of the audio can be determined according to the result of image classification. Further, for example, audio frequency spectrograms of a plurality of audios to be classified and the classification results thereof in a period of time may be collected, and the ML image classifier for classification may be retrained based on the audio frequency spectrograms of the plurality of audios to be classified and the classification results thereof. In one embodiment, the actual audio classification results collected may be subsequently revised classification results, e.g., when the classifier is a binary classifier used for labeling as described below, for those audio labeled by the classifier with a probability of 0.5 nearby, a final confirmation of whether the audio should be labeled may be made via other means (e.g., manual confirmation).

In step S210, the audio framing and spectral transformation may be performed using time-sharing fourier transform, thereby obtaining an audio framing spectrogram. In order to embody the melody related features between the frames, the audio frame spectrograms may be combined to obtain a frame-combined audio spectrogram, and the audio spectrogram which is sent to the ML image classifier for classification in step S220 may be a frame-combined audio spectrogram. Accordingly, also previously used for training the ML image classifier may be a framed combined audio spectrogram as a classification sample. Therefore, the ML image classifier can learn the rules of pitch, overtone, rhythm and the like which change along with time for the jigsaw of the audio frequency spectrum with specific time length when being trained, and can also extract the rules when being classified.

According to different implementations, the ML image classifier used in the present invention may be a relatively simple Softmax classifier or SVM classifier, or an ANN, and particularly, may be a CNN classifier based on the principle shown in fig. 1. CNN classifiers of different depths (e.g., different numbers of hidden layers), different complexities, can be trained. For example, a CNN classifier capable of multiple classifications may be trained based on a sample of a large number of accurate classification representations. For example, a classifier that includes ten or more label classifications. In one embodiment, simpler classifiers may also be trained. The classifier may be, for example, a labeler for determining whether an input picture belongs to a certain style. The labeler, which may also be referred to hereinafter as a stamper, is the simplest classifier that includes only two categories, yes and no. The marker can be realized by a trained CNN model, and can also be realized by a Softmax or SVM model which is relatively simpler in structure and training.

To this end, step S220 may include: and sending the audio frequency spectrogram into a plurality of different labelers for label judgment. Step S230 may then include: and marking corresponding labels for the audio to be classified according to the judgment result of each labeler. In other words, each classifier may be a marker having only two classifications, positive and negative, and the spectrogram of the audio (or song) to be classified may be fed into multiple juxtaposed markers as needed, which may describe the atmosphere or style of the song from different angles. For example, a song may be simultaneously classified as "yes" by one marker, "relaxing jazz," and also by another marker, "yes" by the European lyrics throat. It should be appreciated that models with fewer classes are more lightweight, easier to train, less computationally intensive to classify, and relatively accurate in classification results. In other words, compared with a complex implementation of the CNN multi-class classifier, the implementation of multiple parallel markers is more flexible, and the classification is relatively more accurate.

An example of how to determine whether a song may be tagged based on a plurality of audio segments and the categorization of the tagger will be described below in conjunction with fig. 3. Fig. 3 shows an example of calculating the classification of an entire song based on the classification result of the segmented audio to which it belongs. As described above, in order to facilitate classification while expressing a time-varying rule, the length of the spectrum image processed by the ML image classifier may be selected appropriately. In one embodiment, a song to be classified may include a plurality of audios to be classified, and the classification method of the present invention may further include a step of determining the classification of the song to be classified according to a classification result of the included plurality of audios to be classified.

As shown, a particular length is selected for audio partitioning of a song. For example, a 3-minute song (time domain audio signal) is divided into 36 5-second long audios to be classified and subjected to spectral imaging (time-division frequency domain signal). The audio frequency spectrogram (namely, the spectrogram representing 5 seconds of audio) combined by the sub-frames of each audio to be classified is sent to a specific marker A, and the marker A marks whether each time-sharing frequency domain signal has the A characteristic or not (Y) or not. Then, a weighted score may be performed according to the labeling result of each time-division frequency-domain signal (for example, Y is 1, N is 0, the total score of 36 segments is greater than 18, or the average score of each segment is greater than 0.5, i.e., it is considered that the song may be labeled with the corresponding label a).

It should be understood that, in other embodiments, the classification and marking of the song itself may also be performed by performing deduplication processing, sampling processing, and the like on the audio to be classified to which the song belongs, which is not limited by the present invention.

In the invention, for the convenience of subsequent classification calculation, the audio frequency spectrogram subjected to frequency spectrum conversion can be simplified on the premise of not significantly influencing the information content of the spectrogram, so as to obtain the simplified audio frequency spectrogram. Thus, it is the simplified audio spectrogram that is fed into the ML image classifier for classification.

The raw spectrogram can be simplified using a machine learning model, in particular an artificial neural network. In one embodiment, the model for image reduction described above may re-encode the original audio spectrogram to preserve the dominant frequency information and to remove the relatively less important frequency information. For example, the spectrogram can be simplified using an auto-encoder. For example, a specialized self-encoder may be constructed and trained for reducing the spectral dimensions of the incoming audio spectrogram (e.g., an 80-dimensional simplified graph from an 800-dimensional original graph) to obtain a reduced-dimensional audio spectrogram as the simplified audio spectrogram. Therefore, the consumption of calculation power of secondary or useless frequency spectrums is greatly reduced while the main frequency spectrum information is kept.

Fig. 4 shows an example of audio marking according to the invention. As shown, the audio spectrogram on the right can be fed into the self-encoder in the middle to encode the Input 801-dimensional (Input size:801) image into an 80-dimensional (encodesize:80) reduced-dimension image. The simplified reduced-dimension images (e.g., frame images) can be combined (e.g., sequentially stitching multiple reduced-dimension images within 5s length) and fed into multiple markers (sfc1-4) to complete four markers for the audio to be classified. Subsequently, the marking results of the audio to be classified in each marker can be integrated to determine whether to mark the song correspondingly.

Fig. 5 shows an example of song sorting according to the present invention. As shown, the song to be classified is first divided into corresponding lengths of the audio stitching spectrogram that can be processed by the classifier or the stamper (e.g., into 5 s-long audio segments to be classified). The audio segment is then fourier transformed to obtain a framed spectral transform map. And sending the spectrogram of each frame into a self-encoder for dimension reduction treatment, and splicing dimension-reduced frame images belonging to the same audio segment to be classified. The combined audio spectrogram is then fed into a classifier/tagger to obtain a classification/tagging result for the audio segment. It is determined whether there are any audio segments to be classified that belong to the song. And if so, returning to perform frequency spectrum, dimensionality reduction, splicing and classification processing. And if not, integrating the classification results of the audio segments to finally obtain the classification/marking result aiming at the song. In a broader embodiment, the song classification method of fig. 5 may further include training the classifier and the self-encoder in advance, and then retraining and updating them.

As above, the audio classification method according to the present invention and the preferred embodiments thereof are described with reference to fig. 2 to 5. In one embodiment, the invention may also be implemented as a song recommendation method comprising: performing image classification on the audio to be classified according to the above, wherein the audio to be classified is a song to be classified or a part of the song to be classified; and generating a recommended song list containing other songs in the category of the songs based on at least the category of the songs in the playing history of the user.

Fig. 6 shows a schematic composition diagram of an audio classification apparatus according to an embodiment of the invention. As shown, the audio classification device 600 may include an audio imaging device 610, an image classification device 620, and a classification determination device 630.

The audio imaging apparatus 610 may be configured to perform a spectral transformation on the audio to be classified to obtain an audio frequency spectrogram, for example, perform a framing and spectral transformation on the audio to be classified to obtain an audio frame spectrogram. The image classification device 620 is used for sending the audio spectrogram to a Machine Learning (ML) image classifier for classification. The classification determining device 630 is configured to determine a classification of the audio to be classified according to a classification result of the ML image classifier. Here, the ML image classifier may be a trained CNN classifier or a labeler for determining whether an input picture belongs to a certain style.

In one embodiment, the audio classification apparatus 600 may further include: the image simplifying apparatus is configured to send the audio spectrogram to a self-encoder for reducing the spectral dimension of the sent audio spectrogram to obtain a reduced-dimension audio spectrogram, and the image classifying apparatus 620 sends the ML image classifier to classify the reduced-dimension audio spectrogram.

In case of using a framed audio spectrogram, the audio classification apparatus 600 may further include: and the audio image combination device is used for combining the audio frame spectrograms to obtain the audio frequency spectrogram subjected to frame combination, and the audio frequency spectrogram subjected to frame combination is sent to the ML image classifier for classification.

In one embodiment, the song to be classified includes a plurality of audios to be classified, and the classification determining means is further configured to: and determining the classification of the song to be classified according to the classification result of the included multiple audios to be classified.

In one embodiment, the image classification apparatus 620 is further configured to send the audio spectrogram to a plurality of different labelers for label determination, and the classification determination apparatus 630 is configured to: and marking corresponding labels for the audio to be classified according to the judgment result of each labeler.

In one embodiment, the audio classification apparatus 600 may further include: retraining apparatus for: collecting a plurality of audio frequency spectrograms of the audio frequency to be classified and the classification results of the audio frequency spectrograms; and retraining the ML image classifier for classification based on a plurality of audio frequency spectrograms of the audio to be classified and the classification results thereof. The retraining apparatus may be the training apparatus itself or a part thereof. For example, the retraining means may be part of a training means for training an ANN, Softmax or SVM model.

Accordingly, the present invention can also be realized as a song recommending apparatus comprising: the audio classification apparatus according to the above, wherein the audio to be classified is a song to be classified or a part thereof; and the song list generating device is used for generating a recommended song list containing other songs in the affiliated classification at least based on the affiliated classification of the songs in the user playing history.

Referring to fig. 7, computing device 700 includes memory 710 and processor 720.

Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose host processor and one or more special purpose coprocessors such as a Graphics Processor (GPU), Digital Signal Processor (DSP), or the like. In some embodiments, processor 720 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 710 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by processor 720 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 710 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 710 has stored thereon executable code that, when processed by the processor 720, may cause the processor 720 to perform the audio classification and/or song recommendation methods described above.

The audio classification and the corresponding song recommendation scheme according to the invention have been described in detail above with reference to the accompanying drawings. The classification scheme provided by the invention can more objectively find out the audio with similar spectral distribution characteristics for classification by using artificial intelligence through the spectral visualization of the audio and the image classification of machine learning, thereby improving the objectivity and high efficiency of song classification. Furthermore, the spectrogram to be classified can be subjected to artificial intelligence-based dimension reduction simplification, so that the subsequent classification calculation efficiency is further improved.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of audio classification, comprising:

carrying out frequency spectrum on audio to be classified to obtain an audio frequency spectrogram;

sending the audio frequency spectrogram into a Machine Learning (ML) image classifier for classification;

and determining the category of the audio to be classified according to the classification result of the ML image classifier.

2. The method of claim 1, further comprising:

and simplifying the audio frequency spectrogram subjected to the frequency spectrum to obtain a simplified audio frequency spectrogram, and sending the simplified audio frequency spectrogram to the ML image classifier for classification.

3. The method of claim 2, wherein the spectral dimensions of the incoming audio spectrogram are reduced using an auto-encoder to obtain a reduced-dimension audio spectrogram as the simplified audio spectrogram.

4. The method of claim 1, wherein the spectrally separating the audio to be classified to obtain an audio spectrogram comprises:

and performing framing and spectrum transformation on the audio to be classified to obtain an audio framing spectrogram.

5. The method of claim 4, further comprising:

and combining the audio frequency frame spectrograms to obtain a frame-combined audio frequency spectrogram, and sending the audio frequency spectrogram to the ML image classifier for classification, wherein the audio frequency spectrogram is subjected to frame combination.

6. The method of claim 1, wherein the song to be categorized includes a plurality of audios to be categorized, and,

the method further comprises the following steps:

and determining the classification of the song to be classified according to the classification result of the included multiple audios to be classified.

7. The method of claim 1, wherein the ML image classifier is at least one of:

an ANN classifier;

a Softmax classifier; and

an SVM classifier.

8. The method of claim 7, wherein the ML image classifier is a tagger for determining whether an input picture is of a certain style or listening.

9. The method of claim 8, wherein feeding the audio spectrogram into an ML image classifier for classification comprises:

sending the audio spectrogram to a plurality of different labelers for label determination, an

Determining the belonging classification of the audio to be classified according to the classification result of the ML image classifier comprises:

and marking corresponding labels for the audio to be classified according to the judgment result of each labeler.

10. The method of claim 1, further comprising:

collecting a plurality of audio frequency spectrograms of the audio frequency to be classified and the classification results of the audio frequency spectrograms; and

retraining the ML image classifier for classification based on a plurality of audio frequency spectrograms of the audio to be classified and the classification results to which the audio frequency spectrograms belong.

11. A song recommendation method, comprising:

the audio classification step according to any of claims 1-10, wherein the audio to be classified is a song to be classified or a part thereof; and

and generating a recommended song list containing other songs in the category of the songs based on at least the category of the songs in the playing history of the user.

12. An audio classification apparatus comprising:

the audio imaging device is used for carrying out frequency spectrum on the audio to be classified so as to obtain an audio frequency spectrogram;

the image classification device is used for sending the audio frequency spectrogram into a Machine Learning (ML) image classifier for classification;

and the classification determining device is used for determining the classification of the audio to be classified according to the classification result of the ML image classifier.

13. The apparatus of claim 12, further comprising:

the image simplifying device is used for sending the audio frequency spectrogram into a self-encoder used for reducing the spectral dimension of the sent audio frequency spectrogram so as to obtain a reduced-dimension audio frequency spectrogram, and the image classifying device is used for sending the ML image classifier to classify the reduced-dimension audio frequency spectrogram.

14. The apparatus of claim 12, wherein the audio imaging apparatus is further configured to:

15. The apparatus of claim 14, further comprising:

and the audio image combination device is used for combining the audio frame spectrograms to obtain the audio frequency spectrogram subjected to frame combination, and the audio frequency spectrogram subjected to frame combination is sent to the ML image classifier for classification.

16. The apparatus of claim 12, wherein the song to be categorized comprises a plurality of audios to be categorized, the category determining means further for:

17. The apparatus of claim 12, wherein the ML image classifier is at least one of:

an ANN classifier;

a Softmax classifier;

an SVM classifier; and/or

And the labeler is used for judging whether the input picture belongs to a certain style or a certain listening feeling.

18. The apparatus of claim 17, wherein the image classification means is configured to:

The classification determination means is for:

19. The apparatus of claim 12, further comprising retraining means for:

20. A song recommendation apparatus comprising:

the audio classification apparatus according to any of claims 1-10, wherein the audio to be classified is a song to be classified or a part thereof; and

and the song list generating device is used for generating a recommended song list containing other songs in the affiliated classification at least based on the affiliated classification of the songs in the user playing history.

21. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any one of claims 1-11.

22. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any one of claims 1-11.