WO2019109787A1 - 音频分类方法、装置、智能设备和存储介质 - Google Patents

音频分类方法、装置、智能设备和存储介质 Download PDF

Info

Publication number
WO2019109787A1
WO2019109787A1 PCT/CN2018/115544 CN2018115544W WO2019109787A1 WO 2019109787 A1 WO2019109787 A1 WO 2019109787A1 CN 2018115544 W CN2018115544 W CN 2018115544W WO 2019109787 A1 WO2019109787 A1 WO 2019109787A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
classified
audio file
neural network
network model
Prior art date
Application number
PCT/CN2018/115544
Other languages
English (en)
French (fr)
Inventor
程亮
甄德聪
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019109787A1 publication Critical patent/WO2019109787A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present application relates to the field of artificial intelligence technology, and in particular to audio classification technology.
  • the current technology mainly relies on manual methods to classify audio, which requires a large amount of human resources, which takes a long time, is inefficient, and is affected by factors such as personal knowledge limitations and personal preferences, and the objectivity is not high.
  • the existing machine-assisted method still relies on audio and human-related meta information, such as singers, ages and other manual information to model, and also has low efficiency and low objectivity, and, with the number of audio The more data, the sheer volume of data, and the addition of a lot of audio every day. In these audios, the lack of meta-information is common, making it difficult to accurately classify these audios.
  • the embodiments of the present application provide an audio classification method, device, smart device, and storage medium, which can overcome the limitations of the prior art and improve the accuracy and efficiency of classifying audio.
  • An audio classification method including:
  • the input vector is analyzed by the neural network model to generate a classification result of the audio file to be classified.
  • An audio classification device comprising:
  • An audio file obtaining module to be used for acquiring an audio file to be classified
  • An input vector generating module configured to process an audio signal of the audio file to be classified, to generate an input vector representing a first audio feature, where the first audio feature is an audio feature corresponding to the audio file to be classified;
  • An input module configured to input the input vector to a pre-trained neural network model for audio classification
  • a classification result generating module configured to analyze the input vector by using the neural network model to generate a classification result of the audio file to be classified.
  • a smart device that includes:
  • processor and the memory being connected by a communication bus:
  • the processor is configured to invoke and execute a program stored in the memory
  • the memory is configured to store a program, and the program is at least used to execute the audio classification method described above.
  • a storage medium having stored therein computer executable instructions for performing the audio classification method described above.
  • a computer program product comprising instructions which, when run on a computer, cause the computer to perform the audio classification method described above.
  • the embodiment of the present application provides an audio classification method, apparatus, smart device, and storage medium, as compared with the prior art.
  • the technical solution provided by the embodiment of the present application first obtains an audio file to be classified, and then processes the audio signal of the classified audio file to generate an input vector representing the first audio feature, where the first audio feature is extracted from the audio file to be classified.
  • the audio feature inputs the input vector to a pre-trained neural network model for audio classification, and the input vector is analyzed by the neural network model to generate a classification result of the audio file to be classified.
  • the technical solution provided by the embodiment of the present application is based on the audio features of the audio file to be classified, and the classified audio files are classified by the neural network model for audio classification obtained in advance, instead of relying on manual correlation.
  • the meta information is not only objective but also highly accurate, and is less affected by artificial subjective factors, and can significantly improve work efficiency by enabling automatic classification of audio files to be classified. Therefore, the technical solution provided by the embodiment of the present application can break through the limitations of the prior art, has high reliability, and is more suitable for application.
  • FIG. 1 is a flowchart of an audio classification method according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for generating an input vector representing a first audio feature according to an embodiment of the present application
  • FIG. 3 is a flowchart of a method for extracting an audio signal of an audio file to be classified according to an embodiment of the present disclosure
  • FIG. 4 is a schematic diagram of a Meyer labeling according to an embodiment of the present application.
  • FIG. 5 is a flowchart of another method for generating an input vector representing a first audio feature according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart of another method for generating an input vector representing a first audio feature according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart of another method for generating an input vector representing a first audio feature according to an embodiment of the present disclosure
  • FIG. 8 is a structural diagram of a pre-established convolutional neural network model according to an embodiment of the present application.
  • FIG. 9 is a flowchart of another audio classification method according to an embodiment of the present application.
  • FIG. 10 is a structural diagram of an audio classification apparatus according to an embodiment of the present application.
  • FIG. 11 is a structural diagram of an input vector generating module according to an embodiment of the present application.
  • FIG. 12 is a structural diagram of an input vector generating module according to an embodiment of the present application.
  • FIG. 13 is a structural diagram of an input vector generating module according to an embodiment of the present application.
  • FIG. 14 is a structural diagram of an input vector generating module according to an embodiment of the present application.
  • FIG. 15 is a structural diagram of another audio classification apparatus according to an embodiment of the present application.
  • FIG. 16 is a hardware structural diagram of a smart device according to an embodiment of the present disclosure.
  • FIG. 17 is a structural diagram of a hardware topology environment applied to an audio classification method according to an embodiment of the present disclosure.
  • FIG. 1 is a flowchart of an audio classification method according to an embodiment of the present application. As shown in Figure 1, the method includes:
  • the smart device When there is an audio file that needs to be classified, the smart device first acquires an audio file corresponding to the classified audio, that is, an audio file to be classified.
  • the smart device may extract an audio signal of the audio file to be classified, process the audio signal, and generate an input vector representing the first audio feature.
  • the first audio feature is an audio feature corresponding to the audio file to be classified, and the input vector indicating the first audio feature may be a two-dimensional vector.
  • the pre-trained neural network model for audio classification may be:
  • CNN Convolutional Neural Network
  • a neural network model formed by a combination of a Convolutional Recurrent Neural Network (CRNN) and a Convolutional Neural Network model.
  • CRNN Convolutional Recurrent Neural Network
  • the input of the neural network model is an input vector
  • the output is a classification result of the audio file to be classified
  • the input vector can represent the timbre, rhythm, intensity, melody, harmony, and musical instrument of the audio file to be analyzed.
  • An audio feature such that when the input vector is input to a pre-trained neural network model for audio classification, the input vector is analyzed to determine at least the timbre, rhythm, intensity, melody, harmony, and instrument of the audio file to be analyzed And waiting for the first audio feature to finally generate a classification result of the audio file to be classified.
  • the classification result of the audio file to be classified is determined according to the audio feature (first audio feature) of the audio file to be classified, and does not depend on the manual related information.
  • the technical solution provided by the embodiment of the present application is to classify the classified audio files by using a neural network model for audio classification obtained by pre-training based on the audio features of the audio file to be classified, instead of relying on the artificially related meta information.
  • a neural network model for audio classification obtained by pre-training based on the audio features of the audio file to be classified, instead of relying on the artificially related meta information.
  • it is not only objective, but also has high accuracy, is less affected by artificial subjective factors, and can significantly improve work efficiency by enabling automatic classification of audio files to be classified. Therefore, the technical solution provided by the embodiment of the present application can break through the limitations of the prior art, has high reliability, and is more suitable for application.
  • the S12 may be implemented in multiple manners.
  • the specific implementation of the S12 is described in detail below.
  • FIG. 2 is a flowchart of a method for generating an input vector representing a first audio feature according to an embodiment of the present application. As shown in Figure 2, the method includes:
  • FIG. 3 is a flowchart of a method for extracting an audio signal of an audio file to be classified according to an embodiment of the present application. As shown in FIG. 3, the method for extracting an audio signal of the audio file to be classified includes:
  • S1211 Convert the to-be-classified audio file into a mono audio file.
  • S1212 The sampling frequency of the mono audio file is adjusted to a preset sampling frequency, and the mono audio file is sampled according to the preset sampling frequency to extract the audio of the audio file to be classified. signal.
  • the audio file to be classified or the classified audio file records time-based signals, which need to be converted into time and frequency signals to reduce data size, filter unrelated information, and facilitate subsequent neural network. Training or classification.
  • the preset sampling frequency may be 12 kHz (kilohertz).
  • the audio signal is subjected to short-time Fourier transform and Mel frequency conversion to generate a Mel-labeled spectrogram representing the first audio feature as an input vector.
  • the extracted audio signal of the audio file to be classified is first pre-processed, and then a short-time Fourier transform (STFT) is used to obtain a spectrogram of the audio signal. And then performing a mel-scale frequency conversion on the spectrogram to convert the actual frequency into a frequency adapted to the human auditory feature, and generating a Mel-labeled spectrogram representing the first audio feature as an input vector.
  • STFT short-time Fourier transform
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include framing and windowing operations; after performing mel-scale frequency conversion on the spectrogram,
  • the amplitude is logarithmically, which allows those components with lower amplitudes to be pulled higher relative to the higher amplitude components in order to observe the periodic signals that are masked in low amplitude noise.
  • FIG. 4 is a schematic diagram of a Meyer annotation spectrum according to an embodiment of the present application.
  • the amplitude of the Meer's annotated spectrogram in the figure is logarithmically processed.
  • the figure shows the signal distribution of a piece of audio at different frequencies on the time axis. It is represented by a two-dimensional vector as the input of the next neural network model training. Or as a neural network model for the input of audio classification.
  • the vertical axis on the left side represents the frequency in Hertz (Hz); the horizontal axis represents time in minutes; the right vertical axis represents sound intensity in decibels (dB).
  • the mel power spectrogram indicates that the graph is a graph of the Mel-labeled spectrum whose amplitude has been logarithmically processed.
  • FIG. 5 is a flowchart of another method for generating an input vector representing a first audio feature according to an embodiment of the present application. As shown in FIG. 5, the method includes:
  • the extracted audio signal of the audio file to be classified is first pre-processed, and then a short-time Fourier transform (STFT) is used to obtain a spectrogram of the audio signal. ), generating a spectrogram representing the first audio feature as an input vector.
  • STFT short-time Fourier transform
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include a framing and windowing operation.
  • the method directly obtains the spectrogram of the audio signal through the short-time Fourier transform as an input vector, and improves the processing efficiency without experiencing the Mel frequency conversion as compared with the Meyer-labeled spectrogram as explained above. .
  • FIG. 6 is a flowchart of another method for generating an input vector representing a first audio feature according to an embodiment of the present application. As shown in FIG. 6, the method includes:
  • S124 Perform the short-time Fourier transform, the Mel frequency conversion, and the Mel frequency cepstral coefficient conversion on the audio signal to generate a Mel frequency cepstrum coefficient representing the first audio feature as an input vector.
  • the extracted audio signal of the audio file to be classified is first pre-processed, and then a short-time Fourier transform (STFT) is used to obtain a spectrogram of the audio signal. Then, the mel-scale frequency conversion is performed on the spectrogram, thereby realizing the conversion of the actual frequency into the frequency of adapting to the human auditory feature, obtaining the Meer's labeled spectrogram, and then performing the Meer's labeled spectrogram for the plum
  • the frequency cepstral coefficient is converted to generate a Mel frequency cepstral coefficient representing the first audio feature as an input vector.
  • the method uses the Mel frequency cepstral coefficient representing the first audio feature as an input vector, and the accuracy of the classification of the audio file to be classified in the subsequent neural network model is more accurate than the Melt-labeled spectrogram as explained above. high.
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include framing and windowing operations; after performing mel-scale frequency conversion on the spectrogram, Before labeling the spectrogram for the conversion of the Mel frequency cepstral coefficient, it is also possible to log the amplitude so that the components with lower amplitudes are pulled higher relative to the higher amplitude components in order to observe the periodic signal concealed in the low amplitude noise.
  • FIG. 7 is a flowchart of another method for generating an input vector representing a first audio feature according to an embodiment of the present application. As shown in FIG. 7, the method includes:
  • the obtained audio signal of the audio file to be classified is first pre-processed, and then subjected to constant Q conversion to obtain a spectrogram of the audio signal, thereby generating a spectrogram representing the first audio feature.
  • the input vector, the most characteristic of the spectrogram obtained by constant Q conversion is that the frequency axis is a log scale rather than a linear scale, and the window length will change with frequency.
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include a framing and windowing operation.
  • the audio classification method provided by another embodiment of the present application, before the S13, further includes:
  • the classification identifier information includes, but is not limited to, genre and label information to which the classified audio file belongs.
  • the genres to which classified audio files belong include: Pop Music, Rhythm & Blues, Rap, Jazz, Rock, and Country Music; tag information is more free.
  • tag information is more free.
  • tag information such as audio files, lyrics, lullaby, quiet or stunned audio, audio for piano performance, audio for guzheng performance, and more.
  • the training audio signal is an audio signal of the classified audio file
  • the second audio feature is an audio feature corresponding to the classified audio file.
  • the process of processing the training audio signal to generate a training vector representing the second audio feature is substantially the same as the process of the S12 that has been explained in the foregoing embodiment of the present application, except that the step S12 is processed.
  • the object is the audio file to be classified, and the object processed in step B here is the classified audio file. Therefore, the step B is not described in detail herein.
  • the S12 part in the above embodiment refer to the S12 part in the above embodiment.
  • the embodiment of the present application needs to establish a neural network model in advance, and the pre-established neural network model may be a convolutional neural network model, or a neural network model formed by a combination of a convolutional cyclic neural network model and a convolutional neural network model. . Then, using the training vector and the classification identification information corresponding to the training vector as input, the pre-established neural network model is trained to obtain a neural network model for audio classification.
  • the pre-established neural network model is a convolutional neural network model, or a neural network model formed by a combination of a convolutional cyclic neural network model and a convolutional neural network model, training a pre-established neural network model, mainly training in advance The weight of the established neural network model.
  • FIG. 8 is a structural diagram of a pre-established convolutional neural network model according to an embodiment of the present application.
  • the pre-established convolutional neural network model is a 5-layer 2D convolution model.
  • the activation function corresponding to the scene of the genre to which the classified audio file belongs from the plurality of genres may be softmax, and the loss function may be categorical crossentropy; the activation function corresponding to the scene for selecting the label information for the classified audio file may be For sigmoid, the loss function can be binary cross-entropy.
  • the number of convolution layers of the convolutional neural network model can be adjusted, BatchNormalization is optional, and the pooling layer can be used in addition to Max Pooling. Other functions can be used for the activation function ELU. There is no restriction on the application.
  • the method of generating the input vector in the S12 part in the above embodiment should be the same as the method of generating the training vector in the steps B and C in the embodiment to ensure the training of the neural network model for audio classification.
  • the input matches the input vector obtained by S12.
  • FIG. 9 is a flowchart of another audio classification method according to an embodiment of the present application. As shown in FIG. 9, the method includes:
  • the S21 includes:
  • S25 Process the audio signal of the audio file to be classified to generate an input vector representing the first audio feature.
  • step S27 includes:
  • the input vector is analyzed by the neural network model, and the tag information of the audio file to be classified and the genre to which it belongs are generated.
  • the tag information of the audio file to be classified and the genre to which the audio file belongs are generated, optionally: generating a plurality of tag information of the audio file to be classified, and the audio file to be classified determined from multiple genres A genre that belongs to.
  • the classification result may also be a probability of each label information matched by the audio file to be classified and a probability of a genre belonging to.
  • the technical solution provided by the embodiment of the present application is to classify the classified audio files by using a neural network model for audio classification obtained by pre-training based on the audio features of the audio file to be classified, instead of relying on the artificially related meta information.
  • a neural network model for audio classification obtained by pre-training based on the audio features of the audio file to be classified, instead of relying on the artificially related meta information.
  • it is not only objective, but also has high accuracy, is less affected by artificial subjective factors, and can significantly improve work efficiency by enabling automatic classification of audio files to be classified. Therefore, the technical solution provided by the embodiment of the present application can break through the limitations of the prior art, has high reliability, and is more suitable for application.
  • the classified audio file can be automatically classified, such as determining each tag information of the audio file to be classified, and the classification result of the genre to which the audio file to be classified belongs, and the classification.
  • the audio application software ie, the application software that provides the audio service
  • the audio application software can obtain the complete basic metadata of the audio files to be classified, thereby facilitating the audio application software to perform audio personality recommendation, audio classification management, and Content editing, etc.
  • these processes can be performed automatically by the server of the audio application software, especially for services such as Tencent's JOOX (an audio application software), which has a huge inventory and incremental audio song library. Manpower and time, and accuracy is also higher.
  • the present application discloses an audio classification device corresponding to the audio classification method provided by the embodiment of the present application.
  • FIG. 10 is a structural diagram of an audio classification apparatus according to an embodiment of the present application. As shown in Figure 10, the device includes:
  • the audio file obtaining module 11 to be classified is used to acquire an audio file to be classified
  • the input vector generating module 12 is configured to process the audio signal of the audio file to be classified to generate an input vector representing the first audio feature; the first audio feature is an audio feature corresponding to the audio file to be classified;
  • the input module 13 is configured to input the input vector to a pre-trained neural network model for audio classification
  • the pre-trained neural network model for audio classification may be:
  • CNN Convolutional Neural Network
  • a neural network model formed by a combination of a Convolutional Recurrent Neural Network (CRNN) and a Convolutional Neural Network model.
  • CRNN Convolutional Recurrent Neural Network
  • the classification result generating module 14 is configured to analyze the input vector by using the neural network model to generate a classification result of the audio file to be classified.
  • the input of the neural network model is an input vector
  • the output is a classification result of the audio file to be classified
  • the input vector can represent the timbre, rhythm, intensity, melody, harmony, and musical instrument of the audio file to be analyzed.
  • An audio feature such that after inputting the input vector to the pre-trained neural network model for audio classification, the classification result generation module 14 analyzes the neural network model for audio classification obtained by pre-training
  • the vector is input to determine at least the first audio features of the timbre, rhythm, intensity, melody, harmony, and musical instrument of the audio file to be analyzed, and finally the classification result of the audio file to be classified is generated.
  • the classification result of the audio file to be classified is determined according to the audio feature (first audio feature) of the audio file to be classified, and does not depend on the manual related information.
  • the audio classification device provided by the embodiment of the present application is based on the audio feature of the audio file to be classified, and the classified audio file is classified by using the neural network model for audio classification obtained in advance, instead of relying on the artificial related meta information. Compared with the prior art, it is not only objective, but also highly accurate, and is less affected by artificial subjective factors. Moreover, since the automatic classification of the audio files to be classified can be realized, the work efficiency can be significantly improved compared with the prior art. Therefore, the audio classification device provided by the embodiment of the present application can break through the limitations of the prior art, has high reliability, and is more suitable for application.
  • the input vector generating module 12 may have multiple implementation manners. The specific implementation of the input vector generating module 12 is described in detail below.
  • FIG. 11 is a structural diagram of an input vector generating module according to an embodiment of the present application. As shown in Figure 11, the module includes:
  • the audio signal extracting unit 121 is configured to extract an audio signal of the audio file to be classified
  • the audio signal extracting unit 121 includes:
  • a mono conversion subunit 1211 configured to convert the to-be-classified audio file into a mono audio file
  • the sampling subunit 1212 is configured to adjust a sampling frequency of the mono audio file to a preset sampling frequency, and sample the mono audio file according to the preset sampling frequency to extract the to-be-obtained An audio signal that classifies an audio file;
  • the audio file to be classified or the classified audio file records time-based signals, which need to be converted into time and frequency signals to reduce data size, filter unrelated information, and facilitate subsequent neural network. Training or classification.
  • the preset sampling frequency may be 12 kHz (kilohertz).
  • the input vector first generating unit 122 is configured to perform the short-time Fourier transform and the Mel frequency conversion on the audio signal to generate a Mel-labeled spectrogram representing the first audio feature as an input vector.
  • the input vector first generating unit 122 firstly performs pre-processing on the extracted audio signal of the audio file to be classified, and then obtains the short-time Fourier transform (STFT).
  • STFT short-time Fourier transform
  • a spectrogram of the audio signal and then performing a mel-scale frequency conversion on the spectrogram, thereby converting the actual frequency into a frequency adapted to the human auditory feature, and generating a Mel representing the first audio feature.
  • Label the spectrogram as an input vector.
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include framing and windowing operations; after performing mel-scale frequency conversion on the spectrogram,
  • the amplitude is logarithmically, which allows those components with lower amplitudes to be pulled higher relative to the higher amplitude components in order to observe the periodic signals that are masked in low amplitude noise.
  • FIG. 12 is a structural diagram of an input vector generating module according to an embodiment of the present application. As shown in Figure 12, the module includes:
  • the audio signal extracting unit 121 is configured to extract an audio signal of the audio file to be classified
  • the structure of the audio signal extraction unit 121 can be referred to the audio signal extraction unit 121 in FIG. 11, and details are not described herein again.
  • the input vector second generating unit 123 is configured to perform the short-time Fourier transform on the audio signal to generate a spectrogram representing the first audio feature as an input vector.
  • the input vector second generating unit 123 first performs pre-processing on the extracted audio signal of the audio file to be classified, and then obtains the short-time Fourier transform (STFT).
  • a spectrogram of the audio signal produces a spectrogram representing the first audio feature as an input vector.
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include a framing and windowing operation.
  • the method directly obtains the spectrogram of the audio signal through the short-time Fourier transform as an input vector, and improves the processing efficiency without experiencing the Mel frequency conversion as compared with the Meyer-labeled spectrogram as explained above. .
  • FIG. 13 is a structural diagram of an input vector generating module according to an embodiment of the present application. As shown in Figure 13, the module includes:
  • the audio signal extracting unit 121 is configured to extract an audio signal of the audio file to be classified
  • the structure of the audio signal extracting unit 121 can be referred to the audio signal extracting unit 121 in FIG. 11 , and details are not described herein again.
  • the input vector third generating unit 124 is configured to perform the short-time Fourier transform, the Mel frequency conversion, and the Mel frequency cepstral coefficient conversion on the audio signal to generate a Mel frequency cepstrum coefficient representing the first audio feature. Input vector.
  • the input vector third generating unit 124 first performs pre-processing on the extracted audio signal of the audio file to be classified, and then obtains the short-time Fourier transform (STFT). a spectrogram of the audio signal, and then performing a mel-scale frequency conversion on the spectrogram, thereby converting the actual frequency into a frequency adapted to the human auditory feature, obtaining a Meer's labeled spectrogram, and then The Meyer annotated spectrogram performs a Mel frequency cepstral coefficient conversion to generate a Mel frequency cepstral coefficient representing the first audio feature as an input vector.
  • the method uses the Mel frequency cepstral coefficient representing the first audio feature as an input vector, and the accuracy of the classification of the audio file to be classified in the subsequent neural network model is more accurate than the Melt-labeled spectrogram as explained above. high.
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include framing and windowing operations; after performing mel-scale frequency conversion on the spectrogram, Before labeling the spectrogram for the conversion of the Mel frequency cepstral coefficient, it is also possible to log the amplitude so that the components with lower amplitudes are pulled higher relative to the higher amplitude components in order to observe the periodic signal concealed in the low amplitude noise.
  • FIG. 14 is a structural diagram of an input vector generating module according to an embodiment of the present application. As shown in Figure 14, the module includes:
  • the audio signal extracting unit 121 is configured to extract an audio signal of the audio file to be classified
  • the structure of the audio signal extracting unit 121 can be referred to the audio signal extracting unit 121 in FIG. 11 , and details are not described herein again.
  • the input vector fourth generating unit 125 is configured to perform the constant Q conversion on the audio signal to generate a spectrogram representing the first audio feature as an input vector.
  • the input vector fourth generating unit 125 extracts the obtained audio signal of the audio file to be classified, first performs pre-processing, and then obtains a spectrogram of the audio signal through constant Q conversion, thereby generating a representation
  • the spectrogram of an audio feature is used as the input vector.
  • the most characteristic feature of the spectrogram obtained by constant Q conversion is that the frequency axis is a log scale rather than a linear scale, and the window length will be It changes with frequency and is more suitable for analyzing various types of audio files to be classified.
  • the process of pre-processing the extracted audio signal of the audio file to be classified may include a framing and windowing operation.
  • the audio classification device provided by another embodiment of the present application further includes:
  • a classified audio file and a classification identifier information obtaining module configured to acquire classification identifier information of the classified audio file and the classified audio file;
  • the classified audio file and classification identification information acquisition module is configured to acquire a large number of classified audio files, and classification identification information corresponding to each of the classified audio files.
  • the classification identifier information includes, but is not limited to, genre and label information to which the classified audio file belongs.
  • the genres to which classified audio files belong include: Pop Music, Rhythm & Blues, Rap, jazz, Rock, and Country Music; tag information is more free.
  • tag information is more free.
  • tag information such as singers of audio files, release dates, lyrical songs, lullaby, quiet and incitement, etc.
  • a training vector generation module is configured to process the training audio signal to generate a training vector representing the second audio feature.
  • the training audio signal is an audio signal of the classified audio file
  • the second audio feature is an audio feature corresponding to the classified audio file.
  • the training vector generating module processes the training audio signal to generate a training vector representing the second audio feature, which is substantially the same as the implementation process of the input vector generating module 12 that has been explained in the above embodiment of the present application.
  • the difference is that the object processed by the input vector generating module 12 is an audio file to be classified, and the object processed by the training vector generating module here is a classified audio file. Therefore, the training vector generation module is not elaborated here.
  • the input vector generation module 12 refer to the input vector generation module 12 in the above embodiment.
  • the neural network model training module is configured to train the pre-established neural network model with the training vector and the classification identification information corresponding to the training vector to obtain a neural network model for audio classification.
  • the embodiment of the present application needs to establish a neural network model in advance, and the pre-established neural network model may be a convolutional neural network model, or a neural network model formed by a combination of a convolutional cyclic neural network model and a convolutional neural network model. . Then, using the training vector and the classification identification information corresponding to the training vector as input, the pre-established neural network model is trained to obtain a neural network model for audio classification.
  • the pre-established neural network model is a convolutional neural network model, or a neural network model formed by a combination of a convolutional cyclic neural network model and a convolutional neural network model, training a pre-established neural network model, mainly training in advance The convolution kernel (or weight) of the established neural network model.
  • the method performed by the input vector generating module 12 to generate the input vector in the above embodiment should be the same as the method performed by the training vector generating module in the embodiment to generate the training vector, so as to ensure the training for the audio.
  • the input of the classified neural network model matches the input vector obtained by the input vector generation module 12.
  • FIG. 15 is a structural diagram of another audio classification apparatus according to an embodiment of the present application. As shown in Figure 15, the device includes:
  • the classified audio file and classification identification information obtaining module 21 is configured to obtain the classified identification information of the classified audio file and the classified audio file;
  • the classified audio file and classification identifier information obtaining module 21 is specifically configured to:
  • the training vector generating module 22 is configured to process the training audio signal to generate a training vector representing the second audio feature
  • the neural network model training module 23 is configured to train the pre-established neural network model with the training vector and the classification identification information corresponding to the training vector to obtain a neural network model for audio classification;
  • the audio file obtaining module 24 to be used for acquiring the audio file to be classified
  • the input vector generating module 25 is configured to process the audio signal of the audio file to be classified to generate an input vector representing the first audio feature
  • An input module 26 configured to input the input vector to a pre-trained neural network model for audio classification
  • the classification result generating module 27 is configured to analyze the input vector by using the neural network model to generate a classification result of the audio file to be classified.
  • the classification result generating module 27 is specifically configured to:
  • the input vector is analyzed by the neural network model, and the tag information of the audio file to be classified and the genre to which it belongs are generated.
  • the audio classification device provided by the embodiment of the present application is based on the audio feature of the audio file to be classified, and the classified audio file is classified by using the neural network model for audio classification obtained in advance, instead of relying on the artificial related meta information.
  • it is not only objective, but also has high accuracy, is less affected by artificial subjective factors, and can significantly improve work efficiency by enabling automatic classification of audio files to be classified. Therefore, the audio classification device provided by the embodiment of the present application can break through the limitations of the prior art, has high reliability, and is more suitable for application.
  • the present application discloses a smart device, and the audio classification method provided by the present application may be applied to a smart device, where the smart device may be Computer, or server, etc.
  • FIG. 16 is a hardware structural diagram of a smart device according to an embodiment of the present application. As shown in FIG. 16, the smart device includes:
  • Processor 1 communication interface 2, memory 3 and communication bus 4;
  • the processor 1, the communication interface 2, and the memory 3 complete communication with each other through the communication bus 4;
  • a processor 1 for executing a program
  • a memory 3 for storing a program
  • the program may include program code, and the program code includes computer operation instructions.
  • the program may include a program corresponding to the audio classification method described above.
  • the processor 1 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 3 may include a high speed RAM memory and may also include a non-volatile memory such as at least one disk memory.
  • the program can be specifically used to:
  • the input vector is analyzed by the neural network model to generate a classification result of the audio file to be classified.
  • the intelligent terminal provided by the embodiment of the present application is based on the audio feature of the audio file to be classified, and uses the neural network model for audio classification obtained by pre-training to classify the classified audio file, instead of relying on the artificially related meta information.
  • it is not only objective, but also has high accuracy, is less affected by artificial subjective factors, and can significantly improve work efficiency by enabling automatic classification of audio files to be classified. Therefore, the intelligent terminal provided by the embodiment of the present application can break through the limitations of the prior art, has high reliability, and is more suitable for application.
  • the embodiment of the present application further provides a storage medium storing computer executable instructions for performing the audio classification method described in the foregoing embodiments.
  • the embodiment of the present application further provides a computer program product, comprising instructions, when executed on a computer, causing a computer to execute the audio classification method described in the foregoing embodiments.
  • FIG. 17 is a structural diagram of a hardware topology environment to which an audio classification method is applied according to an embodiment of the present disclosure.
  • the hardware topology environment to which the audio classification method provided by the embodiment of the present application is applied includes a server 31, and a client 32 connected to the server 31.
  • the client 32 may be a computer terminal 321 , can also be a mobile terminal 322;
  • the server 31 is configured to: obtain classification information of the classified audio file and the classified audio file; process the training audio signal to generate a training vector indicating the second audio feature; and the training audio signal is the classified audio An audio signal of the file, the second audio feature is an audio feature corresponding to the classified audio file; training the pre-established neural network model with the training vector and the classification identification information corresponding to the training vector, and obtaining A neural network model for audio classification.
  • the new audio file (ie, the audio file to be classified) may be classified, and the server 31 is further configured to:
  • the input vector is input to a pre-trained neural network model for audio classification; the input vector is analyzed by the neural network model to generate a classification result of the audio file to be classified.
  • the server 31 may configure a neural network model for audio classification on the client 32, and the client 32 may be a client local to the server, such as providing audio.
  • the local client of the merchant that categorizes the software service may also be the client of the user.
  • the client 32 may be used to:
  • the input vector is input to a pre-trained neural network model for audio classification; the input vector is analyzed by the neural network model to generate a classification result of the audio file to be classified.
  • the client 32 configured with the neural network model for audio classification can separately classify new (eg, newly released) audio files from the server. For example, if a user downloads a song himself, the user can classify the song with his own client (such as the user's mobile terminal or the user's computer), and the song can be assigned to the classified category (such as rock music). When the user subsequently listens to the songs in the classified category, the song can be automatically listened to, which can effectively improve the user experience.
  • his own client such as the user's mobile terminal or the user's computer
  • the song can be assigned to the classified category (such as rock music).
  • the song can be automatically listened to, which can effectively improve the user experience.
  • the client 32 can also send a new audio file to the server 31, and the server 31 classifies the new audio file, and then feeds back the classification result to the client 32.
  • the hardware topology environment applied by the audio classification method provided by the embodiment of the present application is based on the audio features of the audio file to be classified, and the neural network model for audio classification obtained by pre-training is used to classify the audio.
  • the classification of files, rather than relying on artificially related meta-information, is not only objective, but also highly accurate, and is less affected by artificial subjective factors than the prior art, and because of the automatic classification of audio files to be classified, Can significantly improve work efficiency, but also help improve the user experience.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both.
  • the software modules can be located in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, or any other form of storage medium known in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

一种音频分类方法、装置、智能设备和存储介质。所述方法包括:获取待分类音频文件(S11);对待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量(S12),第一音频特征为所述待分类音频文件对应的音频特征;将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型(S13);通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果(S14)。该方法能够突破现有技术的局限性,提高对待分类音频文件进行分类的准确度和效率。

Description

音频分类方法、装置、智能设备和存储介质
本申请要求于2017年12月5日提交中国专利局、申请号201711265842.X、申请名称为“音频分类方法、装置、智能设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及音频分类技术。
背景技术
随着经济社会的快速发展,人们的生活水平日益提高,在物质需求逐渐被满足时,人们越来越注重精神追求以及精神上的享受。而比如音乐等音频能够很好的调剂人们的生活,能够一定程度上满足人们的精神追求和精神上的享受。现实生活中,不同的人对不同种类的音频的喜好可能并不相同,同一个人在不同时期以及不同状态下想要接触的音频也可能不同,因此,对音频进行分类很有必要。
目前的技术中,主要是依靠人工方式对音频进行分类,需要消耗大量人力资源,耗时较长,效率较低,并且受到个人知识局限性、个人偏好等因素的影响,客观性不高。而现有机器辅助的方式,仍依赖音频的与人工相关的元信息,比如歌手,年代等人工信息来建模,同样存在效率较低、客观性不高的问题,并且,随着音频数量越来越多,数据量庞大,一般每天也新增许多音频,这些音频里面,元信息的缺失现象很普遍,从而导致难以对这些音频进行准确分类。
因此,目前的技术中,无论是依靠人工方式还是机器辅助的方式,都存在效率都较低,分类的准确度的也不高的问题,存在较大的局限性。
发明内容
有鉴于此,本申请实施例提供了一种音频分类方法、装置、智能设备和存储介质,能够突破现有技术的局限性,提高对音频进行分类的准确度和效率。
为实现上述目的,本申请实施例提供如下技术方案:
一种音频分类方法,包括:
获取待分类音频文件;
对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,所述第一音频特征为所述待分类音频文件对应的音频特征;
将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
一种音频分类装置,包括:
待分类音频文件获取模块,用于获取待分类音频文件;
输入向量生成模块,用于对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,所述第一音频特征为所述待分类音频文件对应的音频特征;
输入模块,用于将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
分类结果生成模块,用于通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
一种智能设备,包括:
处理器和存储器,所述处理器与存储器通过通信总线相连接:
其中,所述处理器,用于调用并执行所述存储器中存储的程序;
所述存储器,用于存储程序,所述程序至少用于执行上述的音频分类方法。
一种存储介质,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行上述的音频分类方法。
一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行上述的音频分类方法。
经由上述的技术方案可知,与现有技术相比,本申请实施例提供了一种音频分类方法、装置、智能设备和存储介质。本申请实施例提供的技术方案,首先获取待分类音频文件,然后对待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,第一音频特征为从待分类音频文件本身提取出的音频特征,将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型,通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。也就是说,本申请实施例提供的技术方案,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,能够显著提高工作效率。因此,本申请实施例提供的技术方案,能够突破现有技术的局限性,可靠性较高,更加适于应用。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种音频分类方法的流程图;
图2为本申请实施例提供的一种生成表示第一音频特征的输入向量的方法的流程图;
图3为本申请实施例提供的一种提取待分类音频文件的音频信号的方法的流程图;
图4为本申请实施例提供的一种梅尔标注频谱图;
图5为本申请实施例提供的另外一种生成表示第一音频特征的输入向量的方法的流程图;
图6为本申请实施例提供的另外一种生成表示第一音频特征的输入向量的方法的流程图;
图7为本申请实施例提供的另外一种生成表示第一音频特征的输入向量的方法的流程图;
图8为本申请实施例提供的一种预先建立的卷积神经网络模型的结构图;
图9为本申请实施例提供的另外一种音频分类方法的流程图;
图10为本申请实施例提供的一种音频分类装置的结构图;
图11为本申请实施例提供的一种输入向量生成模块的结构图;
图12为本申请实施例提供的一种输入向量生成模块的结构图;
图13为本申请实施例提供的一种输入向量生成模块的结构图;
图14为本申请实施例提供的一种输入向量生成模块的结构图;
图15为本申请实施例提供的另外一种音频分类装置的结构图;
图16为本申请实施例提供的一种智能设备的硬件结构图;
图17为本申请实施例提供的一种音频分类方法所应用的硬件拓扑环境的结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
实施例
请参阅图1,图1为本申请实施例提供的一种音频分类方法的流程图。如图1所示,该方法包括:
S11,获取待分类音频文件。
当存在需要进行分类的音频文件时,智能设备首先获取需要分类音频对应的音频文件,即待分类音频文件。
S12,对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量。
智能设备可以提取所述待分类音频文件的音频信号,处理所述音频信号,生成表示第一音频特征的输入向量。其中,所述第一音频特征为所述待分类音频文件对应的音频特征,表示第一音频特征的输入向量可以是二维向量。
S13,将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型。
可选的,所述预先训练得到的用于音频分类的神经网络模型可以是:
卷积神经网络模型(Convolutional Neural Network,简称CNN);
或者卷积循环神经网络模型(Convolutional Recurrent Neural Network,简称CRNN)与卷积神经网络模型的组合形成的神经网络模型。
S14,通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
可以理解的是,所述神经网络模型的输入是输入向量,输出是待分类音频文件的分类结果,而输入向量可以表示待分析音频文件的音色、节奏、强度、旋律、和声以及乐器等第一音频特征,这样,当将输入向量输入到预先训练得到的用于音频分类的神经网络模型后,分析输入向量,从而至少确定待分析音频文件的音色、节奏、强度、旋律、和声以及乐器等第一音频特征,最终生成所述待分类音频文件的分类结果。此时,待分类音频文件的分类结果是根据待分类音频文件本身的音频特征(第一音频特征)进行确定的,并不依赖于人工相关信息。
本申请实施例提供的技术方案,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,能够显著提高工作效率。因此,本申请实施例提供的技术方案,能够突破现有技术的局限性,可靠性较高,更加适于应用。
可选的,本申请实施例中,所述S12可以有多种实现方式,下文将对所述S12的具体实现进行详细介绍。
请参阅图2,图2为本申请实施例提供的一种生成表示第一音频特征的输入向量的方法的流程图。如图2所示,该方法包括:
S121,提取所述待分类音频文件的音频信号。
在一种实现方式中,请参阅图3,图3为本申请实施例提供的一种提取所述待分类音频文件的音频信号的方法的流程图。如图3所示,该提取所述待分类音频文件的音频信号的方法包括:
S1211,将所述待分类音频文件转换为单声道音频文件。
S1212,将所述单声道的音频文件的采样频率调整为预设采样频率,按照所述预设采样频率对所述单声道音频文件进行采样,以提取得到所述待分类音频文件的音频信号。
可以理解的是,待分类音频文件或已分类音频文件记录的是基于时间的信号,需要将其转化为时间和频率的信号,以减少数据大小、过滤不相关的信息,便于后续通过神经网络进行训练或者分类。
可选的,所述预设采样频率可以为12kHz(千赫兹)。
S122,将所述音频信号经过短时傅里叶变换和梅尔频率转换,生成表示第一音频特征的梅尔标注频谱图作为输入向量。
可选的,将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过短时傅里叶变换(short-time Fourier transform,简称STFT)得到该音频信号的频谱图(spectrogram),然后对该频谱图进行梅尔标注(mel-scale)频率转换,从而实现把实际频率转换为适应人的听觉特征的频率,生成表示第一音频特征的梅尔标注频谱图作为输入向量。
其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作;对该频谱图进行梅尔标注(mel-scale)频率转换后,还可以对振幅取对数,使那些振幅较低的成分相对振幅较高的成分得以拉高,以便观察掩盖在低幅噪声中的周期信号。
请参阅图4,图4为本申请实施例提供的一种梅尔标注频谱图。该图中梅尔标注频谱图的振幅经过了取对数处理,该图表示了一段音频在时间轴上不同频率的信号分布,通过一个二维向量表示,作为下一步神经网络模型训练的输入,或者作为神经网络模型为实现音频分类的输入。如图4所示,左侧纵轴表示频率,单位是赫兹(Hz);横轴表示时间,单位是分钟;右侧纵轴表示声音强度,单位是分贝(dB)。图4中,mel power spectrogram表示该图为振幅经过了取对数处理的梅尔标注频谱图。
请参阅图5,图5为本申请实施例提供的另外一种生成表示第一音频特征的输入向量的方法的流程图。如图5所示,该方法包括:
S121,提取所述待分类音频文件的音频信号。
在一种实现方式中,提取所述待分类音频文件的音频信号的方法请参阅图3对应实施例的描述,此处不再赘述。
S123,将所述音频信号经过短时傅里叶变换,生成表示所述第一音频特征的频谱图作为输入向量。
可选的,将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过短时傅里叶变换(short-time Fourier transform,简称STFT)得到该音频信号的频谱图(spectrogram),生成表示第一音频特征的频谱图作为输入向量。其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作。该方法直接以经过短时傅里叶变换得到该音频信号的频谱图作为输入向量,相对于上文中已阐述的以梅尔标注频谱图作为输入向量,无需经历梅尔频率转换,提高了处理效率。
请参阅图6,图6为本申请实施例提供的另外一种生成表示第一音频特征的输入向量的方法的流程图。如图6所示,该方法包括:
S121,提取所述待分类音频文件的音频信号。
在一种实现方式中,提取所述待分类音频文件的音频信号的方法请参阅图3对应实施例的描述,此处不再赘述。
S124,将所述音频信号经过短时傅里叶变换、梅尔频率转换和梅尔频率倒谱系数转换,生成表示第一音频特征的梅尔频率倒谱系数作为输入向量。
可选的,将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过短时傅里叶变换(short-time Fourier transform,简称STFT)得到该音频信号的频谱图(spectrogram),然后对该频谱图进行梅尔标注(mel-scale)频率转换,从而实现把实际频率转换为适应人的听觉特征的频率,得到梅尔标注频谱图,然后将梅尔标注频谱图进行梅尔频率倒谱系数转换,生成表示第一音频特征的梅尔频率倒谱系数作为输入向量。该方法以表示第一音频特征的梅尔频率倒谱系数作为输入向量,相对于上文中已阐述的以梅尔标注频谱图作为输入向量,在后续神经网络模型对待分类音频文件分类的准确度更高。
其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作;对该频谱图进行梅尔标注(mel-scale)频率转换后、将梅尔标注频谱图进行梅尔频率倒谱系数转换之前,还可以对振幅取对数,使那些振幅较低的成分相对振幅较高的成分得以拉高,以便观察掩盖在低幅噪声中的周期信号。
请参阅图7,图7为本申请实施例提供的另外一种生成表示第一音频特征的输入向量的方法的流程图。如图7所示,该方法包括:
S121,提取所述待分类音频文件的音频信号。
在一种实现方式中,提取所述待分类音频文件的音频信号的方法请参阅图3对应实施例的描述,此处不再赘述。
S125,将所述音频信号经过常数Q转换(Constant-Q Transform,简称CQT),生成表示第一音频特征的频谱图作为输入向量。
可选的,将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过常数Q转换得到该音频信号的频谱图(spectrogram),从而生成表示第一音频特征的频谱图作为输入向量,常数Q转换得到的频谱图最大的特色在于频率轴为对数标度(log scale)而不是线性标度(linear scale),且窗口长度(window length)会随着频率而改变,比较适用于分析各种不同类型的待分类音频文件。其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作。
可选的,本申请另外一个实施例提供的音频分类方法,所述S13之前,还包括:
A、获取已分类音频文件和已分类音频文件的分类标识信息;
可以理解的是,为了训练用于音频分类的神经网络模型,首先获取大量已分类音频文件,以及这些已分类音频文件各自对应的分类标识信息。可选的,所述分类标识信息包括但不限于:已分类音频文件所属于的流派和标签信息。比如,已分类音频文件所属于的流派包括:流行(Pop Music)、节奏布鲁斯(Rhythm&Blues,R&B)、说唱(Rap)、爵士(Jazz)、摇滚(Rock)以及乡村乐等;标签信息则更加自由和广泛,可以有多角度的标签信息,如音频文件属于抒情歌曲、催眠曲、安静或者躁动的音频、钢琴演奏的音频、古筝演奏的音频等等。
B、对训练音频信号进行处理,生成表示第二音频特征的训练向量。
其中,所述训练音频信号为所述已分类音频文件的音频信号,所述第二音频特征为所述已分类音频文件对应的音频特征。
可选的,对训练音频信号进行处理,生成表示第二音频特征的训练向量的过程,与本申请上文实施例中已经阐述的所述S12的过程基本相同,区别在于所述步骤S12处理的对象是待分类音频文件,而此处B步骤中处理的对象是已分类音频文件。因此,此处不再对B步骤进行详细阐述,具体可参照上文实施例中的S12部分。
C、以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型。
可以理解的是,本申请实施例需要预先建立神经网络模型,预先建立的神经网络模型可以是卷积神经网络模型,或者卷积循环神经网络模型与卷积神经网络模型的组合形成的神经网络模型。然后以所述训练向量以及所述训练向量对应的分类标识信息作为输入,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型。其中,如果预先建立的神经网络模型是卷积神经网络模型,或者卷积循环神经网络模型与卷积神经网络模型的组合形成的神经网络模型,则训练预先建立的神经网络模型,主要是训练预先建立的神经网络模型的权重。
需要说明的是,预先建立的神经网络模型在其建立的过程中,需要按照第二音频特征的训练向量,选取适当的激活函数,损失函数和优化器,以便让数据通过多个卷积层发掘出多种隐含的特征。
可选的,请参阅图8,图8为本申请实施例提供的一种预先建立的卷积神经网络模型的结构图。如图8所示,该预先建立的卷积神经网络模型为5层2D卷积的模型。其中,从多个流派中选取已分类音频文件所属于的一个流派的场景对应的激活函数可以为softmax,损失函数可以为categorical crossentropy;为已分类音频文件选取标签信息的场景对应的激活函数可以为sigmoid,损失函数可以为binary cross-entropy。具体的,附图8中,卷积神经网络模型的卷积层的数量可调整,BatchNormalization是可选的,池化层可以除了Max Pooling也可以采用其它方式,激活函数ELU也可用其它函数,本申请对此并不做限制。
需要说明的是,图8示出的预先建立的卷积神经网络模型仅是一种示例,本领域技术人员可以依照本申请实施例的启示建立其他类似的神经网络模型用于对音频进行分类,这些变更仍旧属于本申请的保护范围。
此外,上文实施例中的S12部分生成输入向量的方法与本实施例中的步骤B和C部分生成训练向量的方法应该是相同的,以保证训练得到的用于音频分类的神经网络模型的输入与S12得到的输入向量相匹配。
请参阅图9,图9为本申请实施例提供的另外一种音频分类方法的流程图。如图9所示,该方法包括:
S21,获取已分类音频文件和已分类音频文件的分类标识信息。
可选的,所述S21包括:
获取已分类音频文件,以及已分类音频文件的标签信息和所属于的流派。
S22,对训练音频信号进行处理,生成表示第二音频特征的训练向量。
S23,以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型。
S24,获取待分类音频文件。
S25,对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量。
S26,将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型。
S27,通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
可选的,所述步骤S27包括:
通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的标签信息和所属于的流派。其中,生成所述待分类音频文件的标签信息和所属于的流派,可选的为:生 成所述待分类音频文件的多个标签信息,以及从多个流派中确定的所述待分类音频文件所属于的一个流派。
可选的,所述分类结果也可以是所述待分类音频文件所匹配的各标签信息的概率以及所属于的流派的概率。
本申请实施例提供的技术方案,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,能够显著提高工作效率。因此,本申请实施例提供的技术方案,能够突破现有技术的局限性,可靠性较高,更加适于应用。
可以理解的是,由于本申请实施例提供的技术方案,能够自动对待分类音频文件进行分类,如确定待分类音频文件的各个标签信息,以及待分类音频文件所属于的流派等分类结果,这些分类结果能够使音频应用软件(即提供播放音频服务的应用软件)获得这些待分类音频文件比较完整的基础元数据,从而能够方便音频应用软件对这些待分类音频文件进行音频个性推荐、音频分类管理和内容编辑等,这些过程可以由音频应用软件的服务器自动执行,尤其对于如腾讯公司的JOOX(一种音频应用软件)这种拥有巨大存量和增量的音频歌曲库的服务而言,能够节省大量人力和时间,并且准确度的也较高。
为了更加全面地阐述本申请提供的技术方案,对应于本申请实施例提供的音频分类方法,本申请公开一种音频分类装置。
请参阅图10,图10为本申请实施例提供的一种音频分类装置的结构图。如图10所示,该装置包括:
待分类音频文件获取模块11,用于获取待分类音频文件;
输入向量生成模块12,用于对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量;所述第一音频特征为所述待分类音频文件对应的音频特征;
输入模块13,用于将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
可选的,所述预先训练得到的用于音频分类的神经网络模型可以是:
卷积神经网络模型(Convolutional Neural Network,简称CNN);
或者卷积循环神经网络模型(Convolutional Recurrent Neural Network,简称CRNN)与卷积神经网络模型的组合形成的神经网络模型。
分类结果生成模块14,用于通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
可以理解的是,所述神经网络模型的输入是输入向量,输出是待分类音频文件的分类结果,而输入向量可以表示待分析音频文件的音色、节奏、强度、旋律、和声以及乐器等 第一音频特征,这样,当将输入向量输入到预先训练得到的用于音频分类的神经网络模型后,所述分类结果生成模块14通过预先训练得到的用于音频分类的神经网络模型,分析所述输入向量,从而至少确定待分析音频文件的音色、节奏、强度、旋律、和声以及乐器等第一音频特征,最终生成所述待分类音频文件的分类结果。此时,待分类音频文件的分类结果是根据待分类音频文件本身的音频特征(第一音频特征)进行确定的,并不依赖于人工相关信息。
本申请实施例提供的音频分类装置,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,相对于现有技术,能够显著提高工作效率。因此,本申请实施例提供的音频分类装置,能够突破现有技术的局限性,可靠性较高,更加适于应用。
可选的,本申请实施例中,所述输入向量生成模块12可以有多种实现方式,下文将对输入向量生成模块12的具体实现进行详细介绍。
请参阅图11,图11为本申请实施例提供的一种输入向量生成模块的结构图。如图11所示,该模块包括:
音频信号提取单元121,用于提取所述待分类音频文件的音频信号;
在一种实现方式中,如图11所示,音频信号提取单元121包括:
单声道转换子单元1211,用于将所述待分类音频文件转换为单声道音频文件;
采样子单元1212,用于将所述单声道的音频文件的采样频率调整为预设采样频率,按照所述预设采样频率对所述单声道音频文件进行采样,以提取得到所述待分类音频文件的音频信号;
可以理解的是,待分类音频文件或已分类音频文件记录的是基于时间的信号,需要将其转化为时间和频率的信号,以减少数据大小、过滤不相关的信息,便于后续通过神经网络进行训练或者分类。
可选的,所述预设采样频率可以为12kHz(千赫兹)。
输入向量第一生成单元122,用于将所述音频信号经过短时傅里叶变换和梅尔频率转换,生成表示第一音频特征的梅尔标注频谱图作为输入向量。
可选的,输入向量第一生成单元122将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过短时傅里叶变换(short-time Fourier transform,简称STFT)得到该音频信号的频谱图(spectrogram),然后对该频谱图进行梅尔标注(mel-scale)频率转换,从而实现把实际频率转换为适应人的听觉特征的频率,生成表示第一音频特征的梅尔标注频谱图作为输入向量。
其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作;对该频谱图进行梅尔标注(mel-scale)频率转换后,还可以对振幅取对数,使那些振幅较低的成分相对振幅较高的成分得以拉高,以便观察掩盖在低幅噪声中的周期信号。
请参阅图12,图12为本申请实施例提供的一种输入向量生成模块的结构图。如图12所示,该模块包括:
音频信号提取单元121,用于提取所述待分类音频文件的音频信号;
在一种实现方式中,如图12所示,音频信号提取单元121的结构可以参见图11中音频信号提取单元121所示,此处不再赘述。
输入向量第二生成单元123,用于将所述音频信号经过短时傅里叶变换,生成表示第一音频特征的频谱图作为输入向量。
可选的,输入向量第二生成单元123将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过短时傅里叶变换(short-time Fourier transform,简称STFT)得到该音频信号的频谱图(spectrogram),生成表示第一音频特征的频谱图作为输入向量。其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作。该方法直接以经过短时傅里叶变换得到该音频信号的频谱图作为输入向量,相对于上文中已阐述的以梅尔标注频谱图作为输入向量,无需经历梅尔频率转换,提高了处理效率。
请参阅图13,图13为本申请实施例提供的一种输入向量生成模块的结构图。如图13所示,该模块包括:
音频信号提取单元121,用于提取所述待分类音频文件的音频信号;
在一种实现方式中,如图13所示,音频信号提取单元121的结构可以参见图11中音频信号提取单元121所示,此处不再赘述。
输入向量第三生成单元124,用于将所述音频信号经过短时傅里叶变换、梅尔频率转换和梅尔频率倒谱系数转换,生成表示第一音频特征的梅尔频率倒谱系数作为输入向量。
可选的,输入向量第三生成单元124将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过短时傅里叶变换(short-time Fourier transform,简称STFT)得到该音频信号的频谱图(spectrogram),然后对该频谱图进行梅尔标注(mel-scale)频率转换,从而实现把实际频率转换为适应人的听觉特征的频率,得到梅尔标注频谱图,然后将梅尔标注频谱图进行梅尔频率倒谱系数转换,生成表示第一音频特征的梅尔频率倒谱系数作为输入向量。该方法以表示第一音频特征的梅尔频率倒谱系数作为输入向量,相对于上文中已阐述的以梅尔标注频谱图作为输入向量,在后续神经网络模型对待分类音频文件分类的准确度更高。
其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作;对该频谱图进行梅尔标注(mel-scale)频率转换后、将梅尔标注频谱图进行梅尔频率倒谱系数转换之前,还可以对振幅取对数,使那些振幅较低的成分相对振幅较高的成分得以拉高,以便观察掩盖在低幅噪声中的周期信号。
请参阅图14,图14为本申请实施例提供的一种输入向量生成模块的结构图。如图14所示,该模块包括:
音频信号提取单元121,用于提取所述待分类音频文件的音频信号;
在一种实现方式中,如图14所示,音频信号提取单元121的结构可以参见图11中音频信号提取单元121所示,此处不再赘述。
输入向量第四生成单元125,用于将所述音频信号经过常数Q转换,生成表示第一音频特征的频谱图作为输入向量。
可选的,输入向量第四生成单元125将提取得到的所述待分类音频文件的音频信号,首先进行预处理,然后经过常数Q转换得到该音频信号的频谱图(spectrogram),从而生成表示第一音频特征的频谱图作为输入向量,常数Q转换得到的频谱图最大的特色在于频率轴为对数标度(log scale)而不是线性标度(linear scale),且窗口长度(window length)会随着频率而改变,比较适用于分析各种不同类型的待分类音频文件。其中,对提取得到的所述待分类音频文件的音频信号进行预处理的过程,可以包括分帧和加窗操作。
可选的,本申请另外一个实施例提供的音频分类装置,还包括:
已分类音频文件和分类标识信息获取模块,用于获取已分类音频文件和已分类音频文件的分类标识信息;
可以理解的是,为了训练用于音频分类的神经网络模型,已分类音频文件和分类标识信息获取模块用于获取大量已分类音频文件,以及这些已分类音频文件各自对应的分类标识信息。可选的,所述分类标识信息包括但不限于:已分类音频文件所属于的流派和标签信息。比如,已分类音频文件所属于的流派包括:流行(Pop Music)、节奏布鲁斯(Rhythm&Blues,R&B)、说唱(Rap)、爵士(Jazz)、摇滚(Rock)以及乡村乐等;标签信息则更加自由和广泛,可以有多角度的标签信息,如音频文件的歌手、发行年代、抒情歌曲、催眠曲、安静和躁动等等。
训练向量生成模块,用于对训练音频信号进行处理,生成表示第二音频特征的训练向量。
所述训练音频信号为所述已分类音频文件的音频信号,所述第二音频特征为所述已分类音频文件对应的音频特征。
可选的,训练向量生成模块对训练音频信号进行处理,生成表示第二音频特征的训练向量的过程,与本申请上文实施例中已经阐述的所述输入向量生成模块12的实现过程基本 相同,区别在于所述输入向量生成模块12处理的对象是待分类音频文件,而此处训练向量生成模块处理的对象是已分类音频文件。因此,此处不再对训练向量生成模块进行详细阐述,具体可参照上文实施例中的输入向量生成模块12部分。
神经网络模型训练模块,用于以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型。
可以理解的是,本申请实施例需要预先建立神经网络模型,预先建立的神经网络模型可以是卷积神经网络模型,或者卷积循环神经网络模型与卷积神经网络模型的组合形成的神经网络模型。然后以所述训练向量以及所述训练向量对应的分类标识信息作为输入,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型。其中,如果预先建立的神经网络模型是卷积神经网络模型,或者卷积循环神经网络模型与卷积神经网络模型的组合形成的神经网络模型,则训练预先建立的神经网络模型,主要是训练预先建立的神经网络模型的卷积核(或者称为权重)。
需要说明的是,预先建立的神经网络模型在其建立的过程中,需要按照第二音频特征的训练向量,选取适当的激活函数,损失函数和优化器,以便让数据通过多个卷积层发掘出多种隐含的特征。
此外,上文实施例中输入向量生成模块12部分生成输入向量所执行的方法与本实施例中训练向量生成模块部分生成训练向量所执行的方法应当是相同的,以保证训练得到的用于音频分类的神经网络模型的输入与输入向量生成模块12得到的输入向量相匹配。
请参阅图15,图15为本申请实施例提供的另外一种音频分类装置的结构图。如图15所示,该装置包括:
已分类音频文件和分类标识信息获取模块21,用于获取已分类音频文件和已分类音频文件的分类标识信息;
可选的,所述已分类音频文件和分类标识信息获取模块21具体用于:
获取已分类音频文件,以及已分类音频文件的标签信息和所属于的流派。
训练向量生成模块22,用于对训练音频信号进行处理,生成表示第二音频特征的训练向量;
神经网络模型训练模块23,用于以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型;
待分类音频文件获取模块24,用于获取待分类音频文件;
输入向量生成模块25,用于对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量;
输入模块26,用于将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
分类结果生成模块27,用于通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
所述分类结果生成模块27具体用于:
通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的标签信息和所属于的流派。
本申请实施例提供的音频分类装置,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,能够显著提高工作效率。因此,本申请实施例提供的音频分类装置,能够突破现有技术的局限性,可靠性较高,更加适于应用。
为了更加全面地阐述本申请提供的技术方案,对应于本申请实施例提供的音频分类方法,本申请公开一种智能设备,本申请提供的音频分类方法可以应用于智能设备,该智能设备可以是计算机,或者服务器等。
请参阅图16,图16为本申请实施例提供的一种智能设备的硬件结构图。如图16所示,该智能设备包括:
处理器1,通信接口2,存储器3和通信总线4;
其中处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;
处理器1,用于执行程序;
存储器3,用于存放程序;
程序可以包括程序代码,所述程序代码包括计算机操作指令;在本申请实施例中,程序可以包括上述所述音频分类方法对应的程序。
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。
其中,程序可具体用于:
获取待分类音频文件;
对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,所述第一音频特征为所述待分类音频文件对应的音频特征;
将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
本申请实施例提供的智能终端,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,能够显著提高工作效率。因此,本申请实施例提供的智能终端,能够突破现有技术的局限性,可靠性较高,更加适于应用。
此外,本申请实施例还提供一种存储介质,该存储介质存储有计算机可执行指令,所述计算机可执行指令用于执行上述实施例所述的音频分类方法。
本申请实施例还提供一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行上述实施例所述的音频分类方法。
为了更加全面地阐述本申请提供的技术方案,下面对本申请实施例提供的音频分类方法所应用的硬件拓扑环境进行介绍。
请参阅图17,图17为本申请实施例提供的一种音频分类方法所应用的硬件拓扑环境的结构图。如图17所示,本申请实施例提供的音频分类方法所应用的硬件拓扑环境,包括服务器31,以及与所述服务器31相连接的客户端32;其中所述客户端32可以为计算机终端321,也可以为移动终端322;
所述服务器31用于:获取已分类音频文件和已分类音频文件的分类标识信息;对训练音频信号进行处理,生成表示第二音频特征的训练向量;所述训练音频信号为所述已分类音频文件的音频信号,所述第二音频特征为所述已分类音频文件对应的音频特征;以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到用于音频分类的神经网络模型。
可选的,所述服务器31在训练得到用于音频分类的神经网络模型后,可以对新的音频文件(即待分类音频文件)进行分类,则所述服务器31还用于:
获取待分类音频文件;对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,所述第一音频特征为所述待分类音频文件对应的音频特征;将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
另外,所述服务器31在训练得到用于音频分类的神经网络模型后,可以将用于音频分类的神经网络模型配置在客户端32,该客户端32可以是服务器本地的客户端,如提供音频分类软件服务的商家本地的客户端,也可以是用户的客户端,此时,所述客户端32可以用于:
获取待分类音频文件;对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,所述第一音频特征为所述待分类音频文件对应的音频特征;将所述输 入向量输入至预先训练得到的用于音频分类的神经网络模型;通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
也就是说,配置有用于音频分类的神经网络模型的客户端32,可以脱离服务器独立实现对新的(如新发布的)音频文件进行分类。比如,用户自己下载了一首歌曲,用户可以用自己的客户端(如用户的移动终端或用户的计算机)对该歌曲进行分类,这首歌曲便可以归属到所分的类别(如摇滚乐)当中,后续使用户收听该所分类别下的歌曲时,便可自动收听到该歌曲,能够有效提升用户体验。
需要说明的是,客户端32也可以将新的音频文件发送到服务器31,由服务器31对新的音频文件进行分类,然后将分类结果反馈至客户端32。
经由上述内容可以确定,本申请实施例提供的音频分类方法所应用的硬件拓扑环境,是基于待分类音频文件本身的音频特征,借助预先训练得到的用于音频分类的神经网络模型来对待分类音频文件进行分类,而非依赖人工相关的元信息,相对于现有技术,不但客观,而且准确度较高,受人工主观因素的影响很小,并且,由于能够实现待分类音频文件的自动分类,能够显著提高工作效率,也有利于提升用户体验。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者智能设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者智能设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者智能设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置、智能设备和存储介质而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读 存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器或技术领域内所公知的任意其它形式的存储介质中。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (16)

  1. 一种音频分类方法,应用于智能设备,包括:
    获取待分类音频文件;
    对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,所述第一音频特征为所述待分类音频文件对应的音频特征;
    将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
    通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
  2. 根据权利要求1所述的方法,所述对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,包括:
    提取所述待分类音频文件的音频信号;
    将所述音频信号经过短时傅里叶变换和梅尔频率转换,生成表示所述第一音频特征的梅尔标注频谱图作为输入向量。
  3. 根据权利要求1所述的方法,所述对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量,包括:
    提取所述待分类音频文件的音频信号;
    将所述音频信号经过短时傅里叶变换,生成表示所述第一音频特征的频谱图作为输入向量。
  4. 根据权利要求2~3任一项所述的方法,所述提取所述待分类音频文件的音频信号,包括:
    将所述待分类音频文件转换为单声道音频文件;
    将所述单声道音频文件的采样频率调整为预设采样频率,按照所述预设采样频率对所述单声道音频文件进行采样,以提取得到所述待分类音频文件的音频信号。
  5. 根据权利要求1所述的方法,所述将所述输入向量输入至预先训练得到的神经网络模型之前,还包括:
    获取已分类音频文件和所述已分类音频文件的分类标识信息;
    对训练音频信号进行处理,生成表示第二音频特征的训练向量;所述训练音频信号为所述已分类音频文件的音频信号,所述第二音频特征为所述已分类音频文件对应的音频特征;
    以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到所述用于音频分类的神经网络模型。
  6. 根据权利要求5所述的方法,获取已分类音频文件和所述已分类音频文件的分类标识信息,包括:
    获取所述已分类音频文件,以及所述已分类音频文件的标签信息和所属于的流派;
    所述通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果,包括:
    通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的标签信息和所属于的流派。
  7. 根据权利要求1~3任一项所述的方法,所述神经网络模型为:
    卷积神经网络模型;
    或者卷积循环神经网络模型与卷积神经网络模型的组合形成的神经网络模型。
  8. 一种音频分类装置,包括:
    待分类音频文件获取模块,用于获取待分类音频文件;
    输入向量生成模块,用于对所述待分类音频文件的音频信号进行处理,生成表示第一音频特征的输入向量;所述第一音频特征为所述待分类音频文件对应的音频特征;
    输入模块,用于将所述输入向量输入至预先训练得到的用于音频分类的神经网络模型;
    分类结果生成模块,用于通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的分类结果。
  9. 根据权利要求8所述的装置,所述输入向量生成模块包括:
    音频信号提取单元,用于提取所述待分类音频文件的音频信号;
    输入向量第一生成单元,用于将所述音频信号经过短时傅里叶变换和梅尔频率转换,生成表示所述第一音频特征的梅尔标注频谱图作为输入向量。
  10. 根据权利要求8所述的装置,所述输入向量生成模块包括:
    音频信号提取单元,用于提取所述待分类音频文件的音频信号;
    输入向量第二生成单元,用于将所述音频信号经过短时傅里叶变换,生成表示所述第一音频特征的频谱图作为输入向量。
  11. 根据权利要求9~10任一项所述的装置,所述音频信号提取单元包括:
    单声道转换子单元,用于将所述待分类音频文件转换为单声道音频文件;
    采样子单元,用于将所述单声道音频文件的采样频率调整为预设采样频率,按照所述预设采样频率对所述单声道音频文件进行采样,以提取得到所述待分类音频文件的音频信号。
  12. 根据权利要求8所述的装置,还包括:
    已分类音频文件和分类标识信息获取模块,用于获取已分类音频文件和所述已分类音频文件的分类标识信息;
    训练向量生成模块,用于对训练音频信号进行处理,生成表示第二音频特征的训练向量;所述训练音频信号为所述已分类音频文件的音频信号,所述第二音频特征为所述已分类音频文件对应的音频特征;
    神经网络模型训练模块,用于以所述训练向量以及所述训练向量对应的分类标识信息,训练预先建立的神经网络模型,得到所述用于音频分类的神经网络模型。
  13. 根据权利要求12所述的装置,所述已分类音频文件和分类标识信息获取模块用于:
    获取所述已分类音频文件,以及所述已分类音频文件的标签信息和所属于的流派;
    所述分类结果生成模块用于:
    通过所述神经网络模型分析所述输入向量,生成所述待分类音频文件的标签信息和所属于的流派。
  14. 一种智能设备,包括:
    处理器和存储器,所述处理器与存储器通过通信总线相连接:
    其中,所述处理器,用于调用并执行所述存储器中存储的程序;
    所述存储器,用于存储程序,所述程序至少用于执行权利要求1~7任一项所述的音频分类方法。
  15. 一种存储介质,所述存储介质中存储有计算机可执行指令,所述计算机可执行指令用于执行权利要求1至7任一项所述的音频分类方法。
  16. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至7任一项所述的音频分类方法。
PCT/CN2018/115544 2017-12-05 2018-11-15 音频分类方法、装置、智能设备和存储介质 WO2019109787A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711265842.X 2017-12-05
CN201711265842.XA CN110019931B (zh) 2017-12-05 2017-12-05 音频分类方法、装置、智能设备和存储介质

Publications (1)

Publication Number Publication Date
WO2019109787A1 true WO2019109787A1 (zh) 2019-06-13

Family

ID=66750762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/115544 WO2019109787A1 (zh) 2017-12-05 2018-11-15 音频分类方法、装置、智能设备和存储介质

Country Status (2)

Country Link
CN (1) CN110019931B (zh)
WO (1) WO2019109787A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508480A (zh) * 2020-04-20 2020-08-07 网易(杭州)网络有限公司 音频识别模型的训练方法、音频识别方法、装置及设备

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580914A (zh) * 2019-07-24 2019-12-17 安克创新科技股份有限公司 一种音频处理方法、设备及具有存储功能的装置
CN110827837B (zh) * 2019-10-18 2022-02-22 中山大学 一种基于深度学习的鲸鱼活动音频分类方法
CN110929087A (zh) * 2019-10-21 2020-03-27 量子云未来(北京)信息科技有限公司 一种音频分类方法、装置、电子设备及存储介质
CN111179971A (zh) * 2019-12-03 2020-05-19 杭州网易云音乐科技有限公司 无损音频检测方法、装置、电子设备及存储介质
CN111048099A (zh) * 2019-12-16 2020-04-21 随手(北京)信息技术有限公司 声音源的识别方法、装置、服务器及存储介质
CN111081275B (zh) * 2019-12-20 2023-05-26 惠州Tcl移动通信有限公司 基于声音分析的终端处理方法、装置、存储介质及终端
CN111415644B (zh) * 2020-03-26 2023-06-20 腾讯音乐娱乐科技(深圳)有限公司 一种音频舒缓度预测方法及装置、服务器、存储介质
CN111488486B (zh) * 2020-04-20 2021-08-17 武汉大学 一种基于多音源分离的电子音乐分类方法及系统
CN111968670A (zh) * 2020-08-19 2020-11-20 腾讯音乐娱乐科技(深圳)有限公司 音频识别方法及装置
CN112148754A (zh) * 2020-09-01 2020-12-29 腾讯音乐娱乐科技(深圳)有限公司 一种歌曲识别方法和装置
CN112165634B (zh) * 2020-09-29 2022-09-16 北京百度网讯科技有限公司 建立音频分类模型的方法、自动转换视频的方法和装置
CN112237740B (zh) * 2020-10-26 2024-03-15 网易(杭州)网络有限公司 节拍数据的提取方法、装置、电子设备及计算机可读介质
CN112466298B (zh) * 2020-11-24 2023-08-11 杭州网易智企科技有限公司 语音检测方法、装置、电子设备和存储介质
CN113421585A (zh) * 2021-05-10 2021-09-21 云境商务智能研究院南京有限公司 一种音频指纹库生成方法及装置
CN113450828A (zh) * 2021-06-25 2021-09-28 平安科技(深圳)有限公司 音乐流派的识别方法、装置、设备及存储介质
CN114333908B (zh) * 2021-12-29 2022-09-30 广州方硅信息技术有限公司 在线音频分类方法、装置及计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161092A (zh) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 一种语音识别方法和装置
CN105427858A (zh) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 实现语音自动分类的方法及系统
CN105788592A (zh) * 2016-04-28 2016-07-20 乐视控股(北京)有限公司 一种音频分类方法及装置
CN105895087A (zh) * 2016-03-24 2016-08-24 海信集团有限公司 一种语音识别方法及装置

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI297486B (en) * 2006-09-29 2008-06-01 Univ Nat Chiao Tung Intelligent classification of sound signals with applicaation and method
CN103854646B (zh) * 2014-03-27 2018-01-30 成都康赛信息技术有限公司 一种实现数字音频自动分类的方法
US20170140260A1 (en) * 2015-11-17 2017-05-18 RCRDCLUB Corporation Content filtering with convolutional neural networks
US10460747B2 (en) * 2016-05-10 2019-10-29 Google Llc Frequency based audio analysis using neural networks
CN106407960A (zh) * 2016-11-09 2017-02-15 浙江师范大学 基于多特征音乐体载的分类方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105161092A (zh) * 2015-09-17 2015-12-16 百度在线网络技术(北京)有限公司 一种语音识别方法和装置
CN105427858A (zh) * 2015-11-06 2016-03-23 科大讯飞股份有限公司 实现语音自动分类的方法及系统
CN105895087A (zh) * 2016-03-24 2016-08-24 海信集团有限公司 一种语音识别方法及装置
CN105788592A (zh) * 2016-04-28 2016-07-20 乐视控股(北京)有限公司 一种音频分类方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111508480A (zh) * 2020-04-20 2020-08-07 网易(杭州)网络有限公司 音频识别模型的训练方法、音频识别方法、装置及设备

Also Published As

Publication number Publication date
CN110019931B (zh) 2023-01-24
CN110019931A (zh) 2019-07-16

Similar Documents

Publication Publication Date Title
WO2019109787A1 (zh) 音频分类方法、装置、智能设备和存储介质
US11837208B2 (en) Audio processing techniques for semantic audio recognition and report generation
Richard et al. An overview on perceptually motivated audio indexing and classification
US20190370283A1 (en) Systems and methods for consolidating recorded content
CN103177722A (zh) 一种基于音色相似度的歌曲检索方法
CN101599271A (zh) 一种数字音乐情感的识别方法
CN109584904B (zh) 应用于基础音乐视唱教育的视唱音频唱名识别建模方法
WO2019137392A1 (zh) 文件分类处理方法、装置及终端、服务器、存储介质
Huang et al. Music Generation Based on Convolution-LSTM.
Gan Music feature classification based on recurrent neural networks with channel attention mechanism
Reghunath et al. Transformer-based ensemble method for multiple predominant instruments recognition in polyphonic music
CN110889008B (zh) 一种音乐推荐方法、装置、计算装置和存储介质
Murthy et al. Singer identification from smaller snippets of audio clips using acoustic features and DNNs
Retta et al. Kiñit classification in Ethiopian chants, Azmaris and modern music: A new dataset and CNN benchmark
Kızrak et al. Classification of classic Turkish music makams
Sephus et al. Modulation spectral features: In pursuit of invariant representations of music with application to unsupervised source identification
Pratama et al. Human vocal type classification using MFCC and convolutional neural network
Kruspe et al. Automatic speech/music discrimination for broadcast signals
Senevirathna et al. Audio music monitoring: Analyzing current techniques for song recognition and identification
Waghmare et al. Analyzing acoustics of indian music audio signal using timbre and pitch features for raga identification
Chen et al. Cross-cultural music emotion recognition by adversarial discriminative domain adaptation
Qin et al. A bag-of-tones model with MFCC features for musical genre classification
Rajan et al. Multi-channel CNN-Based Rāga Recognition in Carnatic Music Using Sequential Aggregation Strategy
Liang et al. Extraction of music main melody and Multi-Pitch estimation method based on support vector machine in big data environment
CN113806586B (zh) 数据处理方法、计算机设备以及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18887129

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18887129

Country of ref document: EP

Kind code of ref document: A1