CN112712820A

CN112712820A - Tone classification method, device, equipment and medium

Info

Publication number: CN112712820A
Application number: CN202011565974.6A
Authority: CN
Inventors: 汪暾; 马金龙; 熊佳; 罗箫; 焦南凯; 徐志坚; 谢睿; 陈光尧
Original assignee: Guangzhou Huancheng Culture Media Co ltd
Current assignee: Guangzhou Huancheng Culture Media Co ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-27

Abstract

The application discloses a tone classification method, a tone classification device, tone classification equipment and a medium, wherein the method comprises the following steps: acquiring an audio file to be classified; extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified; performing feature fusion on a first feature parameter and a second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain the images to be classified; and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified. The utility model provides a prior art carries out tone classification through drawing single characteristic parameter, exists to a certain specific tone classification task, hardly carries out effective differentiation to different types of tone through single characteristic parameter to and when the face changes the speech speed or noise interference appears, take place the condition of misclassification easily, lead to the lower technical problem of tone classification rate of accuracy.

Description

Tone classification method, device, equipment and medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a medium for classifying timbres.

Background

The language is a unique communication mode for human beings, and besides information which a speaker wants to express, the language also carries a large amount of hidden information which contains the identity of the speaker. Each individual has own unique sounding organ, and sounding organs of different individuals have obvious differences. In real life, natural attributes such as age and sex of a speaker and social attributes such as a language type and a regional culture background can be roughly judged through voice, which indicates that a voice signal contains a large amount of information. Since the voice information of each person is unique, and the voice information of different persons can reflect the attributes of different persons, the persons can be classified by using the voice information, such as gender classification, age classification, timbre classification, and the like.

In the prior art, single features are mainly extracted for tone classification, and for a certain specific tone classification task, different classes of tones are difficult to be effectively distinguished through single feature parameters; and when the intonation speech rate changes or noise interference occurs, the situation of misclassification is easy to occur, so that the accuracy rate of tone classification is low.

Disclosure of Invention

The application provides a tone classification method, a tone classification device, tone classification equipment and a tone classification medium, which are used for solving the technical problems that in the prior art, tone classification is carried out by extracting a single characteristic parameter, a certain specific tone classification task exists, different classes of tones are difficult to be effectively distinguished through the single characteristic parameter, and in the face of tone speed change or noise interference, the tone classification accuracy is low due to the fact that misclassification easily occurs.

In view of the above, a first aspect of the present application provides a method for classifying timbres, including:

acquiring an audio file to be classified;

extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified;

performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain images to be classified;

and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.

Optionally, the configuration process of the preset convolutional neural network model is as follows:

acquiring an audio file to be trained;

extracting a first characteristic parameter and a second characteristic parameter of the audio file to be trained;

performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;

performing tone label marking on the image to be trained to obtain a training sample;

and inputting the training sample into a convolutional neural network for training to obtain the preset convolutional neural network model.

Optionally, the convolutional neural network is VGG, ResNet, MobileNet, or ShuffleNet.

Optionally, the number of the characteristic parameters in the first characteristic parameter and the second characteristic parameter is 1 or more than 2.

Optionally, when the number of the feature parameters in the first feature parameter and the second feature parameter is 1, the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel cepstrum coefficient.

A second aspect of the present application provides a tone color classification apparatus, comprising:

the acquisition unit is used for acquiring the audio files to be classified;

the extraction unit is used for extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified;

the conversion unit is used for performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain images to be classified;

and the classification unit is used for inputting the image to be classified into a preset convolution neural network model for tone classification and outputting a tone classification result of the audio file to be classified.

acquiring an audio file to be trained;

A third aspect of the present application provides a timbre classification device comprising a processor and a memory;

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method for tone color classification according to any one of the first aspect according to instructions in the program code.

A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the method for classifying timbres according to any one of the first aspects.

According to the technical scheme, the method has the following advantages:

the application provides a tone classification method, which comprises the following steps: acquiring an audio file to be classified; extracting a first characteristic parameter and a second characteristic parameter of the acquired audio file to be classified; performing feature fusion on a first feature parameter and a second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain the images to be classified; and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.

According to the method and the device, the first characteristic parameter and the second characteristic parameter of the audio file to be classified are extracted, the two characteristic parameters are fused and then subjected to tone classification, different types of tones can be effectively distinguished, and tone classification accuracy is improved; the characteristic parameters obtained after fusion are converted into the image, and then the preset convolutional neural network model is used for extracting the characteristic parameters in the image and the characteristic information useful for classification, so that the influence caused by the change of the intonation and the speed of speech or noise interference can be effectively avoided, the accuracy and the robustness of tone classification are further improved, and the technical problem that in the prior art, the tone classification is carried out by extracting single characteristic parameters, the condition that different types of tones are difficult to be effectively distinguished through single characteristic parameters for a certain specific tone classification task, and the condition that mis-classification is easy to occur when the intonation and the speed of speech are changed or the noise interference occurs, and the tone classification accuracy is low is solved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a tone classification method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a tone color classification apparatus according to an embodiment of the present application;

fig. 3 is a schematic diagram of a feature parameter fusion method according to an embodiment of the present application;

fig. 4 is a schematic diagram of another feature parameter fusion method provided in the embodiment of the present application.

Detailed Description

The application provides a tone classification method, a tone classification device, tone classification equipment and a tone classification medium, which are used for solving the technical problems that in the prior art, tone classification is carried out by extracting a single characteristic parameter, a certain specific tone classification task exists, different classes of tones are difficult to be effectively distinguished through the single characteristic parameter, and in the face of tone speed change or noise interference, the condition of misclassification is easy to occur, and the tone classification accuracy is low.

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For easy understanding, referring to fig. 1, an embodiment of a method for classifying timbres provided by the present application includes:

step 101, obtaining an audio file to be classified.

The audio file to be classified may be obtained through a recording device, for example, an audio file recorded and uploaded by a user through a music service may be used as the audio file to be classified.

Further, invalid audio screening may be performed on the obtained audio files to be classified, for example, the audio files to be classified with too short audio time are screened out.

And 102, extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified.

Extracting characteristic parameters of the audio file to be classified, wherein the characteristic parameters can be pitch period, Linear Prediction Cepstrum Coefficient (LPCC), Mel cepstrum coefficient (MFCC) and other known characteristic parameters, and obtaining a first characteristic parameter and a second characteristic parameter.

In the embodiment of the present application, the number of the first characteristic parameter and the second characteristic parameter may be 1 or 2 or more. When the number of the feature parameters in the first feature parameter and the second feature parameter is 1, in the embodiment of the present application, it is preferable that the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel cepstrum coefficient. The method for extracting the characteristic parameters of the LPCC and the MFCC belongs to the prior art, and is not described in detail herein.

And 103, performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain the images to be classified.

And performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified to obtain a fused feature parameter, and performing tone classification directly based on the fused feature parameter. In order to further improve the accuracy of tone classification, the embodiment of the application converts the fused characteristic parameters into images to obtain images to be classified, and performs tone classification on the images to be classified through a preset convolutional neural network model.

The specific process of feature fusion of the two feature parameters is described by taking the first feature parameter and the second feature parameter as the LPCC feature parameter and the MFCC feature parameter as examples, the LPCC feature parameter and the MFCC feature parameter can be fused based on a Fisher criterion, and the basic principle of the Fisher criterion is to find a projection shadow space in a feature vector space, so that the included feature points are optimally classified on the subspace. The Fisher ratio is a frequently used method for selecting characteristic parameters and is proposed on the basis of the Fisher criterion. The larger the Fisher ratio is, the more effective and accurate the characteristic information of the voice signal can be reflected by the one-dimensional characteristic parameters.

1. Scheme one for fusing the LPCC and MFCC characteristic parameters based on Fisher criterion can refer to fig. 3:

firstly fusing the characteristic parameters of the LPCC and the MFCC, and then selecting the Fisher ratio, wherein the specific process comprises the following steps:

pre-emphasis, framing and other pre-processing are carried out on the voice signals s (n) in the audio file to be classified, and then the LPCC characteristic parameter C of each frame of voice signal s (n) is obtained through calculation_n(N ═ 1,2,. cndot., N) and MFCC characteristic parameters cmel (k), k ═ 1,2,. cndot., N. Construction of LPCC characteristic parameter sequence C based on LPCC characteristic parameters_lpcc＝{c₁,c₂,...c_N}; MFCC characteristic parameter sequence C constructed based on MFCC characteristic parameters_mfcc＝{Cmel(1),Cmel(2),...,Cmel(N)}。

Fusing the LPCC characteristic parameter sequence and the MFCC characteristic parameter sequence to obtain a fused characteristic parameter sequence C_new＝{C_lpcc,C_mfcc}＝{c₁,c₂,...c_N,Cmel(1),Cmel(2),...,Cmel(N)}。

According to the feature selection calculation formula of Fisher criterion, the fused feature parameter sequence C is solved_newThe Fisher ratio of the characteristic parameters of each dimension in the (A) is determined. Selecting the characteristic parameters of Fisher ratio larger than a preset threshold value to form a final fused characteristic parameter sequence C'_newI.e. the fusion characteristic parameters. The preset threshold may be selected according to actual situations, and is not specifically limited herein.

2. Scheme two for fusing the characteristic parameters of the LPCC and the MFCC based on the Fisher criterion can refer to fig. 4.

Fisher ratio selection is carried out on the basis of characteristic parameters, and then fusion is carried out, wherein the specific process comprises the following steps:

Obtaining an LPCC characteristic parameter sequence C according to a characteristic selection calculation formula of a Fisher criterion_lpccAnd MFCC characteristic parameter sequence C_mfccSelecting the characteristic parameters of the Fisher ratio which is greater than a preset threshold value from the Fisher ratio of the characteristic parameters of each dimension to obtain a screened LPCC characteristic parameter sequence C'_lpccAnd MFCC characteristic parameter sequence C'_mfcc。

C 'of the screened LPCC characteristic parameter sequence'_lpccAnd MFCC characteristic parameter sequence C'_mfccCarrying out fusion to obtain a fused characteristic parameter sequence C'_new＝{C'_lpcc,C'_mfccAnd f, fusing characteristic parameters.

And 104, inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.

And inputting the converted image to be classified into a preset convolution neural network model for tone classification, and further outputting a tone classification result of the audio file to be classified, wherein the obtained tone classification result can be used for personalized recommendation service and the like. By utilizing the powerful fitting and characterization capabilities of the convolutional neural network model, the most useful information and features for the classification task in the multi-feature parameters are automatically extracted, so that higher accuracy and robustness of tone classification are obtained.

Further, the preset convolutional neural network model is a trained convolutional neural network model, and the training process is as follows:

acquiring an audio file to be trained; extracting a first characteristic parameter and a second characteristic parameter of an audio file to be trained; performing feature fusion on a first feature parameter and a second feature parameter of an audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained; performing tone label marking on an image to be trained to obtain a training sample; and inputting the training sample into a convolutional neural network for training to obtain a preset convolutional neural network model.

In the embodiment of the present application, the processing process of the audio file to be trained is similar to the processing process of the audio file to be classified, but the difference is that after two feature parameters of the audio file to be trained are fused and converted to obtain an image, the obtained image to be trained needs to be subjected to tone label marking to obtain a training sample, so as to train a convolutional neural network. The label category of the tone may include: zhengtai, youth, uncle, old age, Luoli, maiden, Yujie or Ma, etc.

Inputting the training samples into a convolutional neural network for processing, outputting tone prediction results corresponding to the training samples, calculating loss values through a loss function based on the tone prediction results and labels corresponding to the training samples, further updating parameters of the convolutional neural network until convergence to obtain a trained convolutional neural network model, and using the trained convolutional neural network model as a preset convolutional neural network model for tone classification.

Further, the convolutional neural network in the embodiment of the present application may be a network structure such as VGG, ResNet, MobileNet, or ShuffleNet.

In the embodiment of the application, the first characteristic parameter and the second characteristic parameter of the audio file to be classified are extracted, and the two characteristic parameters are fused and then subjected to tone classification, so that different types of tones can be effectively distinguished, and the tone classification accuracy is improved; the characteristic parameters obtained after fusion are converted into the image, and then the preset convolutional neural network model is used for extracting the characteristic parameters in the image and the characteristic information useful for classification, so that the influence caused by the change of the intonation and the speed of speech or noise interference can be effectively avoided, the accuracy and the robustness of tone classification are further improved, and the technical problem that in the prior art, the tone classification is carried out by extracting single characteristic parameters, the condition that different types of tones are difficult to be effectively distinguished through single characteristic parameters for a certain specific tone classification task, and the condition that mis-classification is easy to occur when the intonation and the speed of speech are changed or the noise interference occurs, and the tone classification accuracy is low is solved.

The above is an embodiment of a method for classifying timbre provided by the present application, and the following is an embodiment of a device for classifying timbre provided by the present application.

Referring to fig. 2, an embodiment of a timbre classification apparatus provided in the present application includes:

an obtaining unit 201, configured to obtain an audio file to be classified;

the extracting unit 202 is configured to extract a first feature parameter and a second feature parameter of an audio file to be classified;

the conversion unit 203 is configured to perform feature fusion on a first feature parameter and a second feature parameter of the audio file to be classified, and convert the feature parameters obtained after the feature fusion into an image to obtain an image to be classified;

and the classifying unit 204 is used for inputting the image to be classified into a preset convolutional neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.

As a further improvement, the configuration process of the preset convolutional neural network model is as follows:

acquiring an audio file to be trained;

extracting a first characteristic parameter and a second characteristic parameter of an audio file to be trained;

performing feature fusion on a first feature parameter and a second feature parameter of an audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;

performing tone label marking on an image to be trained to obtain a training sample;

and inputting the training sample into a convolutional neural network for training to obtain a preset convolutional neural network model.

As a further improvement, the convolutional neural network is a network structure such as VGG, ResNet, MobileNet or ShuffleNet.

As a further improvement, the number of the characteristic parameters in the first characteristic parameter and the second characteristic parameter is 1 or more than 2.

As a further improvement, when the number of the feature parameters in the first feature parameter and the second feature parameter is 1, the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel cepstrum coefficient.

The embodiment of the application also provides tone color classification equipment, which comprises a processor and a memory;

the memory is used for storing the program codes and transmitting the program codes to the processor;

the processor is configured to execute the tone color classification method in the foregoing method embodiment according to instructions in the program code.

The embodiment of the present application further provides a computer-readable storage medium, which is used for storing a program code, where the program code is used for executing the tone color classification method in the foregoing method embodiment.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for classifying timbres, comprising:

acquiring an audio file to be classified;

2. The timbre classification method according to claim 1, wherein the preset convolutional neural network model is configured by the following process:

acquiring an audio file to be trained;

3. The timbre classification method of claim 2 wherein the convolutional neural network is VGG, ResNet, MobileNet or ShuffleNet.

4. The tone color classification method according to any one of claims 1 to 3, wherein the number of feature parameters in the first feature parameter and the second feature parameter is 1 or 2 or more.

5. The method for classifying timbres according to claim 4, wherein when the number of feature parameters in the first feature parameter and the second feature parameter is 1, the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel-frequency cepstrum coefficient.

6. A tone color classification apparatus, comprising:

the acquisition unit is used for acquiring the audio files to be classified;

7. The timbre classification device according to claim 6, wherein the preset convolutional neural network model is configured by the following process:

acquiring an audio file to be trained;

8. The tone color classification device according to any one of claims 6 to 7, wherein the number of the first feature parameters and the second feature parameters is 1 or 2 or more.

9. A timbre classification device, the device comprising a processor and a memory;

the processor is configured to execute the method of tone classification of any of claims 1 to 5 according to instructions in the program code.

10. A computer-readable storage medium characterized in that the computer-readable storage medium stores a program code for executing the tone color classification method of any one of claims 1 to 5.