CN112712820A - Tone classification method, device, equipment and medium - Google Patents

Tone classification method, device, equipment and medium Download PDF

Info

Publication number
CN112712820A
CN112712820A CN202011565974.6A CN202011565974A CN112712820A CN 112712820 A CN112712820 A CN 112712820A CN 202011565974 A CN202011565974 A CN 202011565974A CN 112712820 A CN112712820 A CN 112712820A
Authority
CN
China
Prior art keywords
classified
feature
audio file
parameter
tone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011565974.6A
Other languages
Chinese (zh)
Inventor
汪暾
马金龙
熊佳
罗箫
焦南凯
徐志坚
谢睿
陈光尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huancheng Culture Media Co ltd
Original Assignee
Guangzhou Huancheng Culture Media Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huancheng Culture Media Co ltd filed Critical Guangzhou Huancheng Culture Media Co ltd
Priority to CN202011565974.6A priority Critical patent/CN112712820A/en
Publication of CN112712820A publication Critical patent/CN112712820A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The application discloses a tone classification method, a tone classification device, tone classification equipment and a medium, wherein the method comprises the following steps: acquiring an audio file to be classified; extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified; performing feature fusion on a first feature parameter and a second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain the images to be classified; and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified. The utility model provides a prior art carries out tone classification through drawing single characteristic parameter, exists to a certain specific tone classification task, hardly carries out effective differentiation to different types of tone through single characteristic parameter to and when the face changes the speech speed or noise interference appears, take place the condition of misclassification easily, lead to the lower technical problem of tone classification rate of accuracy.

Description

Tone classification method, device, equipment and medium
Technical Field
The present application relates to the field of audio processing technologies, and in particular, to a method, an apparatus, a device, and a medium for classifying timbres.
Background
The language is a unique communication mode for human beings, and besides information which a speaker wants to express, the language also carries a large amount of hidden information which contains the identity of the speaker. Each individual has own unique sounding organ, and sounding organs of different individuals have obvious differences. In real life, natural attributes such as age and sex of a speaker and social attributes such as a language type and a regional culture background can be roughly judged through voice, which indicates that a voice signal contains a large amount of information. Since the voice information of each person is unique, and the voice information of different persons can reflect the attributes of different persons, the persons can be classified by using the voice information, such as gender classification, age classification, timbre classification, and the like.
In the prior art, single features are mainly extracted for tone classification, and for a certain specific tone classification task, different classes of tones are difficult to be effectively distinguished through single feature parameters; and when the intonation speech rate changes or noise interference occurs, the situation of misclassification is easy to occur, so that the accuracy rate of tone classification is low.
Disclosure of Invention
The application provides a tone classification method, a tone classification device, tone classification equipment and a tone classification medium, which are used for solving the technical problems that in the prior art, tone classification is carried out by extracting a single characteristic parameter, a certain specific tone classification task exists, different classes of tones are difficult to be effectively distinguished through the single characteristic parameter, and in the face of tone speed change or noise interference, the tone classification accuracy is low due to the fact that misclassification easily occurs.
In view of the above, a first aspect of the present application provides a method for classifying timbres, including:
acquiring an audio file to be classified;
extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified;
performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain images to be classified;
and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.
Optionally, the configuration process of the preset convolutional neural network model is as follows:
acquiring an audio file to be trained;
extracting a first characteristic parameter and a second characteristic parameter of the audio file to be trained;
performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;
performing tone label marking on the image to be trained to obtain a training sample;
and inputting the training sample into a convolutional neural network for training to obtain the preset convolutional neural network model.
Optionally, the convolutional neural network is VGG, ResNet, MobileNet, or ShuffleNet.
Optionally, the number of the characteristic parameters in the first characteristic parameter and the second characteristic parameter is 1 or more than 2.
Optionally, when the number of the feature parameters in the first feature parameter and the second feature parameter is 1, the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel cepstrum coefficient.
A second aspect of the present application provides a tone color classification apparatus, comprising:
the acquisition unit is used for acquiring the audio files to be classified;
the extraction unit is used for extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified;
the conversion unit is used for performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain images to be classified;
and the classification unit is used for inputting the image to be classified into a preset convolution neural network model for tone classification and outputting a tone classification result of the audio file to be classified.
Optionally, the configuration process of the preset convolutional neural network model is as follows:
acquiring an audio file to be trained;
extracting a first characteristic parameter and a second characteristic parameter of the audio file to be trained;
performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;
performing tone label marking on the image to be trained to obtain a training sample;
and inputting the training sample into a convolutional neural network for training to obtain the preset convolutional neural network model.
Optionally, the number of the characteristic parameters in the first characteristic parameter and the second characteristic parameter is 1 or more than 2.
A third aspect of the present application provides a timbre classification device comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method for tone color classification according to any one of the first aspect according to instructions in the program code.
A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the method for classifying timbres according to any one of the first aspects.
According to the technical scheme, the method has the following advantages:
the application provides a tone classification method, which comprises the following steps: acquiring an audio file to be classified; extracting a first characteristic parameter and a second characteristic parameter of the acquired audio file to be classified; performing feature fusion on a first feature parameter and a second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain the images to be classified; and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.
According to the method and the device, the first characteristic parameter and the second characteristic parameter of the audio file to be classified are extracted, the two characteristic parameters are fused and then subjected to tone classification, different types of tones can be effectively distinguished, and tone classification accuracy is improved; the characteristic parameters obtained after fusion are converted into the image, and then the preset convolutional neural network model is used for extracting the characteristic parameters in the image and the characteristic information useful for classification, so that the influence caused by the change of the intonation and the speed of speech or noise interference can be effectively avoided, the accuracy and the robustness of tone classification are further improved, and the technical problem that in the prior art, the tone classification is carried out by extracting single characteristic parameters, the condition that different types of tones are difficult to be effectively distinguished through single characteristic parameters for a certain specific tone classification task, and the condition that mis-classification is easy to occur when the intonation and the speed of speech are changed or the noise interference occurs, and the tone classification accuracy is low is solved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flow chart of a tone classification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a tone color classification apparatus according to an embodiment of the present application;
fig. 3 is a schematic diagram of a feature parameter fusion method according to an embodiment of the present application;
fig. 4 is a schematic diagram of another feature parameter fusion method provided in the embodiment of the present application.
Detailed Description
The application provides a tone classification method, a tone classification device, tone classification equipment and a tone classification medium, which are used for solving the technical problems that in the prior art, tone classification is carried out by extracting a single characteristic parameter, a certain specific tone classification task exists, different classes of tones are difficult to be effectively distinguished through the single characteristic parameter, and in the face of tone speed change or noise interference, the condition of misclassification is easy to occur, and the tone classification accuracy is low.
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For easy understanding, referring to fig. 1, an embodiment of a method for classifying timbres provided by the present application includes:
step 101, obtaining an audio file to be classified.
The audio file to be classified may be obtained through a recording device, for example, an audio file recorded and uploaded by a user through a music service may be used as the audio file to be classified.
Further, invalid audio screening may be performed on the obtained audio files to be classified, for example, the audio files to be classified with too short audio time are screened out.
And 102, extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified.
Extracting characteristic parameters of the audio file to be classified, wherein the characteristic parameters can be pitch period, Linear Prediction Cepstrum Coefficient (LPCC), Mel cepstrum coefficient (MFCC) and other known characteristic parameters, and obtaining a first characteristic parameter and a second characteristic parameter.
In the embodiment of the present application, the number of the first characteristic parameter and the second characteristic parameter may be 1 or 2 or more. When the number of the feature parameters in the first feature parameter and the second feature parameter is 1, in the embodiment of the present application, it is preferable that the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel cepstrum coefficient. The method for extracting the characteristic parameters of the LPCC and the MFCC belongs to the prior art, and is not described in detail herein.
And 103, performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain the images to be classified.
And performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified to obtain a fused feature parameter, and performing tone classification directly based on the fused feature parameter. In order to further improve the accuracy of tone classification, the embodiment of the application converts the fused characteristic parameters into images to obtain images to be classified, and performs tone classification on the images to be classified through a preset convolutional neural network model.
The specific process of feature fusion of the two feature parameters is described by taking the first feature parameter and the second feature parameter as the LPCC feature parameter and the MFCC feature parameter as examples, the LPCC feature parameter and the MFCC feature parameter can be fused based on a Fisher criterion, and the basic principle of the Fisher criterion is to find a projection shadow space in a feature vector space, so that the included feature points are optimally classified on the subspace. The Fisher ratio is a frequently used method for selecting characteristic parameters and is proposed on the basis of the Fisher criterion. The larger the Fisher ratio is, the more effective and accurate the characteristic information of the voice signal can be reflected by the one-dimensional characteristic parameters.
1. Scheme one for fusing the LPCC and MFCC characteristic parameters based on Fisher criterion can refer to fig. 3:
firstly fusing the characteristic parameters of the LPCC and the MFCC, and then selecting the Fisher ratio, wherein the specific process comprises the following steps:
pre-emphasis, framing and other pre-processing are carried out on the voice signals s (n) in the audio file to be classified, and then the LPCC characteristic parameter C of each frame of voice signal s (n) is obtained through calculationn(N ═ 1,2,. cndot., N) and MFCC characteristic parameters cmel (k), k ═ 1,2,. cndot., N. Construction of LPCC characteristic parameter sequence C based on LPCC characteristic parameterslpcc={c1,c2,...cN}; MFCC characteristic parameter sequence C constructed based on MFCC characteristic parametersmfcc={Cmel(1),Cmel(2),...,Cmel(N)}。
Fusing the LPCC characteristic parameter sequence and the MFCC characteristic parameter sequence to obtain a fused characteristic parameter sequence Cnew={Clpcc,Cmfcc}={c1,c2,...cN,Cmel(1),Cmel(2),...,Cmel(N)}。
According to the feature selection calculation formula of Fisher criterion, the fused feature parameter sequence C is solvednewThe Fisher ratio of the characteristic parameters of each dimension in the (A) is determined. Selecting the characteristic parameters of Fisher ratio larger than a preset threshold value to form a final fused characteristic parameter sequence C'newI.e. the fusion characteristic parameters. The preset threshold may be selected according to actual situations, and is not specifically limited herein.
2. Scheme two for fusing the characteristic parameters of the LPCC and the MFCC based on the Fisher criterion can refer to fig. 4.
Fisher ratio selection is carried out on the basis of characteristic parameters, and then fusion is carried out, wherein the specific process comprises the following steps:
pre-emphasis, framing and other pre-processing are carried out on the voice signals s (n) in the audio file to be classified, and then the LPCC characteristic parameter C of each frame of voice signal s (n) is obtained through calculationn(N ═ 1,2,. cndot., N) and MFCC characteristic parameters cmel (k), k ═ 1,2,. cndot., N. Construction of LPCC characteristic parameter sequence C based on LPCC characteristic parameterslpcc={c1,c2,...cN}; MFCC characteristic parameter sequence C constructed based on MFCC characteristic parametersmfcc={Cmel(1),Cmel(2),...,Cmel(N)}。
Obtaining an LPCC characteristic parameter sequence C according to a characteristic selection calculation formula of a Fisher criterionlpccAnd MFCC characteristic parameter sequence CmfccSelecting the characteristic parameters of the Fisher ratio which is greater than a preset threshold value from the Fisher ratio of the characteristic parameters of each dimension to obtain a screened LPCC characteristic parameter sequence C'lpccAnd MFCC characteristic parameter sequence C'mfcc
C 'of the screened LPCC characteristic parameter sequence'lpccAnd MFCC characteristic parameter sequence C'mfccCarrying out fusion to obtain a fused characteristic parameter sequence C'new={C'lpcc,C'mfccAnd f, fusing characteristic parameters.
And 104, inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.
And inputting the converted image to be classified into a preset convolution neural network model for tone classification, and further outputting a tone classification result of the audio file to be classified, wherein the obtained tone classification result can be used for personalized recommendation service and the like. By utilizing the powerful fitting and characterization capabilities of the convolutional neural network model, the most useful information and features for the classification task in the multi-feature parameters are automatically extracted, so that higher accuracy and robustness of tone classification are obtained.
Further, the preset convolutional neural network model is a trained convolutional neural network model, and the training process is as follows:
acquiring an audio file to be trained; extracting a first characteristic parameter and a second characteristic parameter of an audio file to be trained; performing feature fusion on a first feature parameter and a second feature parameter of an audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained; performing tone label marking on an image to be trained to obtain a training sample; and inputting the training sample into a convolutional neural network for training to obtain a preset convolutional neural network model.
In the embodiment of the present application, the processing process of the audio file to be trained is similar to the processing process of the audio file to be classified, but the difference is that after two feature parameters of the audio file to be trained are fused and converted to obtain an image, the obtained image to be trained needs to be subjected to tone label marking to obtain a training sample, so as to train a convolutional neural network. The label category of the tone may include: zhengtai, youth, uncle, old age, Luoli, maiden, Yujie or Ma, etc.
Inputting the training samples into a convolutional neural network for processing, outputting tone prediction results corresponding to the training samples, calculating loss values through a loss function based on the tone prediction results and labels corresponding to the training samples, further updating parameters of the convolutional neural network until convergence to obtain a trained convolutional neural network model, and using the trained convolutional neural network model as a preset convolutional neural network model for tone classification.
Further, the convolutional neural network in the embodiment of the present application may be a network structure such as VGG, ResNet, MobileNet, or ShuffleNet.
In the embodiment of the application, the first characteristic parameter and the second characteristic parameter of the audio file to be classified are extracted, and the two characteristic parameters are fused and then subjected to tone classification, so that different types of tones can be effectively distinguished, and the tone classification accuracy is improved; the characteristic parameters obtained after fusion are converted into the image, and then the preset convolutional neural network model is used for extracting the characteristic parameters in the image and the characteristic information useful for classification, so that the influence caused by the change of the intonation and the speed of speech or noise interference can be effectively avoided, the accuracy and the robustness of tone classification are further improved, and the technical problem that in the prior art, the tone classification is carried out by extracting single characteristic parameters, the condition that different types of tones are difficult to be effectively distinguished through single characteristic parameters for a certain specific tone classification task, and the condition that mis-classification is easy to occur when the intonation and the speed of speech are changed or the noise interference occurs, and the tone classification accuracy is low is solved.
The above is an embodiment of a method for classifying timbre provided by the present application, and the following is an embodiment of a device for classifying timbre provided by the present application.
Referring to fig. 2, an embodiment of a timbre classification apparatus provided in the present application includes:
an obtaining unit 201, configured to obtain an audio file to be classified;
the extracting unit 202 is configured to extract a first feature parameter and a second feature parameter of an audio file to be classified;
the conversion unit 203 is configured to perform feature fusion on a first feature parameter and a second feature parameter of the audio file to be classified, and convert the feature parameters obtained after the feature fusion into an image to obtain an image to be classified;
and the classifying unit 204 is used for inputting the image to be classified into a preset convolutional neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.
As a further improvement, the configuration process of the preset convolutional neural network model is as follows:
acquiring an audio file to be trained;
extracting a first characteristic parameter and a second characteristic parameter of an audio file to be trained;
performing feature fusion on a first feature parameter and a second feature parameter of an audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;
performing tone label marking on an image to be trained to obtain a training sample;
and inputting the training sample into a convolutional neural network for training to obtain a preset convolutional neural network model.
As a further improvement, the convolutional neural network is a network structure such as VGG, ResNet, MobileNet or ShuffleNet.
As a further improvement, the number of the characteristic parameters in the first characteristic parameter and the second characteristic parameter is 1 or more than 2.
As a further improvement, when the number of the feature parameters in the first feature parameter and the second feature parameter is 1, the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel cepstrum coefficient.
In the embodiment of the application, the first characteristic parameter and the second characteristic parameter of the audio file to be classified are extracted, and the two characteristic parameters are fused and then subjected to tone classification, so that different types of tones can be effectively distinguished, and the tone classification accuracy is improved; the characteristic parameters obtained after fusion are converted into the image, and then the preset convolutional neural network model is used for extracting the characteristic parameters in the image and the characteristic information useful for classification, so that the influence caused by the change of the intonation and the speed of speech or noise interference can be effectively avoided, the accuracy and the robustness of tone classification are further improved, and the technical problem that in the prior art, the tone classification is carried out by extracting single characteristic parameters, the condition that different types of tones are difficult to be effectively distinguished through single characteristic parameters for a certain specific tone classification task, and the condition that mis-classification is easy to occur when the intonation and the speed of speech are changed or the noise interference occurs, and the tone classification accuracy is low is solved.
The embodiment of the application also provides tone color classification equipment, which comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to execute the tone color classification method in the foregoing method embodiment according to instructions in the program code.
The embodiment of the present application further provides a computer-readable storage medium, which is used for storing a program code, where the program code is used for executing the tone color classification method in the foregoing method embodiment.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A method for classifying timbres, comprising:
acquiring an audio file to be classified;
extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified;
performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain images to be classified;
and inputting the image to be classified into a preset convolution neural network model for tone classification, and outputting a tone classification result of the audio file to be classified.
2. The timbre classification method according to claim 1, wherein the preset convolutional neural network model is configured by the following process:
acquiring an audio file to be trained;
extracting a first characteristic parameter and a second characteristic parameter of the audio file to be trained;
performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;
performing tone label marking on the image to be trained to obtain a training sample;
and inputting the training sample into a convolutional neural network for training to obtain the preset convolutional neural network model.
3. The timbre classification method of claim 2 wherein the convolutional neural network is VGG, ResNet, MobileNet or ShuffleNet.
4. The tone color classification method according to any one of claims 1 to 3, wherein the number of feature parameters in the first feature parameter and the second feature parameter is 1 or 2 or more.
5. The method for classifying timbres according to claim 4, wherein when the number of feature parameters in the first feature parameter and the second feature parameter is 1, the first feature parameter is a linear prediction cepstrum coefficient, and the second feature parameter is a mel-frequency cepstrum coefficient.
6. A tone color classification apparatus, comprising:
the acquisition unit is used for acquiring the audio files to be classified;
the extraction unit is used for extracting a first characteristic parameter and a second characteristic parameter of the audio file to be classified;
the conversion unit is used for performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be classified, and converting the feature parameters obtained after fusion into images to obtain images to be classified;
and the classification unit is used for inputting the image to be classified into a preset convolution neural network model for tone classification and outputting a tone classification result of the audio file to be classified.
7. The timbre classification device according to claim 6, wherein the preset convolutional neural network model is configured by the following process:
acquiring an audio file to be trained;
extracting a first characteristic parameter and a second characteristic parameter of the audio file to be trained;
performing feature fusion on the first feature parameter and the second feature parameter of the audio file to be trained, and converting the feature parameters obtained after fusion into images to obtain images to be trained;
performing tone label marking on the image to be trained to obtain a training sample;
and inputting the training sample into a convolutional neural network for training to obtain the preset convolutional neural network model.
8. The tone color classification device according to any one of claims 6 to 7, wherein the number of the first feature parameters and the second feature parameters is 1 or 2 or more.
9. A timbre classification device, the device comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method of tone classification of any of claims 1 to 5 according to instructions in the program code.
10. A computer-readable storage medium characterized in that the computer-readable storage medium stores a program code for executing the tone color classification method of any one of claims 1 to 5.
CN202011565974.6A 2020-12-25 2020-12-25 Tone classification method, device, equipment and medium Pending CN112712820A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011565974.6A CN112712820A (en) 2020-12-25 2020-12-25 Tone classification method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011565974.6A CN112712820A (en) 2020-12-25 2020-12-25 Tone classification method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN112712820A true CN112712820A (en) 2021-04-27

Family

ID=75546749

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011565974.6A Pending CN112712820A (en) 2020-12-25 2020-12-25 Tone classification method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN112712820A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778951A (en) * 2023-05-25 2023-09-19 上海蜜度信息技术有限公司 Audio classification method, device, equipment and medium based on graph enhancement

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610692A (en) * 2017-09-22 2018-01-19 杭州电子科技大学 The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN111261141A (en) * 2018-11-30 2020-06-09 北京嘀嘀无限科技发展有限公司 Voice recognition method and voice recognition device
WO2020155584A1 (en) * 2019-01-31 2020-08-06 北京声智科技有限公司 Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
CN111524522A (en) * 2020-04-23 2020-08-11 上海依图网络科技有限公司 Voiceprint recognition method and system based on fusion of multiple voice features
CN111554306A (en) * 2020-04-26 2020-08-18 兰州理工大学 Voiceprint recognition method based on multiple features
CN111613240A (en) * 2020-05-22 2020-09-01 杭州电子科技大学 Camouflage voice detection method based on attention mechanism and Bi-LSTM
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion
CN111816218A (en) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610692A (en) * 2017-09-22 2018-01-19 杭州电子科技大学 The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net
CN108962231A (en) * 2018-07-04 2018-12-07 武汉斗鱼网络科技有限公司 A kind of method of speech classification, device, server and storage medium
CN111261141A (en) * 2018-11-30 2020-06-09 北京嘀嘀无限科技发展有限公司 Voice recognition method and voice recognition device
WO2020155584A1 (en) * 2019-01-31 2020-08-06 北京声智科技有限公司 Method and device for fusing voiceprint features, voice recognition method and system, and storage medium
CN111524522A (en) * 2020-04-23 2020-08-11 上海依图网络科技有限公司 Voiceprint recognition method and system based on fusion of multiple voice features
CN111554306A (en) * 2020-04-26 2020-08-18 兰州理工大学 Voiceprint recognition method based on multiple features
CN111613240A (en) * 2020-05-22 2020-09-01 杭州电子科技大学 Camouflage voice detection method based on attention mechanism and Bi-LSTM
CN111785285A (en) * 2020-05-22 2020-10-16 南京邮电大学 Voiceprint recognition method for home multi-feature parameter fusion
CN111816218A (en) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 Voice endpoint detection method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778951A (en) * 2023-05-25 2023-09-19 上海蜜度信息技术有限公司 Audio classification method, device, equipment and medium based on graph enhancement

Similar Documents

Publication Publication Date Title
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN108305615A (en) A kind of object identifying method and its equipment, storage medium, terminal
JP4220449B2 (en) Indexing device, indexing method, and indexing program
CN110288975B (en) Voice style migration method and device, electronic equipment and storage medium
CN112712809B (en) Voice detection method and device, electronic equipment and storage medium
CN111326139B (en) Language identification method, device, equipment and storage medium
CN112259104A (en) Training device of voiceprint recognition model
CN110717410A (en) Voice emotion and facial expression bimodal recognition system
CN112232276A (en) Emotion detection method and device based on voice recognition and image recognition
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
CN112712820A (en) Tone classification method, device, equipment and medium
CN107358946B (en) Voice emotion recognition method based on slice convolution
CN112584238A (en) Movie and television resource matching method and device and smart television
CN116564269A (en) Voice data processing method and device, electronic equipment and readable storage medium
WO2021217979A1 (en) Voiceprint recognition method and apparatus, and device and storage medium
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device
CN110033786B (en) Gender judgment method, device, equipment and readable storage medium
Devnath et al. Emotion recognition from isolated Bengali speech
Ankışhan A new approach for the acoustic analysis of the speech pathology
Feng et al. Noise Classification Speech Enhancement Generative Adversarial Network
CN117475360B (en) Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
CN112614510B (en) Audio quality assessment method and device
US20230154487A1 (en) Method, system and device of speech emotion recognition and quantization based on deep learning
WO2024042970A1 (en) Information processing device, information processing method, and computer-readable non-transitory storage medium
CN117373433A (en) Dialect voice recognition system and method based on small sample data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination