CN106997765A - The quantitatively characterizing method of voice tone color - Google Patents
The quantitatively characterizing method of voice tone color Download PDFInfo
- Publication number
- CN106997765A CN106997765A CN201710207110.9A CN201710207110A CN106997765A CN 106997765 A CN106997765 A CN 106997765A CN 201710207110 A CN201710207110 A CN 201710207110A CN 106997765 A CN106997765 A CN 106997765A
- Authority
- CN
- China
- Prior art keywords
- tone color
- audio
- frame
- cqt
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 33
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 230000006870 function Effects 0.000 claims description 50
- 230000004913 activation Effects 0.000 claims description 17
- 238000012549 training Methods 0.000 claims description 16
- 230000007935 neutral effect Effects 0.000 claims description 15
- 238000011176 pooling Methods 0.000 claims description 12
- 238000001228 spectrum Methods 0.000 claims description 12
- 238000009432 framing Methods 0.000 claims description 8
- 238000012886 linear function Methods 0.000 claims description 4
- 230000001537 neural effect Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 abstract description 5
- 230000002123 temporal effect Effects 0.000 description 6
- 230000010355 oscillation Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000008447 perception Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 208000000058 Anaplasia Diseases 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000002463 transducing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Auxiliary Devices For Music (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
The present invention relates to a kind of quantitatively characterizing method of voice tone color, this method by analyze professional singer sing number of songs audio, calculate the CQT features tieed up per frame 192, the CQT features composition size for choosing 60 audio frames again is trained for 60*192 input matrix to depth convolutional neural networks, the depth convolutional neural networks and corresponding voice tone color embedded space trained;The audio of singing opera arias to amateur singer carries out identical analysis calculating and sends into the depth convolutional neural networks trained again, the tone color vector in same voice tone color embedded space is can obtain, the tamber characteristic for representing subjective with a kind of quantitative, objective mode is realized.
Description
Technical field
The present invention relates to the acoustic signal processing method in field of singing, more particularly to a kind of quantitative table of voice tone color
Levy method.
Background technology
Tone color is defined as below for American National Standard research institute, and " tone color refers to sound in certain acoustically produced
Attribute, auditor can judge accordingly two present in the same fashion, the difference of sound with identical pitch and loudness ".
Thus, voice tone color during performance refers to that people are used for determining specifically when different singers sings same song
The sound characteristic of which singer.
The analysis that sonograph carries out sound is commonly used in acoustics experiment.Sonograph can show amplitude with frequency and when anaplasia
The characteristic of change, ordinate represents frequency, and abscissa represents the time, and the size of amplitude represents or used spectrum with the depth of grey color
Different colours represent.From the perspective of sonograph, the factor for determining tone color is presence or absence and their phase of overtone
To power.
There is substantial amounts of scholar to study sound signal processing all the time, it is desirable to be able to reference to pitch and loudness, to build
Vertical tone color is quantitative, objective characteristic index, but even to this day, the evaluation of tone color more still rest on qualitatively, it is subjective
Evaluation phase, still fail to carry out effective modeling and quantization to tone color, be constructed without a quantitative measurement system.Therefore,
Need to be continued research in terms of characteristic present, Measure Indexes and the measuring similarity of tone color.
The current main classification including musical instrument of research for tone color and identification and singer's identification etc., mainly by all kinds of
Tone color physical features and disaggregated model are realized.The physical features of common tone color classification can be divided into temporal signatures, frequency domain character
And scramble characteristic of field three major types.
Temporal signatures:Temporal signatures react the dynamic change of sound.All each not phase of the temporal envelope of different audio signals
Together.In order to analyze musical sound comprehensively, starting of oscillation, stable state and decay three phases can be divided into.Starting of oscillation refers to musical sound and starts portion from scratch
Point, stable state is the major part of musical sound, and decay refers to musical sound from the latter end having to nothing.The starting of oscillation of musical sound and attenuation portions are held
The continuous time is probably a few tens of milliseconds, but differentiation of the starting of oscillation stage to tone color has very important effect.
Frequency domain character:The yardstick difference of frequency-domain analysis will obtain different frequency spectrums.Common frequency spectrum have STFT compose and
CQT is composed.
1) the wave filter group centre frequency of STFT spectrums is linear rises, and the bandwidth of each wave filter is constant, and calculation formula is such as
Under:
Wherein, x (n) is the voice signal of a certain frame, and w (n) is windowed function.
2) wave filter group that CQT is distributed by centre frequency exponentially, note signal is expressed as to determine music single-tone
Spectrum energy, the quality factor q of wave filter group keeps constant.CQT, which is converted at low frequency, has higher frequency resolution, relatively low
Temporal resolution, high frequency treatment have higher temporal resolution, relatively low frequency resolution.The low frequency portion of note signal
The information divided is often more valuable, and CQT becomes transducing and meets this feature well.
Scramble characteristic of field:Mel frequency cepstral coefficients (Mel Frequency Cepstrum Coefficient, MFCC) are
What auditory perception model based on people was proposed, be proved to be in musical sound, Classification of Speech such as recognize at the field most important feature it
One, it is nonlinear, i.e. f that people, which delimited the subjective perception of frequency domain,mel=1125log (1+f/700), fmelIt is using Mei Er to be single
The perceived frequency of position, f is the actual frequency in units of hertz.Signal spectrum is transformed into perception domain and can be very good simulation
The process of auditory processing.When calculating MFCC, framing, adding window, the pretreatment of preemphasis are first carried out to signal, then every frame is believed
Number carrying out FFT becomes after frequency domain data, calculates line energy, the line energy of every frame signal by Mel wave filters,
Calculate the energy in the wave filter.DCT is calculated after line energy by Mel wave filters is taken the logarithm, that is, obtains MFCC.
Although the related research of existing voice tone color can relatively efficiently solve some singers identification problem,
It is not quantitative that voice tone color is described.Therefore, this patent is based on above-mentioned analysis, and training depth convolutional neural networks are simultaneously
Corresponding voice tone color embedded space is built, quantitative sign is carried out to tone color in tone color embedded space.
The content of the invention
In view of this, it is an object of the invention to provide a kind of quantitatively characterizing method of voice tone color, amateur singer is sung
Tone color during song is analyzed.
The present invention is realized using following scheme:A kind of quantitatively characterizing method of voice tone color, comprises the following steps:
Step S1:Obtain the audio of singing opera arias when professional singer gives song recitals;
Step S2:Build three-dimensional voice tone color embedded space R;
Step S3:Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign industry
The tone color of remaining singer.
Further, the step S2 specifically includes following steps:
Step S21:The audio of singing opera arias for amounting to 75 from 15 professional singers is analyzed using constant Q transform method, its
In each professional singer have 5 audios of singing opera arias, calculate the CQT coefficients of each frame in audio, CQT coefficients are 192 dimensions;
Step S22:60 frame audios are chosen, the CQT coefficients of this 60 audio frames are constituted to the input matrix of neutral net,
Matrix size is 60*192;
Step S23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god
It is trained through network, and to depth convolutional neural networks;
Step S231:Depth convolutional neural networks are built according to following structure:
First layer is convolutional layer, using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input
For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function;
The second layer is convolutional layer, using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), is swashed
Function living uses hyperbolic tangent function;
Third layer is convolutional layer, using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activation
Function uses hyperbolic tangent function;
4th layer is full articulamentum, has 256 output nodes, activation primitive uses hyperbolic tangent function;
Layer 5 is full articulamentum, has 3 output nodes, activation primitive uses linear function, is output as 3-dimensional vector;
Step S232:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using paired training
Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.
Further, the step S21 is specially:Audio of singing opera arias to professional singer carries out framing, and frame length is 4096
Sampled point, it is field that frame, which moves length, and window function selects Hamming window, then by judging whether the short-time energy of each audio frame low
In some threshold value, to remove mute frame, obtained time-domain audio frame is calculated to the CQT spectrums for obtaining n-th frame signal through below equation
K-th of composition, formula is as follows:
Wherein, NkIt is long for window,Represent window function.
Further, the step S232 is specially:By the input matrix obtained in step S22 input depth convolution god
Training is iterated to network through network, and using the method trained in pairs, the depth convolutional neural networks that are trained and
Corresponding tone color embedded space R, the two data X trained in pairs1, X2Two independent neutral net G are inputted respectivelyAAnd GB,
Two neural network structures are consistent and parameter is consistent, GAAnd GBOutput be respectively Z1And Z2, cost function is defined as Z1With
Z2Between Euclidean distance D:
D=| | Z1-Z2||2
According to two input data X1, X2Whether identical corresponding voice label is, sets different loss functions:
Wherein, MdIt is a constant, it is relevant with the scope of voice tone color embedded space.
Further, as the MdWhen=10, two input data X1, X2Loss letter when corresponding voice label is identical
Number is Lsim, as two input data X1, X2The corresponding asynchronous loss function of voice label is Ldis, pass through input data
X1, X2The similitude of corresponding voice label, merges two different loss functions, the loss function such as following formula institute after merging
Show:
Loss=V*Lsim+(1-V)*Ldis
Wherein, V value is determined by following formula:
The learning rate of network is set to 0.02, and maximum iteration is set to 30000.
Further, the step S3 specifically includes following steps:
Step S31:The audio of singing opera arias of amateur singer is analyzed using constant Q transform method, each frame in audio is calculated
CQT coefficients, CQT coefficients are 192 dimensions;
Step S32:The CQT coefficients of 60 audio frames are chosen, the input matrix of neutral net are constituted, size is 60*192;
Step S33:The input matrix built in step S32 is input to the depth convolutional Neural trained in step S23
In network, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.
Further, the step S31 is specially:Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is
4096 sampled points, frame moves length and elects field as, and window function selects Hamming window, then by judging in short-term for each audio frame
Whether amount is less than some threshold value, to remove mute frame, and obtained time-domain audio frame is calculated through below equation and obtains n-th frame letter
Number CQT spectrums k-th of composition, formula is as follows:
Wherein, NkIt is long for window,Represent window function.
Compared with prior art, the invention has the advantages that:It is clear when this method to professional singer by giving song recitals
Sing audio to be pre-processed, specifically include framing, adding window and remove mute frame, then CQT meters are carried out to obtained time-domain audio frame
Calculate and analyze, obtain the CQT coefficients of every frame.The CQT coefficients of selected 60 audio frames constitute the input of depth convolutional neural networks
Matrix, depth convolutional neural networks are trained with the mode trained in pairs, the depth convolutional neural networks and phase trained
The voice tone color embedded space answered, in tone color embedded space, can be achieved the quantitatively characterizing to voice tone color.To depth convolution
When neutral net is trained, each two input CQT coefficient matrixes are one group, constitute a training data (when two CQT systems
When matrix number belongs to same professional singer, the label of this training data is 1;When two CQT coefficient matrixes be not belonging to it is same specially
During industry singer, 0) label of this training data is, because the label of training data is nominal, therefore depth convolutional Neural net
Network uses " Siamese structures ", and based on the relation between two CQT coefficient matrixes in a training data come training network,
The method trained in pairs, the problem of determination for effectively avoiding the need for priori is exported.
Brief description of the drawings
Fig. 1 is the method flow schematic block diagram of the present invention.
Fig. 2 is the structure chart for the depth convolutional neural networks applied in embodiments of the invention.
Embodiment
Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.
The present embodiment provides a kind of quantitatively characterizing method of voice tone color, comprises the following steps as shown in Figure 1:
Step S1:Obtain the audio of singing opera arias when professional singer gives song recitals;
Step S2:Build three-dimensional voice tone color embedded space R;
Step S3:Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign industry
The tone color of remaining singer.
In the present embodiment, the step S2 specifically includes following steps:
Step S21:It is (each special using constant Q transform method (CQT) analysis 75 altogether from 15 professional singers
Industry singer has 5 audios of singing opera arias) audio of singing opera arias, calculate the CQT coefficients (192 dimension) of each frame in audio;
Step S22:60 frame audios are chosen, the CQT coefficients (CQT coefficients are 192 dimensions) of this 60 audio frames are constituted into nerve
The input matrix of network, matrix size is 60*192;
Step S23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god
It is trained through network, and to depth convolutional neural networks;
Step S231:Depth convolutional neural networks are built according to following structure:
First layer (convolutional layer) is using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input
For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function;
The second layer (convolutional layer) is using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation
Function uses hyperbolic tangent function;
Third layer (convolutional layer) is using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activates letter
Number uses hyperbolic tangent function;
4th layer (full articulamentum) has 256 output nodes, and activation primitive uses hyperbolic tangent function;
Layer 5 (full articulamentum) has 3 output nodes, and activation primitive uses linear function, is output as 3-dimensional vector;
Step S232:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using paired training
Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.
In the present embodiment, the step S3 specifically includes following steps:
Step S31:The audio of singing opera arias of amateur singer is analyzed using constant Q transform method (CQT), calculates each in audio
The CQT coefficients (192 dimension) of frame;
Step S32:The CQT coefficients of 60 audio frames are chosen, (size is 60* to the input matrix of composition neutral net
192);
Step S33:The input matrix built in step S32 is input to the depth convolutional Neural trained in step S23
In network, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.
In the present embodiment, by taking the audio tracks storehouse of singing opera arias comprising 15 professional singers as an example, provided according to above method
Example, specifically includes following steps:
Step 1:Obtain the audio of singing opera arias when professional singer gives song recitals.Comprising 15 professional singers in melody storehouse, each
Singer's 5 audios of singing opera arias of correspondence, audio format is wav, and sampling precision is 16bit, and sample rate is 16KHz.
Step 2:Three-dimensional voice tone color embedded space R is built, is comprised the following steps that:
Step 21:Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 sampled points, and it is half that frame, which moves length,
Frame, window function selects Hamming window, then whether the short-time energy by judging each audio frame is less than some threshold value, quiet to remove
Sound frame, obtained time-domain audio frame is calculated through below equation k-th of composition of the CQT spectrums for obtaining n-th frame signal, formula is such as
Under:Wherein NkIt is long for window,Represent window function.
Step 22:The CQT coefficients of each frame are 192 dimensions, choose the CQT coefficients of 60 frames, constitute the input square of neutral net
Battle array, matrix size is 60*192.
Step 23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god
It is trained, comprises the following steps that through network, and to depth convolutional neural networks:
Step 231:Depth convolutional neural networks as shown in Figure 2 are built according to following structure:
First layer (convolutional layer) is using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input
For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function;
The second layer (convolutional layer) is using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation
Function uses hyperbolic tangent function;
Third layer (convolutional layer) is using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activates letter
Number uses hyperbolic tangent function;
4th layer (full articulamentum) has 256 output nodes, and activation primitive uses hyperbolic tangent function;
Layer 5 (full articulamentum) has 3 output nodes, and activation primitive uses linear function, is output as 3-dimensional vector;
Step 232:The input matrix obtained in step 22 is inputted into depth convolutional neural networks, and using training in pairs
Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained,
The two data X trained in pairs1, X2Two independent neutral net G are inputted respectivelyAAnd GB, two neural network structures are consistent
And parameter is consistent, GAAnd GBOutput be respectively Z1And Z2, cost function is defined as Z1And Z2Between Euclidean distance
D。
D=| | Z1-Z2||2
According to two input data X1, X2Whether identical corresponding voice label is, sets different loss functions.
Wherein MdIt is a constant, it is relevant with the scope of voice tone color embedded space, in the present embodiment, Md=10.When
Two input data X1, X2Loss function when corresponding voice label is identical is Lsim, as two input data X1, X2Correspondence
The asynchronous loss function of voice label be Ldis.Pass through input data X1, X2The similitude of corresponding voice label, merges
Two different loss functions.Loss function after merging is shown below:
Loss=V*Lsim+(1-V)*Ldis
Wherein, V value is determined by following formula:
The learning rate of network is set to 0.02, and maximum iteration is set to 30000.
Step 3:Analyze the audio of singing opera arias of amateur singer, and quantitative sign sparetime in voice tone color embedded space R
The tone color of singer.Comprise the following steps that:
Step 31:Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, and frame moves length
Elect field as, window function selects Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, comes
Mute frame is removed, obtained time-domain audio frame is calculated to k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation,
Formula is as follows:Wherein NkIt is long for window,Represent window function.
Step 32:The CQT coefficients of each frame are 192 dimensions, choose the CQT coefficients of 60 frames, constitute the input square of neutral net
Battle array, matrix size is 60*192.
Step 33:The size built in step 32 is input to what is trained in step 23 for 60*192 input matrix
In depth convolutional neural networks, the output of neutral net be the 3-dimensional in correspondence voice tone color embedded space tamber characteristic to
Amount;
It the foregoing is only presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent
With modification, it should all belong to the covering scope of the present invention.
Claims (7)
1. a kind of quantitatively characterizing method of voice tone color, it is characterised in that:Comprise the following steps:
Step S1:Obtain the audio of singing opera arias when professional singer gives song recitals;
Step S2:Build three-dimensional voice tone color embedded space R;
Step S3:Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign amateur singer
Tone color.
2. a kind of quantitatively characterizing method of voice tone color according to claim 1, it is characterised in that:The step S2 is specific
Comprise the following steps:
Step S21:The audio of singing opera arias for amounting to 75 from 15 professional singers is analyzed using constant Q transform method, wherein often
Individual professional singer has 5 audios of singing opera arias, and calculates the CQT coefficients of each frame in audio, and CQT coefficients are 192 dimensions;
Step S22:60 frame audios are chosen, the CQT coefficients of this 60 audio frames are constituted to the input matrix of neutral net, matrix is big
Small is 60*192;
Step S23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolutional Neural net
Network, and depth convolutional neural networks are trained;
Step S231:Depth convolutional neural networks are built according to following structure:
First layer is convolutional layer, using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), is inputted as 60*
The matrix of 192 sizes, activation primitive uses hyperbolic tangent function;
The second layer is convolutional layer, using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation primitive
Using hyperbolic tangent function;
Third layer is convolutional layer, using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), and activation primitive is adopted
Use hyperbolic tangent function;
4th layer is full articulamentum, has 256 output nodes, activation primitive uses hyperbolic tangent function;
Layer 5 is full articulamentum, has 3 output nodes, activation primitive uses linear function, is output as 3-dimensional vector;
Step S232:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using the method trained in pairs
Training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.
3. a kind of quantitatively characterizing method of voice tone color according to claim 2, it is characterised in that:The step S21 tools
Body is:Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 sampled points, and it is field, window function choosing that frame, which moves length,
With Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, to remove mute frame, by what is obtained
Time-domain audio frame calculates k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, and formula is as follows:
Wherein, NkIt is long for window,Represent window function.
4. a kind of quantitatively characterizing method of voice tone color according to claim 2, it is characterised in that:The step S232 tools
Body is:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, and using the method trained in pairs to network
Training is iterated, the depth convolutional neural networks and corresponding tone color embedded space R trained, two trained in pairs
Data X1, X2Two independent neutral net G are inputted respectivelyAAnd GB, two neural network structures are consistent and parameter is consistent, GAAnd GB
Output be respectively Z1And Z2, cost function is defined as Z1And Z2Between Euclidean distance D:
D=| | Z1-Z2||2
According to two input data X1, X2Whether identical corresponding voice label is, sets different loss functions:
Wherein, MdIt is a constant, it is relevant with the scope of voice tone color embedded space.
5. a kind of quantitatively characterizing method of voice tone color according to claim 4, it is characterised in that:As the MdWhen=10,
Two input data X1, X2Loss function when corresponding voice label is identical is Lsim, as two input data X1, X2It is corresponding
The asynchronous loss function of voice label is Ldis, pass through input data X1, X2The similitude of corresponding voice label, merges two
Different loss functions, the loss function after merging is shown below:
Loss=V*Lsim+(1-V)*Ldis
Wherein, V value is determined by following formula:
The learning rate of network is set to 0.02, and maximum iteration is set to 30000.
6. a kind of quantitatively characterizing method of voice tone color according to claim 1, it is characterised in that:The step S3 is specific
Comprise the following steps:
Step S31:The audio of singing opera arias of amateur singer is analyzed using constant Q transform method, the CQT systems of each frame in audio are calculated
Number, CQT coefficients are 192 dimensions;
Step S32:The CQT coefficients of 60 audio frames are chosen, the input matrix of neutral net are constituted, size is 60*192;
Step S33:The input matrix built in step S32 is input to the depth convolutional neural networks trained in step S23
In, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.
7. a kind of quantitatively characterizing method of voice tone color according to claim 6, it is characterised in that:The step S31 tools
Body is:Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, and frame moves length and elects field, window as
Function selects Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, will to remove mute frame
Obtained time-domain audio frame calculates k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, and formula is as follows:
Wherein, NkIt is long for window,Represent window function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710207110.9A CN106997765B (en) | 2017-03-31 | 2017-03-31 | Quantitative characterization method for human voice timbre |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710207110.9A CN106997765B (en) | 2017-03-31 | 2017-03-31 | Quantitative characterization method for human voice timbre |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106997765A true CN106997765A (en) | 2017-08-01 |
CN106997765B CN106997765B (en) | 2020-09-01 |
Family
ID=59433894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710207110.9A Expired - Fee Related CN106997765B (en) | 2017-03-31 | 2017-03-31 | Quantitative characterization method for human voice timbre |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106997765B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108364637A (en) * | 2018-02-01 | 2018-08-03 | 福州大学 | A kind of audio sentence boundary detection method |
CN108417228A (en) * | 2018-02-02 | 2018-08-17 | 福州大学 | Voice tone color method for measuring similarity under instrument tamber migration |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
CN110277106A (en) * | 2019-06-21 | 2019-09-24 | 北京达佳互联信息技术有限公司 | Audio quality determines method, apparatus, equipment and storage medium |
CN111488485A (en) * | 2020-04-16 | 2020-08-04 | 北京雷石天地电子技术有限公司 | Music recommendation method based on convolutional neural network, storage medium and electronic device |
CN112037766A (en) * | 2020-09-09 | 2020-12-04 | 广州华多网络科技有限公司 | Voice tone conversion method and related equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650940A (en) * | 2008-12-26 | 2010-02-17 | 中国科学院声学研究所 | Objective evaluation method for singing tone purity based on audio frequency spectrum characteristic analysis |
CN103177722A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Tone-similarity-based song retrieval method |
CN104183245A (en) * | 2014-09-04 | 2014-12-03 | 福建星网视易信息系统有限公司 | Method and device for recommending music stars with tones similar to those of singers |
-
2017
- 2017-03-31 CN CN201710207110.9A patent/CN106997765B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650940A (en) * | 2008-12-26 | 2010-02-17 | 中国科学院声学研究所 | Objective evaluation method for singing tone purity based on audio frequency spectrum characteristic analysis |
CN103177722A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Tone-similarity-based song retrieval method |
CN104183245A (en) * | 2014-09-04 | 2014-12-03 | 福建星网视易信息系统有限公司 | Method and device for recommending music stars with tones similar to those of singers |
Non-Patent Citations (1)
Title |
---|
ERIC J. HUMPHREY等: "Non-Linear Semantic Embedding for Organizing Large Instrument Sample Libraries", 《PROCEEDINGS OF 2011 10TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS AND WORKSHOPS》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108364637A (en) * | 2018-02-01 | 2018-08-03 | 福州大学 | A kind of audio sentence boundary detection method |
CN108364637B (en) * | 2018-02-01 | 2021-07-13 | 福州大学 | Audio sentence boundary detection method |
CN108417228A (en) * | 2018-02-02 | 2018-08-17 | 福州大学 | Voice tone color method for measuring similarity under instrument tamber migration |
CN108417228B (en) * | 2018-02-02 | 2021-03-30 | 福州大学 | Human voice tone similarity measurement method under musical instrument tone migration |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
CN110277106A (en) * | 2019-06-21 | 2019-09-24 | 北京达佳互联信息技术有限公司 | Audio quality determines method, apparatus, equipment and storage medium |
CN110277106B (en) * | 2019-06-21 | 2021-10-22 | 北京达佳互联信息技术有限公司 | Audio quality determination method, device, equipment and storage medium |
CN111488485A (en) * | 2020-04-16 | 2020-08-04 | 北京雷石天地电子技术有限公司 | Music recommendation method based on convolutional neural network, storage medium and electronic device |
CN111488485B (en) * | 2020-04-16 | 2023-11-17 | 北京雷石天地电子技术有限公司 | Music recommendation method based on convolutional neural network, storage medium and electronic device |
CN112037766A (en) * | 2020-09-09 | 2020-12-04 | 广州华多网络科技有限公司 | Voice tone conversion method and related equipment |
Also Published As
Publication number | Publication date |
---|---|
CN106997765B (en) | 2020-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108417228B (en) | Human voice tone similarity measurement method under musical instrument tone migration | |
CN103854646B (en) | A kind of method realized DAB and classified automatically | |
CN106997765A (en) | The quantitatively characterizing method of voice tone color | |
Li et al. | An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions | |
CN101599271B (en) | Recognition method of digital music emotion | |
US6691090B1 (en) | Speech recognition system including dimensionality reduction of baseband frequency signals | |
Wältermann et al. | Quality dimensions of narrowband and wideband speech transmission | |
CN101366078A (en) | Neural network classifier for separating audio sources from a monophonic audio signal | |
CN102723079B (en) | Music and chord automatic identification method based on sparse representation | |
CN107851444A (en) | For acoustic signal to be decomposed into the method and system, target voice and its use of target voice | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
CN103054586B (en) | Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list | |
Terasawa et al. | In search of a perceptual metric for timbre: Dissimilarity judgments among synthetic sounds with MFCC-derived spectral envelopes | |
Zwan et al. | System for automatic singing voice recognition | |
Jokinen et al. | Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network | |
Giannoulis et al. | On the disjointess of sources in music using different time-frequency representations | |
CN101650940A (en) | Objective evaluation method for singing tone purity based on audio frequency spectrum characteristic analysis | |
Singh et al. | Efficient pitch detection algorithms for pitched musical instrument sounds: A comparative performance evaluation | |
Ganapathy et al. | Temporal resolution analysis in frequency domain linear prediction | |
Faruqe et al. | Template music transcription for different types of musical instruments | |
Brandner et al. | Classification of phonation modes in classical singing using modulation power spectral features | |
Koolagudi et al. | Spectral features for emotion classification | |
Kumar et al. | Speech quality evaluation for different pitch detection algorithms in LPC speech analysis–synthesis system | |
Li | [Retracted] Automatic Piano Harmony Arrangement System Based on Deep Learning | |
Shen et al. | Solfeggio Teaching Method Based on MIDI Technology in the Background of Digital Music Teaching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200901 |