CN106997765A - The quantitatively characterizing method of voice tone color - Google Patents

The quantitatively characterizing method of voice tone color Download PDF

Info

Publication number
CN106997765A
CN106997765A CN201710207110.9A CN201710207110A CN106997765A CN 106997765 A CN106997765 A CN 106997765A CN 201710207110 A CN201710207110 A CN 201710207110A CN 106997765 A CN106997765 A CN 106997765A
Authority
CN
China
Prior art keywords
tone color
audio
frame
cqt
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710207110.9A
Other languages
Chinese (zh)
Other versions
CN106997765B (en
Inventor
余春艳
苏金池
齐子铭
郭文忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201710207110.9A priority Critical patent/CN106997765B/en
Publication of CN106997765A publication Critical patent/CN106997765A/en
Application granted granted Critical
Publication of CN106997765B publication Critical patent/CN106997765B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Auxiliary Devices For Music (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The present invention relates to a kind of quantitatively characterizing method of voice tone color, this method by analyze professional singer sing number of songs audio, calculate the CQT features tieed up per frame 192, the CQT features composition size for choosing 60 audio frames again is trained for 60*192 input matrix to depth convolutional neural networks, the depth convolutional neural networks and corresponding voice tone color embedded space trained;The audio of singing opera arias to amateur singer carries out identical analysis calculating and sends into the depth convolutional neural networks trained again, the tone color vector in same voice tone color embedded space is can obtain, the tamber characteristic for representing subjective with a kind of quantitative, objective mode is realized.

Description

The quantitatively characterizing method of voice tone color
Technical field
The present invention relates to the acoustic signal processing method in field of singing, more particularly to a kind of quantitative table of voice tone color Levy method.
Background technology
Tone color is defined as below for American National Standard research institute, and " tone color refers to sound in certain acoustically produced Attribute, auditor can judge accordingly two present in the same fashion, the difference of sound with identical pitch and loudness ". Thus, voice tone color during performance refers to that people are used for determining specifically when different singers sings same song The sound characteristic of which singer.
The analysis that sonograph carries out sound is commonly used in acoustics experiment.Sonograph can show amplitude with frequency and when anaplasia The characteristic of change, ordinate represents frequency, and abscissa represents the time, and the size of amplitude represents or used spectrum with the depth of grey color Different colours represent.From the perspective of sonograph, the factor for determining tone color is presence or absence and their phase of overtone To power.
There is substantial amounts of scholar to study sound signal processing all the time, it is desirable to be able to reference to pitch and loudness, to build Vertical tone color is quantitative, objective characteristic index, but even to this day, the evaluation of tone color more still rest on qualitatively, it is subjective Evaluation phase, still fail to carry out effective modeling and quantization to tone color, be constructed without a quantitative measurement system.Therefore, Need to be continued research in terms of characteristic present, Measure Indexes and the measuring similarity of tone color.
The current main classification including musical instrument of research for tone color and identification and singer's identification etc., mainly by all kinds of Tone color physical features and disaggregated model are realized.The physical features of common tone color classification can be divided into temporal signatures, frequency domain character And scramble characteristic of field three major types.
Temporal signatures:Temporal signatures react the dynamic change of sound.All each not phase of the temporal envelope of different audio signals Together.In order to analyze musical sound comprehensively, starting of oscillation, stable state and decay three phases can be divided into.Starting of oscillation refers to musical sound and starts portion from scratch Point, stable state is the major part of musical sound, and decay refers to musical sound from the latter end having to nothing.The starting of oscillation of musical sound and attenuation portions are held The continuous time is probably a few tens of milliseconds, but differentiation of the starting of oscillation stage to tone color has very important effect.
Frequency domain character:The yardstick difference of frequency-domain analysis will obtain different frequency spectrums.Common frequency spectrum have STFT compose and CQT is composed.
1) the wave filter group centre frequency of STFT spectrums is linear rises, and the bandwidth of each wave filter is constant, and calculation formula is such as Under:
Wherein, x (n) is the voice signal of a certain frame, and w (n) is windowed function.
2) wave filter group that CQT is distributed by centre frequency exponentially, note signal is expressed as to determine music single-tone Spectrum energy, the quality factor q of wave filter group keeps constant.CQT, which is converted at low frequency, has higher frequency resolution, relatively low Temporal resolution, high frequency treatment have higher temporal resolution, relatively low frequency resolution.The low frequency portion of note signal The information divided is often more valuable, and CQT becomes transducing and meets this feature well.
Scramble characteristic of field:Mel frequency cepstral coefficients (Mel Frequency Cepstrum Coefficient, MFCC) are What auditory perception model based on people was proposed, be proved to be in musical sound, Classification of Speech such as recognize at the field most important feature it One, it is nonlinear, i.e. f that people, which delimited the subjective perception of frequency domain,mel=1125log (1+f/700), fmelIt is using Mei Er to be single The perceived frequency of position, f is the actual frequency in units of hertz.Signal spectrum is transformed into perception domain and can be very good simulation The process of auditory processing.When calculating MFCC, framing, adding window, the pretreatment of preemphasis are first carried out to signal, then every frame is believed Number carrying out FFT becomes after frequency domain data, calculates line energy, the line energy of every frame signal by Mel wave filters, Calculate the energy in the wave filter.DCT is calculated after line energy by Mel wave filters is taken the logarithm, that is, obtains MFCC.
Although the related research of existing voice tone color can relatively efficiently solve some singers identification problem, It is not quantitative that voice tone color is described.Therefore, this patent is based on above-mentioned analysis, and training depth convolutional neural networks are simultaneously Corresponding voice tone color embedded space is built, quantitative sign is carried out to tone color in tone color embedded space.
The content of the invention
In view of this, it is an object of the invention to provide a kind of quantitatively characterizing method of voice tone color, amateur singer is sung Tone color during song is analyzed.
The present invention is realized using following scheme:A kind of quantitatively characterizing method of voice tone color, comprises the following steps:
Step S1:Obtain the audio of singing opera arias when professional singer gives song recitals;
Step S2:Build three-dimensional voice tone color embedded space R;
Step S3:Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign industry The tone color of remaining singer.
Further, the step S2 specifically includes following steps:
Step S21:The audio of singing opera arias for amounting to 75 from 15 professional singers is analyzed using constant Q transform method, its In each professional singer have 5 audios of singing opera arias, calculate the CQT coefficients of each frame in audio, CQT coefficients are 192 dimensions;
Step S22:60 frame audios are chosen, the CQT coefficients of this 60 audio frames are constituted to the input matrix of neutral net, Matrix size is 60*192;
Step S23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god It is trained through network, and to depth convolutional neural networks;
Step S231:Depth convolutional neural networks are built according to following structure:
First layer is convolutional layer, using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function;
The second layer is convolutional layer, using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), is swashed Function living uses hyperbolic tangent function;
Third layer is convolutional layer, using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activation Function uses hyperbolic tangent function;
4th layer is full articulamentum, has 256 output nodes, activation primitive uses hyperbolic tangent function;
Layer 5 is full articulamentum, has 3 output nodes, activation primitive uses linear function, is output as 3-dimensional vector;
Step S232:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using paired training Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.
Further, the step S21 is specially:Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 Sampled point, it is field that frame, which moves length, and window function selects Hamming window, then by judging whether the short-time energy of each audio frame low In some threshold value, to remove mute frame, obtained time-domain audio frame is calculated to the CQT spectrums for obtaining n-th frame signal through below equation K-th of composition, formula is as follows:
Wherein, NkIt is long for window,Represent window function.
Further, the step S232 is specially:By the input matrix obtained in step S22 input depth convolution god Training is iterated to network through network, and using the method trained in pairs, the depth convolutional neural networks that are trained and Corresponding tone color embedded space R, the two data X trained in pairs1, X2Two independent neutral net G are inputted respectivelyAAnd GB, Two neural network structures are consistent and parameter is consistent, GAAnd GBOutput be respectively Z1And Z2, cost function is defined as Z1With Z2Between Euclidean distance D:
D=| | Z1-Z2||2
According to two input data X1, X2Whether identical corresponding voice label is, sets different loss functions:
Wherein, MdIt is a constant, it is relevant with the scope of voice tone color embedded space.
Further, as the MdWhen=10, two input data X1, X2Loss letter when corresponding voice label is identical Number is Lsim, as two input data X1, X2The corresponding asynchronous loss function of voice label is Ldis, pass through input data X1, X2The similitude of corresponding voice label, merges two different loss functions, the loss function such as following formula institute after merging Show:
Loss=V*Lsim+(1-V)*Ldis
Wherein, V value is determined by following formula:
The learning rate of network is set to 0.02, and maximum iteration is set to 30000.
Further, the step S3 specifically includes following steps:
Step S31:The audio of singing opera arias of amateur singer is analyzed using constant Q transform method, each frame in audio is calculated CQT coefficients, CQT coefficients are 192 dimensions;
Step S32:The CQT coefficients of 60 audio frames are chosen, the input matrix of neutral net are constituted, size is 60*192;
Step S33:The input matrix built in step S32 is input to the depth convolutional Neural trained in step S23 In network, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.
Further, the step S31 is specially:Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, frame moves length and elects field as, and window function selects Hamming window, then by judging in short-term for each audio frame Whether amount is less than some threshold value, to remove mute frame, and obtained time-domain audio frame is calculated through below equation and obtains n-th frame letter Number CQT spectrums k-th of composition, formula is as follows:
Wherein, NkIt is long for window,Represent window function.
Compared with prior art, the invention has the advantages that:It is clear when this method to professional singer by giving song recitals Sing audio to be pre-processed, specifically include framing, adding window and remove mute frame, then CQT meters are carried out to obtained time-domain audio frame Calculate and analyze, obtain the CQT coefficients of every frame.The CQT coefficients of selected 60 audio frames constitute the input of depth convolutional neural networks Matrix, depth convolutional neural networks are trained with the mode trained in pairs, the depth convolutional neural networks and phase trained The voice tone color embedded space answered, in tone color embedded space, can be achieved the quantitatively characterizing to voice tone color.To depth convolution When neutral net is trained, each two input CQT coefficient matrixes are one group, constitute a training data (when two CQT systems When matrix number belongs to same professional singer, the label of this training data is 1;When two CQT coefficient matrixes be not belonging to it is same specially During industry singer, 0) label of this training data is, because the label of training data is nominal, therefore depth convolutional Neural net Network uses " Siamese structures ", and based on the relation between two CQT coefficient matrixes in a training data come training network, The method trained in pairs, the problem of determination for effectively avoiding the need for priori is exported.
Brief description of the drawings
Fig. 1 is the method flow schematic block diagram of the present invention.
Fig. 2 is the structure chart for the depth convolutional neural networks applied in embodiments of the invention.
Embodiment
Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.
The present embodiment provides a kind of quantitatively characterizing method of voice tone color, comprises the following steps as shown in Figure 1:
Step S1:Obtain the audio of singing opera arias when professional singer gives song recitals;
Step S2:Build three-dimensional voice tone color embedded space R;
Step S3:Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign industry The tone color of remaining singer.
In the present embodiment, the step S2 specifically includes following steps:
Step S21:It is (each special using constant Q transform method (CQT) analysis 75 altogether from 15 professional singers Industry singer has 5 audios of singing opera arias) audio of singing opera arias, calculate the CQT coefficients (192 dimension) of each frame in audio;
Step S22:60 frame audios are chosen, the CQT coefficients (CQT coefficients are 192 dimensions) of this 60 audio frames are constituted into nerve The input matrix of network, matrix size is 60*192;
Step S23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god It is trained through network, and to depth convolutional neural networks;
Step S231:Depth convolutional neural networks are built according to following structure:
First layer (convolutional layer) is using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function;
The second layer (convolutional layer) is using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation Function uses hyperbolic tangent function;
Third layer (convolutional layer) is using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activates letter Number uses hyperbolic tangent function;
4th layer (full articulamentum) has 256 output nodes, and activation primitive uses hyperbolic tangent function;
Layer 5 (full articulamentum) has 3 output nodes, and activation primitive uses linear function, is output as 3-dimensional vector;
Step S232:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using paired training Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.
In the present embodiment, the step S3 specifically includes following steps:
Step S31:The audio of singing opera arias of amateur singer is analyzed using constant Q transform method (CQT), calculates each in audio The CQT coefficients (192 dimension) of frame;
Step S32:The CQT coefficients of 60 audio frames are chosen, (size is 60* to the input matrix of composition neutral net 192);
Step S33:The input matrix built in step S32 is input to the depth convolutional Neural trained in step S23 In network, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.
In the present embodiment, by taking the audio tracks storehouse of singing opera arias comprising 15 professional singers as an example, provided according to above method Example, specifically includes following steps:
Step 1:Obtain the audio of singing opera arias when professional singer gives song recitals.Comprising 15 professional singers in melody storehouse, each Singer's 5 audios of singing opera arias of correspondence, audio format is wav, and sampling precision is 16bit, and sample rate is 16KHz.
Step 2:Three-dimensional voice tone color embedded space R is built, is comprised the following steps that:
Step 21:Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 sampled points, and it is half that frame, which moves length, Frame, window function selects Hamming window, then whether the short-time energy by judging each audio frame is less than some threshold value, quiet to remove Sound frame, obtained time-domain audio frame is calculated through below equation k-th of composition of the CQT spectrums for obtaining n-th frame signal, formula is such as Under:Wherein NkIt is long for window,Represent window function.
Step 22:The CQT coefficients of each frame are 192 dimensions, choose the CQT coefficients of 60 frames, constitute the input square of neutral net Battle array, matrix size is 60*192.
Step 23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god It is trained, comprises the following steps that through network, and to depth convolutional neural networks:
Step 231:Depth convolutional neural networks as shown in Figure 2 are built according to following structure:
First layer (convolutional layer) is using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function;
The second layer (convolutional layer) is using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation Function uses hyperbolic tangent function;
Third layer (convolutional layer) is using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activates letter Number uses hyperbolic tangent function;
4th layer (full articulamentum) has 256 output nodes, and activation primitive uses hyperbolic tangent function;
Layer 5 (full articulamentum) has 3 output nodes, and activation primitive uses linear function, is output as 3-dimensional vector;
Step 232:The input matrix obtained in step 22 is inputted into depth convolutional neural networks, and using training in pairs Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained, The two data X trained in pairs1, X2Two independent neutral net G are inputted respectivelyAAnd GB, two neural network structures are consistent And parameter is consistent, GAAnd GBOutput be respectively Z1And Z2, cost function is defined as Z1And Z2Between Euclidean distance D。
D=| | Z1-Z2||2
According to two input data X1, X2Whether identical corresponding voice label is, sets different loss functions.
Wherein MdIt is a constant, it is relevant with the scope of voice tone color embedded space, in the present embodiment, Md=10.When Two input data X1, X2Loss function when corresponding voice label is identical is Lsim, as two input data X1, X2Correspondence The asynchronous loss function of voice label be Ldis.Pass through input data X1, X2The similitude of corresponding voice label, merges Two different loss functions.Loss function after merging is shown below:
Loss=V*Lsim+(1-V)*Ldis
Wherein, V value is determined by following formula:
The learning rate of network is set to 0.02, and maximum iteration is set to 30000.
Step 3:Analyze the audio of singing opera arias of amateur singer, and quantitative sign sparetime in voice tone color embedded space R The tone color of singer.Comprise the following steps that:
Step 31:Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, and frame moves length Elect field as, window function selects Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, comes Mute frame is removed, obtained time-domain audio frame is calculated to k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, Formula is as follows:Wherein NkIt is long for window,Represent window function.
Step 32:The CQT coefficients of each frame are 192 dimensions, choose the CQT coefficients of 60 frames, constitute the input square of neutral net Battle array, matrix size is 60*192.
Step 33:The size built in step 32 is input to what is trained in step 23 for 60*192 input matrix In depth convolutional neural networks, the output of neutral net be the 3-dimensional in correspondence voice tone color embedded space tamber characteristic to Amount;
It the foregoing is only presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent With modification, it should all belong to the covering scope of the present invention.

Claims (7)

1. a kind of quantitatively characterizing method of voice tone color, it is characterised in that:Comprise the following steps:
Step S1:Obtain the audio of singing opera arias when professional singer gives song recitals;
Step S2:Build three-dimensional voice tone color embedded space R;
Step S3:Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign amateur singer Tone color.
2. a kind of quantitatively characterizing method of voice tone color according to claim 1, it is characterised in that:The step S2 is specific Comprise the following steps:
Step S21:The audio of singing opera arias for amounting to 75 from 15 professional singers is analyzed using constant Q transform method, wherein often Individual professional singer has 5 audios of singing opera arias, and calculates the CQT coefficients of each frame in audio, and CQT coefficients are 192 dimensions;
Step S22:60 frame audios are chosen, the CQT coefficients of this 60 audio frames are constituted to the input matrix of neutral net, matrix is big Small is 60*192;
Step S23:Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolutional Neural net Network, and depth convolutional neural networks are trained;
Step S231:Depth convolutional neural networks are built according to following structure:
First layer is convolutional layer, using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), is inputted as 60* The matrix of 192 sizes, activation primitive uses hyperbolic tangent function;
The second layer is convolutional layer, using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation primitive Using hyperbolic tangent function;
Third layer is convolutional layer, using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), and activation primitive is adopted Use hyperbolic tangent function;
4th layer is full articulamentum, has 256 output nodes, activation primitive uses hyperbolic tangent function;
Layer 5 is full articulamentum, has 3 output nodes, activation primitive uses linear function, is output as 3-dimensional vector;
Step S232:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using the method trained in pairs Training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.
3. a kind of quantitatively characterizing method of voice tone color according to claim 2, it is characterised in that:The step S21 tools Body is:Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 sampled points, and it is field, window function choosing that frame, which moves length, With Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, to remove mute frame, by what is obtained Time-domain audio frame calculates k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, and formula is as follows:
X c q ( k ) = 1 N k Σ n = 0 N k - 1 x ( n ) w N k ( n ) e - j 2 π Q N k n
Wherein, NkIt is long for window,Represent window function.
4. a kind of quantitatively characterizing method of voice tone color according to claim 2, it is characterised in that:The step S232 tools Body is:The input matrix obtained in step S22 is inputted into depth convolutional neural networks, and using the method trained in pairs to network Training is iterated, the depth convolutional neural networks and corresponding tone color embedded space R trained, two trained in pairs Data X1, X2Two independent neutral net G are inputted respectivelyAAnd GB, two neural network structures are consistent and parameter is consistent, GAAnd GB Output be respectively Z1And Z2, cost function is defined as Z1And Z2Between Euclidean distance D:
D=| | Z1-Z2||2
According to two input data X1, X2Whether identical corresponding voice label is, sets different loss functions:
L s i m = m a x ( 0 , D 2 ) L d i s = m a x ( 0 , M d - D ) 2
Wherein, MdIt is a constant, it is relevant with the scope of voice tone color embedded space.
5. a kind of quantitatively characterizing method of voice tone color according to claim 4, it is characterised in that:As the MdWhen=10, Two input data X1, X2Loss function when corresponding voice label is identical is Lsim, as two input data X1, X2It is corresponding The asynchronous loss function of voice label is Ldis, pass through input data X1, X2The similitude of corresponding voice label, merges two Different loss functions, the loss function after merging is shown below:
Loss=V*Lsim+(1-V)*Ldis
Wherein, V value is determined by following formula:
V = 1 L a b e l ( X 1 ) = L a b e l ( X 2 ) 0 L a b e l ( X 1 ) ≠ L a b e l ( X 2 )
The learning rate of network is set to 0.02, and maximum iteration is set to 30000.
6. a kind of quantitatively characterizing method of voice tone color according to claim 1, it is characterised in that:The step S3 is specific Comprise the following steps:
Step S31:The audio of singing opera arias of amateur singer is analyzed using constant Q transform method, the CQT systems of each frame in audio are calculated Number, CQT coefficients are 192 dimensions;
Step S32:The CQT coefficients of 60 audio frames are chosen, the input matrix of neutral net are constituted, size is 60*192;
Step S33:The input matrix built in step S32 is input to the depth convolutional neural networks trained in step S23 In, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.
7. a kind of quantitatively characterizing method of voice tone color according to claim 6, it is characterised in that:The step S31 tools Body is:Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, and frame moves length and elects field, window as Function selects Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, will to remove mute frame Obtained time-domain audio frame calculates k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, and formula is as follows:
X c q ( k ) = 1 N k Σ n = 0 N k - 1 x ( n ) w N k ( n ) e - j 2 π Q N k n
Wherein, NkIt is long for window,Represent window function.
CN201710207110.9A 2017-03-31 2017-03-31 Quantitative characterization method for human voice timbre Expired - Fee Related CN106997765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710207110.9A CN106997765B (en) 2017-03-31 2017-03-31 Quantitative characterization method for human voice timbre

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710207110.9A CN106997765B (en) 2017-03-31 2017-03-31 Quantitative characterization method for human voice timbre

Publications (2)

Publication Number Publication Date
CN106997765A true CN106997765A (en) 2017-08-01
CN106997765B CN106997765B (en) 2020-09-01

Family

ID=59433894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710207110.9A Expired - Fee Related CN106997765B (en) 2017-03-31 2017-03-31 Quantitative characterization method for human voice timbre

Country Status (1)

Country Link
CN (1) CN106997765B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364637A (en) * 2018-02-01 2018-08-03 福州大学 A kind of audio sentence boundary detection method
CN108417228A (en) * 2018-02-02 2018-08-17 福州大学 Voice tone color method for measuring similarity under instrument tamber migration
CN108986798A (en) * 2018-06-27 2018-12-11 百度在线网络技术(北京)有限公司 Processing method, device and the equipment of voice data
CN110277106A (en) * 2019-06-21 2019-09-24 北京达佳互联信息技术有限公司 Audio quality determines method, apparatus, equipment and storage medium
CN111488485A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN112037766A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Voice tone conversion method and related equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650940A (en) * 2008-12-26 2010-02-17 中国科学院声学研究所 Objective evaluation method for singing tone purity based on audio frequency spectrum characteristic analysis
CN103177722A (en) * 2013-03-08 2013-06-26 北京理工大学 Tone-similarity-based song retrieval method
CN104183245A (en) * 2014-09-04 2014-12-03 福建星网视易信息系统有限公司 Method and device for recommending music stars with tones similar to those of singers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650940A (en) * 2008-12-26 2010-02-17 中国科学院声学研究所 Objective evaluation method for singing tone purity based on audio frequency spectrum characteristic analysis
CN103177722A (en) * 2013-03-08 2013-06-26 北京理工大学 Tone-similarity-based song retrieval method
CN104183245A (en) * 2014-09-04 2014-12-03 福建星网视易信息系统有限公司 Method and device for recommending music stars with tones similar to those of singers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ERIC J. HUMPHREY等: "Non-Linear Semantic Embedding for Organizing Large Instrument Sample Libraries", 《PROCEEDINGS OF 2011 10TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS AND WORKSHOPS》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108364637A (en) * 2018-02-01 2018-08-03 福州大学 A kind of audio sentence boundary detection method
CN108364637B (en) * 2018-02-01 2021-07-13 福州大学 Audio sentence boundary detection method
CN108417228A (en) * 2018-02-02 2018-08-17 福州大学 Voice tone color method for measuring similarity under instrument tamber migration
CN108417228B (en) * 2018-02-02 2021-03-30 福州大学 Human voice tone similarity measurement method under musical instrument tone migration
CN108986798A (en) * 2018-06-27 2018-12-11 百度在线网络技术(北京)有限公司 Processing method, device and the equipment of voice data
CN110277106A (en) * 2019-06-21 2019-09-24 北京达佳互联信息技术有限公司 Audio quality determines method, apparatus, equipment and storage medium
CN110277106B (en) * 2019-06-21 2021-10-22 北京达佳互联信息技术有限公司 Audio quality determination method, device, equipment and storage medium
CN111488485A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN111488485B (en) * 2020-04-16 2023-11-17 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN112037766A (en) * 2020-09-09 2020-12-04 广州华多网络科技有限公司 Voice tone conversion method and related equipment

Also Published As

Publication number Publication date
CN106997765B (en) 2020-09-01

Similar Documents

Publication Publication Date Title
CN108417228B (en) Human voice tone similarity measurement method under musical instrument tone migration
CN103854646B (en) A kind of method realized DAB and classified automatically
CN106997765A (en) The quantitatively characterizing method of voice tone color
Li et al. An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions
CN101599271B (en) Recognition method of digital music emotion
US6691090B1 (en) Speech recognition system including dimensionality reduction of baseband frequency signals
Wältermann et al. Quality dimensions of narrowband and wideband speech transmission
CN101366078A (en) Neural network classifier for separating audio sources from a monophonic audio signal
CN102723079B (en) Music and chord automatic identification method based on sparse representation
CN107851444A (en) For acoustic signal to be decomposed into the method and system, target voice and its use of target voice
Dubey et al. Non-intrusive speech quality assessment using several combinations of auditory features
CN103054586B (en) Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
Terasawa et al. In search of a perceptual metric for timbre: Dissimilarity judgments among synthetic sounds with MFCC-derived spectral envelopes
Zwan et al. System for automatic singing voice recognition
Jokinen et al. Estimating the spectral tilt of the glottal source from telephone speech using a deep neural network
Giannoulis et al. On the disjointess of sources in music using different time-frequency representations
CN101650940A (en) Objective evaluation method for singing tone purity based on audio frequency spectrum characteristic analysis
Singh et al. Efficient pitch detection algorithms for pitched musical instrument sounds: A comparative performance evaluation
Ganapathy et al. Temporal resolution analysis in frequency domain linear prediction
Faruqe et al. Template music transcription for different types of musical instruments
Brandner et al. Classification of phonation modes in classical singing using modulation power spectral features
Koolagudi et al. Spectral features for emotion classification
Kumar et al. Speech quality evaluation for different pitch detection algorithms in LPC speech analysis–synthesis system
Li [Retracted] Automatic Piano Harmony Arrangement System Based on Deep Learning
Shen et al. Solfeggio Teaching Method Based on MIDI Technology in the Background of Digital Music Teaching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200901