CN106997765A

CN106997765A - The quantitatively characterizing method of voice tone color

Info

Publication number: CN106997765A
Application number: CN201710207110.9A
Authority: CN
Inventors: 余春艳; 苏金池; 齐子铭; 郭文忠
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2017-08-01
Anticipated expiration: 2037-03-31
Also published as: CN106997765B

Abstract

The present invention relates to a kind of quantitatively characterizing method of voice tone color, this method by analyze professional singer sing number of songs audio, calculate the CQT features tieed up per frame 192, the CQT features composition size for choosing 60 audio frames again is trained for 60*192 input matrix to depth convolutional neural networks, the depth convolutional neural networks and corresponding voice tone color embedded space trained；The audio of singing opera arias to amateur singer carries out identical analysis calculating and sends into the depth convolutional neural networks trained again, the tone color vector in same voice tone color embedded space is can obtain, the tamber characteristic for representing subjective with a kind of quantitative, objective mode is realized.

Description

The quantitatively characterizing method of voice tone color

Technical field

The present invention relates to the acoustic signal processing method in field of singing, more particularly to a kind of quantitative table of voice tone color Levy method.

Background technology

Tone color is defined as below for American National Standard research institute, and " tone color refers to sound in certain acoustically produced Attribute, auditor can judge accordingly two present in the same fashion, the difference of sound with identical pitch and loudness ". Thus, voice tone color during performance refers to that people are used for determining specifically when different singers sings same song The sound characteristic of which singer.

The analysis that sonograph carries out sound is commonly used in acoustics experiment.Sonograph can show amplitude with frequency and when anaplasia The characteristic of change, ordinate represents frequency, and abscissa represents the time, and the size of amplitude represents or used spectrum with the depth of grey color Different colours represent.From the perspective of sonograph, the factor for determining tone color is presence or absence and their phase of overtone To power.

There is substantial amounts of scholar to study sound signal processing all the time, it is desirable to be able to reference to pitch and loudness, to build Vertical tone color is quantitative, objective characteristic index, but even to this day, the evaluation of tone color more still rest on qualitatively, it is subjective Evaluation phase, still fail to carry out effective modeling and quantization to tone color, be constructed without a quantitative measurement system.Therefore, Need to be continued research in terms of characteristic present, Measure Indexes and the measuring similarity of tone color.

The current main classification including musical instrument of research for tone color and identification and singer's identification etc., mainly by all kinds of Tone color physical features and disaggregated model are realized.The physical features of common tone color classification can be divided into temporal signatures, frequency domain character And scramble characteristic of field three major types.

Temporal signatures：Temporal signatures react the dynamic change of sound.All each not phase of the temporal envelope of different audio signals Together.In order to analyze musical sound comprehensively, starting of oscillation, stable state and decay three phases can be divided into.Starting of oscillation refers to musical sound and starts portion from scratch Point, stable state is the major part of musical sound, and decay refers to musical sound from the latter end having to nothing.The starting of oscillation of musical sound and attenuation portions are held The continuous time is probably a few tens of milliseconds, but differentiation of the starting of oscillation stage to tone color has very important effect.

Frequency domain character：The yardstick difference of frequency-domain analysis will obtain different frequency spectrums.Common frequency spectrum have STFT compose and CQT is composed.

1) the wave filter group centre frequency of STFT spectrums is linear rises, and the bandwidth of each wave filter is constant, and calculation formula is such as Under：

Wherein, x (n) is the voice signal of a certain frame, and w (n) is windowed function.

2) wave filter group that CQT is distributed by centre frequency exponentially, note signal is expressed as to determine music single-tone Spectrum energy, the quality factor q of wave filter group keeps constant.CQT, which is converted at low frequency, has higher frequency resolution, relatively low Temporal resolution, high frequency treatment have higher temporal resolution, relatively low frequency resolution.The low frequency portion of note signal The information divided is often more valuable, and CQT becomes transducing and meets this feature well.

Scramble characteristic of field：Mel frequency cepstral coefficients (Mel Frequency Cepstrum Coefficient, MFCC) are What auditory perception model based on people was proposed, be proved to be in musical sound, Classification of Speech such as recognize at the field most important feature it One, it is nonlinear, i.e. f that people, which delimited the subjective perception of frequency domain,_mel=1125log (1+f/700), f_melIt is using Mei Er to be single The perceived frequency of position, f is the actual frequency in units of hertz.Signal spectrum is transformed into perception domain and can be very good simulation The process of auditory processing.When calculating MFCC, framing, adding window, the pretreatment of preemphasis are first carried out to signal, then every frame is believed Number carrying out FFT becomes after frequency domain data, calculates line energy, the line energy of every frame signal by Mel wave filters, Calculate the energy in the wave filter.DCT is calculated after line energy by Mel wave filters is taken the logarithm, that is, obtains MFCC.

Although the related research of existing voice tone color can relatively efficiently solve some singers identification problem, It is not quantitative that voice tone color is described.Therefore, this patent is based on above-mentioned analysis, and training depth convolutional neural networks are simultaneously Corresponding voice tone color embedded space is built, quantitative sign is carried out to tone color in tone color embedded space.

The content of the invention

In view of this, it is an object of the invention to provide a kind of quantitatively characterizing method of voice tone color, amateur singer is sung Tone color during song is analyzed.

The present invention is realized using following scheme：A kind of quantitatively characterizing method of voice tone color, comprises the following steps：

Step S1：Obtain the audio of singing opera arias when professional singer gives song recitals；

Step S2：Build three-dimensional voice tone color embedded space R；

Step S3：Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign industry The tone color of remaining singer.

Further, the step S2 specifically includes following steps：

Step S21：The audio of singing opera arias for amounting to 75 from 15 professional singers is analyzed using constant Q transform method, its In each professional singer have 5 audios of singing opera arias, calculate the CQT coefficients of each frame in audio, CQT coefficients are 192 dimensions；

Step S22：60 frame audios are chosen, the CQT coefficients of this 60 audio frames are constituted to the input matrix of neutral net, Matrix size is 60*192；

Step S23：Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god It is trained through network, and to depth convolutional neural networks；

Step S231：Depth convolutional neural networks are built according to following structure：

First layer is convolutional layer, using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function；

The second layer is convolutional layer, using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), is swashed Function living uses hyperbolic tangent function；

Third layer is convolutional layer, using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activation Function uses hyperbolic tangent function；

4th layer is full articulamentum, has 256 output nodes, activation primitive uses hyperbolic tangent function；

Layer 5 is full articulamentum, has 3 output nodes, activation primitive uses linear function, is output as 3-dimensional vector；

Step S232：The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using paired training Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.

Further, the step S21 is specially：Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 Sampled point, it is field that frame, which moves length, and window function selects Hamming window, then by judging whether the short-time energy of each audio frame low In some threshold value, to remove mute frame, obtained time-domain audio frame is calculated to the CQT spectrums for obtaining n-th frame signal through below equation K-th of composition, formula is as follows：

Wherein, N_kIt is long for window,Represent window function.

Further, the step S232 is specially：By the input matrix obtained in step S22 input depth convolution god Training is iterated to network through network, and using the method trained in pairs, the depth convolutional neural networks that are trained and Corresponding tone color embedded space R, the two data X trained in pairs₁, X₂Two independent neutral net G are inputted respectively_AAnd G_B, Two neural network structures are consistent and parameter is consistent, G_AAnd G_BOutput be respectively Z₁And Z₂, cost function is defined as Z₁With Z₂Between Euclidean distance D：

D=| | Z₁-Z₂||₂

According to two input data X₁, X₂Whether identical corresponding voice label is, sets different loss functions：

Wherein, M_dIt is a constant, it is relevant with the scope of voice tone color embedded space.

Further, as the M_dWhen=10, two input data X₁, X₂Loss letter when corresponding voice label is identical Number is L_sim, as two input data X₁, X₂The corresponding asynchronous loss function of voice label is L_dis, pass through input data X₁, X₂The similitude of corresponding voice label, merges two different loss functions, the loss function such as following formula institute after merging Show：

Loss=V*L_sim+(1-V)*L_dis

Wherein, V value is determined by following formula：

The learning rate of network is set to 0.02, and maximum iteration is set to 30000.

Further, the step S3 specifically includes following steps：

Step S31：The audio of singing opera arias of amateur singer is analyzed using constant Q transform method, each frame in audio is calculated CQT coefficients, CQT coefficients are 192 dimensions；

Step S32：The CQT coefficients of 60 audio frames are chosen, the input matrix of neutral net are constituted, size is 60*192；

Step S33：The input matrix built in step S32 is input to the depth convolutional Neural trained in step S23 In network, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.

Further, the step S31 is specially：Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, frame moves length and elects field as, and window function selects Hamming window, then by judging in short-term for each audio frame Whether amount is less than some threshold value, to remove mute frame, and obtained time-domain audio frame is calculated through below equation and obtains n-th frame letter Number CQT spectrums k-th of composition, formula is as follows：

Wherein, N_kIt is long for window,Represent window function.

Compared with prior art, the invention has the advantages that：It is clear when this method to professional singer by giving song recitals Sing audio to be pre-processed, specifically include framing, adding window and remove mute frame, then CQT meters are carried out to obtained time-domain audio frame Calculate and analyze, obtain the CQT coefficients of every frame.The CQT coefficients of selected 60 audio frames constitute the input of depth convolutional neural networks Matrix, depth convolutional neural networks are trained with the mode trained in pairs, the depth convolutional neural networks and phase trained The voice tone color embedded space answered, in tone color embedded space, can be achieved the quantitatively characterizing to voice tone color.To depth convolution When neutral net is trained, each two input CQT coefficient matrixes are one group, constitute a training data (when two CQT systems When matrix number belongs to same professional singer, the label of this training data is 1；When two CQT coefficient matrixes be not belonging to it is same specially During industry singer, 0) label of this training data is, because the label of training data is nominal, therefore depth convolutional Neural net Network uses " Siamese structures ", and based on the relation between two CQT coefficient matrixes in a training data come training network, The method trained in pairs, the problem of determination for effectively avoiding the need for priori is exported.

Brief description of the drawings

Fig. 1 is the method flow schematic block diagram of the present invention.

Fig. 2 is the structure chart for the depth convolutional neural networks applied in embodiments of the invention.

Embodiment

Below in conjunction with the accompanying drawings and embodiment the present invention will be further described.

The present embodiment provides a kind of quantitatively characterizing method of voice tone color, comprises the following steps as shown in Figure 1：

Step S2：Build three-dimensional voice tone color embedded space R；

In the present embodiment, the step S2 specifically includes following steps：

Step S21：It is (each special using constant Q transform method (CQT) analysis 75 altogether from 15 professional singers Industry singer has 5 audios of singing opera arias) audio of singing opera arias, calculate the CQT coefficients (192 dimension) of each frame in audio；

Step S22：60 frame audios are chosen, the CQT coefficients (CQT coefficients are 192 dimensions) of this 60 audio frames are constituted into nerve The input matrix of network, matrix size is 60*192；

First layer (convolutional layer) is using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), input For the matrix of 60*192 sizes, activation primitive uses hyperbolic tangent function；

The second layer (convolutional layer) is using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation Function uses hyperbolic tangent function；

Third layer (convolutional layer) is using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), activates letter Number uses hyperbolic tangent function；

4th layer (full articulamentum) has 256 output nodes, and activation primitive uses hyperbolic tangent function；

Layer 5 (full articulamentum) has 3 output nodes, and activation primitive uses linear function, is output as 3-dimensional vector；

In the present embodiment, the step S3 specifically includes following steps：

Step S31：The audio of singing opera arias of amateur singer is analyzed using constant Q transform method (CQT), calculates each in audio The CQT coefficients (192 dimension) of frame；

Step S32：The CQT coefficients of 60 audio frames are chosen, (size is 60* to the input matrix of composition neutral net 192)；

In the present embodiment, by taking the audio tracks storehouse of singing opera arias comprising 15 professional singers as an example, provided according to above method Example, specifically includes following steps：

Step 1：Obtain the audio of singing opera arias when professional singer gives song recitals.Comprising 15 professional singers in melody storehouse, each Singer's 5 audios of singing opera arias of correspondence, audio format is wav, and sampling precision is 16bit, and sample rate is 16KHz.

Step 2：Three-dimensional voice tone color embedded space R is built, is comprised the following steps that：

Step 21：Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 sampled points, and it is half that frame, which moves length, Frame, window function selects Hamming window, then whether the short-time energy by judging each audio frame is less than some threshold value, quiet to remove Sound frame, obtained time-domain audio frame is calculated through below equation k-th of composition of the CQT spectrums for obtaining n-th frame signal, formula is such as Under：Wherein N_kIt is long for window,Represent window function.

Step 22：The CQT coefficients of each frame are 192 dimensions, choose the CQT coefficients of 60 frames, constitute the input square of neutral net Battle array, matrix size is 60*192.

Step 23：Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolution god It is trained, comprises the following steps that through network, and to depth convolutional neural networks：

Step 231：Depth convolutional neural networks as shown in Figure 2 are built according to following structure：

Step 232：The input matrix obtained in step 22 is inputted into depth convolutional neural networks, and using training in pairs Method training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained, The two data X trained in pairs₁, X₂Two independent neutral net G are inputted respectively_AAnd G_B, two neural network structures are consistent And parameter is consistent, G_AAnd G_BOutput be respectively Z₁And Z₂, cost function is defined as Z₁And Z₂Between Euclidean distance D。

D=| | Z₁-Z₂||₂

According to two input data X₁, X₂Whether identical corresponding voice label is, sets different loss functions.

Wherein M_dIt is a constant, it is relevant with the scope of voice tone color embedded space, in the present embodiment, M_d=10.When Two input data X₁, X₂Loss function when corresponding voice label is identical is L_sim, as two input data X₁, X₂Correspondence The asynchronous loss function of voice label be L_dis.Pass through input data X₁, X₂The similitude of corresponding voice label, merges Two different loss functions.Loss function after merging is shown below：

Loss=V*L_sim+(1-V)*L_dis

Wherein, V value is determined by following formula：

Step 3：Analyze the audio of singing opera arias of amateur singer, and quantitative sign sparetime in voice tone color embedded space R The tone color of singer.Comprise the following steps that：

Step 31：Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, and frame moves length Elect field as, window function selects Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, comes Mute frame is removed, obtained time-domain audio frame is calculated to k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, Formula is as follows：Wherein N_kIt is long for window,Represent window function.

Step 32：The CQT coefficients of each frame are 192 dimensions, choose the CQT coefficients of 60 frames, constitute the input square of neutral net Battle array, matrix size is 60*192.

Step 33：The size built in step 32 is input to what is trained in step 23 for 60*192 input matrix In depth convolutional neural networks, the output of neutral net be the 3-dimensional in correspondence voice tone color embedded space tamber characteristic to Amount；

It the foregoing is only presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent With modification, it should all belong to the covering scope of the present invention.

Claims

1. a kind of quantitatively characterizing method of voice tone color, it is characterised in that：Comprise the following steps：

Step S2：Build three-dimensional voice tone color embedded space R；

Step S3：Analyze the audio of singing opera arias of amateur singer, and quantitative in voice tone color embedded space R sign amateur singer Tone color.

2. a kind of quantitatively characterizing method of voice tone color according to claim 1, it is characterised in that：The step S2 is specific Comprise the following steps：

Step S21：The audio of singing opera arias for amounting to 75 from 15 professional singers is analyzed using constant Q transform method, wherein often Individual professional singer has 5 audios of singing opera arias, and calculates the CQT coefficients of each frame in audio, and CQT coefficients are 192 dimensions；

Step S22：60 frame audios are chosen, the CQT coefficients of this 60 audio frames are constituted to the input matrix of neutral net, matrix is big Small is 60*192；

Step S23：Depth convolutional neural networks are built, the CQT coefficient matrixes of 60 audio frames are inputted depth convolutional Neural net Network, and depth convolutional neural networks are trained；

First layer is convolutional layer, using 20 convolution kernels, and size is (1,5,13), and max-pooling is (2,2), is inputted as 60* The matrix of 192 sizes, activation primitive uses hyperbolic tangent function；

The second layer is convolutional layer, using 40 convolution kernels, and size is (20,5,11), and max-pooling is (2,2), activation primitive Using hyperbolic tangent function；

Third layer is convolutional layer, using 80 convolution kernels, and size is (1,1,9), and max-pooling is (2,2), and activation primitive is adopted Use hyperbolic tangent function；

Step S232：The input matrix obtained in step S22 is inputted into depth convolutional neural networks, using the method trained in pairs Training is iterated to network, the depth convolutional neural networks and corresponding tone color embedded space R trained.

3. a kind of quantitatively characterizing method of voice tone color according to claim 2, it is characterised in that：The step S21 tools Body is：Audio of singing opera arias to professional singer carries out framing, and frame length is 4096 sampled points, and it is field, window function choosing that frame, which moves length, With Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, to remove mute frame, by what is obtained Time-domain audio frame calculates k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, and formula is as follows：

X^{c q} (k) = \frac{1}{N_{k}} Σ_{n = 0}^{N_{k} - 1} x (n) w_{N_{k}} (n) e^{- j \frac{2 π Q}{N_{k}} n}

Wherein, N_kIt is long for window,Represent window function.

4. a kind of quantitatively characterizing method of voice tone color according to claim 2, it is characterised in that：The step S232 tools Body is：The input matrix obtained in step S22 is inputted into depth convolutional neural networks, and using the method trained in pairs to network Training is iterated, the depth convolutional neural networks and corresponding tone color embedded space R trained, two trained in pairs Data X₁, X₂Two independent neutral net G are inputted respectively_AAnd G_B, two neural network structures are consistent and parameter is consistent, G_AAnd G_B Output be respectively Z₁And Z₂, cost function is defined as Z₁And Z₂Between Euclidean distance D：

D=| | Z₁-Z₂||₂

\{\begin{matrix} L_{s i m} = m a x (0, D^{2}) \\ L_{d i s} = m a x {(0, M_{d} - D)}^{2} \end{matrix}

5. a kind of quantitatively characterizing method of voice tone color according to claim 4, it is characterised in that：As the M_dWhen=10, Two input data X₁, X₂Loss function when corresponding voice label is identical is L_sim, as two input data X₁, X₂It is corresponding The asynchronous loss function of voice label is L_dis, pass through input data X₁, X₂The similitude of corresponding voice label, merges two Different loss functions, the loss function after merging is shown below：

Loss=V*L_sim+(1-V)*L_dis

Wherein, V value is determined by following formula：

V = \{\begin{matrix} 1 & L a b e l (X_{1}) = L a b e l (X_{2}) \\ 0 & L a b e l (X_{1}) &NotEqual; L a b e l (X_{2}) \end{matrix}

6. a kind of quantitatively characterizing method of voice tone color according to claim 1, it is characterised in that：The step S3 is specific Comprise the following steps：

Step S31：The audio of singing opera arias of amateur singer is analyzed using constant Q transform method, the CQT systems of each frame in audio are calculated Number, CQT coefficients are 192 dimensions；

Step S33：The input matrix built in step S32 is input to the depth convolutional neural networks trained in step S23 In, the output of neutral net is the 3-dimensional tamber characteristic vector in corresponding tone color embedded space R.

7. a kind of quantitatively characterizing method of voice tone color according to claim 6, it is characterised in that：The step S31 tools Body is：Framing is carried out to the audio file of singing opera arias of amateur singer, frame length is 4096 sampled points, and frame moves length and elects field, window as Function selects Hamming window, then by judging whether the short-time energy of each audio frame is less than some threshold value, will to remove mute frame Obtained time-domain audio frame calculates k-th of composition of the CQT spectrums for obtaining n-th frame signal through below equation, and formula is as follows：

X^{c q} (k) = \frac{1}{N_{k}} Σ_{n = 0}^{N_{k} - 1} x (n) w_{N_{k}} (n) e^{- j \frac{2 π Q}{N_{k}} n}

Wherein, N_kIt is long for window,Represent window function.