CN109272986A

CN109272986A - A kind of dog sound sensibility classification method based on artificial neural network

Info

Publication number: CN109272986A
Application number: CN201810995254.XA
Authority: CN
Inventors: 龙华; 商林松; 邵玉斌; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2019-01-25

Abstract

The present invention relates to a kind of dog sound sensibility classification method based on artificial neural network, belongs to Audio Signal Processing field.The present invention studies their relationships between dog emotion from the short-time energy of voice signal, zero-crossing rate and pitch period respectively.They have on distinguishing glad, angry, painful, frightened four kinds of emotions has certain effect, by extracting parameter of these features as emotional semantic classification, and speech emotional model is obtained by building the training of BP neural network algorithm, finally classified using sound emotion of the model to dog, classification error is corrected finally by the expressive features of comparison dog, reduces false recognition rate.Inventive algorithm is simple, and theoretical clear, technology is easy to accomplish.

Description

A kind of dog sound sensibility classification method based on artificial neural network

Technical field

The present invention relates to a kind of dog sound sensibility classification method based on artificial neural network, belongs to Audio Signal Processing skill Art field.

Background technique

Many progress are had been achieved for the research of the speech emotional of people, but to the research of the speech emotional of animal also in Blank stage, pronouncing frequency and the people of animal are very different, but the pronouncing frequency of dog and people be very close to, so grinding The sound emotion of the sound emotion and people of studying carefully dog has similarity, and dog equally has the emotions such as happiness, pain, indignation, fear.But no Mean that the research method of voice sound emotion can be applied directly in dog sound emotional semantic classification, the characteristic parameter of sound is very It is more, but can not necessarily react the emotion of sound.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of dog sound sensibility classification method based on artificial neural network, Short-time energy, zero-crossing rate, pitch period are extracted respectively, and are used for the sound characteristic parameter of extraction to train artificial neural network. The model that training obtains classifies automatically to the emotion sound of dog, the final expressive features in conjunction with dog carry out classification results It corrects, reduces misclassification rate.

The technical solution adopted by the present invention is that: a kind of dog sound sensibility classification method based on artificial neural network, including Following steps:

(1) dog sound and expression acquisition: acquisition is glad, pain, indignation, frightened four kinds of emotion sound, acquisition corresponding four The countenance image of kind emotion；

(2) sound pre-processes: mainly including preemphasis, framing, windowing process；

(3) sound characteristic parameter extraction: extracting characteristic parameter from tonic train to be measured, and short-time energy is extracted, zero-crossing rate It extracts, pitch period extracts, and the characteristic parameter extraction of sound can effectively classify to four kinds of emotion sound；

(4) expressive features parameter extraction: expressive features parameter is extracted from corresponding face image, extracts the part of image Textural characteristics parameter, the extraction of expressive features can effectively distinguish the emotion of dog；

(5) it builds artificial neural network: obtaining dog sound sentiment classification model using the training of BP neural network algorithm；

(6) training test: it is used as training set by collected sample sound 80%, 20% is used as test set；

(7) the mistake classification of model is corrected in conjunction with the expressive features of image.

Specifically, step (1) sound intermediate frequency acquires, and sample frequency meets nyquist sampling theorem, sample rate f_s≥2f_h, f_h For signal highest frequency, setting channel number is monophonic, sample frequency 4.8kHz, quantified precision 16bit.Face's table of dog The sound of collected dog and the corresponding expression of every kind of sound can be labeled by feelings by acquisition of taking pictures.

Specifically, in step (2) pretreatment the following steps are included:

(2.1) preemphasis: the purpose for promoting audio signal frequency spectrum is to keep its frequency spectrum more flat, usually can there are two types of method It is analog circuit and digital circuit respectively to realize.It is generally realized by single order high-pass digital filter, the number of preemphasis The transmission function of filter are as follows: H (Z)=1- α Z^-1, in which: the value range of α is [0.9,1.0], and usual α takes 0.95.

(2.2) it framing: because voice signal is short-term stationarity signal, needs to carry out sub-frame processing, so as to each Frame is as stationary signal processing.Simultaneously in order to reduce the variation between frame and frame, overlapping is taken between consecutive frame.General frame length takes 25ms, frame pipette the half of frame length.

(2.3) adding window: being to keep the overall situation more continuous in order to carry out Fourier expansion, avoid the occurrence of gibbs after adding window Effect after adding window, shows the Partial Feature of periodic function without periodic voice signal originally.In speech signal analysis In, common window function has rectangular window, Hanning window and Hamming window.

Specifically, in step (3) sound characteristic parameter extraction the following steps are included:

(3.1) short-time energy is extracted: short-time energy indicates the energy of one frame of voice signal, can therefrom observe voice signal Amplitude characteristic.The representation method of short-time energy are as follows: set voice signal as x (n), obtain the voice signal of l frame after pretreatment ForThen short-time energy are as follows:

Wherein E_lFor the short-time energy of voice signal l frame, N is the length of a frame voice signal.

(3.2) zero-crossing rate extracts: short-time zero-crossing rate is the common temporal signatures of voice signal, and finger speech sound signal is in the short time The interior number by zero.Continuous signal and discrete signal, the method for obtaining zero-crossing rate is different, can by observing its waveform statistics To obtain the zero-crossing rate of continuous signal, the sign change number for calculating signal sampling point can obtain the zero-crossing rate of discrete signal. Zero passage number in unit time is referred to as averagely Zero-crossing Number.

Voice frame signalShort-time average Zero-crossing Number z_l(n) is defined as:

WhereinFor l frame voice signal, N is the length of one frame of voice signal, z_lIt (n) is l frame voice signal Zero passage number in short-term, sgn [] are sign function, it may be assumed that

(3.3) pitch period extracts: according to auto-correlation functionCalculate the base of each frame Sound period, wherein x_iIt (m) is the voice signal after adding window, k is the retardation of time.

Specifically, expressive features parameter is extracted using LBP algorithm in step (4):

LBP is a kind of image texture part extraction algorithm, it retains gray processing information, is embodied between pixel and its field value Relationship.

Specifically, step (5) BP neural network algorithm is as follows:

If the node number of input layer is n, the node number of hidden layer is l, and the node number of output layer is m.Input layer To the weight w of hidden layer_ij, the weight of hidden layer to output layer is w_jk, input layer to hidden layer is biased to a_j, hidden layer is to defeated Layer is biased to b out_k.Learning rate is η, and excitation function is g (x).Wherein excitation function is that g (x) takes sigmoid function.Shape Formula are as follows:

The output of the output of hidden layer, hidden layer is set as H_j:

The output of output layer:

The calculating of error are as follows:Wherein Y_kFor desired output, we remember Y_k-O_k=e_k, then E can be with It is expressed asIn above formula, i=1 ... n, j=1 ... l, k=1 ... m.

The more new formula of weight are as follows:

The more new formula of biasing are as follows:

Finally judge whether algorithm iteration terminates: there are many methods to may determine that whether algorithm restrains, common are finger Determine the number of iteration, that is, judges whether the difference between adjacent error twice is less than specified value.

The beneficial effects of the present invention are: present invention could apply to audio identification fields.Nerve net compared with prior art Network has good self study, self-organizing and preferable fault-tolerance, and calculating process is relatively simple, special for the sound of dog Point chooses three kinds of characteristic parameters, by that can reduce classification false recognition rate in conjunction with expressive features.

Detailed description of the invention

Fig. 1 is overall flow figure of the present invention；

Fig. 2 is BP neural network training flow chart of the present invention.

Specific embodiment

Below by the drawings and specific embodiments, invention is further described in detail, but protection scope of the present invention is not It is confined to the content.

Embodiment 1: as shown in Figure 1, 2, a kind of dog sound emotion identification method based on artificial neural network, including it is following Step:

(1) dog sound and expression acquisition: acquisition is glad, pain, indignation, frightened four kinds of emotion sound, acquisition corresponding four The countenance image of kind emotion.

(2) sample frequency meets nyquist sampling theorem, sample rate f_s≥2f_h, f_hFor signal highest frequency.Setting sound Road number is monophonic, and sample frequency is set as 4.8kHz, quantified precision 16bit.

(3) parameter μ of the filter of preemphasis takes 0.95, and the frame length that framing uses is 128, and it is 64 that frame, which moves, and window function is adopted Use Hamming window.

(4) preemphasis: the purpose for promoting speech signal spec-trum is to keep its frequency spectrum more flat, usually can be with there are two types of method It realizes, is analog circuit and digital circuit respectively.It is generally realized by single order high-pass digital filter, the number filter of preemphasis The transmission function of wave device are as follows: H (Z)=1- α Z^-1, wherein: the value range of α is [0.9,1.0], and usual α takes 0.95.

(5) it framing: because voice signal is short-term stationarity signal, needs to carry out sub-frame processing, so as to each frame As stationary signal processing.Simultaneously in order to reduce the variation between frame and frame, overlapping is taken between consecutive frame.General frame length takes 25ms, frame pipette the half of frame length.

(6) adding window: it is to keep the overall situation more continuous to carry out Fourier expansion after adding window, avoids the occurrence of gibbs effect It answers, after adding window, shows the Partial Feature of periodic function without periodic voice signal originally.In speech signal analysis In, common window function has rectangular window, Hanning window and Hamming window.

(7) read pretreated data: this step can be realized by programming.

(8) short-time energy is extracted: short-time energy indicates the energy of one frame of voice signal, can therefrom observe voice signal Amplitude characteristic.The representation method of short-time energy are as follows: set voice signal as x (n), the voice signal that l frame is obtained after pretreatment isThen short-time energy are as follows:

(9) zero-crossing rate extracts: short-time zero-crossing rate is the common temporal signatures of voice signal, and finger speech sound signal is in a short time Pass through the number of zero.Continuous signal and discrete signal, the method for obtaining zero-crossing rate is different, can be with by observing its waveform statistics The zero-crossing rate of continuous signal is obtained, the sign change number for calculating signal sampling point can obtain the zero-crossing rate of discrete signal.It is single Zero passage number in the time of position is referred to as averagely Zero-crossing Number.

Voice frame signalShort-time average Zero-crossing Number z_l(n) is defined as:

(10) pitch period extracts: according to auto-correlation functionCalculate the base of each frame Sound period, wherein x_iIt (m) is the voice signal after adding window, k is the retardation of time.

(11) expressive features parameter extraction: expressive features parameter is extracted using LBP algorithm, LBP is a kind of image texture office Portion's extraction algorithm, it retains gray processing information, embodies relationship between pixel and its field value.

(12) it builds artificial neural network: setting the node number of input layer as n, the node number of hidden layer is l, output layer Node number be l (l=4,4 kinds of emotions).Weight w of the input layer to hidden layer_ij, the weight of hidden layer to output layer is w_jk, Input layer is biased to a to hidden layer_j, hidden layer to output layer is biased to b_k.Learning rate is η (η is set as 0.01), is swashed Encouraging function is g (x).Wherein excitation function is that g (x) takes sigmoid function.Form are as follows:

The output of the output of hidden layer, hidden layer is set as H_j:

The output of output layer:

The more new formula of weight are as follows:

The more new formula of biasing are as follows:

Finally judge whether algorithm iteration terminates: there are many methods to may determine that whether algorithm restrains, common are finger The number for determining iteration, judges whether the difference between adjacent error twice is less than specified value.

(13) training test: sample sound is divided into independent two parts training set and test set, concentration training collection is used to Training pattern and network allow neural network to meet the requirement of anticipation.And test set is for testing model.Training set accounts for sample 80%, test set accounts for the 20% of sample, and two parts are randomly selected from sample.

(14) error correction: classify inevitably to emotion sound using trained model will appear the classification results of mistake, The expressive features of classification results combination image are corrected, misclassification rate can be reduced.

The invention patent is directed to the characteristic voice of dog, chooses three kinds of characteristic parameters, extracts these characteristic parameters and is used to classify, And it in conjunction with the expressive features of dog, is corrected by result of the expressive features to classification, reduces classification error rate.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of dog sound sensibility classification method based on artificial neural network, characterized by the following steps:

(1) dog sound and expression acquisition: acquisition is glad, pain, indignation, frightened four kinds of emotion sound, acquires corresponding four kinds of feelings The countenance image of sense；

(2) sound pre-processes: including preemphasis, framing, windowing process；

(3) sound characteristic parameter extraction: extracting characteristic parameter from tonic train to be measured, and short-time energy is extracted, and zero-crossing rate extracts, Pitch period extracts；

(4) expressive features parameter extraction: expressive features parameter is extracted from corresponding face image, extracts the local grain of image Characteristic parameter；

2. a kind of dog sound sensibility classification method based on artificial neural network according to claim 1, it is characterised in that: Sound is acquired by sound pick-up outfit in the step (1), and sample frequency meets nyquist sampling theorem, sample rate f_s≥ 2f_h, f_hCountenance for signal highest frequency, dog can be by acquisition of taking pictures, by the sound of collected dog and every kind of sound The corresponding expression of sound is labeled.

3. a kind of dog sound sensibility classification method based on artificial neural network according to claim 1, it is characterised in that: The parameter μ of the filter of preemphasis takes 0.95 in the step (2), and the frame length that framing uses is 128, and it is 64 that frame, which moves, window function Using Hanning window.

4. a kind of dog sound sensibility classification method based on artificial neural network according to claim 1, it is characterised in that: Sound characteristic parameter extraction in the step (3) the following steps are included:

(3.1) short-time energy is extracted: short-time energy indicates the energy of one frame of voice signal, the representation method of short-time energy are as follows: set Voice signal is x (n), and the voice signal that l frame is obtained after pretreatment isThen short-time energy are as follows:

Wherein E_lFor the short-time energy of voice signal l frame, N is the length of a frame voice signal；

(3.2) zero-crossing rate extracts: short-time zero-crossing rate, finger speech sound signal in a short time by the number of zero, continuous signal and Discrete signal, the method for obtaining zero-crossing rate is different, and the zero-crossing rate of available continuous signal is counted by observing its waveform, calculates The sign change number of signal sampling point can obtain the zero-crossing rate of discrete signal, and the zero passage number in the unit time is referred to as flat Equal Zero-crossing Number；

Voice frame signalShort-time average Zero-crossing Number z_l(n) is defined as:

WhereinFor l frame voice signal, N is the length of one frame of voice signal, z_l(n) in short-term for l frame voice signal Zero passage number, sgn [] are sign function, it may be assumed that

(3.3) pitch period extracts: according to auto-correlation functionCalculate the fundamental tone week of each frame Phase, wherein x_iIt (m) is the voice signal after adding window, k is the retardation of time.

5. a kind of dog sound sensibility classification method based on artificial neural network according to claim 1, it is characterised in that: Expressive features parameter is extracted using LBP algorithm in the step (4).

6. a kind of dog sound sensibility classification method based on artificial neural network according to claim 1, it is characterised in that: BP neural network algorithm steps include: in the step (5)

(5.1) initialization of network

If the node number of input layer is n, the node number of hidden layer is l, and the node number of output layer is m, and input layer is to hidden Weight w containing layer_ij, the weight of hidden layer to output layer is w_jk, input layer to hidden layer is biased to a_j, hidden layer to output layer Be biased to b_k, learning rate η, excitation function is g (x), and wherein excitation function is that g (x) takes sigmoid function, form are as follows:

(5.2) output of the output of hidden layer, hidden layer is set as H_j:

(5.3) output of output layer:

(5.4) calculating of error are as follows:Wherein Y_kFor desired output, Y is remembered_k-O_k=e_k,

Then E can be expressed asIn above formula, i=1 ... n, j=1 ... l, k=1 ... m；

(5.5) the more new formula of weight are as follows:

(5.6) the more new formula biased are as follows:

(5.7) finally judge whether algorithm iteration terminates: by specifying the number of iteration to judge whether algorithm restrains, that is, judging phase Whether the difference between adjacent error twice is less than specified value.