CN107808659A

CN107808659A - Intelligent sound signal type recognition system device

Info

Publication number: CN107808659A
Application number: CN201711253194.6A
Authority: CN
Inventors: 宫文峰; 张泽辉; 刘志勇
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-12-02
Filing date: 2017-12-02
Publication date: 2018-03-16

Abstract

A kind of intelligent sound signal type recognition system device,Include framework 10,The framework 10 is provided with cavity,Voice acquisition module 1 is provided with framework 10,Sound identification module 2,Central processing unit 3,Wireless signal transceiver 4,Display screen 8,Memory 33,Mixed-media network modules mixed-media 31,RAM card 32,Loudspeaker 35 and power supply 9,Voice acquisition module 1 includes microphone 11,Wireless interspeaker 12 and fixed phonographic recorder 13,Sound identification module 2 includes voice-input unit 20,Voice pretreatment unit 21,Speech recognition unit 22,Characteristic matching identification and classification unit 23,Voice signal is gathered by voice acquisition module 1,The signal collected is handled by sound identification module 2,Data-signal is preserved by memory 33,The visualization of the output of the operating process and result of man-machine interaction is shown by display screen 8,Therefore,People's recognition of speech signals is more convenient.

Description

Intelligent sound signal type recognition system device

Technical field

The invention discloses a kind of intelligent sound signal type recognition system device, belongs to smart electronicses product technology neck Domain, specifically it is equipped with one kind intelligence that voice acquisition module, sound identification module, control system and loudspeaker are integrated Voice signal PRS device.

Background technology

In daily life, there is the voice letter that various language signals, such as the exchange of people are sent Number, sound caused by machine operation, play sound etc. caused by the sound, vehicle whistle that music sends, voice signal almost fills Denounce around whole living environment, be sometimes right by which in one group of voice signal it is desirable to accurately learn and identify As sending.In common voice signal, people can often identify different sound and be sent by what object, but Be when a variety of objects are simultaneously emitted by sound, especially multiple homogeneous objects simultaneously sounding when, or playback environ-ment is noisy, people It is difficult to distinguish which kind of sound is sent by which object, for example, in the recording that people more than one group argues phenomenon, the number of speech When more, people be difficult by playback and distinguish which words be which debater says.Therefore, people usually require a kind of energy Enough identify the device of voice.

Before making the present invention, on the market there is also the product of some identification voices, such as some phonetic entry softwares, but It is to identify the word or letter in voice mostly to be, or carries out pairing identification to simple single voice, and what is also had can be with By being spoken against products such as mobile phones, some tasks are completed after handset identity voice semanteme, such as make a phone call to search for simple task, But the difference to phonetic feature can not be realized, it is impossible to which it is by which that accurately comparable speech or identical word are distinguished in identification The Similar Problems that people or object are said.Therefore, it is not easy to the flexible use of people.

The content of the invention

In order to overcome above-mentioned technical disadvantages, it is an object of the invention to provide a kind of intelligent sound signal type recognition system dress Put, in that context it may be convenient to identification and record voice signal and proposition characteristic parameter, and by carrying out existing signal to unknown voice Signal carries out intelligent mode identification, classification and extraction.

To reach above-mentioned purpose, the present invention adopts the technical scheme that：Including framework 10, framework 10 is provided with cavity, Voice acquisition module 1, sound identification module 2, central processing unit 3, wireless signal transceiver 4, aobvious is provided with framework 10 Display screen 8, memory 33, mixed-media network modules mixed-media 31, RAM card 32, loudspeaker 35 and power supply 9, voice acquisition module 1 include microphone 11, Wireless interspeaker 12 and fixed phonographic recorder 13, sound identification module 2 include voice-input unit 20, voice pretreatment unit 21st, speech recognition unit 22, characteristic matching identification and classification unit 23, voice signal is gathered by voice acquisition module 1, The signal collected is handled by sound identification module 2, and data-signal is preserved by memory 33, the operating process of man-machine interaction and As a result the visualization of output is shown that loudspeaker 35 is arranged to carry out voice message to operating procedure and report to know by display screen 8 Other result, mixed-media network modules mixed-media 31 are arranged to the present invention being attached with internet cloud platform, and central processing unit 3 is arranged to whole The programme-control of system and device and data operation, wireless signal transceiver 4 be arranged to wireless interspeaker 12, smart mobile phone, Radio signal caused by mixed-media network modules mixed-media 31 is received, launched and is connected the present invention with internet wireless, RAM card 32 The external voice data for being arranged to have recorded are read in database of the present invention.

The present invention devises, and voice-input unit 20 is arranged to include " voice typing pattern " and " tone testing pattern " Two types, microphone 11 that voice acquisition module 1 is provided, wireless interspeaker 12, fixed phonographic recorder 13 and intelligent hand can be passed through Any one mode of machine inputs voice, and in " voice typing pattern ", voice-input unit 20 is arranged to once can only be to one People or an object carry out voice typing, it is characterised in that the voice of typing is one section of audio signal of 5~30 seconds, the present invention Using multimode voice typing strategy, it is characterised in that can include in the voice of typing normal speech, singing or height/in/ The multimode combine voice of bass, the real-time display speech waveform of display 8 and schedule bar, need after typing voice into Row data markers, labeling method have such as gathered the sound of Zhang San, i.e., shown in display screen 8 of the present invention using manually marking Dialog box in remarks：" sound of Zhang San ", preservation, the voice of typing are preserved in the memory 33, in " tone testing mould Under formula ", the present invention by the microphone 11 in voice acquisition module 1, wireless interspeaker 12, fixed phonographic recorder 13 and smart mobile phone its In one or more input tools together collecting test voice, tested speech gatherer process is collection in real time, nobody Number, object and the limitation of time.

The present invention devises, and voice-input unit 20 is arranged to be connected with voice acquisition module 1, and microphone 11 passes through audio Line is connected to voice acquisition module 1, and wireless interspeaker 12 is connected by radio signal with voice acquisition module 1.

The present invention devises, and voice acquisition module 1 can also use smart mobile phone to carry out voice signal input, by using mobile phone Connection is matched with voice acquisition module 1 of the present invention, matching way, which includes bluetooth, infrared ray, WIFI and scanning Quick Response Code, to be carried out Connection, realizes voice typing, is used equivalent to mobile phone as wireless language cylinder, more convenient for more crowd's voice interfaces.

The present invention is devised, and the voice signal that voice acquisition module 1 collects is changed into electricity by voice pretreatment unit 21 Signal, i.e., be data signal by analog-signal transitions, then carry out conventional signal transacting, including ambient background noise eliminate, Signal framing, filtering, preemphasis, windowed function and end-point detection etc..

The present invention devises, and speech recognition unit 22 is arranged to extract reflection language from primary speech signal The main characteristic parameters of sound essence, form characteristic vector x_i, x_i=(x_i1,x_i2,…x_ij,…,x_in)^T, x_ijRepresent i-th of object Or j-th personal of speech characteristic value, characteristic parameter extraction method preferably use frequency cepstral coefficient method (MFCC), can also adopted Acoustic feature is obtained with spectrum envelope method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method etc., is obtained after extraction feature Characteristic vector system will be automatically saved in pattern class database, the corresponding pattern of all sound characteristics of an object or people Class, if after the voice of the N number of people of typing or object, that is, N number of pattern class is obtained, if each pattern class has n characteristic parameter, you can structure Into n dimensional feature spaces, that is, the characteristic signal collection after marking can be designated as D={ (x₁,y₁),(x₂,y₂),…(x_i,y_i),…,(x_N, y_N), wherein x_i∈ χ=Rⁿ, x_iRepresent i-th of object of institute's typing or the phonetic feature signal of people, y_i∈ Y={ 1,2 ..., N }, y_iRepresent i-th of people or object, N represents the digital number of n-th people or object, and the voice feature data after mark forms pattern Class database, and be stored in the memory 33 of the present invention.

The present invention devises, and characteristic matching identification and classification unit 23 is arranged to the multi classifier using intelligence, grader Learning algorithm be arranged to use improved neural network classification algorithm, pass through to typing and mark phonetic feature signal collection As training data, allow network model to learn training data, obtain classifying rules, complete the training of grader；Then Intelligent classification and identification are carried out to unknown tested speech signal using the grader trained；When test signal extracts spy After sign, the present invention can carry out characteristic matching automatically, and the characteristic parameter of the tested speech signal of extraction is deposited with the present invention in real time Typing and the sample voice characteristic parameter progress characteristic matching marked in reservoir 33, and tested speech signal is calculated with owning The similarity of the sample speech signal of typing, then tested speech signal is divided into and that sample of its similarity highest In signal mode classification, the last present invention outwardly exports recognition result, " this is XXX sound " similar report, for example, such as The fruit present invention has stored the phonetic feature signal of Zhang San, and when Zhang San speaks or sung against the present invention, the present invention can be automatic It is most like to calculate tested speech characteristic parameter and the typing voice signal of typing and labeled Zhang San of Zhang San, by knowing Not, output " this is the sound of Zhang San " automatically.

The present invention devises, the multi-layer artificial neural network structure that multi classifier uses, it is characterized in that, one end of network It is defined as input layer, the other end is defined as output layer, and the part among input layer and output layer is defined as hidden layer, and input layer is used In receiving extraneous input signal, input signal is sent to all neurons of hidden layer again, will after hidden layer calculates As a result output layer is passed to, output layer is from output category result after hidden layer reception signal, calculating, that is, the result identified, this hair The number of plies of bright preferable hidden layer is arranged to 1~200 layer.

The present invention devises, and the process of improved artificial neural network classification algorithm training includes step 1~7.

Step 1：Netinit.According to the number of voice signal typing, algorithm data-base is constantly updated, the N when typing During the voice signal of individual object, that is, N number of pattern class is formed, obtain sample space (X, Y), i-th group of sample is (X_i,Y_i), X_iRepresent The characteristic vector set extracted to i-th of object, Y_iRepresent i-th marked of object；According to system input and output sequence (X, Y) determines network input layer nodal point number n, hidden layer node number l, output layer nodal point number m, and wherein n values are by input signal feature The number of character pair value is determined in extraction, and m values are determined that l reference point is by the number of the speech pattern class storedWherein a span is 0~10, calculates determination automatically by model, initializes input layer and hidden layer Neuron between connection weight ω_ijConnection weight ω between hidden layer and output layer neuron_jk, initialize implicit Layer threshold value a and output layer threshold value b, gives learning rate η and neuron excitation function.

Step 2：Calculate the output of hidden layer.According to input change X, the connection weight of the neuron of input layer and hidden layer ω_ij, and hidden layer threshold value a, calculate hidden layer output H；The output for remembering j-th of hidden layer node is H_j,J=1,2 ..., l, wherein l are hidden layer node number, and f is general hidden layer excitation function, the excitation Function has a variety of, currently preferred use f (x)=(1+e^-x)^-1。

Step 3：Calculate the output of output layer.H, the connection between hidden layer and output layer neuron are exported according to hidden layer Weights ω_jk, and output layer threshold value b, output layer output O is calculated, the output of k-th of output layer node of note is O_k,K=1,2 ..., m, wherein m are output layer nodal point number, b_kFor the threshold value of k-th of node of output layer, H_jFor the output valve of j-th of node of hidden layer.

Step 4：Calculate prediction error.The output O and desired output Y (true value) obtained according to neural network forecast, calculating network Overall error e is predicted,e_kFor error caused by k-th of output layer node,

Step 5：Update weights.Network connection weights ω is updated according to neural network forecast overall error e_jkAnd ω_ij, ω_jk ⁺=ω_jk +η·H_j·E_k, wherein j=1,2 ..., l, k=1,2 ..., m, η are learning rate, E_kRepresent the network overall error of output layer node Sensitivity to output layer network node k, Wherein i=1,2 ..., n, j=1,2 ..., l.

Step 6：Threshold value updates.Hidden layer threshold value a and output layer threshold value b is updated according to neural network forecast overall error e,J=1,2 ..., l；b_k ⁺=b_k+η·E_k, k=1,2 ..., m.

Step 7：Whether evaluation algorithm iteration restrains, if no convergence return to step 2, currently preferred minimal error are not Terminate iteration when 0.001.

The present invention devises, and voice acquisition module 1 is built-in with voice collecting card, for collecting and handling the voice collected Signal.

The present invention devises, and fixed phonographic recorder 13 uses wind-proof type microphone.

The present invention devises, and display screen 8 uses touch-screen or LED display with background light.

In the present invention, fixed phonographic recorder 13 can be set multiple, be arranged at shell of the present invention, for increasing Speech Record Intensity processed.

The present invention has long-term store function to typing and the voice signal that has marked, every to be stored in voice of the present invention Voice signal in pattern class database, the present invention can transfer at any time carries out contrast identification with unknown tested speech.

The process for using of the present invention is first to turn on the power switch 5, and then automated system operation, display screen 8 are lighted and shown Operation interface, people can select " voice typing pattern " and " tone testing pattern " two kinds of functions.

(1) when selection Speech Record is fashionable, the meeting control voice of central processing unit 3 input block 20 enters " voice typing mould Formula ", display screen 8 and loudspeaker 35 can prompt " being voice typing pattern now, please speak " similar prompting simultaneously, and people can lead to Cross microphone 11, wireless interspeaker 12, fixed phonographic recorder 13 that voice acquisition module 1 provided and smart mobile phone any one mode Input voice；To ensure the phonetic feature of the invention that can be accurately identified and quantify identified object, therefore in " voice typing mould Formula " the stage can only carry out voice typing to a people or an object every time, be sent due to same people when speaking and singing Sound signal data can have certain feature deviation, and therefore, to improve the degree of accuracy of voice signal identification, the present invention uses more State voice typing strategy, i.e., it can be included in the voice of typing under normal speech, singing or high/medium/low sound and other states Multimode combined sound, long recording time be 5~30 seconds, display 8 can show voice real-time waveform and schedule bar, if The voice of recording is undesirable can to delete typing again, need to carry out data markers after typing voice, labeling method uses Manually mark, such as gathered the sound of Zhang San, i.e., the remarks in the dialog box of display screen 8 of the present invention display：" the sound of Zhang San Sound ", preservation, the phonetic storage of typing is in the memory 33 of the present invention.

(2) after voice signal typing, marked voice signal is sent into voice by control system of the invention automatically The voice signal that voice acquisition module 1 collects is changed into electric signal by pretreatment unit 21, voice pretreatment unit 21, will Analog-signal transitions are data signal, then carry out conventional signal transacting, including ambient background noise elimination, signal framing, Filtering, preemphasis, windowed function and end-point detection etc..

(3) voice signal after having pre-processed is sent into signal characteristic abstraction unit 22, language by control system of the invention automatically Sound signal feature extraction unit 22 extracts the characteristic parameter of reflection voice essence from pretreated voice signal, obtains spy Levy vector x_i, characteristic parameter extraction method preferably uses frequency cepstral coefficient method (MFCC), can also use spectrum envelope method, LPC Interpolation method, LPC extraction of roots, Hilbert transform method etc. obtain acoustic feature, extract the characteristic vector system obtained after feature certainly It is dynamic to be saved in pattern class storehouse, the corresponding pattern class of all sound characteristics of a people, if after the N number of people's voice of typing, i.e., N number of pattern class is obtained, if each pattern class has n characteristic parameter, so as to obtain the data that a people corresponds to voice signal pattern class Storehouse, all data are stored in the memory 33 of the present invention, and so far, speech signal typing mode contents finish.

(4) after voice typing, tone testing can be carried out, when carrying out tone testing, it is only necessary in the behaviour of display screen 8 Make to select " tone testing pattern " in interface, the meeting control voice of central processing unit 3 input block 20 enters " tone testing mould Formula ", display screen 8 can prompt prompting similar " in tone testing ... " simultaneously with loudspeaker 35, at this moment people do not take to do it is any Operation, the present invention can pass through the microphone 11 in voice acquisition module 1, wireless interspeaker 12, fixed phonographic recorder 13 and smart mobile phone Together collecting test voice, tested speech gatherer process be not any to gather in real time for one or more input tools therein The limitation of time restriction and number.

(5) speech data collected under " tone testing pattern ", present system device can be automatically to testing language Sound signal is pre-processed and feature extraction, the tone testing signal collected is converted into electric signal, and carry out conventional filter Ripple, carry out signal characteristic abstraction after removing noise, windowed function and end-point detection.

(6) after test signal extraction feature, the present invention can carry out characteristic matching automatically, by the tested speech signal of extraction Marked good sample voice characteristic parameter of the characteristic parameter in real time with typing in memory 33 of the present invention carries out feature Match somebody with somebody, and calculate the similarity of tested speech signal and the primary speech signal of all typings, and tested speech signal is assigned to With in its similarity highest that pattern class, the last present invention outwardly exports, " this is XXX sound " similar report Accuse, if for example, the present invention has stored the phonetic feature signal of Zhang San, when Zhang San speaks or sung against the present invention, originally Invention can export " this is the sound of Zhang San " automatically by identification.

When the present invention tests in public, because in test environment, it is same that the same period there may be multiple objects When speak, that is, the voice signal collected is the signal of broadband aliasing, to prevent that the present invention is special to the voice signal that now gathers Malfunctioned during sign extraction, the strategy that the present invention uses is, with intelligent algorithm, first matches and identify language when single people speaks Sound characteristic parameter is simultaneously identified and stored, then system again to speaking jointly when voice signal carry out automatic screening and point From finally exporting recognition result and report " being the common sound of Zhang San, Li Si, king five ... now " similar prompting, and carry Show that XX voice be present fails to identify, power off key 6 is pressed when closing system.

The present invention have also been devised, and system and device can also export clear to the recognition result under more people's communication environments to people It is single, comprising the quantity that how many people or object are spoken at the scene under test environment, and screen and play from more people while speak Recording in identification isolate the content that everyone is said, and filter out other people sound and ambient sound.

When occur in tested speech signal the present invention do not store sample phonic signal character when, the present invention can remember automatically The unknown phonic signal character is recorded, to remind people whether to mark and store the voice signal of the object.

Brief description of the drawings

Fig. 1 is the structural representation of the present invention.

Fig. 2 is the system framework figure of the present invention.

Fig. 3 is the multi-layer artificial neural network schematic diagram of the present invention.

Fig. 4 is the improved neural network classification algorithm flow chart of voice signal of the present invention.

Embodiment

Accompanying drawing 1 is one embodiment of the present of invention, illustrates the present embodiment with reference to 1~accompanying drawing of accompanying drawing 4, includes framework 10, framework 10 is provided with cavity, be provided with framework 10 voice acquisition module 1, sound identification module 2, central processing unit 3, Wireless signal transceiver 4, display screen 8, memory 33, mixed-media network modules mixed-media 31, RAM card 32, loudspeaker 35 and power supply 9, voice is adopted Collection module 1 includes microphone 11, wireless interspeaker 12 and fixed phonographic recorder 13, sound identification module 2 and includes phonetic entry list Member 20, voice pretreatment unit 21, speech recognition unit 22, characteristic matching identification and classification unit 23, voice signal Gathered by voice acquisition module 1, the signal collected is handled by sound identification module 2, and data-signal is preserved by memory 33, people The visualization of the output of the operating process and result of machine interaction is shown that loudspeaker 35 is arranged to operating procedure by display screen 8 Carrying out voice message and report recognition result, mixed-media network modules mixed-media 31 is arranged to the present invention being attached with internet cloud platform, in Central processor 3 is arranged to programme-control and data operation to whole system device, and wireless signal transceiver 4 is arranged to nothing Line intercom 12, smart mobile phone, radio signal caused by mixed-media network modules mixed-media 31 received, is launched and by the present invention and interconnection Net wireless connection, the external voice data that RAM card 32 is arranged to have recorded are read in database of the present invention.

In the present embodiment, voice-input unit 20 is arranged to include " voice typing pattern " and " tone testing pattern " Two types, microphone 11 that voice acquisition module 1 is provided, wireless interspeaker 12, fixed phonographic recorder 13 and intelligent hand can be passed through Any one mode of machine inputs voice, and in " voice typing pattern ", voice-input unit 20 is arranged to once can only be to one People or an object carry out voice typing, it is characterised in that the voice of typing is one section of audio signal of 5~30 seconds, the present invention Using multimode voice typing strategy, it is characterised in that can include in the voice of typing normal speech, singing or height/in/ The multimode combine voice of bass, the real-time display speech waveform of display 8 and schedule bar, need after typing voice into Row data markers, labeling method have such as gathered the sound of Zhang San, i.e., shown in display screen 8 of the present invention using manually marking Dialog box in remarks：" sound of Zhang San ", preservation, the voice of typing are preserved in the memory 33, in " tone testing mould Under formula ", the present invention by the microphone 11 in voice acquisition module 1, wireless interspeaker 12, fixed phonographic recorder 13 and smart mobile phone its In one or more input tools together collecting test voice, tested speech gatherer process is collection in real time, nobody Number, object and the limitation of time.

In the present embodiment, voice-input unit 20 is arranged to be connected with voice acquisition module 1, and microphone 11 passes through audio Line is connected to voice acquisition module 1, and wireless interspeaker 12 is connected by radio signal with voice acquisition module 1.

In the present embodiment, voice acquisition module 1 can also use smart mobile phone to carry out voice signal input, by using mobile phone Connection is matched with voice acquisition module 1 of the present invention, matching way, which includes bluetooth, infrared ray, WIFI and scanning Quick Response Code, to be carried out Connection, realizes voice typing, is used equivalent to mobile phone as wireless language cylinder, more convenient for more crowd's voice interfaces.

In the present embodiment, the voice signal that voice acquisition module 1 collects is changed into electricity by voice pretreatment unit 21 Signal, i.e., be data signal by analog-signal transitions, then carry out conventional signal transacting, including ambient background noise eliminate, Signal framing, filtering, preemphasis, windowed function and end-point detection etc..

In the present embodiment, speech recognition unit 22 is arranged to extract reflection language from primary speech signal The main characteristic parameters of sound essence, form characteristic vector x_i, x_i=(x_i1,x_i2,…x_ij,…,x_in)^T, x_ijRepresent i-th of object Or j-th personal of speech characteristic value, characteristic parameter extraction method preferably use frequency cepstral coefficient method (MFCC), can also adopted Acoustic feature is obtained with spectrum envelope method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method etc., is obtained after extraction feature Characteristic vector system will be automatically saved in pattern class database, the corresponding pattern of all sound characteristics of an object or people Class, if after the voice of the N number of people of typing or object, that is, N number of pattern class is obtained, if each pattern class has n characteristic parameter, you can structure Into n dimensional feature spaces, that is, the characteristic signal collection after marking can be designated as D={ (x₁,y₁),(x₂,y₂),…(x_i,y_i),…,(x_N, y_N), wherein x_i∈ χ=Rⁿ, x_iRepresent i-th of object of institute's typing or the phonetic feature signal of people, y_i∈ Y={ 1,2 ..., N }, y_iRepresent i-th of people or object, N represents the digital number of n-th people or object, and the voice feature data after mark forms pattern Class database, and be stored in the memory 33 of the present invention.

In the present embodiment, characteristic matching identification and classification unit 23 is arranged to the multi classifier using intelligence, grader Learning algorithm be arranged to use improved neural network classification algorithm, pass through to typing and mark phonetic feature signal collection As training data, allow network model to learn training data, obtain classifying rules, complete the training of grader；Then Intelligent classification and identification are carried out to unknown tested speech signal using the grader trained；When test signal extracts spy After sign, the present invention can carry out characteristic matching automatically, and the characteristic parameter of the tested speech signal of extraction is deposited with the present invention in real time Typing and the sample voice characteristic parameter progress characteristic matching marked in reservoir 33, and tested speech signal is calculated with owning The similarity of the sample speech signal of typing, then tested speech signal is divided into and that sample of its similarity highest In signal mode classification, the last present invention outwardly exports recognition result, " this is XXX sound " similar report, for example, such as The fruit present invention has stored the phonetic feature signal of Zhang San, and when Zhang San speaks or sung against the present invention, the present invention can be automatic It is most like to calculate tested speech characteristic parameter and the typing voice signal of typing and labeled Zhang San of Zhang San, by knowing Not, output " this is the sound of Zhang San " automatically.

In the present embodiment, the multi-layer artificial neural network structure that multi classifier uses, it is characterized in that, one end of network It is defined as input layer, the other end is defined as output layer, and the part among input layer and output layer is defined as hidden layer, and input layer is used In receiving extraneous input signal, input signal is sent to all neurons of hidden layer again, will after hidden layer calculates As a result output layer is passed to, output layer is from output category result after hidden layer reception signal, calculating, that is, the result identified, this hair The number of plies of bright preferable hidden layer is arranged to 1~200 layer.

In the present embodiment, the process of improved artificial neural network classification algorithm training is as follows：

Step 4：Calculate prediction error.The output O and desired output Y (true value) obtained according to neural network forecast, calculating network Overall error e is predicted,e_kFor error caused by k-th of output layer node,Step 5：Renewal power Value.Network connection weights ω is updated according to neural network forecast overall error e_jkAnd ω_ij, ω_jk ⁺=ω_jk+η·H_j·E_k, wherein j=1, 2 ..., l, k=1,2 ..., m, η are learning rate, E_kRepresent the network overall error of output layer node to output layer network node k's Sensitivity, Wherein i=1,2 ..., n, j =1,2 ..., l.

In the present embodiment, voice acquisition module 1 is built-in with voice collecting card, for collecting and handling the voice collected Signal.

In the present embodiment, fixed phonographic recorder 13 uses wind-proof type microphone.

In this example it is shown that screen 8 uses touch-screen or LED display with background light.

In the present embodiment, fixed phonographic recorder 13 can be set multiple, be arranged at shell of the present invention, for increasing voice Record intensity.

In the present embodiment, system and device can also export to the recognition result inventory under more people's communication environments to people, Comprising the quantity that how many people or object are spoken at the scene under test environment, and screen and play from more people while the record spoken The content that everyone is said is isolated in identification in sound, and filters out other people sound and ambient sound.

In intelligent sound signal type recognition system engineering device technique field；Every to include framework 10, framework 10 is set There is cavity, voice acquisition module 1, sound identification module 2, central processing unit 3, wireless signal transmitting-receiving dress are provided with framework 10 4, display screen 8, memory 33, mixed-media network modules mixed-media 31, RAM card 32, loudspeaker 35 and power supply 9 are put, voice acquisition module 1 includes Microphone 11, wireless interspeaker 12 and fixed phonographic recorder 13, sound identification module 2 includes voice-input unit 20, voice is located in advance Unit 21, speech recognition unit 22, characteristic matching identification and classification unit 23 are managed, voice signal is by voice acquisition module 1 collection, the signal collected are handled by sound identification module 2, and data-signal is preserved by memory 33, the operation stream of man-machine interaction The visualization of the output of journey and result shows by display screen 8, loudspeaker 35 be arranged to carry out operating procedure voice message and Recognition result is reported, mixed-media network modules mixed-media 31 is arranged to the present invention being attached with internet cloud platform, and central processing unit 3 is arranged to Programme-control and data operation to whole system device, wireless signal transceiver 4 are arranged to wireless interspeaker 12, intelligence Radio signal caused by mobile phone, mixed-media network modules mixed-media 31 is received, launched and is connected the present invention with internet wireless, internal memory Technology contents in the external voice data reading database of the present invention that card 32 is arranged to have recorded are all in the protection of the present invention In the range of, it is noted that the scope of the present invention should not be so limited to resemblance, and the moulding of framework 10 of the invention can be set For square, cylindrical, the polygon prism bodily form or similar to other moulding such as Chinese cabbage, watermelon, stone, every moulding is different and essence Technology contents and identical of the present invention all technology contents are also within protection scope of the present invention；Meanwhile the art skill Art personnel make the obvious small improvement of routine or small combination on the basis of present invention, as long as technology contents are included in The technology contents within context described in the present invention are also within the scope of the present invention.

Claims

A kind of 1. intelligent sound signal type recognition system device；It is characterized in that：Include framework (10), the framework (10) sets It is equipped with cavity, it is characterised in that voice acquisition module (1), sound identification module (2), centre are provided with framework (10) Reason device (3), wireless signal transceiver (4), display screen (8), memory (33), mixed-media network modules mixed-media (31), RAM card (32), raise one's voice Device (35) and power supply (9), voice acquisition module (1) include microphone (11), wireless interspeaker (12) and fixed phonographic recorder (13), Sound identification module (2) includes voice-input unit (20), voice pretreatment unit (21), speech recognition unit (22), characteristic matching identification and classification unit (23), voice signal are gathered by voice acquisition module (1), and the signal collected is by language Sound identification module (2) processing, data-signal are preserved by memory (33), the output of the operating process and result of man-machine interaction Visualization is shown that loudspeaker (35) is arranged to carry out operating procedure voice message and reports recognition result, net by display screen (8) Network module (31) is arranged to the present invention being attached with internet cloud platform, and central processing unit (3) is arranged to whole system The programme-control of device and data carry out computing, and wireless signal transceiver (4) is arranged to wireless interspeaker (12), intelligent hand Radio signal caused by machine, mixed-media network modules mixed-media (31) is received, launched and is connected the present invention with internet wireless, internal memory The external voice data that card (32) is arranged to have recorded are read in database of the present invention.
2. intelligent sound signal type recognition system device according to claim 1；It is characterized in that：Voice-input unit (20) it is arranged to include " voice typing pattern " and " tone testing pattern " two types, voice acquisition module (1) can be passed through The microphone (11) that is there is provided, wireless interspeaker (12), fixed phonographic recorder (13) and smart mobile phone any one mode input voice, In " voice typing pattern ", voice-input unit (20) is arranged to that once Speech Record can only be carried out to a people or an object Entering, it is characterised in that the voice of typing is one section of audio signal of 5~30 seconds, and the present invention uses multimode voice typing strategy, Characterized in that, the multimode combine voice of normal speech, singing or high/medium/low sound can be included in the voice of typing, show Show device (8) real-time display speech waveform and schedule bar, need to carry out data markers after typing voice, labeling method is adopted Manually hand labeled, the remarks in the dialog box of display screen (8) display：" XXX sound ", preservation, the voice of typing It is stored in memory (33), under " tone testing pattern ", by the microphone (11) in voice acquisition module (1), wireless right Machine (12), fixed phonographic recorder (13) and smart mobile phone one or more input tools therein together collecting test voice are said, is surveyed Voice collecting process is tried as collection in real time, nobody number, object and the limitation of time, smart mobile phone and voice acquisition module (1) no lines matching connection is arranged to, matching way includes bluetooth, infrared ray, WIFI and scanning Quick Response Code and is attached, realizes Voice typing, used equivalent to mobile phone as wireless language cylinder, realize more crowd's voice interfaces.
3. intelligent sound signal type recognition system device according to claim 1；It is characterized in that：Voice pretreatment unit (21) voice signal that voice acquisition module (1) collects is changed into electric signal, i.e., is data signal by analog-signal transitions, Then conventional signal transacting, including ambient background noise elimination, signal framing, filtering, preemphasis, windowed function and end are carried out Point detection.
4. intelligent sound signal type recognition system device according to claim 1；It is characterized in that：Phonic signal character carries Take unit (22) to be arranged to extract the main characteristic parameters of reflection voice essence from primary speech signal, form characteristic vector x_i, x_i=(x_i1,x_i2,…x_ij,…,x_in)^T, x_ijRepresent that i-th of object or j-th of speech characteristic value of individual, characteristic parameter carry Take method to use frequency cepstral coefficient method (MFCC), can also use spectrum envelope method, LPC interpolation methods, LPC extraction of roots, Hilbert Converter technique etc. obtains acoustic feature, and extracting the characteristic vector system obtained after feature will be automatically saved in pattern class database, One object or the corresponding pattern class of all sound characteristics of people, if after the voice of the N number of people of typing or object, that is, are obtained N number of Pattern class, if each pattern class has n characteristic parameter, you can form n dimensional feature spaces, that is, the characteristic signal collection after marking can be remembered For D={ (x₁,y₁),(x₂,y₂),…(x_i,y_i),…,(x_N,y_N), wherein x_i∈ χ=Rⁿ, x_iRepresent i-th of object of institute's typing Or the phonetic feature signal of people, y_i∈ Y={ 1,2 ..., N }, y_iRepresent i-th of people or object, N represents n-th people or object Digital number, the voice feature data after mark forms pattern class database, and is stored in the memory (33) of the present invention.
5. intelligent sound signal type recognition system device according to claim 1；It is characterized in that：Characteristic matching, which differentiates, divides Class unit (23) is arranged to the multi classifier using intelligence, and the learning algorithm of grader is arranged to use improved neutral net Sorting algorithm, by being used as training data to the phonetic feature signal collection of typing and mark, network model is allowed to training data Learnt, obtain classifying rules, complete the training of grader；Then using the grader trained to unknown test Voice signal carries out intelligent classification and identification；After test signal extracts feature, the present invention can carry out characteristic matching automatically, will carry The characteristic parameter of the tested speech signal taken in real time with typing in memory of the present invention (33) and the sample voice marked Characteristic parameter carries out characteristic matching, and calculates tested speech signal and the similarity of the sample speech signal of all typings, so Tested speech signal is divided into and in its similarity highest that sample signal mode classification, the last present invention is outwardly afterwards Export recognition result, " this is XXX sound " similar report.
6. intelligent sound signal type recognition system device according to claim 1；It is characterized in that：Multi classifier uses Multi-layer artificial neural network structure, it is characterised in that one end of network is defined as input layer, and the other end is defined as output layer, Part among input layer and output layer is defined as hidden layer, and input layer is used for the input signal for receiving the external world, again will input Signal is sent to all neurons of hidden layer, passes the result to output layer after hidden layer calculates, and output layer is from hidden layer Reception signal, output category result after calculating, that is, the result identified, the number of plies of currently preferred hidden layer are arranged to 1~200 Layer.
7. intelligent sound signal type recognition system device according to claim 1；It is characterized in that：Improve ANN The step of network training, is as follows：

Step 1：Netinit；According to the number of voice signal typing, algorithm data-base is constantly updated, it is N number of right when typing During the voice signal of elephant, that is, N number of pattern class is formed, obtain sample space (X, Y), i-th group of sample is (X_i,Y_i), X_iRepresent to the The characteristic vector set that i object is extracted, Y_iRepresent i-th marked of object；According to system input and output sequence (X, Y) Network input layer nodal point number n, hidden layer node number l, output layer nodal point number m are determined, wherein n values are by input signal feature extraction The number of character pair value determines that m values are determined that l reference point is by the number of the speech pattern class stored Wherein a span is 0~10, calculates determination automatically by model, is initialized between input layer and the neuron of hidden layer Connection weight ω_ijConnection weight ω between hidden layer and output layer neuron_jk, initialize hidden layer threshold value a and output layer Threshold value b, give learning rate η and neuron excitation function；

Step 2：Calculate the output of hidden layer；According to input change X, the connection weight ω of the neuron of input layer and hidden layer_ij, And hidden layer threshold value a, calculate hidden layer output H；The output for remembering j-th of hidden layer node is H_j, J=1,2 ..., l, wherein l are hidden layer node number, and f is general hidden layer excitation function, the excitation function have it is a variety of, the present invention it is excellent Choosing uses f (x)=(1+e^-x)^-1；

Step 3：Calculate the output of output layer；H, the connection weight between hidden layer and output layer neuron are exported according to hidden layer ω_jk, and output layer threshold value b, output layer output O is calculated, the output of k-th of output layer node of note is O_k,K=1,2 ..., m, wherein m are output layer nodal point number, b_kFor the threshold value of k-th of node of output layer, H_jFor the output valve of j-th of node of hidden layer；

Step 4：Calculate prediction error；The output O and desired output Y (true value) obtained according to neural network forecast, calculating network prediction Overall error e,e_kFor error caused by k-th of output layer node,

<mrow> <msub> <mi>e</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <msup> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>k</mi> </msub> <mo>-</mo> <msub> <mi>O</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>;</mo> </mrow>

Step 5：Update weights；Network connection weights ω is updated according to neural network forecast overall error e_jkAnd ω_ij, ω_jk ⁺=ω_jk+η· H_j·E_k, wherein j=1,2 ..., l, k=1,2 ..., m, η are learning rate, E_kRepresent the network overall error of output layer node to defeated Go out layer network node k sensitivity, Wherein i=1,2 ..., n, j=1,2 ..., l；

Step 6：Threshold value updates；Hidden layer threshold value a and output layer threshold value b is updated according to neural network forecast overall error e,J=1,2 ..., l；b_k ⁺=b_k+η·E_k, k=1,2 ..., m；

Step 7：Whether evaluation algorithm iteration restrains, if no convergence return to step 2, currently preferred minimal error is 0.001 When terminate iteration.
8. according to the intelligent sound signal type recognition system device described in any one of claim 1,2,3,4,5,6,7；It is special Levying is being：Basic operating system flow setting is：

1) (5) are turned on the power switch, then automated system operation, display screen (8) is lighted and operation display interface, people can select Select " voice typing pattern " and " tone testing pattern " two kinds of functions；When selection Speech Record is fashionable, central processing unit (3) can control Voice-input unit (20) enters " voice typing pattern ", and display screen (8) and loudspeaker (35) can prompt " to be voice now simultaneously Typing pattern, please speak " similar prompting, people can pass through microphone (11) that voice acquisition module (1) is provided, wireless talkback Machine (12), fixed phonographic recorder (13) and smart mobile phone any one mode input voice；To ensure that the present invention can accurately identify The phonetic feature of object is identified with quantization, therefore every time can only be to a people or an object in " the voice typing pattern " stage Voice typing is carried out, because the sound signal data that same people sends when speaking and singing can have certain feature deviation, Therefore, to improve the degree of accuracy of voice signal identification, the present invention uses multimode voice typing strategy, i.e., can in the voice of typing Comprising the multimode combined sound under normal speech, singing or high/medium/low sound and other states, long recording time is 5~30 Second, display (8) can show voice real-time waveform and schedule bar, if undesirable can delete of voice recorded is recorded again Enter, need to carry out data markers after typing voice, labeling method has such as gathered the sound of Zhang San using manually marking Sound, i.e., the remarks in the dialog box of display screen of the present invention (8) display：" sound of Zhang San ", preservation, the phonetic storage of typing In the memory (33) of the present invention；

2) after voice signal typing, marked voice signal is sent into voice pretreatment by control system of the invention automatically The voice signal that voice acquisition module (1) collects is changed into electric signal by unit (21), voice pretreatment unit (21), will Analog-signal transitions are data signal, then carry out conventional signal transacting, including ambient background noise elimination, signal framing, Filtering, preemphasis, windowed function and end-point detection etc.；

3) voice signal after having pre-processed is sent into signal characteristic abstraction unit (22), voice letter by control system of the invention automatically Number feature extraction unit (22) extracts the characteristic parameter of reflection voice essence from pretreated voice signal, obtains feature Vector x_i, characteristic parameter extraction method uses frequency cepstral coefficient method (MFCC), can also use spectrum envelope method, LPC interpolation methods, LPC extraction of roots, Hilbert transform method etc. obtain acoustic feature, extract the characteristic vector system obtained after feature and are automatically saved to In pattern class storehouse, the corresponding pattern class of all sound characteristics of a people, if after the N number of people's voice of typing, that is, N number of mould is obtained Formula class, it is all so as to obtain the database that a people corresponds to voice signal pattern class if each pattern class has n characteristic parameter Data are stored in the memory (33) of the present invention, and so far, speech signal typing mode contents finish；

4) after voice typing, tone testing can be carried out, when carrying out tone testing, it is only necessary to the operation in display screen (8) " tone testing pattern " is selected in interface, central processing unit (3) meeting control voice input block (20) enters " tone testing Pattern ", display screen (8) can prompt prompting similar " in tone testing ... " simultaneously with loudspeaker (35), and at this moment people, which do not take, wants Any operation is done, the present invention can pass through the microphone (11) in voice acquisition module (1), wireless interspeaker (12), fixed phonographic recorder (13) and smart mobile phone one or more input tools therein together collecting test voice, tested speech gatherer process are real-time Collection, without any time limitation and the limitation of number；

5) speech data collected under " tone testing pattern ", present system device can be automatically to tested speech signals Pre-processed and feature extraction, the tone testing signal collected is converted into electric signal, and carried out conventional filtering, remove Signal characteristic abstraction is carried out after noise, windowed function and end-point detection；

6) after test signal extraction feature, the present invention can carry out characteristic matching automatically, by the feature of the tested speech signal of extraction Marked good sample voice characteristic parameter of the parameter in real time with typing in memory of the present invention (33) carries out characteristic matching, And calculate the similarity of the primary speech signal of tested speech signal and all typings, and tested speech signal assign to and its In that pattern class of similarity highest, the last present invention outwardly exports, " this is XXX sound " similar report, example Such as, if the present invention has stored the phonetic feature signal of Zhang San, when Zhang San speaks or sung against the present invention, present invention warp Identification is crossed, " this is the sound of Zhang San " can be exported automatically；

7) system and device can also be exported to the recognition result inventory under more people's communication environments to people, comprising having under test environment The quantity that how many people or object are spoken at the scene, and screen and play the identification from more people while the recording spoken isolate it is every The content that individual is said, and other people sound and ambient sound are filtered out, do not deposited when occurring the present invention in tested speech signal During the sample phonic signal character of storage, the present invention can record the unknown phonic signal character automatically, to remind whether people mark Remember and store the voice signal of the object.