CN106653020A

CN106653020A - Multi-business control method and system for smart sound and video equipment based on deep learning

Info

Publication number: CN106653020A
Application number: CN201611144430.6A
Authority: CN
Inventors: 曾旭龙; 林格; 陈小燕
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-05-10

Abstract

The embodiment of the invention discloses a multi-business control method and system for smart sound and video equipment based on deep learning, and the method comprises the steps that a voice preprocessing module carries out the extraction of a voice control signal, and obtains MFCC original voice feature information; a remote GPU server receives the MFCC original voice feature information, and obtains the depth voice feature information according to the MFCC original voice feature information; an Internet connection module enables control signal identification information to be transmitted to a signal analysis module, and the signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to a control signal output module; the control signal output module transmits a control signal to the smart sound and video equipment according to the control signal code. The method can control various types of smart sound and video equipment which is based on different control protocols and achieves various types of different business, and provides more unified and natural man-machine interaction and control mode for the smart sound and video equipment.

Description

A kind of wisdom audio-visual equipment method for controlling multiple operations and system based on deep learning

Technical field

The present invention relates to wisdom audio-visual equipment multi-service control technology field, more particularly to a kind of intelligence based on deep learning Intelligent audio-visual equipment method for controlling multiple operations and system.

Background technology

With the progress of Internet of Things and artificial intelligence technology, wisdom audio-visual equipment technology is developed rapidly.Increasing intelligence Intelligent audio-visual equipment is designed to produce, and realizes various multimedia services, to meet the different demands during people live. The equipment for designing production by different vendor has different control and man-machine interaction mode.These equipment may adopt infrared, indigo plant The various control modes such as tooth, Z-wave, in modes such as voice, action, touch-controls man-machine interaction is realized.Wisdom audio-visual equipment control and The disunity of man-machine interaction mode improves the threshold that user learning uses wisdom audio-visual equipment, and it is not good to easily cause Consumer's Experience Problem.Merge multiple business scene, a kind of unification is provided for these wisdom audio-visual equipment, easily naturally controls and man-machine friendship Mutually mode is a problem demanding prompt solution.

Deep learning is the subdomains of artificial intelligence.In recent years, with graphic process unit (Graphics Processing Unit, GPU), the progress of the technology such as cloud computing, deep learning theoretical research achieves breakthrough.At the same time, depth The introducing of habit technology causes the fields such as computer vision, speech recognition to advance by leaps and bounds.This is also wisdom audio-visual equipment control technology Bring new thinking.

A kind of existing smart home natural interaction system [1] based on Voice ＆ Video, is adopted using microphone and camera Collection sound and image information, use information Fusion Module carries out signal transacting, then obtains useful finger using machine learning method Order, reuses control signal transmitter module and sends control signal.

The system is controlled using the information such as voice, gesture, face, action be various, it is impossible to provide the user one kind Simple unified interactive mode, causes user to grasp the problems such as learning cost that system uses is high, and Consumer's Experience is not good.It is adopted Conventional machines learning method is recognizing the multimedia messages such as voice, image so that its discrimination is relatively low, and system robustness is poor. And its voice, image recognition program run on locally, the hardware and energy cost of user is which increased.

The content of the invention

It is an object of the invention to overcome the deficiencies in the prior art, the invention provides a kind of wisdom based on deep learning Audio-visual equipment method for controlling multiple operations and system, it is controllable it is various based on different control protocol, realize the intelligence of various different business Intelligent audio-visual equipment, for them the mode of a kind of more unified, more natural man-machine interaction and control is provided.

In order to solve the above problems, the present invention proposes a kind of wisdom audio-visual equipment multi-service based on deep learning and controls Method, methods described includes：

Microphone array monitors the audio controls that collection user sends with CF；

Voice pretreatment module is extracted to audio controls, obtains mel cepstrum coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC) raw tone characteristic information；Detection MFCC raw tone features Whether logarithmic energy is more than threshold value；If so, then send MFCC raw tones characteristic information by internet link block to scheme to long-range Shape processor (Graphics Processing Unit, GPU) server；

Long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block；

Control signal identification information is passed to control signal parsing module by internet link block, is parsed by control signal Module generates control signal coding according to control signal identification information, selects corresponding control signal output module, by control letter Number coding passes to the control signal output module；

Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.

Preferably, the voice pretreatment module is extracted to audio controls, obtains MFCC raw tone features The step of information, including：

End-point detection and dividing processing are carried out to audio controls；

Noise reduction process is carried out to the audio controls after dividing processing；

Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones is obtained special Reference ceases.

Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, to MFCC raw tone features The step of information carries out depth speech feature extraction, acquisition depth voice characteristics information, including：

Long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt Two-way long short-term memory Recognition with Recurrent Neural Network (Bidirectional Long Short-Term Memory, biLSTM) algorithm pair MFCC raw tones characteristic information carries out depth speech feature extraction, obtains depth voice characteristics information.

Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, special according to MFCC raw tones Information acquisition depth voice characteristics information is levied, and the corresponding control signal identification information of depth characteristic information is sent into internet The step of link block, including：

Long-range GPU servers receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth is carried out Degree speech feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent out Give internet link block；

Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark；If so, return control signal identification information connects to internet Connection module.

Correspondingly, the present invention also provides a kind of wisdom audio-visual equipment multi-service control system based on deep learning, described System includes：Microphone array, voice pretreatment module, long-range GPU servers, internet link block, control signal parsing Module, control signal output module；Wherein,

Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations；Detection Whether the logarithmic energy of MFCC raw tone features is more than threshold value；If so, then by the original languages of internet link block transmission MFCC Sound characteristic information is to long-range GPU servers；

Preferably, the voice pretreatment module includes：

Cutting unit, for carrying out end-point detection and dividing processing to audio controls；

Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing；

Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains MFCC raw tone characteristic informations.

Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning voice and know Other program, using biLSTM algorithms depth speech feature extraction is carried out to MFCC raw tones characteristic information, obtains depth voice Characteristic information.

Preferably, long-range GPU servers receive MFCC raw tone characteristic informations, to MFCC raw tone characteristic informations Depth speech feature extraction is carried out, depth voice characteristics information is obtained, and the corresponding control signal of depth characteristic information is identified Information is sent to internet link block；

Implement the embodiment of the present invention, can using natural-sounding control it is various based on different control protocol, realize various differences The wisdom audio-visual equipment of business, for wisdom audio-visual equipment a kind of unification, nature, man-machine interaction mode efficiently, inexpensive are provided； Task deployment by complicated deep learning on the remote server, reduce the hardware and energy cost of user, be user simultaneously The wisdom audio-visual equipment phonetic control command identification service of high-performance, low cost is provided, wisdom audio-visual equipment Voice command is improved The recognition accuracy of instruction.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is that the flow process of the wisdom audio-visual equipment method for controlling multiple operations based on deep learning of the embodiment of the present invention is illustrated Figure；

Fig. 2 is the schematic diagram of deep learning speech recognition modeling in the embodiment of the present invention；

Fig. 3 is the wisdom audio-visual equipment multi-service control based on deep learning of the embodiment of the present invention and the structure group of system Into schematic diagram.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Fig. 1 is that the flow process of the wisdom audio-visual equipment method for controlling multiple operations based on deep learning of the embodiment of the present invention is illustrated Figure, as shown in figure 1, the method includes：

S1, microphone array monitors the audio controls that collection user sends with CF；

S2, voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations；Detection Whether the logarithmic energy of MFCC raw tone features is more than threshold value；If so, then by the original languages of internet link block transmission MFCC Sound characteristic information is to long-range GPU servers；If it is not, then returning S1；

S3, long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tone characteristic informations Depth voice characteristics information is obtained, and the corresponding control signal identification information of depth characteristic information is sent into internet connection mode Block；

Control signal identification information is passed to control signal parsing module by S4, internet link block, by control signal Parsing module generates control signal coding according to control signal identification information, selects corresponding control signal output module, will control Signal coding processed passes to the control signal output module；

S5, control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.

Audio controls are extracted in voice pretreatment module, obtains the process of MFCC raw tone characteristic informations In, including：

Specifically, in S3, long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning language Sound recognizer, using biLSTM algorithms depth speech feature extraction is carried out to MFCC raw tones characteristic information, obtains depth Voice characteristics information.

Further, long-range GPU servers receive MFCC raw tone characteristic informations, and MFCC raw tones feature is believed Breath carries out depth speech feature extraction, obtains depth voice characteristics information, and by the corresponding control signal mark of depth characteristic information Knowledge information is sent to internet link block；

Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark；If so, return control signal identification information connects to internet Connection module；If it is not, then returning error identification gives internet link block.

In embodiments of the present invention, as shown in Fig. 2 the agent structure of deep learning speech recognition modeling is included by one just To long short-term memory Recognition with Recurrent Neural Network and one reversely the biLSTM of long short-term memory Recognition with Recurrent Neural Network composition, one Softmax graders.The input of the deep learning speech recognition modeling sends special from local internet connection unit MFCC voices Levy, its output is T+1 category identifier.These category identifiers include a pair of the control signal 1 of T and the system support The classification answered, and a Default classification.If model exports Default classifications, illustrate that the MFCC phonetic features cannot be right Answer a kind of control signal to wisdom audio-visual equipment.Deep learning speech recognition modeling trains generation phase to produce in advance by it, Then it is deployed and the identification service of wisdom audio-visual equipment phonetic control command is provided the user on long-range GPU servers.

In being embodied as, the training generating process of deep learning speech recognition modeling is as follows：

The first step：According to the business function that the wisdom audio-visual equipment species and these equipment of required support are realized, simulation is true Real equipment uses situation, and using microphone array a large amount of sound bites are collected；

Second step：Manually mark the corresponding control signal classification of these sound bites；

3rd step：MFCC phonetic features are extracted to all sound bites using voice pretreatment module, marked control is obtained Voice feature data collection processed；

4th step：Data set is divided, and is taken above-mentioned marked control voice characteristic and is concentrated a certain amount of data composition instruction Practice data set, i.e. Training Set, a certain amount of data are used as checking data set, i.e. Validation Set；

5th step：All parameters in random initializtion deep learning speech recognition modeling；

6th step：With training dataset as input, deep learning forward-propagating process is performed；

7th step：(Back Propagation Through Time, BPTT) method is propagated using time reversal and performs depth Degree study back-propagation process, updates all parameters in deep learning speech model；

8th step：If the cycle of execution reaches proving period, using checking data set current deep learning voice is verified Identification model；

9th step：The deconditioning if the stop condition of training is reached, otherwise returns the 6th step.The stop condition can be Frequency of training reaches certain value, or validation error is less than certain value.

Correspondingly, the embodiment of the present invention also provides a kind of wisdom audio-visual equipment multi-service control system based on deep learning System, as shown in figure 3, the system includes：Microphone array 1, voice pretreatment module 2, the connection of long-range GPU servers 3, internet Module 4, control signal parsing module 5, control signal output module 6；Wherein,

Microphone array 1 monitors the audio controls that collection user sends with CF；

Voice pretreatment module 2 is extracted to audio controls, obtains MFCC raw tone characteristic informations；Detection Whether the logarithmic energy of MFCC raw tone features is more than threshold value；If so, then the original languages of MFCC are sent by internet link block 4 Sound characteristic information is to long-range GPU servers 3；

Long-range GPU servers 3 receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block 4；

Control signal identification information is passed to control signal parsing module 5 by internet link block 4, by control signal solution Analysis module 5 generates control signal coding according to control signal identification information, selects corresponding control signal output module 6, will control Signal coding processed passes to the control signal output module 6；

Control signal output module 6 sends control signal and gives wisdom audio-visual equipment according to control signal coding.

In embodiments of the present invention, the voice signal that the Real-time Collection user of microphone array 1 sends, and voice signal is sent out Give voice pretreatment module 2.

Voice pretreatment module 2 is responsible for carrying out voice signal end-point detection, noise reduction process and MFCC raw tones spy Levy extraction operation.

Internet link block 4 is responsible for setting up network connection, sending MFCC raw tone features with long-range GPU servers 3 Information to long-range GPU servers 3, receive from long-range GPU servers 3 feedback message.

Control signal parsing module 5 is responsible for feedback message of the parsing from long-range GPU servers 3, is opened according to message content With corresponding control signal output module 6, or carry out error handle.

Control signal output module 6 has multiple, and each control signal output unit is mounted with to support a kind of side wireless communication The hardware of formula, is responsible for all wisdom audio-visual equipment of the control based on the communication.These communications include red Outward, bluetooth, Z-wave etc..

Long-range GPU servers 3 provide the user the identification service of wisdom audio-visual equipment phonetic control command.

Further, voice pretreatment module 2 includes：

Long-range GPU servers 3 receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt Depth speech feature extraction is carried out to MFCC raw tones characteristic information with biLSTM algorithms, depth voice characteristics information is obtained.

Long-range GPU servers 3 receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth is carried out Degree speech feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent out Give internet link block 4；

Long-range GPU servers 3 are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark；If so, return control signal identification information connects to internet Connection module 4.

Specifically, the operation principle of the system related functions module of the embodiment of the present invention can be found in the correlation of embodiment of the method Description, repeats no more here.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completed with instructing the hardware of correlation by program, the program can be stored in a computer-readable recording medium, storage Medium can include：Read-only storage (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..

In addition, the wisdom audio-visual equipment multi-service controlling party based on deep learning for being provided the embodiment of the present invention above Method and system are described in detail, and specific case used herein is explained the principle and embodiment of the present invention State, the explanation of above example is only intended to help and understands the method for the present invention and its core concept；Simultaneously for this area Those skilled in the art, according to the thought of the present invention, will change in specific embodiments and applications, to sum up institute State, this specification content should not be construed as limiting the invention.

Claims

1. a kind of wisdom audio-visual equipment method for controlling multiple operations based on deep learning, it is characterised in that methods described includes：

Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations；MFCC is former for detection Whether the logarithmic energy of beginning phonetic feature is more than threshold value；If so, then by internet link block transmission MFCC raw tone features Information is to long-range GPU servers；

Long-range GPU servers receive MFCC raw tone characteristic informations, and according to MFCC raw tones characteristic information depth is obtained Voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block；

Control signal identification information is passed to control signal parsing module by internet link block, by control signal parsing module Control signal coding is generated according to control signal identification information, corresponding control signal output module is selected, control signal is compiled Code passes to the control signal output module；

2. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute Predicate sound pretreatment module is extracted to audio controls, obtain MFCC raw tone characteristic informations the step of, including：

Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones feature letter is obtained Breath.

3. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute State long-range GPU servers and receive MFCC raw tone characteristic informations, depth voice is carried out to MFCC raw tones characteristic information The step of feature extraction, acquisition depth voice characteristics information, including：

Long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt BiLSTM algorithms carry out depth speech feature extraction to MFCC raw tones characteristic information, obtain depth voice characteristics information.

4. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute State long-range GPU servers and receive MFCC raw tone characteristic informations, depth language is obtained according to MFCC raw tones characteristic information Sound characteristic information, and the step of the corresponding control signal identification information of depth characteristic information is sent into internet link block, Including：

Long-range GPU servers receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth language is carried out Sound feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent to Internet link block；

Long-range GPU servers are classified to depth voice characteristics information, obtain the corresponding classification of depth voice characteristics information, And detect whether the category corresponds to a kind of control signal mark；If so, return control signal identification information gives internet connection mode Block.

5. a kind of wisdom audio-visual equipment multi-service control system based on deep learning, it is characterised in that the system includes：Wheat Gram wind array, voice pretreatment module, long-range GPU servers, internet link block, control signal parsing module, control letter Number output module；Wherein,

6. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that institute Predicate sound pretreatment module includes：

Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains MFCC Raw tone characteristic information.

7. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that institute State long-range GPU servers and receive MFCC raw tone characteristic informations, start deep learning speech recognition program, using biLSTM Algorithm carries out depth speech feature extraction to MFCC raw tones characteristic information, obtains depth voice characteristics information.

8. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that remote Journey GPU server receives MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth phonetic feature is carried out Extract, obtain depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet Link block；