CN106653020A - Multi-business control method and system for smart sound and video equipment based on deep learning - Google Patents

Multi-business control method and system for smart sound and video equipment based on deep learning Download PDF

Info

Publication number
CN106653020A
CN106653020A CN201611144430.6A CN201611144430A CN106653020A CN 106653020 A CN106653020 A CN 106653020A CN 201611144430 A CN201611144430 A CN 201611144430A CN 106653020 A CN106653020 A CN 106653020A
Authority
CN
China
Prior art keywords
control signal
mfcc
depth
audio
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611144430.6A
Other languages
Chinese (zh)
Inventor
曾旭龙
林格
陈小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National Sun Yat Sen University
Original Assignee
National Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National Sun Yat Sen University filed Critical National Sun Yat Sen University
Priority to CN201611144430.6A priority Critical patent/CN106653020A/en
Publication of CN106653020A publication Critical patent/CN106653020A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The embodiment of the invention discloses a multi-business control method and system for smart sound and video equipment based on deep learning, and the method comprises the steps that a voice preprocessing module carries out the extraction of a voice control signal, and obtains MFCC original voice feature information; a remote GPU server receives the MFCC original voice feature information, and obtains the depth voice feature information according to the MFCC original voice feature information; an Internet connection module enables control signal identification information to be transmitted to a signal analysis module, and the signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to a control signal output module; the control signal output module transmits a control signal to the smart sound and video equipment according to the control signal code. The method can control various types of smart sound and video equipment which is based on different control protocols and achieves various types of different business, and provides more unified and natural man-machine interaction and control mode for the smart sound and video equipment.

Description

A kind of wisdom audio-visual equipment method for controlling multiple operations and system based on deep learning
Technical field
The present invention relates to wisdom audio-visual equipment multi-service control technology field, more particularly to a kind of intelligence based on deep learning Intelligent audio-visual equipment method for controlling multiple operations and system.
Background technology
With the progress of Internet of Things and artificial intelligence technology, wisdom audio-visual equipment technology is developed rapidly.Increasing intelligence Intelligent audio-visual equipment is designed to produce, and realizes various multimedia services, to meet the different demands during people live. The equipment for designing production by different vendor has different control and man-machine interaction mode.These equipment may adopt infrared, indigo plant The various control modes such as tooth, Z-wave, in modes such as voice, action, touch-controls man-machine interaction is realized.Wisdom audio-visual equipment control and The disunity of man-machine interaction mode improves the threshold that user learning uses wisdom audio-visual equipment, and it is not good to easily cause Consumer's Experience Problem.Merge multiple business scene, a kind of unification is provided for these wisdom audio-visual equipment, easily naturally controls and man-machine friendship Mutually mode is a problem demanding prompt solution.
Deep learning is the subdomains of artificial intelligence.In recent years, with graphic process unit (Graphics Processing Unit, GPU), the progress of the technology such as cloud computing, deep learning theoretical research achieves breakthrough.At the same time, depth The introducing of habit technology causes the fields such as computer vision, speech recognition to advance by leaps and bounds.This is also wisdom audio-visual equipment control technology Bring new thinking.
A kind of existing smart home natural interaction system [1] based on Voice & Video, is adopted using microphone and camera Collection sound and image information, use information Fusion Module carries out signal transacting, then obtains useful finger using machine learning method Order, reuses control signal transmitter module and sends control signal.
The system is controlled using the information such as voice, gesture, face, action be various, it is impossible to provide the user one kind Simple unified interactive mode, causes user to grasp the problems such as learning cost that system uses is high, and Consumer's Experience is not good.It is adopted Conventional machines learning method is recognizing the multimedia messages such as voice, image so that its discrimination is relatively low, and system robustness is poor. And its voice, image recognition program run on locally, the hardware and energy cost of user is which increased.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, the invention provides a kind of wisdom based on deep learning Audio-visual equipment method for controlling multiple operations and system, it is controllable it is various based on different control protocol, realize the intelligence of various different business Intelligent audio-visual equipment, for them the mode of a kind of more unified, more natural man-machine interaction and control is provided.
In order to solve the above problems, the present invention proposes a kind of wisdom audio-visual equipment multi-service based on deep learning and controls Method, methods described includes:
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains mel cepstrum coefficients (Mel-scale Frequency Cepstral Coefficients, MFCC) raw tone characteristic information;Detection MFCC raw tone features Whether logarithmic energy is more than threshold value;If so, then send MFCC raw tones characteristic information by internet link block to scheme to long-range Shape processor (Graphics Processing Unit, GPU) server;
Long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, is parsed by control signal Module generates control signal coding according to control signal identification information, selects corresponding control signal output module, by control letter Number coding passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
Preferably, the voice pretreatment module is extracted to audio controls, obtains MFCC raw tone features The step of information, including:
End-point detection and dividing processing are carried out to audio controls;
Noise reduction process is carried out to the audio controls after dividing processing;
Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones is obtained special Reference ceases.
Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, to MFCC raw tone features The step of information carries out depth speech feature extraction, acquisition depth voice characteristics information, including:
Long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt Two-way long short-term memory Recognition with Recurrent Neural Network (Bidirectional Long Short-Term Memory, biLSTM) algorithm pair MFCC raw tones characteristic information carries out depth speech feature extraction, obtains depth voice characteristics information.
Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, special according to MFCC raw tones Information acquisition depth voice characteristics information is levied, and the corresponding control signal identification information of depth characteristic information is sent into internet The step of link block, including:
Long-range GPU servers receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth is carried out Degree speech feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent out Give internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet Connection module.
Correspondingly, the present invention also provides a kind of wisdom audio-visual equipment multi-service control system based on deep learning, described System includes:Microphone array, voice pretreatment module, long-range GPU servers, internet link block, control signal parsing Module, control signal output module;Wherein,
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;Detection Whether the logarithmic energy of MFCC raw tone features is more than threshold value;If so, then by the original languages of internet link block transmission MFCC Sound characteristic information is to long-range GPU servers;
Long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, is parsed by control signal Module generates control signal coding according to control signal identification information, selects corresponding control signal output module, by control letter Number coding passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
Preferably, the voice pretreatment module includes:
Cutting unit, for carrying out end-point detection and dividing processing to audio controls;
Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing;
Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains MFCC raw tone characteristic informations.
Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning voice and know Other program, using biLSTM algorithms depth speech feature extraction is carried out to MFCC raw tones characteristic information, obtains depth voice Characteristic information.
Preferably, long-range GPU servers receive MFCC raw tone characteristic informations, to MFCC raw tone characteristic informations Depth speech feature extraction is carried out, depth voice characteristics information is obtained, and the corresponding control signal of depth characteristic information is identified Information is sent to internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet Connection module.
Implement the embodiment of the present invention, can using natural-sounding control it is various based on different control protocol, realize various differences The wisdom audio-visual equipment of business, for wisdom audio-visual equipment a kind of unification, nature, man-machine interaction mode efficiently, inexpensive are provided; Task deployment by complicated deep learning on the remote server, reduce the hardware and energy cost of user, be user simultaneously The wisdom audio-visual equipment phonetic control command identification service of high-performance, low cost is provided, wisdom audio-visual equipment Voice command is improved The recognition accuracy of instruction.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is that the flow process of the wisdom audio-visual equipment method for controlling multiple operations based on deep learning of the embodiment of the present invention is illustrated Figure;
Fig. 2 is the schematic diagram of deep learning speech recognition modeling in the embodiment of the present invention;
Fig. 3 is the wisdom audio-visual equipment multi-service control based on deep learning of the embodiment of the present invention and the structure group of system Into schematic diagram.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
Fig. 1 is that the flow process of the wisdom audio-visual equipment method for controlling multiple operations based on deep learning of the embodiment of the present invention is illustrated Figure, as shown in figure 1, the method includes:
S1, microphone array monitors the audio controls that collection user sends with CF;
S2, voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;Detection Whether the logarithmic energy of MFCC raw tone features is more than threshold value;If so, then by the original languages of internet link block transmission MFCC Sound characteristic information is to long-range GPU servers;If it is not, then returning S1;
S3, long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tone characteristic informations Depth voice characteristics information is obtained, and the corresponding control signal identification information of depth characteristic information is sent into internet connection mode Block;
Control signal identification information is passed to control signal parsing module by S4, internet link block, by control signal Parsing module generates control signal coding according to control signal identification information, selects corresponding control signal output module, will control Signal coding processed passes to the control signal output module;
S5, control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
Audio controls are extracted in voice pretreatment module, obtains the process of MFCC raw tone characteristic informations In, including:
End-point detection and dividing processing are carried out to audio controls;
Noise reduction process is carried out to the audio controls after dividing processing;
Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones is obtained special Reference ceases.
Specifically, in S3, long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning language Sound recognizer, using biLSTM algorithms depth speech feature extraction is carried out to MFCC raw tones characteristic information, obtains depth Voice characteristics information.
Further, long-range GPU servers receive MFCC raw tone characteristic informations, and MFCC raw tones feature is believed Breath carries out depth speech feature extraction, obtains depth voice characteristics information, and by the corresponding control signal mark of depth characteristic information Knowledge information is sent to internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet Connection module;If it is not, then returning error identification gives internet link block.
In embodiments of the present invention, as shown in Fig. 2 the agent structure of deep learning speech recognition modeling is included by one just To long short-term memory Recognition with Recurrent Neural Network and one reversely the biLSTM of long short-term memory Recognition with Recurrent Neural Network composition, one Softmax graders.The input of the deep learning speech recognition modeling sends special from local internet connection unit MFCC voices Levy, its output is T+1 category identifier.These category identifiers include a pair of the control signal 1 of T and the system support The classification answered, and a Default classification.If model exports Default classifications, illustrate that the MFCC phonetic features cannot be right Answer a kind of control signal to wisdom audio-visual equipment.Deep learning speech recognition modeling trains generation phase to produce in advance by it, Then it is deployed and the identification service of wisdom audio-visual equipment phonetic control command is provided the user on long-range GPU servers.
In being embodied as, the training generating process of deep learning speech recognition modeling is as follows:
The first step:According to the business function that the wisdom audio-visual equipment species and these equipment of required support are realized, simulation is true Real equipment uses situation, and using microphone array a large amount of sound bites are collected;
Second step:Manually mark the corresponding control signal classification of these sound bites;
3rd step:MFCC phonetic features are extracted to all sound bites using voice pretreatment module, marked control is obtained Voice feature data collection processed;
4th step:Data set is divided, and is taken above-mentioned marked control voice characteristic and is concentrated a certain amount of data composition instruction Practice data set, i.e. Training Set, a certain amount of data are used as checking data set, i.e. Validation Set;
5th step:All parameters in random initializtion deep learning speech recognition modeling;
6th step:With training dataset as input, deep learning forward-propagating process is performed;
7th step:(Back Propagation Through Time, BPTT) method is propagated using time reversal and performs depth Degree study back-propagation process, updates all parameters in deep learning speech model;
8th step:If the cycle of execution reaches proving period, using checking data set current deep learning voice is verified Identification model;
9th step:The deconditioning if the stop condition of training is reached, otherwise returns the 6th step.The stop condition can be Frequency of training reaches certain value, or validation error is less than certain value.
Correspondingly, the embodiment of the present invention also provides a kind of wisdom audio-visual equipment multi-service control system based on deep learning System, as shown in figure 3, the system includes:Microphone array 1, voice pretreatment module 2, the connection of long-range GPU servers 3, internet Module 4, control signal parsing module 5, control signal output module 6;Wherein,
Microphone array 1 monitors the audio controls that collection user sends with CF;
Voice pretreatment module 2 is extracted to audio controls, obtains MFCC raw tone characteristic informations;Detection Whether the logarithmic energy of MFCC raw tone features is more than threshold value;If so, then the original languages of MFCC are sent by internet link block 4 Sound characteristic information is to long-range GPU servers 3;
Long-range GPU servers 3 receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block 4;
Control signal identification information is passed to control signal parsing module 5 by internet link block 4, by control signal solution Analysis module 5 generates control signal coding according to control signal identification information, selects corresponding control signal output module 6, will control Signal coding processed passes to the control signal output module 6;
Control signal output module 6 sends control signal and gives wisdom audio-visual equipment according to control signal coding.
In embodiments of the present invention, the voice signal that the Real-time Collection user of microphone array 1 sends, and voice signal is sent out Give voice pretreatment module 2.
Voice pretreatment module 2 is responsible for carrying out voice signal end-point detection, noise reduction process and MFCC raw tones spy Levy extraction operation.
Internet link block 4 is responsible for setting up network connection, sending MFCC raw tone features with long-range GPU servers 3 Information to long-range GPU servers 3, receive from long-range GPU servers 3 feedback message.
Control signal parsing module 5 is responsible for feedback message of the parsing from long-range GPU servers 3, is opened according to message content With corresponding control signal output module 6, or carry out error handle.
Control signal output module 6 has multiple, and each control signal output unit is mounted with to support a kind of side wireless communication The hardware of formula, is responsible for all wisdom audio-visual equipment of the control based on the communication.These communications include red Outward, bluetooth, Z-wave etc..
Long-range GPU servers 3 provide the user the identification service of wisdom audio-visual equipment phonetic control command.
Further, voice pretreatment module 2 includes:
Cutting unit, for carrying out end-point detection and dividing processing to audio controls;
Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing;
Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains MFCC raw tone characteristic informations.
Long-range GPU servers 3 receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt Depth speech feature extraction is carried out to MFCC raw tones characteristic information with biLSTM algorithms, depth voice characteristics information is obtained.
Long-range GPU servers 3 receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth is carried out Degree speech feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent out Give internet link block 4;
Long-range GPU servers 3 are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet Connection module 4.
Specifically, the operation principle of the system related functions module of the embodiment of the present invention can be found in the correlation of embodiment of the method Description, repeats no more here.
Implement the embodiment of the present invention, can using natural-sounding control it is various based on different control protocol, realize various differences The wisdom audio-visual equipment of business, for wisdom audio-visual equipment a kind of unification, nature, man-machine interaction mode efficiently, inexpensive are provided; Task deployment by complicated deep learning on the remote server, reduce the hardware and energy cost of user, be user simultaneously The wisdom audio-visual equipment phonetic control command identification service of high-performance, low cost is provided, wisdom audio-visual equipment Voice command is improved The recognition accuracy of instruction.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can Completed with instructing the hardware of correlation by program, the program can be stored in a computer-readable recording medium, storage Medium can include:Read-only storage (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc..
In addition, the wisdom audio-visual equipment multi-service controlling party based on deep learning for being provided the embodiment of the present invention above Method and system are described in detail, and specific case used herein is explained the principle and embodiment of the present invention State, the explanation of above example is only intended to help and understands the method for the present invention and its core concept;Simultaneously for this area Those skilled in the art, according to the thought of the present invention, will change in specific embodiments and applications, to sum up institute State, this specification content should not be construed as limiting the invention.

Claims (8)

1. a kind of wisdom audio-visual equipment method for controlling multiple operations based on deep learning, it is characterised in that methods described includes:
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;MFCC is former for detection Whether the logarithmic energy of beginning phonetic feature is more than threshold value;If so, then by internet link block transmission MFCC raw tone features Information is to long-range GPU servers;
Long-range GPU servers receive MFCC raw tone characteristic informations, and according to MFCC raw tones characteristic information depth is obtained Voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, by control signal parsing module Control signal coding is generated according to control signal identification information, corresponding control signal output module is selected, control signal is compiled Code passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
2. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute Predicate sound pretreatment module is extracted to audio controls, obtain MFCC raw tone characteristic informations the step of, including:
End-point detection and dividing processing are carried out to audio controls;
Noise reduction process is carried out to the audio controls after dividing processing;
Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones feature letter is obtained Breath.
3. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute State long-range GPU servers and receive MFCC raw tone characteristic informations, depth voice is carried out to MFCC raw tones characteristic information The step of feature extraction, acquisition depth voice characteristics information, including:
Long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt BiLSTM algorithms carry out depth speech feature extraction to MFCC raw tones characteristic information, obtain depth voice characteristics information.
4. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute State long-range GPU servers and receive MFCC raw tone characteristic informations, depth language is obtained according to MFCC raw tones characteristic information Sound characteristic information, and the step of the corresponding control signal identification information of depth characteristic information is sent into internet link block, Including:
Long-range GPU servers receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth language is carried out Sound feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent to Internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the corresponding classification of depth voice characteristics information, And detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information gives internet connection mode Block.
5. a kind of wisdom audio-visual equipment multi-service control system based on deep learning, it is characterised in that the system includes:Wheat Gram wind array, voice pretreatment module, long-range GPU servers, internet link block, control signal parsing module, control letter Number output module;Wherein,
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;MFCC is former for detection Whether the logarithmic energy of beginning phonetic feature is more than threshold value;If so, then by internet link block transmission MFCC raw tone features Information is to long-range GPU servers;
Long-range GPU servers receive MFCC raw tone characteristic informations, and according to MFCC raw tones characteristic information depth is obtained Voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, by control signal parsing module Control signal coding is generated according to control signal identification information, corresponding control signal output module is selected, control signal is compiled Code passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
6. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that institute Predicate sound pretreatment module includes:
Cutting unit, for carrying out end-point detection and dividing processing to audio controls;
Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing;
Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains MFCC Raw tone characteristic information.
7. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that institute State long-range GPU servers and receive MFCC raw tone characteristic informations, start deep learning speech recognition program, using biLSTM Algorithm carries out depth speech feature extraction to MFCC raw tones characteristic information, obtains depth voice characteristics information.
8. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that remote Journey GPU server receives MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth phonetic feature is carried out Extract, obtain depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet Link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the corresponding classification of depth voice characteristics information, And detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information gives internet connection mode Block.
CN201611144430.6A 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning Pending CN106653020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611144430.6A CN106653020A (en) 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611144430.6A CN106653020A (en) 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning

Publications (1)

Publication Number Publication Date
CN106653020A true CN106653020A (en) 2017-05-10

Family

ID=58824998

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611144430.6A Pending CN106653020A (en) 2016-12-13 2016-12-13 Multi-business control method and system for smart sound and video equipment based on deep learning

Country Status (1)

Country Link
CN (1) CN106653020A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108074575A (en) * 2017-12-14 2018-05-25 广州势必可赢网络科技有限公司 A kind of auth method and device based on Recognition with Recurrent Neural Network
CN109559761A (en) * 2018-12-21 2019-04-02 广东工业大学 A kind of risk of stroke prediction technique based on depth phonetic feature
CN110428821A (en) * 2019-07-26 2019-11-08 广州市申迪计算机系统有限公司 A kind of voice command control method and device for crusing robot
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111783892A (en) * 2020-07-06 2020-10-16 广东工业大学 Robot instruction identification method and device, electronic equipment and storage medium
CN113921016A (en) * 2021-10-15 2022-01-11 阿波罗智联(北京)科技有限公司 Voice processing method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221762A (en) * 2007-12-06 2008-07-16 上海大学 MP3 compression field audio partitioning method
CN104952448A (en) * 2015-05-04 2015-09-30 张爱英 Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks
CN105045122A (en) * 2015-06-24 2015-11-11 张子兴 Intelligent household natural interaction system based on audios and videos
CN105700359A (en) * 2014-11-25 2016-06-22 上海天脉聚源文化传媒有限公司 Method and system for controlling smart home through speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221762A (en) * 2007-12-06 2008-07-16 上海大学 MP3 compression field audio partitioning method
CN105700359A (en) * 2014-11-25 2016-06-22 上海天脉聚源文化传媒有限公司 Method and system for controlling smart home through speech recognition
CN104952448A (en) * 2015-05-04 2015-09-30 张爱英 Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks
CN105045122A (en) * 2015-06-24 2015-11-11 张子兴 Intelligent household natural interaction system based on audios and videos

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108074575A (en) * 2017-12-14 2018-05-25 广州势必可赢网络科技有限公司 A kind of auth method and device based on Recognition with Recurrent Neural Network
CN109559761A (en) * 2018-12-21 2019-04-02 广东工业大学 A kind of risk of stroke prediction technique based on depth phonetic feature
CN110428821A (en) * 2019-07-26 2019-11-08 广州市申迪计算机系统有限公司 A kind of voice command control method and device for crusing robot
CN111357051A (en) * 2019-12-24 2020-06-30 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
WO2021127982A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech emotion recognition method, smart device, and computer-readable storage medium
CN111357051B (en) * 2019-12-24 2024-02-02 深圳市优必选科技股份有限公司 Speech emotion recognition method, intelligent device and computer readable storage medium
CN111783892A (en) * 2020-07-06 2020-10-16 广东工业大学 Robot instruction identification method and device, electronic equipment and storage medium
CN113921016A (en) * 2021-10-15 2022-01-11 阿波罗智联(北京)科技有限公司 Voice processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN106653020A (en) Multi-business control method and system for smart sound and video equipment based on deep learning
CN108000526B (en) Dialogue interaction method and system for intelligent robot
JP6828001B2 (en) Voice wakeup method and equipment
CN104777911B (en) A kind of intelligent interactive method based on holographic technique
CN105810200A (en) Man-machine dialogue apparatus and method based on voiceprint identification
CN110263324A (en) Text handling method, model training method and device
CN107870994A (en) Man-machine interaction method and system for intelligent robot
CN107293289A (en) A kind of speech production method that confrontation network is generated based on depth convolution
JP2020034895A (en) Responding method and device
CN101834809B (en) Internet instant message communication system
CN110379441B (en) Voice service method and system based on countermeasure type artificial intelligence network
CN102298694A (en) Man-machine interaction identification system applied to remote information service
CN107704612A (en) Dialogue exchange method and system for intelligent robot
CN205508398U (en) Intelligent robot with high in clouds interactive function
CN105244042B (en) A kind of speech emotional interactive device and method based on finite-state automata
CN106486122A (en) A kind of intelligent sound interacts robot
US20190371319A1 (en) Method for human-machine interaction, electronic device, and computer-readable storage medium
CN116431316B (en) Task processing method, system, platform and automatic question-answering method
CN111368142A (en) Video intensive event description method based on generation countermeasure network
CN108053826A (en) For the method, apparatus of human-computer interaction, electronic equipment and storage medium
CN106708950B (en) Data processing method and device for intelligent robot self-learning system
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
CN209328511U (en) A kind of portable AI interactive voice control system
CN112860213B (en) Audio processing method and device, storage medium and electronic equipment
CN117193524A (en) Man-machine interaction system and method based on multi-mode feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication