CN106653020A - Multi-business control method and system for smart sound and video equipment based on deep learning - Google Patents
Multi-business control method and system for smart sound and video equipment based on deep learning Download PDFInfo
- Publication number
- CN106653020A CN106653020A CN201611144430.6A CN201611144430A CN106653020A CN 106653020 A CN106653020 A CN 106653020A CN 201611144430 A CN201611144430 A CN 201611144430A CN 106653020 A CN106653020 A CN 106653020A
- Authority
- CN
- China
- Prior art keywords
- control signal
- mfcc
- depth
- audio
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/34—Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Abstract
The embodiment of the invention discloses a multi-business control method and system for smart sound and video equipment based on deep learning, and the method comprises the steps that a voice preprocessing module carries out the extraction of a voice control signal, and obtains MFCC original voice feature information; a remote GPU server receives the MFCC original voice feature information, and obtains the depth voice feature information according to the MFCC original voice feature information; an Internet connection module enables control signal identification information to be transmitted to a signal analysis module, and the signal analysis module generates a control signal code according to the control signal identification information, selects a corresponding control signal output module, and transmits the control signal code to a control signal output module; the control signal output module transmits a control signal to the smart sound and video equipment according to the control signal code. The method can control various types of smart sound and video equipment which is based on different control protocols and achieves various types of different business, and provides more unified and natural man-machine interaction and control mode for the smart sound and video equipment.
Description
Technical field
The present invention relates to wisdom audio-visual equipment multi-service control technology field, more particularly to a kind of intelligence based on deep learning
Intelligent audio-visual equipment method for controlling multiple operations and system.
Background technology
With the progress of Internet of Things and artificial intelligence technology, wisdom audio-visual equipment technology is developed rapidly.Increasing intelligence
Intelligent audio-visual equipment is designed to produce, and realizes various multimedia services, to meet the different demands during people live.
The equipment for designing production by different vendor has different control and man-machine interaction mode.These equipment may adopt infrared, indigo plant
The various control modes such as tooth, Z-wave, in modes such as voice, action, touch-controls man-machine interaction is realized.Wisdom audio-visual equipment control and
The disunity of man-machine interaction mode improves the threshold that user learning uses wisdom audio-visual equipment, and it is not good to easily cause Consumer's Experience
Problem.Merge multiple business scene, a kind of unification is provided for these wisdom audio-visual equipment, easily naturally controls and man-machine friendship
Mutually mode is a problem demanding prompt solution.
Deep learning is the subdomains of artificial intelligence.In recent years, with graphic process unit (Graphics Processing
Unit, GPU), the progress of the technology such as cloud computing, deep learning theoretical research achieves breakthrough.At the same time, depth
The introducing of habit technology causes the fields such as computer vision, speech recognition to advance by leaps and bounds.This is also wisdom audio-visual equipment control technology
Bring new thinking.
A kind of existing smart home natural interaction system [1] based on Voice & Video, is adopted using microphone and camera
Collection sound and image information, use information Fusion Module carries out signal transacting, then obtains useful finger using machine learning method
Order, reuses control signal transmitter module and sends control signal.
The system is controlled using the information such as voice, gesture, face, action be various, it is impossible to provide the user one kind
Simple unified interactive mode, causes user to grasp the problems such as learning cost that system uses is high, and Consumer's Experience is not good.It is adopted
Conventional machines learning method is recognizing the multimedia messages such as voice, image so that its discrimination is relatively low, and system robustness is poor.
And its voice, image recognition program run on locally, the hardware and energy cost of user is which increased.
The content of the invention
It is an object of the invention to overcome the deficiencies in the prior art, the invention provides a kind of wisdom based on deep learning
Audio-visual equipment method for controlling multiple operations and system, it is controllable it is various based on different control protocol, realize the intelligence of various different business
Intelligent audio-visual equipment, for them the mode of a kind of more unified, more natural man-machine interaction and control is provided.
In order to solve the above problems, the present invention proposes a kind of wisdom audio-visual equipment multi-service based on deep learning and controls
Method, methods described includes:
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains mel cepstrum coefficients (Mel-scale
Frequency Cepstral Coefficients, MFCC) raw tone characteristic information;Detection MFCC raw tone features
Whether logarithmic energy is more than threshold value;If so, then send MFCC raw tones characteristic information by internet link block to scheme to long-range
Shape processor (Graphics Processing Unit, GPU) server;
Long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information
Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, is parsed by control signal
Module generates control signal coding according to control signal identification information, selects corresponding control signal output module, by control letter
Number coding passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
Preferably, the voice pretreatment module is extracted to audio controls, obtains MFCC raw tone features
The step of information, including:
End-point detection and dividing processing are carried out to audio controls;
Noise reduction process is carried out to the audio controls after dividing processing;
Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones is obtained special
Reference ceases.
Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, to MFCC raw tone features
The step of information carries out depth speech feature extraction, acquisition depth voice characteristics information, including:
Long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt
Two-way long short-term memory Recognition with Recurrent Neural Network (Bidirectional Long Short-Term Memory, biLSTM) algorithm pair
MFCC raw tones characteristic information carries out depth speech feature extraction, obtains depth voice characteristics information.
Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, special according to MFCC raw tones
Information acquisition depth voice characteristics information is levied, and the corresponding control signal identification information of depth characteristic information is sent into internet
The step of link block, including:
Long-range GPU servers receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth is carried out
Degree speech feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent out
Give internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding
Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet
Connection module.
Correspondingly, the present invention also provides a kind of wisdom audio-visual equipment multi-service control system based on deep learning, described
System includes:Microphone array, voice pretreatment module, long-range GPU servers, internet link block, control signal parsing
Module, control signal output module;Wherein,
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;Detection
Whether the logarithmic energy of MFCC raw tone features is more than threshold value;If so, then by the original languages of internet link block transmission MFCC
Sound characteristic information is to long-range GPU servers;
Long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information
Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, is parsed by control signal
Module generates control signal coding according to control signal identification information, selects corresponding control signal output module, by control letter
Number coding passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
Preferably, the voice pretreatment module includes:
Cutting unit, for carrying out end-point detection and dividing processing to audio controls;
Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing;
Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains
MFCC raw tone characteristic informations.
Preferably, the long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning voice and know
Other program, using biLSTM algorithms depth speech feature extraction is carried out to MFCC raw tones characteristic information, obtains depth voice
Characteristic information.
Preferably, long-range GPU servers receive MFCC raw tone characteristic informations, to MFCC raw tone characteristic informations
Depth speech feature extraction is carried out, depth voice characteristics information is obtained, and the corresponding control signal of depth characteristic information is identified
Information is sent to internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding
Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet
Connection module.
Implement the embodiment of the present invention, can using natural-sounding control it is various based on different control protocol, realize various differences
The wisdom audio-visual equipment of business, for wisdom audio-visual equipment a kind of unification, nature, man-machine interaction mode efficiently, inexpensive are provided;
Task deployment by complicated deep learning on the remote server, reduce the hardware and energy cost of user, be user simultaneously
The wisdom audio-visual equipment phonetic control command identification service of high-performance, low cost is provided, wisdom audio-visual equipment Voice command is improved
The recognition accuracy of instruction.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is that the flow process of the wisdom audio-visual equipment method for controlling multiple operations based on deep learning of the embodiment of the present invention is illustrated
Figure;
Fig. 2 is the schematic diagram of deep learning speech recognition modeling in the embodiment of the present invention;
Fig. 3 is the wisdom audio-visual equipment multi-service control based on deep learning of the embodiment of the present invention and the structure group of system
Into schematic diagram.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
Fig. 1 is that the flow process of the wisdom audio-visual equipment method for controlling multiple operations based on deep learning of the embodiment of the present invention is illustrated
Figure, as shown in figure 1, the method includes:
S1, microphone array monitors the audio controls that collection user sends with CF;
S2, voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;Detection
Whether the logarithmic energy of MFCC raw tone features is more than threshold value;If so, then by the original languages of internet link block transmission MFCC
Sound characteristic information is to long-range GPU servers;If it is not, then returning S1;
S3, long-range GPU servers receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tone characteristic informations
Depth voice characteristics information is obtained, and the corresponding control signal identification information of depth characteristic information is sent into internet connection mode
Block;
Control signal identification information is passed to control signal parsing module by S4, internet link block, by control signal
Parsing module generates control signal coding according to control signal identification information, selects corresponding control signal output module, will control
Signal coding processed passes to the control signal output module;
S5, control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
Audio controls are extracted in voice pretreatment module, obtains the process of MFCC raw tone characteristic informations
In, including:
End-point detection and dividing processing are carried out to audio controls;
Noise reduction process is carried out to the audio controls after dividing processing;
Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones is obtained special
Reference ceases.
Specifically, in S3, long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning language
Sound recognizer, using biLSTM algorithms depth speech feature extraction is carried out to MFCC raw tones characteristic information, obtains depth
Voice characteristics information.
Further, long-range GPU servers receive MFCC raw tone characteristic informations, and MFCC raw tones feature is believed
Breath carries out depth speech feature extraction, obtains depth voice characteristics information, and by the corresponding control signal mark of depth characteristic information
Knowledge information is sent to internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding
Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet
Connection module;If it is not, then returning error identification gives internet link block.
In embodiments of the present invention, as shown in Fig. 2 the agent structure of deep learning speech recognition modeling is included by one just
To long short-term memory Recognition with Recurrent Neural Network and one reversely the biLSTM of long short-term memory Recognition with Recurrent Neural Network composition, one
Softmax graders.The input of the deep learning speech recognition modeling sends special from local internet connection unit MFCC voices
Levy, its output is T+1 category identifier.These category identifiers include a pair of the control signal 1 of T and the system support
The classification answered, and a Default classification.If model exports Default classifications, illustrate that the MFCC phonetic features cannot be right
Answer a kind of control signal to wisdom audio-visual equipment.Deep learning speech recognition modeling trains generation phase to produce in advance by it,
Then it is deployed and the identification service of wisdom audio-visual equipment phonetic control command is provided the user on long-range GPU servers.
In being embodied as, the training generating process of deep learning speech recognition modeling is as follows:
The first step:According to the business function that the wisdom audio-visual equipment species and these equipment of required support are realized, simulation is true
Real equipment uses situation, and using microphone array a large amount of sound bites are collected;
Second step:Manually mark the corresponding control signal classification of these sound bites;
3rd step:MFCC phonetic features are extracted to all sound bites using voice pretreatment module, marked control is obtained
Voice feature data collection processed;
4th step:Data set is divided, and is taken above-mentioned marked control voice characteristic and is concentrated a certain amount of data composition instruction
Practice data set, i.e. Training Set, a certain amount of data are used as checking data set, i.e. Validation Set;
5th step:All parameters in random initializtion deep learning speech recognition modeling;
6th step:With training dataset as input, deep learning forward-propagating process is performed;
7th step:(Back Propagation Through Time, BPTT) method is propagated using time reversal and performs depth
Degree study back-propagation process, updates all parameters in deep learning speech model;
8th step:If the cycle of execution reaches proving period, using checking data set current deep learning voice is verified
Identification model;
9th step:The deconditioning if the stop condition of training is reached, otherwise returns the 6th step.The stop condition can be
Frequency of training reaches certain value, or validation error is less than certain value.
Correspondingly, the embodiment of the present invention also provides a kind of wisdom audio-visual equipment multi-service control system based on deep learning
System, as shown in figure 3, the system includes:Microphone array 1, voice pretreatment module 2, the connection of long-range GPU servers 3, internet
Module 4, control signal parsing module 5, control signal output module 6;Wherein,
Microphone array 1 monitors the audio controls that collection user sends with CF;
Voice pretreatment module 2 is extracted to audio controls, obtains MFCC raw tone characteristic informations;Detection
Whether the logarithmic energy of MFCC raw tone features is more than threshold value;If so, then the original languages of MFCC are sent by internet link block 4
Sound characteristic information is to long-range GPU servers 3;
Long-range GPU servers 3 receive MFCC raw tone characteristic informations, are obtained according to MFCC raw tones characteristic information
Depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block 4;
Control signal identification information is passed to control signal parsing module 5 by internet link block 4, by control signal solution
Analysis module 5 generates control signal coding according to control signal identification information, selects corresponding control signal output module 6, will control
Signal coding processed passes to the control signal output module 6;
Control signal output module 6 sends control signal and gives wisdom audio-visual equipment according to control signal coding.
In embodiments of the present invention, the voice signal that the Real-time Collection user of microphone array 1 sends, and voice signal is sent out
Give voice pretreatment module 2.
Voice pretreatment module 2 is responsible for carrying out voice signal end-point detection, noise reduction process and MFCC raw tones spy
Levy extraction operation.
Internet link block 4 is responsible for setting up network connection, sending MFCC raw tone features with long-range GPU servers 3
Information to long-range GPU servers 3, receive from long-range GPU servers 3 feedback message.
Control signal parsing module 5 is responsible for feedback message of the parsing from long-range GPU servers 3, is opened according to message content
With corresponding control signal output module 6, or carry out error handle.
Control signal output module 6 has multiple, and each control signal output unit is mounted with to support a kind of side wireless communication
The hardware of formula, is responsible for all wisdom audio-visual equipment of the control based on the communication.These communications include red
Outward, bluetooth, Z-wave etc..
Long-range GPU servers 3 provide the user the identification service of wisdom audio-visual equipment phonetic control command.
Further, voice pretreatment module 2 includes:
Cutting unit, for carrying out end-point detection and dividing processing to audio controls;
Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing;
Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains
MFCC raw tone characteristic informations.
Long-range GPU servers 3 receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt
Depth speech feature extraction is carried out to MFCC raw tones characteristic information with biLSTM algorithms, depth voice characteristics information is obtained.
Long-range GPU servers 3 receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth is carried out
Degree speech feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent out
Give internet link block 4;
Long-range GPU servers 3 are classified to depth voice characteristics information, obtain the depth voice characteristics information corresponding
Classification, and detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information connects to internet
Connection module 4.
Specifically, the operation principle of the system related functions module of the embodiment of the present invention can be found in the correlation of embodiment of the method
Description, repeats no more here.
Implement the embodiment of the present invention, can using natural-sounding control it is various based on different control protocol, realize various differences
The wisdom audio-visual equipment of business, for wisdom audio-visual equipment a kind of unification, nature, man-machine interaction mode efficiently, inexpensive are provided;
Task deployment by complicated deep learning on the remote server, reduce the hardware and energy cost of user, be user simultaneously
The wisdom audio-visual equipment phonetic control command identification service of high-performance, low cost is provided, wisdom audio-visual equipment Voice command is improved
The recognition accuracy of instruction.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can
Completed with instructing the hardware of correlation by program, the program can be stored in a computer-readable recording medium, storage
Medium can include:Read-only storage (ROM, Read Only Memory), random access memory (RAM, Random
Access Memory), disk or CD etc..
In addition, the wisdom audio-visual equipment multi-service controlling party based on deep learning for being provided the embodiment of the present invention above
Method and system are described in detail, and specific case used herein is explained the principle and embodiment of the present invention
State, the explanation of above example is only intended to help and understands the method for the present invention and its core concept;Simultaneously for this area
Those skilled in the art, according to the thought of the present invention, will change in specific embodiments and applications, to sum up institute
State, this specification content should not be construed as limiting the invention.
Claims (8)
1. a kind of wisdom audio-visual equipment method for controlling multiple operations based on deep learning, it is characterised in that methods described includes:
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;MFCC is former for detection
Whether the logarithmic energy of beginning phonetic feature is more than threshold value;If so, then by internet link block transmission MFCC raw tone features
Information is to long-range GPU servers;
Long-range GPU servers receive MFCC raw tone characteristic informations, and according to MFCC raw tones characteristic information depth is obtained
Voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, by control signal parsing module
Control signal coding is generated according to control signal identification information, corresponding control signal output module is selected, control signal is compiled
Code passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
2. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute
Predicate sound pretreatment module is extracted to audio controls, obtain MFCC raw tone characteristic informations the step of, including:
End-point detection and dividing processing are carried out to audio controls;
Noise reduction process is carried out to the audio controls after dividing processing;
Audio controls after noise reduction process are carried out with MFCC raw tone feature extractions, MFCC raw tones feature letter is obtained
Breath.
3. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute
State long-range GPU servers and receive MFCC raw tone characteristic informations, depth voice is carried out to MFCC raw tones characteristic information
The step of feature extraction, acquisition depth voice characteristics information, including:
Long-range GPU servers receive MFCC raw tone characteristic informations, start deep learning speech recognition program, adopt
BiLSTM algorithms carry out depth speech feature extraction to MFCC raw tones characteristic information, obtain depth voice characteristics information.
4. the wisdom audio-visual equipment method for controlling multiple operations of deep learning is based on as claimed in claim 1, it is characterised in that institute
State long-range GPU servers and receive MFCC raw tone characteristic informations, depth language is obtained according to MFCC raw tones characteristic information
Sound characteristic information, and the step of the corresponding control signal identification information of depth characteristic information is sent into internet link block,
Including:
Long-range GPU servers receive MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth language is carried out
Sound feature extraction, obtains depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent to
Internet link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the corresponding classification of depth voice characteristics information,
And detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information gives internet connection mode
Block.
5. a kind of wisdom audio-visual equipment multi-service control system based on deep learning, it is characterised in that the system includes:Wheat
Gram wind array, voice pretreatment module, long-range GPU servers, internet link block, control signal parsing module, control letter
Number output module;Wherein,
Microphone array monitors the audio controls that collection user sends with CF;
Voice pretreatment module is extracted to audio controls, obtains MFCC raw tone characteristic informations;MFCC is former for detection
Whether the logarithmic energy of beginning phonetic feature is more than threshold value;If so, then by internet link block transmission MFCC raw tone features
Information is to long-range GPU servers;
Long-range GPU servers receive MFCC raw tone characteristic informations, and according to MFCC raw tones characteristic information depth is obtained
Voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet link block;
Control signal identification information is passed to control signal parsing module by internet link block, by control signal parsing module
Control signal coding is generated according to control signal identification information, corresponding control signal output module is selected, control signal is compiled
Code passes to the control signal output module;
Control signal output module sends control signal and gives wisdom audio-visual equipment according to control signal coding.
6. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that institute
Predicate sound pretreatment module includes:
Cutting unit, for carrying out end-point detection and dividing processing to audio controls;
Noise reduction unit, for carrying out noise reduction process to the audio controls after dividing processing;
Extraction unit, for the audio controls after noise reduction process to be carried out with MFCC raw tone feature extractions, obtains MFCC
Raw tone characteristic information.
7. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that institute
State long-range GPU servers and receive MFCC raw tone characteristic informations, start deep learning speech recognition program, using biLSTM
Algorithm carries out depth speech feature extraction to MFCC raw tones characteristic information, obtains depth voice characteristics information.
8. the wisdom audio-visual equipment multi-service control system of deep learning is based on as claimed in claim 5, it is characterised in that remote
Journey GPU server receives MFCC raw tone characteristic informations, and to MFCC raw tones characteristic information depth phonetic feature is carried out
Extract, obtain depth voice characteristics information, and the corresponding control signal identification information of depth characteristic information is sent into internet
Link block;
Long-range GPU servers are classified to depth voice characteristics information, obtain the corresponding classification of depth voice characteristics information,
And detect whether the category corresponds to a kind of control signal mark;If so, return control signal identification information gives internet connection mode
Block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611144430.6A CN106653020A (en) | 2016-12-13 | 2016-12-13 | Multi-business control method and system for smart sound and video equipment based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611144430.6A CN106653020A (en) | 2016-12-13 | 2016-12-13 | Multi-business control method and system for smart sound and video equipment based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106653020A true CN106653020A (en) | 2017-05-10 |
Family
ID=58824998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611144430.6A Pending CN106653020A (en) | 2016-12-13 | 2016-12-13 | Multi-business control method and system for smart sound and video equipment based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106653020A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108074575A (en) * | 2017-12-14 | 2018-05-25 | 广州势必可赢网络科技有限公司 | A kind of auth method and device based on Recognition with Recurrent Neural Network |
CN109559761A (en) * | 2018-12-21 | 2019-04-02 | 广东工业大学 | A kind of risk of stroke prediction technique based on depth phonetic feature |
CN110428821A (en) * | 2019-07-26 | 2019-11-08 | 广州市申迪计算机系统有限公司 | A kind of voice command control method and device for crusing robot |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111783892A (en) * | 2020-07-06 | 2020-10-16 | 广东工业大学 | Robot instruction identification method and device, electronic equipment and storage medium |
CN113921016A (en) * | 2021-10-15 | 2022-01-11 | 阿波罗智联(北京)科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101221762A (en) * | 2007-12-06 | 2008-07-16 | 上海大学 | MP3 compression field audio partitioning method |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
CN105045122A (en) * | 2015-06-24 | 2015-11-11 | 张子兴 | Intelligent household natural interaction system based on audios and videos |
CN105700359A (en) * | 2014-11-25 | 2016-06-22 | 上海天脉聚源文化传媒有限公司 | Method and system for controlling smart home through speech recognition |
-
2016
- 2016-12-13 CN CN201611144430.6A patent/CN106653020A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101221762A (en) * | 2007-12-06 | 2008-07-16 | 上海大学 | MP3 compression field audio partitioning method |
CN105700359A (en) * | 2014-11-25 | 2016-06-22 | 上海天脉聚源文化传媒有限公司 | Method and system for controlling smart home through speech recognition |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
CN105045122A (en) * | 2015-06-24 | 2015-11-11 | 张子兴 | Intelligent household natural interaction system based on audios and videos |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108074575A (en) * | 2017-12-14 | 2018-05-25 | 广州势必可赢网络科技有限公司 | A kind of auth method and device based on Recognition with Recurrent Neural Network |
CN109559761A (en) * | 2018-12-21 | 2019-04-02 | 广东工业大学 | A kind of risk of stroke prediction technique based on depth phonetic feature |
CN110428821A (en) * | 2019-07-26 | 2019-11-08 | 广州市申迪计算机系统有限公司 | A kind of voice command control method and device for crusing robot |
CN111357051A (en) * | 2019-12-24 | 2020-06-30 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
WO2021127982A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, smart device, and computer-readable storage medium |
CN111357051B (en) * | 2019-12-24 | 2024-02-02 | 深圳市优必选科技股份有限公司 | Speech emotion recognition method, intelligent device and computer readable storage medium |
CN111783892A (en) * | 2020-07-06 | 2020-10-16 | 广东工业大学 | Robot instruction identification method and device, electronic equipment and storage medium |
CN113921016A (en) * | 2021-10-15 | 2022-01-11 | 阿波罗智联(北京)科技有限公司 | Voice processing method, device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106653020A (en) | Multi-business control method and system for smart sound and video equipment based on deep learning | |
CN108000526B (en) | Dialogue interaction method and system for intelligent robot | |
JP6828001B2 (en) | Voice wakeup method and equipment | |
CN104777911B (en) | A kind of intelligent interactive method based on holographic technique | |
CN105810200A (en) | Man-machine dialogue apparatus and method based on voiceprint identification | |
CN110263324A (en) | Text handling method, model training method and device | |
CN107870994A (en) | Man-machine interaction method and system for intelligent robot | |
CN107293289A (en) | A kind of speech production method that confrontation network is generated based on depth convolution | |
JP2020034895A (en) | Responding method and device | |
CN101834809B (en) | Internet instant message communication system | |
CN110379441B (en) | Voice service method and system based on countermeasure type artificial intelligence network | |
CN102298694A (en) | Man-machine interaction identification system applied to remote information service | |
CN107704612A (en) | Dialogue exchange method and system for intelligent robot | |
CN205508398U (en) | Intelligent robot with high in clouds interactive function | |
CN105244042B (en) | A kind of speech emotional interactive device and method based on finite-state automata | |
CN106486122A (en) | A kind of intelligent sound interacts robot | |
US20190371319A1 (en) | Method for human-machine interaction, electronic device, and computer-readable storage medium | |
CN116431316B (en) | Task processing method, system, platform and automatic question-answering method | |
CN111368142A (en) | Video intensive event description method based on generation countermeasure network | |
CN108053826A (en) | For the method, apparatus of human-computer interaction, electronic equipment and storage medium | |
CN106708950B (en) | Data processing method and device for intelligent robot self-learning system | |
CN109933773A (en) | A kind of multiple semantic sentence analysis system and method | |
CN209328511U (en) | A kind of portable AI interactive voice control system | |
CN112860213B (en) | Audio processing method and device, storage medium and electronic equipment | |
CN117193524A (en) | Man-machine interaction system and method based on multi-mode feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |
|
RJ01 | Rejection of invention patent application after publication |