CN106997243A - Speech scene monitoring method and device based on intelligent robot - Google Patents
Speech scene monitoring method and device based on intelligent robot Download PDFInfo
- Publication number
- CN106997243A CN106997243A CN201710192637.9A CN201710192637A CN106997243A CN 106997243 A CN106997243 A CN 106997243A CN 201710192637 A CN201710192637 A CN 201710192637A CN 106997243 A CN106997243 A CN 106997243A
- Authority
- CN
- China
- Prior art keywords
- speech
- user
- data
- lecture
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012544 monitoring process Methods 0.000 title claims abstract description 13
- 238000004458 analytical method Methods 0.000 claims abstract description 17
- 238000013135 deep learning Methods 0.000 claims abstract description 10
- 239000000284 extract Substances 0.000 claims description 8
- 230000008921 facial expression Effects 0.000 claims description 8
- 238000012806 monitoring device Methods 0.000 claims description 8
- 235000013399 edible fruits Nutrition 0.000 claims description 2
- 238000013499 data model Methods 0.000 claims 1
- 238000012549 training Methods 0.000 abstract description 10
- 230000003993 interaction Effects 0.000 abstract description 6
- 238000012545 processing Methods 0.000 description 25
- 230000009471 action Effects 0.000 description 17
- 230000006870 function Effects 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 8
- 230000036651 mood Effects 0.000 description 8
- 210000004556 brain Anatomy 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000002224 dissection Methods 0.000 description 2
- 238000005553 drilling Methods 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 108010022579 ATP dependent 26S protease Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Abstract
The invention discloses a kind of speech scene monitoring method and device based on intelligent robot.Wherein, this method includes:The multi-modal data that user is given a lecture under virtually speech scene is obtained, multi-modal data at least includes speech data;The multi-modal data given a lecture user is parsed;Specifically given a lecture depth model using based on deep learning algorithm, obtain the speech authority data group of the text of the correspondence speech data, speech authority data group is has gathered the speech exemplary data with directiveness;According to default speech element, analysis result and the speech authority data group determined are compared;Multi-modal output data for instructing user to give a lecture is exported according to comparison result.The speech scene monitoring system based on intelligent robot of the present invention can help user to do speech training, make robot closer to practical application scene, meet user's request, and enhance the multi-modal interaction capabilities of intelligent robot, improve Consumer's Experience.
Description
Technical field
The present invention relates to field in intelligent robotics, more particularly to a kind of speech scene monitoring method based on intelligent robot
And device.
Background technology
With the continuous development of scientific technology, the introducing of information technology, computer technology and artificial intelligence technology, machine
Industrial circle is progressively walked out in the research of people, gradually extend to the neck such as medical treatment, health care, family, amusement and service industry
Domain.And people for the requirement of robot also conform to the principle of simplicity the multiple mechanical action of substance be promoted to anthropomorphic question and answer, independence and with
The intelligent robot that other robot is interacted, man-machine interaction also just turns into the key factor for determining intelligent robot development.
Therefore, the interaction capabilities of intelligent robot are lifted, improves the class human nature of robot and intelligent, is the important of present urgent need to resolve
Problem.
The content of the invention
One of technical problems to be solved by the invention are to need offer one kind that user can be helped to do speech training, make machine
Solution of the device people closer to practical application scene.
In order to solve the above-mentioned technical problem, embodiments herein provide firstly a kind of speech based on intelligent robot
Scene monitoring method, this method includes:Obtain the multi-modal data that user is given a lecture under virtually speech scene, the multimode
State data at least include speech data;The multi-modal data given a lecture user is parsed;Calculated using based on deep learning
Method is specifically given a lecture depth model, obtains the speech authority data group of the text of the correspondence speech data, the speech specification
Data group is to have gathered the speech exemplary data with directiveness;According to default speech element, analysis result is compared and true
Fixed speech authority data group;Multi-modal output data for instructing user to give a lecture is exported according to comparison result.
Preferably, the multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene, is based on
Whether the voice messaging, meet setting rule by the voice, intonation and dead time of user described in contrast judgement.
Preferably, the multi-modal data includes the image information that user is given a lecture under virtually speech scene, is based on
Whether described image information, meet setting rule by the facial expression and posture of contrast judgement user.
Preferably, in addition to:Extracting the speech content of user according to analysis result, there is provided in the speech with the user
Hold associated video information, with consumer-oriented speech, or, provide the speech content with the user by intelligent robot
Associated virtual robot demonstration data.
Preferably, methods described realizes that the robot is mounted with machine by being configured with speech APP intelligent robot
People's operating system, the virtual speech scene is produced by AR/VR equipment, the AR/VR equipment and the intelligent robot
Speech APP synthetic operations, or, there is provided the virtual robot associated with the speech content of the user in AR/VR equipment
Demonstration data.
The embodiment of the present invention additionally provides a kind of speech scene monitoring device, and the device includes:Speech data acquisition module,
It obtains the multi-modal data that user is given a lecture under virtually speech scene, and the multi-modal data at least includes voice number
According to;One or more processors;Coding is used for by one or more of computing devices in one or more tangible mediums
Logic, and the logic when executed be used for perform following operation:The multi-modal data given a lecture user is carried out
Parsing;Specifically given a lecture depth model using based on deep learning algorithm, obtain the speech of the text of the correspondence speech data
Authority data group, the speech authority data group is to have gathered the speech exemplary data with directiveness;Drilled according to default
Element is said, analysis result and the speech authority data group determined is compared;And exported according to comparison result for instructing user to drill
The multi-modal output data said.
Preferably, the multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene, described
The logic is further used for performing following operation when executed:Based on the voice messaging, by being used described in contrast judgement
Whether voice, intonation and the dead time at family meet setting rule.
Preferably, the multi-modal data includes the image information that user is given a lecture under virtually speech scene, described
Logic is further used for performing following operation when executed:Based on described image information, pass through the face of contrast judgement user
Whether expression and posture meet setting rule.
Preferably, in addition to speech Video Output Modules, its according to analysis result extract the speech content of user there is provided
The video information associated with the speech content of the user, with consumer-oriented speech, or, the logic is when executed
It is further used for performing following operation:Extracting the speech content of user according to analysis result, there is provided the speech with the user
The associated virtual robot demonstration data of content.
Preferably, described device realizes that the robot is mounted with machine by being configured with speech APP intelligent robot
People's operating system, the virtual speech scene is produced by AR/VR equipment, the AR/VR equipment and the intelligent robot
Speech APP synthetic operations, or, there is provided the virtual robot associated with the speech content of the user in AR/VR equipment
Demonstration data.
Compared with prior art, one or more of such scheme embodiment can have the following advantages that or beneficial effect
Really:
The embodiment of the present invention carries out speech guidance by intelligent robot to the user under virtual speech scene,
When user is given a lecture, the multi-modal data that user is given a lecture under virtually speech scene is obtained, user is given a lecture
Multi-modal data is parsed, and according to default speech element, compares analysis result and the speech authority data group determined, according to
Comparison result exports the multi-modal output data for instructing user to give a lecture.The drilling based on intelligent robot of the embodiment of the present invention
Say that scene monitoring system can help user to do speech training, make robot closer to practical application scene, meet user's request,
And the multi-modal interaction capabilities of intelligent robot are enhanced, Consumer's Experience is improved.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
Obtain it is clear that or being understood by implementing technical scheme.The purpose of the present invention and other advantages can by
Specifically noted structure and/or flow are realized and obtained in specification, claims and accompanying drawing.
Brief description of the drawings
Accompanying drawing is used for providing to the technical scheme of the application or further understanding for prior art, and constitutes specification
A part.Wherein, the accompanying drawing of expression the embodiment of the present application is used for the technical side for explaining the application together with embodiments herein
Case, but do not constitute the limitation to technical scheme.
Fig. 1 is the schematic configuration schematic diagram of the speech scene monitoring device of the embodiment of the present invention.
Fig. 2 is the simple flow of the example of the speech scene monitoring method based on intelligent robot of the embodiment of the present invention
Figure.
Fig. 3 performs the simple flow of the example one of processing for the voice messaging given a lecture user of the embodiment of the present invention
Figure.
Fig. 4 performs the simple flow of the example two of processing for the image information given a lecture user of the embodiment of the present invention
Figure.
Fig. 5 performs the letter of the example three of processing for the electrocardio given a lecture user/brain electric information of the embodiment of the present invention
Change flow chart.
Fig. 6 exports the example that the multi-modal data for instructing user to give a lecture is handled according to comparison result for the embodiment of the present invention
Simplified flowchart.
Embodiment
Describe embodiments of the present invention in detail below with reference to drawings and Examples, how the present invention is applied whereby
Technological means solves technical problem, and reaches the implementation process of relevant art effect and can fully understand and implement according to this.This Shen
Each feature that please be in embodiment and embodiment, can be combined with each other under the premise of not colliding, the technical scheme formed
Within protection scope of the present invention.
The embodiment of the speech scene monitoring device of the present invention is illustrated referring to Fig. 1.It is used as speech scene prison
One example of control device can be virtual experience terminal or be mounted with the intelligent robot of robot operating system,
The robot multi-modal data interactive function and multi-modal data analytical capabilities.Hereinafter, to realize multi-modal interaction and solution
Exemplified by the robot of analysis, to illustrate the speech scene monitoring device of the present invention.The speech scene monitoring device is applied to a variety of flat
Platform, and, the application of anthropomorphic robot and function are supported.In addition, the application of speech scene monitoring device, the application can be carried
In robot operating system, or the function by being realized under certain pattern of robot.
The intelligent robot 1 can realize the speech training to user, as shown in figure 1, it mainly includes following module:Drill
Data acquisition module 10, processor 20 and multi-modal data display module 30 are said, the function of modules is specifically described below.
First come illustrate give a lecture data acquisition module 10, the module 10 mainly obtain user virtually speech scene under carry out
Multi-modal data during speech, the multi-modal data can include limb action, facial expression, the voice letter when speaker gives a lecture
Cease the information such as (including specific speech content, word speed, intonation, pause frequency) and/or electrocardio/brain wave.Further, such as Fig. 1 institutes
Show, the module 10 mainly includes depth camera 11, voice-input device 12 and electrocardio/eeg monitoring equipment 13.Wherein, this example is adopted
The imaging sensor of tradition collection two-dimensional image information is substituted with depth camera 11, mainly provides more smart for processor 20
True information is to obtain the limb action information of user.The Kinect depth cameras of Microsoft are used in this example, and it can be adopted
RGB image and depth image are obtained with OpenNI kits.Except obtain view data in addition to, the kit also have skeleton with
Track function, can extract human body pass by analyzing human motion sequence image with the human synovial in the every two field picture of real-time tracking
The three-dimensional coordinate of node, so as to obtain the kinematic parameter of human body.Based on this, speaker can be obtained and drilled under virtual scene
The limb action said.On the other hand, depth camera 11 can also provide the facial expression information of speaker for processor 20, so that
The processor 20 can be detected to the face in every two field picture, identify the facial expression of current speaker.Microphone is made
Can be dynamic microphones, MEMS microphone and electret condenser microphone for one kind in voice-input device 12, its
In, electret condenser microphone size is small, low in energy consumption, cheap and performance is pretty good, therefore uses this kind of microphone conduct
The sound transducer of the robot.In addition, the speech in order to more preferably train user, the device of the present embodiment also includes electrocardio/brain
Pyroelectric monitor equipment 13, electrocardio/brain wave data during speaker's simulation speech can be monitored by the monitoring device 13, with
Used for processor 20, so as to coordinate image recognition result more accurately to determine tensity or the user of active user
Mood attribute.
Next, being illustrated to processor 20, the processor 20 performs coding in one or more tangible mediums
Logic, makes the logic perform following operation when executed:The multi-modal data given a lecture user is parsed;Utilize base
Specifically given a lecture depth model in deep learning algorithm, obtain the speech authority data group of the text of the correspondence speech data,
The speech authority data group has guiding speech exemplary data to have gathered;According to default speech element, compare
Analysis result and specific speech authority data group;And exported according to comparison result for instructing the multi-modal defeated of user's speech
Go out data.As shown in figure 1, the processor 20 includes a processor or by multiple processors (such as label 211,212,213)
Processor unit 21, I/O interfaces 22 and the memory 23 of composition.
It should be noted that " processor " include processing data, any appropriate hardware of signal or other information and/or
Software systems, mechanism or component.Processor can be included with general Central Processing Unit (CPU), multiplied unit, for reality
The system or other systems of the special circuit of existing function.Processing is not necessarily limited to geographical position, or with time restriction.For example, place
Its function can be performed with " real-time ", " offline ", " batch mode " etc. by managing device.Can in different time and different location, by
Different (or identical) processing systems perform the part of processing.Computer can be any processor with memory communication.Storage
Device can be any appropriate processor readable storage medium, such as random access memory (RAM), read-only storage (ROM),
Magnetically or optically disk or be suitable for storage by computing device instruction other tangible mediums.
Specifically, processor unit 21 includes graphics processing unit 211, sound processing unit 212, electric wave processing unit
213rd, data resolution module 214, guide data output module 215 and speech Video Output Modules 216.Image procossing list therein
Member 211, sound processing unit 212, the multi-modal data of 213 pairs of acquisitions of electric wave processing unit are parsed.Specifically, image
Processing unit 211 possesses image preprocessing function, feature extraction functions, decision making function and concrete application function.Image preprocessing
Vision collecting data mainly to acquisition carry out basic handling, including the conversion of color space conversion, edge extracting, image and figure
As thresholding.Feature extraction mainly extracts the characteristic information such as the colour of skin of target, color, texture, motion and coordinate in image.Certainly
Plan is mainly to characteristic information, and being distributed to according to certain decision strategy needs the concrete application of this feature information.Concrete application
Function realizes the functions such as Face datection, human limbs identification, motion detection.Sound processing unit 212 uses speech recognition technology
Natural-sounding information is subjected to the semantic information that language comprehension analysing obtains user spoken utterances, moreover, true by analyzing voice content
Determine word speed, intonation, the pause frequency of speaker.The heart/EEG signals that 213 pairs of electric wave processing unit is collected, which are pre-processed, to be come
Remove the artefact of doping.Feature extraction then is carried out to the EEG signals for removing artefact, these features can be temporal signatures, frequency
Characteristic of field or time-frequency characteristics.According to these features and before according to obtained by the sample of training different moods (such as it is tranquil,
It is glad, sad, frightened) corresponding brain electrical feature determines the mood of user.In addition, in addition to three kinds of common attributes above, also
Many other features, such as entropy, fractal dimension and customized feature can be extracted from EEG signals.
Data resolution module 214 is specifically given a lecture depth model using based on deep learning algorithm, obtains correspondingly voice number
According to text speech authority data group, and according to default speech element, compare analysis result and the speech specification number determined
According to group.Wherein, default speech element can include the accuracy of emotion expression service, the number of times that walks, frequent/monotonicity of limbs,
Reasonability (including whether hunchback, whether lowering is natural for hand), the frequency of hand motion of stance;The conjunction of sound intonation
The elements such as rationality, the reasonability paused.Speech depth model is obtained based on deep learning algorithm, specifically, in advance profit
The speaker with directiveness is collected (for example, the higher outstanding speech of speech level with speech recognition technology, machine vision technique
Person) speech word content, voice content, video image content, by deep learning algorithm, according to touching paragraph, moving section
Tone, mood, the limb action corresponding to the word at moment etc. such as fall and do deep learning.More particularly, collect a large amount of in advance
The speech video data of outstanding speaker, is first handled as follows for each video data:Filter out touching speech paragraph, example
The speech period that such as can be more fluctuated for speaker's mood, the video progress voice recognition processing for the period obtains the section
Fall corresponding content of text, speech intonation and pause frequency, carrying out image procossing to image information determines in the period not identical text
The corresponding limb action of this content and emotional characteristics etc..The data after these above-mentioned processing for each video data are made
For give a lecture depth model network training data set, based on depth autocoder and deep neural network to the training data
Set carries out further feature and extracts training of the completion to depth model of giving a lecture.
Data resolution module 214 using the content of text of the speech data during the speech of acquisition as input, by drilling
Say that depth model obtains corresponding speech authority data group.The data group can include rationally walking during this paragraph is given a lecture
The contents such as number of times, limb action reasonability data, speech intonation reasonability data and mood data.Then relevant user is contrasted
The parsing content (true speech reaction content) of speech multi-modal data during speech and speech authority data group, it is determined that
The reasonability of voice, the limb action shown during user's speech etc..
Guide data output module 215 exports the multi-modal output data for instructing user to give a lecture according to comparison result.
Specifically, when comparison result is not up to and sets expected, for example, the element of setting quantity does not reach in the speech element compared
To matching, then it is assumed that not up to setting is expected, then by the speech authority data all living creatures given a lecture for the paragraph into multi-modal output
Data, the speech mode of specification is shown to user.
Give a lecture Video Output Modules 216, extracting the speech content of user according to analysis result, there is provided the speech with user
The associated video information of content, with consumer-oriented speech.As shown in figure 1, being stored with memory 23 with subject name
Or video outline keyword is the speech video database of index, speech Video Output Modules 216 are searched according to speech content should
The video information of database selection matching.The limitation of capacity is locally stored in view of robot, guiding video can be set
Beyond the clouds in server, speech Video Output Modules 216 by network communication protocol to cloud server send video request come
The video information of matching is obtained, about the 26S Proteasome Structure and Function of cloud server, herein without excessively limitation.
Multi-modal data output module 30, it is presented multi-modal output data in multi-modal mode to user.The module
30 mainly include display 31, voice-output device 32, limbs operating mechanism 33.Display 31 can select liquid crystal display,
Video information and/or emotion expression service information that its control display screen is received with showing.Voice-output device 32 can be loudspeaker,
It audibly exports the information of the phonetic matrix received to user.Limbs operating mechanism 33 is according to the limb received
Body action command shows the limb action of recommendation to user.
Except the mode of robot entity hardware is used above come in addition to exporting guiding multi-modal data, the intelligence of this example
Robot 1 can also extract user according to analysis result and lead speech content there is provided associated with the speech content of the user
Virtual robot demonstration data, it is shown on the display 31.Specifically, intelligent robot 1 can utilize data solution
Analyse the speech authority data group of the generation of module 214 to generate virtual robot demonstration data, certainly, sound therein etc. is still logical
Cross voice-output device 31 to export, and there is the facial table of directiveness about virtual robot during this section speech is carried out
Feelings and limb action etc. are realized based on virtual robot demonstration data.The virtual robot can be that have mapped active user's
Integrality (including looks, sign etc.) is come the virtual portrait realized so that user can be more by the performance of virtual robot
The information such as required expression, sound status when itself being given a lecture to understand well.
In addition, in embodiments of the present invention, it is preferable that the AR/VR equipment 40 of the establishment of virtual speech scene as shown in Figure 1
To realize.The speech scene that hundreds and thousands of people listen the user to be given a lecture as audience is constructed by the AR/VR equipment 40.
In addition, dynamic speech scene can also be created that by projection pattern, although the experience property of this mode is not so good as AR/VR equipment
40, but can also implement as one embodiment of the present of invention.On the other hand, in AR/VR equipment, can also provide with
The associated virtual robot demonstration data of the speech content of the user, this section of speech content is demonstrated by virtual robot
The status information that should be shown.
Fig. 2 is the simple flow of the example of the speech scene monitoring method based on intelligent robot of the embodiment of the present invention
Figure.
Carrying out general description with reference to Fig. 1 and Fig. 2 below, once the of the invention speech scene based on intelligent robot is supervised
The flow of prosecutor method.As shown in Fig. 2 first, in step S210, speech data acquisition module 10 obtains user and virtually given a lecture
The multi-modal data given a lecture under scene.Then graphics processing unit 211, in processor 20, sound processing unit 212,
The multi-modal data that the grade of electric wave processing unit 213 is given a lecture user is parsed (step S220).With in preprocessor 20
Data resolution module 214 specifically given a lecture depth model using based on deep learning algorithm, obtain the correspondingly speech data
Text speech authority data group (step S230), in step S240, guide data output module 215 is drilled according to default
Element is said, analysis result and the speech authority data group determined is compared.Finally, speech Video Output Modules 216 are tied according to comparison
Fruit exports the multi-modal output data (step S250) for instructing user to give a lecture.
Next, the example one for the speech data execution dissection process given a lecture user is illustrated with reference to Fig. 3
Process.The processing of speech multi-modal data of the robot to user under virtually speech scene for convenience, user is given a lecture
When, using each paragraph as a unit, to receive the speech training of robot.In the process, depth camera 11, phonetic entry
Equipment 12 and electrocardio/collection user of eeg monitoring equipment 13 are directed to the speech multi-modal data of a certain paragraph.Because this example is pair
The processing of speech data, therefore as shown in figure 3, extract voice messaging in step S310 first, 212 pairs of sound processing unit
The voice messaging carries out dissection process (step S320), and the text message of the paragraph is obtained by speech recognition technology, language is utilized
The information such as voice, intonation, dead time/number of times, the word speed of sound detection technique detection user.Then, in step S330, data
Content of text as input, corresponding speech authority data group, the data is obtained by depth model of giving a lecture by parsing module 214
Group at least includes the corresponding reasonable speech intonation of this section of speech content, pause information etc..Data resolution module 214 is in step S330
In by compare operation assess speaker speech intonation, dead time and number of times it is whether reasonable, such as where paragraph content
Should make a short pause, where should tone want loud and sonorous etc..It also can determine that the true place of cacoepy simultaneously.Setting is not being met
In the case of rule, guide data output module 215 exports guiding multi-modal data.The guiding multi-modal data can be wrapped
Include evaluation result (content of irrationality), reasonability suggestion (when pausing, when intonation is loud and sonorous, when want overcast etc.) with
And video information and/or speech authority data group.
On the other hand, in addition it is also necessary to which limb action and facial expression when simulation speech is carried out to user are evaluated.Specifically
Flow with reference to shown in Fig. 4.As shown in figure 4, extracting the image information of user's speech, graphics processing unit in step S410
211 carry out the limb action and facial expression information that image analysis operation (step S420) obtains user, in step S430, number
It is whether reasonable by the limb action for comparing operation judges speaker according to parsing module 214, for example whether have rationally walk, limbs
Whether excessively frequently or excessively whether dull, stance is reasonable for action, whether whether have bow-backed phenomenon, hand lowering naturally, hand
Whether portion's action excessively frequently waits.In the case where not meeting setting rule, guide data output module 215 exports directiveness
Multi-modal data.
Another further aspect, as shown in figure 5, also being parsed to the electrocardio/brain electricity collected, obtains the emotional information of user
Whether (step S520), meet setting rule by the current mood of contrast judgement user, directiveness is exported if not meeting many
Modal data, for example, provide reasonability suggestion, inform the mood that user should produce.
Fig. 6 exports the example that the multi-modal data for instructing user to give a lecture is handled according to comparison result for the embodiment of the present invention
Simplified flowchart.As shown in fig. 6, first inquiring about the video information in video database 231 with the presence or absence of matching first.Specifically
Ground, extracts keyword (step S610), the keyword for example can be the name repeatedly occurred from the text message of speech paragraph
Word or phrase.The video information (step S620) searched for by major key of the keyword in video database 231, in the feelings searched
Under condition ("Yes" in step S630), exported video information as guiding multi-modal data to display 31 and voice output
Equipment 32 is exported, and is demonstrated (step S640) to user.Otherwise, it is speech authority data group is multi-modal as directiveness
Data distribution to corresponding hardware executing agency carries out multi-modal output, show correct pronunciation, the tone of recommendation and pause,
Limb action of recommendation etc., by user express it is imperfect place correct come.Or, based on speech authority data all living creatures into
The virtual robot demonstration data associated with the speech content of the user, is showed (step S650) with virtual mode.
In one embodiment, the intelligent robot is configured with speech APP, and as above method stream is realized by speech APP
Journey, when the APP is run, itself and the cooperating of AR/VR equipment 40.In this place, AR/VR equipment 40 can also be provided uses with described
The associated virtual robot demonstration data of the speech content at family.
The speech scene monitoring system based on intelligent robot of the embodiment of the present invention can help user to do speech training,
Make robot closer to practical application scene, meet user's request, and enhance the multi-modal interaction capabilities of intelligent robot,
Improve Consumer's Experience.
Because the method for the present invention describes what is realized in computer systems.The computer system can for example be set
In the control core processor of robot.For example, method described herein can be implemented as what can be performed with control logic
Software, it is performed by the CPU in robot operating system.Function as described herein, which can be implemented as being stored in non-transitory, to be had
Programmed instruction set in shape computer-readable medium.When implemented in this fashion, the computer program includes one group of instruction,
When group instruction is run by computer, it, which promotes computer to perform, can implement the method for above-mentioned functions.FPGA can be temporary
When or be permanently mounted in non-transitory tangible computer computer-readable recording medium, for example ROM chip, computer storage,
Disk or other storage mediums.In addition to being realized with software, logic as described herein can utilize discrete parts, integrated electricity
Road, programmable the patrolling with programmable logic device (such as, field programmable gate array (FPGA) or microprocessor) combined use
Volume, or embodied including any other equipment that they are combined.All such embodiments are intended to fall under the model of the present invention
Within enclosing.
It should be understood that disclosed embodiment of this invention is not limited to specific structure disclosed herein, process step
Or material, and the equivalent substitute for these features that those of ordinary skill in the related art are understood should be extended to.It should also manage
Solution, term as used herein is only used for describing the purpose of specific embodiment, and is not intended to limit.
" one embodiment " or " embodiment " mentioned in specification means special characteristic, the structure described in conjunction with the embodiments
Or during characteristic is included at least one embodiment of the present invention.Therefore, the phrase " reality that specification various places throughout occurs
Apply example " or " embodiment " same embodiment might not be referred both to.
While it is disclosed that embodiment as above, but described content is only to facilitate understanding the present invention and adopting
Embodiment, is not limited to the present invention.Any those skilled in the art to which this invention pertains, are not departing from this
On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details,
But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.
Claims (10)
1. a kind of speech scene monitoring method based on intelligent robot, this method includes:
The multi-modal data that user is given a lecture under virtually speech scene is obtained, the multi-modal data at least includes voice number
According to;
The multi-modal data given a lecture user is parsed;
Specifically given a lecture depth model using based on deep learning algorithm, obtain the speech rule of the text of the correspondence speech data
Model data group, the speech authority data group has guiding speech exemplary data to have gathered;
According to default speech element, analysis result and the speech authority data group determined are compared;
Multi-modal output data for instructing user to give a lecture is exported according to comparison result.
2. according to the method described in claim 1, it is characterised in that
The multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene, based on voice letter
Whether breath, meet setting rule by the voice, intonation and dead time of user described in contrast judgement.
3. method according to claim 1 or 2, it is characterised in that
The multi-modal data includes the image information that user is given a lecture under virtually speech scene, based on described image letter
Whether breath, meet setting rule by the facial expression and posture of contrast judgement user.
4. method according to claim 1 or 2, it is characterised in that also include:
Believed according to the speech content that analysis result extracts user there is provided the video associated with the speech content of the user
Breath, with consumer-oriented speech,
Or,
The virtual robot demonstration data associated with the speech content of the user is provided by intelligent robot.
5. the method according to any one of Claims 1 to 4, it is characterised in that
Methods described realizes that the robot is mounted with robot operating system by being configured with speech APP intelligent robot,
The virtual speech scene is produced by AR/VR equipment, and the AR/VR equipment is cooperateed with the speech APP of the intelligent robot
Operation, or, there is provided the virtual robot demonstration data associated with the speech content of the user in AR/VR equipment.
6. one kind speech scene monitoring device, the device includes:
Speech data acquisition module, it obtains the multi-modal data that user is given a lecture under virtually speech scene, the multimode
State data at least include speech data;
One or more processors;
Encode is used for the logic by one or more of computing devices in one or more tangible mediums, and described patrols
Collecting is used to perform following operation when executed:The multi-modal data given a lecture user is parsed;Using based on depth
Learning algorithm is specifically given a lecture depth model, obtains the speech authority data group of the text of the correspondence speech data, described to drill
Authority data group is said to have gathered the speech exemplary data with directiveness;According to default speech element, parsing knot is compared
Fruit and the speech authority data group of determination;And the multi-modal output number for instructing user to give a lecture is exported according to comparison result
According to.
7. device according to claim 6, it is characterised in that
The multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene,
The logic is further used for performing following operation when executed:Based on the voice messaging, pass through contrast judgement institute
Whether the voice, intonation and dead time for stating user meet setting rule.
8. the device according to claim 6 or 7, it is characterised in that
The multi-modal data includes the image information that user is given a lecture under virtually speech scene,
The logic is further used for performing following operation when executed:Based on described image information, used by contrast judgement
Whether the facial expression and posture at family meet setting rule.
9. the device according to claim 6 or 7, it is characterised in that also including speech Video Output Modules, it is according to parsing
As a result the speech content of user is extracted there is provided the video information associated with the speech content of the user, to instruct user
Speech, or,
The logic is further used for performing following operation when executed:Extracted according to analysis result in the speech of user
There is provided the virtual robot demonstration data associated with the speech content of the user for appearance.
10. the device according to any one of claim 6~9, it is characterised in that
Described device realizes that the robot is mounted with robot operating system by being configured with speech APP intelligent robot,
The virtual speech scene is produced by AR/VR equipment, and the AR/VR equipment is cooperateed with the speech APP of the intelligent robot
Operation, or, there is provided the virtual robot demonstration data associated with the speech content of the user in AR/VR equipment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710192637.9A CN106997243B (en) | 2017-03-28 | 2017-03-28 | Speech scene monitoring method and device based on intelligent robot |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710192637.9A CN106997243B (en) | 2017-03-28 | 2017-03-28 | Speech scene monitoring method and device based on intelligent robot |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106997243A true CN106997243A (en) | 2017-08-01 |
CN106997243B CN106997243B (en) | 2019-11-08 |
Family
ID=59431715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710192637.9A Active CN106997243B (en) | 2017-03-28 | 2017-03-28 | Speech scene monitoring method and device based on intelligent robot |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106997243B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543812A (en) * | 2017-09-22 | 2019-03-29 | 吴杰 | A kind of specific true man's behavior fast modeling method |
CN109583363A (en) * | 2018-11-27 | 2019-04-05 | 湖南视觉伟业智能科技有限公司 | The method and system of speaker's appearance body movement are improved based on human body critical point detection |
CN110333781A (en) * | 2019-06-17 | 2019-10-15 | 胡勇 | Simulated scenario operation method and system |
CN110390845A (en) * | 2018-04-18 | 2019-10-29 | 北京京东尚科信息技术有限公司 | Robotic training method and device, storage medium and computer system under virtual environment |
CN110491372A (en) * | 2019-07-22 | 2019-11-22 | 平安科技(深圳)有限公司 | A kind of feedback information generating method, device, storage medium and smart machine |
CN110647636A (en) * | 2019-09-05 | 2020-01-03 | 深圳追一科技有限公司 | Interaction method, interaction device, terminal equipment and storage medium |
CN111596761A (en) * | 2020-05-03 | 2020-08-28 | 清华大学 | Method and device for simulating lecture based on face changing technology and virtual reality technology |
CN112232127A (en) * | 2020-09-14 | 2021-01-15 | 辽宁对外经贸学院 | Intelligent speech training system and method |
CN113377971A (en) * | 2021-05-31 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Multimedia resource generation method and device, electronic equipment and storage medium |
CN113571087A (en) * | 2020-04-29 | 2021-10-29 | 宏达国际电子股份有限公司 | Method for generating action according to audio signal and electronic device |
CN116484318A (en) * | 2023-06-20 | 2023-07-25 | 新励成教育科技股份有限公司 | Lecture training feedback method, lecture training feedback device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130093852A1 (en) * | 2011-10-12 | 2013-04-18 | Board Of Trustees Of The University Of Arkansas | Portable robotic device |
CN103065629A (en) * | 2012-11-20 | 2013-04-24 | 广东工业大学 | Speech recognition system of humanoid robot |
CN103714248A (en) * | 2013-12-23 | 2014-04-09 | 青岛优维奥信息技术有限公司 | Training system for competitive speech |
CN105488044A (en) * | 2014-09-16 | 2016-04-13 | 华为技术有限公司 | Data processing method and device |
CN106056207A (en) * | 2016-05-09 | 2016-10-26 | 武汉科技大学 | Natural language-based robot deep interacting and reasoning method and device |
-
2017
- 2017-03-28 CN CN201710192637.9A patent/CN106997243B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130093852A1 (en) * | 2011-10-12 | 2013-04-18 | Board Of Trustees Of The University Of Arkansas | Portable robotic device |
CN103065629A (en) * | 2012-11-20 | 2013-04-24 | 广东工业大学 | Speech recognition system of humanoid robot |
CN103714248A (en) * | 2013-12-23 | 2014-04-09 | 青岛优维奥信息技术有限公司 | Training system for competitive speech |
CN105488044A (en) * | 2014-09-16 | 2016-04-13 | 华为技术有限公司 | Data processing method and device |
CN106056207A (en) * | 2016-05-09 | 2016-10-26 | 武汉科技大学 | Natural language-based robot deep interacting and reasoning method and device |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543812A (en) * | 2017-09-22 | 2019-03-29 | 吴杰 | A kind of specific true man's behavior fast modeling method |
CN110390845A (en) * | 2018-04-18 | 2019-10-29 | 北京京东尚科信息技术有限公司 | Robotic training method and device, storage medium and computer system under virtual environment |
CN109583363B (en) * | 2018-11-27 | 2022-02-11 | 湖南视觉伟业智能科技有限公司 | Method and system for detecting and improving posture and body movement of lecturer based on human body key points |
CN109583363A (en) * | 2018-11-27 | 2019-04-05 | 湖南视觉伟业智能科技有限公司 | The method and system of speaker's appearance body movement are improved based on human body critical point detection |
CN110333781A (en) * | 2019-06-17 | 2019-10-15 | 胡勇 | Simulated scenario operation method and system |
CN110333781B (en) * | 2019-06-17 | 2024-01-12 | 胡勇 | Method and system for simulating scene operation |
CN110491372A (en) * | 2019-07-22 | 2019-11-22 | 平安科技(深圳)有限公司 | A kind of feedback information generating method, device, storage medium and smart machine |
CN110647636A (en) * | 2019-09-05 | 2020-01-03 | 深圳追一科技有限公司 | Interaction method, interaction device, terminal equipment and storage medium |
CN113571087B (en) * | 2020-04-29 | 2023-07-28 | 宏达国际电子股份有限公司 | Method for generating action according to audio signal and electronic device |
CN113571087A (en) * | 2020-04-29 | 2021-10-29 | 宏达国际电子股份有限公司 | Method for generating action according to audio signal and electronic device |
CN111596761A (en) * | 2020-05-03 | 2020-08-28 | 清华大学 | Method and device for simulating lecture based on face changing technology and virtual reality technology |
CN112232127A (en) * | 2020-09-14 | 2021-01-15 | 辽宁对外经贸学院 | Intelligent speech training system and method |
CN113377971A (en) * | 2021-05-31 | 2021-09-10 | 北京达佳互联信息技术有限公司 | Multimedia resource generation method and device, electronic equipment and storage medium |
CN113377971B (en) * | 2021-05-31 | 2024-02-27 | 北京达佳互联信息技术有限公司 | Multimedia resource generation method and device, electronic equipment and storage medium |
CN116484318A (en) * | 2023-06-20 | 2023-07-25 | 新励成教育科技股份有限公司 | Lecture training feedback method, lecture training feedback device and storage medium |
CN116484318B (en) * | 2023-06-20 | 2024-02-06 | 新励成教育科技股份有限公司 | Lecture training feedback method, lecture training feedback device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106997243B (en) | 2019-11-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106997243B (en) | Speech scene monitoring method and device based on intelligent robot | |
US20210233521A1 (en) | Method for speech recognition based on language adaptivity and related apparatus | |
CN110531860B (en) | Animation image driving method and device based on artificial intelligence | |
CN108000526B (en) | Dialogue interaction method and system for intelligent robot | |
CN107030691B (en) | Data processing method and device for nursing robot | |
CN106985137B (en) | Multi-modal exchange method and system for intelligent robot | |
US20210191506A1 (en) | Affective interaction systems, devices, and methods based on affective computing user interface | |
CN107728780A (en) | A kind of man-machine interaction method and device based on virtual robot | |
CN107797663A (en) | Multi-modal interaction processing method and system based on visual human | |
CN102298694A (en) | Man-machine interaction identification system applied to remote information service | |
CN109117952B (en) | Robot emotion cognition method based on deep learning | |
CN107704612A (en) | Dialogue exchange method and system for intelligent robot | |
CN109871450A (en) | Based on the multi-modal exchange method and system for drawing this reading | |
CN110598576A (en) | Sign language interaction method and device and computer medium | |
CN108491808B (en) | Method and device for acquiring information | |
CN109993131B (en) | Design intention distinguishing system and method based on multi-mode signal fusion | |
CN109859324A (en) | A kind of motion teaching method and device based on visual human | |
CN108364662A (en) | Based on the pairs of speech-emotion recognition method and system for differentiating task | |
CN112016367A (en) | Emotion recognition system and method and electronic equipment | |
CN110909680A (en) | Facial expression recognition method and device, electronic equipment and storage medium | |
CN106502382A (en) | Active exchange method and system for intelligent robot | |
CN109343695A (en) | Exchange method and system based on visual human's behavioral standard | |
CN109086351A (en) | A kind of method and user tag system obtaining user tag | |
Roudposhti et al. | A multilevel body motion-based human activity analysis methodology | |
CN112860213B (en) | Audio processing method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230927 Address after: 100000 6198, Floor 6, Building 4, Yard 49, Badachu Road, Shijingshan District, Beijing Patentee after: Beijing Virtual Dynamic Technology Co.,Ltd. Address before: 100000 Fourth Floor Ivy League Youth Venture Studio No. 193, Yuquan Building, No. 3 Shijingshan Road, Shijingshan District, Beijing Patentee before: Beijing Guangnian Infinite Technology Co.,Ltd. |
|
TR01 | Transfer of patent right |