CN106997243A

CN106997243A - Speech scene monitoring method and device based on intelligent robot

Info

Publication number: CN106997243A
Application number: CN201710192637.9A
Authority: CN
Inventors: 许豪劲
Original assignee: Beijing Guangnian Wuxian Technology Co Ltd
Current assignee: Beijing Virtual Point Technology Co Ltd
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2017-08-01
Anticipated expiration: 2037-03-28
Also published as: CN106997243B

Abstract

The invention discloses a kind of speech scene monitoring method and device based on intelligent robot.Wherein, this method includes：The multi-modal data that user is given a lecture under virtually speech scene is obtained, multi-modal data at least includes speech data；The multi-modal data given a lecture user is parsed；Specifically given a lecture depth model using based on deep learning algorithm, obtain the speech authority data group of the text of the correspondence speech data, speech authority data group is has gathered the speech exemplary data with directiveness；According to default speech element, analysis result and the speech authority data group determined are compared；Multi-modal output data for instructing user to give a lecture is exported according to comparison result.The speech scene monitoring system based on intelligent robot of the present invention can help user to do speech training, make robot closer to practical application scene, meet user's request, and enhance the multi-modal interaction capabilities of intelligent robot, improve Consumer's Experience.

Description

Speech scene monitoring method and device based on intelligent robot

Technical field

The present invention relates to field in intelligent robotics, more particularly to a kind of speech scene monitoring method based on intelligent robot And device.

Background technology

With the continuous development of scientific technology, the introducing of information technology, computer technology and artificial intelligence technology, machine Industrial circle is progressively walked out in the research of people, gradually extend to the neck such as medical treatment, health care, family, amusement and service industry Domain.And people for the requirement of robot also conform to the principle of simplicity the multiple mechanical action of substance be promoted to anthropomorphic question and answer, independence and with The intelligent robot that other robot is interacted, man-machine interaction also just turns into the key factor for determining intelligent robot development. Therefore, the interaction capabilities of intelligent robot are lifted, improves the class human nature of robot and intelligent, is the important of present urgent need to resolve Problem.

The content of the invention

One of technical problems to be solved by the invention are to need offer one kind that user can be helped to do speech training, make machine Solution of the device people closer to practical application scene.

In order to solve the above-mentioned technical problem, embodiments herein provide firstly a kind of speech based on intelligent robot Scene monitoring method, this method includes：Obtain the multi-modal data that user is given a lecture under virtually speech scene, the multimode State data at least include speech data；The multi-modal data given a lecture user is parsed；Calculated using based on deep learning Method is specifically given a lecture depth model, obtains the speech authority data group of the text of the correspondence speech data, the speech specification Data group is to have gathered the speech exemplary data with directiveness；According to default speech element, analysis result is compared and true Fixed speech authority data group；Multi-modal output data for instructing user to give a lecture is exported according to comparison result.

Preferably, the multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene, is based on Whether the voice messaging, meet setting rule by the voice, intonation and dead time of user described in contrast judgement.

Preferably, the multi-modal data includes the image information that user is given a lecture under virtually speech scene, is based on Whether described image information, meet setting rule by the facial expression and posture of contrast judgement user.

Preferably, in addition to：Extracting the speech content of user according to analysis result, there is provided in the speech with the user Hold associated video information, with consumer-oriented speech, or, provide the speech content with the user by intelligent robot Associated virtual robot demonstration data.

Preferably, methods described realizes that the robot is mounted with machine by being configured with speech APP intelligent robot People's operating system, the virtual speech scene is produced by AR/VR equipment, the AR/VR equipment and the intelligent robot Speech APP synthetic operations, or, there is provided the virtual robot associated with the speech content of the user in AR/VR equipment Demonstration data.

The embodiment of the present invention additionally provides a kind of speech scene monitoring device, and the device includes：Speech data acquisition module, It obtains the multi-modal data that user is given a lecture under virtually speech scene, and the multi-modal data at least includes voice number According to；One or more processors；Coding is used for by one or more of computing devices in one or more tangible mediums Logic, and the logic when executed be used for perform following operation：The multi-modal data given a lecture user is carried out Parsing；Specifically given a lecture depth model using based on deep learning algorithm, obtain the speech of the text of the correspondence speech data Authority data group, the speech authority data group is to have gathered the speech exemplary data with directiveness；Drilled according to default Element is said, analysis result and the speech authority data group determined is compared；And exported according to comparison result for instructing user to drill The multi-modal output data said.

Preferably, the multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene, described The logic is further used for performing following operation when executed：Based on the voice messaging, by being used described in contrast judgement Whether voice, intonation and the dead time at family meet setting rule.

Preferably, the multi-modal data includes the image information that user is given a lecture under virtually speech scene, described Logic is further used for performing following operation when executed：Based on described image information, pass through the face of contrast judgement user Whether expression and posture meet setting rule.

Preferably, in addition to speech Video Output Modules, its according to analysis result extract the speech content of user there is provided The video information associated with the speech content of the user, with consumer-oriented speech, or, the logic is when executed It is further used for performing following operation：Extracting the speech content of user according to analysis result, there is provided the speech with the user The associated virtual robot demonstration data of content.

Preferably, described device realizes that the robot is mounted with machine by being configured with speech APP intelligent robot People's operating system, the virtual speech scene is produced by AR/VR equipment, the AR/VR equipment and the intelligent robot Speech APP synthetic operations, or, there is provided the virtual robot associated with the speech content of the user in AR/VR equipment Demonstration data.

Compared with prior art, one or more of such scheme embodiment can have the following advantages that or beneficial effect Really：

The embodiment of the present invention carries out speech guidance by intelligent robot to the user under virtual speech scene, When user is given a lecture, the multi-modal data that user is given a lecture under virtually speech scene is obtained, user is given a lecture Multi-modal data is parsed, and according to default speech element, compares analysis result and the speech authority data group determined, according to Comparison result exports the multi-modal output data for instructing user to give a lecture.The drilling based on intelligent robot of the embodiment of the present invention Say that scene monitoring system can help user to do speech training, make robot closer to practical application scene, meet user's request, And the multi-modal interaction capabilities of intelligent robot are enhanced, Consumer's Experience is improved.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification Obtain it is clear that or being understood by implementing technical scheme.The purpose of the present invention and other advantages can by Specifically noted structure and/or flow are realized and obtained in specification, claims and accompanying drawing.

Brief description of the drawings

Accompanying drawing is used for providing to the technical scheme of the application or further understanding for prior art, and constitutes specification A part.Wherein, the accompanying drawing of expression the embodiment of the present application is used for the technical side for explaining the application together with embodiments herein Case, but do not constitute the limitation to technical scheme.

Fig. 1 is the schematic configuration schematic diagram of the speech scene monitoring device of the embodiment of the present invention.

Fig. 2 is the simple flow of the example of the speech scene monitoring method based on intelligent robot of the embodiment of the present invention Figure.

Fig. 3 performs the simple flow of the example one of processing for the voice messaging given a lecture user of the embodiment of the present invention Figure.

Fig. 4 performs the simple flow of the example two of processing for the image information given a lecture user of the embodiment of the present invention Figure.

Fig. 5 performs the letter of the example three of processing for the electrocardio given a lecture user/brain electric information of the embodiment of the present invention Change flow chart.

Fig. 6 exports the example that the multi-modal data for instructing user to give a lecture is handled according to comparison result for the embodiment of the present invention Simplified flowchart.

Embodiment

Describe embodiments of the present invention in detail below with reference to drawings and Examples, how the present invention is applied whereby Technological means solves technical problem, and reaches the implementation process of relevant art effect and can fully understand and implement according to this.This Shen Each feature that please be in embodiment and embodiment, can be combined with each other under the premise of not colliding, the technical scheme formed Within protection scope of the present invention.

The embodiment of the speech scene monitoring device of the present invention is illustrated referring to Fig. 1.It is used as speech scene prison One example of control device can be virtual experience terminal or be mounted with the intelligent robot of robot operating system, The robot multi-modal data interactive function and multi-modal data analytical capabilities.Hereinafter, to realize multi-modal interaction and solution Exemplified by the robot of analysis, to illustrate the speech scene monitoring device of the present invention.The speech scene monitoring device is applied to a variety of flat Platform, and, the application of anthropomorphic robot and function are supported.In addition, the application of speech scene monitoring device, the application can be carried In robot operating system, or the function by being realized under certain pattern of robot.

The intelligent robot 1 can realize the speech training to user, as shown in figure 1, it mainly includes following module：Drill Data acquisition module 10, processor 20 and multi-modal data display module 30 are said, the function of modules is specifically described below.

First come illustrate give a lecture data acquisition module 10, the module 10 mainly obtain user virtually speech scene under carry out Multi-modal data during speech, the multi-modal data can include limb action, facial expression, the voice letter when speaker gives a lecture Cease the information such as (including specific speech content, word speed, intonation, pause frequency) and/or electrocardio/brain wave.Further, such as Fig. 1 institutes Show, the module 10 mainly includes depth camera 11, voice-input device 12 and electrocardio/eeg monitoring equipment 13.Wherein, this example is adopted The imaging sensor of tradition collection two-dimensional image information is substituted with depth camera 11, mainly provides more smart for processor 20 True information is to obtain the limb action information of user.The Kinect depth cameras of Microsoft are used in this example, and it can be adopted RGB image and depth image are obtained with OpenNI kits.Except obtain view data in addition to, the kit also have skeleton with Track function, can extract human body pass by analyzing human motion sequence image with the human synovial in the every two field picture of real-time tracking The three-dimensional coordinate of node, so as to obtain the kinematic parameter of human body.Based on this, speaker can be obtained and drilled under virtual scene The limb action said.On the other hand, depth camera 11 can also provide the facial expression information of speaker for processor 20, so that The processor 20 can be detected to the face in every two field picture, identify the facial expression of current speaker.Microphone is made Can be dynamic microphones, MEMS microphone and electret condenser microphone for one kind in voice-input device 12, its In, electret condenser microphone size is small, low in energy consumption, cheap and performance is pretty good, therefore uses this kind of microphone conduct The sound transducer of the robot.In addition, the speech in order to more preferably train user, the device of the present embodiment also includes electrocardio/brain Pyroelectric monitor equipment 13, electrocardio/brain wave data during speaker's simulation speech can be monitored by the monitoring device 13, with Used for processor 20, so as to coordinate image recognition result more accurately to determine tensity or the user of active user Mood attribute.

Next, being illustrated to processor 20, the processor 20 performs coding in one or more tangible mediums Logic, makes the logic perform following operation when executed：The multi-modal data given a lecture user is parsed；Utilize base Specifically given a lecture depth model in deep learning algorithm, obtain the speech authority data group of the text of the correspondence speech data, The speech authority data group has guiding speech exemplary data to have gathered；According to default speech element, compare Analysis result and specific speech authority data group；And exported according to comparison result for instructing the multi-modal defeated of user's speech Go out data.As shown in figure 1, the processor 20 includes a processor or by multiple processors (such as label 211,212,213) Processor unit 21, I/O interfaces 22 and the memory 23 of composition.

It should be noted that " processor " include processing data, any appropriate hardware of signal or other information and/or Software systems, mechanism or component.Processor can be included with general Central Processing Unit (CPU), multiplied unit, for reality The system or other systems of the special circuit of existing function.Processing is not necessarily limited to geographical position, or with time restriction.For example, place Its function can be performed with " real-time ", " offline ", " batch mode " etc. by managing device.Can in different time and different location, by Different (or identical) processing systems perform the part of processing.Computer can be any processor with memory communication.Storage Device can be any appropriate processor readable storage medium, such as random access memory (RAM), read-only storage (ROM), Magnetically or optically disk or be suitable for storage by computing device instruction other tangible mediums.

Specifically, processor unit 21 includes graphics processing unit 211, sound processing unit 212, electric wave processing unit 213rd, data resolution module 214, guide data output module 215 and speech Video Output Modules 216.Image procossing list therein Member 211, sound processing unit 212, the multi-modal data of 213 pairs of acquisitions of electric wave processing unit are parsed.Specifically, image Processing unit 211 possesses image preprocessing function, feature extraction functions, decision making function and concrete application function.Image preprocessing Vision collecting data mainly to acquisition carry out basic handling, including the conversion of color space conversion, edge extracting, image and figure As thresholding.Feature extraction mainly extracts the characteristic information such as the colour of skin of target, color, texture, motion and coordinate in image.Certainly Plan is mainly to characteristic information, and being distributed to according to certain decision strategy needs the concrete application of this feature information.Concrete application Function realizes the functions such as Face datection, human limbs identification, motion detection.Sound processing unit 212 uses speech recognition technology Natural-sounding information is subjected to the semantic information that language comprehension analysing obtains user spoken utterances, moreover, true by analyzing voice content Determine word speed, intonation, the pause frequency of speaker.The heart/EEG signals that 213 pairs of electric wave processing unit is collected, which are pre-processed, to be come Remove the artefact of doping.Feature extraction then is carried out to the EEG signals for removing artefact, these features can be temporal signatures, frequency Characteristic of field or time-frequency characteristics.According to these features and before according to obtained by the sample of training different moods (such as it is tranquil, It is glad, sad, frightened) corresponding brain electrical feature determines the mood of user.In addition, in addition to three kinds of common attributes above, also Many other features, such as entropy, fractal dimension and customized feature can be extracted from EEG signals.

Data resolution module 214 is specifically given a lecture depth model using based on deep learning algorithm, obtains correspondingly voice number According to text speech authority data group, and according to default speech element, compare analysis result and the speech specification number determined According to group.Wherein, default speech element can include the accuracy of emotion expression service, the number of times that walks, frequent/monotonicity of limbs, Reasonability (including whether hunchback, whether lowering is natural for hand), the frequency of hand motion of stance；The conjunction of sound intonation The elements such as rationality, the reasonability paused.Speech depth model is obtained based on deep learning algorithm, specifically, in advance profit The speaker with directiveness is collected (for example, the higher outstanding speech of speech level with speech recognition technology, machine vision technique Person) speech word content, voice content, video image content, by deep learning algorithm, according to touching paragraph, moving section Tone, mood, the limb action corresponding to the word at moment etc. such as fall and do deep learning.More particularly, collect a large amount of in advance The speech video data of outstanding speaker, is first handled as follows for each video data：Filter out touching speech paragraph, example The speech period that such as can be more fluctuated for speaker's mood, the video progress voice recognition processing for the period obtains the section Fall corresponding content of text, speech intonation and pause frequency, carrying out image procossing to image information determines in the period not identical text The corresponding limb action of this content and emotional characteristics etc..The data after these above-mentioned processing for each video data are made For give a lecture depth model network training data set, based on depth autocoder and deep neural network to the training data Set carries out further feature and extracts training of the completion to depth model of giving a lecture.

Data resolution module 214 using the content of text of the speech data during the speech of acquisition as input, by drilling Say that depth model obtains corresponding speech authority data group.The data group can include rationally walking during this paragraph is given a lecture The contents such as number of times, limb action reasonability data, speech intonation reasonability data and mood data.Then relevant user is contrasted The parsing content (true speech reaction content) of speech multi-modal data during speech and speech authority data group, it is determined that The reasonability of voice, the limb action shown during user's speech etc..

Guide data output module 215 exports the multi-modal output data for instructing user to give a lecture according to comparison result. Specifically, when comparison result is not up to and sets expected, for example, the element of setting quantity does not reach in the speech element compared To matching, then it is assumed that not up to setting is expected, then by the speech authority data all living creatures given a lecture for the paragraph into multi-modal output Data, the speech mode of specification is shown to user.

Give a lecture Video Output Modules 216, extracting the speech content of user according to analysis result, there is provided the speech with user The associated video information of content, with consumer-oriented speech.As shown in figure 1, being stored with memory 23 with subject name Or video outline keyword is the speech video database of index, speech Video Output Modules 216 are searched according to speech content should The video information of database selection matching.The limitation of capacity is locally stored in view of robot, guiding video can be set Beyond the clouds in server, speech Video Output Modules 216 by network communication protocol to cloud server send video request come The video information of matching is obtained, about the 26S Proteasome Structure and Function of cloud server, herein without excessively limitation.

Multi-modal data output module 30, it is presented multi-modal output data in multi-modal mode to user.The module 30 mainly include display 31, voice-output device 32, limbs operating mechanism 33.Display 31 can select liquid crystal display, Video information and/or emotion expression service information that its control display screen is received with showing.Voice-output device 32 can be loudspeaker, It audibly exports the information of the phonetic matrix received to user.Limbs operating mechanism 33 is according to the limb received Body action command shows the limb action of recommendation to user.

Except the mode of robot entity hardware is used above come in addition to exporting guiding multi-modal data, the intelligence of this example Robot 1 can also extract user according to analysis result and lead speech content there is provided associated with the speech content of the user Virtual robot demonstration data, it is shown on the display 31.Specifically, intelligent robot 1 can utilize data solution Analyse the speech authority data group of the generation of module 214 to generate virtual robot demonstration data, certainly, sound therein etc. is still logical Cross voice-output device 31 to export, and there is the facial table of directiveness about virtual robot during this section speech is carried out Feelings and limb action etc. are realized based on virtual robot demonstration data.The virtual robot can be that have mapped active user's Integrality (including looks, sign etc.) is come the virtual portrait realized so that user can be more by the performance of virtual robot The information such as required expression, sound status when itself being given a lecture to understand well.

In addition, in embodiments of the present invention, it is preferable that the AR/VR equipment 40 of the establishment of virtual speech scene as shown in Figure 1 To realize.The speech scene that hundreds and thousands of people listen the user to be given a lecture as audience is constructed by the AR/VR equipment 40. In addition, dynamic speech scene can also be created that by projection pattern, although the experience property of this mode is not so good as AR/VR equipment 40, but can also implement as one embodiment of the present of invention.On the other hand, in AR/VR equipment, can also provide with The associated virtual robot demonstration data of the speech content of the user, this section of speech content is demonstrated by virtual robot The status information that should be shown.

Carrying out general description with reference to Fig. 1 and Fig. 2 below, once the of the invention speech scene based on intelligent robot is supervised The flow of prosecutor method.As shown in Fig. 2 first, in step S210, speech data acquisition module 10 obtains user and virtually given a lecture The multi-modal data given a lecture under scene.Then graphics processing unit 211, in processor 20, sound processing unit 212, The multi-modal data that the grade of electric wave processing unit 213 is given a lecture user is parsed (step S220).With in preprocessor 20 Data resolution module 214 specifically given a lecture depth model using based on deep learning algorithm, obtain the correspondingly speech data Text speech authority data group (step S230), in step S240, guide data output module 215 is drilled according to default Element is said, analysis result and the speech authority data group determined is compared.Finally, speech Video Output Modules 216 are tied according to comparison Fruit exports the multi-modal output data (step S250) for instructing user to give a lecture.

Next, the example one for the speech data execution dissection process given a lecture user is illustrated with reference to Fig. 3 Process.The processing of speech multi-modal data of the robot to user under virtually speech scene for convenience, user is given a lecture When, using each paragraph as a unit, to receive the speech training of robot.In the process, depth camera 11, phonetic entry Equipment 12 and electrocardio/collection user of eeg monitoring equipment 13 are directed to the speech multi-modal data of a certain paragraph.Because this example is pair The processing of speech data, therefore as shown in figure 3, extract voice messaging in step S310 first, 212 pairs of sound processing unit The voice messaging carries out dissection process (step S320), and the text message of the paragraph is obtained by speech recognition technology, language is utilized The information such as voice, intonation, dead time/number of times, the word speed of sound detection technique detection user.Then, in step S330, data Content of text as input, corresponding speech authority data group, the data is obtained by depth model of giving a lecture by parsing module 214 Group at least includes the corresponding reasonable speech intonation of this section of speech content, pause information etc..Data resolution module 214 is in step S330 In by compare operation assess speaker speech intonation, dead time and number of times it is whether reasonable, such as where paragraph content Should make a short pause, where should tone want loud and sonorous etc..It also can determine that the true place of cacoepy simultaneously.Setting is not being met In the case of rule, guide data output module 215 exports guiding multi-modal data.The guiding multi-modal data can be wrapped Include evaluation result (content of irrationality), reasonability suggestion (when pausing, when intonation is loud and sonorous, when want overcast etc.) with And video information and/or speech authority data group.

On the other hand, in addition it is also necessary to which limb action and facial expression when simulation speech is carried out to user are evaluated.Specifically Flow with reference to shown in Fig. 4.As shown in figure 4, extracting the image information of user's speech, graphics processing unit in step S410 211 carry out the limb action and facial expression information that image analysis operation (step S420) obtains user, in step S430, number It is whether reasonable by the limb action for comparing operation judges speaker according to parsing module 214, for example whether have rationally walk, limbs Whether excessively frequently or excessively whether dull, stance is reasonable for action, whether whether have bow-backed phenomenon, hand lowering naturally, hand Whether portion's action excessively frequently waits.In the case where not meeting setting rule, guide data output module 215 exports directiveness Multi-modal data.

Another further aspect, as shown in figure 5, also being parsed to the electrocardio/brain electricity collected, obtains the emotional information of user Whether (step S520), meet setting rule by the current mood of contrast judgement user, directiveness is exported if not meeting many Modal data, for example, provide reasonability suggestion, inform the mood that user should produce.

Fig. 6 exports the example that the multi-modal data for instructing user to give a lecture is handled according to comparison result for the embodiment of the present invention Simplified flowchart.As shown in fig. 6, first inquiring about the video information in video database 231 with the presence or absence of matching first.Specifically Ground, extracts keyword (step S610), the keyword for example can be the name repeatedly occurred from the text message of speech paragraph Word or phrase.The video information (step S620) searched for by major key of the keyword in video database 231, in the feelings searched Under condition ("Yes" in step S630), exported video information as guiding multi-modal data to display 31 and voice output Equipment 32 is exported, and is demonstrated (step S640) to user.Otherwise, it is speech authority data group is multi-modal as directiveness Data distribution to corresponding hardware executing agency carries out multi-modal output, show correct pronunciation, the tone of recommendation and pause, Limb action of recommendation etc., by user express it is imperfect place correct come.Or, based on speech authority data all living creatures into The virtual robot demonstration data associated with the speech content of the user, is showed (step S650) with virtual mode.

In one embodiment, the intelligent robot is configured with speech APP, and as above method stream is realized by speech APP Journey, when the APP is run, itself and the cooperating of AR/VR equipment 40.In this place, AR/VR equipment 40 can also be provided uses with described The associated virtual robot demonstration data of the speech content at family.

The speech scene monitoring system based on intelligent robot of the embodiment of the present invention can help user to do speech training, Make robot closer to practical application scene, meet user's request, and enhance the multi-modal interaction capabilities of intelligent robot, Improve Consumer's Experience.

Because the method for the present invention describes what is realized in computer systems.The computer system can for example be set In the control core processor of robot.For example, method described herein can be implemented as what can be performed with control logic Software, it is performed by the CPU in robot operating system.Function as described herein, which can be implemented as being stored in non-transitory, to be had Programmed instruction set in shape computer-readable medium.When implemented in this fashion, the computer program includes one group of instruction, When group instruction is run by computer, it, which promotes computer to perform, can implement the method for above-mentioned functions.FPGA can be temporary When or be permanently mounted in non-transitory tangible computer computer-readable recording medium, for example ROM chip, computer storage, Disk or other storage mediums.In addition to being realized with software, logic as described herein can utilize discrete parts, integrated electricity Road, programmable the patrolling with programmable logic device (such as, field programmable gate array (FPGA) or microprocessor) combined use Volume, or embodied including any other equipment that they are combined.All such embodiments are intended to fall under the model of the present invention Within enclosing.

It should be understood that disclosed embodiment of this invention is not limited to specific structure disclosed herein, process step Or material, and the equivalent substitute for these features that those of ordinary skill in the related art are understood should be extended to.It should also manage Solution, term as used herein is only used for describing the purpose of specific embodiment, and is not intended to limit.

" one embodiment " or " embodiment " mentioned in specification means special characteristic, the structure described in conjunction with the embodiments Or during characteristic is included at least one embodiment of the present invention.Therefore, the phrase " reality that specification various places throughout occurs Apply example " or " embodiment " same embodiment might not be referred both to.

While it is disclosed that embodiment as above, but described content is only to facilitate understanding the present invention and adopting Embodiment, is not limited to the present invention.Any those skilled in the art to which this invention pertains, are not departing from this On the premise of the disclosed spirit and scope of invention, any modification and change can be made in the implementing form and in details, But the scope of patent protection of the present invention, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of speech scene monitoring method based on intelligent robot, this method includes：

The multi-modal data that user is given a lecture under virtually speech scene is obtained, the multi-modal data at least includes voice number According to；

The multi-modal data given a lecture user is parsed；

Specifically given a lecture depth model using based on deep learning algorithm, obtain the speech rule of the text of the correspondence speech data Model data group, the speech authority data group has guiding speech exemplary data to have gathered；

According to default speech element, analysis result and the speech authority data group determined are compared；

Multi-modal output data for instructing user to give a lecture is exported according to comparison result.

2. according to the method described in claim 1, it is characterised in that

The multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene, based on voice letter Whether breath, meet setting rule by the voice, intonation and dead time of user described in contrast judgement.

3. method according to claim 1 or 2, it is characterised in that

The multi-modal data includes the image information that user is given a lecture under virtually speech scene, based on described image letter Whether breath, meet setting rule by the facial expression and posture of contrast judgement user.

4. method according to claim 1 or 2, it is characterised in that also include：

Believed according to the speech content that analysis result extracts user there is provided the video associated with the speech content of the user Breath, with consumer-oriented speech,

Or,

The virtual robot demonstration data associated with the speech content of the user is provided by intelligent robot.

5. the method according to any one of Claims 1 to 4, it is characterised in that

Methods described realizes that the robot is mounted with robot operating system by being configured with speech APP intelligent robot, The virtual speech scene is produced by AR/VR equipment, and the AR/VR equipment is cooperateed with the speech APP of the intelligent robot Operation, or, there is provided the virtual robot demonstration data associated with the speech content of the user in AR/VR equipment.

6. one kind speech scene monitoring device, the device includes：

Speech data acquisition module, it obtains the multi-modal data that user is given a lecture under virtually speech scene, the multimode State data at least include speech data；

One or more processors；

Encode is used for the logic by one or more of computing devices in one or more tangible mediums, and described patrols Collecting is used to perform following operation when executed：The multi-modal data given a lecture user is parsed；Using based on depth Learning algorithm is specifically given a lecture depth model, obtains the speech authority data group of the text of the correspondence speech data, described to drill Authority data group is said to have gathered the speech exemplary data with directiveness；According to default speech element, parsing knot is compared Fruit and the speech authority data group of determination；And the multi-modal output number for instructing user to give a lecture is exported according to comparison result According to.

7. device according to claim 6, it is characterised in that

The multi-modal data includes the voice messaging that user is given a lecture under virtually speech scene,

The logic is further used for performing following operation when executed：Based on the voice messaging, pass through contrast judgement institute Whether the voice, intonation and dead time for stating user meet setting rule.

8. the device according to claim 6 or 7, it is characterised in that

The multi-modal data includes the image information that user is given a lecture under virtually speech scene,

The logic is further used for performing following operation when executed：Based on described image information, used by contrast judgement Whether the facial expression and posture at family meet setting rule.

9. the device according to claim 6 or 7, it is characterised in that also including speech Video Output Modules, it is according to parsing As a result the speech content of user is extracted there is provided the video information associated with the speech content of the user, to instruct user Speech, or,

The logic is further used for performing following operation when executed：Extracted according to analysis result in the speech of user There is provided the virtual robot demonstration data associated with the speech content of the user for appearance.

10. the device according to any one of claim 6~9, it is characterised in that

Described device realizes that the robot is mounted with robot operating system by being configured with speech APP intelligent robot, The virtual speech scene is produced by AR/VR equipment, and the AR/VR equipment is cooperateed with the speech APP of the intelligent robot Operation, or, there is provided the virtual robot demonstration data associated with the speech content of the user in AR/VR equipment.