CN110335600A

CN110335600A - The multi-modal exchange method and system of household appliance

Info

Publication number: CN110335600A
Application number: CN201910616247.9A
Authority: CN
Inventors: 刘明华; 游忍; 张欢欢; 展华益; 周建波
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2019-10-15

Abstract

The present invention proposes the multi-modal exchange method and system of a kind of household appliance, belongs to household electrical appliance field of speech recognition.The present invention solve the problems, such as the single interactive voice of tradition deposit mode misrecognition, rely on activation word and disagreeableness, the drip irrigation device of interaction are as follows: obtain the image and voice signal under current environment；According to voice signal, detect whether that there are speech activities；If detecting the presence of speech activity, according to picture signal, judges whether someone's positive injection depending on equipment and speaking；If detecting that someone is just watching equipment attentively and speaking, start voice interactive function, and store present user speech feature and characteristics of image；When starting voice interactive function, according to phonetic feature, the speech content of current speaker is identified；Also, intention assessment is used, judge the intention of current speaker and corresponding service is provided.It can judge automatically and whether need to start interactive voice, without activating word, and user can be helped to carry out services selection.

Description

The multi-modal exchange method and system of household appliance

Technical field

The present invention relates to household electrical appliance speech recognition technologies, the in particular to multi-modal exchange method and system of household appliance Technology.

Background technique

In smart machine interactive process, interactive mode more common at present is interactive voice, is joined by the voice of acquisition The operating or search service of number control household appliance.But there is misrecognition in single speech parameter, especially when surrounding ring Border noise is big, distance farther out when, the probability of bigger misrecognition.Meanwhile current interactive voice is first to need that word is activated to wake up The strong interactive mode of equipment, inconvenient, interactive mode is unfriendly.To sum up, existing household appliance exchange method and system are deposited In misrecognition, rely on activation word and the disagreeableness problem of interaction.

Summary of the invention

The object of the present invention is to provide the multi-modal exchange methods and system of a kind of household appliance, solve the single language of tradition Sound interaction deposit mode misrecognition, rely on activation word and the disagreeableness problem of interaction.

The present invention solves its technical problem, the technical solution adopted is that: the multi-modal exchange method of household appliance, including with Lower step:

S1. the image and voice signal under current environment are obtained；

S2. according to voice signal, detect whether that there are speech activities；

S3. if detecting the presence of speech activity, according to picture signal, judge whether someone's positive injection depending on equipment and saying Words；

S4. if detecting that someone is just watching equipment attentively and speaking, start voice interactive function, and store active user Phonetic feature and characteristics of image；

S5. when starting voice interactive function, according to phonetic feature, the speech content of current speaker is identified；

S6. when starting voice interactive function, using intention assessment, judge the intention of current speaker and phase is provided The service answered.

Particularly, in step S1, by the voice receiver device built in household appliance, the language under current environment is obtained Sound signal；By the cam device built in household appliance, the picture signal under current environment is obtained.

Further, step S2 specifically includes the following steps:

S201. voice signal traditional characteristic or depth characteristic are extracted；

S202. feature is made decisions based on thresholding, statistical model and machine learning, detects whether that there are speech activities.

Particularly, step S3 specifically includes the following steps:

S301. according to described image signal, the facial orientation of current speaker is calculated with computer vision technique, judgement is worked as Whether someone is just watching equipment attentively in preceding environment；

S302. if someone is just watching equipment attentively, according to picture signal, judgement is calculated using computer vision technique and is watched attentively Whether the people of equipment is speaking.

Further, the phonetic feature includes age, gender and the identity of speaker in step S4；Described image Feature includes face, position, gender, age and the identity of speaker.

Particularly, in step S5, by extracting the speech parameter in phonetic feature, identify that speaking for current speaker is interior Hold.

Further, step S6 specifically includes the following steps:

S601. intention assessment is used, speech content is analyzed, extracts the intention of current speaker；

S602. household appliance built-in command word database；

S603. by the intention and database matching of current speaker, confirm that user thinks the order of input；

S604., service needed for current speaker is provided.

The multi-modal interactive system of household appliance, the multi-modal exchange method applied to the household appliance includes signal Module, Speaker change detection module, voice interaction module, characteristic storage module, speech recognition module and intention assessment module are obtained, Signal acquisition module is connected with Speaker change detection module, and Speaker change detection module is connected with voice interaction module, interactive voice mould Block is connected with characteristic storage module, and characteristic storage module is connected with speech recognition module, speech recognition module and intention assessment mould Block is connected；

The signal acquisition module, for obtaining voice and picture signal；

The Speaker change detection module, for judging whether that someone is speaking to household appliance；

The voice interaction module starts voice interactive function for judging whether according to described image, voice signal；

The characteristic storage module, for storing the phonetic feature and characteristics of image of current speaker；

The speech recognition module, for identification user's speech content；

The intention assessment module, for understanding that user is intended to, recommendation service content.

The invention has the advantages that figure can be passed through by the multi-modal exchange method and system of above-mentioned household appliance The input of picture, voice signal is judged automatically using computer vision technique and speech recognition technology and whether is needed to start voice friendship Mutually, it without activating word, makes interaction more accurate, more efficient, improves the intelligent level of household appliance, and pass through speech recognition skill Art and intention assessment technology confirm user search intent, help user to carry out services selection, improve interactive accuracy rate and efficiency, Bring more good interactive experience.

Detailed description of the invention

Fig. 1 is the flow chart of the multi-modal exchange method of present inventor's electric equipment.

Specific embodiment

Below with reference to examples and drawings, the technical schemes of the invention are described in detail.

The multi-modal exchange method of household appliance of the present invention, flow chart is referring to Fig. 1, wherein this method include with Lower step:

S1. the image and voice signal under current environment are obtained.

Wherein, it is more convenient to save input cost and acquire voice signal, preferably passes through the voice built in household appliance Acceptor device obtains the voice signal under current environment；In order to precisely obtain picture signal, preferably pass through household appliance Built-in cam device obtains the picture signal under current environment.

S2. according to voice signal, detect whether that there are speech activities.

Wherein, step S2 specifically includes the following steps:

S3. if detecting the presence of speech activity, according to picture signal, judge whether someone's positive injection depending on equipment and saying Words.

Wherein, step S3 specifically includes the following steps:

S4. if detecting that someone is just watching equipment attentively and speaking, start voice interactive function, and store active user Phonetic feature and characteristics of image.

Wherein, the phonetic feature includes age, gender and identity of speaker etc.；Described image feature includes speaker Face, position, gender, age and identity etc..

S5. when starting voice interactive function, according to phonetic feature, the speech content of current speaker is identified.

Wherein, under general operating condition, speaking for current speaker can be identified by extracting the speech parameter in phonetic feature Content.

Wherein, step S6 specifically includes the following steps:

S602. household appliance built-in command word database；

S604., service needed for current speaker is provided.

The signal acquisition module, for obtaining voice and picture signal；

The speech recognition module, for identification user's speech content；

Embodiment 1

Present embodiments provide a kind of multi-modal exchange method of household appliance, comprising the following steps:

S1. the image and voice signal under current environment are obtained.Wherein, it is filled by voice receiver built in household appliance It sets, as remote controler or far field microphone array obtain the voice signal under current environment；It is filled by household appliance built-in camera It sets, if RGB camera or infrared camera, obtains the picture signal under current environment.

S2. according to voice signal, detect whether that there are speech activities.Wherein, firstly, extract voice signal traditional characteristic or Depth characteristic can calculate the energy of each moment voice as feature in the present embodiment；Then, threshold value k is set, if Energy is greater than k and is denoted as 1, i.e. voice, 0, i.e. non-voice is otherwise denoted as, and judge the lasting interval of voice, if more than given threshold T then detects the presence of speech activity.

S3. if detecting the presence of speech activity, according to picture signal, judge whether someone's positive injection depending on equipment and saying Words.Wherein, firstly, according to picture signal, Face datection and crucial point location are carried out to the picture signal of acquisition, before judging equipment Whether someone, while to the people of positioning by key point carry out head pose estimation obtain facial orientation, judge its relative device Deflection angle, if its be less than threshold value r, be determined as face equipment；Then, if someone is just watching equipment attentively, according to image Signal judges the key point of the continuous several frames of face people, sees whether its upperlip spacing is greater than threshold from dynamic range Value d, if more than then determining that it is speaking, i.e., someone is just watching equipment attentively and is speaking.

S4. if detecting that someone is just watching equipment attentively and speaking, start voice interactive function, and store active user Phonetic feature and characteristics of image.Wherein, firstly, storing the phonetic feature of speaker, including age " 25 ", gender " male ", identity " user 1 " etc.；Secondly, the characteristics of image of storage speaker, including facial image and coordinate, position " equipment is 30 degree left ", gender " male ", age " 25 ", identity user 1 " etc..

S5. when starting voice interactive function, according to phonetic feature, the speech content of current speaker is identified.Its In, by extracting the speech parameter in phonetic feature, identify user's speech content, such as " I wants to see for the instruction of TV interaction Journey to the West ", " sound is more greatly ", the instruction " temperature height a bit " interactive for air-conditioning, " wind is a little bit smaller " etc..

S6. when starting voice interactive function, using intention assessment, judge the intention of current speaker and phase is provided The service answered.Wherein, firstly, using intention assessment, speech content is analyzed, user is extracted and is intended to, as " I thinks for TV interactive instruction See Journey to the West ", analyze " Journey to the West "；Air-conditioning interactive instruction " wind is a little bit smaller " analyzes " wind ", " small "；Secondly, household appliance Built-in command word database, such as " Journey to the West ", " wind ", " small "；Then, user is intended to and database matching, confirmation user thinks The order of input；Finally, service needed for providing current speaker, such as searches for Journey to the West film source and selects for user, turns down air-conditioning Wind speed.

Embodiment 2

The present embodiment provides a kind of multi-modal interactive systems of household appliance, specifically include within the system: signal acquisition Module, Speaker change detection module, voice interaction module, characteristic storage module, speech recognition module and intention assessment module, signal Obtain module be connected with Speaker change detection module, Speaker change detection module is connected with voice interaction module, voice interaction module and Characteristic storage module is connected, and characteristic storage module is connected with speech recognition module, speech recognition module and intention assessment module phase Even.

Signal acquisition module obtains image and voice signal under current scene by sensor, wherein image acquisition is set Standby such as RGB camera, speech ciphering equipment receiver such as remote controler or far field microphone array.

Speaker change detection module is mainly used for judging whether that someone speaks to household appliance, if not face household electrical appliances are spoken Or face household electrical appliances are not spoken, then are not connected to voice interaction module.Judgment method is as follows:

A. the energy of each moment voice can be calculated in the present embodiment by extracting voice signal traditional characteristic or depth characteristic Amount is used as feature；

B. threshold value k is set, if energy is greater than k and is denoted as 1, i.e. voice, is otherwise denoted as 0, i.e. non-voice, and judge voice Lasting interval then detects the presence of speech activity if more than given threshold t；

If C. detecting speech activity, according to image parameter, Face datection and key point are carried out to the picture signal of acquisition Positioning, judge before equipment whether someone, while facial orientation is obtained by key point progress head pose estimation to the people of positioning, Judge the deflection angle of its relative device, if it is less than threshold value r, is determined as face equipment；

D. if someone is just watching equipment attentively, according to picture signal, the key point of the continuous several frames of face people is judged, See whether its upperlip spacing is greater than threshold value d from dynamic range, if more than then determining that it is speaking, i.e., someone just watches attentively and sets It is standby and speaking.

Voice interaction module judges whether to start voice interactive function according to described image, voice signal:

If speech activity is not detected, do not start voice interactive function；If detecting speech activity, someone is being not detected just Watch equipment attentively to speak, does not start voice interactive function；If detecting speech activity, and detect that someone is just watching equipment attentively and speaking, Start voice interactive function.

Characteristic storage module is used to store the phonetic feature and characteristics of image of current speaker, including phonetic feature and image Feature: phonetic feature of speaker, including age " 25 ", gender " male ", identity " user 1 " etc. are stored；Store the figure of speaker As feature, including facial image and coordinate, position " equipment is 30 degree left ", gender " male ", age " 25 ", identity user 1 " etc..

The speech content of speech recognition module identification speaker, instruction " I wants to see Journey to the West " such as interactive for TV, " sound is more greatly ", the instruction " temperature height a bit " interactive for air-conditioning, " wind is a little bit smaller " etc..

Intention assessment module carries out intention assessment to current speaker, understands after the speech content of identification speaker User is intended to, such as " Journey to the West ", " wind ", " small ".Journey to the West piece is such as searched in service needed for household appliance provides current speaker Source selects for user, turns down air conditioner wind speed.

Embodiment 1 and the embodiment 2 also expansible interactive voice for other household appliances, such as the temperature of refrigerator, lamp Switch etc..So as to and carry out multimodal recognition, improve interactive efficiency without activating word to start voice interactive function, for Family provides more intelligent service.

Claims

1. the multi-modal exchange method of household appliance, which comprises the following steps:

S1. the image and voice signal under current environment are obtained；

S3. if detecting the presence of speech activity, according to picture signal, judge whether someone's positive injection depending on equipment and speaking；

S4. if detecting that someone is just watching equipment attentively and speaking, start voice interactive function, and store present user speech Feature and characteristics of image；

S6. when starting voice interactive function, using intention assessment, judge the intention of current speaker and provide corresponding Service.

2. the multi-modal exchange method of household appliance according to claim 1, which is characterized in that in step S1, pass through house Voice receiver device built in electric equipment obtains the voice signal under current environment；Pass through the camera built in household appliance Device obtains the picture signal under current environment.

3. the multi-modal exchange method of household appliance according to claim 1, which is characterized in that step S2 specifically include with Lower step:

4. the multi-modal exchange method of household appliance according to claim 1, which is characterized in that step S3 specifically include with Lower step:

S301. according to described image signal, the facial orientation of current speaker is calculated with computer vision technique, front ring is worked as in judgement Whether someone is just watching equipment attentively in border；

S302. if someone is just watching equipment attentively, according to picture signal, judgement is calculated using computer vision technique and watches equipment attentively People whether speaking.

5. the multi-modal exchange method of household appliance according to claim 1, which is characterized in that in step S4, institute's predicate Sound feature includes age, gender and the identity of speaker；Described image feature includes the face of speaker, position, gender, age And identity.

6. the multi-modal exchange method of household appliance according to claim 1, which is characterized in that in step S5, by mentioning The speech parameter in phonetic feature is taken, identifies the speech content of current speaker.

7. the multi-modal exchange method of household appliance according to claim 1, which is characterized in that step S6 specifically include with Lower step:

S602. household appliance built-in command word database；

S604., service needed for current speaker is provided.

8. the multi-modal interactive system of household appliance, the multimode applied to household appliance described in claim 1-7 any one State exchange method, which is characterized in that including signal acquisition module, Speaker change detection module, voice interaction module, characteristic storage mould Block, speech recognition module and intention assessment module, signal acquisition module are connected with Speaker change detection module, Speaker change detection module It is connected with voice interaction module, voice interaction module is connected with characteristic storage module, characteristic storage module and speech recognition module It is connected, speech recognition module is connected with intention assessment module；

The signal acquisition module, for obtaining voice and picture signal；

The speech recognition module, for identification user's speech content；