CN110910887B - Voice wake-up method and device - Google Patents

Voice wake-up method and device Download PDF

Info

Publication number
CN110910887B
CN110910887B CN201911394715.9A CN201911394715A CN110910887B CN 110910887 B CN110910887 B CN 110910887B CN 201911394715 A CN201911394715 A CN 201911394715A CN 110910887 B CN110910887 B CN 110910887B
Authority
CN
China
Prior art keywords
user
face
voice signal
voice
continuous data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911394715.9A
Other languages
Chinese (zh)
Other versions
CN110910887A (en
Inventor
孙尔伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN201911394715.9A priority Critical patent/CN110910887B/en
Publication of CN110910887A publication Critical patent/CN110910887A/en
Application granted granted Critical
Publication of CN110910887B publication Critical patent/CN110910887B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Social Psychology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Psychiatry (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention discloses a voice awakening method and a voice awakening device, wherein the method comprises the following steps: performing VAD detection on the acquired audio data to judge whether a voice signal is input; responding to the VAD detection that a voice signal is input, and performing awakening word detection on the input voice signal to judge whether the voice signal contains preset awakening words; if the voice signal does not contain the preset awakening word, starting image recognition to obtain continuous data of the face of the user at the current moment; judging whether the user has a conversation intention or not based on the continuous data of the face; and if the user is judged to have the conversation intention, executing awakening. The scheme of the embodiment of the application acquires the continuous data of the face of the user when the face cannot be awakened through voice, judges whether the face has conversation intention according to the continuous data of the face, and awakens the equipment if the face has the conversation intention, so that whether the face is awakened or not can be determined according to the intention of the user instead of only relying on awakening words, and the face is more humanized and better in user experience.

Description

Voice wake-up method and device
Technical Field
The invention belongs to the technical field of voice awakening, and particularly relates to a voice awakening method and device.
Background
In the related art, most of the current devices can realize voice interaction. The voice interaction is also the necessary skill of the intelligent equipment, so that the man-machine interaction is more humanized, and the conversation is more intelligent. In the related art, waking up is an important link in the voice interaction process.
The inventor finds that the prior scheme has at least the following defects in the process of implementing the application: at present, voice interaction used by most intelligent devices needs to be awakened according to a set awakening word and then conversed, so that the interaction is troublesome and unfriendly.
Disclosure of Invention
An embodiment of the present invention provides a voice wake-up method and apparatus, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a voice wake-up method, including: performing VAD detection on the acquired audio data to judge whether a voice signal is input; responding to the VAD detection that a voice signal is input, and performing awakening word detection on the input voice signal to judge whether the voice signal contains a preset awakening word; if the voice signal does not contain the preset awakening word, starting image recognition to obtain continuous data of the face of the user at the current moment; judging whether the user has a conversation intention or not based on the continuous data of the face; and if the user is judged to have the conversation intention, executing awakening.
In a second aspect, an embodiment of the present invention provides a voice wake-up apparatus, including: the detection module is configured to perform VAD detection on the acquired audio data to judge whether a voice signal is input; the voice signal detection module is configured to respond to the VAD detection that a voice signal is input, and perform wake-up word detection on the input voice signal to judge whether the voice signal contains a preset wake-up word; the image recognition module is configured to start image recognition to acquire continuous data of the face of the user at the current moment if the voice signal does not contain a preset awakening word; an intention judging module configured to judge whether the user has a conversation intention based on continuous data of the face; and the awakening execution module is configured to execute awakening if the user is judged to have the conversation intention.
In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the voice wake-up method of any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, and the computer program includes program instructions, when the program instructions are executed by a computer, the computer executes the steps of the voice wake-up method according to any embodiment of the present invention.
According to the scheme provided by the method and the device, when the face cannot be awakened through voice, the continuous data of the face of the user is obtained, whether the face has conversation intention is judged according to the continuous data of the face, if yes, the device can be awakened, and therefore whether the face is awakened or not can be determined according to the intention of the user instead of only relying on awakening words, the method and the device are more humanized, and user experience is better.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a voice wake-up method according to an embodiment of the present invention;
fig. 2 is a flowchart of another voice wake-up method according to an embodiment of the present invention;
fig. 3 is a flowchart of another voice wake-up method according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating an embodiment of a voice wake-up system according to the present invention;
fig. 5 is a block diagram of a voice wake-up apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, a flowchart of an embodiment of a voice wake-up method according to the present application is shown, where the voice wake-up method according to the present embodiment may be applied to a smart voice device with a voice wake-up function, such as a smart voice phone, a smart speaker, a smart voice robot, a smart voice toy, and the like, and the present application is not limited herein.
As shown in fig. 1, in step 101, VAD detection is performed on the acquired audio data to determine whether a voice signal is input;
in step 102, in response to the VAD detection that a voice signal is input, performing wakeup word detection on the input voice signal to determine whether the voice signal contains a preset wakeup word;
in step 103, if the voice signal does not contain a preset wake-up word, starting image recognition to acquire continuous data of the face of the user at the current moment;
in step 104, judging whether the user has a conversation intention or not based on the continuous data of the face;
in step 105, if the user is determined to have the dialog intention, a wake-up is performed.
In this embodiment, for step 101, the voice wake-up apparatus performs VAD detection on the obtained audio data of the user to determine whether a voice signal of the user is input;
then, for step 102, if the VAD detection result indicates that a voice signal is input, performing wakeup word detection on the voice signal input by the user to determine whether the voice signal of the user contains a preset device wakeup word;
then, in step 103, if the device detects that the voice signal of the user does not include the preset wake-up word, the device starts image recognition to obtain continuous data of the face of the user at the current moment; then, for step 104, the device determines whether the user has a dialog intention based on the continuous data of the face; finally, in step 105, if the device determines that the user has the dialog intention, it performs an auditive wake-up.
For example, a user walks to a smart television to speak, then the smart television acquires audio data of the user, and then performs VAD to detect whether the audio data of the user contains a wakeup word, if the audio data of the user does not contain the wakeup word, the smart television starts image recognition at a background to determine whether the user has a conversation intention, for example, the user does not watch a processing result of the smart television, and if the smart television determines that the user has the conversation intention, the smart television directly wakes up the device.
The scheme of the embodiment of the application acquires the continuous data of the face of the user when the face cannot be awakened through voice, judges whether the face has conversation intention according to the continuous data of the face, and awakens the equipment if the face has the conversation intention, so that whether the face is awakened can be determined according to the real intention of the user, whether the face is awakened is determined by only relying on awakening words, the scene of the concept that the user forgets or does not have the awakening words at all is more friendly, more humanized and better in user experience.
With further reference to fig. 2, a flow chart of another embodiment of the voice wake-up method of the present application is shown. The flowchart of this embodiment is a flowchart mainly directed to a step further defined in step after "determining whether the user has a dialog intention based on the continuous data of the face" in step 104 in fig. 1 of the flowchart.
As shown in fig. 2, in step 201, if the user is determined not to have a dialog intention, determining the portrait of the user based on the continuous data of the voice signal and the face;
in step 202, determining recommendation information for the user based on the representation;
in step 203, the recommendation information is fed back to the user.
In this embodiment, for step 201, if the voice wake-up apparatus determines that the user does not have a conversation intention, it determines a portrait of the user based on the voice signal and the continuous data of the face, for example, an intelligent robot at a doorway of a mall or a supermarket, collects the voice signal and the continuous data of the face of the user to determine a portrait of the user, where the portrait includes information such as gender, age, and hobby;
then, for step 202, the device determines recommendation information for the user based on the representations, for example, the intelligent robot generates recommendation information such as "hello," brand of people in front of the doorway of the second building, and you shop for pleasure "based on the collected voice information of the user," clothing of certain brand "and passerby representations, or recommendation information such as" hello, ask what can help you "for calling, and the like, which is not limited herein.
Finally, for step 203, the device feeds back the recommendation information to the user, for example, after the intelligent robot feeds back the generated recommendation information to the user, the user may also carefully perform voice interaction inquiry with the intelligent robot, and the like.
In some optional embodiments, after performing VAD detection on the acquired audio data to determine whether a voice signal is input, the method further includes: and responding to the VAD detection that no voice signal is detected within a certain preset time, and starting image recognition to acquire continuous data of the image at the current moment. Therefore, when the voice signal is not detected for a long time, the image data can be actively acquired, so that the subsequent heuristic dialogue can be started and actively conducted with the user, and the method is more friendly to the user who does not know the situation or know the use of the voice equipment.
Further referring to fig. 3, which shows a flowchart of another embodiment of the voice wake-up method of the present application. The flowchart of this embodiment is mainly a flowchart of steps further defined after "in response to VAD detection, no voice signal is detected within a certain preset time, image recognition is started to acquire continuous data of an image at the current time".
As shown in fig. 3, in step 301, it is determined whether the image is moving image data based on the continuous data of the image, wherein the moving image data includes at least one user;
in step 302, if it is determined to be dynamic image data, determining a portrait of the at least one user based on the dynamic image data;
in step 303, determining recommendation information for the at least one user based on the representation;
in step 304, the recommendation information is fed back to the at least one user.
In this embodiment, for step 301, the voice wake-up apparatus determines whether the data is dynamic image data based on the continuous data of the collected images, for example, in a shopping mall, the intelligent robot performs continuous photographing caching on the users passing by the intelligent robot, and then determines whether the data is dynamic image data according to the cached continuous image data, where the dynamic image data includes at least one user;
then, for step 302, if the device determines to be dynamic image data, determining a representation of the at least one user based on the dynamic image data; for example, the intelligent robot determines information such as sex, age, hobbies and the like of at least one user based on the dynamic image data;
Thereafter, for step 303, the voice wake up apparatus determines recommendation information for the at least one user based on the representation; for example, the intelligent robot generates recommendation information such as recommendation information including "what we have to eat" and user figures in chat voice information of the user and friends, wherein the recommendation information includes recommendation information such as "you are good, five buildings of our mall have many restaurants, for example, people can go to a certain store when wanting to eat local features, and people can go to a certain store when wanting to eat buffet";
finally, for step 304, the voice wake-up unit feeds back the recommendation information to the at least one user. For example, the intelligent robot may stop the passerby, make a careful inquiry, and the like after feeding back the generated recommendation information to the user.
The method of the embodiment actively initiates the heuristic session to the user when the user dynamics is included after the continuous user images are acquired, so that the method can help some users who do not know how to use or know the existence of the voice equipment at all to acquire the related information by means of the voice equipment, and the user experience is better.
In some optional embodiments, the recommendation information includes a call, a chat, and a reminder. Therefore, the dialogue can be actively opened with the user in the modes of calling, chatting or reminding, the real person dialogue scene is better simulated, the user can easily accept the dialogue, and the user experience is better.
The following description is given to a specific example describing some problems encountered by the inventor in implementing the present invention and a final solution so as to enable those skilled in the art to better understand the solution of the present application.
The technical scheme adopted by the embodiment of the application is that on the basis of the original customized awakening word, the combination of newly added panoramic image recognition and the combination of the awakening word and the image recognition are adopted to bring better experience.
A robot awakening and heuristic dialogue method and device comprises a voice awakening module/identification module, a voice synthesis module and an image identification module, and the heuristic dialogue module, wherein the awakening module is in a dormant state after system initialization, whether voice signals are input is detected according to VAD detection results, awakening and dialogue are performed according to whether the voice input is a preset awakening word, if the detected audio signals are not the preset awakening word, the panoramic image identification module is started, whether the voice input is a dialogue intention is obtained according to continuous data of a face of image identification, and therefore equipment is awakened, and dialogue detection is performed. Meanwhile, if the input does not have the dialogue intention, the data is input to the heuristic dialogue module.
A robot awakening and heuristic dialogue method and device comprises a heuristic dialogue module, wherein the heuristic dialogue module is used for collecting voiceprint and face data to perform database comparison, for example, the result of a face, a man or a woman, or the result pre-stored in a database is used for performing heuristic dialogue, including calling, chatting, reminding and the like.
The working principle of the invention is as follows:
1. VAD detection, after system initialization, VAD detects signals and then detects and wakes up;
2. after the awakening words are detected, entering a conversation according to a result;
3. when the awakening words are not detected, the awakening words are input into the face recognition module according to the panoramic camera for detection, and the detection results are sent to the heuristic dialogue module for processing;
4. meanwhile, when sound is not detected for a long time, timing camera data processing is carried out, and when image dynamics are detected, the data are simultaneously input to a heuristic dialogue module for processing;
5. the heuristic dialogue module performs dialogue after comprehensively analyzing the data (such as voiceprints, human faces and the like).
Referring to fig. 5, a block diagram of a voice wake-up apparatus according to an embodiment of the invention is shown.
As shown in fig. 5, the voice wake-up apparatus 500 includes a detection module 510, a wake-up determination module 520, an image recognition module 530, an intention determination module 540, and a wake-up execution module 550.
The detection module 510 is configured to perform VAD detection on the acquired audio data to determine whether a voice signal is input; a wake-up determining module 520, configured to perform wake-up word detection on the input voice signal to determine whether the voice signal includes a preset wake-up word in response to the VAD detection that the voice signal is input; an image recognition module 530 configured to start image recognition to obtain continuous data of a face of a user at a current moment if the voice signal does not include a preset wakeup word; an intention judging module 540 configured to judge whether the user has a dialogue intention based on the continuous data of the face; and a wakeup execution module 550 configured to execute wakeup if it is determined that the user has the dialog intention.
In some optional embodiments, the voice wake-up apparatus 500 further includes: a portrait module (not shown) configured to determine a portrait of the user based on the voice signal and the continuous data of the face if it is determined that the user does not have a dialog intention; a recommendation module (not shown) configured to determine recommendation information for the user based on the representation; and a feedback module (not shown in the figure) configured to feed back the recommendation information to the user.
In some optional embodiments, the image recognition module 530 is further configured to: and responding to the VAD detection that no voice signal is detected within a certain preset time, and starting image recognition to acquire continuous data of the image at the current moment.
It should be understood that the modules recited in fig. 5 correspond to various steps in the methods described with reference to fig. 1, 2, and 3. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 5, and are not described again here.
It should be noted that the modules in the embodiments of the present application are not intended to limit the solution of the present application, for example, the word segmentation module may be described as a module that divides the received sentence text into a sentence and at least one entry. In addition, the related function modules may also be implemented by a hardware processor, for example, the word segmentation module may also be implemented by a processor, which is not described herein again.
In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the voice wakeup method in any of the above method embodiments;
As one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
performing VAD detection on the acquired audio data to judge whether a voice signal is input;
responding to the VAD detection that a voice signal is input, and performing awakening word detection on the input voice signal to judge whether the voice signal contains a preset awakening word;
if the voice signal does not contain the preset awakening word, starting image recognition to acquire continuous data of the face of the user at the current moment;
judging whether the user has a conversation intention or not based on the continuous data of the face;
and if the user is judged to have the conversation intention, executing awakening.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice recognition apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the speech recognition apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present invention further provide a computer program product, the computer program product including a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions that, when executed by a computer, cause the computer to perform any of the above-mentioned speech recognition methods.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 6, the electronic device includes: one or more processors 610 and a memory 620, with one processor 610 being an example in fig. 6. The apparatus of the voice recognition method may further include: an input device 630 and an output device 640. The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or other means, such as the bus connection in fig. 6. The memory 620 is a non-volatile computer-readable storage medium as described above. The processor 610 executes various functional applications of the server and data processing by running nonvolatile software programs, instructions and modules stored in the memory 620, that is, implements the voice recognition method of the above-described method embodiment. The input device 630 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the voice recognition device. The output device 640 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
As an embodiment, the electronic device is applied to a voice wake-up apparatus, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:
performing VAD detection on the acquired audio data to judge whether a voice signal is input;
responding to VAD detection that voice signals are input, and performing awakening word detection on the input voice signals to judge whether the voice signals contain preset awakening words;
if the voice signal does not contain the preset awakening word, starting image recognition to obtain continuous data of the face of the user at the current moment;
judging whether the user has a conversation intention or not based on the continuous data of the face;
and if the user is judged to have the conversation intention, executing awakening.
The electronic device of the embodiments of the present application exists in various forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.
(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.
(5) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A voice wake-up method, comprising:
Performing VAD detection on the acquired audio data to judge whether a voice signal is input;
responding to VAD detection that voice signals are input, and performing awakening word detection on the input voice signals to judge whether the voice signals contain preset awakening words;
if the voice signal does not contain the preset awakening word, starting image recognition to acquire continuous data of the face of the user at the current moment;
judging whether the user has a dialog intention or not based on the continuous data of the face;
if the user is judged to have the conversation intention, awakening is executed;
wherein the continuous data of the face is collected based on a panoramic camera, and after judging whether the user has a conversation intention based on the continuous data of the face, the method further comprises:
if the user does not have the conversation intention, determining the portrait of the user based on the voice signal and the continuous data of the human face;
determining recommendation information for the user based on the representation;
and feeding the recommendation information back to the user.
2. The method of claim 1, wherein after performing VAD detection on the acquired audio data to determine whether a voice signal is input, the method further comprises:
And responding to the VAD detection that no voice signal is detected within a certain preset time, and starting image recognition to acquire continuous data of the image at the current moment.
3. The method of claim 2, wherein the method further comprises:
judging whether the image is dynamic image data or not based on the continuous data of the image, wherein the dynamic image data comprises at least one user;
if the user is judged to be dynamic image data, determining the portrait of the at least one user based on the dynamic image data;
determining recommendation information for the at least one user based on the representation;
and feeding back the recommendation information to the at least one user.
4. The method of claim 1 or 3, wherein the recommendation information includes a call, a chat, and a reminder.
5. A voice wake-up apparatus comprising:
the detection module is configured to perform VAD detection on the acquired audio data to judge whether a voice signal is input;
the voice signal detection module is configured to detect a voice signal input by responding to VAD detection, and perform wake-up word detection on the input voice signal to judge whether the voice signal contains a preset wake-up word;
the image recognition module is configured to start image recognition to acquire continuous data of the face of the user at the current moment if the voice signal does not contain a preset awakening word;
An intention judging module configured to judge whether the user has a conversation intention based on continuous data of the face;
the awakening execution module is configured to execute awakening if the user is judged to have the conversation intention;
wherein, the continuous data of people's face is based on panorama camera is gathered, the device still includes:
a portrait module configured to determine a portrait of the user based on the voice signal and the continuous data of the face if it is determined that the user does not have a dialog intention;
a recommendation module configured to determine recommendation information for the user based on the representation;
and the feedback module is configured to feed the recommendation information back to the user.
6. The apparatus of claim 5, wherein the image recognition module is further configured to: and responding to the VAD detection that no voice signal is detected within a certain preset time, and starting image recognition to acquire continuous data of the image at the current moment.
7. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 4.
8. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 4.
CN201911394715.9A 2019-12-30 2019-12-30 Voice wake-up method and device Active CN110910887B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911394715.9A CN110910887B (en) 2019-12-30 2019-12-30 Voice wake-up method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911394715.9A CN110910887B (en) 2019-12-30 2019-12-30 Voice wake-up method and device

Publications (2)

Publication Number Publication Date
CN110910887A CN110910887A (en) 2020-03-24
CN110910887B true CN110910887B (en) 2022-06-28

Family

ID=69813882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911394715.9A Active CN110910887B (en) 2019-12-30 2019-12-30 Voice wake-up method and device

Country Status (1)

Country Link
CN (1) CN110910887B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445910A (en) * 2020-03-26 2020-07-24 杭州涂鸦信息技术有限公司 Non-contact-based voice interaction method, system and equipment
CN111722696B (en) * 2020-06-17 2021-11-05 思必驰科技股份有限公司 Voice data processing method and device for low-power-consumption equipment
CN111880854B (en) * 2020-07-29 2024-04-30 百度在线网络技术(北京)有限公司 Method and device for processing voice
CN112558911B (en) * 2020-12-04 2022-07-08 思必驰科技股份有限公司 Voice interaction method and device for massage chair
CN112669837B (en) * 2020-12-15 2022-12-06 北京百度网讯科技有限公司 Awakening method and device of intelligent terminal and electronic equipment
CN114187909A (en) * 2021-12-14 2022-03-15 思必驰科技股份有限公司 Voice wake-up method and system for medical scene

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016071192A (en) * 2014-09-30 2016-05-09 株式会社Nttドコモ Interaction device and interaction method
CN105868827A (en) * 2016-03-25 2016-08-17 北京光年无限科技有限公司 Multi-mode interaction method for intelligent robot, and intelligent robot
JP2018045675A (en) * 2016-09-07 2018-03-22 パナソニックIpマネジメント株式会社 Information presentation method, information presentation program and information presentation system
CN107977852A (en) * 2017-09-29 2018-05-01 京东方科技集团股份有限公司 A kind of intelligent sound purchase guiding system and method
CN108432190A (en) * 2015-09-01 2018-08-21 三星电子株式会社 Response message recommends method and its equipment
CN110076769A (en) * 2019-03-20 2019-08-02 广东工业大学 A kind of acoustic control patrol navigation robot system and its control method based on the movement of magnetic suspension sphere

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100030578A1 (en) * 2008-03-21 2010-02-04 Siddique M A Sami System and method for collaborative shopping, business and entertainment
US10884503B2 (en) * 2015-12-07 2021-01-05 Sri International VPA with integrated object recognition and facial expression recognition
CN105912092B (en) * 2016-04-06 2019-08-13 北京地平线机器人技术研发有限公司 Voice awakening method and speech recognition equipment in human-computer interaction
CN107679506A (en) * 2017-10-12 2018-02-09 Tcl通力电子(惠州)有限公司 Awakening method, intelligent artifact and the computer-readable recording medium of intelligent artifact
CN109767774A (en) * 2017-11-08 2019-05-17 阿里巴巴集团控股有限公司 A kind of exchange method and equipment
CN108198553B (en) * 2018-01-23 2021-08-06 北京百度网讯科技有限公司 Voice interaction method, device, equipment and computer readable storage medium
CN109343706B (en) * 2018-09-18 2022-03-11 周文 Interactive system and implementation method thereof
CN109887503A (en) * 2019-01-20 2019-06-14 北京联合大学 A kind of man-machine interaction method of intellect service robot
CN109817211B (en) * 2019-02-14 2021-04-02 珠海格力电器股份有限公司 Electric appliance control method and device, storage medium and electric appliance
CN110163715A (en) * 2019-04-03 2019-08-23 阿里巴巴集团控股有限公司 Information recommendation method, device, equipment and system
CN110171005A (en) * 2019-06-10 2019-08-27 杭州任你说智能科技有限公司 A kind of tourism robot system based on intelligent sound box
CN110196900A (en) * 2019-06-13 2019-09-03 三星电子(中国)研发中心 Exchange method and device for terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2016071192A (en) * 2014-09-30 2016-05-09 株式会社Nttドコモ Interaction device and interaction method
CN108432190A (en) * 2015-09-01 2018-08-21 三星电子株式会社 Response message recommends method and its equipment
CN105868827A (en) * 2016-03-25 2016-08-17 北京光年无限科技有限公司 Multi-mode interaction method for intelligent robot, and intelligent robot
JP2018045675A (en) * 2016-09-07 2018-03-22 パナソニックIpマネジメント株式会社 Information presentation method, information presentation program and information presentation system
CN107977852A (en) * 2017-09-29 2018-05-01 京东方科技集团股份有限公司 A kind of intelligent sound purchase guiding system and method
CN110076769A (en) * 2019-03-20 2019-08-02 广东工业大学 A kind of acoustic control patrol navigation robot system and its control method based on the movement of magnetic suspension sphere

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Prediction of Who Will Be Next Speaker and When Using Mouth-Opening Pattern in Multi-Party Conversation;Ryo Ishii 等;《Multimodal Technologies and Interact》;20191026;第1-7页 *
飞行驾驶员的应答方式识别;谢湘等;《北京理工大学学报》;20170715;第第37卷卷(第07期);第743-746页 *

Also Published As

Publication number Publication date
CN110910887A (en) 2020-03-24

Similar Documents

Publication Publication Date Title
CN110910887B (en) Voice wake-up method and device
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
CN108363706B (en) Method and device for man-machine dialogue interaction
CN107632706B (en) Application data processing method and system of multi-modal virtual human
CN112331193B (en) Voice interaction method and related device
US11631408B2 (en) Method for controlling data, device, electronic equipment and computer storage medium
CN110364145A (en) A kind of method and device of the method for speech recognition, voice punctuate
CN110570840B (en) Intelligent device awakening method and device based on artificial intelligence
CN110263131B (en) Reply information generation method, device and storage medium
CN106384591A (en) Method and device for interacting with voice assistant application
CN108564943B (en) Voice interaction method and system
WO2017166651A1 (en) Voice recognition model training method, speaker type recognition method and device
CN109377979B (en) Method and system for updating welcome language
CN112634911B (en) Man-machine conversation method, electronic device and computer readable storage medium
CN109032554B (en) Audio processing method and electronic equipment
CN112634895A (en) Voice interaction wake-up-free method and device
CN107809674A (en) A kind of customer responsiveness acquisition, processing method, terminal and server based on video
CN112863508A (en) Wake-up-free interaction method and device
CN110491384B (en) Voice data processing method and device
CN114065168A (en) Information processing method, intelligent terminal and storage medium
CN114333774A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
JP2016189121A (en) Information processing device, information processing method, and program
WO2023006033A1 (en) Speech interaction method, electronic device, and medium
CN112863511B (en) Signal processing method, device and storage medium
CN112820265B (en) Speech synthesis model training method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant before: AI SPEECH Ltd.

GR01 Patent grant
GR01 Patent grant