CN117119102A - Awakening method of voice interaction function and electronic equipment - Google Patents

Awakening method of voice interaction function and electronic equipment Download PDF

Info

Publication number
CN117119102A
CN117119102A CN202310310747.6A CN202310310747A CN117119102A CN 117119102 A CN117119102 A CN 117119102A CN 202310310747 A CN202310310747 A CN 202310310747A CN 117119102 A CN117119102 A CN 117119102A
Authority
CN
China
Prior art keywords
electronic device
user
voice
microphone
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310310747.6A
Other languages
Chinese (zh)
Inventor
吴满意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
Priority to CN202310310747.6A priority Critical patent/CN117119102A/en
Publication of CN117119102A publication Critical patent/CN117119102A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • H04M1/72454User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions according to context-related or environment-related conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/02Constructional features of telephone sets
    • H04M1/03Constructional features of telephone transmitters or receivers, e.g. telephone hand-sets
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/12Details of telephonic subscriber devices including a sensor for measuring a physical value, e.g. temperature or motion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/52Details of telephonic subscriber devices including functional features of a camera
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/11Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's

Abstract

The application provides a wake-up method of a voice interaction function and electronic equipment, wherein the electronic equipment detects a user state based on a user image acquired by a camera under the condition that a microphone of the electronic equipment is unoccupied, and when the user is detected to be in a state of watching the electronic equipment in a short distance, the electronic equipment further detects whether the voice signal distribution difference on each microphone meets a preset condition or not through voice signals acquired by a plurality of microphones, and if so, wakes up the voice interaction function of the electronic equipment.

Description

Awakening method of voice interaction function and electronic equipment
Technical Field
The present application relates to the field of terminals, and in particular, to a method for waking up a voice interaction function and an electronic device.
Background
At present, a voice interaction function in an electronic device is widely used, and how to wake up the voice interaction function of the electronic device is a problem to be solved.
Disclosure of Invention
The application provides a wake-up method of a voice interaction function and electronic equipment, wherein the electronic equipment detects a user state based on a user image acquired by a camera under the condition that a microphone of the electronic equipment is unoccupied, and when the user is detected to be in a state of watching the electronic equipment in a short distance, the electronic equipment further detects whether the voice signal distribution difference on each microphone meets a preset condition or not through voice signals acquired by a plurality of microphones, and if so, wakes up the voice interaction function of the electronic equipment.
In a first aspect, the present application provides a method for waking up a voice interaction function, where the method includes: the electronic equipment acquires a first image containing a face of a user through a camera; determining the distance between the face of the user and the electronic equipment and the gazing direction of the user according to the first image; the distance is within a preset range, and the gazing direction of the user indicates that the user gazes at the electronic equipment; the electronic equipment collects a first voice signal through a first microphone, collects a second voice signal through a second microphone, and the strengths of the first voice signal and the second voice signal are different; the electronic device wakes up the voice interaction function.
After the method provided by the first aspect of the application is implemented, the electronic equipment can accurately recognize the intention of the user for waking up the voice interaction function according to the user state and the user voice, and the voice interaction function of the electronic equipment is started for the user.
In combination with the method provided in the first aspect, before the electronic device collects the first image including the face of the user through the camera, the method further includes: the electronic device detects that the electronic device is lifted or lifted.
Therefore, considering that the user usually picks up or lifts up the mobile phone first when wishing to wake up the voice interaction function, whether the electronic equipment is lifted or lifted up is judged in advance, and subsequent image detection and voice detection are performed after the electronic equipment is lifted or lifted up, so that the electronic equipment can be prevented from performing additional detection work, and the power consumption of the electronic equipment is saved.
In combination with the method provided in the first aspect, before the electronic device collects the first image including the face of the user through the camera, the method further includes: the electronic device detects that both the first microphone and the second microphone are unoccupied.
In this way, scenes unsuitable for turning on the voice interactive function can be preliminarily excluded. Specifically, after the voice interaction function is started, the electronic device needs to occupy the microphone to collect the voice signal of the user to respond to the voice input of the user, so that in order to avoid the collision between the voice interaction function and other services using the microphone, before the voice interaction function is awakened, the situation that the voice interaction function is not suitable for being started is eliminated by detecting whether the microphone of the electronic device is occupied in advance, for example, the situation that the user uses the electronic device to answer a call, record a sound, record a video, send voice information or online conference, and the like.
With reference to the method provided in the first aspect, after the electronic device collects a first image including a face of a user through a camera, the method further includes: and determining that the mouth shape of the user changes according to the first image.
Therefore, after the user's mouth shape is determined to be changed, a plurality of microphones can be started to collect voice signals and detect whether the voice signals on the microphones are different or not, so that the microphones can be prevented from being started under the condition that the user does not speak, and the power consumption of the electronic equipment is saved.
In combination with the method provided in the first aspect, the electronic device wakes up the voice interaction function specifically includes: the electronic equipment collects voice instructions through the third microphone and executes tasks corresponding to the voice instructions.
Therefore, after the voice interaction function is awakened, the task corresponding to the consistency of the voice instruction of the user can be received, and the user experience is improved.
With reference to the method provided in the first aspect, after the electronic device wakes up the voice interaction function, the method further includes: the electronic equipment extracts voice instructions from the first voice signal and/or the second voice signal and executes tasks corresponding to the voice instructions.
Therefore, the voice instruction of the user can be avoided being omitted, and the electronic equipment can perform semantic analysis on the voice signal collected before the voice interaction function is awakened after the voice interaction function is awakened so as to execute the corresponding task.
With reference to the method provided in the first aspect, after the electronic device wakes up the voice interaction function, the method further includes: the electronic equipment outputs prompt information, and the prompt information is used for prompting that the voice interaction function is started currently; the prompt message comprises: speech and/or interface elements displayed on a display screen.
Therefore, the user can be timely reminded of waking up the voice interaction function, and user experience is provided.
The method according to the first aspect is combined with the method according to the first aspect, wherein the first microphone is a microphone at the top of the electronic device, and the second microphone is a microphone at the bottom of the electronic device.
The method provided in combination with the first aspect, the first microphone and the second microphone being different in sensitivity; the first voice signal and the second voice signal have different intensities, and specifically include: the intensity of the first voice signal or the second voice signal after being compensated is different based on the sensitivity difference value; or, the intensities of the first voice signal or the second voice signal are different, which specifically includes: the number of times the difference in intensity of the corresponding sub-segment speech signals of the first speech signal and the second speech signal is greater than a first value is greater than a second value.
In this way, the auxiliary electronic device can further determine whether the user's intention to speak at this time is to wake up the voice interaction function.
With reference to the method provided in the first aspect, after the electronic device collects a first image including a face of a user through a camera, the method further includes: and carrying out identity verification according to the face of the user in the first image.
Therefore, the voice interaction function can be awakened through face authentication under the condition of ensuring the privacy of the user.
With reference to the method provided in the first aspect, after the electronic device collects a first voice signal through the first microphone and collects a second voice signal through the second microphone, the method further includes: and extracting voiceprint information from the first voice signal or the second voice signal, and performing identity verification based on the voiceprint information.
Thus, voice interaction function can be provided by voice print authentication under the condition of ensuring user privacy.
In combination with the method provided in the first aspect, the electronic device determines the distance between the face of the user and the electronic device according to the first image, specifically by the following manner: the electronic device analyzing an imaging pupil distance from the first image; the electronic equipment calculates the distance between the face of the user and the electronic equipment by using a similar triangle principle based on the imaging pupil distance, the focal length of the camera and the actual pupil distance.
Therefore, the distance between the face of the user and the electronic equipment can be accurately known.
In combination with the method provided in the first aspect, the electronic device determines, according to the first image, a gaze direction of the user, specifically by: the electronic device analyzes the position of the pupil in the eye from the first image, and determines a gaze direction of the user.
In this way, whether the user gazing direction is the direction of gazing at the electronic equipment can be accurately known.
In combination with the method provided in the first aspect, the first voice signal is obtained by bandpass filtering a voice signal collected by a first microphone; the second voice signal is obtained by band-pass filtering the voice signal collected by the second microphone.
Therefore, the part of non-human sound such as environmental noise can be filtered, and the electronic equipment can conveniently perform subsequent difference detection.
In a second aspect, the present application provides an electronic device comprising: one or more processors, one or more memories, at least two microphones, a camera, and a display screen; the one or more memories are coupled with one or more processors, the one or more memories being for storing computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described in any of the first aspects.
In a third aspect, the present application provides a computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform a method as described in any of the first aspects.
Drawings
FIG. 1 is a schematic view of a scene provided in an embodiment of the present application;
FIG. 2 is a flow chart of a method according to an embodiment of the present application;
FIGS. 3A-3D are schematic diagrams of an operation interface for turning on a set of "close wakeup" functions according to an embodiment of the present application;
fig. 4 is a schematic diagram of a distance calculation principle between a face and an electronic device according to an embodiment of the present application;
fig. 5 is a schematic diagram of an implementation process for determining a gaze direction of a user according to an embodiment of the present application;
fig. 6 is a schematic diagram of the distribution of the same section of voice signals detected by the microphones at two distances according to the embodiment of the present application;
fig. 7 is a schematic hardware architecture of an electronic device 100 according to an embodiment of the present application;
fig. 8 is a schematic software architecture of the electronic device 100 according to an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone.
The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the application, unless otherwise indicated, the meaning of "a plurality" is two or more.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the described embodiments of the application may be combined with other embodiments.
The term "User Interface (UI)" in the following embodiments of the present application is a media interface for interaction and information exchange between an application program or an operating system and a user, which enables conversion between an internal form of information and a form acceptable to the user. The user interface is a source code written in a specific computer language such as java, extensible markup language (extensible markup language, XML) and the like, and the interface source code is analyzed and rendered on the electronic equipment to finally be presented as content which can be identified by a user. A commonly used presentation form of the user interface is a graphical user interface (graphic user interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be a visual interface element of text, icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc., displayed in a display of the electronic device.
Technology for implementing man-machine interaction by voice has been widely used, however, in practical application, if a user wants to perform voice interaction with an electronic device, the user needs to wake up the voice interaction function of the electronic device first. In some implementations, the user may wake up the voice interactive functions of the electronic device by voice input of wake words/breath sounds, or by operation of physical buttons/virtual controls, etc.
The above-mentioned wake-up method has many drawbacks, such as the way of inputting wake-up words/breath sounds by voice is not natural enough, and the sociality is poor. For another example, the operation or the like of the physical buttons/virtual controls is cumbersome, and the user's hands cannot be liberated, and there is also a privacy risk.
Referring to fig. 1, fig. 1 illustrates a schematic view of a scenario provided by the present application.
As shown in fig. 1, when a user holds an electronic device and places the user can watch in front of the user, if the user speaks into the electronic device at a short distance at this time, the electronic device can wake up the voice interaction function and complete various tasks such as responding instruction input, information inquiry or voice chat according to the voice input by the user.
It should be understood that fig. 1 is only an exemplary scenario, and should not be construed as limiting the implementation of the present application, for example, a scenario to which the present application is applicable further includes: when the user picks up the electronic device and places the electronic device in front of the user where the user can watch, if the user speaks into the electronic device at a short distance at this time, the electronic device can wake up the voice interaction function and complete various tasks such as responding instruction input, information inquiry or voice chat according to the voice input by the user.
In connection with the scenario illustrated in fig. 1, when a user has an intention to wake up a voice interaction function, he will typically look at the electronic device in close proximity and speak near a certain microphone of the electronic device, e.g. the bottom microphone, in the mouth.
In order to solve the problems, the application provides a wake-up method of a voice interaction function and electronic equipment. Specifically, the electronic device may detect a user state through a user image collected by the camera, detect a voice signal distribution condition on each microphone through voice signals collected by the plurality of microphones, and wake up a voice interaction function after detecting that the user is in a state of closely gazing at the electronic device and detecting that a difference exists in the voice signal distribution on each microphone.
In one implementation manner, the electronic device detects the user state through the user image collected by the camera, and the execution sequence of detecting the distribution situation of the voice signals on each microphone through the voice signals collected by the plurality of microphones is not limited, that is, the two can be executed simultaneously or sequentially. For example, after a user image collected by a camera detects that a user is in a state of looking at the electronic device in a short distance, the distribution situation of the voice signals on each microphone is detected by the voice signals collected by a plurality of microphones so as to determine whether the voice signals have differences. For another example, after the difference between the voice signals collected by the microphones is detected, the user state is detected by the user image collected by the camera, so as to determine whether the user is in a state of gazing at the electronic device in a short distance.
In another implementation manner, before the electronic device wakes up the voice interaction function, the electronic device may also detect whether a plurality of microphones of the electronic device are occupied, and if it is determined that the electronic device is unoccupied, the electronic device may detect a user state through a user image collected by the camera, or detect that a difference exists between voice signals on the microphones through voice signals collected by the plurality of microphones.
In another implementation, before the electronic device detects the user state, the electronic device may also detect its own motion state, for example, before detecting that the electronic device is in a lifted/up state, and then start the next detection, that is, detecting the user state.
In another implementation, when the electronic device detects the user state based on the user image, the electronic device may detect the mouth shape of the user in addition to the distance between the user and the electronic device and the gaze direction of the user.
In the embodiment of the application, the voice interaction function is equivalent to the function provided by the voice assistant in the electronic device, and after the voice assistant is awakened, the user can perform voice interaction with the electronic device by inputting voice into the voice assistant so as to control the electronic device to execute corresponding tasks, such as executing tasks of instruction input, information viewing or voice chat.
After the wake-up method of the voice interaction function provided by the application is implemented, the electronic equipment can accurately recognize the intention of the user for waking up the voice interaction function, so that the voice interaction function is started for the user, and the user experience is improved.
The wake-up method of the voice interaction function provided by the application is described next with reference to the method flow shown in fig. 2.
As shown in fig. 2, the method flow includes the steps of:
optionally S11, it is detected whether the microphone of the electronic device is occupied.
Specifically, the electronic device detects whether the microphone is occupied, specifically including detecting whether the first microphone and the second microphone are occupied, executing the subsequent S12 when detecting that the first microphone and the second microphone are not occupied by other services, stopping executing the subsequent steps when detecting that the microphone is occupied by other services, and continuing to detect whether the microphone is occupied, until detecting that the microphone is not occupied, continuing to execute the subsequent S12. In the embodiment of the application, the first microphone and the second microphone are microphones at two different positions in the electronic device, for example, the first microphone is a top microphone and the second microphone is a bottom microphone.
It will be appreciated that S11 is an optional step provided by the present application, and if more than S11, the electronic device may directly execute S12 or S13. Preferably, performing S11 may initially exclude scenes that are not suitable for turning on the voice interaction function. Specifically, after the voice interaction function is started, the electronic device needs to occupy the microphone to collect the voice signal of the user to respond to the voice input of the user, so that in order to avoid the collision between the voice interaction function and other services using the microphone, before the voice interaction function is awakened, the situation that the voice interaction function is not suitable for being started is eliminated by detecting whether the microphone of the electronic device is occupied in advance, for example, the situation that the user uses the electronic device to answer a call, record a sound, record a video, send voice information or online conference, and the like.
In one implementation, the detection of whether a microphone is occupied by an electronic device may be detected, in particular, by an audio manager in an application framework layer of the electronic device.
In one implementation, the condition that triggers the electronic device to detect whether the microphone is occupied may be any of the following: the electronic device is powered on or the electronic device turns on a corresponding function such as turning on a "close wake up" function.
Referring to fig. 3A-3D, fig. 3A-3D schematically illustrate an operator interface for turning on the "close-range wake" function.
As shown in fig. 3A, the user interface 310 displayed by the electronic device is a setup interface, in which a plurality of setup options including a smart assistant option 311 are displayed. When an operation on the intelligent assistant option 311 is detected in the electronic device, the electronic device displays the user interface 320 shown in fig. 3B in response to the operation.
As shown in FIG. 3B, user interface 320 is a smart assistant details page in which a series of functional options provided by the smart assistant, such as voice assistant options 321, etc., are displayed. When the electronic device detects an operation on the voice assistant option 321, the electronic device displays the user interface 330 shown in fig. 3C in response to the operation.
As shown in fig. 3C, the user interface 330 displays a plurality of options for waking up the voice assistant (also called waking up the voice interaction function), and controls corresponding to the options. For example, voice wakes up the corresponding switch control 331 and close range wakes up the corresponding switch control 332. At this time, both the switch control 331 and the switch control 332 of the electronic device remain in the off state. After the electronic device detects an operation acting on the switch control 332 corresponding to the close range wakeup, the electronic device displays the user interface 330 shown in fig. 3D in response to the operation, so as to prompt the user that the voice wakeup function is turned on.
As shown in fig. 3D, the user interface 330 is similar to the user interface 330 shown in fig. 3C, except that the state of the corresponding switch control 332 is on at this time, which is a close wakeup.
The voice wake-up means that after a user starts a voice wake-up function and inputs a wake-up word, the electronic device can detect whether the user speaks the wake-up word, and if yes, the voice interaction function of the electronic device is wake-up to provide convenient service for the user.
After the user starts the close-range wake-up function, the electronic device may detect whether the user meets a condition corresponding to the close-range wake-up (for example, whether the microphone is occupied by other services, for example, whether the user looks at the electronic device for speaking at a close range, for example, whether the distribution of voice signals on a plurality of microphones of the electronic device is different, for example), and if so, may wake up the voice interaction function of the electronic device to provide convenience service for the user.
In addition, the voice assistant and the near wakeup are only one alternative names of corresponding functions, and the embodiment of the present application is not limited thereto. The functions corresponding to the voice assistant and the short-distance wake-up are specifically described below, and will be briefly described herein. In addition, the voice assistant may also be referred to as smart voice, voice interaction, near wake may also be referred to as smart wake, and so on.
It will be appreciated that the foregoing merely illustrates an interface for turning on the "close wakeup" function, and that in other embodiments, the electronic device may also be capable of turning on the "close wakeup" function by default, which embodiments of the present application are not limited in this respect.
S12, acquiring an image through a camera.
Specifically, after detecting that the microphone is not occupied by other services, the electronic device may turn on the camera to collect the image. Generally, the electronic device continuously collects a plurality of images in a period of time, and is used for determining whether the user status satisfies a preset condition in the following S13.
In the embodiment of the application, the camera is a front camera of the electronic equipment and is a low-power consumption camera.
Optionally, the front-end camera of the electronic device may be continuously turned on, or turned on after the electronic device is lifted or lifted, except when the front-end camera of the electronic device detects that the microphone is not occupied by other services.
S13, detecting whether the user state meets the preset condition based on the image.
Specifically, the electronic device may detect whether a face exists in the current image through image analysis based on the image, and lift eyeball information of the user from the face information, so as to determine a gazing direction of the user, calculate a distance between the face and the electronic device, and the like, so as to determine whether the user state meets a preset condition. And when the user state is detected to meet the preset condition, executing the subsequent S14, and when the user state is detected to not meet the preset condition, stopping executing the subsequent steps, and continuously detecting the user state until the user state is detected to meet the preset condition, then continuously executing the subsequent S14. Wherein the image containing the face may also be referred to as a first image.
The executing S13 may identify, according to the user state, whether the user has an intention to wake up the voice interaction function, and further exclude a scenario in which the user does not turn on the voice interaction function.
Optionally, the electronic device may further extract a face based on the acquired image to perform face recognition, so as to determine whether the current user is a registered user of the electronic device.
In the embodiment of the present application, the preset conditions required to be satisfied by the user state include: the face is within a preset range (e.g., 15-25 cm) from the electronic device and the user is looking at the electronic device. Optionally, the preset conditions may further include: the user is speaking and the face authentication passes.
The following describes in detail an implementation of detecting a distance between a face and an electronic device, a gaze direction of a user, and whether the user is speaking based on an image of the user.
(1) Distance between face and electronic device.
Referring to the schematic diagram of the distance calculation principle between the face and the electronic device shown in fig. 4, known conditions include: the actual pupil distance H, the focal length f and the imaging pupil distance H, and the unknown condition is the object distance D. The practical pupil distance H is obtained by counting pupil distance information of a large number of users, the focal length f is a front camera parameter of the electronic equipment, the imaging pupil distance H is the pupil distance extracted from a shot user image, and the object distance D is approximately the distance between a human face and the electronic equipment. Therefore, according to the similar triangle calculation principle, d= (h×f)/H is known, and D is calculated.
Optionally, the H adopted in the present application is an average pupil distance obtained by counting pupil distance information of a large number of users, or may be an average pupil distance of a large number of users corresponding to ages of users of the current electronic device, which is not limited in the embodiment of the present application.
(2) The gaze direction of the user.
Referring to the implementation process schematic diagram of determining the gazing direction of the user shown in fig. 5, the electronic device may detect whether a face exists in the current user image through image analysis based on the user image, and if the face exists, extract binocular images in the face, and then determine the gazing direction of the user by analyzing the position of the pupil in the eyeball.
When the pupil is at the center of the eyeball and within a certain range from the center of the eyeball, the gazing direction of the user is considered to be the direction in which the electronic equipment is positioned. Correspondingly, when the pupil is out of a certain range from the center of the eyeball, the user is considered to be not gazing at the electronic equipment at the moment.
Particularly, when the user image has only a side face or the outline of the face is incomplete, the image is regarded as an unmanned face image. This is because, when the front camera of the electronic device can only shoot the side face of the user, it means that the user is highly likely to not watch the electronic device at this time, and therefore is very highly likely to wake up the voice interaction function, so that in the case that only the side face is shot or the outline of other human face is incomplete, in order to save power consumption, the eyeball image is not further extracted to judge the watch direction of the user.
(3) The mouth shape of the user.
Specifically, the electronic device may also extract mouth-shaped conversion information of the user from the continuous multiple user images to detect whether the user is speaking, or even the content of the speaking. When the mouth shape of the user is detected to be changed, the user is considered to be speaking, the user is considered to have the intention of waking up the voice interaction function, when the mouth shape of the user is detected to be unchanged, the user is considered to have no speaking, and the user is considered to have no intention of waking up the voice interaction function.
(4) And (5) face authentication.
Specifically, the electronic device may further compare the collected user image with a pre-stored face image of the user, and when the similarity meets the condition, consider that the face authentication passes, or else, does not pass.
S14, voice signals are collected through a plurality of microphones.
Specifically, after detecting that the user state meets the preset condition, the electronic device may turn on a plurality of microphones to collect the voice signal. Generally, the electronic device continuously collects a section of voice signal in a period of time, and is used for determining whether the difference of the voice signal distribution on each microphone satisfies a preset condition in the following S15.
In an embodiment of the present application, the plurality of microphones includes at least two microphones disposed at different positions, for example, a microphone disposed at a top of the electronic device and a microphone disposed at a bottom of the electronic device.
S15, detecting whether the voice signal distribution difference on the plurality of microphones meets a preset condition.
Specifically, since the physical positions of the microphones on the electronic device are set differently, if the acoustic parameters of the two microphones are identical, when the user speaks into the electronic device in a short distance and in parallel/nearly parallel face, the distribution of the same section of voice signals collected on the two microphones is different for the same section of voice spoken by the user. Specific variants now: the energy distributed by the voice signal (which may also be referred to as the second voice signal) on the microphone at the bottom of the electronic device is greater than the energy distributed by the voice signal (which may also be referred to as the first voice signal) on the microphone at the bottom of the electronic device. This is because when the user speaks into the electronic device in close proximity and face parallel/near parallel, the microphone at the bottom of the electronic device is closer to the user's mouth than the microphone at the top of the electronic device.
In summary, taking two microphones, i.e., a top microphone and a bottom microphone, as an example, whether the distribution difference of the voice signals on each microphone set in the present application satisfies the preset condition is as follows: for a voice in a fixed time length, cutting the voice in the fixed time length into a plurality of continuous subsections of the fixed time length, and when the number of subsections of which the signal intensity on the top microphone is larger than that on the bottom microphone is larger than N, determining whether the voice signal distribution difference on each microphone meets the preset condition. Where N is a preset value, which is not limited by the present application.
In a specific implementation manner, the electronic device may compare the signal intensity on the bottom microphone and the signal intensity on the top microphone periodically in a period of continuous speech according to the signal intensity distribution situation on the time domain/frequency domain on the two microphones, and count +1 whenever the signal intensity on the bottom microphone is greater than the signal intensity on the top microphone and the difference is greater than the preset intensity difference (also referred to as a first value), and consider whether the difference of the speech signal distributions on the two microphones satisfies the preset condition when the count is greater than N (N may also be referred to as a second value) after the period of continuous speech is compared.
In another specific implementation, when the distance between the face of the user and the electronic device is different, the signal intensity detected by the microphone is also different, so the aforementioned preset intensity difference is also different, and when the distance is further, the corresponding signal intensity difference is smaller. Referring specifically to fig. 6, fig. 6 illustrates a schematic diagram of the same voice signal distribution detected by the microphones at two distances, and it can be seen that the signal intensity/energy on the top microphone and the bottom microphone is smaller when the face of the user is at a distance from the electronic device (e.g., 30 cm), respectively, than when the face of the user is at a distance from the electronic device (e.g., 20 cm).
Optionally, since the frequency range of the voice is usually in the range of 200Hz-8000Hz, the analysis of other collected voice signals of the microphones is avoided, and the voice signals can be bandpass filtered before detecting whether the difference of the voice signal distribution on the plurality of microphones meets the preset condition, so as to filter the signals with the frequency lower than 200Hz and the frequency higher than 8000 Hz.
It can be understood that when the acoustic parameters of the microphones on the electronic device are inconsistent, in the process of detecting whether the distribution difference of the voice signals on the microphones meets the preset condition, the signals on the microphones are compensated, and the compensated signals are compared to determine whether the preset condition is met. In a specific example, when the acoustic parameter includes sensitivity, if the sensitivity of the top microphone is lower than that of the bottom microphone and the difference of the sensitivities is 16db, in detecting whether the difference of the distribution of the voice signals on the plurality of microphones meets the preset condition, it is necessary to compensate the signal on the top microphone by 16db or subtract the signal on the bottom microphone by 16db, and then compare the compensated signal on the top microphone with the signal on the bottom microphone.
In the embodiment of the present application, the foregoing only shows one execution sequence of S12-S13 and S14-S15, that is, execution of S12-S13 is performed first and execution of S14-S15 is performed later. In another implementation manner of the present application, the electronic device may execute S12-S13 and S14-S15 simultaneously, or may execute S14-S15 first and then execute S12-S13, that is, the electronic device detects whether the difference of the distribution of the voice signals on the plurality of microphones satisfies a preset condition through the voice signals collected by the plurality of microphones, and if so, the image is collected by the camera, and whether the user state satisfies the preset condition is detected based on the image.
S16, waking up the voice interaction function.
Specifically, after detecting that the voice signal distribution difference on the plurality of microphones meets the preset condition, the electronic device wakes up the voice interaction function, which is equivalent to starting the voice interaction flow.
Wherein, opening the voice interaction function includes: voice commands are collected through the third microphone and subjected to semantic analysis to perform corresponding tasks such as command input, information query, voice chat and the like. The third microphone may be a different microphone than the first microphone, the second microphone, or the third microphone may be the first microphone or the second microphone.
Optionally, in order to avoid missing the voice command of the user, the electronic device caches the voice signal collected on the microphone in the process of executing S14-S15, and performs semantic analysis on the cached voice after waking up the voice interaction function in step S16, so as to execute the corresponding task.
Optionally, after the electronic device wakes up the voice interaction function, the electronic device may output a prompt message for prompting the user that the voice interaction function has been waken up. Output forms of the prompt information include, but are not limited to: the prompt is carried out through voice, and the corresponding interface elements are displayed through a display screen to prompt.
Optionally, after the electronic device wakes up the voice interaction function, semantic analysis may be directly performed on the newly collected voice or the cached voice to perform a corresponding task, or voiceprint verification may be performed according to the newly collected voice or the cached voice, and after the voiceprint verification is successful, the semantics corresponding to the voice may be analyzed to perform the corresponding task. It can be appreciated that when the electronic device has previously performed face authentication on the acquired user image and the face authentication is passed, voiceprint authentication is no longer performed at this time.
The electronic device to which the method relates is next described based on the wake-up method of the voice interaction function described above.
The electronic device may be a mounted deviceOr other operating system, such as cell phones, tablet computers, desktop computers, laptop computers, handheld computers, notebook computers, ultra-mobile personal computers (mobile personal computer, UMPC), netbooks, as well as cellular telephones, personal digital assistants (personal digital assistant, PDA), augmented reality (augmented reality, AR) devices, virtual Reality (VR) devices, artificial intelligence (artificial intelligence, AI) devices, wearable devices, vehicle-mounted devices, smart home devices, and/or smart city devices, among others.
Fig. 7 shows a hardware architecture diagram of the electronic device 100.
The electronic device 100 may include: processor 110, external memory interface 120, internal memory 121, universal serial bus (universal serial bus, USB) interface 130, charge management module 140, power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headset interface 170D, sensor module 180, keys 190, camera 193, display 194, etc. The sensor module 180 may include a pressure sensor 180A, a touch sensor 180B, an acceleration sensor 180C, and the like.
It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.
In the embodiment of the present application, the processor 110 is configured to control the corresponding software and hardware modules, and the method flow is described in fig. 2. In addition, after waking up the voice interaction function, the processor 110 may control the display screen or the audio module to output corresponding prompt information for prompting the user that the voice interaction function has been started. In addition, the processor may perform voice analysis based on the collected voice signals to perform corresponding tasks.
In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.
The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.
It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.
The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.
The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.
The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.
The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.
The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.
The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.
The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, demodulates and filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.
In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques may include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).
The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD). The display panel may also be manufactured using organic light-emitting diode (OLED), active-matrix organic light-emitting diode (AMOLED), flexible light-emitting diode (flex-emitting diode), mini, micro-OLED, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.
In the embodiment of the application, the electronic device may display the user interface as shown in fig. 3A-3D, and in addition, in the process that the electronic device wakes up the voice interaction function and performs voice interaction with the user, the electronic device may also display the voice interaction interface, where the voice interaction interface may be information queried in response to a voice instruction of the user, or the voice interaction interface may also output a corresponding text obtained by identifying the voice of the user, or may also output a text corresponding to how much voice of the user is replied by the voice assistant, and so on.
The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.
The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.
In the embodiment of the present application, the camera 193 includes a front camera that is disposed on the screen side of the electronic device, and the front camera can continuously operate in the low power mode, so as to collect the image of the user and provide data support for the electronic device to execute S12-S13 described in fig. 2.
The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.
Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.
The internal memory 121 may include one or more random access memories (random access memory, RAM) and one or more non-volatile memories (NVM).
The random access memory may include a static random-access memory (SRAM), a dynamic random-access memory (dynamic random access memory, DRAM), a synchronous dynamic random-access memory (synchronous dynamic random access memory, SDRAM), a double data rate synchronous dynamic random-access memory (double data rate synchronous dynamic random access memory, DDR SDRAM, such as fifth generation DDR SDRAM is commonly referred to as DDR5 SDRAM), etc.;
the nonvolatile memory may include a disk storage device, a flash memory (flash memory).
The FLASH memory may include NOR FLASH, NAND FLASH, 3D NAND FLASH, etc. divided according to an operation principle, may include single-level memory cells (SLC), multi-level memory cells (MLC), triple-level memory cells (TLC), quad-level memory cells (QLC), etc. divided according to a storage specification, may include universal FLASH memory (english: universal FLASH storage, UFS), embedded multimedia memory cards (embedded multi media Card, eMMC), etc. divided according to a storage specification.
The random access memory may be read directly from and written to by the processor 110, may be used to store executable programs (e.g., machine instructions) for an operating system or other on-the-fly programs, may also be used to store data for users and applications, and the like.
The nonvolatile memory may store executable programs, store data of users and applications, and the like, and may be loaded into the random access memory in advance for the processor 110 to directly read and write.
The external memory interface 120 may be used to connect external non-volatile memory to enable expansion of the memory capabilities of the electronic device 100. The external nonvolatile memory communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and video are stored in an external nonvolatile memory.
The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.
The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.
The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. The electronic device 100 may listen to music, or to hands-free conversations, through the speaker 170A.
A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. When electronic device 100 is answering a telephone call or voice message, voice may be received by placing receiver 170B in close proximity to the human ear.
Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can sound near the microphone 170C through the mouth, inputting a sound signal to the microphone 170C. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, and may implement a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four, or more microphones 170C to enable collection of sound signals, noise reduction, identification of sound sources, directional recording functions, etc.
In an embodiment of the present application, microphone 170C includes at least two, e.g., a top microphone and a bottom microphone. The two microphones are used to collect voice signals for providing data support for the electronic device 100 to perform the aforementioned S14-S15. Therefore, the voice input by the user at will can be further avoided, and the voice interaction function of the electronic equipment is awakened by mistake.
The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be a USB interface 130 or a 3.5mm open mobile electronic device platform (open mobile terminal platform, OMTP) standard interface, a american cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.
The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.
The touch sensor 180B, also referred to as a "touch panel". The touch sensor 180B may be disposed on the display 194, and the touch sensor 180B and the display 194 form a touch screen, which is also referred to as a "touch screen". The touch sensor 180B is used to detect a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180B may also be disposed on the surface of the electronic device 100 at a different location than the display 194.
The acceleration sensor 180C may detect the magnitude of acceleration of the electronic device 100 in various directions (typically, X, Y, Z three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.
In the embodiment of the present application, the electronic device may further determine the motion gesture of the electronic device through the acceleration on the X, Y, Z triaxial detected on the acceleration sensor 180C, for example, whether the user lifts or lifts the electronic device.
The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.
In an embodiment of the present application, the electronic device may wake up the voice interaction function of the electronic device by an operation on the key 190.
The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated.
Fig. 8 is a schematic software architecture of the electronic device 100 according to an embodiment of the present application.
The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages.
As shown in fig. 8, the application package may include applications such as voice assistants, cameras, gallery, calendar, talk, map, navigation, bluetooth, video, short message, etc. The voice assistant may be a sub-function integrated into the setup application of the electronic device, or a separate application, to which embodiments of the application are not limited.
In the embodiment of the application, after the voice assistant is awakened, the user can perform voice interaction with the electronic device by inputting voice into the voice assistant so as to control the electronic device to execute corresponding tasks, such as executing tasks of inputting instructions, viewing information or voice chat and the like. The wake-up voice assistant is equivalent to a wake-up voice interactive function, and the wake-up method can be specifically referred to the description of the method flow shown in fig. 2, which is not repeated herein.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.
As shown in fig. 8, the application framework layer may include a media manager, a window manager, a content provider, a view system, a telephony manager, a notification manager, and the like.
The media manager is used to manage media services including microphone-occupied services such as call, recording, video, conference-occupied microphone services. In one implementation manner, after the electronic device turns on the "close-range wake-up" function provided in the voice assistant, the voice assistant may further control, through the camera driver, the front camera to collect the user image, so as to detect whether the user state meets a preset condition, and may further control, through the microphone driver, the top microphone and the bottom microphone to collect the voice, so as to detect whether the distribution difference of the voice signals on the two microphones meets the preset condition, and confirm to wake up the voice assistant when it is determined that the user state meets the preset condition and the distribution difference of the voice signals on the two microphones meets the preset condition, that is, turn on the subsequent voice interaction flow. Optionally, the voice assistant further detects whether other services occupy the microphone from the media manager, and determines to wake up the voice assistant if it is determined that no other services occupy the microphone, the user state meets a preset condition, and a difference in distribution of the voice signals on the two microphones meets the preset condition.
The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.
The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc.
The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture.
The telephony manager is used to provide the communication functions of the electronic device 100. Such as the management of call status (including on, hung-up, etc.).
The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is prompted in a status bar, a prompt tone is emitted, the electronic device vibrates, and an indicator light blinks, etc.
Android runtimes include core libraries and virtual machines. Android run time is responsible for scheduling and management of the Android system.
The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), two-dimensional (2D) graphics engines (e.g., SGL), etc.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.
Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.
The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.
The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer is a layer between hardware and software. The inner core layer at least comprises a microphone driver, a camera driver, a display driver, a sensor driver and the like.
The workflow of the electronic device 100 software and hardware is illustrated below in connection with capturing a photo scene.
When touch sensor 180B receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into the original input event (including information such as touch coordinates, time stamp of touch operation, etc.). The original input event is stored at the kernel layer. The application framework layer acquires an original input event from the kernel layer, and identifies a control corresponding to the input event. Taking the touch operation as a touch click operation, taking a control corresponding to the click operation as an example of a control of a camera application icon, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera driver by calling a kernel layer, and captures a still image or video by the camera 193.
It should be understood that each step in the above method embodiments provided by the present application may be implemented by an integrated logic circuit of hardware in a processor or an instruction in software form. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.
The present application also provides an electronic device, which may include: memory and a processor. Wherein the memory is operable to store a computer program; the processor may be operative to invoke a computer program in said memory to cause the electronic device to perform the method of any of the embodiments described above.
The application also provides a chip system comprising at least one processor for implementing the functions involved in the method performed by the electronic device in any of the above embodiments.
In one possible design, the system on a chip further includes a memory to hold program instructions and data, the memory being located either within the processor or external to the processor.
The chip system may be formed of a chip or may include a chip and other discrete devices.
Alternatively, the processor in the system-on-chip may be one or more. The processor may be implemented in hardware or in software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general purpose processor, implemented by reading software code stored in a memory.
Alternatively, the memory in the system-on-chip may be one or more. The memory may be integral with the processor or separate from the processor, and embodiments of the present application are not limited. The memory may be a non-transitory processor, such as a ROM, which may be integrated on the same chip as the processor, or may be separately provided on different chips, and the type of memory and the manner of providing the memory and the processor are not particularly limited in the embodiments of the present application.
Illustratively, the system-on-chip may be a field programmable gate array (field programmable gate array, FPGA), an application specific integrated chip (application specific integrated circuit, ASIC), a system on chip (SoC), a central processing unit (central processor unit, CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), a microcontroller (micro controller unit, MCU), a programmable controller (programmable logic device, PLD) or other integrated chip.
The present application also provides a computer program product comprising: a computer program (which may also be referred to as code, or instructions), which when executed, causes a computer to perform the method performed by the electronic device in any of the embodiments described above.
The present application also provides a computer-readable storage medium storing a computer program (which may also be referred to as code, or instructions). The computer program, when executed, causes a computer to perform the method performed by the electronic device in any of the embodiments described above.
The embodiments of the present application may be arbitrarily combined to achieve different technical effects.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk), etc.
Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.
In summary, the foregoing description is only exemplary embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made according to the disclosure of the present invention should be included in the protection scope of the present invention.

Claims (15)

1. A method for waking up a voice interactive function, the method comprising:
the electronic equipment acquires a first image containing a face of a user through a camera;
determining the distance between the face of the user and the electronic equipment and the gazing direction of the user according to the first image; the distance is within a preset range, and the gazing direction of the user indicates that the user gazes at the electronic equipment;
The electronic equipment collects a first voice signal through a first microphone, and collects a second voice signal through a second microphone, wherein the strengths of the first voice signal and the second voice signal are different;
the electronic device wakes up the voice interaction function.
2. The method of claim 1, wherein prior to the electronic device capturing the first image containing the user's face with the camera, the method further comprises:
the electronic device detects that the electronic device is lifted or picked up.
3. The method of claim 1 or 2, wherein before the electronic device captures a first image containing a face of a user via a camera, the method further comprises:
the electronic device detects that both the first microphone and the second microphone are unoccupied.
4. A method according to any of claims 1-3, wherein after the electronic device acquires the first image containing the face of the user via the camera, the method further comprises:
and determining that the mouth shape of the user changes according to the first image.
5. The method according to any one of claims 1-4, wherein the electronic device waking up a voice interaction function specifically comprises:
The electronic equipment collects voice instructions through the third microphone and executes tasks corresponding to the voice instructions.
6. The method of any of claims 1-5, wherein after the electronic device wakes up a voice interaction function, the method further comprises:
the electronic equipment extracts voice instructions from the first voice signals and/or the second voice signals and executes tasks corresponding to the voice instructions.
7. The method of any of claims 1-6, wherein after the electronic device wakes up a voice interaction function, the method further comprises:
the electronic equipment outputs prompt information, wherein the prompt information is used for prompting that a voice interaction function is started currently;
the prompt message comprises: speech and/or interface elements displayed on a display screen.
8. The method of any of claims 1-7, wherein the first microphone is a microphone on a top of the electronic device and the second microphone is a microphone on a bottom of the electronic device.
9. The method of any one of claims 1-8, wherein the sensitivity of the first microphone and the second microphone are different;
The first voice signal and the second voice signal have different intensities, and specifically include: the intensity of the first voice signal or the second voice signal after being compensated is different based on the sensitivity difference value;
or, the intensities of the first voice signal or the second voice signal are different, which specifically includes: the number of times that the difference in intensity of the corresponding sub-segment speech signals of the first speech signal and the second speech signal is greater than a first value is greater than a second value.
10. The method of any of claims 1-9, wherein after the electronic device captures a first image containing a face of a user via a camera, the method further comprises: and carrying out identity verification according to the face of the user in the first image.
11. The method of any of claims 1-10, wherein after the electronic device collects a first voice signal through a first microphone and collects a second voice signal through a second microphone, the method further comprises:
and extracting voiceprint information from the first voice signal or the second voice signal, and performing identity verification based on the voiceprint information.
12. The method according to any of claims 1-11, wherein the electronic device determines the distance of the user face from the first image from the electronic device, in particular by:
The electronic device analyzes an imaging pupil distance from the first image;
and the electronic equipment calculates the distance between the face of the user and the electronic equipment by using a similar triangle principle based on the imaging pupil distance, the focal length of the camera and the actual pupil distance.
13. The method according to any of claims 1-12, wherein the electronic device determines the gaze direction of the user from the first image, in particular by:
the electronic device analyzes the position of the pupil in the eyeball from the first image to determine the gaze direction of the user.
14. An electronic device, comprising: one or more processors, one or more memories, at least two microphones, a camera, and a display screen; the one or more memories coupled with one or more processors, the one or more memories to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-13.
15. A computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-13.
CN202310310747.6A 2023-03-21 2023-03-21 Awakening method of voice interaction function and electronic equipment Pending CN117119102A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310310747.6A CN117119102A (en) 2023-03-21 2023-03-21 Awakening method of voice interaction function and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310310747.6A CN117119102A (en) 2023-03-21 2023-03-21 Awakening method of voice interaction function and electronic equipment

Publications (1)

Publication Number Publication Date
CN117119102A true CN117119102A (en) 2023-11-24

Family

ID=88793538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310310747.6A Pending CN117119102A (en) 2023-03-21 2023-03-21 Awakening method of voice interaction function and electronic equipment

Country Status (1)

Country Link
CN (1) CN117119102A (en)

Similar Documents

Publication Publication Date Title
US11450322B2 (en) Speech control method and electronic device
CN113704014B (en) Log acquisition system, method, electronic device and storage medium
CN110910872B (en) Voice interaction method and device
CN110543287B (en) Screen display method and electronic equipment
CN111819533B (en) Method for triggering electronic equipment to execute function and electronic equipment
CN112671976B (en) Control method and device of electronic equipment, electronic equipment and storage medium
CN112740152B (en) Handwriting pen detection method, handwriting pen detection system and related device
CN114650363B (en) Image display method and electronic equipment
WO2022052897A1 (en) Method and device for adjusting memory configuration parameter
CN115589051B (en) Charging method and terminal equipment
CN114115512A (en) Information display method, terminal device, and computer-readable storage medium
CN113641271A (en) Application window management method, terminal device and computer readable storage medium
CN113380240B (en) Voice interaction method and electronic equipment
CN115206308A (en) Man-machine interaction method and electronic equipment
CN117119102A (en) Awakening method of voice interaction function and electronic equipment
WO2020024087A1 (en) Working method of touch control apparatus, and terminal
CN117271170B (en) Activity event processing method and related equipment
CN114942741B (en) Data transmission method and electronic equipment
CN116048831B (en) Target signal processing method and electronic equipment
CN113050864B (en) Screen capturing method and related equipment
CN117273687B (en) Card punching recommendation method and electronic equipment
WO2023246783A1 (en) Method for adjusting device power consumption and electronic device
CN117271170A (en) Activity event processing method and related equipment
CN117177216A (en) Information interaction method and device and electronic equipment
CN117708009A (en) Signal transmission method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination