CN111182385B

CN111182385B - Voice interaction control method and intelligent sound box

Info

Publication number: CN111182385B
Application number: CN201911136301.6A
Authority: CN
Inventors: 张卓
Original assignee: Guangdong Genius Technology Co Ltd
Current assignee: Guangdong Genius Technology Co Ltd
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2021-08-20
Anticipated expiration: 2039-11-19
Also published as: CN111182385A

Abstract

A voice interaction control method and an intelligent sound box are provided, the method comprises the following steps: identifying whether a human body exists in the interaction area by using a camera device and a pyroelectric sensor of the intelligent sound box; the interaction area is an area, the distance between the interaction area and the intelligent sound box is smaller than a first preset distance threshold value; if so, starting a microphone of the intelligent sound box so as to utilize the microphone to carry out voice interaction. By implementing the embodiment of the invention, the microphone can be started when the camera device and the pyroelectric sensor are used for judging that the interaction requirement exists, so that the possibility that the microphone intercepts the user conversation in a non-interaction scene can be reduced, the privacy of a user of the intelligent sound box is protected, and the intelligence of voice interaction is improved.

Description

Voice interaction control method and intelligent sound box

Technical Field

The invention relates to the technical field of man-machine interaction, in particular to a voice interaction control method and an intelligent sound box.

Background

At present, most intelligent sound boxes have a voice interaction function, and can respond to voice information input by a user and execute operation indicated by the voice information. However, in practice, it is found that, in order to receive voice information input by a user at any time, the smart speaker generally controls the microphone to be turned on for a long time in a low power consumption state. However, this may cause the smart speaker to become a portal for eavesdropping on user information.

Disclosure of Invention

The embodiment of the invention discloses a voice interaction control method and an intelligent sound box, which can protect the user privacy of an intelligent sound box user and improve the intelligence of voice interaction.

The first aspect of the embodiments of the present invention discloses a voice interaction control method, which includes:

identifying whether a human body exists in the interaction area by using a camera device and a pyroelectric sensor of the intelligent sound box; the interaction area is an area, the distance between the interaction area and the intelligent sound box is smaller than a first preset distance threshold value;

if so, starting a microphone of the intelligent sound box so as to utilize the microphone to carry out voice interaction.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the identifying whether there is a human body in the interaction area by using the camera device and the pyroelectric sensor of the smart speaker includes:

controlling a camera device of the intelligent sound box to shoot a first image;

controlling a pyroelectric sensor of the intelligent sound box to detect whether a movable object exists in the interaction area;

if a human body is identified from the first image and a movable object is detected to exist in the interaction area, determining that the human body exists in the interaction area;

and the effective detection distance of the pyroelectric sensor is less than or equal to the first preset distance threshold.

As an optional implementation manner, in a first aspect of an embodiment of the present invention, the controlling a camera device of a smart speaker to capture a first image includes:

controlling the camera device to shoot a first image at a first preset posture;

and, after said turning on a microphone of said smart speaker, said method further comprising:

controlling the camera device to turn over from the first preset posture to a second preset posture;

controlling the camera device to shoot a second image at the second preset posture;

performing corresponding processing on the second image according to the voice information acquired by the microphone;

when the camera device is in the first preset posture, a camera lens of the camera device and a display screen of the intelligent sound box face to the same side; and when the camera device is in the second preset posture, the camera lens of the camera device faces the placing surface of the intelligent sound box.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, before the controlling the image capturing apparatus to flip from the first preset posture to the second preset posture, the method further includes:

identifying an orientation of a face in the first image;

and if the face faces the display screen of the intelligent sound box, executing the step of controlling the camera device to turn from the first preset posture to the second preset posture.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, after the turning on the microphone of the smart sound box, the method further includes:

when the microphone is in an open state and the display screen of the intelligent sound box is occupied, displaying a prompt interface for indicating that the microphone is in the open state in a suspending manner on a user interface currently displayed on the display screen;

or when the microphone is in an on state and the display screen of the intelligent sound box is occupied, controlling the light particles on the intelligent sound box to light up with a preset light effect;

or, when the human body is recognized to leave the interaction area, the microphone is turned off.

A second aspect of the embodiments of the present invention discloses an intelligent speaker, including: the first identification unit is used for identifying whether a human body exists in the interaction area by utilizing the camera device and the pyroelectric sensor of the intelligent sound box; the interaction area is an area, the distance between the interaction area and the intelligent sound box is smaller than a first preset distance threshold value;

and the first control unit is used for starting the microphone of the intelligent sound box when the recognition unit recognizes that a human body exists in the interaction area, so that voice interaction is carried out by utilizing the microphone.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the first identifying unit includes:

the first control subunit is used for controlling the camera device of the intelligent sound box to shoot a first image;

the second control subunit is used for controlling a pyroelectric sensor of the intelligent sound box to detect whether a movable object exists in the interaction area;

a determining subunit, configured to determine that a human body exists in the interaction region when the human body is identified from the first image and a moving object exists in the interaction region is detected;

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the first control subunit is specifically configured to control the image capturing apparatus to capture a first image in a first preset posture;

and, the smart sound box further comprises:

the second control unit is used for controlling the camera device to turn over from the first preset posture to a second preset posture after the first control unit starts the microphone of the intelligent sound box;

the third control unit is used for controlling the camera device to shoot a second image at the second preset posture;

the image processing unit is used for executing corresponding processing on the second image according to the voice information acquired by the microphone; when the camera device is in the first preset posture, a camera lens of the camera device and a display screen of the intelligent sound box face to the same side; and when the camera device is in the second preset posture, the camera lens of the camera device faces the placing surface of the intelligent sound box.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the method further includes:

the second identification unit is used for identifying the orientation of the face in the first image;

and the second control unit is specifically used for controlling the camera device to turn over from the first preset posture to a second preset posture after the first control unit turns on the microphone of the intelligent sound box and the second recognition unit recognizes that the face faces the display screen of the intelligent sound box.

the fourth control unit is used for displaying a prompt interface used for indicating that the microphone is in the starting state in a suspending manner on a user interface currently displayed on a display screen when the microphone is in the starting state and the display screen of the intelligent sound box is occupied after the first control unit starts the microphone of the intelligent sound box; or when the microphone is in an on state and the display screen of the intelligent sound box is occupied, controlling the light particles on the intelligent sound box to light up with a preset light effect;

and the first control unit is further used for closing the microphone if the human body is identified to leave the interaction area after the microphone of the intelligent sound box is started.

A third aspect of the embodiments of the present invention discloses an intelligent speaker, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute any one of the methods disclosed in the first aspect of the embodiments of the present invention.

A fourth aspect of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute any one of the methods disclosed in the first aspect of the embodiments of the present invention.

A fifth aspect of the embodiments of the present invention discloses a computer program product, which, when running on a computer, causes the computer to execute any one of the methods disclosed in the first aspect of the embodiments of the present invention.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, whether a human body exists in the interaction area is detected by using the camera device and the pyroelectric sensor, and when the human body is detected, the microphone of the intelligent sound box is turned on, so that the microphone is turned on when the interaction requirement is judged to exist. Because the microphone does not need to be opened for a long time, the possibility that the microphone intercepts the user conversation in a non-interactive scene can be reduced, the privacy of the intelligent sound box user is protected, and the intelligence of voice interaction is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is an exemplary diagram of a smart sound box disclosed in an embodiment of the present invention;

FIG. 2 is a flow chart of a voice interaction control method disclosed in the embodiment of the present invention;

FIG. 3 is a flow chart illustrating another exemplary method for controlling voice interaction according to the present invention;

fig. 4 is an exemplary diagram of the camera device 20 of the smart speaker in the first preset posture according to the embodiment of the present invention;

fig. 5 is an exemplary diagram of the camera device 20 of the smart speaker in the second preset posture, which is disclosed in the embodiment of the present invention;

fig. 6 is a schematic structural diagram of an intelligent sound box disclosed in the embodiment of the present invention;

fig. 7 is a schematic structural diagram of another smart sound box disclosed in the embodiment of the present invention;

fig. 8 is a schematic structural diagram of another smart sound box disclosed in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the embodiments and drawings of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The embodiment of the invention discloses a voice interaction control method and an intelligent sound box, which can protect the user privacy of an intelligent sound box user and improve the intelligence of voice interaction. The following are detailed below.

In order to better introduce the voice interaction control method disclosed by the embodiment of the invention, the intelligent sound box applicable to the method is described below. Referring to fig. 1, fig. 1 is an exemplary diagram of an intelligent sound box according to an embodiment of the present invention. As shown in fig. 1, a display screen 11 may be disposed on one side surface of the smart sound box main body 10. The smart sound box may further include a camera device 20, the camera device 20 may be lifted, and after the camera device 20 is lifted, the camera device 20 may further rotate, so as to change a viewing range of the camera device 20. For example, the camera 20 may be disposed on the top of the main body 10, and the camera 20 may be turned around a rotation axis parallel to the top of the main body 10, so that the viewing range of the camera 20 may be changed at least from a placement surface facing the smart sound box (as shown in fig. 1) to a position facing the front of the display screen 11. Alternatively, the camera device 20 can rotate around a rotating shaft perpendicular to the top of the main body 10, so that the view range of the camera device 20 includes a circle with the smart sound box as the center. When the camera device 20 descends, the camera device 20 can be hidden in the groove of the main body 10, so that the intelligent sound box is more attractive in appearance.

On the top or side of the body 10, there may be further provided light pellets 121, the number of which is not limited. Alternatively, a plurality of light particles 121 may form the light strip 12, and the light particles 121 are disposed on the top or the side of the main body 10 in the form of the light strip 12, and the number of the light strips 12 is not limited.

In addition, one or more of a pyroelectric sensor, a microphone, and a speaker may be provided inside the main body 10. The pyroelectric sensor utilizes the radiation heat effect to enable the detection element to receive radiation energy and then cause temperature rise, and further enables the performance of the detector depending on the temperature to change. For example, the pyroelectric sensor may be a pyroelectric infrared sensor, and a pyroelectric element of the pyroelectric infrared sensor releases electric charges outwards when detecting that the infrared radiation temperature of a human body changes, so as to trigger a signal for indicating that the human body is detected.

Example one

Referring to fig. 2, fig. 2 is a schematic flow chart illustrating a voice interaction control method according to an embodiment of the present invention. As shown in fig. 2, the voice interaction control method may include the steps of:

201. the intelligent sound box utilizes a camera device and a pyroelectric sensor of the intelligent sound box to identify whether a human body exists in an interaction area; if yes, go to step 202; if not, the flow is ended.

In the embodiment of the present invention, the interaction area may be an area whose distance from the smart sound box is smaller than a first preset distance threshold, and the shape of the area is not limited. Optionally, the interaction region may be a circular region with the smart sound box as a center and the first preset distance threshold as a radius; or, the interaction area may also be an area located in front of the display screen 11 and having a distance from the smart speaker smaller than a first preset distance threshold. It is understood that in the embodiment of the present invention, the viewing range of the image pickup device 20 and the effective detection area of the pyroelectric sensor may include the above-described interaction area. For example, if the interaction area is a circular area with the smart sound box as a center, the smart sound box may be provided with a plurality of pyroelectric sensors, and a set of effective detection areas of the pyroelectric sensors may cover the interaction area; if the interaction area is an area located in front of the display screen 11 of the smart sound box, the pyroelectric sensor may be disposed on the same side of the display screen 11, and the effective detection area of the pyroelectric sensor covers the interaction area.

The first preset distance threshold may be set with reference to at least one of the following empirical values: when voice interaction is carried out with the intelligent sound box, the distance between a user and the intelligent sound box is generally kept; or the distance between the edge of the placement surface of the intelligent sound box farthest from the intelligent sound box and the intelligent sound box. As an optional implementation manner, in the embodiment of the present invention, the smart sound box may turn on a microphone of the smart sound box when the smart sound box detects the preset gesture. Meanwhile, the intelligent sound box can detect the distance between the position of the user and the intelligent sound box as the sample distance when the microphone is started through a preset gesture by using the pyroelectric sensor; when the number of times that a user starts a microphone through gesture triggering reaches a plurality of times, the intelligent sound box also collects a plurality of sample distances; the first preset distance threshold value can be set as an average value of the distances of the samples, so that the distance between the user and the intelligent sound box can be objectively and effectively measured when the user performs voice interaction with the intelligent sound box.

In the embodiment of the present invention, the camera device 20 may be configured to capture an image, and the smart speaker may perform image recognition on the captured image to identify whether a human body exists in the image. However, since the maximum viewing distance of the camera device 20 can reach a hundred meters level, when the first preset distance threshold is small (e.g. lower than 1 meter), even though the human body is photographed, the human body may be located outside the interaction region. The pyroelectric sensor is sensitive to a change in temperature of the sensor caused by external infrared radiation, and the change is a movement of a human body. The effective detection distance of the pyroelectric sensor can be set to be smaller than or equal to a first preset distance threshold, and when a user enters the interaction area from the outside of the interaction area, the pyroelectric sensor can detect temperature change caused by movement of the user. However, since the smart speaker is generally placed in a house, if pets such as cats and dogs are raised in the house, the animals may be mistakenly identified as human bodies by the pyroelectric sensor.

Therefore, whether a human body exists in the interaction area is recognized by the camera device 20 alone, or whether a human body exists in the interaction area is recognized by the pyroelectric sensor alone, and a certain false recognition rate may exist. In order to reduce the false recognition rate, in the embodiment of the present invention, the smart sound box may perform double verification through the camera device 20 and the pyroelectric sensor, so as to improve the accuracy of detecting a human body in the interaction area by the smart sound box and reduce false activation of the microphone. As an alternative implementation, step 201 may include:

controlling a camera device 20 of the smart sound box to shoot a first image;

controlling a pyroelectric sensor of the intelligent sound box to detect whether a movable object exists in an interaction area;

if the human body is identified from the first image and the existence of the movable object in the interaction area is detected, the existence of the human body in the interaction area is determined;

the effective detection distance of the pyroelectric sensor is smaller than or equal to a first preset distance threshold value.

It is understood that, as an alternative implementation manner, the smart speaker may first control the camera device 20 to capture a first image, recognize that a human body exists in the first image, and then control the pyroelectric sensor to detect whether a moving object exists in the interaction area. As another alternative, the smart speaker may control the pyroelectric sensor to detect whether a moving object exists in the interaction area, and after detecting that the moving object exists, control the camera device 20 to capture the first image and identify whether a human body exists in the first image.

In addition, the intelligent sound box can be internally provided with a battery or not comprise the battery, and a power supply is plugged in when the intelligent sound box is used. If a battery is disposed in the smart sound box, as an optional implementation manner, in order to prolong the endurance time of the smart sound box, in the embodiment of the present invention, before performing step 201, the following steps may be further performed:

detecting whether a Bluetooth signal of wearable equipment bound with the intelligent sound box is searched;

if yes, go to step 201 above.

In the above embodiment, the bluetooth communication distance between the small mobile terminals such as the smart speaker and the smart watch may generally be 6 to 8 meters, and when the smart speaker searches for the bluetooth signal of the wearable device, it may be considered that the user is located near the smart speaker, and then the camera device 20 and the pyroelectric sensor are controlled to perform recognition, so that the power consumption of the smart speaker may be reduced, and the duration of the smart speaker may be prolonged.

202. The microphone of intelligent audio amplifier is opened to intelligent audio amplifier to carry out the pronunciation interaction through the microphone.

In the embodiment of the invention, after the microphone is turned on, the smart sound box can recognize the audio signal collected by the microphone and execute corresponding operations, such as searching and playing audio, according to the recognized voice information, so as to perform voice interaction with the user.

For example, the smart speaker may be a learning aid placed at a desk, and the first preset distance threshold may be set as: when a child sits in front of the desk, the distance between the intelligent sound box and the position where the child sits. When a child starts learning, the pyroelectric sensor can recognize the action of the child approaching the desk and sitting down in front of the desk, and the image pickup device 20 can photograph the child. The microphone is turned on by the intelligent sound box to monitor the voice input by the child.

As an optional implementation manner, in the embodiment of the present invention, after the microphone of the smart speaker is turned on, the following steps may be further performed:

when the microphone is in an on state, controlling the light particles 121 on the smart sound box to light up with a preset light effect;

or when the microphone is in an open state, controlling the display screen 11 of the intelligent sound box to output a preset image;

the preset light effect may include, but is not limited to, normally on, flashing, running light, and the like; the preset image may include a still image and a moving image. According to the implementation mode, the microphone of the user can be prompted to be in the open state through light or preset images, so that the possibility that the intelligent sound box eavesdrops a user conversation under the condition that the user does not know is reduced, and the privacy of the user can be protected.

Further optionally, when the microphone is detected to be in the on state, whether the display screen 11 of the smart sound box is occupied or not can be further detected;

if the display screen 11 of intelligent audio amplifier is occupied, then 11 output preset images of the display screen of control intelligent audio amplifier include:

if the display screen 11 of the intelligent sound box is occupied, displaying a prompt interface for indicating that the microphone is in an open state in a suspending manner on a user interface currently displayed on the display screen 11; optionally, the area of the prompt interface may be smaller than the user interface;

if the display screen 11 of the smart sound box is occupied, the step of controlling the light particles 121 on the smart sound box to light with a preset light effect is executed.

By implementing the implementation mode, if the display screen 11 of the intelligent sound box is detected to be occupied, it can be considered that the user may be interacting with the intelligent sound box, the probability is approximately in the interaction area, the distance between the user and the intelligent sound box is short, and a prompt interface is output through the display screen 11, so that the user can be effectively prompted; if the display screen 11 of the intelligent sound box is occupied, the distance between the user and the intelligent sound box is probably far, and the user can receive the prompt of the intelligent sound box in a remote scene through outputting light.

It can be seen that in the method described in fig. 2, when the smart speaker detects that there is a human body in the interaction area through the camera device 20 and the pyroelectric sensor, the microphone is triggered and started, so that the microphone can be turned on again when there is an interaction demand, and the microphone does not need to be turned on for a long time, thereby reducing the possibility that the microphone intercepts a user conversation in a non-interaction scene, protecting the privacy of a user of the smart speaker, and improving the intelligence of voice interaction. Further, the first preset distance threshold value is reasonably set, the accuracy of recognizing the voice interaction demand can be improved, double verification is performed by the aid of the camera device 20 and the pyroelectric sensor, and the accuracy of recognizing a human body in the interaction area can be improved. In addition, after the microphone is started, the intelligent sound box can remind a user that the microphone is being started through outputting a prompt interface or a preset light effect and other modes, and the user can manually close the microphone according to a prompt in time so as to reduce the occurrence of eavesdropping.

Example two

Referring to fig. 3, fig. 3 is a schematic flow chart of another voice interaction control method according to an embodiment of the present invention. As shown in fig. 3, the voice interaction control method may include the steps of:

301. the intelligent sound box controls a camera device of the intelligent sound box to shoot a first image at a first preset posture, and whether a human body exists in the first image is identified; if yes, go to step 302; if not, the flow is ended.

In the embodiment of the invention, the intelligent sound box can be used for learning and tutoring. Under the scene of learning and tutoring, the intelligent sound box can output and display teaching materials through the display screen 11, for example, play teaching videos, display the problem solving process and the like. At this time, if the user needs to interact with the smart speaker, the user generally faces the display screen 11, so as to view the teaching materials output from the display screen 11.

Therefore, in the embodiment of the present invention, when the camera device 20 is in the first preset posture, the camera lens of the camera device 20 faces the same side as the display screen 11 of the smart sound box, so that the user can be shot by the camera device 20 when the user moves towards the display screen 11. For example, please refer to fig. 4, fig. 4 is an exemplary diagram of the camera device 20 of the smart speaker in the first preset posture according to the embodiment of the present invention.

As an alternative implementation, before performing step 301, the following steps may also be performed:

acquiring the current posture of the camera device 20; if the current posture is the first preset posture, executing step 302;

if the current posture is not the first preset posture, controlling the pyroelectric sensor to detect whether a movable object exists in the interaction area; if yes, controlling the camera device 20 to turn to a first preset posture, shooting an image in the first preset posture, and identifying whether a human body exists in the image; if the human body exists, go to step 304; the non-first preset posture may include a second preset posture described below.

Implementing the above embodiment, if the image capturing device 20 is already in the first preset posture, the image capturing is performed by the image capturing device 20; if camera device 20 is not in first predetermined gesture, detect through pyroelectric sensor earlier then, can reduce camera device 20's frequent upset, reduce the consumption of intelligent audio amplifier, prolong the life of the turning device who controls camera device 20 upset.

302. The intelligent sound box controls a pyroelectric sensor thereof to detect whether a moving object exists in an interaction area; if yes, go to step 303; if not, the flow is ended.

303. The microphone of intelligent audio amplifier is opened to intelligent audio amplifier to carry out the pronunciation interaction through the microphone.

In the embodiment of the present invention, as an optional implementation manner, when the microphone is in the on state and the display screen 11 of the smart sound box is occupied, a prompt interface for indicating that the microphone is in the on state may be displayed in a floating manner on a user interface currently displayed on the display screen 11; or, when the microphone is in an on state and the display screen 11 of the smart speaker is occupied, the light particles 121 on the smart speaker may be controlled to light with a preset light effect.

304. The intelligent sound box identifies the orientation of the face in the first image and judges whether the face faces the display screen of the intelligent sound box; if yes, executing step 305 to step 307; if not, step 308 is performed.

In the embodiment of the present invention, when the user walks into the interaction area, there may be the following cases: the user only needs to perform voice interaction with the smart speaker, for example, the user asks the smart speaker through voice "how do the weather today? ", the intelligent sound box inquires weather conditions through voice broadcast; alternatively, the user needs to interact with the smart speaker in combination with the sense of sight and sense of hearing, for example, in a scene of learning assistance, the user asks the smart speaker by voice "how does the question? The intelligent sound box can recognize that the search target of the user is a topic through voice information input by the user, and in addition, the intelligent sound box needs to further determine the topic content corresponding to the 'topic' specified by the voice of the user. In the case where the user needs to interact with the smart speaker in combination with the vision and the hearing, the smart speaker may need to control the camera device 20 so that the camera device 20 can capture the content (such as the title content) that the user does not input by voice.

In the embodiment of the present invention, after the face is recognized to face the display screen 11 of the smart speaker by performing step 304, it may be determined that an application scenario may be in a learning assistance or the like that requires a combination of hearing and vision. Therefore, step 305 is executed, so that the number of times of flipping of the image pickup apparatus 20 can be further reduced.

305. The intelligent sound box controls the camera device to turn over from the first preset posture to the second preset posture.

306. And the intelligent sound box controls the camera device to shoot a second image at a second preset posture.

When the camera device 20 is in the second preset posture, the camera lens of the camera device 20 faces the placing surface of the smart sound box. For example, please refer to fig. 5, fig. 5 is an exemplary diagram of the camera device 20 of the smart speaker in the second preset posture according to the embodiment of the present invention. As shown in fig. 5, the placement surface of the smart speaker may be a desktop. When the camera device 20 is in the second predetermined posture, if a book, an exercise book or other learning materials are placed on the desk, the camera device can capture a second image including the learning materials.

307. And the intelligent sound box executes corresponding processing on the second image according to the voice information acquired by the microphone.

As an alternative implementation, step 307 may include:

if the voice information acquired by the microphone comprises a search instruction, the intelligent sound box identifies whether a preset specified object such as a finger or a pen exists in the second image or not; if the learning content exists, the intelligent sound box can further identify the learning content specified by the specified object in the second image; the intelligent sound box can determine the content to be searched which needs to be searched finally according to the voice information and the learning content, and the content to be searched is searched. Illustratively, search keywords such as "how to do", "what", and the like may be set in advance. If the voice information is recognized to contain the search keyword, it can be determined that the voice information includes a search instruction, such as "how do this question? "may be voice information containing a search instruction. If the intelligent sound box further identifies that the second image has the appointed object, the learning content appointed by the appointed object is identified, such as the question stem content of a certain question and answer. According to the voice information "how to do this question? The question stem content of the "and learning content" question may determine that the content to be searched is an answer to the question and answer or a process of solving the question.

Or if the voice information acquired by the microphone comprises a help seeking instruction, the intelligent sound box acquires a target account number which has an association relation with a user account number according to the currently logged-in user account number, and sends the second image to the intelligent sound box bound with the target account number, so that the intelligent sound box bound with the target account number outputs and displays the second image. Illustratively, help keywords such as "ask a glance", "teach me", and the like may be preset. If the voice information is identified to contain the help-seeking keyword, the voice information can be determined to comprise the help-seeking instruction, and the second image is sent to another intelligent sound box bound in advance, so that a user of the other intelligent sound box can see the help-seeking content of the user, and help is provided for the user.

308. And when the human body is identified to leave the interaction area, the intelligent sound box closes the microphone.

In this embodiment of the present invention, as an optional implementation manner, step 307 may include:

when the microphone is in an open state, if the pyroelectric sensor detects that a movable object exists in the interaction area again, the fact that the human body leaves the interaction area can be determined, and therefore the microphone is closed.

It can be seen that in the method described in fig. 3, before the microphone is turned on, the camera device 20 of the smart speaker may be located at the first preset posture, so as to shoot a human body that may enter the interaction area; after the microphone is turned on, the camera device 20 of the smart speaker may preset the posture at the second time so as to photograph the learning material or the like placed on the placement surface. In addition, after the intelligent sound box recognizes that the face in the first image faces the display screen 11, the camera device 20 is controlled to turn over from the first preset posture to the second preset posture, namely, when the current application scene needing to combine hearing and vision, such as learning assistance, is judged, the camera device 20 is controlled to turn over again, so that the turning times of the camera device 20 are reduced, the power consumption is reduced, and the service life of the turning device is prolonged. When the intelligent sound box recognizes that the human body leaves the interaction area, the microphone is turned off, and the risk of eavesdropping of the intelligent sound box can be further reduced.

EXAMPLE III

Referring to fig. 6, fig. 6 is a schematic structural diagram of an intelligent sound box according to an embodiment of the present invention. As shown in fig. 6, the smart speaker may include:

the first identification unit 601 is used for identifying whether a human body exists in the interaction area by using a camera device and a pyroelectric sensor of the intelligent sound box; the interaction area is an area with a distance from the intelligent sound box smaller than a first preset distance threshold value.

The first preset distance threshold may be set with reference to at least one of the following empirical values: when voice interaction is carried out with the intelligent sound box, the distance between a user and the intelligent sound box is generally kept; or the distance between the edge of the placement surface of the intelligent sound box farthest from the intelligent sound box and the intelligent sound box. As an optional implementation manner, in an embodiment of the present invention, the first recognition unit 601 may turn on a microphone of the smart speaker when detecting the preset gesture. Meanwhile, the first recognition unit 601 may detect, by using the pyroelectric sensor, a distance between a position of the user and the smart speaker when the microphone is turned on by a preset gesture as a sample distance; when the number of times that the user turns on the microphone through gesture triggering reaches a plurality of times, the first recognition unit 601 also collects a plurality of sample distances; the first preset distance threshold value can be set as an average value of the distances of the samples, so that the distance between the user and the intelligent sound box can be objectively and effectively measured when the user performs voice interaction with the intelligent sound box.

In addition, the intelligent sound box can be internally provided with a battery or not comprise the battery, and a power supply is plugged in when the intelligent sound box is used. If the battery is built in the smart speaker, in order to prolong the duration of the smart speaker, as an optional implementation manner, in an embodiment of the present invention, before performing an operation of identifying whether a human body exists in the interaction area by using the camera device and the pyroelectric sensor of the smart speaker, the first identification unit 601 may further be configured to:

if yes, the operation of identifying whether a human body exists in the interaction area by using the camera device and the pyroelectric sensor of the intelligent sound box is executed.

By implementing the implementation mode, the electric quantity consumption of the intelligent sound box can be reduced, and the endurance time of the intelligent sound box is prolonged.

As an optional implementation manner, the first identifying unit 601 may include:

the first control subunit 6011 is configured to control the camera device of the smart speaker to capture a first image;

a second control subunit 6012, configured to control a pyroelectric sensor of the smart speaker to detect whether a moving object exists in the interaction area;

a determining subunit 6013, configured to determine that a human body exists in the interaction region when the human body is identified from the first image and a moving object exists in the interaction region is detected;

It can be understood that, in the embodiment of the present invention, the first control subunit 6011 may control the image capturing device of the smart speaker to capture a first image, and after recognizing that a human body exists in the first image, trigger the second control subunit 6012 to perform an operation of controlling the pyroelectric sensor to detect whether a moving object exists in the interaction area;

or, the second control subunit 6012 may control the pyroelectric sensor of the smart speaker to detect whether there is a moving object in the interaction area, and after detecting that there is a moving object, trigger the first control subunit 6011 to perform an operation of controlling the camera of the smart speaker to capture the first image.

The first control unit 602 is configured to, when the first identification unit 601 identifies that a human body exists in the interaction area, turn on a microphone of the smart speaker to perform voice interaction by using the microphone.

It can be seen that, when implementing the intelligent sound box as shown in fig. 6, it is able to trigger and start the microphone when detecting that there is a human body in the interaction area through the camera device 20 and the pyroelectric sensor, so as to reduce the possibility that the microphone eavesdrops on the user conversation in a non-interaction scene, and further protect the privacy of the intelligent sound box user, and improve the intelligence of voice interaction. Furthermore, through reasonably setting a first preset distance threshold value and utilizing the camera device and the pyroelectric sensor to carry out double verification, the accuracy rate of identifying the human body in the interaction area can be improved.

Example four

Referring to fig. 7, fig. 7 is a schematic structural diagram of another intelligent sound box disclosed in the embodiment of the present invention. The smart sound box shown in fig. 7 is obtained by optimizing the smart sound box shown in fig. 6. In the smart speaker shown in fig. 7:

as an optional implementation manner, the first control subunit 6011 may be specifically configured to control the camera device of the smart speaker to capture a first image in a first preset posture;

and, the smart speaker shown in fig. 7 may further include:

the second control unit 603 is configured to control the camera to turn over from the first preset posture to a second preset posture after the first control unit 602 turns on the microphone of the smart sound box;

a third control unit 604 for controlling the camera to take a second image in a second preset posture;

the image processing unit 605 is configured to perform corresponding processing on the second image according to the voice information acquired by the microphone;

when the camera device is in the first preset posture, the camera lens of the camera device and the display screen of the intelligent sound box face to the same side; when the camera device is in the second preset posture, the camera lens of the camera device faces the placing surface of the intelligent sound box.

As an optional implementation manner, the image processing unit 605 may be specifically configured to, when the voice information acquired by the microphone includes a search instruction, identify whether a preset specified object such as a finger or a pen exists in the second image; if the learning content exists, the learning content specified by the specified object in the second image can be further identified; according to the voice information and the learning content, the content to be searched which needs to be searched finally can be determined, and the content to be searched is searched; or, the method and the device can be used for acquiring a target account number which has an association relation with a user account number according to the currently logged-in user account number when the voice information acquired by the microphone includes a help seeking instruction, and sending the second image to the smart sound box bound with the target account number, so that the smart sound box bound with the target account number outputs and displays the second image.

Further optionally, the smart sound box shown in fig. 7 may further include:

a second recognition unit 606 for recognizing the orientation of the face in the first image;

the second control unit 603 is specifically configured to control the camera device to turn from the first preset posture to the second preset posture after the first control unit 602 turns on the microphone of the smart speaker and the second recognition unit 606 recognizes that the face of the person faces the display screen of the smart speaker.

By recognizing the orientation of the face through the second recognition unit 606, the second control unit 603 can be triggered to turn over the image capture device after determining that the current application scene such as learning assistance needs to combine auditory sense and visual sense, so that the turn-over frequency of the image capture device can be reduced.

Still further optionally, the smart sound box shown in fig. 7 may further include:

a fourth control unit 607, configured to, after the first control unit 602 turns on the microphone of the smart speaker, when the microphone is in an on state and the display screen of the smart speaker is occupied, suspend and display a prompt interface for indicating that the microphone is in the on state on a user interface currently displayed on the display screen; or when the microphone is in an open state and the display screen of the intelligent sound box is occupied, controlling the light particles on the intelligent sound box to light up with a preset light effect. Fourth control unit 607 passes through light or predetermined image suggestion user microphone and is in the on-state, can reduce the intelligent audio amplifier and eavesdrop the probably emergence of user's conversation under the condition that the user is unknown to can protect user's privacy.

And the first control unit 602 is further configured to turn off the microphone after turning on the microphone of the smart speaker, if it is recognized that the human body leaves the interaction area, so that the risk of eavesdropping by the smart speaker can be further reduced.

As can be seen, with the implementation of the smart sound box shown in fig. 7, before the microphone is turned on, the camera device of the smart sound box may be located in the first preset posture, so as to shoot a human body that may enter the interaction area; after the microphone is opened, the camera device of the intelligent sound box can be located in the second preset posture, so that the learning materials and the like placed on the placing surface can be shot conveniently. In addition, when judging that the current application scene such as learning assistance which needs to combine hearing and vision is in, the intelligent sound box controls the camera device to turn over, so that the turning times of the camera device 20 can be reduced, the power consumption is reduced, and the service life of the turning device is prolonged. When the microphone is in the open state, the intelligent sound box prompts the user that the microphone is in the open state through light or a preset image, and further privacy of the user can be protected. And when the fact that the human body leaves the interaction area is identified, the microphone is turned off, and the risk of eavesdropping of the intelligent sound box can be further reduced.

EXAMPLE five

Referring to fig. 8, fig. 8 is a schematic structural diagram of another intelligent sound box disclosed in the embodiment of the present invention. As shown in fig. 8, the mobile terminal may include:

a memory 801 in which executable program code is stored;

a processor 802 coupled with the memory 801;

the processor 802 calls the executable program code stored in the memory 801 to execute any one of the voice interaction control methods shown in fig. 2 or fig. 3.

It should be noted that the smart sound box shown in fig. 8 may further include components, which are not shown, such as a power supply, a speaker, a display screen, an RF circuit, a Wi-Fi module, a bluetooth module, and a pyroelectric sensor, and are not described in detail in this embodiment.

The embodiment of the invention discloses a computer-readable storage medium which stores a computer program, wherein the computer program enables a computer to execute any one of the voice interaction control methods shown in fig. 2 or fig. 3.

An embodiment of the invention discloses a computer program product, which includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to make a computer execute any one of the voice interaction control methods shown in fig. 2 or fig. 3.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are exemplary and alternative embodiments, and that the acts and modules illustrated are not required in order to practice the invention. In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a computer accessible memory. Based on such understanding, the technical solution of the present invention, which is a part of or contributes to the prior art in essence, or all or part of the technical solution, can be embodied in the form of a software product, which is stored in a memory and includes several requests for causing a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute part or all of the steps of the above-described method of each embodiment of the present invention.

It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by hardware instructions of a program, and the program may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM), or other Memory, such as a magnetic disk, or a combination thereof, A tape memory, or any other medium readable by a computer that can be used to carry or store data.

The voice interaction control method and the smart speaker disclosed in the embodiments of the present invention are described in detail above, and specific examples are applied in this text to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention. Meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A voice interaction control method is characterized by comprising the following steps:

controlling a camera device of the intelligent sound box to shoot a first image at a first preset posture;

the effective detection distance of the pyroelectric sensor is smaller than or equal to a first preset distance threshold; the interaction area is an area, the distance between the interaction area and the intelligent sound box is smaller than the first preset distance threshold;

starting a microphone of the intelligent sound box so as to perform voice interaction by using the microphone;

if the voice information acquired by the microphone comprises a help seeking instruction, acquiring a target account number which has an association relation with a user account number according to the currently logged user account number, and sending the second image to an intelligent sound box bound with the target account number so that the intelligent sound box bound with the target account number outputs and displays the second image;

2. The method of claim 1, wherein prior to the controlling the camera to flip from the first preset pose to a second preset pose, the method further comprises: identifying an orientation of a face in the first image;

3. The method of claim 1 or 2, wherein after the turning on a microphone of the smart sound box, the method further comprises:

4. An intelligent sound box, comprising:

the first control subunit is used for controlling the camera device of the intelligent sound box to shoot a first image at a first preset posture;

the second control subunit is used for controlling the pyroelectric sensor of the intelligent sound box to detect whether a movable object exists in the interaction area;

the first control unit is used for starting a microphone of the intelligent sound box so as to perform voice interaction by utilizing the microphone;

the image processing unit is used for acquiring a target account number which has an association relation with a user account number according to the currently logged user account number if the voice information acquired by the microphone comprises a help seeking instruction, and sending the second image to the intelligent sound box bound with the target account number so that the intelligent sound box bound with the target account number outputs and displays the second image;

5. The smart sound box of claim 4, further comprising:

6. The smart sound box of claim 4 or 5, further comprising: