CN115359789A

CN115359789A - Voice interaction method and related device, equipment and storage medium

Info

Publication number: CN115359789A
Application number: CN202210930532.XA
Authority: CN
Inventors: 肖建辉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-18
Also published as: WO2024027100A1

Abstract

The application discloses a voice interaction method, a related device, equipment and a storage medium, wherein the voice interaction method comprises the following steps: responding to a starting request for detecting voice interaction, carrying out target detection based on a shot image, and determining a target area where a requester initiating the starting request is located; based on the target area, locking a first sound zone of the voice interaction; wherein the voice interaction is performed based on a first voice collected at a first vocal range; and re-executing the steps of carrying out target detection based on the shot image, and determining the target area where the requester initiating the opening request is located and the subsequent steps. By the scheme, the sound zone can be dynamically locked, and the reliability and continuity of voice interaction are improved.

Description

Voice interaction method and related device, equipment and storage medium

Technical Field

The present application relates to the field of human-computer interaction technologies, and in particular, to a voice interaction method, and a related apparatus, device, and storage medium.

Background

With the rapid development of artificial intelligence technology, human-computer interaction is realized through voice, and the method is widely applied to intelligent home, mobile terminals, vehicle-mounted equipment and other intelligent products.

At present, in order to improve the anti-interference performance of voice interaction in the industry, a voice area with voice interaction is generally locked by a sound production area of a wakeup word and the position of a voice key. For example, waking up a voice on the left locks a zone to the left and waking up a voice on the right locks a zone to the right. However, this approach may result in the user not being able to continuously use voice interaction if the user moves beyond the locked zone. In view of this, how to dynamically lock the vocal range and improve the reliability and continuity of voice interaction becomes an urgent problem to be solved.

Disclosure of Invention

The technical problem mainly solved by the application is to provide a voice interaction method, a related device, equipment and a storage medium, which can dynamically lock a sound zone and improve the reliability and continuity of voice interaction.

In order to solve the above technical problem, a first aspect of the present application provides a voice interaction method, including: responding to a starting request for detecting voice interaction, carrying out target detection based on a shot image, and determining a target area where a requester initiating the starting request is located; based on the target area, locking a first sound zone of the voice interaction; wherein the voice interaction is performed based on a first voice collected at a first vocal range; and re-executing the steps of carrying out target detection based on the shot image, and determining the target area where the requester initiating the opening request is located and the subsequent steps.

In order to solve the above technical problem, a second aspect of the present application provides a voice interaction apparatus, including: the system comprises an area determining module, a sound zone locking module and a circulating execution module, wherein the area determining module is used for responding to a starting request of voice interaction, carrying out target detection based on a shot image and determining a target area where a requester initiating the starting request is located; the vocal tract locking module is used for locking a first vocal tract of the voice interaction based on the target area; wherein the voice interaction is performed based on a first voice collected at a first vocal range; and the cyclic execution module is used for re-executing the steps of carrying out target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps.

In order to solve the above technical problem, a third aspect of the present application provides a voice interaction device, which includes a microphone, a camera, a memory, and a processor, where the microphone, the camera, and the memory are respectively coupled to the processor, and the memory stores program instructions, and the processor is configured to execute the program instructions to implement the voice interaction method in the first aspect.

In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being configured to implement the voice interaction method of the first aspect.

According to the scheme, in response to the detection of the starting request of voice interaction, target detection is carried out based on the shot image, the target area where the requester who initiates the starting request is located is determined, the first sound zone of the voice interaction is locked based on the target area, the voice interaction is executed based on the first voice collected in the first sound zone, the target detection is carried out based on the shot image again, and the step and the subsequent steps where the requester who initiates the starting request is located are determined.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a voice interaction method of the present application;

FIG. 2 is a schematic diagram of one embodiment of zoning;

FIG. 3 is a diagram of one embodiment of the division of the soundzones by the user terminal;

FIG. 4 is a schematic flow chart diagram illustrating another embodiment of a voice interaction method of the present application;

FIG. 5 is a block diagram of an embodiment of a voice interaction apparatus;

FIG. 6 is a block diagram of an embodiment of an electronic device of the present application;

FIG. 7 is a block diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.

In the following description, for purposes of explanation rather than limitation, specific details are set forth such as the particular system architecture, interfaces, techniques, etc., in order to provide a thorough understanding of the present application.

The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the segment "/" herein generally indicates that the former and latter associated objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a voice interaction method according to an embodiment of the present application. Specifically, the method may include the steps of:

step S11: and responding to the starting request of voice interaction, carrying out target detection based on the shot image, and determining a target area where a requester initiating the starting request is located.

In one implementation scenario, the open request may be initiated by voice. Specifically, the requester may send out a voice including a wakeup word, so that the voice interaction device for performing voice interaction may determine that the start request of the voice interaction is detected if it is recognized that the collected voice includes the wakeup word. Exemplarily, taking the voice interaction device as an on-board device as an example, a person at any position in the vehicle (e.g., a main driver, a sub-driver, a left rear seat, a right rear seat, etc.) can be used as a requester to initiate an opening request by voice; or, taking the example that the voice interaction device is a smart home, a person in any indoor location (e.g., a living room, a restaurant, a bedroom, etc.) can be used as a requester to initiate an opening request through voice. Other scenarios may be analogized, and are not exemplified here.

In another implementation scenario, the start request may be initiated by a key, as opposed to the voice. In particular, the requestor may trigger a key on the voice interaction device for performing the voice interaction, such that the voice interaction device may determine that an open request for the voice interaction is detected in response to the key being triggered. Exemplarily, taking the voice interaction device as an on-board device, the voice interaction device may include keys respectively arranged at a plurality of positions (e.g., a main driver, a sub driver, a left rear seat, a right rear seat, etc.) in the vehicle, and then a person at any position in the vehicle may serve as a requester to initiate an opening request through the keys; or, taking the example that the voice interaction device is an intelligent home, a person in any indoor position (e.g., a living room, a restaurant, a bedroom, etc.) can walk to the voice interaction device and trigger the key thereof, and then the person can be used as a requester to initiate an opening request through the key. Other scenarios may be analogized, and are not exemplified here.

In one implementation scenario, after detecting the turn-on request, a captured image may be acquired and subject to subject detection to determine a subject area in which the requesting person is located. It should be noted that the voice interaction device may be integrated with a camera, and is used to shoot an environment where the voice interaction device is located, so as to obtain the shot image. Taking the example that the voice interaction device is a vehicle-mounted device, the voice interaction device may include a camera disposed at a top inside the vehicle (or in front of the vehicle), and the camera is configured to capture an environment inside the vehicle to obtain the captured image; or, use voice interaction equipment to be intelligent house as an example, voice interaction equipment's top can be equipped with the camera, and this camera is used for shooing the indoor environment, obtains above-mentioned shooting image, and especially, considering that the indoor environment probably is comparatively broad, this camera can set up to wide angle camera. Other scenarios may be analogized, and are not exemplified here.

In a specific implementation scenario, in order to facilitate subsequent locking of the sound zones, the sound pickup range of the voice interaction device may be divided into a plurality of sound zones in advance. Referring to fig. 2, fig. 2 is a schematic diagram of an embodiment of sound zone division. As shown in fig. 2, for convenience of description, taking the sound pickup range of the voice interaction apparatus as a rectangle as an example, the sound pickup range can be divided into four sound zones arranged in two rows and two columns, which are respectively denoted as: the sound zone 1, the sound zone 2, the sound zone 3 and the sound zone 4. In practical application, the sound pickup range is not limited to a rectangle, and may also be in other shapes, such as a circle, for example, in the case that the sound pickup range is a circle, the sound pickup range may be divided into a plurality of sound zones by dividing lines passing through a center of the circle, for example, the sound pickup range may be divided into four sound zones by dividing lines passing through the center of the circle at a 90-degree central angle. Other cases may be analogized, and no one example is given here.

In a specific implementation scenario, please refer to fig. 2 in combination, taking the voice interaction device as an on-board device as an example, and taking the sound pickup range of the voice interaction device as the interior of the vehicle, the sound pickup range can be divided into four sound zones according to the main driving area, the assistant driving area, the left rear seat and the right rear seat, for example, the sound zone No. 3 corresponds to the main driving area, the sound zone No. 1 corresponds to the assistant driving area, the sound zone No. 4 corresponds to the left rear seat area, and the sound zone No. 2 corresponds to the right rear seat area; or, taking the example that the voice interaction device is a smart home, the voice interaction device is located at the intersection of the four sound zones shown in fig. 2, and the sound pickup range of the voice interaction device is the largest rectangle shown in fig. 2, then the sound pickup range can be divided into four sound zones according to fig. 2, which respectively correspond to different indoor spaces, for example, the No. 3 sound zone corresponds to a living room, the No. 4 sound zone corresponds to a dining room, the No. 2 sound zone corresponds to a kitchen, and the No. 1 sound zone corresponds to a bedroom. Other cases may be analogized and are not illustrated here. It should be noted that the above correspondence relationship is only a possible implementation manner in the practical application process, and the specific division manner of the sound zone is not limited thereby.

In a specific implementation scenario, when the voice interaction device is used for the first time, the user terminal (e.g., a smart phone, a tablet computer, etc.) may establish a communication connection with the voice interaction device, for example, the user terminal may be in a bluetooth mode, a wireless local area network mode, and the like, which is not limited herein. On this basis, the user terminal can display the real-time image shot by the voice interaction equipment to the environment where the voice interaction equipment is located, and receive the pickup data sent by the voice interaction equipment, wherein the pickup data can comprise the pickup range of the voice interaction equipment. On the basis, the user terminal can highlight the image area in the sound pickup range on the real-time image in a preset pattern. For example, an image area located within the sound pickup range on the real-time image may be filled with a color (e.g., red) having a certain transparency (e.g., 60%) to indicate that the user can perform sound zone division in the image area. Based on the above, the user terminal may further receive a dividing instruction operated by the user on the real-time image, so that a plurality of sound zones may be initialized. Referring to fig. 3, fig. 3 is a schematic diagram of an embodiment of dividing a sound zone by a user terminal. As shown in fig. 3, taking the voice interaction device as an example of an in-vehicle device, a camera in the voice interaction device may be disposed at the top of the vehicle, so as to capture images of various positions in the vehicle. As shown in fig. 3, the real-time image displayed by the user terminal is an in-vehicle top view, and an image area in the sound pickup range on the real-time image is filled with gray scales (the real-time images shown in fig. 3 are both in the sound pickup range), and the user can operate a dividing instruction on the real-time image, such as dividing two dividing lines, i.e., horizontal and vertical dividing lines, so as to divide the sound pickup range into four sound zones, i.e., a sound zone corresponding to the main driving area, a sound zone corresponding to the assistant driving area, a sound zone corresponding to the left rear seat area, and a sound zone corresponding to the right rear seat area. Other cases may be analogized, and no one example is given here.

In a specific implementation scenario, target detection may be performed on the captured image to obtain a target position of the requester in the captured image, so that a target area of the requester in an environment where the voice interaction device is located may be determined based on the target position. Specifically, in order to improve the target detection efficiency, a target detection model may be trained in advance, so that the target detection model may be used to perform target detection on the captured image, and the target position of the requester in the captured image is obtained. The object detection model may include, but is not limited to: fast RCNN, YOLO, etc., where the network structure of the target detection model is not limited. In addition, considering that the voice interaction device usually cannot move the position of the voice interaction device in the voice interaction process, the shot image and the background picture of the real-time image are basically consistent, and therefore the target area of the requester can be determined through the target position of the requester. Referring to fig. 3, if the requester is detected to be located at the main driving position at the lower left corner of the photographed image, it may be determined that the requester is located in the main driving region, that is, the target region is the main driving region, and similarly, if the requester is detected to be located at the assistant driving position at the upper left corner of the photographed image, it may be determined that the requester is located in the assistant driving region, that is, the target region is the assistant driving region. Other cases may be analogized, and no one example is given here.

In a specific implementation scenario, in order to enable the voice interaction device to continue to accurately perform voice interaction after being placed again, the voice interaction device may detect whether the background of the captured image changes every preset time period, and if so, may prompt to reset the sound zone. At this time, the user can operate the user terminal to establish communication connection with the voice interaction device again, and the step of displaying the real-time image shot by the voice interaction device to the environment where the voice interaction device is located and receiving the pickup data sent by the voice interaction device and the subsequent steps are executed again, so that the sound zone is reset.

Step S12: based on the target region, a first vocal tract of the voice interaction is locked.

In the embodiment of the present disclosure, the voice interaction is performed based on the first voice collected in the first sound zone, that is, although voice signals may also be collected in other sound zones except the first sound zone, these voice signals may be masked during the voice interaction, that is, the voice interaction may be performed based on only the first voice collected in the first sound zone. With continued reference to fig. 3, if the first sound zone of the locked voice interaction is the main driving area at the lower left corner, the voice interaction may be performed based on only the first voice collected in the main driving area, and in the voice interaction process, the voice signals collected in the assistant driving area, the left rear seat area, and the right rear seat area are shielded. Other cases may be analogized, and no one example is given here. In addition, in the voice interaction process, the directions of the collected voice signals can be determined in technical manners such as sound source positioning, if the directions of the voice signals are consistent with the first sound zone, the voice signals can be used as first voices, otherwise, the voice signals can be shielded in the voice interaction process, and the determination process of the directions of the voice signals can refer to technical details such as sound source positioning, and the like, and is not described herein again.

In one implementation, after the target region is determined, the target region may be locked as a first vocal range of the voice interaction. Referring to fig. 3, if it is determined that the target area is the primary driving area, the primary driving area may be locked as a first voice zone of voice interaction; similarly, if the target area is determined to be the passenger driving area, the passenger driving area can be locked to be the first sound zone of the voice interaction. Other cases may be analogized, and no one example is given here.

In another implementation scenario, different from the foregoing manner, in order to adapt to a scenario in which a requester may have a large moment of movement during voice interaction (for example, after the requester in the driving area initiates an opening request, the head of the requester faces the storage bin in the driving area and sends a voice), the target area and the pickup area adjacent to the target area may also be simultaneously locked at this time as the first area of voice interaction.

It should be noted that after the first sound zone of the voice interaction is locked, the first voice collected in the first sound zone can be acquired, and the voice interaction with the requester is performed based on the first voice.

In an implementation scenario, in order to further improve robustness of voice interaction, before a first sound zone of voice interaction is locked based on a target area, whether the target area is located in a coverage area of the voice interaction may be detected, if the target area is located in the coverage area of the voice interaction, then a step of locking the first sound zone of the voice interaction based on the target area is performed, otherwise it may be determined that the voice interaction is not currently supported, a requestor may be prompted that the voice interaction is not currently supported, and if "no speech can be broadcasted, a location where the user can not hear the user, and the user moves the location" may be broadcasted. It should be noted that the coverage of voice interaction, that is, the sound pickup range of the voice interaction device. According to the method, whether the target area is within the coverage range of voice interaction is detected before the first sound zone of voice interaction is locked based on the target area, if yes, the first sound zone of voice interaction is locked based on the target area, so that the target area can be ensured to be located behind the coverage range of voice interaction, subsequent voice interaction is executed, the robustness of voice interaction is improved, and when the target area is determined not to be within the coverage range of voice interaction, a requester is prompted that voice interaction is not supported currently, so that when voice interaction is not supported, a user can sense the target area, and user experience is improved.

Step S13: and re-executing the steps of carrying out target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps.

In an implementation scenario, after the first sound zone of the voice interaction is locked, the step of performing target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps can be executed again, so that the sound zone can be dynamically locked along with the movement of the requester.

In another implementation scenario, different from the foregoing manner, in order to further reduce a large amount of overhead generated by performing target detection on the captured image, before performing the step of performing target detection based on the captured image and determining the target area where the requester initiating the opening request is located and the subsequent steps again, it may be predicted whether there is a risk of change in the currently locked first sound zone, and if so, the step of performing target detection based on the captured image and determining the target area where the requester initiating the opening request is located and the subsequent steps may be performed again. It should be noted that, for the specific process of risk prediction, reference may be made to the following disclosed embodiments, which are not repeated herein.

Referring to fig. 4, fig. 4 is a flowchart illustrating a voice interaction method according to another embodiment of the present application. Specifically, the following steps may be included:

step S401: and receiving a voice interaction starting request initiated by a requester.

Reference may be made to the above-mentioned embodiments, which are not described herein again.

Step S402: and carrying out target detection based on the shot image, and determining a target area where the applicant initiating the opening request is located.

Reference may be made specifically to the foregoing disclosure embodiments, which are not described herein again.

Step S403: and detecting whether the target area is in the coverage range of voice interaction, if so, executing step S404, otherwise, executing step S408.

It should be noted that the coverage of voice interaction is the sound pickup range of the voice interaction device. And if the target area is positioned in the coverage range of the voice interaction, the normal voice interaction can be confirmed, otherwise, the voice interaction can not be normally carried out.

Step S404: based on the target region, a first vocal tract of the voice interaction is locked.

Step S405: predicting whether the first sound zone currently locked has a change risk, if so, executing step S406, otherwise, executing step S407.

Specifically, if it is predicted that there is a risk of change in the currently locked first sound zone, the foregoing step S402 and subsequent steps are executed again to perform target detection on the newly acquired captured image, so as to dynamically change the currently locked first sound zone in time when there is a risk of change, and if it is predicted that there is no risk of change in the currently locked first sound zone, the step S405 is executed again to continuously predict again whether there is a risk of change in the currently locked first sound zone, so that once it is predicted that there is a risk of change, the currently locked first sound zone can be dynamically changed in time. For example, a first voice captured in a first vocal range may be captured simultaneously with a second voice captured in a second vocal range, which is adjacent to the first vocal range, as previously described. Referring to fig. 2, if the sound zone 3 is the first sound zone, the second sound zone adjacent to the first sound zone may include at least one of the sound zone 1 and the sound zone 4. Other cases may be analogized, and no one example is given here. On this basis, a prediction may be made based on the first speech and the second speech to determine whether there is a risk of change for the currently locked first phoneme. According to the mode, the second voice collected in the second sound area is obtained, the second sound area is adjacent to the first sound area, prediction is carried out based on the first voice and the second voice, whether the change risk exists in the currently locked first sound area or not is determined, therefore, the shot image does not need to be operated, only the collected first voice and the collected second voice are needed, whether the change risk exists in the currently locked first sound area or not can be predicted, and the operation load can be reduced.

In one implementation scenario, after obtaining the first voice and the second voice, a first change of the first voice with respect to the signal strength may be obtained, and a second change of the second voice with respect to the signal strength may be obtained, so that whether there is a risk of change in the currently locked first sound zone may be determined based on the first change and the second change. It should be noted that the first variation may include: any of the signal strength of the first voice becoming strong, the signal strength of the first voice not becoming strong (i.e., becoming weak or constant), the signal strength of the first voice becoming weak, the signal strength of the first voice not becoming weak (i.e., becoming strong or constant), and similarly, the second variation scenario may include: the signal strength of the second voice is either increased, the signal strength of the second voice is not increased, the signal strength of the second voice is decreased, or the signal strength of the second voice is not decreased. Furthermore, the variation in signal strength represents a variation in the time domain dimension, such that a first variation represents a variation in signal strength of a first voice at a current acquisition that intersects a previous acquisition, and a second variation represents a variation in signal strength of a second voice at the current acquisition as compared to the previous acquisition. In the above manner, whether the currently locked first phoneme region has a risk of change is determined through the first change condition of the first voice about the signal strength and the second change condition of the second voice about the signal strength, that is, whether the currently locked first phoneme region has a risk of change can be predicted only by measuring the signal strength change, so that the method can help to further reduce the operation load.

In one particular implementation scenario, in response to a first change condition comprising a weakening of the signal strength of a first voice and a second change condition comprising a non-weakening of the signal strength of a second voice, it may be determined that the currently locked first phoneme is at risk of change. In the above manner, in the case that the first change condition includes that the signal strength of the first voice is weakened and the second change condition includes that the signal strength of the second voice is not weakened, it is determined that the currently locked first sound zone is at risk of change, that is, once the requester has a tendency to move away from the first sound zone, it is immediately determined that the currently locked first sound zone is at risk of change, so that the target detection is performed on the latest shot image again, which is helpful to dynamically change the currently locked first sound zone further in time.

In one particular implementation scenario, in response to a first change condition comprising a first voice not being weakened in signal strength and a second change condition comprising a second voice being weakened in signal strength, it may be determined that there is no risk of change in the currently locked first phoneme. In this way, when the first change situation includes that the signal strength of the first voice is not weakened and the second change situation includes that the signal strength of the second voice is weakened, it is determined that there is no risk of a change in the currently locked first vocal range, and therefore, it is not necessary to perform target detection on the latest captured image again, which contributes to reducing the computational load.

In a specific implementation scenario, in order to improve the instantaneity of dynamically locking a sound zone as much as possible, in other cases than the above two cases, for example, when the first change case includes that the signal strength of the first voice is weakened and the second change case includes that the signal strength of the second voice is weakened, and the first change case includes that the signal strength of the first voice is not weakened and the second change case includes that the signal strength of the second voice is not weakened, it may be determined that there is a risk of changing the currently locked first sound zone, so that target detection may be performed on the latest captured image again.

In an implementation scenario, in order to further improve the instantaneity of dynamically locking a vocal range, after the first speech and the second speech are obtained and before prediction is performed based on the first speech and the second speech, whether the first speech and the second speech are valid speech (that is, including a human voice signal) may be detected, and in a case that both the first speech and the second speech are valid speech, prediction is performed based on the first speech and the second speech to determine whether there is a risk of change in the currently locked first vocal range. According to the mode, before predicting whether the currently-locked first sound zone has the change risk or not based on the first voice and the second voice, whether the first voice and the second voice are effective voices or not is detected, and then the change situation is based on the signal intensity under the condition that the first voice and the second voice are both effective voices, so that the movement trend of a requester can be reflected as accurately as possible under the change situation, whether the currently-locked first sound zone has the change risk or not is predicted based on the change situation, and the prediction accuracy can be improved as much as possible.

In a specific implementation scenario, different from that the first voice and the second voice are both valid voices, if at least one of the first voice and the second voice is blank voice (i.e., does not include a voice signal), the step of performing target detection based on the captured image, and determining a target area where a request person initiating the start request is located and subsequent steps may be performed again, that is, performing target detection on the latest captured image again. In this way, when at least one of the first voice and the second voice is blank voice, the requester can be considered not to send voice, and at this time, in order to avoid that the requester moves and does not follow the moving requester in time to cause that subsequent voice interaction cannot be realized, the target detection and the subsequent steps based on the latest shot image are immediately executed again, so that the instantaneity of dynamically locking the sound area can be further improved.

Step S406: step S402 and subsequent steps are re-executed.

Specifically, as described above, if it is predicted that there is a risk of a change in the currently locked first sound zone, the step of performing target detection based on the captured image and subsequent steps may be performed again to improve the instantaneity of dynamically locking the sound zone as much as possible.

It should be noted that, in the embodiment of the present disclosure, when target detection based on a captured image is re-performed, the captured image referred to may be the latest captured image. For example, the voice interaction device may continue shooting, and when it is determined that target detection needs to be performed again, target detection may be performed on the latest shot image.

Step S407: step S405 and subsequent steps are re-executed.

Specifically, as described above, if it is predicted that there is no risk of change in the currently locked first range, it may be predicted again whether there is a risk of change in the currently locked first range.

Step S408: prompting the requestor that voice interaction is not currently supported.

Reference may be made to the related description in the foregoing embodiments, which are not repeated herein.

It should be noted that, after the first sound zone of the voice interaction is locked in the step S404, in the processes of the risk prediction, the target detection re-executed, and the like, the requester may also make a voice, and at this time, the voice interaction may still be performed based on the first voice collected in the currently locked first sound zone, and if the first sound zone of the voice interaction is re-locked, the voice interaction may be performed based on the first voice collected in the newly locked first sound zone.

According to the scheme, after the first sound zone of voice interaction is locked based on the target area, and before the step of performing target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps are executed again, whether the currently-locked first sound zone has the change risk is predicted, and when the currently-locked first sound zone has the change risk, the step of performing target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps are executed again, so that the frequency of performing target detection based on the shot image can be reduced as much as possible, the interaction is maintained in the moving process as much as possible, and the operation load is reduced as much as possible on the premise of improving the reliability and continuity of the voice interaction.

Referring to fig. 5, fig. 5 is a schematic block diagram of a voice interaction apparatus 50 according to an embodiment of the present application. The voice interaction apparatus 50 includes: the system comprises an area determining module 51, a sound zone locking module 52 and a circular execution module 53, wherein the area determining module 51 is used for responding to a starting request of voice interaction, performing target detection based on a shot image and determining a target area where a requester initiating the starting request is located; a vocal tract locking module 52, configured to lock a first vocal tract of the voice interaction based on the target region; wherein the voice interaction is performed based on a first voice collected at a first vocal range; and a loop execution module 53, configured to re-execute the step of performing target detection based on the captured image, and determining the target area where the requestor initiating the start request is located, and subsequent steps.

According to the scheme, in response to the detection of the opening request of the voice interaction, the target detection is carried out based on the shot image, the target area where the requester initiating the opening request is located is determined, the first sound zone of the voice interaction is locked based on the target area, the voice interaction is carried out based on the first voice collected in the first sound zone, the target detection is carried out based on the shot image again, and the target area where the requester initiating the opening request is located and the subsequent steps are determined.

In some disclosed embodiments, the voice interaction apparatus 50 further includes a risk prediction module configured to predict whether there is a risk of change in the currently locked first sound zone, and the loop execution module 53 is specifically configured to, in response to the risk of change in the currently locked first sound zone, re-execute the steps of performing target detection based on the captured image, determining a target area where a requester initiating the request to turn on is located, and performing the subsequent steps.

In some disclosed embodiments, the risk prediction module includes an acquisition submodule for acquiring a second voice collected at a second vocal range; wherein the second sound zone is adjacent to the first sound zone; the risk prediction module includes a prediction sub-module configured to predict based on the first speech and the second speech, and determine whether a currently locked first phoneme is at risk of changing.

In some disclosed embodiments, the prediction sub-module includes a strength variation obtaining unit, configured to obtain a first variation of the first speech with respect to the signal strength, and obtain a second variation of the second speech with respect to the signal strength; the prediction submodule comprises a change risk determination unit for determining whether there is a risk of a change in the currently locked first phoneme based on the first and second variation situations.

In some disclosed embodiments, the change risk determination unit includes a first determination subunit for determining that a currently locked first phoneme is at risk of change in response to the first change condition including weakening of the signal strength of the first phoneme and the second change condition including not weakening of the signal strength of the second phoneme; the change risk determination unit comprises a second determination subunit for determining that there is no risk of change for the currently locked first phoneme in response to the first change situation comprising that the signal strength of the first phoneme is not weakened and the second change situation comprising that the signal strength of the second speech is weakened.

In some disclosed embodiments, the risk prediction module includes a detection sub-module for detecting whether the first speech and the second speech are valid speech; the prediction sub-module is specifically configured to, in response to that the first speech and the second speech are both valid speech, perform prediction based on the first speech and the second speech, and determine whether there is a risk of change in the currently locked first phoneme.

In some disclosed embodiments, the loop execution module 53 is further configured to, in response to at least one of the first voice and the second voice being a blank voice, re-execute the steps of performing target detection based on the captured image, determining a target area where the requestor initiating the opening request is located, and the subsequent steps.

In some disclosed embodiments, voice interaction device 50 further comprises a loop prediction module for, in response to no risk of change for the currently locked first phoneme, re-performing the step of predicting whether a risk of change exists for the currently locked first phoneme and the subsequent steps.

In some disclosed embodiments, the voice interaction apparatus 50 further includes a coverage detection module for detecting whether the target area is within a coverage range of the voice interaction; the vocal tract locking module 52 is specifically configured to lock a first vocal tract of the voice interaction based on the target area in response to the target area being within the coverage of the voice interaction.

In some disclosed embodiments, the voice interaction apparatus 50 further comprises an interaction prompt module for prompting the requestor that the voice interaction is not currently supported in response to the target area not being within the coverage of the voice interaction.

Referring to fig. 6, fig. 6 is a schematic block diagram of a voice interaction device 60 according to an embodiment of the present application. The voice interaction device 60 comprises a memory 61, a processor 62, a microphone 63 and a camera 64, and the memory 61, the microphone 63 and the camera 64 are respectively coupled to the processor 62, the memory 61 stores program instructions, and the processor 62 is configured to execute the program instructions to implement the steps in any of the above voice interaction method embodiments. Specifically, the voice interaction device 60 may include, but is not limited to: desktop computers, notebook computers, servers, mobile phones, tablet computers, smart speakers, learning robots, story robots, vehicle terminals, car machines, and the like, without limitation. It should be noted that the number of the microphones 63 included in the voice interaction device 60 may not be limited to two as shown in fig. 6, and may also be three, four, etc., and is not limited herein. In addition, the number of the cameras 64 included in the voice interaction device 60 is not limited to one shown in fig. 6, and two, three, and the like may be provided, which is not limited herein.

In particular, the processor 62 is configured to control itself and the memory 61, the microphone 63 and the camera 64 to implement the steps in any of the above-described voice interaction method embodiments. Processor 62 may also be referred to as a CPU (Central Processing Unit). The processor 62 may be an integrated circuit chip having signal processing capabilities. The Processor 62 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 62 may be collectively implemented by an integrated circuit chip.

According to the scheme, the processes of target detection, sound zone locking and the like can be repeatedly and circularly executed in the voice interaction process, so that even if the voice interaction process moves, the sound zone can be dynamically locked along with the movement, the movement can be ensured not to exceed the locked sound zone as much as possible, namely, the interaction voice can be ensured not to be shielded as much as possible, namely, the interaction can be maintained in the moving process as much as possible, and the reliability and continuity of voice interaction can be further improved.

Referring to fig. 7, fig. 7 is a block diagram illustrating an embodiment of a computer readable storage medium 70 according to the present application. The computer readable storage medium 70 stores program instructions 71 capable of being executed by the processor, the program instructions 71 being configured to implement the steps in any of the above-described embodiments of the voice interaction method.

According to the scheme, the target detection, the sound zone locking and other processes can be repeatedly and circularly executed in the voice interaction process, so that even if the voice interaction process moves, the sound zone can be dynamically locked along with the movement, the movement can be ensured not to exceed the locked sound zone as much as possible, namely, the interaction voice can be ensured not to be shielded as much as possible, namely, the interaction can be maintained in the movement process as much as possible, and the reliability and the continuity of the voice interaction can be further improved.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight different aspects of the various embodiments that are the same or similar, which can be referenced with one another and therefore are not repeated herein for brevity.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, a product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'express consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is regarded as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

Claims

1. A method for voice interaction, comprising:

responding to a starting request for detecting voice interaction, carrying out target detection based on a shot image, and determining a target area where a requester initiating the starting request is located;

based on the target area, locking a first sound zone of the voice interaction; wherein the voice interaction is performed based on a first voice captured at the first vocal range;

and re-executing the step of carrying out target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps.

2. The method according to claim 1, wherein after the first sound zone of the voice interaction is locked based on the target area, and before the step of re-executing the step of performing the target detection based on the shot image, determining the target area where the requester who initiated the opening request is located, and the subsequent steps, the method further comprises:

predicting whether a currently locked first range is at risk of change;

the step of re-executing the target detection based on the shot image and determining the target area where the requestor initiating the opening request is located and the subsequent steps include:

and in response to the fact that the first voice zone which is locked currently has a change risk, re-executing the steps of carrying out target detection based on the shot image, and determining the target area where the requester who initiates the opening request is located, and the subsequent steps.

3. The method of claim 2, wherein predicting whether a currently locked first range is at risk of change comprises:

acquiring a second voice collected in a second sound zone; wherein the second soundzone is adjacent to the first soundzone;

and predicting based on the first voice and the second voice to determine whether the currently locked first voice region is at risk of changing.

4. The method of claim 3, wherein said predicting based on the first speech and the second speech, determining whether there is a risk of change for the currently locked first phoneme, comprises:

acquiring a first change situation of the first voice about the signal strength, and acquiring a second change situation of the second voice about the signal strength;

determining whether there is a risk of change for the currently locked first range based on the first and second variations.

5. The method of claim 4, wherein said determining whether there is a risk of change for a currently locked first phoneme region based on the first and second variation conditions comprises:

in response to the first change condition comprising weakening of a signal strength of the first voice and the second change condition comprising not weakening of a signal strength of the second voice, determining that a currently locked first phoneme is at risk of change;

determining that there is no risk of change for the currently locked first phoneme in response to the first change condition comprising the signal strength of the first phoneme not being weakened and the second change condition comprising the signal strength of the second speech being weakened.

6. The method of claim 3, wherein after said obtaining second speech collected at a second phoneme and before said predicting based on the first speech and the second speech, determining whether there is a risk of change for a currently locked first phoneme, the method further comprises:

detecting whether the first voice and the second voice are valid voices or not;

said predicting based on the first speech and the second speech, determining whether a currently locked first phoneme is at risk of changing, comprising:

and in response to the first voice and the second voice both being valid voices, performing prediction based on the first voice and the second voice, and determining whether the currently locked first sound zone is at risk of change.

7. The method of claim 6, further comprising:

and in response to at least one of the first voice and the second voice being blank voice, re-executing the step of performing target detection based on the shot image, and determining a target area where a requestor initiating the opening request is located, and the subsequent steps.

8. The method of claim 2, further comprising:

in response to the currently locked first range not being at risk of change, re-executing the step of predicting whether the currently locked first range is at risk of change and subsequent steps.

9. The method of claim 1, wherein prior to said locking a first soundzone of the voice interaction based on the target region, the method further comprises:

detecting whether the target area is within the coverage range of the voice interaction;

the locking a first vocal tract of the voice interaction based on the target region comprises:

in response to the target area being within a coverage area of the voice interaction, locking a first zone of the voice interaction based on the target area.

10. The method of claim 9, further comprising:

and prompting the applicant that the voice interaction is not supported currently in response to the fact that the target area is not within the coverage range of the voice interaction.

11. A voice interaction apparatus, comprising:

the area determining module is used for responding to a starting request for detecting voice interaction, carrying out target detection based on a shot image and determining a target area where a requester initiating the starting request is located;

a vocal tract locking module for locking a first vocal tract of the voice interaction based on the target region; wherein the voice interaction is performed based on a first voice collected at the first vocal range;

and the circular execution module is used for re-executing the step of carrying out target detection based on the shot image and determining the target area where the requester initiating the opening request is located and the subsequent steps.

12. A voice interaction device comprising a microphone, a camera, a memory and a processor, wherein the microphone, the camera and the memory are respectively coupled to the processor, and wherein the memory stores program instructions and the processor is configured to execute the program instructions to implement the voice interaction method according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that program instructions are stored which can be executed by a processor for implementing the voice interaction method of any one of claims 1 to 10.