CN116705016A

CN116705016A - Control method and device of voice interaction equipment, electronic equipment and medium

Info

Publication number: CN116705016A
Application number: CN202210183692.2A
Authority: CN
Inventors: 刘桐; 王俊; 白洋
Original assignee: Beijing And Cloud Intelligent Technology Co ltd
Current assignee: Beijing And Cloud Intelligent Technology Co ltd
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2023-09-05

Abstract

The application discloses a control method and device of voice interaction equipment, electronic equipment and a medium, and relates to the technical field of computers. One embodiment of the method comprises the following steps: acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor; combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment; extracting a face image to be identified between a target starting endpoint time and a target ending endpoint time from a face image sequence, and extracting audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data; determining a user instruction according to the face image to be identified and/or the audio data to be identified; and controlling the voice interaction equipment to execute corresponding operation according to the instruction. The method can accurately identify the voice endpoint in a complex and noisy environment, improves the precision of dividing the voice sequence and improves the accuracy of user instruction identification.

Description

Control method and device of voice interaction equipment, electronic equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for controlling a voice interaction device, an electronic device, and a medium.

Background

With the rapid development of computer technology and the popularization of various intelligent devices, human-computer interaction demands are increasing. Of all man-machine interaction means, interaction by speech is obviously most convenient and efficient. At present, voice interaction generally uses only audio features to perform detection of human voice activity and recognition of user instructions, and the method is more suitable for scenes with low environmental noise and less environmental noise, however, more mixed noise exists in a cabin environment in a vehicle, and at the moment, the quality of original audio signals is lower, so that the method has great influence on the instructions for correctly recognizing the users. In addition, the loudness of the speaker's voice and the speaking distance also affect the quality of the audio captured by the microphone, and if the process of processing the audio signal cannot correct these variations well, the effect of speech recognition is also affected.

Disclosure of Invention

In order to solve the technical problems or at least partially solve the technical problems, embodiments of the present application provide a method, an apparatus, an electronic device, and a medium for controlling a voice interaction device.

In a first aspect, an embodiment of the present application provides a method for controlling a voice interaction device, including:

acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor;

combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment;

extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data;

determining a user instruction according to the face image to be recognized and/or the audio data to be recognized;

and controlling the voice interaction equipment to execute corresponding operation according to the instruction.

Optionally, determining the target starting endpoint moment and the target ending endpoint moment in combination with the face image sequence and the audio data includes: determining lip action information of the user according to the face image sequence; and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.

Optionally, determining the target starting endpoint time and the target ending endpoint time according to the lip motion information and the audio data includes: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user; determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice; and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.

Optionally, the determining the instruction of the user according to the face image to be identified and/or the audio data to be identified includes: carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized; determining a first similarity between the semantic meaning and a preset keyword; determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword; obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity; and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.

Optionally, determining lip motion information of the user according to the face image sequence includes: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image; determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images; and determining lip action information of the user according to the accumulated relative variation of the lip area.

Optionally, the method further comprises determining a target start endpoint time and a target end endpoint time according to:

wherein T is ₁ Indicating the starting endpoint time of the target, T _k1 Indicating the first starting point time, T _k2 Indicating a second starting endpoint time, T ₂ Indicating the target termination endpoint time, T _m1 Indicating the first termination endpoint time, T _m2 Indicating the second termination endpoint time.

Optionally, the preset keyword includes a wake word.

In a second aspect, an embodiment of the present application further provides a control device of a voice interaction device, including:

the acquisition module is used for acquiring the human face image sequence acquired by the vision sensor and the audio data acquired by the sound sensor;

the time determining module is used for combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment;

the extraction module is used for extracting a face image to be identified between the target starting endpoint time and the target ending endpoint time from the face image sequence and/or extracting audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data;

the identification module is used for determining a user instruction according to the face image to be identified and/or the audio data to be identified;

and the control module is used for controlling the voice interaction equipment to execute corresponding operation according to the instruction.

Optionally, the time determining module is further configured to: determining lip action information of the user according to the face image sequence; and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.

Optionally, the time determining module is further configured to: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user; determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice; and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.

Optionally, the identification module is further configured to: carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized; determining a first similarity between the semantic meaning and a preset keyword; determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword; obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity; and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.

Optionally, the time determining module is further configured to: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image; determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images; and determining lip action information of the user according to the accumulated relative variation of the lip area.

Optionally, the time determining module is further configured to determine the target starting endpoint time and the target ending endpoint time according to the following formula:

In a third aspect, an embodiment of the present application further provides an electronic device, including: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the control method of the voice interaction device.

In a fourth aspect, the embodiment of the present application further provides a computer readable medium, on which a computer program is stored, where the program is executed by a processor to implement a control method of a voice interaction device according to the embodiment of the present application.

One embodiment of the above application has the following advantages or benefits:

the method has the advantages that the target starting endpoint time and the target ending endpoint time are determined by combining the acquired face image sequence and the acquired audio data, namely, the multi-mode information is obtained by fusing the vision and the audio, so that the starting point and the ending point of the voice activity are determined, the voice endpoint can be accurately identified in a noisy environment with large noise or small voice, the precision of dividing the voice sequence is improved, the defect of endpoint detection by independently using the audio is overcome, and the accuracy of user instruction identification is further improved; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the application and are not to be construed as unduly limiting the application. Wherein:

fig. 1 schematically shows a schematic diagram of a main flow of a control method of a voice interaction device according to an embodiment of the present application;

FIG. 2 schematically illustrates a schematic diagram of a sub-flow of a control method of a voice interaction device according to an embodiment of the present application;

FIG. 3 schematically illustrates another sub-flow of a control method of a voice interaction device according to an embodiment of the application;

fig. 4 schematically illustrates a schematic diagram of a lip key point in a control method of a voice interaction device according to an embodiment of the present application;

fig. 5 schematically shows a schematic diagram of main modules of a control device of a voice interaction device according to an embodiment of the present application;

fig. 6 schematically shows a structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The embodiment of the application provides a control method of voice interaction equipment, which obtains multi-mode information by fusing visual features and audio features so as to determine the starting point and the ending point of voice activity, accurately identify voice endpoints in noisy environments with large noise or small voice, improve the precision of segmenting voice sequences, overcome the defect of endpoint detection by independently using audio, and further improve the accuracy of user instruction identification; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced. The method can be applied to the wake-up link of the voice interaction equipment, and can improve the wake-up rate and reduce the false wake-up rate. The method can be applied to a wake-up scene of automobile intelligent cabin driving, a face image sequence of a driver is obtained through monitoring equipment such as a camera, a depth sensor, an eye tracker and the like which are arranged in front of a driving position, audio of the driver is collected through a sound sensor which is arranged in the automobile, characteristics of a voice signal and a visual signal are fused, the noise resistance in the automobile is high, the adaptability to the conditions such as volume fluctuation and distance fluctuation is high, the overall robustness of the system is high, the wake-up rate of voice wake-up is high, and the false wake-up rate is low.

Fig. 1 schematically illustrates a schematic diagram of a main flow of a control method of a voice interaction device according to an embodiment of the present application, as shown in fig. 1, where the method includes:

step S101: and acquiring a human face image sequence acquired by the vision sensor and audio data acquired by the sound sensor. The visual sensor may be a camera, a depth sensor, or the like. The sound sensor may be a microphone.

Step S102: and combining the face image sequence and the audio data to determine the target starting endpoint time and the target ending endpoint time.

The step extracts visual characteristics from the human face image sequence, wherein the visual characteristics can be used as supplementary information of audio data to assist in detecting the starting point and the ending point of human voice activity, so that the detection accuracy is improved.

In one embodiment, the target start endpoint time and the target end endpoint time may be determined according to the following procedure:

determining lip action information of the user according to the face image sequence;

and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.

More specifically, as shown in fig. 2, the process of determining the target start endpoint time and the target end endpoint time according to the lip motion information and the audio data may include:

step S201: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user;

step S202: determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice;

step S203: and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.

Wherein, as shown in fig. 3, the lip motion information of the user can be determined according to the following procedure:

step S301: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image;

step S302: determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images;

step S303: and determining lip action information of the user according to the accumulated relative variation of the lip area.

The lip motion information of this embodiment may be a change in key points used to characterize the lip profile. For example, the segmentation and extraction of the lip region may be achieved by locating the face from the face image, and then performing a coarse segmentation of the image containing the lip region, separating the lips from the surrounding skin. After the lip region is segmented, extraction of lip key points is performed. The key points of the lip region are used to characterize the contour of the lips, as shown in fig. 4, and can be used to describe changes in the lips, such as opening and closing. In an embodiment, the lip region may include an upper lip upper edge, a lower lip lower edge, and more particularly, may be equally spaced acquisition keypoints of the upper lip upper edge, the lower lip upper edge, and the lower lip lower edge. In another embodiment, the keypoints of the lip region may include the left mouth corner, right mouth corner, mouth corner center, sampling points of the inner boundary of the upper and lower lips, sampling points of the outer boundary of the upper and lower lips.

The change in position of the keypoints of the lip region has a correlation when a person speaks. Therefore, in this embodiment, the relative position characteristics of the key points of the lip region are used as lip motion information, that is, the cumulative relative variation of the lip region is determined according to the lip region key point sequence corresponding to the two adjacent frames of face images, and the cumulative relative variation information is used as the lip motion information. Relative position refers to position after normalization, such as 15 keypoints (ordered) of the lip, normalization refers to subtracting the coordinates of the first keypoint from each keypoint coordinate, thus resulting in normalized keypoint coordinates that can be used to describe the shape of the lip independent of position. This is because the head may be moved while the user is speaking, which may cause a deviation in the origin of coordinates of each frame of image when extracting the lip keypoints. Therefore, the key points in the face image need to be normalized so as to eliminate deviation caused by head movement and improve accuracy.

After extracting the lip region key points in each frame of face image, the changes of the lip region key points obtained from the current frame image and the previous frame image can be continuously compared. And summing the relative position changes of all the key points, if the relative position changes exceed a specified threshold value, considering that obvious lip movement occurs, and determining the moment when the current frame image is shot as the moment of the first starting point. And when the sum of the relative position changes of all the key points is smaller than a specified threshold value, stopping lip movement, and determining the moment when the current frame image is shot as the moment of the first termination endpoint.

For example, the lip region keypoint sequence at time t is (assuming a total of 15 keypoints are acquired):

keypoint _t ＝[(a1,b1),(a2,b2).....(a15,b15)]

the key point sequence of the lip region at the time t+1 is as follows:

keypoint _t+1 ＝[(x1,y1),(x2,y2).....(x15,y15)]

the cumulative relative variation is:

for the second start endpoint time and the second end endpoint time, it may be determined according to a voice endpoint detection algorithm (VAD, voice Activity Detection).

After determining the first starting endpoint time, the first ending endpoint time, the second starting endpoint time, and the second ending endpoint time, determining a target starting endpoint time and a target ending endpoint time according to the following formula:

Step S103: extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data.

Step S104: and determining a user instruction according to the face image to be identified and/or the audio data to be identified.

Step S105: and controlling the voice interaction equipment to execute corresponding operation according to the instruction.

In an alternative embodiment, the instruction spoken by the user may be determined only by the face image to be recognized, the instruction spoken by the user may be determined only by the audio data to be recognized, and the instruction spoken by the user may be determined by combining the face image to be recognized and the audio data to be recognized.

The process for determining the instruction spoken by the user by combining the face image to be recognized and the audio data to be recognized comprises the following steps:

carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized;

determining a first similarity between the semantic meaning and a preset keyword;

determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword;

obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity;

and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.

The preset keyword may be a wake-up word of the voice interaction device, or may be a preset control word, for example, a control word for controlling the voice interaction device to increase or decrease the volume. And waking up the voice interaction device under the condition that the instruction uttered by the user is determined to be a wake-up word. And under the condition that the instruction uttered by the user is determined to be a preset control word, controlling the voice interaction equipment to execute corresponding operation.

According to the control method of the voice interaction equipment, the multi-mode information is obtained by fusing the visual characteristics and the audio characteristics, so that the technical means of determining the starting point and the ending point of the voice activity can accurately identify the voice endpoint in a noisy environment with large noise or small voice, the precision of dividing the voice sequence is improved, the defect of endpoint detection by independently using the audio is overcome, and the accuracy of user instruction identification is further improved; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced.

Fig. 5 schematically illustrates a structural diagram of a control apparatus 500 of a voice interaction device according to an embodiment of the present application, and as shown in fig. 5, the control apparatus 500 of a voice interaction device includes:

the acquisition module 501 is used for acquiring a human face image sequence acquired by the vision sensor and audio data acquired by the sound sensor;

a time determining module 502, configured to combine the face image sequence and the audio data to determine a target start endpoint time and a target end endpoint time;

an extracting module 503, configured to extract a face image to be identified between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extract audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data;

the recognition module 504 is configured to determine a user instruction according to the face image to be recognized and/or the audio data to be recognized;

and the control module 505 is configured to control the voice interaction device to execute a corresponding operation according to the instruction.

Optionally, the time determining module 502 is further configured to: determining lip action information of the user according to the face image sequence; and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.

Optionally, the time determining module 502 is further configured to: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user; determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice; and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.

Optionally, the identifying module 504 is further configured to: carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized; determining a first similarity between the semantic meaning and a preset keyword; determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword; obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity; and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.

Optionally, the time determining module 502 is further configured to: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image; determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images; and determining lip action information of the user according to the accumulated relative variation of the lip area.

Optionally, the time determining module 502 is further configured to determine the target starting endpoint time and the target ending endpoint time according to the following formula:

The embodiment of the application provides a control device of voice interaction equipment, which obtains multi-mode information by fusing visual features and audio features so as to determine the starting point and the ending point of voice activity, accurately identify voice endpoints in noisy environments with large noise or small voice, improve the precision of segmenting voice sequences, overcome the defect of endpoint detection by independently using audio, and further improve the accuracy of user instruction identification; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced.

The device can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

Fig. 6 schematically shows a schematic view of an electronic device according to an embodiment of the application. As shown in fig. 6, an electronic device 600 provided by an embodiment of the present application includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604; a memory 603 for storing at least one executable instruction; the processor 601 is configured to implement the control method of the voice interaction device as described above when executing the executable instructions stored in the memory.

Specifically, when the control method of the voice interaction device is implemented, the executable instructions cause the processor to perform the following steps: acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor; combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment; extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data; determining a user instruction according to the face image to be recognized and/or the audio data to be recognized; and controlling the voice interaction equipment to execute corresponding operation according to the instruction.

The memory 603 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 603 has storage space for program code for performing any of the method steps described above. For example, the memory space for the program code may include individual program code for implementing individual steps in the above method, respectively. The program code can be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, compact Disk (CD), memory card or floppy disk. Such computer program products are typically portable or fixed storage units. The storage unit may have a memory segment or a memory space or the like arranged similarly to the memory 603 in the electronic device described above. The program code may be compressed, for example, in a suitable form. Typically, the storage unit comprises a program for performing the method steps according to an embodiment of the application, i.e. code that can be read by a processor, such as 601, for example, which when run by an electronic device causes the electronic device to perform the various steps in the method described above.

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to: acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor; combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment; extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data; determining a user instruction according to the face image to be recognized and/or the audio data to be recognized; and controlling the voice interaction equipment to execute corresponding operation according to the instruction.

The computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. A control method of a voice interaction device, comprising:

2. The method of claim 1, wherein determining a target start endpoint time and a target end endpoint time in combination with the sequence of face images and the audio data comprises:

3. The method of claim 2, wherein determining a target start endpoint time and a target end endpoint time based on the lip motion information and the audio data comprises:

determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user;

determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice;

and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.

4. The method according to claim 2, wherein determining the user's instructions from the face image to be identified and/or the audio data to be identified comprises:

5. The method of any of claims 2-4, wherein determining lip motion information of the user from the sequence of face images comprises:

performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image;

determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images;

and determining lip action information of the user according to the accumulated relative variation of the lip area.

6. A method according to claim 3, wherein the target start endpoint time and the target end endpoint time are determined according to the following equation:

7. The method of claim 4, wherein the preset keyword comprises a wake word.

8. A control apparatus for a voice interaction device, comprising:

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.