CN116705016A - Control method and device of voice interaction equipment, electronic equipment and medium - Google Patents
Control method and device of voice interaction equipment, electronic equipment and medium Download PDFInfo
- Publication number
- CN116705016A CN116705016A CN202210183692.2A CN202210183692A CN116705016A CN 116705016 A CN116705016 A CN 116705016A CN 202210183692 A CN202210183692 A CN 202210183692A CN 116705016 A CN116705016 A CN 116705016A
- Authority
- CN
- China
- Prior art keywords
- determining
- target
- endpoint time
- audio data
- face image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000003993 interaction Effects 0.000 title claims abstract description 42
- 230000000007 visual effect Effects 0.000 claims abstract description 12
- 230000033001 locomotion Effects 0.000 claims description 41
- 230000009471 action Effects 0.000 claims description 15
- 238000001514 detection method Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 description 14
- 230000000694 effects Effects 0.000 description 13
- 239000012634 fragment Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 5
- 230000007547 defect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a control method and device of voice interaction equipment, electronic equipment and a medium, and relates to the technical field of computers. One embodiment of the method comprises the following steps: acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor; combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment; extracting a face image to be identified between a target starting endpoint time and a target ending endpoint time from a face image sequence, and extracting audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data; determining a user instruction according to the face image to be identified and/or the audio data to be identified; and controlling the voice interaction equipment to execute corresponding operation according to the instruction. The method can accurately identify the voice endpoint in a complex and noisy environment, improves the precision of dividing the voice sequence and improves the accuracy of user instruction identification.
Description
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and apparatus for controlling a voice interaction device, an electronic device, and a medium.
Background
With the rapid development of computer technology and the popularization of various intelligent devices, human-computer interaction demands are increasing. Of all man-machine interaction means, interaction by speech is obviously most convenient and efficient. At present, voice interaction generally uses only audio features to perform detection of human voice activity and recognition of user instructions, and the method is more suitable for scenes with low environmental noise and less environmental noise, however, more mixed noise exists in a cabin environment in a vehicle, and at the moment, the quality of original audio signals is lower, so that the method has great influence on the instructions for correctly recognizing the users. In addition, the loudness of the speaker's voice and the speaking distance also affect the quality of the audio captured by the microphone, and if the process of processing the audio signal cannot correct these variations well, the effect of speech recognition is also affected.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, embodiments of the present application provide a method, an apparatus, an electronic device, and a medium for controlling a voice interaction device.
In a first aspect, an embodiment of the present application provides a method for controlling a voice interaction device, including:
acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor;
combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment;
extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data;
determining a user instruction according to the face image to be recognized and/or the audio data to be recognized;
and controlling the voice interaction equipment to execute corresponding operation according to the instruction.
Optionally, determining the target starting endpoint moment and the target ending endpoint moment in combination with the face image sequence and the audio data includes: determining lip action information of the user according to the face image sequence; and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.
Optionally, determining the target starting endpoint time and the target ending endpoint time according to the lip motion information and the audio data includes: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user; determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice; and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.
Optionally, the determining the instruction of the user according to the face image to be identified and/or the audio data to be identified includes: carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized; determining a first similarity between the semantic meaning and a preset keyword; determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword; obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity; and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.
Optionally, determining lip motion information of the user according to the face image sequence includes: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image; determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images; and determining lip action information of the user according to the accumulated relative variation of the lip area.
Optionally, the method further comprises determining a target start endpoint time and a target end endpoint time according to:
wherein T is 1 Indicating the starting endpoint time of the target, T k1 Indicating the first starting point time, T k2 Indicating a second starting endpoint time, T 2 Indicating the target termination endpoint time, T m1 Indicating the first termination endpoint time, T m2 Indicating the second termination endpoint time.
Optionally, the preset keyword includes a wake word.
In a second aspect, an embodiment of the present application further provides a control device of a voice interaction device, including:
the acquisition module is used for acquiring the human face image sequence acquired by the vision sensor and the audio data acquired by the sound sensor;
the time determining module is used for combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment;
the extraction module is used for extracting a face image to be identified between the target starting endpoint time and the target ending endpoint time from the face image sequence and/or extracting audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data;
the identification module is used for determining a user instruction according to the face image to be identified and/or the audio data to be identified;
and the control module is used for controlling the voice interaction equipment to execute corresponding operation according to the instruction.
Optionally, the time determining module is further configured to: determining lip action information of the user according to the face image sequence; and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.
Optionally, the time determining module is further configured to: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user; determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice; and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.
Optionally, the identification module is further configured to: carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized; determining a first similarity between the semantic meaning and a preset keyword; determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword; obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity; and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.
Optionally, the time determining module is further configured to: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image; determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images; and determining lip action information of the user according to the accumulated relative variation of the lip area.
Optionally, the time determining module is further configured to determine the target starting endpoint time and the target ending endpoint time according to the following formula:
wherein T is 1 Indicating the starting endpoint time of the target, T k1 Indicating the first starting point time, T k2 Indicating a second starting endpoint time, T 2 Indicating the target termination endpoint time, T m1 Indicating the first termination endpoint time, T m2 Indicating the second termination endpoint time.
In a third aspect, an embodiment of the present application further provides an electronic device, including: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the control method of the voice interaction device.
In a fourth aspect, the embodiment of the present application further provides a computer readable medium, on which a computer program is stored, where the program is executed by a processor to implement a control method of a voice interaction device according to the embodiment of the present application.
One embodiment of the above application has the following advantages or benefits:
the method has the advantages that the target starting endpoint time and the target ending endpoint time are determined by combining the acquired face image sequence and the acquired audio data, namely, the multi-mode information is obtained by fusing the vision and the audio, so that the starting point and the ending point of the voice activity are determined, the voice endpoint can be accurately identified in a noisy environment with large noise or small voice, the precision of dividing the voice sequence is improved, the defect of endpoint detection by independently using the audio is overcome, and the accuracy of user instruction identification is further improved; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the application and are not to be construed as unduly limiting the application. Wherein:
fig. 1 schematically shows a schematic diagram of a main flow of a control method of a voice interaction device according to an embodiment of the present application;
FIG. 2 schematically illustrates a schematic diagram of a sub-flow of a control method of a voice interaction device according to an embodiment of the present application;
FIG. 3 schematically illustrates another sub-flow of a control method of a voice interaction device according to an embodiment of the application;
fig. 4 schematically illustrates a schematic diagram of a lip key point in a control method of a voice interaction device according to an embodiment of the present application;
fig. 5 schematically shows a schematic diagram of main modules of a control device of a voice interaction device according to an embodiment of the present application;
fig. 6 schematically shows a structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
The embodiment of the application provides a control method of voice interaction equipment, which obtains multi-mode information by fusing visual features and audio features so as to determine the starting point and the ending point of voice activity, accurately identify voice endpoints in noisy environments with large noise or small voice, improve the precision of segmenting voice sequences, overcome the defect of endpoint detection by independently using audio, and further improve the accuracy of user instruction identification; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced. The method can be applied to the wake-up link of the voice interaction equipment, and can improve the wake-up rate and reduce the false wake-up rate. The method can be applied to a wake-up scene of automobile intelligent cabin driving, a face image sequence of a driver is obtained through monitoring equipment such as a camera, a depth sensor, an eye tracker and the like which are arranged in front of a driving position, audio of the driver is collected through a sound sensor which is arranged in the automobile, characteristics of a voice signal and a visual signal are fused, the noise resistance in the automobile is high, the adaptability to the conditions such as volume fluctuation and distance fluctuation is high, the overall robustness of the system is high, the wake-up rate of voice wake-up is high, and the false wake-up rate is low.
Fig. 1 schematically illustrates a schematic diagram of a main flow of a control method of a voice interaction device according to an embodiment of the present application, as shown in fig. 1, where the method includes:
step S101: and acquiring a human face image sequence acquired by the vision sensor and audio data acquired by the sound sensor. The visual sensor may be a camera, a depth sensor, or the like. The sound sensor may be a microphone.
Step S102: and combining the face image sequence and the audio data to determine the target starting endpoint time and the target ending endpoint time.
The step extracts visual characteristics from the human face image sequence, wherein the visual characteristics can be used as supplementary information of audio data to assist in detecting the starting point and the ending point of human voice activity, so that the detection accuracy is improved.
In one embodiment, the target start endpoint time and the target end endpoint time may be determined according to the following procedure:
determining lip action information of the user according to the face image sequence;
and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.
More specifically, as shown in fig. 2, the process of determining the target start endpoint time and the target end endpoint time according to the lip motion information and the audio data may include:
step S201: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user;
step S202: determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice;
step S203: and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.
Wherein, as shown in fig. 3, the lip motion information of the user can be determined according to the following procedure:
step S301: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image;
step S302: determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images;
step S303: and determining lip action information of the user according to the accumulated relative variation of the lip area.
The lip motion information of this embodiment may be a change in key points used to characterize the lip profile. For example, the segmentation and extraction of the lip region may be achieved by locating the face from the face image, and then performing a coarse segmentation of the image containing the lip region, separating the lips from the surrounding skin. After the lip region is segmented, extraction of lip key points is performed. The key points of the lip region are used to characterize the contour of the lips, as shown in fig. 4, and can be used to describe changes in the lips, such as opening and closing. In an embodiment, the lip region may include an upper lip upper edge, a lower lip lower edge, and more particularly, may be equally spaced acquisition keypoints of the upper lip upper edge, the lower lip upper edge, and the lower lip lower edge. In another embodiment, the keypoints of the lip region may include the left mouth corner, right mouth corner, mouth corner center, sampling points of the inner boundary of the upper and lower lips, sampling points of the outer boundary of the upper and lower lips.
The change in position of the keypoints of the lip region has a correlation when a person speaks. Therefore, in this embodiment, the relative position characteristics of the key points of the lip region are used as lip motion information, that is, the cumulative relative variation of the lip region is determined according to the lip region key point sequence corresponding to the two adjacent frames of face images, and the cumulative relative variation information is used as the lip motion information. Relative position refers to position after normalization, such as 15 keypoints (ordered) of the lip, normalization refers to subtracting the coordinates of the first keypoint from each keypoint coordinate, thus resulting in normalized keypoint coordinates that can be used to describe the shape of the lip independent of position. This is because the head may be moved while the user is speaking, which may cause a deviation in the origin of coordinates of each frame of image when extracting the lip keypoints. Therefore, the key points in the face image need to be normalized so as to eliminate deviation caused by head movement and improve accuracy.
After extracting the lip region key points in each frame of face image, the changes of the lip region key points obtained from the current frame image and the previous frame image can be continuously compared. And summing the relative position changes of all the key points, if the relative position changes exceed a specified threshold value, considering that obvious lip movement occurs, and determining the moment when the current frame image is shot as the moment of the first starting point. And when the sum of the relative position changes of all the key points is smaller than a specified threshold value, stopping lip movement, and determining the moment when the current frame image is shot as the moment of the first termination endpoint.
For example, the lip region keypoint sequence at time t is (assuming a total of 15 keypoints are acquired):
keypoint t =[(a1,b1),(a2,b2).....(a15,b15)]
the key point sequence of the lip region at the time t+1 is as follows:
keypoint t+1 =[(x1,y1),(x2,y2).....(x15,y15)]
the cumulative relative variation is:
for the second start endpoint time and the second end endpoint time, it may be determined according to a voice endpoint detection algorithm (VAD, voice Activity Detection).
After determining the first starting endpoint time, the first ending endpoint time, the second starting endpoint time, and the second ending endpoint time, determining a target starting endpoint time and a target ending endpoint time according to the following formula:
wherein T is 1 Indicating the starting endpoint time of the target, T k1 Indicating the first starting point time, T k2 Indicating a second starting endpoint time, T 2 Indicating the target termination endpoint time, T m1 Indicating the first termination endpoint time, T m2 Indicating the second termination endpoint time.
Step S103: extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data.
Step S104: and determining a user instruction according to the face image to be identified and/or the audio data to be identified.
Step S105: and controlling the voice interaction equipment to execute corresponding operation according to the instruction.
In an alternative embodiment, the instruction spoken by the user may be determined only by the face image to be recognized, the instruction spoken by the user may be determined only by the audio data to be recognized, and the instruction spoken by the user may be determined by combining the face image to be recognized and the audio data to be recognized.
The process for determining the instruction spoken by the user by combining the face image to be recognized and the audio data to be recognized comprises the following steps:
carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized;
determining a first similarity between the semantic meaning and a preset keyword;
determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword;
obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity;
and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.
The preset keyword may be a wake-up word of the voice interaction device, or may be a preset control word, for example, a control word for controlling the voice interaction device to increase or decrease the volume. And waking up the voice interaction device under the condition that the instruction uttered by the user is determined to be a wake-up word. And under the condition that the instruction uttered by the user is determined to be a preset control word, controlling the voice interaction equipment to execute corresponding operation.
According to the control method of the voice interaction equipment, the multi-mode information is obtained by fusing the visual characteristics and the audio characteristics, so that the technical means of determining the starting point and the ending point of the voice activity can accurately identify the voice endpoint in a noisy environment with large noise or small voice, the precision of dividing the voice sequence is improved, the defect of endpoint detection by independently using the audio is overcome, and the accuracy of user instruction identification is further improved; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced.
Fig. 5 schematically illustrates a structural diagram of a control apparatus 500 of a voice interaction device according to an embodiment of the present application, and as shown in fig. 5, the control apparatus 500 of a voice interaction device includes:
the acquisition module 501 is used for acquiring a human face image sequence acquired by the vision sensor and audio data acquired by the sound sensor;
a time determining module 502, configured to combine the face image sequence and the audio data to determine a target start endpoint time and a target end endpoint time;
an extracting module 503, configured to extract a face image to be identified between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extract audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data;
the recognition module 504 is configured to determine a user instruction according to the face image to be recognized and/or the audio data to be recognized;
and the control module 505 is configured to control the voice interaction device to execute a corresponding operation according to the instruction.
Optionally, the time determining module 502 is further configured to: determining lip action information of the user according to the face image sequence; and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.
Optionally, the time determining module 502 is further configured to: determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user; determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice; and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.
Optionally, the identifying module 504 is further configured to: carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized; determining a first similarity between the semantic meaning and a preset keyword; determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword; obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity; and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.
Optionally, the time determining module 502 is further configured to: performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image; determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images; and determining lip action information of the user according to the accumulated relative variation of the lip area.
Optionally, the time determining module 502 is further configured to determine the target starting endpoint time and the target ending endpoint time according to the following formula:
wherein T is 1 Indicating the starting endpoint time of the target, T k1 Indicating the first starting point time, T k2 Indicating a second starting endpoint time, T 2 Indicating the target termination endpoint time, T m1 Indicating the first termination endpoint time, T m2 Indicating the second termination endpoint time.
The embodiment of the application provides a control device of voice interaction equipment, which obtains multi-mode information by fusing visual features and audio features so as to determine the starting point and the ending point of voice activity, accurately identify voice endpoints in noisy environments with large noise or small voice, improve the precision of segmenting voice sequences, overcome the defect of endpoint detection by independently using audio, and further improve the accuracy of user instruction identification; after the starting point and the ending point of the voice activity are determined, the voice fragment spoken by the user and the face change image when speaking can be accurately extracted, further, the instruction spoken by the user can be identified independently according to the voice fragment or the face change image, the instruction spoken by the user can be comprehensively identified by combining the voice fragment or the face change image, the identification rate is further improved, and the false identification rate is reduced.
The device can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
Fig. 6 schematically shows a schematic view of an electronic device according to an embodiment of the application. As shown in fig. 6, an electronic device 600 provided by an embodiment of the present application includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete communication with each other through the communication bus 604; a memory 603 for storing at least one executable instruction; the processor 601 is configured to implement the control method of the voice interaction device as described above when executing the executable instructions stored in the memory.
Specifically, when the control method of the voice interaction device is implemented, the executable instructions cause the processor to perform the following steps: acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor; combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment; extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data; determining a user instruction according to the face image to be recognized and/or the audio data to be recognized; and controlling the voice interaction equipment to execute corresponding operation according to the instruction.
The memory 603 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 603 has storage space for program code for performing any of the method steps described above. For example, the memory space for the program code may include individual program code for implementing individual steps in the above method, respectively. The program code can be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, compact Disk (CD), memory card or floppy disk. Such computer program products are typically portable or fixed storage units. The storage unit may have a memory segment or a memory space or the like arranged similarly to the memory 603 in the electronic device described above. The program code may be compressed, for example, in a suitable form. Typically, the storage unit comprises a program for performing the method steps according to an embodiment of the application, i.e. code that can be read by a processor, such as 601, for example, which when run by an electronic device causes the electronic device to perform the various steps in the method described above.
As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to: acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor; combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment; extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data; determining a user instruction according to the face image to be recognized and/or the audio data to be recognized; and controlling the voice interaction equipment to execute corresponding operation according to the instruction.
The computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.
Claims (10)
1. A control method of a voice interaction device, comprising:
acquiring a human face image sequence acquired by a visual sensor and audio data acquired by a sound sensor;
combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment;
extracting face images to be recognized between the target starting endpoint time and the target ending endpoint time from the face image sequence, and/or extracting audio data to be recognized between the target starting endpoint time and the target ending endpoint time from the audio data;
determining a user instruction according to the face image to be recognized and/or the audio data to be recognized;
and controlling the voice interaction equipment to execute corresponding operation according to the instruction.
2. The method of claim 1, wherein determining a target start endpoint time and a target end endpoint time in combination with the sequence of face images and the audio data comprises:
determining lip action information of the user according to the face image sequence;
and determining target starting endpoint time and target ending endpoint time according to the lip motion information and the audio data.
3. The method of claim 2, wherein determining a target start endpoint time and a target end endpoint time based on the lip motion information and the audio data comprises:
determining a first starting end point moment and a first ending end point moment according to the lip action information; the first starting end point moment is the moment for determining the lip movement of the user, and the first ending end point moment is the moment for determining the lip movement stopping of the user;
determining a second starting endpoint time and a second ending endpoint time according to the audio data; the second starting endpoint time is the time for determining the user to start voice, and the second ending endpoint time is the time for determining the user to stop voice;
and determining the target starting endpoint time according to the first starting endpoint time and the second starting endpoint time, and determining the target ending endpoint time according to the first ending endpoint time and the second ending endpoint time.
4. The method according to claim 2, wherein determining the user's instructions from the face image to be identified and/or the audio data to be identified comprises:
carrying out semantic recognition on the audio data to be recognized, and determining the semantics corresponding to the audio data to be recognized;
determining a first similarity between the semantic meaning and a preset keyword;
determining lip motion information to be recognized of the user according to the face image to be recognized, and determining second similarity between the lip motion information to be recognized and a lip motion sequence of the preset keyword;
obtaining the confidence coefficient of the preset keyword according to the first similarity and the second similarity;
and determining the preset keywords as the instructions of the users under the condition that the confidence degrees of the preset keywords are larger than or equal to a preset threshold value.
5. The method of any of claims 2-4, wherein determining lip motion information of the user from the sequence of face images comprises:
performing key point detection on the face image aiming at each frame of face image in the face image sequence, and determining a lip region key point sequence corresponding to the face image;
determining accumulated relative variation of the lip areas according to the lip area key point sequences corresponding to the two adjacent frames of face images;
and determining lip action information of the user according to the accumulated relative variation of the lip area.
6. A method according to claim 3, wherein the target start endpoint time and the target end endpoint time are determined according to the following equation:
wherein T is 1 Indicating the starting endpoint time of the target, T k1 Indicating the first starting point time, T k2 Indicating a second starting endpoint time, T 2 Indicating the target termination endpoint time, T m1 Indicating the first termination endpoint time, T m2 Indicating the second termination endpoint time.
7. The method of claim 4, wherein the preset keyword comprises a wake word.
8. A control apparatus for a voice interaction device, comprising:
the acquisition module is used for acquiring the human face image sequence acquired by the vision sensor and the audio data acquired by the sound sensor;
the time determining module is used for combining the face image sequence and the audio data to determine a target starting endpoint moment and a target ending endpoint moment;
the extraction module is used for extracting a face image to be identified between the target starting endpoint time and the target ending endpoint time from the face image sequence and/or extracting audio data to be identified between the target starting endpoint time and the target ending endpoint time from the audio data;
the identification module is used for determining a user instruction according to the face image to be identified and/or the audio data to be identified;
and the control module is used for controlling the voice interaction equipment to execute corresponding operation according to the instruction.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.
10. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210183692.2A CN116705016A (en) | 2022-02-24 | 2022-02-24 | Control method and device of voice interaction equipment, electronic equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210183692.2A CN116705016A (en) | 2022-02-24 | 2022-02-24 | Control method and device of voice interaction equipment, electronic equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116705016A true CN116705016A (en) | 2023-09-05 |
Family
ID=87828090
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210183692.2A Pending CN116705016A (en) | 2022-02-24 | 2022-02-24 | Control method and device of voice interaction equipment, electronic equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116705016A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117370961A (en) * | 2023-12-05 | 2024-01-09 | 江西五十铃汽车有限公司 | Vehicle voice interaction method and system |
-
2022
- 2022-02-24 CN CN202210183692.2A patent/CN116705016A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117370961A (en) * | 2023-12-05 | 2024-01-09 | 江西五十铃汽车有限公司 | Vehicle voice interaction method and system |
CN117370961B (en) * | 2023-12-05 | 2024-03-15 | 江西五十铃汽车有限公司 | Vehicle voice interaction method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021082941A1 (en) | Video figure recognition method and apparatus, and storage medium and electronic device | |
CN107622770B (en) | Voice wake-up method and device | |
CN109410957B (en) | Front human-computer interaction voice recognition method and system based on computer vision assistance | |
EP3614377A1 (en) | Object identifying method, computer device and computer readable storage medium | |
WO2016150001A1 (en) | Speech recognition method, device and computer storage medium | |
US9899025B2 (en) | Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities | |
US20150325240A1 (en) | Method and system for speech input | |
US8416998B2 (en) | Information processing device, information processing method, and program | |
CN111797632B (en) | Information processing method and device and electronic equipment | |
CN109785846B (en) | Role recognition method and device for mono voice data | |
CN111048113A (en) | Sound direction positioning processing method, device and system, computer equipment and storage medium | |
CN110706707B (en) | Method, apparatus, device and computer-readable storage medium for voice interaction | |
Ivanko et al. | Multimodal speech recognition: increasing accuracy using high speed video data | |
CN112397093B (en) | Voice detection method and device | |
CN111326152A (en) | Voice control method and device | |
CN113593597B (en) | Voice noise filtering method, device, electronic equipment and medium | |
CN116705016A (en) | Control method and device of voice interaction equipment, electronic equipment and medium | |
CN109065026B (en) | Recording control method and device | |
CN112669837B (en) | Awakening method and device of intelligent terminal and electronic equipment | |
CN114239610A (en) | Multi-language speech recognition and translation method and related system | |
KR20210066774A (en) | Method and Apparatus for Distinguishing User based on Multimodal | |
Seong et al. | A review of audio-visual speech recognition | |
CN111477226A (en) | Control method, intelligent device and storage medium | |
CN114399992B (en) | Voice instruction response method, device and storage medium | |
CN114282621B (en) | Multi-mode fused speaker role distinguishing method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |