WO2021196648A1 - 交互对象的驱动方法、装置、设备以及存储介质 - Google Patents

交互对象的驱动方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2021196648A1
WO2021196648A1 PCT/CN2020/129855 CN2020129855W WO2021196648A1 WO 2021196648 A1 WO2021196648 A1 WO 2021196648A1 CN 2020129855 W CN2020129855 W CN 2020129855W WO 2021196648 A1 WO2021196648 A1 WO 2021196648A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
key point
target object
distance
processed
Prior art date
Application number
PCT/CN2020/129855
Other languages
English (en)
French (fr)
Inventor
陈智辉
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2021549762A priority Critical patent/JP2022531055A/ja
Priority to KR1020217027719A priority patent/KR20210124313A/ko
Priority to SG11202109202VA priority patent/SG11202109202VA/en
Publication of WO2021196648A1 publication Critical patent/WO2021196648A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a method, device, device, and storage medium for driving interactive objects.
  • the way of human-computer interaction is mostly: the user inputs based on keys, touch, and voice, and the device responds by presenting images, text or virtual characters on the display screen.
  • virtual characters are mostly improved on the basis of voice assistants, and the interaction between users and virtual characters is still on the surface.
  • the embodiments of the present disclosure provide a driving solution for interactive objects.
  • a method for driving an interactive object includes: acquiring a first image; identifying a facial region image that contains at least a mouth of a target object in the first image, and determining the face The key point information of the mouth contained in the region image; determine whether the target object in the first image is in a speaking state according to the key point information of the mouth; and in response to determining whether the target object in the first image The target object is in a speaking state, and the interactive object is driven to respond.
  • the key point information of the mouth includes position information of multiple key points located in the mouth of the target object; the multiple key points include at least one set of key point pairs, each The key point pair includes two key points located at the upper lip and the lower lip; the determining whether the target object is in a speaking state according to the key point information of the mouth includes: according to the at least one Determine the first distance between the two key points located at the upper lip and the lower lip in each key point pair according to the position information of each group of key point pairs; The first distance determines whether the target object in the first image is in a speaking state.
  • the first image is a frame in an image sequence; the target in the first image is determined according to the first distance of each group of the key point pairs Whether the subject is in a talking state includes: acquiring a set number of images to be processed in the image sequence, the images to be processed including the first image and at least one frame of second image; for each frame of second image: Acquire the first distance of each key point pair in the second image; according to the first distance of each group of the key point pair in the first image and the first distance of each group of the second image in each frame The first distance of the key point pair determines whether the target object in the first image is in a speaking state.
  • obtaining a set number of images to be processed includes: sliding windows in the image sequence with a set length window and a set step size , Acquiring the set number of images to be processed each time by sliding, wherein the first image is the last frame of image in the window.
  • the first distance of the key point pair includes the Euclidean distance between two key points in the key point pair
  • the Determining whether the target object in the first image is in a speaking state includes: identifying the The target image in the image to be processed; determining the number of target images contained in the image to be processed; responding that the ratio between the number of target images and the set number of images to be processed is greater than the set Ratio, it is determined that the target object in the first image is in a speaking state.
  • the identifying the target image in the image to be processed includes: determining an image whose Euclidean distance of each group of key point pairs is greater than a first set threshold as the Target image; or determine an image whose Euclidean distance weighted average value of each group of key point pairs is greater than a second set threshold value as the target image.
  • the first set threshold and the second set threshold are determined according to the resolution of the image to be processed.
  • the first set threshold and the second set threshold are determined according to the resolution of the image to be processed.
  • driving an interactive object to respond includes: when the interactive object is in a standby state, in response to determining that the first image is in the first image for the first time The target object is in a speaking state, which drives the interactive object into a state of interacting with the target object.
  • a device for driving an interactive object includes: an acquisition unit for acquiring a first image; an identification unit for identifying at least a mouth of a target object in the first image And determine the key point information of the mouth contained in the facial area image; the determining unit is configured to determine the target object in the first image according to the key point information of the mouth Whether it is in a speaking state; and a driving unit for driving the interactive object to respond in response to determining that the target object in the first image is in the speaking state.
  • the key point information of the mouth includes position information of multiple key points located in the mouth of the target object; the multiple key points include at least one set of key point pairs, each The key point pair includes two key points respectively located at the upper lip and the lower lip; when the determining module determines whether the target object is in a speaking state according to the key point information of the mouth, it is also used for According to the position information of the at least one set of key point pairs, determine the first distance between the two key points located at the upper lip and the lower lip in each key point pair; The first distance of the key point pair determines whether the target object in the first image is in a speaking state.
  • the first image is a frame in an image sequence; the determining unit is configured to determine the first image according to the first distance of each group of the key point pairs When the target object in is in a speaking state, it is used to: obtain a set number of images to be processed in the image sequence, and the images to be processed include the first image and at least one frame of second image; For each frame of the second image, obtain the first distance of each key point pair in the second image; according to the first distance of each group of the key point pairs in the first image and the first distance of each frame The first distance of each group of the key point pairs in the two images determines whether the target object in the first image is in a speaking state.
  • the determining unit when used to obtain a set number of images to be processed in the image sequence, it is used to: use a window with a set length and a set step size in the image sequence.
  • a sliding window is performed in the image sequence, and a set number of to-be-processed images are acquired each time, wherein the first image is the last frame of the image in the window.
  • the first distance of the key point pair includes the Euclidean distance between two key points in the key point pair
  • the determining unit is configured according to each of the key points in the first image.
  • the determining module when determining the target image in the image to be processed, is used to set the average value of the Euclidean distance of each group of key point pairs to be greater than the first setting
  • the threshold image is determined as the target image; or, the image with the weighted average value of the Euclidean distance of each group of key point pairs greater than the second set threshold is determined as the target image.
  • the first set threshold and the second set threshold are determined according to the resolution of the image to be processed.
  • the driving unit is specifically configured to: when the interactive object is in a standby state, in response to determining for the first time that the target object in the first image is in a speaking state, drive the The interactive object enters a state of interacting with the target object.
  • the driving method, device, device, and computer-readable storage medium of an interactive object can obtain information in the first image that contains at least the mouth of the target object by recognizing the first image. Face area image, and determine the key point information of the mouth in the face area image, and determine whether the target object in the first image is in a speaking state according to the key point information of the mouth, so as to drive all
  • the interactive object responds. By judging whether the target object is speaking in real time according to the first image, the interactive object can respond to the target object’s speech in a timely manner when the target object does not touch interaction with the terminal device that displays the interactive object. Entering the interactive state improves the interactive experience of the target object.
  • Fig. 1 is a schematic diagram of a display in a method for driving an interactive object according to an embodiment of the present disclosure
  • Fig. 2 is a flowchart of a method for driving interactive objects according to an embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of key points of the mouth in the method for driving interactive objects according to an embodiment of the present disclosure
  • FIG. 4 is a schematic structural diagram of a driving device for interactive objects according to an embodiment of the present disclosure
  • Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • At least one embodiment of the present disclosure provides a method for driving interactive objects.
  • the driving method may be executed by electronic devices such as a terminal device or a server.
  • the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet, or a game.
  • the server includes a local server or a cloud server, etc., and the method can also be implemented by a processor calling computer-readable instructions stored in a memory.
  • the interactive object can be any interactive object that can interact with the target object. It can be a virtual character, virtual animal, virtual item, cartoon image, etc., and other virtual objects that can realize interactive functions.
  • the image, the display form of the avatar may be either 2D or 3D, which is not limited in the present disclosure.
  • the target object may be a user, a robot, or other smart devices.
  • the interaction manner between the interaction object and the target object may be an active interaction manner or a passive interaction manner.
  • the target object can make a demand by making gestures or body movements, and trigger the interactive object to interact with it by means of active interaction.
  • the interactive object may actively greet the target object, prompt the target object to make an action, etc., so that the target object interacts with the interactive object in a passive manner.
  • the interactive object may be displayed through electronic equipment, and the electronic equipment may also be a TV, an all-in-one machine with a display function, a projector, a virtual reality (VR) device, or an augmented reality (AR) Devices, etc., the present disclosure does not limit the specific form of the electronic device.
  • the electronic equipment may also be a TV, an all-in-one machine with a display function, a projector, a virtual reality (VR) device, or an augmented reality (AR) Devices, etc., the present disclosure does not limit the specific form of the electronic device.
  • FIG. 1 shows a display device according to an embodiment of the present disclosure.
  • the display device has a display screen, which can display a stereoscopic picture on the display screen to present a virtual scene and interactive objects.
  • the interactive objects displayed on the display screen in Figure 1 are virtual cartoon characters.
  • the electronic device described in the present disclosure may include a built-in display, and through the display, a stereoscopic picture may be displayed to present a virtual scene and interactive objects.
  • the electronic device described in the present disclosure may not include a built-in display, and the content to be displayed may be notified through a wired or wireless connection to notify an external display to present a virtual scene and interactive objects.
  • the interactive object in response to the electronic device receiving sound driving data for driving the interactive object to output voice, the interactive object may emit a specified voice to the target object.
  • Sound-driven data can be generated according to the actions, expressions, identities, preferences, etc. of the target object around the electronic device to drive the interactive object to respond by issuing a specified voice, thereby providing anthropomorphic services for the target object.
  • at least one embodiment of the present disclosure proposes a method for driving an interactive object, so as to improve the interaction experience between the target object and the interactive object.
  • FIG. 2 shows a flowchart of a method for driving an interactive object according to an embodiment of the present disclosure. As shown in FIG. 2, the method includes steps 201 to 204.
  • step 201 a first image is acquired.
  • the first image may be an image around an electronic device (such as a terminal device, a server, etc.) that displays an interactive object.
  • the image can be obtained through the image acquisition module of the electronic device, for example, through a built-in camera.
  • the image of the periphery of the electronic device includes images in any direction within a certain range of the electronic device, for example, may include images in one or more of the front, side, rear, and upper directions of the electronic device.
  • the range is determined according to the range of the audio signal of the set intensity that can be received by the sound detection module for detecting the audio signal.
  • the sound detection module can be provided in the electronic device as a built-in module of the electronic device, or can be used as an external device, independent of the electronic device.
  • the first image may also be an image acquired by an image acquisition device acquired through a network.
  • the image capture device may be a camera independent of the terminal device, and the camera may transmit the captured image to the electronic device that executes the method through a wired or wireless network.
  • the number of the image acquisition device may be one or more.
  • a target object such as a user
  • the first image can be a camera or an external device of the terminal device.
  • the image collected by the camera can be uploaded to the server through the network, and the server can analyze and determine whether the interactive object needs to be controlled to respond based on the analysis result; or the image can also be directly analyzed by the terminal device and based on the analysis result Determine whether it is necessary to control the interactive object to respond.
  • step 202 the facial region image that contains at least the mouth of the target object in the first image is identified, and the key point information of the mouth contained in the facial region image is determined.
  • the facial region image containing the mouth of the target object in the first image may be cropped, so that the facial region image becomes an independent image, so as to perform facial key point detection on the facial region image, Determine the key points of the mouth in the facial region image, and obtain key point information of the mouth, such as position information.
  • the face key point detection may be directly performed on the face region image block of the mouth of the target object in the first image, and the key point information of the mouth contained in the first image may be determined.
  • step 203 it is determined whether the target object in the first image is in a speaking state according to the key point information of the mouth.
  • the key point information (for example, position information) of the detected mouth is different.
  • the distance between the key points on the upper lip and the key points on the lower lip is usually greater than a certain degree; and when the mouth is closed, the key points on the upper lip and the key points on the lower lip are usually greater than a certain degree.
  • the distance between key points is usually small.
  • the distance threshold used to determine whether the mouth is in an open state or a closed state is related to the position of the mouth where the key points of the upper lip and the key points of the lower lip are located. For example, the threshold for the distance between the key point at the center of the upper lip and the key point at the center of the lower lip is generally greater than the threshold for the distance between the key point at the edge of the upper lip and the key point at the edge of the lower lip.
  • an image exceeding the set ratio detects that the mouth of the target object is in an open state, it can be determined that the target object is in a talking state . Conversely, if within a set time, if an image that does not exceed the set ratio detects that the mouth of the target object is in a closed state, it can be determined that the target object is not speaking.
  • step 204 in response to the target object in the first image being in a speaking state, the interactive object is driven to respond.
  • the target object Since there may be no touch interaction between the target object and the terminal device displaying the interactive object, when there are many target objects around the electronic device or image acquisition device, or the received audio signal is large, the target object starts to speak or issue a voice command At this time, the electronic device may not be able to determine in time that a target object has started to interact with the interactive object.
  • the interactive object By detecting whether the target object around the electronic device or the image acquisition device is in the speaking state, when it is determined that a target object is in the speaking state, the interactive object can be driven to respond to the target object in time, such as making a gesture of listening to the target object, Or a specific response is made to the target object. For example, when the target object is a lady, the interactive object can be driven to send out "Madam, what can I do for you?".
  • the interactive object by judging whether the target object is speaking in real time according to the first image, the interactive object can be made to respond to the target object's speech in a timely manner when the target object does not touch interaction with the terminal device that displays the interactive object. , Enter the interactive state, improve the interactive experience of the target object.
  • the key point information of the mouth includes position information of multiple key points located in the mouth of the target object; the multiple key points include at least one set of key point pairs, and the key point pairs At least include two key points located at the upper lip and the lower lip.
  • Fig. 3 shows a schematic diagram of key points of the mouth in the method for driving an interactive object according to an embodiment of the present disclosure.
  • the key points of the mouth shown in Fig. 3 at least one set of key point pairs can be obtained, such as key point pairs (98, 102), where the key point 98 is located in the middle of the upper lip, and the key point 102 is located in the middle of the lower lip. .
  • the first distance between the two key points located at the upper lip and the lower lip in each key point pair can be determined.
  • the first distance between the key point 98 and the key point 102 can be determined according to the position information of the key point 98 and the key point 102.
  • the first distance between the key point 98 and the key point 102 is different. In the case that the first distance between the key point 98 and the key point 102 is greater than the distance setting threshold, it can be determined that the mouth of the target object in the first image is in an open state; on the contrary, at the key point 98 and the key point If the first distance between the points 102 is less than or equal to the distance setting threshold, it can be determined that the mouth of the target object is in a closed state. According to the closed or open state of the mouth, it can be determined whether the target is in a speaking state, that is, whether the target is currently speaking.
  • the selection of the key point pair is not limited to (98, 102), and another key point may be located in the upper lip area, and the other key point may be located in the key point pair in the lower lip area.
  • the upper lip key point and the lower lip key point in the first image can be determined according to the average value or the weighted average value of the first distances corresponding to the multiple key point pairs The average distance between points.
  • the threshold used to determine the distance of the mouth closed or opened is determined according to the selected key points.
  • the first image is a frame in an image sequence.
  • the image sequence may be a video stream acquired by an image acquisition device, or multiple frames of images captured at a set frequency.
  • a set number of images to be processed may be acquired in the image sequence, and the first distance of the key point pair in each image to be processed may be obtained.
  • the image to be processed includes the first image and at least one frame of second image other than the first image.
  • For each frame of the second image obtain the first distance of each key point pair in the second image, according to the first distance of each group of the key point pairs in the first image and the second image of each frame The first distance of the key point pairs in each group in the group determines whether the target object is in a speaking state.
  • the two frames of second images in the image to be processed may be two consecutive frames adjacent to the first image, or they may be two frames of second images that form the same frame interval as the first image.
  • the first image is the Nth frame in the image sequence
  • the two second frames can be the N-1th frame and the N-2th frame; it can also be the N-2th frame and the N-2th frame. N-4 frames, and so on.
  • the target object's mouth is in an open state or a closed state in a set number of images to be processed, thereby determining Whether the target object is speaking.
  • a window of a set length and a set step size may be used to slide the window in the image sequence, and a set number of images to be processed may be obtained each time, and the first image is the The last image in the window.
  • the method described in the present disclosure can detect in real time whether the target object is in a speaking state.
  • the first image collected may increase all the time.
  • the first image may be the image newly added to the window, and the first frame image that is added to the window earliest may be discarded while adding the first image, that is, the frame with the earliest acquisition time in the window is discarded image. This can ensure that the acquisition time of multiple images in the window is relatively new.
  • all the images to be detected in the window can be processed at the same time, and the mouth state of the target object in the images to be processed can be determined to determine whether the target object is in a speaking state.
  • all the images to be detected in the window can be processed separately, that is, whenever a new frame of images to be detected is added in the window, the image is detected and the target object in the image is determined When determining whether the target object is in a speaking state, use the mouth state of each frame of the current multi-frame to-be-detected image saved in the window.
  • the length of the window is related to the number of images to be processed contained in the window. The longer the length of the window, the more the number of images to be processed; That is, it is related to the time interval for judging the speaking state of the target object.
  • the length and step length of the window can be set according to the actual interactive scene. For example, when the length of the window is 10 and the step size is 2, it indicates that the window may include 10 images to be processed, and each slide moves 2 frames of images in the image sequence.
  • the setting of the window length is related to the accuracy of the detection. For example, if the state of the target object is determined based on the detection result of an image to be processed, the accuracy of the determination may be low. Judging the state of the target object based on multiple pieces of to-be-processed detection results can improve the accuracy of judgment. However, if the length of the window is too long, it will lead to poor real-time judgment.
  • the target object starts to speak at t1 corresponding to the Nth frame of image, but because the detection results of other frame images in the window (such as N-1, N-2,...) still indicate that the target object is not speaking, it will still speak at t1 It is judged that the target object has not started to speak until time t2 when the N+i-th frame image is acquired, that is, the detection result of the image exceeding the set ratio in the window indicates that the target object is in an open state, and then the target object is judged to start speaking, where, i at least depends on the length of the window, the step size, and the set ratio. Therefore, the longer the length of the window, the greater the time difference between t2 and t1, which affects the real-time detection.
  • the target object is in a speaking state in the first image through the first image and the mouth state of the target object in the second image before the first image.
  • the image is taken as the last frame of image in the window, so that it can be detected in real time whether the target object is in a talking state.
  • the first distance includes the Euclidean distance between two key points in the key point pair.
  • the Euclidean distance can more accurately measure the distance and position relationship between two key points.
  • it may be determined according to the first distance of each group of the key point pairs in the first image and the first distance of each group of the key point pairs in each frame of the second image in the following manner Whether the target object is speaking.
  • the image whose average Euclidean distance of each key point pair is greater than the first set threshold is the target image, or determine the Euclidean distance of each key point pair
  • the image whose weighted average value is greater than the second set threshold is the target image. That is, in the image to be processed, an image in which the mouth of the target object is in an open state is determined as the target image.
  • the number of target images contained in the image to be processed is determined. That is, it is determined that the number of images (which may be the first image in the image to be processed or the second image in the image to be processed) with the mouth in an open state is included in the image to be processed.
  • the target object in the first image is in a speaking state; otherwise, in response to the ratio being less than or equal to the set ratio, it is determined that the target object is not currently speaking .
  • different Euclidean distance setting thresholds may be set according to different resolutions of the image to be processed. That is, the first set threshold and the second threshold may be determined according to the resolution of the image to be processed.
  • the Euclidean distance setting threshold may be set to 9 (for example, 9 pixels).
  • the length of the window can be set to 10, even if the window includes 10 images to be processed, and the window is moved with a step length of 1.
  • the ratio is set to 0.4, when the window slides to the current image frame, if the contained 10 images to be processed contain more than 4 images in the state of mouth opening, it is determined that the target object is Speaking state.
  • the resolution of the image to be processed can be adjusted to 720*1080 by cropping, zooming in or out; or according to the resolution of the image to be processed, Calculate the corresponding Euclidean distance setting threshold at this resolution.
  • the interaction may be driven
  • the object enters a state of interacting with the target object.
  • the above method can enable the interactive object to respond to the target object in a speaking state in a timely manner, enter the interactive state, and improve the interactive experience of the target object.
  • FIG. 4 shows a schematic structural diagram of a driving device for an interactive object according to an embodiment of the present disclosure.
  • the device may include: an acquiring unit 401 for acquiring a first image; and an identifying unit 402 for identifying the The first image contains at least the face region image of the mouth of the target object, and the key point information of the mouth contained in the face region image is determined; the determining unit 403 is configured to, according to the key point information of the mouth, It is determined whether the target object in the first image is in a speaking state; the driving unit 404 is configured to drive the interactive object to respond in response to determining that the target object in the first image is in a speaking state.
  • the key point information of the mouth includes position information of multiple key points located in the mouth of the target object; the multiple key points include at least one set of key point pairs, and each key point The pair includes two key points located at the upper lip and the lower lip; when the determining module 403 determines whether the target object is in a speaking state according to the key point information of the mouth, it is also used according to the Determining the first distance between the two key points located at the upper lip and the lower lip in each of the key point pairs, based on the position information of at least one group of key point pairs; and according to each group of the key points The correct first distance determines whether the target object in the first image is in a speaking state.
  • the first image is a frame in an image sequence; the determining unit 403 is configured to determine all of the first images in the first image according to the first distance of each group of the key point pairs.
  • the target object is in a speaking state, it is used to: obtain a set number of images to be processed in the image sequence, and the images to be processed include the first image and at least one frame of the second image; for each frame The second image is to obtain the first distance of each key point pair in the second image; according to the first distance of each group of the key point pairs in the first image and in each frame of the second image The first distance of each group of the key point pairs determines whether the target object in the first image is in a speaking state.
  • the determining unit 403 when used to obtain a set number of images to be processed in the image sequence, it is used to: Sliding window is performed in the middle, and a set number of to-be-processed images are acquired every time when sliding, wherein the first image is the last frame of image in the window.
  • the first distance of the key point pair includes the Euclidean distance between two key points in the key point pair.
  • the first distance of the key point pair and the first distance of each group of the key point pairs in the second image of each frame are used to identify whether the target object in the first image is in a speaking state or not.
  • the target image in the image to be processed; the number of target images contained in the image to be processed is determined; the ratio between the number of target images and the set number of the image to be processed is greater than The ratio is set, and it is determined that the target object in the first image is in a speaking state.
  • the determining module 403 when determining the target image in the image to be processed, is used to determine the average value of the Euclidean distance of each group of key point pairs to be greater than the first set threshold. It is determined as the target image; or, an image whose Euclidean distance weighted average value of each group of key point pairs is greater than a second set threshold is determined as the target image.
  • the first set threshold and the second set threshold are determined according to the resolution of the image to be processed.
  • the driving unit 404 is configured to: when the interactive object is in a standby state, in response to determining for the first time that the target object in the first image is in a speaking state, drive the interactive object to enter and The interactive state of the target object.
  • the embodiment of the present disclosure also provides an electronic device. As shown in FIG. 5, the device includes a memory and a processor.
  • the memory is used to store computer instructions that can run on the processor.
  • the processor is used to execute the computer instructions.
  • the method for driving interactive objects described in any of the embodiments of the present disclosure is realized at a time.
  • the device is, for example, a server or a terminal device, and the server or terminal device determines the speaking state of the target state according to the key point information of the mouth in the first image, thereby controlling the interactive objects displayed on the display .
  • the terminal device includes a display
  • the display also includes a display screen or a transparent display screen for displaying animations of interactive objects.
  • the embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the interactive object driving method described in any of the embodiments of the present disclosure is implemented.
  • one or more embodiments of the present disclosure may be provided as a method, a system, or a computer program product. Therefore, one or more embodiments of the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of the present disclosure may adopt computer programs implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes. The form of the product.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • Embodiments of the subject matter described in the present disclosure may be implemented as one or more computer programs, that is, one or one of the computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Multiple modules.
  • the program instructions may be encoded on artificially generated propagated signals, such as machine-generated electrical, optical or electromagnetic signals, which are generated to encode information and transmit it to a suitable receiver device for data transmission.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
  • the processing and logic flow described in the present disclosure can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output.
  • the processing and logic flow can also be executed by a dedicated logic circuit, such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for executing computer programs include, for example, general-purpose and/or special-purpose microprocessors, or any other type of central processing unit.
  • the central processing unit will receive instructions and data from a read-only memory and/or a random access memory.
  • the basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • the computer will also include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer will be operatively coupled to this mass storage device to receive data from or send data to it. It transmits data, or both.
  • the computer does not have to have such equipment.
  • the computer can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or, for example, a universal serial bus (USB ) Flash drives are portable storage devices, just to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices (such as EPROM, EEPROM, and flash memory devices), magnetic disks (such as internal hard disks or Removable disks), magneto-optical disks, CD ROM and DVD-ROM disks.
  • semiconductor memory devices such as EPROM, EEPROM, and flash memory devices
  • magnetic disks such as internal hard disks or Removable disks
  • magneto-optical disks CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

公开了一种交互对象的驱动方法、装置、设备以及存储介质,所述方法包括:获取第一图像;识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的嘴部的关键点信息;根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象是否处于说话状态;响应于确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进行回应。

Description

交互对象的驱动方法、装置、设备以及存储介质 技术领域
本公开涉及计算机技术领域,具体涉及一种交互对象的驱动方法、装置、设备以及存储介质。
背景技术
人机交互的方式大多为:用户基于按键、触摸、语音进行输入,设备通过在显示屏上呈现图像、文本或虚拟人物进行回应。目前虚拟人物多是在语音助理的基础上改进得到的,用户与虚拟人物的交互还停留表面上。
发明内容
本公开实施例提供一种交互对象的驱动方案。
根据本公开的一方面,提出一种交互对象的驱动方法,所述方法包括:获取第一图像;识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的所述嘴部的关键点信息;根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象是否处于说话状态;以及响应于确定所述第一图像中的所述目标对象处于说话状态,驱动交互对象进行回应。
结合本公开提供的任一实施方式,所述嘴部的关键点信息包括位于目标对象的嘴部的多个关键点的位置信息;所述多个关键点包括至少一组关键点对,每个所述关键点对包括分别位于上嘴唇处和下嘴唇处的两个关键点;所述根据所述嘴部的关键点信息,确定所述目标对象是否处于说话状态,包括:根据所述至少一组关键点对的位置信息,确定每个所述关键点对中分别位于所述上嘴唇处和所述下嘴唇处的两个关键点的第一距离;以及根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态。
结合本公开提供的任一实施方式,所述第一图像为图像序列中的一帧;所述根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态,包括:在所述图像序列中,获取设定数目的待处理图像,所述待处理图像包括所述第一图像以及至少一帧第二图像;针对每帧第二图像:获取所述第二图像中每个所述关 键点对的第一距离;根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态。
结合本公开提供的任一实施方式,所述在所述图像序列中,获取设定数目的待处理图像,包括:以设定长度的窗口以及设定步长在所述图像序列中进行滑窗,每次滑动获取所述设定数目的待处理图像,其中,所述第一图像为所述窗口内的最后一帧图像。
结合本公开提供的任一实施方式,所述关键点对的第一距离包括所述关键点对中的两个关键点之间的欧式距离,所述根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态,包括:识别所述待处理图像中的目标图像;确定所述待处理图像中所包含的目标图像的数目;响应于所述目标图像的数目与所述待处理图像的所述设定数目之间的比例大于设定比例,确定所述第一图像中的目标对象处于说话状态。
结合本公开提供的任一实施方式,所述识别所述待处理图像中的目标图像包括:将所述各组关键点对的欧式距离的平均值大于第一设定阈值的图像确定为所述目标图像;或将所述各组关键点对的欧式距离的加权平均值大于第二设定阈值的图像确定为所述目标图像。
结合本公开提供的任一实施方式,所述第一设定阈值和所述第二设定阈值根据所述待处理图像的分辨率确定。
结合本公开提供的任一实施方式,所述第一设定阈值和所述第二设定阈值根据所述待处理图像的分辨率确定。
结合本公开提供的任一实施方式,所述响应于所述目标对象处于说话状态,驱动交互对象进行回应,包括:在所述交互对象处于待机状态下,响应于首次确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进入与所述目标对象进行交互的状态。
根据本公开的一方面,提出一种交互对象的驱动装置,所述装置包括:获取单元,用于获取第一图像;识别单元,用于识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的所述嘴部的关键点信息;确定单元,用于根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象是否处于说话状态;以及驱动单元,用于响应于确定所述第一图像中的所述目标对象处于说话状态,驱动交 互对象进行回应。
结合本公开提供的任一实施方式,所述嘴部的关键点信息包括位于目标对象的嘴部的多个关键点的位置信息;所述多个关键点包括至少一组关键点对,每个所述关键点对包括分别位于上嘴唇处和下嘴唇处的两个关键点;所述确定模块在根据所述嘴部的关键点信息,确定所述目标对象是否处于说话状态时,还用于根据所述至少一组关键点对的位置信息,确定每个所述关键点对中分别位于所述上嘴唇处和所述下嘴唇处的两个关键点的第一距离;以及根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态。
结合本公开提供的任一实施方式,所述第一图像为图像序列中的一帧;所述确定单元在用于根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态时,用于:在所述图像序列中,获取设定数目的待处理图像,所述待处理图像包括所述第一图像以及至少一帧第二图像;针对每帧第二图像,获取所述第二图像中每个所述关键点对的第一距离;根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态。
结合本公开提供的任一实施方式,所述确定单元在用于在所述图像序列中,获取设定数目的待处理图像时,用于:以设定长度的窗口以及设定步长在所述图像序列中进行滑窗,每次滑动获取设定数目的待处理图像,其中,所述第一图像为所述窗口内的最后一帧图像。
结合本公开提供的任一实施方式,所述关键点对的第一距离包括所述关键点对中的两个关键点之间的欧式距离,所述确定单元在根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态时,用于:识别所述待处理图像中的目标图像;确定所述待处理图像中所包含的目标图像的数目;响应于所述目标图像的数目与所述待处理图像的所述设定数目之间的比例大于设定比例,确定所述第一图像中的目标对象处于说话状态。
结合本公开提供的任一实施方式,所述确定模块在所述待处理图像中,确定所述目标图像时,用于将所述各组关键点对的欧式距离的平均值大于第一设定阈值的图像确定为所述目标图像;或,将所述各组关键点对的欧式距离的加权平均值大于第二设定阈值的图像确定为所述目标图像。
结合本公开提供的任一实施方式,所述第一设定阈值和所述第二设定阈值根据所述待处理图像的分辨率确定。
结合本公开提供的任一实施方式,所述驱动单元具体用于:在所述交互对象处于待机状态下,响应于首次确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进入与所述目标对象进行交互的状态。
本公开一个或多个实施例的交互对象的驱动方法、装置、设备及计算机可读存储介质,通过对第一图像进行识别,获得所述第一图像中至少包含所述目标对象的嘴部的面部区域图像,并确定所述面部区域图像中的嘴部的关键点信息,根据所述嘴部的关键点信息来确定所述第一图像中的所述目标对象是否处于说话状态,以驱动所述交互对象进行回应,通过根据第一图像实时判断目标对象是否在说话,可以在目标对象未与展示交互对象的终端设备进行触摸交互的情况下,使交互对象对于目标对象说话及时做出回应,进入交互状态,提高了目标对象的交互体验。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入本公开中并构成本公开的一部分,示出了符合本公开的实施例,并与本公开一起用于解释本公开的原理。
图1是根据本公开实施例的交互对象的驱动方法中显示器的示意图;
图2是根据本公开实施例的交互对象的驱动方法的流程图;
图3是根据本公开实施例的交互对象的驱动方法中嘴部关键点的示意图;
图4是根据本公开实施例的交互对象的驱动装置的结构示意图;
图5是根据本公开实施例的电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如 所附权利要求书中所述的、本公开的一些方面相一致的装置和方法的例子。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
本公开至少一个实施例提供了一种交互对象的驱动方法,所述驱动方法可以由终端设备或服务器等电子设备执行,所述终端设备可以是固定终端或移动终端,例如手机、平板电脑、游戏机、台式机、广告机、一体机、车载终端等等,所述服务器包括本地服务器或云端服务器等,所述方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
在本公开实施例中,交互对象可以是任意一种能够与目标对象进行交互的交互对象,其可以是虚拟人物,还可以是虚拟动物、虚拟物品、卡通形象等等其他能够实现交互功能的虚拟形象,虚拟形象的展现形式即可以是2D形式也可以是3D形式,本公开对此并不限定。所述目标对象可以是用户,也可以是机器人,还可以是其他智能设备。所述交互对象和所述目标对象之间的交互方式可以是主动交互方式,也可以是被动交互方式。一示例中,目标对象可以通过做出手势或者肢体动作来发出需求,通过主动交互的方式来触发交互对象与其交互。另一示例中,交互对象可以通过主动打招呼、提示目标对象做出动作等方式,使得目标对象采用被动方式与交互对象进行交互。
所述交互对象可以通过电子设备进行展示,所述电子设备还可以是电视机、带有显示功能的一体机、投影仪、虚拟现实(Virtual Reality,VR)设备、增强现实(Augmented Reality,AR)设备等,本公开并不限定电子设备的具体形式。
图1示出根据本公开实施例的显示设备。如图1所示,该显示设备具有显示屏,其可以在显示屏上显示立体画面,以呈现出虚拟场景以及交互对象。例如图1中显示屏显示的交互对象有虚拟卡通人物。
在一些实施例中,本公开中所述的电子设备可以包括内置的显示器,通过显示器,可以显示立体画面,以呈现出虚拟场景以及交互对象。在另一些实施例中,本公开中所述的电子设备还可以不包括内置的显示器,所需显示的内容可以通过有线或无线的连接通知外接的显示器呈现出虚拟场景以及交互对象。
在一些实施例中,响应于电子设备接收到用于驱动交互对象输出语音的声音驱动数据,交互对象可以对目标对象发出指定语音。可以根据电子设备周边目标对象的动作、表情、身份、偏好等,生成声音驱动数据,以驱动交互对象通过发出指定语音进行回应,从而为目标对象提供拟人化的服务。基于此,本公开至少一个实施例提出一种交互对象的驱动方法,以提升目标对象与交互对象进行交互的体验。
图2示出根据本公开实施例的交互对象的驱动方法的流程图,如图2所示,所述方法包括步骤201~步骤204。
在步骤201中,获取第一图像。
所述第一图像可以是展示交互对象的电子设备(例如终端设备、服务器等)周边的图像。该图像可以通过电子设备的图像采集模块获得,例如通过内置摄像头获得。电子设备的周边的图像包括所述电子设备的一定范围内任意方向上的图像,例如可以包括所述电子设备的前向、侧向、后方、上方中的一个或多个方向上的图像。示例性的,该范围根据用于检测音频信号的声音检测模块所能接收到设定强度的音频信号的范围确定。其中,所述声音检测模块可以作为所述电子设备的内置模块设置在电子设备中,也可以作为外接设备,独立于电子设备之外。所述第一图像还可以是通过网络获取的图像采集设备所采集的图像。所述图像采集设备可以是独立于终端设备之外的摄像头,该摄像头可以通过有线或无线网络将采集的图像传输给执行本方法的电子设备。所述图像采集设备的数量可以为一个或多个。例如,目标对象(如用户)可使用终端设备进行某种操作,如使用终端设备的某个客户端进行一种涉及与交互对象进行交互的服务,第一图像可以是由终端设备的摄像头或外接摄像头采集的图像,该图像可通过网络上传至服务器端,由服务器进行解析并基于解析结果判断是否需控制交互对象作出回应;或者,该图像也可直接由该终端设备进行解析,并基于解析结果判断是否需控制交互对象作出回应。
在步骤202中,识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的所述嘴部的关键点信息。
在一个示例中,可以对第一图像中包含所述目标对象的嘴部的面部区域图像进行裁剪,使所述面部区域图像成为独立的图像,以对所述面部区域图像进行面部关键点检测,确定所述面部区域图像中的嘴部关键点,并获得所述嘴部的关键点信息,例如位置信息。
在一个示例中,可以直接对第一图像中包含目标对象的嘴部的面部区域图像块进行面部关键点检测,确定所述第一图像中所包含的所述嘴部的关键点信息。
在步骤203中,根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象是否处于说话状态。
在目标对象嘴部处于张开状态或处于闭合状态时,所检测到的嘴部的关键点信息(例如,位置信息)是不同的。例如,在嘴部处于张开状态时,位于上唇的关键点与位于下唇的关键点之间的距离通常大于一定程度;而嘴部处于闭合状态时,位于上唇的关键点和位于下唇的关键点之间的距离通常较小。用于判断嘴部处于张开状态或闭合状态的距离阈值,与所选取的上唇关键点和下唇关键点所处的嘴部位置有关。例如,针对位于上唇中心处的关键点与下唇中心处的关键点之间的距离的阈值,通常大于针对位于上唇边缘处的关键点与下唇边缘处的关键点之间的距离的阈值。
在一示例中,当在设定时间内,在多张第一图像中,若超过设定比例的图像检测到目标对象的嘴部处于张开状态时,则可以确定所述目标对象处于说话状态。反之,如果在设定时间内,若不超过设定比例的图像检测到目标对象的嘴部处于闭合状态,则可以确定所述目标对象未说话。
在步骤204中,响应于所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进行回应。
由于目标对象与展示所述交互对象的终端设备可能不存在触摸交互,在电子设备或者图像采集设备周边的目标对象较多,或者接收到的音频信号较多时,在目标对象开始说话或者发出语音指令时,电子设备可能无法及时判断出有目标对象已经开始与交互对象进行交互。通过检测电子设备或者图像采集设备周边的目标对象是否处于说话状态,可以在确定一目标对象处于说话状态时,及时驱动所述交互对象针对该目标对象进行回应,例如做出聆听目标对象的姿态,或者针对所述目标对象进行特定的回应,例如,在所述目标对象为女士的情况下,可以驱动所述交互对象发出“女士,有什么可以帮您的?”。
在本公开实施例中,通过根据第一图像实时判断目标对象是否在说话,可以在目标对象未与展示交互对象的终端设备进行触摸交互的情况下,使交互对象对于目标对象说话及时做出回应,进入交互状态,提高了目标对象的交互体验。
在本公开实施例中,所述嘴部的关键点信息包括位于目标对象的嘴部的多个关键点的位置信息;所述多个关键点包括至少一组关键点对,所述关键点对至少包括分别位于上嘴唇处和下嘴唇处的两个关键点。
图3示出根据本公开实施例提供的交互对象的驱动方法中嘴部关键点的示意图。在图3所示的嘴部关键点中,可以获取至少一组关键点对,例如关键点对(98,102),其中,关键点98位于上嘴唇中间处,关键点102位于下嘴唇中间处。
根据嘴部的至少一组关键点对的位置信息,可以确定每个所述关键点对中分别位于上嘴唇处和下嘴唇处的两个关键点的第一距离。例如,在获取了一组关键点对(98,102)的情况下,根据关键点98和关键点102的位置信息,则可以确定关键点98和关键点102的第一距离。
根据各组所述关键点对的所述第一距离可以确定所述目标对象是否处于说话状态。
在嘴部的张开状态和闭合状态下,关键点98和关键点102之间的第一距离是不同的。在关键点98和关键点102之间的第一距离大于距离设定阈值的情况下,可以确定所述第一图像中的目标对象的嘴部处于张开状态;反之,在关键点98和关键点102之间的第一距离小于或等于所述距离设定阈值的情况下,则可以确定所述目标对象的嘴部处于闭合状态。根据嘴部的闭合或张开的状态,则可以确定所述目标是否处于说话状态,也即所述目标对象当前是否正在说话。
本领域技术人员应当理解,关键点对的选取不限点于(98,102),也可以其他一个关键点于上唇区域,另一个关键点位于下唇区域的关键点对。在选取了多组关键点对的情况下,可以根据多组关键点对所对应的第一距离的平均值或者加权平均值,来确定在所述第一图像中,上唇关键点与下唇关键点之间的平均距离。而用于判断嘴部闭合或张开的距离设定阈值,根据所选取的关键点对所处的部位确定。
在本公开实施例中,所述第一图像为图像序列中的一帧。其中,所述图像序列可以是通过图像采集设备获取的视频流,或者以设定频率拍摄的多帧图像。在所述第一图像为图像序列中的一帧的情况下,可以通过在所述图像序列中,获取设定数目的待处理图像,根据各个待处理图像中所述关键点对的第一距离来确定所述目标对象是否处于说话状态。其中,所述待处理图像包括所述第一图像以及所述第一图像之外的至少一帧第二图像。对于每帧第二图像,获取所述第二图像中每个关键点对的第一距离,根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述目标对象是否处于说话状态。
例如,对于所述待处理图像中的两帧第二图像,可以是与第一图像相邻的连续两帧,也可以是与第一图像形成相同间隔帧的两帧第二图像。例如,假设所述第一图像是图像 序列中的第N帧,那么所述两帧第二图像,可以是第N-1帧、第N-2帧;也可以是第N-2帧、第N-4帧,以此类推。
在本实施例中,根据第一图像和各第二图像中各关键点对的第一距离,能够确定在设定数目的待处理图像中目标对象嘴部处于张开状态还是闭合状态,从而确定所述目标对象是否处于说话状态。
在一些实施例中,可以以设定长度的窗口以及设定步长在所述图像序列中进行滑窗,每次滑动获取设定数目的待处理图像,并且,所述第一图像为所述窗口内的最后一帧图像。
需要说明的是,本公开所述的方法可以实时检测目标对象是否处于说话状态。也就是说所采集的第一图像可能会一直增加。设置窗口,所述第一图像可以为最新加入所述窗口的图像,并且可以在增加第一图像的同时丢弃最早加入所述窗口的第一帧图像,也就是丢弃窗口内采集时间最早的一帧图像。从而可以保证窗口内的多张图像的采集时间较新。
在一种实现方式中,可以对窗口内所有的待检测图像同时进行处理,确定在这些待处理图像中目标对象嘴部状态,以判断目标对象是否处于说话状态。在另一种实现方式中,可以对窗口内所有的待检测图像分别进行处理,也就是说,每当窗口内新增加一帧待检测图像,即对该图像进行检测,确定该图像中目标对象的嘴部状态,并保存该嘴部状态,在后续判断目标对象是否处于说话状态时,使用窗口内保存的当前多帧待检测图像中的每帧待检测图像的嘴部状态。
窗口的长度与窗口所包含的待处理图像的数目相关,窗口的长度越长,包含的待处理图像的数目越多;进行滑窗的步长与获取待处理图像的时间间隔(频率)相关,也即与对所述目标对象的说话状态进行判断的时间间隔相关。窗口的长度以及步长可以根据实际的交互场景进行设置。例如,在窗口的长度为10,步长为2的情况下,表明所述窗口可以包括10个待处理图像,并且每次滑动在所述图像序列中移动2帧图像。
此外,窗口长度的设置与检测的准确性相关。例如,若根据一张待处理图像的检测结果判断目标对象的状态,则判断的准确性可能较低。根据多张待处理的检测结果判断目标对象的状态,可以提高判断的准确性。但是若窗口的长度过长,则会导致判断的实时性较差。例如,目标对象对应着第N帧图像的t1时刻开始说话,但是由于窗口内其他帧图像(如N-1,N-2,…)的检测结果仍表示目标对象未说话,因此t1时刻仍会判 断目标对象未开始说话,直到获取了第N+i帧图像的t2时刻,也就是窗口内超过设定比例的图像的检测结果表示目标对象处于开口状态,才会判断目标对象开始说话,其中,i至少取决于窗口的长度、步长、以及设定比例。因此,窗口的长度越长,t2与t1之间的时间差值越大,从而影响检测的实时性。
在本公开实施例中,通过第一图像以及所述第一图像之前的第二图像中目标对象的嘴部状态,能够确定在所述第一图像中所述目标对象是否处于说话状态。并且,通过滑窗的方式,每采集到一帧新的图像,也即第一图像,将该图像作为窗口内的最后一帧图像,从而可以实时检测目标对象是否处于说话状态。
在本公开实施例中,所述第一距离包括所述关键点对中的两个关键点之间的欧式距离。对于三维面部图像而言,所述欧式距离能够更准确地衡量两个关键点之间的距离和位置关系。
在一些实施例中,可以通过以下方式根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述目标对象是否处于说话状态。
首先,确定所述第一图像和各帧所述第二图像中,各关键点对的欧式距离的平均值大于第一设定阈值的图像为目标图像,或,确定各关键点对的欧式距离的加权平均值大于第二设定阈值的图像为目标图像。也即,将所述待处理图像中,所述目标对象的嘴部处于张开状态的图像确定为目标图像。
之后,确定所述待处理图像中所包含的目标图像的数目。也即,确定所述待处理图像中包含嘴部处于张开状态的图像(可以是待处理图像中的第一图像,也可以是待处理图像中的第二图像)的数目。
接下来,根据所述目标图像的数目与所述待处理图像的所述设定数目之间的比例,确定所述目标对象是否处于说话状态。
响应于所述比例大于设定比例,确定所述第一图像中的所述目标对象处于说话状态;反之,响应于所述比例小于或等于设定比例,则确定所述目标对象当前并未说话。
在一些实施例中,可以根据所述待处理图像的不同分辨率设置不同的欧式距离设定阈值。也即,所述第一设定阈值和所述第二阈值可以根据所述待处理图像的分辨率确定。
在一个示例中,可以在所述待处理图像的分辨率为720*1080的情况下,将欧式距离设定阈值设置为9(例如,9个像素点)。可以将窗口的长度设置为10,也即使所述窗 口包括10个待处理图像,并以步长1移动窗口。在设定比例为0.4的情况下,当所述窗口滑动到当前图像帧时,如果所包含的10个待处理图像中包含超过4个处于张嘴状态的图像时,则确定所述目标对象正处于说话状态。
在另一个示例中,若待处理图像的分辨率不是720*1080,则可以通过剪裁、放大或缩小,将待处理图像的分辨率调整为720*1080;也可以根据待处理图像的分辨率,计算出该分辨率下相应的欧式距离设定阈值。
在所述交互对象处于待机状态下,也即所述交互对象未与所述目标对象进行交互的状态下,响应于首次确定所述第一图像中的目标对象处于说话状态,可以驱动所述交互对象进入与所述目标对象进行交互的状态。
在目标对象未与展示所述交互对象的终端设备进行触摸交互的情况下,通过以上方式能够使交互对象及时对于目标对象处于说话状态做出回应,进入交互状态,提高目标对象的交互体验。
图4示出根据本公开实施例的交互对象的驱动装置的结构示意图,如图4所示,该装置可以包括:获取单元401,用于获取第一图像;识别单元402,用于识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的所述嘴部的关键点信息;确定单元403,用于根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象是否处于说话状态;驱动单元404,用于响应于确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进行回应。
在一些实施例中,所述嘴部的关键点信息包括位于目标对象的嘴部的多个关键点的位置信息;所述多个关键点包括至少一组关键点对,每个所述关键点对包括分别位于上嘴唇处和下嘴唇处的两个关键点;所述确定模块403在根据所述嘴部的关键点信息,确定所述目标对象是否处于说话状态时,还用于根据所述至少一组关键点对的位置信息,确定每个所述关键点对中分别位于所述上嘴唇处和所述下嘴唇处的两个关键点的第一距离;以及根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态。
在一些实施例中,所述第一图像为图像序列中的一帧;所述确定单元403在用于根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态时,用于:在所述图像序列中,获取设定数目的待处理图像,所述待处理图像包括所述第一图像以及至少一帧第二图像;针对每帧第二图像,获取所述第二图像中每个 所述关键点对的第一距离;根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态。
在一些实施例中,所述确定单元403在用于在所述图像序列中,获取设定数目的待处理图像时,用于:以设定长度的窗口以及设定步长在所述图像序列中进行滑窗,每次滑动获取设定数目的待处理图像,其中,所述第一图像为所述窗口内的最后一帧图像。
在一些实施例中,所述关键点对的第一距离包括所述关键点对中的两个关键点之间的欧式距离,所述确定单元403在根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态时,用于:识别所述待处理图像中的目标图像;确定所述待处理图像中所包含的目标图像的数目;响应于所述目标图像的数目与所述待处理图像的所述设定数目之间的比例大于设定比例,确定所述第一图像中的目标对象处于说话状态。
在一些实施例中,所述确定模块403在所述待处理图像中,确定所述目标图像时,用于将所述各组关键点对的欧式距离的平均值大于第一设定阈值的图像确定为所述目标图像;或,将所述各组关键点对的欧式距离的加权平均值大于第二设定阈值的图像确定为所述目标图像。
在一些实施例中,所述第一设定阈值和所述第二设定阈值根据所述待处理图像的分辨率确定。
在一些实施例中,所述驱动单元404用于:在所述交互对象处于待机状态下,响应于首次确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进入与所述目标对象进行交互的状态。
本公开实施例还提供了一种电子设备,如图5所示,所述设备包括存储器、处理器,存储器用于存储可在处理器上运行的计算机指令,处理器用于在执行所述计算机指令时实现本公开任一实施例所述的交互对象的驱动方法。
在一些实施例中,所述设备例如为服务器或终端设备,所述服务器或终端设备根据第一图像中的嘴部的关键点信息确定目标状态的说话状态,从而对显示器展示的交互对象进行控制。在所述终端设备包括显示器的情况下,所述显示器还包括显示屏或者透明显示屏,用于显示交互对象的动画。
本公开实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述程 序被处理器执行时实现本公开任一实施例所述的交互对象的驱动方法。
本领域技术人员应明白,本公开的一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本公开一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本公开一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本公开特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本公开中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本文公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本公开中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。
本公开中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。所述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他 类型的中央处理单元。通常,中央处理单元将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理单元以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及CD ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。
虽然本公开包含许多具体实施细节,但是这些不应被解释为限制任何发明的范围或所要求保护的范围,而是主要用于描述特定发明的具体实施例的特征。本公开在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上所述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。
以上所述仅为本公开的一个或多个实施例而已,并不用以限制本公开,凡在本公开一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含 在本公开一个或多个实施例保护的范围之内。

Claims (18)

  1. 一种交互对象的驱动方法,包括:
    获取第一图像;
    识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的所述嘴部的关键点信息;
    根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象是否处于说话状态;
    响应于确定所述第一图像中的所述目标对象处于说话状态,驱动交互对象进行回应。
  2. 根据权利要求1所述的方法,其特征在于,所述嘴部的关键点信息包括位于目标对象的嘴部的多个关键点的位置信息;所述多个关键点包括至少一组关键点对,每组所述关键点对包括分别位于上嘴唇处和下嘴唇处的两个关键点;
    所述根据所述嘴部的关键点信息,确定所述目标对象是否处于说话状态,包括:
    根据所述至少一组关键点对的位置信息,确定每组所述关键点对中分别位于所述上嘴唇处和所述下嘴唇处的两个关键点的第一距离;以及
    根据各组所述关键点对的所述第一距离,确定所述第一图像中的所述目标对象是否处于说话状态。
  3. 根据权利要求2所述的方法,其特征在于,所述第一图像为图像序列中的一帧;
    所述根据各组所述关键点对的所述第一距离,确定所述第一图像中的所述目标对象是否处于说话状态,包括:
    在所述图像序列中,获取设定数目的待处理图像,所述待处理图像包括所述第一图像以及至少一帧第二图像;
    针对每帧第二图像,获取所述第二图像中每组所述关键点对的所述第一距离;
    根据所述第一图像中各组所述关键点对的所述第一距离以及各帧所述第二图像中各组所述关键点对的所述第一距离,确定所述第一图像中的所述目标对象是否处于说话状态。
  4. 根据权利要求3所述的方法,其特征在于,所述在所述图像序列中,获取设定数目的待处理图像,包括:
    以设定长度的窗口以及设定步长在所述图像序列中进行滑窗,每次滑动获取所述设定数目的待处理图像,其中,所述第一图像为所述窗口内的最后一帧图像。
  5. 根据权利要求3或4所述的方法,其特征在于,所述关键点对的第一距离包括所述关键点对中的两个关键点之间的欧式距离,所述根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态,包括:
    识别所述待处理图像中的目标图像;
    确定所述待处理图像中包含的所述目标图像的数目;
    响应于所述目标图像的数目与所述待处理图像的所述设定数目之间的比例大于设定比例,确定所述第一图像中的目标对象处于说话状态。
  6. 根据权利要求5所述的方法,其特征在于,所述识别所述待处理图像中的目标图像,包括:
    将所述各组关键点对的欧式距离的平均值大于第一设定阈值的图像确定为所述目标图像;或,
    将所述各组关键点对的欧式距离的加权平均值大于第二设定阈值的图像确定为所述目标图像。
  7. 根据权利要求6所述的方法,其特征在于,所述第一设定阈值和所述第二设定阈值根据所述待处理图像的分辨率确定。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述响应于所述目标对象处于说话状态,驱动交互对象进行回应,包括:
    在所述交互对象处于待机状态下,响应于首次确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进入与所述目标对象进行交互的状态。
  9. 一种交互对象的驱动装置,其特征在于,所述装置包括:
    获取单元,用于获取第一图像;
    识别单元,用于识别所述第一图像中至少包含目标对象的嘴部的面部区域图像,并确定所述面部区域图像包含的嘴部的关键点信息;
    确定单元,用于根据所述嘴部的关键点信息,确定所述第一图像中的所述目标对象 是否处于说话状态;以及
    驱动单元,用于响应于确定所述第一图像中的所述目标对象处于说话状态,驱动交互对象进行回应。
  10. 根据权利要求9所述的装置,其特征在于,所述嘴部的关键点信息包括位于目标对象的嘴部的多个关键点的位置信息;所述多个关键点包括至少一组关键点对,每个所述关键点对包括分别位于上嘴唇处和下嘴唇处的两个关键点;
    所述确定单元用于:
    根据所述至少一组关键点对的位置信息,确定每个所述关键点对中分别位于所述上嘴唇处和所述下嘴唇处的两个关键点的第一距离;以及
    根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态。
  11. 根据权利要求10所述的装置,其特征在于,所述第一图像为图像序列中的一帧;
    所述确定单元在用于根据各组所述关键点对的所述第一距离确定所述第一图像中的所述目标对象是否处于说话状态时,用于:
    在所述图像序列中,获取设定数目的待处理图像,所述待处理图像包括所述第一图像以及至少一帧第二图像;
    针对每帧第二图像:
    获取所述第二图像中每个所述关键点对的第一距离;
    根据所述第一图像中各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态。
  12. 根据权利要求11所述的装置,其特征在于,所述确定单元在用于在所述图像序列中,获取设定数目的待处理图像时,具体用于:
    以设定长度的窗口以及设定步长在所述图像序列中进行滑窗,每次滑动获取所述设定数目的待处理图像,其中,所述第一图像为所述窗口内的最后一帧图像。
  13. 根据权利要求11或12所述的装置,其特征在于,所述关键点对的第一距离包括所述关键点对中的两个关键点之间的欧式距离,所述确定单元在根据所述第一图像中 各组所述关键点对的第一距离以及各帧所述第二图像中各组所述关键点对的第一距离,确定所述第一图像中的所述目标对象是否处于说话状态时,用于:
    识别所述待处理图像中的目标图像;
    确定所述待处理图像中所包含的目标图像的数目;
    响应于所述目标图像的数目与所述待处理图像的所述设定数目之间的比例大于设定比例,确定所述第一图像中的目标对象处于说话状态。
  14. 根据权利要求13所述的装置,其特征在于,所述确定单元在识别所述待处理图像中的目标图像时,用于:
    将所述各组关键点对的欧式距离的平均值大于第一设定阈值的图像确定为所述目标图像;或,
    将所述各组关键点对的欧式距离的加权平均值大于第二设定阈值的图像确定为所述目标图像。
  15. 根据权利要求14所述的装置,其特征在于,所述第一设定阈值和所述第二设定阈值根据所述待处理图像的分辨率确定。
  16. 根据权利要求10至15任一项所述的装置,其特征在于,所述驱动单元具体用于:
    在所述交互对象处于待机状态下,响应于首次确定所述第一图像中的所述目标对象处于说话状态,驱动所述交互对象进入与所述目标对象进行交互的状态。
  17. 一种电子设备,其特征在于,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至8任一所述的方法。
  18. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1至8任一所述的方法。
PCT/CN2020/129855 2020-03-31 2020-11-18 交互对象的驱动方法、装置、设备以及存储介质 WO2021196648A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021549762A JP2022531055A (ja) 2020-03-31 2020-11-18 インタラクティブ対象の駆動方法、装置、デバイス、及び記録媒体
KR1020217027719A KR20210124313A (ko) 2020-03-31 2020-11-18 인터랙티브 대상의 구동 방법, 장치, 디바이스 및 기록 매체
SG11202109202VA SG11202109202VA (en) 2020-03-31 2020-11-18 Methods, apparatuses, electronic devices and storage media for driving an interactive object

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010247255.3 2020-03-31
CN202010247255.3A CN111428672A (zh) 2020-03-31 2020-03-31 交互对象的驱动方法、装置、设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2021196648A1 true WO2021196648A1 (zh) 2021-10-07

Family

ID=71550226

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/129855 WO2021196648A1 (zh) 2020-03-31 2020-11-18 交互对象的驱动方法、装置、设备以及存储介质

Country Status (6)

Country Link
JP (1) JP2022531055A (zh)
KR (1) KR20210124313A (zh)
CN (1) CN111428672A (zh)
SG (1) SG11202109202VA (zh)
TW (1) TW202139064A (zh)
WO (1) WO2021196648A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428672A (zh) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质
CN113018858B (zh) * 2021-04-12 2023-07-25 深圳市腾讯计算机系统有限公司 一种虚拟角色检测方法、计算机设备以及可读存储介质
CN113139491A (zh) * 2021-04-30 2021-07-20 厦门盈趣科技股份有限公司 视频会议控制方法、系统、移动终端及存储介质
CN113822205A (zh) * 2021-09-26 2021-12-21 北京市商汤科技开发有限公司 会议记录生成方法、装置、电子设备以及存储介质
CN115063867A (zh) * 2022-06-30 2022-09-16 上海商汤临港智能科技有限公司 说话状态识别方法及模型训练方法、装置、车辆、介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709400A (zh) * 2015-11-12 2017-05-24 阿里巴巴集团控股有限公司 一种感官张闭状态的识别方法、装置及客户端
US20170244891A1 (en) * 2016-02-24 2017-08-24 Beijing Xiaomi Mobile Software Co., Ltd. Method for automatically capturing photograph, electronic device and medium
CN108646920A (zh) * 2018-05-16 2018-10-12 Oppo广东移动通信有限公司 识别交互方法、装置、存储介质及终端设备
CN109241907A (zh) * 2018-09-03 2019-01-18 北京旷视科技有限公司 标注方法、装置及电子设备
CN110750152A (zh) * 2019-09-11 2020-02-04 云知声智能科技股份有限公司 一种基于唇部动作的人机交互方法和系统
CN111428672A (zh) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918975B (zh) * 2017-12-13 2022-10-21 腾讯科技(深圳)有限公司 一种增强现实的处理方法、对象识别的方法及终端
CN108492350A (zh) * 2018-04-02 2018-09-04 吉林动画学院 基于唇读技术的角色口型动画制作方法
CN109377539B (zh) * 2018-11-06 2023-04-11 北京百度网讯科技有限公司 用于生成动画的方法和装置
CN109977811A (zh) * 2019-03-12 2019-07-05 四川长虹电器股份有限公司 基于嘴部关键位置特征检测实现免语音唤醒的系统及方法
CN110309799B (zh) * 2019-07-05 2022-02-08 四川长虹电器股份有限公司 基于摄像头的说话判断方法
CN110620884B (zh) * 2019-09-19 2022-04-22 平安科技(深圳)有限公司 基于表情驱动的虚拟视频合成方法、装置及存储介质
CN110647865B (zh) * 2019-09-30 2023-08-08 腾讯科技(深圳)有限公司 人脸姿态的识别方法、装置、设备及存储介质
CN110826441B (zh) * 2019-10-25 2022-10-28 深圳追一科技有限公司 交互方法、装置、终端设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709400A (zh) * 2015-11-12 2017-05-24 阿里巴巴集团控股有限公司 一种感官张闭状态的识别方法、装置及客户端
US20170244891A1 (en) * 2016-02-24 2017-08-24 Beijing Xiaomi Mobile Software Co., Ltd. Method for automatically capturing photograph, electronic device and medium
CN108646920A (zh) * 2018-05-16 2018-10-12 Oppo广东移动通信有限公司 识别交互方法、装置、存储介质及终端设备
CN109241907A (zh) * 2018-09-03 2019-01-18 北京旷视科技有限公司 标注方法、装置及电子设备
CN110750152A (zh) * 2019-09-11 2020-02-04 云知声智能科技股份有限公司 一种基于唇部动作的人机交互方法和系统
CN111428672A (zh) * 2020-03-31 2020-07-17 北京市商汤科技开发有限公司 交互对象的驱动方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
CN111428672A (zh) 2020-07-17
SG11202109202VA (en) 2021-11-29
TW202139064A (zh) 2021-10-16
KR20210124313A (ko) 2021-10-14
JP2022531055A (ja) 2022-07-06

Similar Documents

Publication Publication Date Title
WO2021196648A1 (zh) 交互对象的驱动方法、装置、设备以及存储介质
CN106651955B (zh) 图片中目标物的定位方法及装置
US10241990B2 (en) Gesture based annotations
CN106664376B (zh) 增强现实设备和方法
EP3195601B1 (en) Method of providing visual sound image and electronic device implementing the same
TW202105331A (zh) 一種人體關鍵點檢測方法及裝置、電子設備和電腦可讀儲存介質
TW202113757A (zh) 目標對象匹配方法及目標對象匹配裝置、電子設備和電腦可讀儲存媒介
EP2998960B1 (en) Method and device for video browsing
CN109190449A (zh) 年龄识别方法、装置、电子设备及存储介质
CN105631408A (zh) 基于视频的面孔相册处理方法和装置
CN111045511B (zh) 基于手势的操控方法及终端设备
US11935294B2 (en) Real time object surface identification for augmented reality environments
CN105095881A (zh) 人脸识别方法、装置及终端
CN109063580A (zh) 人脸识别方法、装置、电子设备及存储介质
CN105528078B (zh) 控制电子设备的方法及装置
CN105354560A (zh) 指纹识别方法及装置
US9799376B2 (en) Method and device for video browsing based on keyframe
US20220222831A1 (en) Method for processing images and electronic device therefor
CN109344703B (zh) 对象检测方法及装置、电子设备和存储介质
CN107977636B (zh) 人脸检测方法及装置、终端、存储介质
CN105335714A (zh) 照片处理方法、装置和设备
CN104573642A (zh) 人脸识别方法及装置
CN105608469A (zh) 图像分辨率的确定方法及装置
US20230019181A1 (en) Device and method for device localization
Korchagin et al. Multimodal cue detection engine for orchestrated entertainment

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021549762

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217027719

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928347

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20928347

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 521430715

Country of ref document: SA