WO2019205016A1 - Systems and methods for nod action recognition based on facial feature points - Google Patents

Systems and methods for nod action recognition based on facial feature points Download PDF

Info

Publication number
WO2019205016A1
WO2019205016A1 PCT/CN2018/084426 CN2018084426W WO2019205016A1 WO 2019205016 A1 WO2019205016 A1 WO 2019205016A1 CN 2018084426 W CN2018084426 W CN 2018084426W WO 2019205016 A1 WO2019205016 A1 WO 2019205016A1
Authority
WO
WIPO (PCT)
Prior art keywords
image frames
image frame
action
feature point
action parameter
Prior art date
Application number
PCT/CN2018/084426
Other languages
French (fr)
Inventor
Xiubao Zhang
Original Assignee
Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology And Development Co., Ltd. filed Critical Beijing Didi Infinity Technology And Development Co., Ltd.
Priority to PCT/CN2018/084426 priority Critical patent/WO2019205016A1/en
Priority to CN201880038528.4A priority patent/CN110753931A/en
Publication of WO2019205016A1 publication Critical patent/WO2019205016A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Definitions

  • the present disclosure generally relates to systems and methods for action recognition, and in particular, to systems and methods for automated identification of the presence of a nod action from sequential image frames.
  • Living body detection based on human action recognition has become increasingly important in many scenarios (e.g., system login, identity authentication, Human-Computer Interaction) .
  • system login Take “system login” as an example, when a user intends to sign in the system via face recognition, in order to verify that the “user” is a person with a living body rather than a deceptive object (e.g., a picture) , the system may need to identify an action (e.g., a nod action) of the user for the purpose of such verification.
  • the existing technology achieves this goal by using a complex algorithm which requires excessive computing capacity, resulting in a heavy burden on the computing system. Therefore, it is desirable to provide systems and methods for automated identification of the presence of an action of a user quickly and efficiently, preferably putting less demand on computing capacity.
  • An aspect of the present disclosure relates to a system for automated identification of presence of a facial action from sequential images.
  • the system may include at least one storage medium including a set of instructions and at least one processor in communication with the at least one storage medium.
  • the at least one processor may be directed to cause the system to perform one or more of the following operations.
  • the at least one processor may obtain a plurality of sequential candidate image frames containing a facial object.
  • Each of the plurality of candidate image frames may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object.
  • the at least one processor may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point.
  • the at least one processor may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames.
  • the at least one processor may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
  • the method may include one or more of the following operations.
  • the at least one processor may obtain a plurality of sequential candidate image frames containing a facial object.
  • Each of the plurality of candidate image frames may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object.
  • the at least one processor may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point.
  • the at least one processor may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames.
  • the at least one processor may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
  • a further aspect of the present disclosure relates to a non-transitory computer readable medium.
  • the non-transitory computer readable medium may include executable instructions. When the executable instructions are executed by at least one processor, the executable instructions may direct the at least one processor to perform a method. The method may include one or more of the following operations.
  • the at least one processor may obtain a plurality of sequential candidate image frames containing a facial object. Each of the plurality of candidate image frames may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object.
  • the at least one processor may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point.
  • the at least one processor may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames.
  • the at least one processor may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
  • the one or more first feature points may be points associated with at least one of a left brow or a right brow of the facial object
  • the second feature point may be a point on a tip of a nose of the facial object
  • the one or more third feature points may be points on a chin of the facial object.
  • first distances may include one or more first left distances and one or more first right distances.
  • Each first left distance may be determined based on a corresponding first feature point associated with the left brow and the second feature point.
  • Each first right distance may be determined based on a corresponding first feature point associated with the right brow and the second feature point.
  • the at least one processor may determine one or more first ratios of the one or more first left distances to the one or more second distances, each of the one or more first ratios corresponding to a first left distance and a second distance.
  • the at least one processor may determine a first average ratio of the one or more first ratios.
  • the at least one processor may determine one or more second ratios of the one or more first right distances to the one or more second distances, each of the one or more second ratios corresponding to a first right distance and a second distance.
  • the at least one processor may determine a second average ratio of the one or more second ratios.
  • the at least one processor may determine the action parameter based on the first average ratio and the second average ratio.
  • the at least one processor may determine one or more distance ratios of the one or more first distances to the one or more second distances, each of the one or more distance ratios corresponding to a first distance and a second distance.
  • the at least one processor may determine a compound distance ratio of the one or more distance ratios as the action parameter.
  • the at least one processor may obtain an initial image frame including a first initial feature point on a center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object.
  • the at least one processor may determine whether the third initial feature point is within a quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point.
  • the at least one processor may determine the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point.
  • the at least one processor may identify a plurality of sequential target image frames from the plurality of sequential candidate image frames.
  • the plurality of sequential target image frames may include a start image frame and an end image frame.
  • the at least one processor may identify a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames.
  • the at least one processor may identify a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames.
  • the at least one processor may determine an asymmetry parameter based on the maximum action parameter and the minimum action parameter.
  • the at least one processor may determine a first number count of target image frames from the start image frame to a target image frame corresponding to the maximum action parameter.
  • the at least one processor may determine a second number count of target image frames from the target image frame corresponding to the maximum action parameter to the end image frame.
  • the at least one processor may determine an estimated line by fitting the second feature points in the plurality of sequential target image frames.
  • the at least one processor may identify the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and an angle between the estimated line and a vertical line is less than an angle threshold.
  • the at least one processor may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames.
  • the at least one processor may determine a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames.
  • the at least one processor may determine a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames.
  • the at least one processor may identify the candidate image frame as the start image frame in response to that the first average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to an action parameter corresponding to the candidate image frame.
  • the at least one processor may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames.
  • the at least one processor may determine a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames.
  • the at least one processor may determine a fourth average action parameter based on a plurality of fourth action parameters corresponding to the plurality of subsequent image frames.
  • the at least one processor may identify the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to an action parameter corresponding to the candidate image frame, an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
  • the asymmetry threshold may be 2-3.
  • the first number count threshold may be 4-6
  • the second number count threshold may be 4-6
  • the angle threshold may be 10°-15°.
  • the at least one processor may provide an authentication to a terminal device associated with a user corresponding to the facial object in response to the identification of the presence of the nod action.
  • the system may further include a camera, which may be configured to provide video data from which the plurality of sequential candidate image frames may be obtained.
  • the at least one processor may obtain the plurality of sequential candidate image frames from video data provided by a camera.
  • FIG. 1 is a schematic diagram illustrating an exemplary action recognition system according to some embodiments of the present disclosure
  • FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device according to some embodiments of the present disclosure
  • FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device according to some embodiments of the present disclosure
  • FIG. 4 is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure
  • FIG. 5 is a flowchart illustrating an exemplary process for identifying presence of a nod action according to some embodiments of the present disclosure
  • FIGs. 6-A and 6-B are schematic diagrams illustrating exemplary feature points according to some embodiments of the present disclosure.
  • FIG. 7-A is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure
  • FIG. 7-B is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure
  • FIG. 8-A is a flowchart illustrating an exemplary process for determining a candidate image frame according to some embodiments of the present disclosure
  • FIG. 8-B is a schematic diagram illustrating exemplary initial feature points according to some embodiments of the present disclosure.
  • FIG. 9 is a flowchart illustrating an exemplary process for identifying presence of a nod action according to some embodiments of the present disclosure.
  • FIG. 10 is a flowchart illustrating an exemplary process for determining a start image frame according to some embodiments of the present disclosure
  • FIG. 11 is a flowchart illustrating an exemplary process for determining an end image frame according to some embodiments of the present disclosure.
  • FIG. 12 is a schematic diagram illustrating an exemplary curve indicating a variation process of an action parameter during a nod action according to some embodiments of the present disclosure.
  • the flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowchart may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
  • the systems and methods in the present disclosure is described primarily regarding a nod action identification, it should also be understood that this is only one exemplary embodiment.
  • the systems and methods of the present disclosure may be applied to any other kind of action recognition.
  • the system and methods of the present disclosure may be applied to other action recognitions including an eye movement, a shaking action, a blink cation, a head up action, a mouth opening action, or the like, or any combination thereof.
  • the action recognition system may be applied in many application scenarios such as, system login, identity authentication, Human-Computer Interaction (HCl) , etc.
  • the application of the systems and methods of the present disclosure may include but not be limited to a web page, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.
  • subject, ” “human, ” or “user” in the present disclosure are used interchangeably to refer to a living body whose action is to be identified.
  • image frame, ” “image, ” “candidate image frames, ” and “target image frames” in the present disclosure are used to refer to frames in video data or images captured by a camera device.
  • camera, ” “camera device, ” and “capture device” in the present disclosure may be used interchangeably to refer to a device that can capture video data or image data.
  • An aspect of the present disclosure relates to systems and methods for identifying the presence of a nod action.
  • a distance between an upper part of a facial object (e.g., a face of a human) and a middle part of the facial part dynamically changes; a distance between the middle part of the facial object and a lower part of the facial object dynamically also changes.
  • an action parameter e.g., a ratio of the two distances
  • the systems and methods may identify the presence of the nod action based on the change of the action parameter.
  • the systems and methods may obtain a plurality of sequential candidate image frames associated with the facial object.
  • Each of the plurality of sequential candidate image frames may include one or more first feature points associated with the upper part, a second feature point associated with the middle part, and one or more third feature points associated with the lower part.
  • the systems and methods may determine one or more first distances based on the one or more first feature points and the second feature point, and one or more second distances based on the one or more third feature points and the second feature point.
  • the systems and methods may determine the action parameter based on the one or more first distances and the one or more second distances. Accordingly, the systems and methods may identify the presence of the nod action based on the action parameters corresponding to the plurality of sequential candidate image frames.
  • FIG. 1 is a schematic diagram illustrating an exemplary action recognition system according to some embodiments of the present disclosure.
  • the action recognition system 100 may be an online action recognition platform for living body recognition based on information of a facial object (e.g., a face 160 of a human) .
  • the action recognition system 100 may be used in a variety of application scenarios such as Human-Computer Interaction (HCl) , system login, identity authentication, or the like, or any combination thereof.
  • the action recognition system 100 may execute instructions to perform operations defined by a user in response to an identification of an action. For example, after extracting facial information of the user and identifying an action (e.g., a nod action) of the user, the action recognition system 100 may execute instructions to perform defined operations such as turning a page of an e-book, adding animation effects during a video chat, controlling a robot to perform an operation (e.g., mopping the floor) , requesting a service (e.g., a taxi hailing service) , etc.
  • a service e.g., a taxi hailing service
  • the action recognition system 100 may determine a login permission and allow a user account associated with the user to log in the system.
  • the action recognition system 100 may determine the user’s identity and provide a permission to access an account (e.g., a terminal device, a payment account, or a membership account) or a permission to enter a restricted place (e.g., a company, a library, a hospital, or an apartment) .
  • an account e.g., a terminal device, a payment account, or a membership account
  • a permission to enter a restricted place e.g., a company, a library, a hospital, or an apartment
  • the action recognition system 100 may be an online platform including a server 110, a network 120, a camera device 130, a user terminal 140, and a storage 150.
  • the server 110 may be a single server or a server group.
  • the server group may be centralized, or distributed (e.g., server 110 may be a distributed system) .
  • the server 110 may be local or remote.
  • the server 110 may access information and/or data stored in the camera device 130, the user terminal 140, and/or the storage 150 via the network 120.
  • the server 110 may be directly connected to the camera device 130, the user terminal 140, and/or the storage 150 to access stored information and/or data.
  • the server 110 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the server 110 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.
  • the server 110 may include a processing engine 112.
  • the processing engine 112 may process information and/or data relating to action recognition to perform one or more functions described in the present disclosure. For example, the processing engine 112 may identify the presence of a nod action based on a plurality of sequential candidate image frames containing a facial object.
  • the processing engine 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) .
  • the processing engine 112 may include one or more hardware processors, such as a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • ASIP application-specific instruction-set processor
  • GPU graphics processing unit
  • PPU physics processing unit
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • controller a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any
  • the sever 110 may be unnecessary and all or part of the functions of the server 110 may be implemented by other components (e.g., the camera device 130, the user terminal 140) of the action recognition system 100.
  • the processing engine 112 may be integrated in the camera device 130 or the user terminal 140 and the functions (e.g., identifying presence of an action of a facial object based on image frames associated with the facial object) of the processing engine 112 may be implemented by the camera device 130 or the user terminal 140.
  • the network 120 may facilitate the exchange of information and/or data.
  • one or more components of the action recognition system 100 e.g., the server 110, the camera device 130, the user terminal 140, the storage 150
  • the server 110 may obtain information and/or data (e.g., image frames) from the camera device 130 via the network 120.
  • the network 120 may be any type of wired or wireless network, or a combination thereof.
  • the network 130 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, the Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof.
  • the network 120 may include one or more network access points.
  • the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2, ..., through which one or more components of the action recognition system 100 may be connected to the network 120 to exchange data and/or information.
  • the camera device 130 may capture image data or video data containing a facial object.
  • the camera device 130 may capture a video including a plurality of image frames containing the facial object.
  • the camera device 130 may include a black-white camera, a color camera, an infrared camera, a 3-D camera, an X-ray camera, etc.
  • the camera device 130 may include a monocular camera, a binocular camera, a multi-camera, etc.
  • the camera device 130 may be a smart device including or connected to a camera.
  • the smart device may include a smart home device (e.g., a smart lighting device, a smart television, ) , an intelligent robot (e.g., a sweeping robot, a mopping robot, a chatting robot, an industry robot) , etc.
  • the camera device 130 may be a surveillance camera.
  • the surveillance camera may include a wireless color camera, a low light camera, a vandal proof camera, a bullet camera, a pinhole camera, a hidden spy camera, a fixed box camera, or the like, or any combination thereof.
  • the camera device 130 may be an IP camera which can transmit the captured image data or video data to any component (e.g., the server 110, the user terminal 140, the storage 150) of the action recognition system 100 via the network 120.
  • the camera device 130 may independently identify the presence of an action of the facial object based on the captured image frames. In some embodiments, the camera device 130 may transmit the captured image frames to the server 110 or the user terminal 140 to be further processed. In some embodiments, the camera device 130 may transmit the captured image frames to the storage 150 to be stored. In some embodiments, the camera device 130 may be integrated in the user terminal 140. For example, the camera device 130 may be part of the user terminal 140, such as a camera of a mobile phone, a camera of a computer, etc.
  • the user terminal 140 may include a mobile device, a tablet computer, a laptop computer, or the like, or any combination thereof.
  • the mobile device may include a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof.
  • the wearable device may include a smart bracelet, a smart footgear, smart glasses, a smart helmet, a smart watch, smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof.
  • the smart mobile device may include a mobile phone, a personal digital assistance (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof.
  • PDA personal digital assistance
  • POS point of sale
  • the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, augmented reality glasses, an augmented reality patch, or the like, or any combination thereof.
  • the virtual reality device and/or the augmented reality device may include a Google Glass TM , a RiftCon TM , a Fragments TM , a Gear VR TM , etc.
  • the user terminal 140 may exchange information and/or data with other components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user terminal 140, the storage 150) directly or via the network 120.
  • the user terminal 140 may obtain image frames from the camera device 130 or the storage 150 to identify the presence of an action of a facial object based on the image frames.
  • the user terminal 140 may receive a message (e.g., an authentication) from the server 110.
  • the storage 150 may store data and/or instructions. In some embodiments, the storage 150 may store data obtained from the camera device 130 and/or the user terminal 140. In some embodiments, the storage 150 may store data and/or instructions that the server 110 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, storage 150 may include a mass storage, a removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc.
  • Exemplary volatile read-and-write memory may include a random access memory (RAM) .
  • RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc.
  • Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically-erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc.
  • MROM mask ROM
  • PROM programmable ROM
  • EPROM erasable programmable ROM
  • EEPROM electrically-erasable programmable ROM
  • CD-ROM compact disk ROM
  • digital versatile disk ROM etc.
  • the storage 150 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
  • the storage 150 may be connected to the network 120 to communicate with one or more components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user terminal 140, etc. ) .
  • One or more components of the action recognition system 100 may access the data or instructions stored in the storage 150 via the network 120.
  • the storage 150 may be directly connected to or communicate with one or more components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user terminal 140, etc. ) .
  • the storage 150 may be part of the server 110.
  • one or more components e.g., the server 110, the camera device 130, the user terminal 140
  • the user terminal 140 may access information/data (e.g., image frames containing the facial object) from the storage 150.
  • the storage 150 may be a data storage including cloud computing platforms, such as, public cloud, private cloud, community, and hybrid clouds, etc. However, those variations and modifications do not depart the scope of the present disclosure.
  • FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device 200 according to some embodiments of the present disclosure.
  • the server 110, the camera device 130, and/or the user terminal 140 may be implemented on the computing device 200.
  • the processing engine 112 may be implemented on the computing device 200 and configured to perform functions of the processing engine 112 disclosed in this disclosure.
  • the computing device 200 may be used to implement any component of the action recognition system 100 as described herein.
  • the processing engine 112 may be implemented on the computing device 200, via its hardware, software program, firmware, or a combination thereof.
  • only one such computer is shown, for convenience, the computer functions relating to the action recognition as described herein may be implemented in a distributed fashion on a number of similar platforms to distribute the processing load.
  • the computing device 200 may include COM ports 250 connected to and from a network connected thereto to facilitate data communications.
  • the computing device 200 may also include a processor 220, in the form of one or more processors (e.g., logic circuits) , for executing program instructions.
  • the processor 220 may include interface circuits and processing circuits therein.
  • the interface circuits may be configured to receive electronic signals from a bus 210, wherein the electronic signals encode structured data and/or instructions for the processing circuits to process.
  • the processing circuits may conduct logic calculations, and then determine a conclusion, a result, and/or an instruction encoded as electronic signals. Then the interface circuits may send out the electronic signals from the processing circuits via the bus 210.
  • the computing device 200 may further include program storage and data storage of different forms including, for example, a disk 270, and a read only memory (ROM) 230, or a random access memory (RAM) 240, for various data files to be processed and/or transmitted by the computing device.
  • the exemplary computer platform may also include program instructions stored in the ROM 230, RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220.
  • the methods and/or processes of the present disclosure may be implemented as the program instructions.
  • the computing device 200 also includes an I/O component 260, supporting input/output between the computer and other components.
  • the computing device 200 may also receive programming and data via network communications.
  • step A and step B may also be performed by two different CPUs and/or processors jointly or separately in the computing device 200 (e.g., the first processor executes step A and the second processor executes step B, or the first and second processors jointly execute steps A and B) .
  • FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device 300 on which the camera device 130, the user terminal 140, or part of the camera device 130 or the user terminal 140 may be implemented according to some embodiments of the present disclosure.
  • the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, a mobile operating system (OS) 370, and a storage 390.
  • any other suitable component including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.
  • the mobile operating system 370 e.g., iOS TM , Android TM , Windows Phone TM , etc.
  • the applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to action recognition or other information from the action recognition system 100.
  • User interactions with the information stream may be achieved via the I/O 350 and provided to the processing engine 112 and/or other components of the action recognition system 100 via the network 120.
  • computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein.
  • a computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device.
  • PC personal computer
  • a computer may also act as a system if appropriately programmed.
  • FIG. 4 is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure.
  • the processing engine 112 may include an obtaining module 410, a distance determination module 420, an action parameter determination module 430, and an identification module 440.
  • the obtaining module 410 may be configured to obtain a plurality of sequential candidate image frames containing a facial object.
  • the facial object may refer to a face of a subject (e.g., a human, an animal) .
  • the obtaining module 410 may obtain the plurality of sequential candidate image frames from the camera device 130, the user terminal 140, or a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
  • an “image frame” may refer to a frame in a video, and “sequential” may refer to that the image frames are aligned according to a sequence (e.g. a temporal sequence) in the video.
  • the camera device 130 may capture a video in chronological order.
  • the video includes a plurality of image frames corresponding to a plurality of capture time points respectively. Accordingly, the image frames are aligned in chronological order based on the capture time points.
  • each of the plurality of candidate image frames may include a plurality of feature points associated with the facial object.
  • the plurality of feature points may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object.
  • the upper part may refer to an upper region above a nose of the facial object
  • the middle part may refer to a middle region including the nose of the facial object
  • the lower part may refer to a lower region below the nose of the facial object.
  • the distance determination module 420 may be configured to determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point in each of the plurality of sequential candidate image frames.
  • the first distance may indicate a distance between the upper part of the facial object and the middle part of the facial object.
  • the second distance may indicate a distance between the middle part of the facial object and the lower part of the facial object.
  • the action parameter determination module 430 may be configured to determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames.
  • the action parameter refers to a parameter associated with a ratio of a distance between the upper part and the middle part to a distance between the middle part and the lower part.
  • the identification module 440 may be configured to identify the presence of a nod action in response to that the action parameters satisfy a preset condition. It is known that during the nod action, the facial object may move along a downward direction from a start position to a middle position and then move along an upward direction from the middle position to an end position. Therefore, during the nod action, the distance between the upper part of the facial object and the middle part of the facial object and the distance between the middle part of the facial object and the lower part of the facial object dynamically change in the plurality of sequential candidate image frames. Accordingly, the action parameter dynamically changes during the nod action.
  • the modules in the processing engine 112 may be connected to or communicated with each other via a wired connection or a wireless connection.
  • the wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof.
  • the wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof.
  • LAN Local Area Network
  • WAN Wide Area Network
  • Bluetooth a ZigBee
  • NFC Near Field Communication
  • the distance determination module 420 and the action parameter determination module 430 may be combined as a single module which may both determine the one or more first distances and the one or more second distances, and determine the action parameter based on the first distances and the second distances.
  • the processing engine 112 may include a storage module (not shown) which may be used to store data generated by the above-mentioned modules.
  • FIG. 5 is a flowchart illustrating an exemplary process for identifying presence of a nod action according to some embodiments of the present disclosure.
  • the process 500 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 500.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 5 and described below is not intended to be limiting.
  • the processing engine 112 may obtain a plurality of sequential candidate image frames containing a facial object.
  • the facial object may refer to a face of a subject (e.g., a human, an animal) .
  • the processing engine 112 may obtain the plurality of sequential candidate image frames from the camera device 130, the user terminal 140, or a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
  • an “image frame” may refer to a frame in a video, and “sequential” may refer to that the image frames are aligned according to a sequence (e.g. a temporal sequence) in the video.
  • the camera device 130 may capture a video in chronological order.
  • the video includes a plurality of image frames corresponding to a plurality of capture time points respectively. Accordingly, the image frames are aligned in chronological order based on the capture time points.
  • the plurality of sequential candidate image frames may be expressed as an ordered set illustrated bellow:
  • F refers to the ordered set
  • F i refers to an ith candidate image frame
  • m refers to a number count of the plurality of candidate image frames.
  • the plurality of sequential candidate image frames are ordered in chronological order based on capture time points of the plurality of candidate image frames.
  • the candidate image frame F 1 corresponds to a first capture time point
  • the candidate image frame F 2 corresponds to a second capture time point, wherein the second capture time point is later than the first capture time point and a time interval between the first capture time point and the second capture time point may be a default parameter of the camera device 130 or may be set by the action recognition system 100.
  • the camera device 130 may capture 24 image frames per second; in certain embodiments, the intervals between neighboring candidate image frames may be 1/24 second, meaning that all the captured image frames are used as candidate image frames; in certain other embodiments, the intervals between neighboring candidate image frames may be 1/12 second, meaning that certain (half) captured image frames are used as candidate image frames but the others are skipped.
  • each of the plurality of candidate image frames may include a plurality of feature points associated with the facial object.
  • a “feature point” may refer to a point located on the face; in certain embodiments, the feature point is a point on the face and is measurably recognizable, for example, a point on an end of an eye, a point on a brow, a point on a nose, etc.
  • the processing engine 112 may determine the plurality of feature points based on a facial recognition process.
  • the facial recognition process may include a process based on geometric features, a local face analysis process, a principle component analysis process, a deep-learning-based process, or the like, or any combination thereof.
  • the plurality of feature points may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object.
  • the upper part may refer to an upper region above a nose of the facial object
  • the middle part may refer to a middle region including the nose of the facial object
  • the lower part may refer to a lower region below the nose of the facial object.
  • the one or more first feature points may include one or more first left points associated with a left upper part of the facial object and one or more first right points associated with a right upper part of the facial object.
  • the one or more first left points may include any point (e.g., point a l1 , ..., point a li , ..., and point ) on a left brow.
  • the one or more first right points may include any point (e.g., point a r1 , ..., point a ri , ..., and point ) on a right brow.
  • the one or more first feature points may include any point located on the upper part of the facial object.
  • the one or more first feature points may include any point (e.g., point a i ) located on or above a line 610 determined based on a highest point of the left brow and a highest point of the right brow, or any point (e.g., point a′ i ) located on a line 620 determined based on a right end point of the left brow and a left end point of the right brow.
  • point a i located on or above a line 610 determined based on a highest point of the left brow and a highest point of the right brow
  • any point e.g., point a′ i
  • the second feature point may be a point (e.g., a tip point of the nose) on or around the nose of the facial object.
  • the one or more third feature points may include any point located on the lower part of the facial object.
  • the one or more third feature points may include any point (e.g., point c 1 , ..., point c i , ..., and point ) on a chin of the facial object.
  • the processing engine 112 e.g., the distance determination module 420
  • the processing circuits of the processor 220 may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point.
  • the first distance may indicate a distance between the upper part of the facial object and the middle part of the facial object.
  • the second distance may indicate a distance between the middle part of the facial object and the lower part of the facial object.
  • the processing engine 112 may determine the first distance or the second distance according to formula (2) below:
  • D refers to the first distance or the second distance
  • (x i , y i ) refers to a coordinate of an ith first feature point associated with the upper part of the facial object or an ith third feature point associated with the lower part of the facial object
  • (x 0 , y 0 ) refers to a coordinate of the second feature point associated with the middle part of the facial object.
  • the present disclosure takes a rectangular coordinate system as an example, it should be noted that the coordinates of the feature points may be expressed in any coordinate system (e.g., a polar coordinate system) and an origin of the coordinate system may be any point in the image frame.
  • the processing engine 112 e.g., the action parameter determination module 430
  • the processing circuits of the processor 220 may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames.
  • the action parameter refers to a parameter associated with a ratio of a distance between the upper part and the middle part to a distance between the middle part and the lower part. More descriptions of the action parameter may be found elsewhere in the present disclosure (e.g., FIG. 7-A, FIG. 7-B, and the descriptions thereof) .
  • the processing engine 112 e.g., the identification module 440
  • the processing circuits of the processor 220 may identify the presence of a nod action in response to that the action parameters satisfy a preset condition. It is known that during the nod action, the facial object may move along a downward direction from a start position to a middle position and then move along an upward direction from the middle position to an end position. Therefore, during the nod action, the distance between the upper part of the facial object and the middle part of the facial object and the distance between the middle part of the facial object and the lower part of the facial object dynamically change in the plurality of sequential candidate image frames. Accordingly, the action parameter dynamically changes during the nod action.
  • the action parameter corresponding to the start position and the action parameter corresponding to the end position are a fixed value and approximately equal to each other.
  • the middle position may be a stop position where the facial object stops moving down (or starts moving up back) , which corresponds to a time point when the action parameter is maximum.
  • the processing engine 112 may identify a plurality of sequential target image frames including a start image frame which corresponds to or substantially corresponds to the start position, an end image frame which corresponds to or substantially corresponds to the end position, and a middle image frame which corresponds to or substantially corresponds to the middle position, and identify the presence of the nod action based on the action parameters of the start image frame, the end image frame, and the middle image frame. More descriptions of the identification of the nod action may be found elsewhere in the present disclosure (e.g., FIGs. 9-11 and the descriptions thereof) .
  • substantially corresponds to refers to that a time interval between a capture time point when the image frame is captured and a time point corresponding to the position is less than a time threshold that is recognizable for an ordinary person in the art.
  • a frame rate which may be a default parameter
  • capture time points of two adjacent image frames are not continuous (i.e., there is a time interval between the two capture time points) . Therefore, take the “start position” as an example, the start image frame may not strictly correspond to a time point of the start position, but the capture time point of the start image frame may be a time point which is very close to the time point of the start position. In ideal conditions, we can consider that the two time points are the same because the intervals between the candidate image frames are usually short.
  • the processing engine 112 may further provide an authentication to a terminal device (e.g., the user terminal 140) associated with a user corresponding to the facial object in response to the identification of the presence of the nod action. After receiving the authentication, the user can have an access permission to the terminal device.
  • a terminal device e.g., the user terminal 140
  • the processing engine 112 may store information (e.g., the plurality of sequential candidate image frames, the one or more first distances, the one or more second distances, the action parameters) associated with the action identification in a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
  • FIGs. 6-A and 6-B are schematic diagrams illustrating exemplary feature points according to some embodiments of the present disclosure.
  • each of the plurality of candidate image frames may include one or more first feature points associated with the upper part of the facial object, a second feature point associated with the middle part of the facial object, and one or more third feature points associated with the lower part of the facial object.
  • the upper part of he facial object refers to the area on the facial object that includes the eye brows and above.
  • the lower part of the facial object refers to the area on the facial object that includes the lips and below.
  • the middle part of the facial object refers the area between the eye brows and lips.
  • the one or more first feature points may include first left points (e.g., point a l1 , ..., point a li , ..., and point on a left brow) associated with the left upper part of the facial object and first right points (e.g., point a r1 , ..., point a ri , ..., and point on a right brow) associated with the right upper part of the facial object.
  • the second feature point may be a tip point (e.g., point b) of a nose of the facial object.
  • the one or more third feature points may include points (e.g., point c 1 , ..., point c i , ..., and point ) on a chin of the facial object.
  • points e.g., point c 1 , ..., point c i , ..., and point
  • the values of n 1 , n 2 , and n 3 above may be the same as each other or may be different from each other.
  • the one or more first feature points may include any point (e.g., point a i ) located on or above a line 610 determined based on a highest point of the left brow and a highest point of the right brow, or any point (e.g., point a′ i ) located on or above a line 620 determined based on a right end point of the left brow and a left end point of the right brow.
  • point a i located on or above a line 610 determined based on a highest point of the left brow and a highest point of the right brow
  • any point e.g., point a′ i
  • the first feature point may be any point located on the upper part of the facial object, for example, an end point of an eye, a point located on a line determined based on two end points of the eye, etc.
  • the second feature point may be any point (e.g., a nasal root point) located on or around the nose of the facial object.
  • the third feature point may be any point located on the lower part of the facial object, for example, an end point of a lip, a point located on a line determined based on two end points of the lip, etc.
  • FIG. 7-A is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure.
  • the process 710 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 710.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 710 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 7 and described below is not intended to be limiting. In some embodiments, operation 530 may be performed based on the process 710.
  • the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine one or more first ratios of one or more first left distances to one or more second distances. Each of the one or more first ratios may correspond to a first left distance and a second distance. As described in connection with 510, the one or more first feature points may include one or more first left points associated with a left upper part of the facial object and one or more first right points associated with a right upper part of the facial object. Accordingly, each of the one or more first left distances is determined based on a corresponding first left point and the second feature point.
  • the processing engine 112 may determine the first ratio according to formula (3) below:
  • R li refers to an ith first ratio
  • D li refers to an ith first left distance
  • C i refers to an ith second distance
  • the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine a first average ratio of the one or more first ratios. For example, the processing engine 112 may determine the first average ratio according to formula (4) below:
  • R li refers to the ith first ratio
  • s 1 refers to a number count of the one or more first ratios.
  • the processing engine 112 e.g., the action parameter determination module 430
  • the processing circuits of the processor 220 may determine one or more second ratios of one or more first right distances to the one or more second distances.
  • Each of the one or more second ratios may correspond to a first right distance and a second distance.
  • each of the one or more first right distances is determined based on a corresponding first right point and the second feature point.
  • the processing engine 112 may determine the second ratio according to formula (5) below:
  • R ri refers to an ith second ratio
  • D ri refers to an ith first right distance
  • C i refers to the ith second distance
  • the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine a second average ratio of the one or more second ratios. For example, the processing engine 112 may determine the second average ratio according to formula (6) below:
  • R ri refers to the ith second ratio
  • s 2 refers to a number count of the one or more second ratios, wherein s 2 may be the same as or different from s 1 .
  • the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine an action parameter based on the first average ratio and the second average ratio. For example, the processing engine 112 may determine the action parameter according to formula (7) below:
  • A refers to the action parameter, and and refer to the first average ratio and the second average ratio respectively. It should be noted that formula (7) above is provided for illustration purposes, the processing engine 112 may determine the action parameter based on a weighted average value of the first average ratio and the second average ratio, a larger one of the first average ratio and the second average ratio, etc.
  • FIG. 7-B is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure.
  • the process 720 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 720.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 720 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 7-B and described below is not intended to be limiting. In some embodiments, operation 530 may be performed based on the process 720.
  • the processing engine 112 e.g., the action parameter determination module 430
  • the processing circuits of the processor 220 may determine one or more distance ratios of the one or more first distances to the one or more second distances. Each of the one or more distance ratios may correspond to a first distance and a second distance.
  • the processing engine 112 may determine the distance ratio according to formula (8) below:
  • R i refers to an ith distance ratio
  • D i refers to an ith first right distance
  • C i refers to an ith second distance
  • the processing engine 112 may determine a compound distance ratio of the one or more distance ratios as the action parameter. For example, the processing engine 112 may determine the compound distance ratio (i.e., the action parameter) according to formula (9) below:
  • A refers to the action parameter
  • R i refers to the ith distance ratio
  • q refers to a number count of the one or more distance ratios. It should be noted that formula (9) above is provided for illustration purposes, the processing engine 112 may determine the action parameter based on a weighted average value of the one or more distance ratios, a larger one of the one or more distance ratios, etc.
  • FIG. 8-A is a flowchart illustrating an exemplary process for determining a candidate image frame according to some embodiments of the present disclosure.
  • the process 800 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 800.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 800 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 8-A and described below is not intended to be limiting. In some embodiments, operation 510 may be performed based on process 800.
  • the processing engine 112 may obtain an initial image frame.
  • the processing engine 112 may obtain the initial image frame from the camera device 130, the user terminal 140, or a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
  • the initial image frame may include a first initial feature point on a center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object.
  • the processing engine 112 (e.g., the obtaining module 410) (e.g., the processing circuits of the processor 220) may determine a quadrangle based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point. Further, the processing engine 112 may determine whether the third initial feature point is within the quadrangle.
  • the processing engine 112 e.g., the obtaining module 410) (e.g., the processing circuits of the processor 220) may determine the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle. As illustrated in FIG. 8-B, assuming that the third initial feature point is within the quadrangle, it may indicate that the initial image frame contains the facial object; whereas assuming that the third initial feature point is not within the quadrangle, it may indicate that there may be a problem during the capture of the initial image frame, resulting in that the initial frame cannot be used for further processing.
  • the initial image frame only includes some of the first initial feature point, the second initial feature point, the third initial feature point, the fourth initial feature point, and the fifth initial feature point, it may indicate that the initial image frame only contains a part (e.g., the upper part) of the facial object, under this situation, the initial image frame also cannot be used for further processing.
  • FIG. 8-B is a schematic diagram illustrating exemplary initial feature points according to some embodiments of the present disclosure.
  • point 841 refers to the first initial feature point on the center of the left eye
  • point 842 refers to the second initial feature point on the center of the right eye
  • point 843 refers to the third initial feature point on the tip of the nose
  • point 844 refers to the fourth initial feature point on the left end of the lip
  • point 845 refers the fifth initial feature point on the right end of the lip. It can be seen that point 843 is within a quadrangle 840 determined based on the points 841, 842, 844, and 845.
  • FIG. 9 is a flowchart illustrating an exemplary process for identifying the presence of a nod action according to some embodiments of the present disclosure.
  • the process 900 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 900.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 900 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 9 and described below is not intended to be limiting. In some embodiments, operation 540 may be performed based on process 900.
  • the processing engine 112 may identify a plurality of sequential target image frames from the plurality of sequential candidate image frames.
  • the plurality of sequential target image frames include a start image frame which corresponds to or substantially corresponds to the start position (i.e., a position where the facial object starts moving along a downward direction) , an end image frame which corresponds to or substantially corresponds to the end position (i.e., a position where the facial object stops moving along the upward direction) , and a middle image frame which corresponds to or substantially corresponds to the middle position (i.e., a position where the facial object stops moving down (or starts moving up back) ) .
  • the processing engine 112 e.g., the identification module 440
  • the processing circuits of the processor 220 may identify a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames. As described above, the maximum action parameter corresponds to the middle image frame.
  • the processing engine 112 e.g., the identification module 440
  • the processing circuits of the processor 220 may identify a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames. As described above, in ideal conditions, the minimum action parameter corresponds to the start image frame or the end image frame.
  • the processing engine 112 may determine the minimum action parameter based on an action parameter (also referred to as “start action parameter” ) corresponding to the start image frame and an action parameter (also referred to as “end action parameter” ) corresponding to the end image frame. For example, the processing engine 112 may determine an average action parameter of the start action parameter and the end action parameter as the minimum action parameter.
  • the facial object may keep facing right to or substantially right to the camera device 130, during which the action parameter stays almost unchanged (e.g., from point 1201 to point 1202) , and within a time period after the capture time point corresponding to the end image frame, the facial object may be facing right to or substantially right to the camera device 130, within which the action parameter also stays almost unchanged (e.g., from point 1208 to point 1212) , therefore, the processing engine 112 may determine two average action parameters (i.e., a first average action parameter and a fourth average action parameter described in FIG. 10 and FIG. 11 respectively) corresponding to the two time periods respectively, and further determine an average value of the two average action parameters as the minimum action parameter.
  • two average action parameters i.e., a first average action parameter and a fourth average action parameter described in FIG. 10 and FIG. 11 respectively
  • the processing engine 112 may determine an asymmetry parameter based on the maximum action parameter and the minimum action parameter.
  • the asymmetry parameter may indicate an amplitude of action parameters corresponding to the plurality of sequential target image frames.
  • the processing engine 112 may determine the asymmetry parameter according to formula (10) below:
  • a max refers to the maximum action parameter
  • a min refers to the minimum action parameter
  • the processing engine 112 e.g., the identification module 440
  • the processing circuits of the processor 220 may determine a first number count of target image frames from the start image frame to a target image frame (i.e., the middle image frame) corresponding to the maximum action parameter.
  • the processing engine 112 e.g., the identification module 440
  • the processing circuits of the processor 220 may determine a second number count of target image frames from the target image frame (i.e., the middle image frame) corresponding to the maximum action parameter to the end image frame.
  • the processing engine 112 may determine an estimated line by fitting the second feature points (e.g., a tip point on the nose) in the plurality of sequential target image frames.
  • the processing engine 112 may determine the estimated line based on a fitting process.
  • the fitting process may include a least-squares estimation process, a maximum-likelihood estimation process, a Bayesian linear regression process, or the like, or any combination thereof.
  • the processing engine 112 may identify the presence of a nod action based on the maximum action parameter, the minimum action parameter, the asymmetry parameter, the first number count, the second number count, and an angle between the estimated line and a vertical line.
  • the processing engine 112 may identify the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and the angle between the estimated line and the vertical line is less than an angle threshold.
  • the asymmetry threshold may be default settings of the action recognition system 100, or may be adjustable under different situations.
  • the asymmetry threshold may be any value within a range from 2 to 3.
  • the first number count threshold and the second number count threshold may be default settings of the action recognition system 100.
  • the first number count threshold or the second number count threshold may be any value (e.g., 4) within a range from 2 to 10.
  • the first number count threshold and the second number count threshold may be adjustable according to a frame rate of the camera device 130 or the interval between neighboring image frames.
  • the frame rate may refer to a number of image frames captured by the camera device 130 per unit time (e.g., per second) .
  • a larger frame rate of the camera device 130 may correspond to a larger first number count threshold or a larger second number count threshold.
  • the first number count threshold and the second number count threshold may be the same or different.
  • the estimated line fitted based on the second feature points may be a straight line.
  • the angle between the estimated line and the vertical line may be an angle between two straight lines.
  • the estimated line may be a curve.
  • the angle between the estimated line and the vertical line may be an angle between a tangent line of a point on the curve and the horizontal line.
  • the angle threshold may be default settings of the action recognition system 100, or may be adjustable under specific situations.
  • the angle threshold may be any value (e.g., 10°) within a range from 5° to 20°.
  • the processing engine 112 defines the angle threshold, provided that the angle between the estimated line and the vertical line is less than the angle threshold, it is considered that the identification of the nod action is correct.
  • FIG. 10 is a flowchart illustrating an exemplary process for determining a start image frame according to some embodiments of the present disclosure.
  • the process 1000 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 1000.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 1000 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 10 and described below is not intended to be limiting. In some embodiments, operation 910 may be performed based on process 1000.
  • the processing engine 112 may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames.
  • the plurality of sequential candidate image frames are aligned in a chronological order based on the capture time points. Accordingly, the “sequence” here refers to the chronological order.
  • “previous image frames” here refer to continuous image frames immediately before the candidate image frame along the sequence
  • “subsequent image frames” refer to continuous image frames immediately after the candidate image frame.
  • the plurality of previous image frames before the ith candidate image frame may be expressed as an ordered set below:
  • P 1 refers to the ordered set including the plurality of previous image frames and x refers to a number count of the plurality of previous image frames.
  • the plurality of subsequent image frames after the ith candidate image frame may be expressed as an ordered set below:
  • N 1 [F i+1 , F i+2 , ..., F i+y ] (i>1, y ⁇ m-1) (12)
  • N 1 refers to the ordered set including the plurality of subsequent image frames
  • y refers to a number count of the plurality of subsequent image frames
  • m refers to a number count of the plurality of candidate image frames.
  • the processing engine 112 may determine a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames. For example, the processing engine 112 may determine the first average action parameter according to formula (13) below:
  • a i-x refers to a first action parameter corresponding to a (i-x) th candidate image frame.
  • the processing engine 112 may determine a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames. For example, the processing engine 112 may determine the second average action according to formula (14) below:
  • a i+y refers to a second action parameter corresponding to a (i+y) th candidate image frame.
  • the processing engine 112 e.g., the identification module 440
  • the processing circuits of the processor 220 may identify the candidate image frame as the start image frame in response to that the first average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to the action parameter corresponding to the candidate image frame.
  • the start image frame (e.g., point 1202 illustrated in FIG. 12) corresponds to or substantially corresponds to the start position where the facial object is facing right to or substantially right to the camera device 130.
  • the facial object may keep facing right to or substantially right to the camera device 130, during which the action parameter stays almost unchanged (e.g., from point 1201 to point 1202 illustrated in FIG. 12) .
  • the facial object moves from the start position along a downward direction during which the action parameter gradually increases (e.g., from point 1202 to point 1204 illustrated in FIG. 12) .
  • the first average action parameter of the plurality of previous image frames is less than the second average action parameter of the plurality of subsequent image frames and each of the plurality of second action parameters corresponding to the subsequent image frames is larger than the action parameter of the start image frame.
  • FIG. 11 is a flowchart illustrating an exemplary process for determining an end image frame according to some embodiments of the present disclosure.
  • the process 1100 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240.
  • the processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 1100.
  • the operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 1100 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 11 and described below is not intended to be limiting. In some embodiments, operation 910 may be performed based on process 1100.
  • the processing engine 112 may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames.
  • the “previous image frames” refer to continuous image frames immediately before the candidate image frame along the sequence
  • the “subsequent image frames” refer to continuous image frames immediately after the candidate image frame.
  • the plurality of previous image frames before the jth candidate image frame may be expressed as an ordered set below:
  • P 2 refers to the ordered set including the plurality of previous image frames and e refers to a number count of the plurality of previous image frames.
  • the plurality of subsequent image frames after the jth candidate image frame may be expressed as an ordered set below:
  • N 2 [F j+1 , F j+2 , ..., F j+f ] ( (j+f) ⁇ m) (16)
  • N 2 refers to the ordered set including the plurality of subsequent image frames and f refers to a number count of the plurality of subsequent image frames.
  • the processing engine 112 may determine a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames. For example, the processing engine 112 may determine the third average action parameter according to formula (17) below:
  • a j-e refers to a third action parameter corresponding to a (j-e) th candidate image frame.
  • the processing engine 112 may determine a fourth average action parameter based on a plurality of fourth action parameters corresponding to the plurality of subsequent image frames. For example, the processing engine 112 may determine the fourth average action parameter according to formula (18) below:
  • a j+f refers to a fourth action parameter corresponding to a (j+f) th candidate image frame.
  • the processing engine 112 may identify the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to the action parameter corresponding to the candidate image frame, an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
  • the processing engine 112 e.g., the identification module 440
  • each of the plurality of third action parameters is larger than or equal to the action parameter corresponding to the candidate image frame
  • an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame
  • a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
  • the ratio associated with the first average action parameter and the fourth average action parameter may be expressed as formula (19) below:
  • T refers to the ratio associated with the first average action parameter and the fourth average action parameter, refers to the first average action parameter, and refers to the fourth average action parameter.
  • the ratio threshold may be default settings of the action recognition system 100, or may be adjustable under different situations.
  • the ratio threshold may be any value within a range from 1.05 to 1.2.
  • the end image frame (e.g., point 1208 illustrated in FIG. 12) corresponds to or substantially corresponds to the end position where the facial object moves back and is facing right to or substantially right to the camera device 130.
  • the facial object is moving along an upward direction during which the action parameter gradually decreases (e.g., from point 1206 to point 1208 illustrated in FIG. 12); within a time period after the capture time point corresponding to the end image frame, the facial object may keep facing right to or substantially right to the camera device 130, within which the action parameter stays almost unchanged (e.g., from point 1208 to point 1212 illustrated in FIG. 12) .
  • the third average action parameter of the plurality of previous image frames is larger than the fourth average action parameter of the plurality of subsequent image frames
  • each of the plurality of third action parameters is larger than or equal to the action parameter corresponding to the end image frame
  • an action parameter corresponding to a subsequent image frame (e.g., point 1209 illustrated in FIG. 12) adjacent to the end image frame is smaller than or equal to the action parameter corresponding to the end image frame.
  • FIG. 12 is a schematic diagram illustrating an exemplary curve indicating a variation process of the action parameter according to some embodiments of the present disclosure. As illustrated in FIG. 12, the horizontal axis refers to “image frame” and the vertical axis refers to “action parameter. ”
  • the processing engine 112 may identify a plurality of sequential target image frames associated with the facial object and identify the presence of a nod action based on the plurality of sequential target image frames.
  • the plurality of sequential target image frames include a start image frame F i , an end image frame F j , and a middle image frame F mid having the maximum action parameter. As illustrated in FIG. 12, point 1202 corresponds to the start image frame, point 1208 corresponds to the end image frame, and point 1205 corresponds to the middle image frame.
  • the facial object moves from a start position to a middle position along a downward direction and moves from the middle position to an end position along an upward direction.
  • the start image frame may correspond to or substantially correspond to the start position corresponding to a time point when the facial object is facing right to or substantially right to the camera device 130.
  • substantially right to refers to that an angle between a direction that the facial object is facing to and a direction pointing perpendicularly at the camera device 130 is less than a threshold that is recognizable for an ordinary person in the art.
  • the action parameter of the start image frame is a fixed value which may be default settings of the action recognition system 100, or may be adjustable under different situations.
  • the action parameter associated with a ratio of the two distances gradually increases, for example, as illustrated in a section of the curve from point 1202 to point 1205.
  • the facial object moves to the middle positon (e.g., point 1205) where the facial object stops moving down (or starts moving up back) , the action parameter reaches the maximum value.
  • the facial object moves from the middle position along the upward direction, the distance between the upper part of facial object and the middle part of the facial object gradually decreases, and the distance between the middle part of the facial object and the lower part of the facial object gradually increases in the image frames. Accordingly, the action parameter associated with a ratio of the two distances gradually decreases, for example, as illustrated in a section of the curve from point 1205 to point 1208.
  • the facial object moves to the end position which is the same as or substantially same as the start position.
  • substantially same as refers to that an angle between a direction that the facial object is facing to at the end position and a direction that the facial object is facing to at the start position is less than an angle threshold that is recognizable for an ordinary person in the art.
  • aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .
  • LAN local area network
  • WAN wide area network
  • SaaS Software as a Service

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The present disclosure relates to systems and methods for automated identification of presence of a facial action from sequential images. The systems and methods may obtain a plurality of sequential candidate image frames containing a facial object. Each of the plurality of candidate image frames may include a plurality of feature points associated with the facial object. For each of the plurality of sequential candidate image frames, the systems and methods may determine one or more first distances and one or more second distances based on the plurality of feature points. The systems and methods may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames. The systems and methods may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.

Description

SYSTEMS AND METHODS FOR NOD ACTION RECOGNITION BASED ON FACIAL FEATURE POINTS TECHNICAL FIELD
The present disclosure generally relates to systems and methods for action recognition, and in particular, to systems and methods for automated identification of the presence of a nod action from sequential image frames.
BACKGROUND
Living body detection based on human action recognition (e.g., nod action recognition) has become increasingly important in many scenarios (e.g., system login, identity authentication, Human-Computer Interaction) . Take “system login” as an example, when a user intends to sign in the system via face recognition, in order to verify that the “user” is a person with a living body rather than a deceptive object (e.g., a picture) , the system may need to identify an action (e.g., a nod action) of the user for the purpose of such verification. The existing technology achieves this goal by using a complex algorithm which requires excessive computing capacity, resulting in a heavy burden on the computing system. Therefore, it is desirable to provide systems and methods for automated identification of the presence of an action of a user quickly and efficiently, preferably putting less demand on computing capacity.
SUMMARY
An aspect of the present disclosure relates to a system for automated identification of presence of a facial action from sequential images. The system may include at least one storage medium including a set of instructions and at least one processor in communication with the at least one storage medium. When executing the set of instructions, the at least one  processor may be directed to cause the system to perform one or more of the following operations. The at least one processor may obtain a plurality of sequential candidate image frames containing a facial object. Each of the plurality of candidate image frames may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object. For each of the plurality of sequential candidate image frames, the at least one processor may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point. The at least one processor may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames. The at least one processor may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
Another aspect of the present disclosure relates to a method implemented on a computing device having at least one processor, at least one storage medium, and a communication platform connected to a network. The method may include one or more of the following operations. The at least one processor may obtain a plurality of sequential candidate image frames containing a facial object. Each of the plurality of candidate image frames may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object. For each of the plurality of sequential candidate image frames, the at least one processor may determine one or more first distances, each based on one of the one or more first feature points  and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point. The at least one processor may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames. The at least one processor may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
A further aspect of the present disclosure relates to a non-transitory computer readable medium. The non-transitory computer readable medium may include executable instructions. When the executable instructions are executed by at least one processor, the executable instructions may direct the at least one processor to perform a method. The method may include one or more of the following operations. The at least one processor may obtain a plurality of sequential candidate image frames containing a facial object. Each of the plurality of candidate image frames may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object. For each of the plurality of sequential candidate image frames, the at least one processor may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point. The at least one processor may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames. The at least one processor may identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
In some embodiments, the one or more first feature points may be  points associated with at least one of a left brow or a right brow of the facial object, the second feature point may be a point on a tip of a nose of the facial object, and the one or more third feature points may be points on a chin of the facial object.
In some embodiments, he one or more first distances may include one or more first left distances and one or more first right distances. Each first left distance may be determined based on a corresponding first feature point associated with the left brow and the second feature point. Each first right distance may be determined based on a corresponding first feature point associated with the right brow and the second feature point.
In some embodiments, the at least one processor may determine one or more first ratios of the one or more first left distances to the one or more second distances, each of the one or more first ratios corresponding to a first left distance and a second distance. The at least one processor may determine a first average ratio of the one or more first ratios. The at least one processor may determine one or more second ratios of the one or more first right distances to the one or more second distances, each of the one or more second ratios corresponding to a first right distance and a second distance. The at least one processor may determine a second average ratio of the one or more second ratios. The at least one processor may determine the action parameter based on the first average ratio and the second average ratio.
In some embodiments, the at least one processor may determine one or more distance ratios of the one or more first distances to the one or more second distances, each of the one or more distance ratios corresponding to a first distance and a second distance. The at least one processor may determine a compound distance ratio of the one or more distance ratios as the action parameter.
In some embodiments, the at least one processor may obtain an initial  image frame including a first initial feature point on a center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object. The at least one processor may determine whether the third initial feature point is within a quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point. The at least one processor may determine the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point.
In some embodiments, the at least one processor may identify a plurality of sequential target image frames from the plurality of sequential candidate image frames. The plurality of sequential target image frames may include a start image frame and an end image frame. The at least one processor may identify a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames. The at least one processor may identify a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames. The at least one processor may determine an asymmetry parameter based on the maximum action parameter and the minimum action parameter. The at least one processor may determine a first number count of target image frames from the start image frame to a target image frame corresponding to the maximum action parameter. The at least one processor may determine a second number count of target image frames from the target image frame corresponding to the maximum action parameter to the end image frame. The at least one  processor may determine an estimated line by fitting the second feature points in the plurality of sequential target image frames. The at least one processor may identify the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and an angle between the estimated line and a vertical line is less than an angle threshold.
In some embodiments, for a candidate image frame, the at least one processor may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames. The at least one processor may determine a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames. The at least one processor may determine a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames. The at least one processor may identify the candidate image frame as the start image frame in response to that the first average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to an action parameter corresponding to the candidate image frame.
In some embodiments, for a candidate image frame after the start image frame, the at least one processor may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames. The at least one processor may determine a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames. The at least one processor may determine a fourth average action parameter  based on a plurality of fourth action parameters corresponding to the plurality of subsequent image frames. The at least one processor may identify the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to an action parameter corresponding to the candidate image frame, an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
In some embodiments, the asymmetry threshold may be 2-3.
In some embodiments, the first number count threshold may be 4-6, the second number count threshold may be 4-6, or the angle threshold may be 10°-15°.
In some embodiments, the at least one processor may provide an authentication to a terminal device associated with a user corresponding to the facial object in response to the identification of the presence of the nod action.
In some embodiments, the system may further include a camera, which may be configured to provide video data from which the plurality of sequential candidate image frames may be obtained.
In some embodiments, the at least one processor may obtain the plurality of sequential candidate image frames from video data provided by a camera.
Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the  present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.
BRIEF DESCRIPTION OF THE DRAWINGS
The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:
FIG. 1 is a schematic diagram illustrating an exemplary action recognition system according to some embodiments of the present disclosure;
FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device according to some embodiments of the present disclosure;
FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device according to some embodiments of the present disclosure;
FIG. 4 is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure;
FIG. 5 is a flowchart illustrating an exemplary process for identifying presence of a nod action according to some embodiments of the present disclosure;
FIGs. 6-A and 6-B are schematic diagrams illustrating exemplary feature points according to some embodiments of the present disclosure;
FIG. 7-A is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure;
FIG. 7-B is a flowchart illustrating an exemplary process for  determining an action parameter according to some embodiments of the present disclosure;
FIG. 8-A is a flowchart illustrating an exemplary process for determining a candidate image frame according to some embodiments of the present disclosure;
FIG. 8-B is a schematic diagram illustrating exemplary initial feature points according to some embodiments of the present disclosure;
FIG. 9 is a flowchart illustrating an exemplary process for identifying presence of a nod action according to some embodiments of the present disclosure;
FIG. 10 is a flowchart illustrating an exemplary process for determining a start image frame according to some embodiments of the present disclosure;
FIG. 11 is a flowchart illustrating an exemplary process for determining an end image frame according to some embodiments of the present disclosure; and
FIG. 12 is a schematic diagram illustrating an exemplary curve indicating a variation process of an action parameter during a nod action according to some embodiments of the present disclosure.
DETAILED DESCRIPTION
The following description is presented to enable any person skilled in the art to make and use the present disclosure and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown but is to be accorded the widest scope consistent with the claims.
The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
These and other features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form a part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.
The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowchart may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
Moreover, while the systems and methods in the present disclosure is described primarily regarding a nod action identification, it should also be understood that this is only one exemplary embodiment. The systems and  methods of the present disclosure may be applied to any other kind of action recognition. For example, the system and methods of the present disclosure may be applied to other action recognitions including an eye movement, a shaking action, a blink cation, a head up action, a mouth opening action, or the like, or any combination thereof. The action recognition system may be applied in many application scenarios such as, system login, identity authentication, Human-Computer Interaction (HCl) , etc. The application of the systems and methods of the present disclosure may include but not be limited to a web page, a plug-in of a browser, a client terminal, a custom system, an internal analysis system, an artificial intelligence robot, or the like, or any combination thereof.
The terms “subject, ” “human, ” or “user” in the present disclosure are used interchangeably to refer to a living body whose action is to be identified. Also, the terms “image frame, ” “image, ” “candidate image frames, ” and “target image frames” in the present disclosure are used to refer to frames in video data or images captured by a camera device. The terms “camera, ” “camera device, ” and “capture device” in the present disclosure may be used interchangeably to refer to a device that can capture video data or image data.
An aspect of the present disclosure relates to systems and methods for identifying the presence of a nod action. During the nod action, a distance between an upper part of a facial object (e.g., a face of a human) and a middle part of the facial part dynamically changes; a distance between the middle part of the facial object and a lower part of the facial object dynamically also changes. Accordingly, an action parameter (e.g., a ratio of the two distances) associated with the two distances changes during the nod action. The systems and methods may identify the presence of the nod action based on the change of the action parameter.
For example, the systems and methods may obtain a plurality of  sequential candidate image frames associated with the facial object. Each of the plurality of sequential candidate image frames may include one or more first feature points associated with the upper part, a second feature point associated with the middle part, and one or more third feature points associated with the lower part. For each of the plurality of sequential candidate image frames, the systems and methods may determine one or more first distances based on the one or more first feature points and the second feature point, and one or more second distances based on the one or more third feature points and the second feature point. Further, the systems and methods may determine the action parameter based on the one or more first distances and the one or more second distances. Accordingly, the systems and methods may identify the presence of the nod action based on the action parameters corresponding to the plurality of sequential candidate image frames.
FIG. 1 is a schematic diagram illustrating an exemplary action recognition system according to some embodiments of the present disclosure. For example, the action recognition system 100 may be an online action recognition platform for living body recognition based on information of a facial object (e.g., a face 160 of a human) .
In some embodiments, the action recognition system 100 may be used in a variety of application scenarios such as Human-Computer Interaction (HCl) , system login, identity authentication, or the like, or any combination thereof. In the application scenario of HCI, the action recognition system 100 may execute instructions to perform operations defined by a user in response to an identification of an action. For example, after extracting facial information of the user and identifying an action (e.g., a nod action) of the user, the action recognition system 100 may execute instructions to perform defined operations such as turning a page of an e-book, adding animation effects during a video chat, controlling a robot to  perform an operation (e.g., mopping the floor) , requesting a service (e.g., a taxi hailing service) , etc. In the application scenario of system login (e.g., a bank system, a payment system, an online examination system, a security and protection system, etc. ) , after extracting facial information of the user and identifying an action (e.g., a nod action) of the user, the action recognition system 100 may determine a login permission and allow a user account associated with the user to log in the system. In the application scenario of identity authentication, after extracting facial information of the user and identifying an action (e.g., a nod action) of the user, the action recognition system 100 may determine the user’s identity and provide a permission to access an account (e.g., a terminal device, a payment account, or a membership account) or a permission to enter a restricted place (e.g., a company, a library, a hospital, or an apartment) .
In some embodiments, the action recognition system 100 may be an online platform including a server 110, a network 120, a camera device 130, a user terminal 140, and a storage 150.
The server 110 may be a single server or a server group. The server group may be centralized, or distributed (e.g., server 110 may be a distributed system) . In some embodiments, the server 110 may be local or remote. For example, the server 110 may access information and/or data stored in the camera device 130, the user terminal 140, and/or the storage 150 via the network 120. As another example, the server 110 may be directly connected to the camera device 130, the user terminal 140, and/or the storage 150 to access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 110 may be implemented on a computing device  200 having one or more components illustrated in FIG. 2 in the present disclosure.
In some embodiments, the server 110 may include a processing engine 112. The processing engine 112 may process information and/or data relating to action recognition to perform one or more functions described in the present disclosure. For example, the processing engine 112 may identify the presence of a nod action based on a plurality of sequential candidate image frames containing a facial object. In some embodiments, the processing engine 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) . Merely by way of example, the processing engine 112 may include one or more hardware processors, such as a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.
In some embodiment, the sever 110 may be unnecessary and all or part of the functions of the server 110 may be implemented by other components (e.g., the camera device 130, the user terminal 140) of the action recognition system 100. For example, the processing engine 112 may be integrated in the camera device 130 or the user terminal 140 and the functions (e.g., identifying presence of an action of a facial object based on image frames associated with the facial object) of the processing engine 112 may be implemented by the camera device 130 or the user terminal 140.
The network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user  terminal 140, the storage 150) may exchange information and/or data with other component (s) of the action recognition system 100 via the network 120. For example, the server 110 may obtain information and/or data (e.g., image frames) from the camera device 130 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or a combination thereof. Merely by way of example, the network 130 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, the Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2, …, through which one or more components of the action recognition system 100 may be connected to the network 120 to exchange data and/or information.
The camera device 130 may capture image data or video data containing a facial object. For example, the camera device 130 may capture a video including a plurality of image frames containing the facial object. In some embodiments, the camera device 130 may include a black-white camera, a color camera, an infrared camera, a 3-D camera, an X-ray camera, etc. In some embodiments, the camera device 130 may include a monocular camera, a binocular camera, a multi-camera, etc. In some embodiments, the camera device 130 may be a smart device including or connected to a camera. The smart device may include a smart home device (e.g., a smart lighting device, a smart television, ) , an intelligent robot (e.g., a sweeping robot, a mopping robot, a chatting robot, an industry robot) , etc. In some  embodiments, the camera device 130 may be a surveillance camera. The surveillance camera may include a wireless color camera, a low light camera, a vandal proof camera, a bullet camera, a pinhole camera, a hidden spy camera, a fixed box camera, or the like, or any combination thereof. In some embodiments, the camera device 130 may be an IP camera which can transmit the captured image data or video data to any component (e.g., the server 110, the user terminal 140, the storage 150) of the action recognition system 100 via the network 120.
In some embodiments, the camera device 130 may independently identify the presence of an action of the facial object based on the captured image frames. In some embodiments, the camera device 130 may transmit the captured image frames to the server 110 or the user terminal 140 to be further processed. In some embodiments, the camera device 130 may transmit the captured image frames to the storage 150 to be stored. In some embodiments, the camera device 130 may be integrated in the user terminal 140. For example, the camera device 130 may be part of the user terminal 140, such as a camera of a mobile phone, a camera of a computer, etc.
In some embodiments, the user terminal 140 may include a mobile device, a tablet computer, a laptop computer, or the like, or any combination thereof. In some embodiments, the mobile device may include a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, a smart footgear, smart glasses, a smart helmet, a smart watch, smart clothing, a smart backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a mobile phone, a personal digital assistance (PDA) , a gaming device, a navigation device, a point of sale (POS) device, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device  may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, augmented reality glasses, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass TM, a RiftCon TM, a Fragments TM, a Gear VR TM, etc.
In some embodiments, the user terminal 140 may exchange information and/or data with other components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user terminal 140, the storage 150) directly or via the network 120. For example, the user terminal 140 may obtain image frames from the camera device 130 or the storage 150 to identify the presence of an action of a facial object based on the image frames. As another example, the user terminal 140 may receive a message (e.g., an authentication) from the server 110.
The storage 150 may store data and/or instructions. In some embodiments, the storage 150 may store data obtained from the camera device 130 and/or the user terminal 140. In some embodiments, the storage 150 may store data and/or instructions that the server 110 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, storage 150 may include a mass storage, a removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM  (EPROM) , an electrically-erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage 150 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.
In some embodiments, the storage 150 may be connected to the network 120 to communicate with one or more components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user terminal 140, etc. ) . One or more components of the action recognition system 100 may access the data or instructions stored in the storage 150 via the network 120. In some embodiments, the storage 150 may be directly connected to or communicate with one or more components of the action recognition system 100 (e.g., the server 110, the camera device 130, the user terminal 140, etc. ) . In some embodiments, the storage 150 may be part of the server 110.
In some embodiments, one or more components (e.g., the server 110, the camera device 130, the user terminal 140) of the action recognition system 100 may have permission to access the storage 150. For example, the user terminal 140 may access information/data (e.g., image frames containing the facial object) from the storage 150.
This description is intended to be illustrative, and not to limit the scope of the present disclosure. Many alternatives, modifications, and variations will be apparent to those skilled in the art. The features, structures, methods, and other characteristics of the exemplary embodiments described herein may be combined in various ways to obtain additional and/or alternative exemplary embodiments. For example, the storage 150 may be a data storage including cloud computing platforms, such as, public cloud, private cloud, community, and hybrid clouds, etc. However, those variations  and modifications do not depart the scope of the present disclosure.
FIG. 2 is a schematic diagram illustrating exemplary hardware and/or software components of a computing device 200 according to some embodiments of the present disclosure. In some embodiments, the server 110, the camera device 130, and/or the user terminal 140 may be implemented on the computing device 200. For example, the processing engine 112 may be implemented on the computing device 200 and configured to perform functions of the processing engine 112 disclosed in this disclosure.
The computing device 200 may be used to implement any component of the action recognition system 100 as described herein. For example, the processing engine 112 may be implemented on the computing device 200, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the action recognition as described herein may be implemented in a distributed fashion on a number of similar platforms to distribute the processing load.
The computing device 200, for example, may include COM ports 250 connected to and from a network connected thereto to facilitate data communications. The computing device 200 may also include a processor 220, in the form of one or more processors (e.g., logic circuits) , for executing program instructions. For example, the processor 220 may include interface circuits and processing circuits therein. The interface circuits may be configured to receive electronic signals from a bus 210, wherein the electronic signals encode structured data and/or instructions for the processing circuits to process. The processing circuits may conduct logic calculations, and then determine a conclusion, a result, and/or an instruction encoded as electronic signals. Then the interface circuits may send out the electronic signals from the processing circuits via the bus 210.
The computing device 200 may further include program storage and  data storage of different forms including, for example, a disk 270, and a read only memory (ROM) 230, or a random access memory (RAM) 240, for various data files to be processed and/or transmitted by the computing device. The exemplary computer platform may also include program instructions stored in the ROM 230, RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220. The methods and/or processes of the present disclosure may be implemented as the program instructions. The computing device 200 also includes an I/O component 260, supporting input/output between the computer and other components. The computing device 200 may also receive programming and data via network communications.
Merely for illustration, only one processor is described in FIG. 2. Multiple processors are also contemplated, thus operations and/or method steps performed by one processor as described in the present disclosure may also be jointly or separately performed by the multiple processors. For example, if in the present disclosure the processor of the computing device 200 executes both step A and step B, it should be understood that step A and step B may also be performed by two different CPUs and/or processors jointly or separately in the computing device 200 (e.g., the first processor executes step A and the second processor executes step B, or the first and second processors jointly execute steps A and B) .
FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of a mobile device 300 on which the camera device 130, the user terminal 140, or part of the camera device 130 or the user terminal 140 may be implemented according to some embodiments of the present disclosure. As illustrated in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, a mobile operating system (OS) 370, and a storage 390. In some  embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.
In some embodiments, the mobile operating system 370 (e.g., iOS TM, Android TM, Windows Phone TM, etc. ) and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to action recognition or other information from the action recognition system 100. User interactions with the information stream may be achieved via the I/O 350 and provided to the processing engine 112 and/or other components of the action recognition system 100 via the network 120.
To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device. A computer may also act as a system if appropriately programmed.
FIG. 4 is a block diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure. The processing engine 112 may include an obtaining module 410, a distance determination module 420, an action parameter determination module 430, and an identification module 440.
The obtaining module 410 may be configured to obtain a plurality of sequential candidate image frames containing a facial object. The facial object may refer to a face of a subject (e.g., a human, an animal) . The obtaining module 410 may obtain the plurality of sequential candidate image frames from the camera device 130, the user terminal 140, or a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
As used herein, an “image frame” may refer to a frame in a video, and “sequential” may refer to that the image frames are aligned according to a sequence (e.g. a temporal sequence) in the video. For example, the camera device 130 may capture a video in chronological order. The video includes a plurality of image frames corresponding to a plurality of capture time points respectively. Accordingly, the image frames are aligned in chronological order based on the capture time points.
In some embodiments, each of the plurality of candidate image frames may include a plurality of feature points associated with the facial object. In some embodiments, the plurality of feature points may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object. As used herein, the upper part may refer to an upper region above a nose of the facial object, the middle part may refer to a middle region including the nose of the facial object, and the lower part may refer to a lower region below the nose of the facial object.
The distance determination module 420 may be configured to determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point in each of the plurality of sequential candidate image frames. As used herein, in certain embodiments, the first distance may indicate a distance between the upper part of the facial object and the middle part of the facial object. In certain embodiments, the second distance may indicate a distance between the middle part of the facial object and the lower part of the facial object.
The action parameter determination module 430 may be configured to determine an action parameter based on the one or more first distances and  the one or more second distances in each of the plurality of sequential candidate image frames. As used herein, the action parameter refers to a parameter associated with a ratio of a distance between the upper part and the middle part to a distance between the middle part and the lower part.
The identification module 440 may be configured to identify the presence of a nod action in response to that the action parameters satisfy a preset condition. It is known that during the nod action, the facial object may move along a downward direction from a start position to a middle position and then move along an upward direction from the middle position to an end position. Therefore, during the nod action, the distance between the upper part of the facial object and the middle part of the facial object and the distance between the middle part of the facial object and the lower part of the facial object dynamically change in the plurality of sequential candidate image frames. Accordingly, the action parameter dynamically changes during the nod action.
The modules in the processing engine 112 may be connected to or communicated with each other via a wired connection or a wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof. Two or more of the modules may be combined into a single module, and any one of the modules may be divided into two or more units. For example, the distance determination module 420 and the action parameter determination module 430 may be combined as a single module which may both determine the one or more first distances and the one or more second distances, and determine the action parameter based on the first distances and the second distances. As another example, the processing engine 112 may include a storage module (not shown) which may  be used to store data generated by the above-mentioned modules.
FIG. 5 is a flowchart illustrating an exemplary process for identifying presence of a nod action according to some embodiments of the present disclosure. In some embodiments, the process 500 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 500. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 500 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 5 and described below is not intended to be limiting.
In 510, the processing engine 112 (e.g., the obtaining module 410) (e.g., the interface circuits of the processor 220) may obtain a plurality of sequential candidate image frames containing a facial object. The facial object may refer to a face of a subject (e.g., a human, an animal) . The processing engine 112 may obtain the plurality of sequential candidate image frames from the camera device 130, the user terminal 140, or a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
As used herein, an “image frame” may refer to a frame in a video, and “sequential” may refer to that the image frames are aligned according to a sequence (e.g. a temporal sequence) in the video. For example, the camera device 130 may capture a video in chronological order. The video includes a plurality of image frames corresponding to a plurality of capture time points respectively. Accordingly, the image frames are aligned in chronological order based on the capture time points.
In some embodiments, the plurality of sequential candidate image  frames may be expressed as an ordered set illustrated bellow:
F = [F 1, F 2, F i, …, F m] ,       (1)
where F refers to the ordered set, F i refers to an ith candidate image frame, and m refers to a number count of the plurality of candidate image frames. In the ordered set, the plurality of sequential candidate image frames are ordered in chronological order based on capture time points of the plurality of candidate image frames. For example, the candidate image frame F 1 corresponds to a first capture time point and the candidate image frame F 2 corresponds to a second capture time point, wherein the second capture time point is later than the first capture time point and a time interval between the first capture time point and the second capture time point may be a default parameter of the camera device 130 or may be set by the action recognition system 100. For example, the camera device 130 may capture 24 image frames per second; in certain embodiments, the intervals between neighboring candidate image frames may be 1/24 second, meaning that all the captured image frames are used as candidate image frames; in certain other embodiments, the intervals between neighboring candidate image frames may be 1/12 second, meaning that certain (half) captured image frames are used as candidate image frames but the others are skipped.
In some embodiments, each of the plurality of candidate image frames may include a plurality of feature points associated with the facial object. As used herein, a “feature point” may refer to a point located on the face; in certain embodiments, the feature point is a point on the face and is measurably recognizable, for example, a point on an end of an eye, a point on a brow, a point on a nose, etc. In some embodiments, the processing engine 112 may determine the plurality of feature points based on a facial recognition process. The facial recognition process may include a process based on geometric features, a local face analysis process, a principle component analysis process, a deep-learning-based process, or the like, or any  combination thereof.
In some embodiments, the plurality of feature points may include one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object. As used herein, the upper part may refer to an upper region above a nose of the facial object, the middle part may refer to a middle region including the nose of the facial object, and the lower part may refer to a lower region below the nose of the facial object.
In some embodiments, the one or more first feature points may include one or more first left points associated with a left upper part of the facial object and one or more first right points associated with a right upper part of the facial object. For example, as illustrated in FIG. 6-A, the one or more first left points may include any point (e.g., point a l1, …, point a li, …, and point 
Figure PCTCN2018084426-appb-000001
) on a left brow. The one or more first right points may include any point (e.g., point a r1, …, point a ri, …, and point 
Figure PCTCN2018084426-appb-000002
) on a right brow. In some embodiments, the one or more first feature points may include any point located on the upper part of the facial object. For example, as illustrated in FIG. 6-B, the one or more first feature points may include any point (e.g., point a i) located on or above a line 610 determined based on a highest point of the left brow and a highest point of the right brow, or any point (e.g., point a′ i) located on a line 620 determined based on a right end point of the left brow and a left end point of the right brow.
In some embodiments, the second feature point may be a point (e.g., a tip point of the nose) on or around the nose of the facial object. In some embodiments, the one or more third feature points may include any point located on the lower part of the facial object. For example, as illustrated in FIG. 6-A, the one or more third feature points may include any point (e.g., point c 1, …, point c i, …, and point
Figure PCTCN2018084426-appb-000003
) on a chin of the facial object.
In 520, for each of the plurality of sequential candidate image frames, the processing engine 112 (e.g., the distance determination module 420) (e.g., the processing circuits of the processor 220) may determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point. As used herein, in certain embodiments, the first distance may indicate a distance between the upper part of the facial object and the middle part of the facial object. In certain embodiments, the second distance may indicate a distance between the middle part of the facial object and the lower part of the facial object.
Take a specific first distance or a specific second distance as an example, the processing engine 112 may determine the first distance or the second distance according to formula (2) below:
Figure PCTCN2018084426-appb-000004
where D refers to the first distance or the second distance, (x i, y i) refers to a coordinate of an ith first feature point associated with the upper part of the facial object or an ith third feature point associated with the lower part of the facial object, and (x 0, y 0) refers to a coordinate of the second feature point associated with the middle part of the facial object. For illustration purposes, the present disclosure takes a rectangular coordinate system as an example, it should be noted that the coordinates of the feature points may be expressed in any coordinate system (e.g., a polar coordinate system) and an origin of the coordinate system may be any point in the image frame.
In 530, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames. As used herein, the action parameter refers to a  parameter associated with a ratio of a distance between the upper part and the middle part to a distance between the middle part and the lower part. More descriptions of the action parameter may be found elsewhere in the present disclosure (e.g., FIG. 7-A, FIG. 7-B, and the descriptions thereof) .
In 540, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify the presence of a nod action in response to that the action parameters satisfy a preset condition. It is known that during the nod action, the facial object may move along a downward direction from a start position to a middle position and then move along an upward direction from the middle position to an end position. Therefore, during the nod action, the distance between the upper part of the facial object and the middle part of the facial object and the distance between the middle part of the facial object and the lower part of the facial object dynamically change in the plurality of sequential candidate image frames. Accordingly, the action parameter dynamically changes during the nod action.
Assuming that the start position and the end position both correspond to a time point when the facial object is facing right to or substantially right to the camera device 130; in ideal conditions, the action parameter corresponding to the start position and the action parameter corresponding to the end position are a fixed value and approximately equal to each other. During the nod action, the middle position may be a stop position where the facial object stops moving down (or starts moving up back) , which corresponds to a time point when the action parameter is maximum. Accordingly, the processing engine 112 may identify a plurality of sequential target image frames including a start image frame which corresponds to or substantially corresponds to the start position, an end image frame which corresponds to or substantially corresponds to the end position, and a middle image frame which corresponds to or substantially corresponds to the middle position, and identify the presence of the nod action based on the action  parameters of the start image frame, the end image frame, and the middle image frame. More descriptions of the identification of the nod action may be found elsewhere in the present disclosure (e.g., FIGs. 9-11 and the descriptions thereof) .
It should be noted that “substantially corresponds to” used herein refers to that a time interval between a capture time point when the image frame is captured and a time point corresponding to the position is less than a time threshold that is recognizable for an ordinary person in the art. It is known that the camera device 130 captures image frames according to a frame rate (which may be a default parameter) , that is, capture time points of two adjacent image frames are not continuous (i.e., there is a time interval between the two capture time points) . Therefore, take the “start position” as an example, the start image frame may not strictly correspond to a time point of the start position, but the capture time point of the start image frame may be a time point which is very close to the time point of the start position. In ideal conditions, we can consider that the two time points are the same because the intervals between the candidate image frames are usually short.
In some embodiments, the processing engine 112 may further provide an authentication to a terminal device (e.g., the user terminal 140) associated with a user corresponding to the facial object in response to the identification of the presence of the nod action. After receiving the authentication, the user can have an access permission to the terminal device.
It should be noted that the above description is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations or modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. For example, one or more other optional operations (e.g., a storing operation) may be added elsewhere in the process 500. In  the storing operation, the processing engine 112 may store information (e.g., the plurality of sequential candidate image frames, the one or more first distances, the one or more second distances, the action parameters) associated with the action identification in a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
FIGs. 6-A and 6-B are schematic diagrams illustrating exemplary feature points according to some embodiments of the present disclosure. As described in connection with 510, each of the plurality of candidate image frames may include one or more first feature points associated with the upper part of the facial object, a second feature point associated with the middle part of the facial object, and one or more third feature points associated with the lower part of the facial object. In some embodiments, the upper part of he facial object refers to the area on the facial object that includes the eye brows and above. In some embodiments, the lower part of the facial object refers to the area on the facial object that includes the lips and below. In some embodiments, the middle part of the facial object refers the area between the eye brows and lips.
As illustrated in FIG. 6-A, the one or more first feature points may include first left points (e.g., point a l1, …, point a li, …, and point
Figure PCTCN2018084426-appb-000005
on a left brow) associated with the left upper part of the facial object and first right points (e.g., point a r1, …, point a ri, …, and point
Figure PCTCN2018084426-appb-000006
on a right brow) associated with the right upper part of the facial object. The second feature point may be a tip point (e.g., point b) of a nose of the facial object. The one or more third feature points may include points (e.g., point c 1, …, point c i, …, and point
Figure PCTCN2018084426-appb-000007
) on a chin of the facial object. The values of n 1, n 2, and n 3 above may be the same as each other or may be different from each other.
As illustrated in FIG. 6-B, the one or more first feature points may include any point (e.g., point a i) located on or above a line 610 determined based on a highest point of the left brow and a highest point of the right brow,  or any point (e.g., point a′ i) located on or above a line 620 determined based on a right end point of the left brow and a left end point of the right brow.
It should be noted that the examples of the feature points illustrated in FIG. 6-A and FIG. 6-B are provided for illustration purposes, and not intended to limit the scope of the present disclosure. In some embodiments, the first feature point may be any point located on the upper part of the facial object, for example, an end point of an eye, a point located on a line determined based on two end points of the eye, etc. The second feature point may be any point (e.g., a nasal root point) located on or around the nose of the facial object. The third feature point may be any point located on the lower part of the facial object, for example, an end point of a lip, a point located on a line determined based on two end points of the lip, etc.
FIG. 7-A is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure. In some embodiments, the process 710 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 710. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 710 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 7 and described below is not intended to be limiting. In some embodiments, operation 530 may be performed based on the process 710.
In 711, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine one or more first ratios of one or more first left distances to  one or more second distances. Each of the one or more first ratios may correspond to a first left distance and a second distance. As described in connection with 510, the one or more first feature points may include one or more first left points associated with a left upper part of the facial object and one or more first right points associated with a right upper part of the facial object. Accordingly, each of the one or more first left distances is determined based on a corresponding first left point and the second feature point.
Take a specific first ratio as an example, the processing engine 112 may determine the first ratio according to formula (3) below:
R li = D li/C i        (3)
where R li refers to an ith first ratio, D li refers to an ith first left distance, and C i refers to an ith second distance.
In 712, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine a first average ratio of the one or more first ratios. For example, the processing engine 112 may determine the first average ratio according to formula (4) below:
Figure PCTCN2018084426-appb-000008
where
Figure PCTCN2018084426-appb-000009
refers to the first average ratio, R li refer to the ith first ratio, and s 1 refers to a number count of the one or more first ratios.
In 713, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine one or more second ratios of one or more first right distances to the one or more second distances. Each of the one or more second ratios may correspond to a first right distance and a second distance. As described in connection with 510 and 711, each of the one or more first right distances is determined based on a corresponding first right point and the second feature point.
Also take a specific second ratio as an example, the processing  engine 112 may determine the second ratio according to formula (5) below:
R ri = D ri/C i        (5)
where R ri refers to an ith second ratio, D ri refers to an ith first right distance, and C i refers to the ith second distance.
In 714, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine a second average ratio of the one or more second ratios. For example, the processing engine 112 may determine the second average ratio according to formula (6) below:
Figure PCTCN2018084426-appb-000010
where
Figure PCTCN2018084426-appb-000011
refers to the second average ratio, R ri refer to the ith second ratio, and s 2 refers to a number count of the one or more second ratios, wherein s 2 may be the same as or different from s 1.
In 715, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine an action parameter based on the first average ratio and the second average ratio. For example, the processing engine 112 may determine the action parameter according to formula (7) below:
Figure PCTCN2018084426-appb-000012
where A refers to the action parameter, and
Figure PCTCN2018084426-appb-000013
and
Figure PCTCN2018084426-appb-000014
refer to the first average ratio and the second average ratio respectively. It should be noted that formula (7) above is provided for illustration purposes, the processing engine 112 may determine the action parameter based on a weighted average value of the first average ratio and the second average ratio, a larger one of the first average ratio and the second average ratio, etc.
FIG. 7-B is a flowchart illustrating an exemplary process for determining an action parameter according to some embodiments of the present disclosure. In some embodiments, the process 720 may be implemented as a set of instructions (e.g., an application) stored in the  storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 720. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 720 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 7-B and described below is not intended to be limiting. In some embodiments, operation 530 may be performed based on the process 720.
In 721, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine one or more distance ratios of the one or more first distances to the one or more second distances. Each of the one or more distance ratios may correspond to a first distance and a second distance.
Take a specific distance ratio as an example, the processing engine 112 may determine the distance ratio according to formula (8) below:
R i = D i/C i         (8)
where R i refers to an ith distance ratio, D i refers to an ith first right distance, and C i refers to an ith second distance.
In 722, the processing engine 112 (e.g., the action parameter determination module 430) (e.g., the processing circuits of the processor 220) may determine a compound distance ratio of the one or more distance ratios as the action parameter. For example, the processing engine 112 may determine the compound distance ratio (i.e., the action parameter) according to formula (9) below:
A = (R 1+R 2+... +R i+... +R q) /q     (9)
where A refers to the action parameter, R i refers to the ith distance ratio, and q refers to a number count of the one or more distance ratios. It should be  noted that formula (9) above is provided for illustration purposes, the processing engine 112 may determine the action parameter based on a weighted average value of the one or more distance ratios, a larger one of the one or more distance ratios, etc.
It should be noted that the above description is provided for the purpose of illustration, and is not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teaching of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.
FIG. 8-A is a flowchart illustrating an exemplary process for determining a candidate image frame according to some embodiments of the present disclosure. In some embodiments, the process 800 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 800. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 800 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 8-A and described below is not intended to be limiting. In some embodiments, operation 510 may be performed based on process 800.
In 810, the processing engine 112 (e.g., the obtaining module 410) (e.g., the interface circuits of the processor 220) may obtain an initial image frame. The processing engine 112 may obtain the initial image frame from the camera device 130, the user terminal 140, or a storage device (e.g., the storage 150) disclosed elsewhere in the present disclosure.
In some embodiments, the initial image frame may include a first initial feature point on a center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object.
In 820, the processing engine 112 (e.g., the obtaining module 410) (e.g., the processing circuits of the processor 220) may determine a quadrangle based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point. Further, the processing engine 112 may determine whether the third initial feature point is within the quadrangle.
In 830, the processing engine 112 (e.g., the obtaining module 410) (e.g., the processing circuits of the processor 220) may determine the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle. As illustrated in FIG. 8-B, assuming that the third initial feature point is within the quadrangle, it may indicate that the initial image frame contains the facial object; whereas assuming that the third initial feature point is not within the quadrangle, it may indicate that there may be a problem during the capture of the initial image frame, resulting in that the initial frame cannot be used for further processing. In addition, assuming that the initial image frame only includes some of the first initial feature point, the second initial feature point, the third initial feature point, the fourth initial feature point, and the fifth initial feature point, it may indicate that the initial image frame only contains a part (e.g., the upper part) of the facial object, under this situation, the initial image frame also cannot be used for further processing.
FIG. 8-B is a schematic diagram illustrating exemplary initial feature points according to some embodiments of the present disclosure. As  illustrated, point 841 refers to the first initial feature point on the center of the left eye, point 842 refers to the second initial feature point on the center of the right eye, point 843 refers to the third initial feature point on the tip of the nose, point 844 refers to the fourth initial feature point on the left end of the lip, and point 845 refers the fifth initial feature point on the right end of the lip. It can be seen that point 843 is within a quadrangle 840 determined based on the  points  841, 842, 844, and 845.
FIG. 9 is a flowchart illustrating an exemplary process for identifying the presence of a nod action according to some embodiments of the present disclosure. In some embodiments, the process 900 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 900. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 900 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 9 and described below is not intended to be limiting. In some embodiments, operation 540 may be performed based on process 900.
In 910, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify a plurality of sequential target image frames from the plurality of sequential candidate image frames. As described in connection with 540, the plurality of sequential target image frames include a start image frame which corresponds to or substantially corresponds to the start position (i.e., a position where the facial object starts moving along a downward direction) , an end image frame which corresponds to or substantially corresponds to the  end position (i.e., a position where the facial object stops moving along the upward direction) , and a middle image frame which corresponds to or substantially corresponds to the middle position (i.e., a position where the facial object stops moving down (or starts moving up back) ) .
In 920, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames. As described above, the maximum action parameter corresponds to the middle image frame.
In 930, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames. As described above, in ideal conditions, the minimum action parameter corresponds to the start image frame or the end image frame.
In some embodiments, the processing engine 112 may determine the minimum action parameter based on an action parameter (also referred to as “start action parameter” ) corresponding to the start image frame and an action parameter (also referred to as “end action parameter” ) corresponding to the end image frame. For example, the processing engine 112 may determine an average action parameter of the start action parameter and the end action parameter as the minimum action parameter.
In some embodiments, as illustrated in FIG. 12, within a time period before a capture time point corresponding to the start image frame, the facial object may keep facing right to or substantially right to the camera device 130, during which the action parameter stays almost unchanged (e.g., from point 1201 to point 1202) , and within a time period after the capture time point corresponding to the end image frame, the facial object may be facing right to or substantially right to the camera device 130, within which the action  parameter also stays almost unchanged (e.g., from point 1208 to point 1212) , therefore, the processing engine 112 may determine two average action parameters (i.e., a first average action parameter and a fourth average action parameter described in FIG. 10 and FIG. 11 respectively) corresponding to the two time periods respectively, and further determine an average value of the two average action parameters as the minimum action parameter.
In 940, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine an asymmetry parameter based on the maximum action parameter and the minimum action parameter. The asymmetry parameter may indicate an amplitude of action parameters corresponding to the plurality of sequential target image frames. In some embodiments, the processing engine 112 may determine the asymmetry parameter according to formula (10) below:
Figure PCTCN2018084426-appb-000015
where Asy refers to the asymmetry parameter, A max refers to the maximum action parameter, and A min refers to the minimum action parameter.
In 950, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine a first number count of target image frames from the start image frame to a target image frame (i.e., the middle image frame) corresponding to the maximum action parameter.
In 960, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine a second number count of target image frames from the target image frame (i.e., the middle image frame) corresponding to the maximum action parameter to the end image frame.
In 970, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine an estimated line by fitting the second feature points (e.g., a tip point on the  nose) in the plurality of sequential target image frames. In some embodiments, the processing engine 112 may determine the estimated line based on a fitting process. For example, the fitting process may include a least-squares estimation process, a maximum-likelihood estimation process, a Bayesian linear regression process, or the like, or any combination thereof.
In 980, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify the presence of a nod action based on the maximum action parameter, the minimum action parameter, the asymmetry parameter, the first number count, the second number count, and an angle between the estimated line and a vertical line. The processing engine 112 may identify the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and the angle between the estimated line and the vertical line is less than an angle threshold.
In some embodiments, the asymmetry threshold may be default settings of the action recognition system 100, or may be adjustable under different situations. For example, the asymmetry threshold may be any value within a range from 2 to 3.
In some embodiments, the first number count threshold and the second number count threshold may be default settings of the action recognition system 100. For example, the first number count threshold or the second number count threshold may be any value (e.g., 4) within a range from 2 to 10. In some embodiments, the first number count threshold and the second number count threshold may be adjustable according to a frame rate of the camera device 130 or the interval between neighboring image frames. The frame rate may refer to a number of image frames captured by the camera device 130 per unit time (e.g., per second) . In some  embodiments, a larger frame rate of the camera device 130 may correspond to a larger first number count threshold or a larger second number count threshold. In some embodiments, the first number count threshold and the second number count threshold may be the same or different.
In some embodiments, the estimated line fitted based on the second feature points may be a straight line. The angle between the estimated line and the vertical line may be an angle between two straight lines. In some embodiments, the estimated line may be a curve. The angle between the estimated line and the vertical line may be an angle between a tangent line of a point on the curve and the horizontal line. The angle threshold may be default settings of the action recognition system 100, or may be adjustable under specific situations. For example, the angle threshold may be any value (e.g., 10°) within a range from 5° to 20°. It is known that during the nod action, the facial object may not move strictly along the vertical line, that is, the second feature point (e.g., a tip point of the nose) may not always strictly on the vertical line. Therefore, the processing engine 112 defines the angle threshold, provided that the angle between the estimated line and the vertical line is less than the angle threshold, it is considered that the identification of the nod action is correct.
It should be noted that the above description is provided for the purpose of illustration, and is not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teaching of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.
FIG. 10 is a flowchart illustrating an exemplary process for determining a start image frame according to some embodiments of the present disclosure. In some embodiments, the process 1000 may be implemented as a set of instructions (e.g., an application) stored in the  storage ROM 230 or RAM 240. The processor 220 and/or the modules in FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 1000. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 1000 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 10 and described below is not intended to be limiting. In some embodiments, operation 910 may be performed based on process 1000.
In 1010, for a candidate image frame, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames. As described in connection with 510, the plurality of sequential candidate image frames are aligned in a chronological order based on the capture time points. Accordingly, the “sequence” here refers to the chronological order. Further, “previous image frames” here refer to continuous image frames immediately before the candidate image frame along the sequence, and “subsequent image frames” refer to continuous image frames immediately after the candidate image frame.
Take an ith candidate image frame F i as an example, the plurality of previous image frames before the ith candidate image frame may be expressed as an ordered set below:
P 1 = [F i-x, …, F i-2, F i-1] (i>1, x<i)       (11)
where P 1 refers to the ordered set including the plurality of previous image frames and x refers to a number count of the plurality of previous image frames.
Also take the ith candidate image frame F i as an example, the plurality of subsequent image frames after the ith candidate image frame may be expressed as an ordered set below:
N 1 = [F i+1, F i+2, …, F i+y] (i>1, y<m-1)    (12)
where N 1 refers to the ordered set including the plurality of subsequent image frames, y refers to a number count of the plurality of subsequent image frames, and m refers to a number count of the plurality of candidate image frames.
In 1020, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames. For example, the processing engine 112 may determine the first average action parameter according to formula (13) below:
Figure PCTCN2018084426-appb-000016
where
Figure PCTCN2018084426-appb-000017
refers to the first average action parameter, and A i-x refers to a first action parameter corresponding to a (i-x) th candidate image frame.
In 1030, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames. For example, the processing engine 112 may determine the second average action according to formula (14) below:
Figure PCTCN2018084426-appb-000018
where
Figure PCTCN2018084426-appb-000019
refers to the second average action parameter, and A i+y refers to a second action parameter corresponding to a (i+y) th candidate image frame.
In 1040, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify the candidate image frame as the start image frame in response to that the first  average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to the action parameter corresponding to the candidate image frame.
As described in connection with 540, the start image frame (e.g., point 1202 illustrated in FIG. 12) corresponds to or substantially corresponds to the start position where the facial object is facing right to or substantially right to the camera device 130. Within a time period before a capture time point corresponding to the start image frame, the facial object may keep facing right to or substantially right to the camera device 130, during which the action parameter stays almost unchanged (e.g., from point 1201 to point 1202 illustrated in FIG. 12) . Sequentially, the facial object moves from the start position along a downward direction during which the action parameter gradually increases (e.g., from point 1202 to point 1204 illustrated in FIG. 12) . Therefore, for the start image frame, the first average action parameter of the plurality of previous image frames is less than the second average action parameter of the plurality of subsequent image frames and each of the plurality of second action parameters corresponding to the subsequent image frames is larger than the action parameter of the start image frame.
It should be noted that the above description is provided for the purpose of illustration, and is not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teaching of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.
FIG. 11 is a flowchart illustrating an exemplary process for determining an end image frame according to some embodiments of the present disclosure. In some embodiments, the process 1100 may be implemented as a set of instructions (e.g., an application) stored in the storage ROM 230 or RAM 240. The processor 220 and/or the modules in  FIG. 4 may execute the set of instructions, and when executing the instructions, the processor 220 and/or the modules may be configured to perform the process 1100. The operations of the illustrated process presented below are intended to be illustrative. In some embodiments, the process 1100 may be accomplished with one or more additional operations not described and/or without one or more of the operations herein discussed. Additionally, the order in which the operations of the process as illustrated in FIG. 11 and described below is not intended to be limiting. In some embodiments, operation 910 may be performed based on process 1100.
In 1110, for a candidate image frame after the start image frame, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames. As described in connection with 1010, the “previous image frames” refer to continuous image frames immediately before the candidate image frame along the sequence, and the “subsequent image frames” refer to continuous image frames immediately after the candidate image frame.
Take a jth candidate image frame F j after the start image frame (assuming that the start image frame is F i) as an example, the plurality of previous image frames before the jth candidate image frame may be expressed as an ordered set below:
P 2 = [F j-e, …, F j-2, F j-1] ( (j-e) >i)    (15)
where P 2 refers to the ordered set including the plurality of previous image frames and e refers to a number count of the plurality of previous image frames.
Also take the jth candidate image frame F j as an example, the plurality of subsequent image frames after the jth candidate image frame may  be expressed as an ordered set below:
N 2 = [F j+1, F j+2, …, F j+f] ( (j+f) ≤m)    (16)
where N 2 refers to the ordered set including the plurality of subsequent image frames and f refers to a number count of the plurality of subsequent image frames.
In 1120, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames. For example, the processing engine 112 may determine the third average action parameter according to formula (17) below:
Figure PCTCN2018084426-appb-000020
where
Figure PCTCN2018084426-appb-000021
refers to the third average action parameter, and A j-e refers to a third action parameter corresponding to a (j-e) th candidate image frame.
In 1130, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may determine a fourth average action parameter based on a plurality of fourth action parameters corresponding to the plurality of subsequent image frames. For example, the processing engine 112 may determine the fourth average action parameter according to formula (18) below:
Figure PCTCN2018084426-appb-000022
where
Figure PCTCN2018084426-appb-000023
refers to the fourth average action parameter, and A j+f refers to a fourth action parameter corresponding to a (j+f) th candidate image frame.
In 1140, the processing engine 112 (e.g., the identification module 440) (e.g., the processing circuits of the processor 220) may identify the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to the  action parameter corresponding to the candidate image frame, an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
As used herein, the ratio associated with the first average action parameter and the fourth average action parameter may be expressed as formula (19) below:
Figure PCTCN2018084426-appb-000024
where T refers to the ratio associated with the first average action parameter and the fourth average action parameter, 
Figure PCTCN2018084426-appb-000025
refers to the first average action parameter, and
Figure PCTCN2018084426-appb-000026
refers to the fourth average action parameter.
In some embodiments, the ratio threshold may be default settings of the action recognition system 100, or may be adjustable under different situations. For example, the ratio threshold may be any value within a range from 1.05 to 1.2.
As described in connection with 540, the end image frame (e.g., point 1208 illustrated in FIG. 12) corresponds to or substantially corresponds to the end position where the facial object moves back and is facing right to or substantially right to the camera device 130. Within a time period before a capture time point corresponding to the end image frame, the facial object is moving along an upward direction during which the action parameter gradually decreases (e.g., from point 1206 to point 1208 illustrated in FIG. 12); within a time period after the capture time point corresponding to the end image frame, the facial object may keep facing right to or substantially right to the camera device 130, within which the action parameter stays almost unchanged (e.g., from point 1208 to point 1212 illustrated in FIG. 12) .
Therefore, for the end image frame, the third average action parameter of the  plurality of previous image frames is larger than the fourth average action parameter of the plurality of subsequent image frames, each of the plurality of third action parameters is larger than or equal to the action parameter corresponding to the end image frame, an action parameter corresponding to a subsequent image frame (e.g., point 1209 illustrated in FIG. 12) adjacent to the end image frame is smaller than or equal to the action parameter corresponding to the end image frame.
FIG. 12 is a schematic diagram illustrating an exemplary curve indicating a variation process of the action parameter according to some embodiments of the present disclosure. As illustrated in FIG. 12, the horizontal axis refers to “image frame” and the vertical axis refers to “action parameter. ”
As described elsewhere in the present disclosure, the processing engine 112 may identify a plurality of sequential target image frames associated with the facial object and identify the presence of a nod action based on the plurality of sequential target image frames. In some embodiments, the plurality of sequential target image frames include a start image frame F i, an end image frame F j, and a middle image frame F mid having the maximum action parameter. As illustrated in FIG. 12, point 1202 corresponds to the start image frame, point 1208 corresponds to the end image frame, and point 1205 corresponds to the middle image frame.
During a nod action, as described elsewhere in the present disclosure, the facial object moves from a start position to a middle position along a downward direction and moves from the middle position to an end position along an upward direction. The start image frame may correspond to or substantially correspond to the start position corresponding to a time point when the facial object is facing right to or substantially right to the camera device 130. As used herein, “substantially right to” refers to that an angle between a direction that the facial object is facing to and a direction  pointing perpendicularly at the camera device 130 is less than a threshold that is recognizable for an ordinary person in the art. In some embodiments, the action parameter of the start image frame is a fixed value which may be default settings of the action recognition system 100, or may be adjustable under different situations.
As the facial object moves from the start position along the downward direction, in the image frames, the distance between the upper part of facial object and the middle part of the facial object gradually increases, and the distance between the middle part of the facial object and the lower part of the facial object gradually decreases, accordingly, the action parameter associated with a ratio of the two distances gradually increases, for example, as illustrated in a section of the curve from point 1202 to point 1205.
Further, the facial object moves to the middle positon (e.g., point 1205) where the facial object stops moving down (or starts moving up back) , the action parameter reaches the maximum value.
As the facial object moves from the middle position along the upward direction, the distance between the upper part of facial object and the middle part of the facial object gradually decreases, and the distance between the middle part of the facial object and the lower part of the facial object gradually increases in the image frames. Accordingly, the action parameter associated with a ratio of the two distances gradually decreases, for example, as illustrated in a section of the curve from point 1205 to point 1208.
Finally, the facial object moves to the end position which is the same as or substantially same as the start position. As used herein, “substantially same as” refers to that an angle between a direction that the facial object is facing to at the end position and a direction that the facial object is facing to at the start position is less than an angle threshold that is recognizable for an ordinary person in the art.
Having thus described the basic concepts, it may be rather apparent  to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.
Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.
Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied  thereon.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a  Service (SaaS) .
Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.
Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claimed subject matter may lie in less than all features of a single foregoing disclosed embodiment.

Claims (36)

  1. A system for automated identification of presence of a facial action from sequential images, comprising:
    at least one storage medium including a set of instructions; and
    at least one processor in communication with the at least one storage medium, wherein when executing the set of instructions, the at least one processor is directed to cause the system to:
    obtain a plurality of sequential candidate image frames containing a facial object, each of the plurality of candidate image frames including one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object;
    for each of the plurality of sequential candidate image frames, determine one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point;
    determine an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames; and
    identify the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
  2. The system of claim 1, wherein,
    the one or more first feature points are points associated with at least one of a left brow or a right brow of the facial object,
    the second feature point is a point on a tip of a nose of the facial object, and
    the one or more third feature points are points on a chin of the facial object.
  3. The system of claim 2, wherein the one or more first distances include one or more first left distances and one or more first right distances, wherein each first left distance is determined based on a corresponding first feature point associated with the left brow and the second feature point, and each first right distance is determined based on a corresponding first feature point associated with the right brow and the second feature point, and
    to determine the action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames, the at least one processor is directed to cause the system further to:
    determine one or more first ratios of the one or more first left distances to the one or more second distances, each of the one or more first ratios corresponding to a first left distance and a second distance;
    determine a first average ratio of the one or more first ratios;
    determine one or more second ratios of the one or more first right distances to the one or more second distances, each of the one or more second ratios corresponding to a first right distance and a second distance;
    determine a second average ratio of the one or more second ratios; and
    determine the action parameter based on the first average ratio and the second average ratio.
  4. The system of any one of claims 1-3, wherein to determine the action parameter based on the one or more first distances and the one or  more second distances in each of the plurality of sequential candidate image frames, the at least one processor is directed to cause the system further to:
    determine one or more distance ratios of the one or more first distances to the one or more second distances, each of the one or more distance ratios corresponding to a first distance and a second distance; and
    determine a compound distance ratio of the one or more distance ratios as the action parameter.
  5. The system of any one of claims 1-4, wherein to obtain the plurality of sequential candidate image frames containing the facial object, the at least one processor is directed to cause the system further to:
    obtain an initial image frame including a first initial feature point on a center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object;
    determine whether the third initial feature point is within a quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point; and
    determine the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point.
  6. The system of any one of claims 1-5, wherein to identify the presence of the nod action in response to that the action parameters satisfy one or more preset conditions, the at least one processor is directed to cause the system further to:
    identify a plurality of sequential target image frames from the plurality of sequential candidate image frames, the plurality of sequential target image frames including a start image frame and an end image frame;
    identify a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames;
    identify a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames;
    determine an asymmetry parameter based on the maximum action parameter and the minimum action parameter;
    determine a first number count of target image frames from the start image frame to a target image frame corresponding to the maximum action parameter;
    determine a second number count of target image frames from the target image frame corresponding to the maximum action parameter to the end image frame;
    determine an estimated line by fitting the second feature points in the plurality of sequential target image frames; and
    identify the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and an angle between the estimated line and a vertical line is less than an angle threshold.
  7. The system of claim 6, wherein to identify the start image frame of the plurality of sequential target image frames, the at least one processor is directed to cause the system further to:
    for a candidate image frame, select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image  frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames;
    determine a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames;
    determine a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames; and
    identify the candidate image frame as the start image frame in response to that the first average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to an action parameter corresponding to the candidate image frame.
  8. The system of claim 7, wherein to identify the end image frame of the plurality of sequential target image frames, the at least one processor is directed to cause the system further to:
    for a candidate image frame after the start image frame, select a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames;
    determine a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames;
    determine a fourth average action parameter based on a plurality of fourth action parameters corresponding to the plurality of subsequent image frames; and
    identify the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to an action parameter corresponding to the candidate image frame,  an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
  9. The system of claim 6, wherein the asymmetry threshold is 2-3.
  10. The system of claims 6, wherein the first number count threshold is 4-6, the second number count threshold is 4-6, or the angle threshold is 10°-15°.
  11. The system of any one of claims 1-10, wherein the at least one processor is directed to cause the system further to:
    provide an authentication to a terminal device associated with a user corresponding to the facial object in response to the identification of the presence of the nod action.
  12. The system of any of claims 1-11, further comprising a camera, which is configured to provide video data from which the plurality of sequential candidate image frames are obtained.
  13. A method implemented on a computing device having at least one processor, at least one storage medium, and a communication platform connected to a network, the method comprising:
    obtaining a plurality of sequential candidate image frames containing a facial object, each of the plurality of candidate image frames including one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and  one or more third feature points associated with a lower part of the facial object;
    for each of the plurality of sequential candidate image frames, determining one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point;
    determining an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames; and
    identifying the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
  14. The method of claim 13, wherein,
    the one or more first feature points are points associated with at least one of a left brow or a right brow of the facial object,
    the second feature point is a point on a tip of a nose of the facial object, and
    the one or more third feature points are points on a chin of the facial object.
  15. The method of claim 14, wherein the one or more first distances include one or more first left distances and one or more first right distances, wherein each first left distance is determined based on a corresponding first feature point associated with the left brow and the second feature point, and each first right distance is determined based on a corresponding first feature point associated with the right brow and the second feature point, and
    the determining the action parameter based on the one or more first distances and the one or more second distances in each of the plurality of  sequential candidate image frames includes:
    determining one or more first ratios of the one or more first left distances to the one or more second distances, each of the one or more first ratios corresponding to a first left distance and a second distance;
    determining a first average ratio of the one or more first ratios;
    determining one or more second ratios of the one or more first right distances to the one or more second distances, each of the one or more second ratios corresponding to a first right distance and a second distance;
    determining a second average ratio of the one or more second ratios; and
    determining the action parameter based on the first average ratio and the second average ratio.
  16. The method of any one of claims 13-15, wherein the determining the action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames includes:
    determining one or more distance ratios of the one or more first distances to the one or more second distances, each of the one or more distance ratios corresponding to a first distance and a second distance; and
    determining a compound distance ratio of the one or more distance ratios as the action parameter.
  17. The method of any one of claims 13-16, wherein the obtaining the plurality of sequential candidate image frames containing the facial object includes:
    obtaining an initial image frame including a first initial feature point on a  center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object;
    determining whether the third initial feature point is within a quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point; and
    determining the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point.
  18. The method of any one of claims 13-17, wherein the identifying the presence of the nod action in response to that the action parameters satisfy one or more preset conditions includes:
    identifying a plurality of sequential target image frames from the plurality of sequential candidate image frames, the plurality of sequential target image frames including a start image frame and an end image frame;
    identifying a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames;
    identifying a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames;
    determining an asymmetry parameter based on the maximum action parameter and the minimum action parameter;
    determining a first number count of target image frames from the start image frame to a target image frame corresponding to the maximum action parameter;
    determining a second number count of target image frames from the target image frame corresponding to the maximum action parameter to the end image frame;
    determining an estimated line by fitting the second feature points in the plurality of sequential target image frames; and
    identifying the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and an angle between the estimated line and a vertical line is less than an angle threshold.
  19. The method of claim 18, wherein the identifying the start image frame of the plurality of sequential target image frames includes:
    for a candidate image frame, selecting a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames;
    determining a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames;
    determining a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames; and
    identifying the candidate image frame as the start image frame in response to that the first average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to an action parameter corresponding to the candidate image frame.
  20. The method of claim 19, wherein the identifying the end image  frame of the plurality of sequential target image frames includes:
    for a candidate image frame after the start image frame, selecting a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames;
    determining a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames;
    determining a fourth average action parameter based on a plurality of fourth action parameters corresponding to the plurality of subsequent image frames; and
    identifying the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to an action parameter corresponding to the candidate image frame, an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
  21. The method of claim 18, wherein the asymmetry threshold is 2-3.
  22. The method of claim 18, wherein the first number count threshold is 4-6, the second number count threshold is 4-6, or the angle threshold is 10°-15°.
  23. The method of any one of claims 13-22, wherein the method further includes:
    providing an authentication to a terminal device associated with a user corresponding to the facial object in response to the identification of the presence of the nod action.
  24. The method of any of claims 13-23, wherein the method further includes:
    obtaining the plurality of sequential candidate image frames from video data provided by a camera.
  25. A non-transitory computer readable medium, comprising executable instructions that, when executed by at least one processor, directs the at least one processor to perform a method, the method comprising:
    obtaining a plurality of sequential candidate image frames containing a facial object, each of the plurality of candidate image frames including one or more first feature points associated with an upper part of the facial object, a second feature point associated with a middle part of the facial object, and one or more third feature points associated with a lower part of the facial object;
    for each of the plurality of sequential candidate image frames, determining one or more first distances, each based on one of the one or more first feature points and the second feature point, and one or more second distances, each based on one of the one or more third feature points and the second feature point;
    determining an action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames; and
    identifying the presence of a nod action in response to that the action parameters satisfy one or more preset conditions.
  26. The non-transitory computer readable medium of claim 25, wherein,
    the one or more first feature points are points associated with at least one of a left brow or a right brow of the facial object,
    the second feature point is a point on a tip of a nose of the facial object, and
    the one or more third feature points are points on a chin of the facial object.
  27. The non-transitory computer readable medium of claim 26, wherein the one or more first distances include one or more first left distances and one or more first right distances, wherein each first left distance is determined based on a corresponding first feature point associated with the left brow and the second feature point, and each first right distance is determined based on a corresponding first feature point associated with the right brow and the second feature point, and
    the determining the action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames includes:
    determining one or more first ratios of the one or more first left distances to the one or more second distances, each of the one or more first ratios corresponding to a first left distance and a second distance;
    determining a first average ratio of the one or more first ratios;
    determining one or more second ratios of the one or more first right distances to the one or more second distances, each of the one or more second ratios corresponding to a first right distance and a second distance;
    determining a second average ratio of the one or more second  ratios; and
    determining the action parameter based on the first average ratio and the second average ratio.
  28. The non-transitory computer readable medium of any one of claims 25-27, wherein the determining the action parameter based on the one or more first distances and the one or more second distances in each of the plurality of sequential candidate image frames includes:
    determining one or more distance ratios of the one or more first distances to the one or more second distances, each of the one or more distance ratios corresponding to a first distance and a second distance; and
    determining a compound distance ratio of the one or more distance ratios as the action parameter.
  29. The non-transitory computer readable medium of any one of claims 25-28, wherein the obtaining the plurality of sequential candidate image frames containing the facial object includes:
    obtaining an initial image frame including a first initial feature point on a center of a left eye of the facial object, a second initial feature point on a center of a right eye of the facial object, a third initial feature point on a tip of a nose of the facial object, a fourth initial feature point on a left end of a lip of the facial object, and a fifth initial feature point on a right end of the lip of the facial object;
    determining whether the third initial feature point is within a quadrangle determined based on the first initial feature point, the second initial feature point, the fourth initial feature point, and the fifth initial feature point; and
    determining the initial image frame as a candidate image frame in response to that the third initial feature point is within the quadrangle determined based on the first initial feature point, the second initial feature  point, the fourth initial feature point, and the fifth initial feature point.
  30. The non-transitory computer readable medium of any one of claims 25-29, wherein the identifying the presence of the nod action in response to that the action parameters satisfy one or more preset conditions includes:
    identifying a plurality of sequential target image frames from the plurality of sequential candidate image frames, the plurality of sequential target image frames including a start image frame and an end image frame;
    identifying a maximum action parameter from a plurality of action parameters corresponding to the plurality of sequential target image frames;
    identifying a minimum action parameter associated with the plurality of action parameters corresponding to the plurality of sequential target image frames;
    determining an asymmetry parameter based on the maximum action parameter and the minimum action parameter;
    determining a first number count of target image frames from the start image frame to a target image frame corresponding to the maximum action parameter;
    determining a second number count of target image frames from the target image frame corresponding to the maximum action parameter to the end image frame;
    determining an estimated line by fitting the second feature points in the plurality of sequential target image frames; and
    identifying the presence of the nod action in response to that the asymmetry parameter is larger than an asymmetry threshold, the first number count is larger than a first number count threshold, the second number count is larger than a second number count threshold, and an angle between the estimated line and a vertical line is less than an angle threshold.
  31. The non-transitory computer readable medium of claim 30, wherein the identifying the start image frame of the plurality of sequential target image frames includes:
    for a candidate image frame, selecting a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along a sequence of the plurality of sequential candidate image frames;
    determining a first average action parameter based on a plurality of first action parameters corresponding to the plurality of previous image frames;
    determining a second average action parameter based on a plurality of second action parameters corresponding to the plurality of subsequent image frames; and
    identifying the candidate image frame as the start image frame in response to that the first average action parameter is less than the second average action parameter and each of the plurality of second action parameters is larger than or equal to an action parameter corresponding to the candidate image frame.
  32. The non-transitory computer readable medium of claim 31, wherein the identifying the end image frame of the plurality of sequential target image frames includes:
    for a candidate image frame after the start image frame, selecting a plurality of previous image frames before the candidate image frame and a plurality of subsequent image frames after the candidate image frame along the sequence of the plurality of sequential candidate image frames;
    determining a third average action parameter based on a plurality of third action parameters corresponding to the plurality of previous image frames;
    determining a fourth average action parameter based on a plurality of  fourth action parameters corresponding to the plurality of subsequent image frames; and
    identifying the candidate image frame as the end image frame in response to that the third average action parameter is larger than the fourth average action parameter, each of the plurality of third action parameters is larger than or equal to an action parameter corresponding to the candidate image frame, an action parameter corresponding to a subsequent image frame adjacent to the candidate image frame is smaller than or equal to the action parameter corresponding to the candidate image frame, and a ratio associated with the first average action parameter and the fourth average action parameter is less than a ratio threshold.
  33. The non-transitory computer readable medium of claim 30, wherein the asymmetry threshold is 2-3.
  34. The non-transitory computer readable medium of claim 30, wherein the first number count threshold is 4-6, the second number count threshold is 4-6, or the angle threshold is 10°-15°.
  35. The non-transitory computer readable medium of any one of claims 25-34, wherein the method further includes:
    providing an authentication to a terminal device associated with a user corresponding to the facial object in response to the identification of the presence of the nod action.
  36. The non-transitory computer readable medium of any of claims 25-35, wherein the method further includes:
    obtaining the plurality of sequential candidate image frames from video data provided by a camera.
PCT/CN2018/084426 2018-04-25 2018-04-25 Systems and methods for nod action recognition based on facial feature points WO2019205016A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/084426 WO2019205016A1 (en) 2018-04-25 2018-04-25 Systems and methods for nod action recognition based on facial feature points
CN201880038528.4A CN110753931A (en) 2018-04-25 2018-04-25 System and method for nodding action recognition based on facial feature points

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/084426 WO2019205016A1 (en) 2018-04-25 2018-04-25 Systems and methods for nod action recognition based on facial feature points

Publications (1)

Publication Number Publication Date
WO2019205016A1 true WO2019205016A1 (en) 2019-10-31

Family

ID=68293406

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/084426 WO2019205016A1 (en) 2018-04-25 2018-04-25 Systems and methods for nod action recognition based on facial feature points

Country Status (2)

Country Link
CN (1) CN110753931A (en)
WO (1) WO2019205016A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444763A (en) * 2020-02-24 2020-07-24 珠海格力电器股份有限公司 Security control method and device, storage medium and air conditioner

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914142A (en) * 2013-01-04 2014-07-09 三星电子株式会社 Apparatus and method for providing control service using head tracking technology in electronic device
CN104123545A (en) * 2014-07-24 2014-10-29 江苏大学 Real-time expression feature extraction and identification method
CN105975935A (en) * 2016-05-04 2016-09-28 腾讯科技(深圳)有限公司 Face image processing method and apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914142A (en) * 2013-01-04 2014-07-09 三星电子株式会社 Apparatus and method for providing control service using head tracking technology in electronic device
CN104123545A (en) * 2014-07-24 2014-10-29 江苏大学 Real-time expression feature extraction and identification method
CN105975935A (en) * 2016-05-04 2016-09-28 腾讯科技(深圳)有限公司 Face image processing method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444763A (en) * 2020-02-24 2020-07-24 珠海格力电器股份有限公司 Security control method and device, storage medium and air conditioner

Also Published As

Publication number Publication date
CN110753931A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
US11647163B2 (en) Methods and systems for object monitoring
US11263469B2 (en) Electronic device for processing image and method for controlling the same
EP3872689A1 (en) Liveness detection method and device, electronic apparatus, storage medium and related system using the liveness detection method
WO2020147423A1 (en) Systems and methods for noise reduction
US10254831B2 (en) System and method for detecting a gaze of a viewer
WO2017088804A1 (en) Method and apparatus for detecting wearing of spectacles in facial image
JP7342366B2 (en) Avatar generation system, avatar generation method, and program
US10936867B2 (en) Systems and methods for blink action recognition based on facial feature points
WO2018121523A1 (en) Methods, systems, and media for evaluating images
US10929984B2 (en) Systems and methods for shaking action recognition based on facial feature points
US20200118349A1 (en) Information processing apparatus, information processing method, and program
WO2020133330A1 (en) Systems and methods for video surveillance
JP2018081402A (en) Image processing system, image processing method, and program
CN110189252B (en) Method and device for generating average face image
WO2019205016A1 (en) Systems and methods for nod action recognition based on facial feature points
KR101820503B1 (en) Service systembased on face recognition inference, and face recognition inference method and storage medium thereof
KR20160128275A (en) Service systembased on face recognition inference, and face recognition inference method and storage medium thereof
CN111033508B (en) System and method for recognizing body movement
CN111582121A (en) Method for capturing facial expression features, terminal device and computer-readable storage medium
CN112114659A (en) Method and system for determining a fine point of regard for a user
JP6287527B2 (en) Information processing apparatus, method, and program
WO2024016786A1 (en) Palm image recognition method and apparatus, and device, storage medium and program product
CN112395912B (en) Face segmentation method, electronic device and computer readable storage medium
CN116451195A (en) Living body identification method and system
CN117765621A (en) Living body detection method, living body detection device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18916604

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18916604

Country of ref document: EP

Kind code of ref document: A1