WO2020063753A1 - 动作识别、驾驶动作分析方法和装置、电子设备 - Google Patents

动作识别、驾驶动作分析方法和装置、电子设备 Download PDF

Info

Publication number
WO2020063753A1
WO2020063753A1 PCT/CN2019/108167 CN2019108167W WO2020063753A1 WO 2020063753 A1 WO2020063753 A1 WO 2020063753A1 CN 2019108167 W CN2019108167 W CN 2019108167W WO 2020063753 A1 WO2020063753 A1 WO 2020063753A1
Authority
WO
WIPO (PCT)
Prior art keywords
action
frame
candidate
candidate frame
motion
Prior art date
Application number
PCT/CN2019/108167
Other languages
English (en)
French (fr)
Inventor
陈彦杰
王飞
钱晨
Original Assignee
北京市商汤科技开发有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京市商汤科技开发有限公司 filed Critical 北京市商汤科技开发有限公司
Priority to JP2020551540A priority Critical patent/JP7061685B2/ja
Priority to SG11202009320PA priority patent/SG11202009320PA/en
Priority to KR1020207027826A priority patent/KR102470680B1/ko
Publication of WO2020063753A1 publication Critical patent/WO2020063753A1/zh
Priority to US17/026,933 priority patent/US20210012127A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60RVEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
    • B60R21/00Arrangements or fittings on vehicles for protecting or preventing injuries to occupants or pedestrians in case of accidents or other traffic risks
    • B60R21/01Electrical circuits for triggering passive safety arrangements, e.g. airbags, safety belt tighteners, in case of vehicle accidents or impending vehicle accidents
    • B60R21/015Electrical circuits for triggering passive safety arrangements, e.g. airbags, safety belt tighteners, in case of vehicle accidents or impending vehicle accidents including means for detecting the presence or position of passengers, passenger seats or child seats, and the related safety parameters therefor, e.g. speed or timing of airbag inflation in relation to occupant position or seat belt use
    • B60R21/01512Passenger detection systems
    • B60R21/01542Passenger detection systems detecting passenger motion
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/08Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/08Interaction between the driver and the control system
    • B60W50/14Means for informing the driver, warning the driver or prompting a driver intervention
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/50Constructional details
    • H04N23/54Mounting of pick-up tubes, electronic image sensors, deviation or focusing coils
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2420/00Indexing codes relating to the type of sensors based on the principle of their operation
    • B60W2420/40Photo, light or radio wave sensitive means, e.g. infrared sensors
    • B60W2420/403Image sensing, e.g. optical camera
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2540/00Input parameters relating to occupants
    • B60W2540/229Attention level, e.g. attentive to driving, reading or sleeping

Definitions

  • the present application relates to the field of image processing technology, and in particular, to a method and device for motion recognition, driving motion analysis, and electronic equipment.
  • Motion recognition technology has become a very popular application research direction in recent years, and it can be seen in many fields and products.
  • the use of this technology is also the future development trend of human-computer interaction, especially in driver monitoring.
  • the field has broad application prospects.
  • the embodiments of the present application provide a technical solution for motion recognition and a technical solution for driving motion analysis.
  • an embodiment of the present application provides a motion recognition method, which includes: extracting features of an image including a human face; determining a plurality of candidate frames that may include a predetermined motion based on the features; and based on the plurality of The candidate frame determines a motion target frame, wherein the motion target frame includes a local area of a human face and a motion interactor; classifying a predetermined motion based on the motion target frame to obtain a motion recognition result.
  • an embodiment of the present application provides a driving motion analysis method.
  • the method includes: collecting a video stream including a driver's face image through an on-board camera; and using any one of the motion recognition methods described in the embodiments of the present application.
  • One implementation manner is to obtain a motion recognition result of at least one frame of images in the video stream; and in response to the motion recognition result meeting a predetermined condition, generating dangerous driving prompt information.
  • an embodiment of the present application provides a motion recognition device.
  • the device includes: a first extraction unit for extracting features of an image including a human face; and a second extraction unit for determining a possibility based on the features.
  • a plurality of candidate frames including a predetermined action; a determination unit configured to determine a motion target frame based on the plurality of candidate frames, wherein the motion target frame includes a local area of a human face and a motion interactor; a classification unit configured to be based on The action target frame performs classification of a predetermined action to obtain an action recognition result.
  • an embodiment of the present application provides a driving motion analysis device.
  • the device includes: a vehicle-mounted camera for collecting a video stream including a driver's face image; and an obtaining unit for passing the In any one of the implementation manners of the motion recognition device, a motion recognition result of at least one frame image in the video stream is obtained; and a generating unit is configured to generate dangerous driving prompt information in response to the motion recognition result meeting a predetermined condition.
  • an embodiment of the present application provides an electronic device including a memory and a processor.
  • the memory stores computer-executable instructions.
  • the processor implements the application when the computer runs the computer-executable instructions on the memory. The method described in the first or second aspect of the embodiment.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the computer-readable storage medium is run on a computer, causes the computer to execute the first aspect or The method described in the second aspect.
  • an embodiment of the present application provides a computer program including computer instructions.
  • the computer instructions are run in a processor of a device, the method described in the first aspect or the second aspect of the embodiments of the present application.
  • the embodiment of the present application extracts features in an image containing a human face, determines multiple candidate frames that may include a predetermined action based on the extracted features, determines an action target frame based on the plurality of candidate frames, and performs a predetermined action based on the action target frame. Classification to get action recognition results.
  • the action target frame described in the embodiment of the present application includes a local area of a human face and a motion interactor, in the process of classifying a predetermined action based on the action target frame, the local area corresponding to the human face and the action interaction are The movement of the object as a whole, instead of cutting apart the human body part and the action interaction object, and classifying based on the corresponding characteristics of the whole, so it can realize the recognition of fine movements, especially for the fine movements in the face area or near the face area. Recognition to improve the accuracy and precision of motion recognition.
  • FIG. 1 is a schematic flowchart of a motion recognition method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a target action frame according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of another motion recognition method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a negative sample image including an action similar to a predetermined action according to an embodiment of the present application
  • FIG. 5 is a schematic flowchart of a driving action analysis method according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a neural network training method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an action supervision frame for drinking water according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a call supervision frame provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a motion recognition device according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a training component of a neural network according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a driving motion analysis device according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a motion recognition method according to an embodiment of the present application. As shown in FIG. 1, the method includes:
  • the embodiments of the present application mainly identify the actions of people in the vehicle. Taking the driver as an example, the embodiment of the present application can recognize some driving actions made by the vehicle driver when driving the vehicle, and can give a reminder to the driver according to the recognition result.
  • the inventor found that due to some fine movements of the person in the car related to the human face, for example, the driver drinks water, the driver calls, etc., it is difficult or impossible to recognize these actions through Realization of detection of key points or estimation of human pose.
  • feature extraction is performed on an image to be processed, and actions in the image to be processed are identified according to the extracted features.
  • the above-mentioned actions may be: the action of the hand area and / or the action of the local area of the face, the action of the action interactive object, etc. Therefore, it is necessary to use the vehicle camera to collect image of the person in the vehicle to obtain Process the image. Then perform a convolution operation on the processed image to extract motion features.
  • the method further includes: using a vehicle-mounted camera to capture an image of a person located in the vehicle, including a human face.
  • the person in the car includes at least one of the following: a driver in a driving zone of the car, a person in a passenger driving zone of the car, and a person in a rear seat of the car.
  • the vehicle-mounted camera may be a red-green-blue (RGB) camera, an infrared camera, or a near-infrared camera.
  • RGB red-green-blue
  • infrared camera or a near-infrared camera.
  • the embodiments of the present application mainly identify predetermined actions of persons in the vehicle. Taking the persons in the vehicle as examples of the driver, the predetermined actions may be, for example, predetermined actions corresponding to the dangerous driving of the driver, or certain actions for the driver. A scheduled action for a dangerous action.
  • the features of the predetermined action are first defined, and then a neural network is used to determine whether there is a predetermined action in the image according to the defined features and the extracted features in the image. When there is a predetermined motion in the image, it is determined that a plurality of candidate frames including the predetermined motion are included in the image.
  • the neural networks in this embodiment are all trained, that is, features of a predetermined action in an image can be extracted through the neural network.
  • the neural network may be provided with multiple convolutional layers, and the multi-layered convolutional layers may be used to extract richer information in the image, thereby improving the accuracy of determining a predetermined action.
  • the feature extraction process including a hand region and a human face is obtained through a neural network
  • a feature area of a local area, a candidate area is determined based on the feature area, and the candidate area is identified by a candidate frame; wherein the candidate frame may be represented by a rectangular frame, for example.
  • a feature region including a hand region, a local face region, and a corresponding region of a motion interactor is identified through another candidate frame.
  • 103 Determine a motion target frame based on the plurality of candidate frames, wherein the motion target frame includes a local area of a human face and a motion interactor.
  • the actions identified in the embodiments of the present application are all fine actions related to the face.
  • the recognition of such fine actions related to the face is difficult or even impossible to detect through key points of the human body, and the corresponding areas of such fine actions are at least Including the local area of the face and the corresponding area of the action interactor, for example, the local area of the face and the corresponding area of the action interactor, or the local area of the face, the corresponding area of the action interactor, and the hand area , Etc. Therefore, the recognition of such fine actions can be realized by identifying the features in the action target frame obtained from multiple candidate frames.
  • the local area of the human face includes at least one of the following: a mouth area, an ear area, and an eye area.
  • the action interaction objects include at least one of the following: containers, cigarettes, mobile phones, food, tools, beverage bottles, glasses, and masks.
  • the action target frame further includes: a hand region.
  • the target action box shown in FIG. 2 includes a local human face, a mobile phone (that is, an action interaction object), and a hand.
  • the target action frame may also include: mouth and smoke (that is, an action interaction object).
  • the candidate frame may include features other than the features corresponding to the predetermined action, or may not include all features corresponding to the predetermined action (referring to all features of any one predetermined action), which will affect the final action recognition result. Therefore, in order to ensure the accuracy of the final recognition result, the position of the candidate frame needs to be adjusted, that is, an action target frame is determined based on a plurality of candidate frames, and the position and size of the action target frame are at least part of the candidate frames of the plurality of candidate frames. Position and size may vary. As shown in FIG.
  • the position offset and the zoom factor of the corresponding candidate frame can be determined according to the position and size of the feature corresponding to the predetermined action, and then the position and size of the candidate frame are adjusted according to the position offset and the zoom factor, so that
  • the adjusted action target box includes only features corresponding to the predetermined action, and includes all features corresponding to the predetermined action. Based on this, by adjusting the position and size of each candidate frame, the adjusted candidate frame is determined as the action target frame. It can be understood that the adjusted multiple candidate frames can overlap into one candidate frame, and the overlapping candidate frames are determined as the action target frames.
  • the predetermined action includes at least one of the following: making a call, smoking, drinking water / beverage, eating, using tools, wearing glasses, and applying makeup.
  • the predetermined actions may be classified based on characteristics corresponding to the predetermined actions contained in the action target frame.
  • a neural network for action classification may be used to perform classification processing on the features corresponding to the predetermined actions contained in the action target frame to obtain classification and recognition results of the predetermined actions corresponding to the features.
  • a plurality of candidate frames that may include a predetermined motion are determined based on the extracted features, and an action target frame is determined based on the plurality of candidate frames.
  • the target action is quickly classified into predetermined actions.
  • the action target frame described in the embodiment of the present application includes a local area of a human face and a motion interactor, in the process of classifying a predetermined action based on the action target frame, the local area corresponding to the human face and the action interaction are The movement of the object as a whole, instead of cutting apart the human body part and the action interaction object, and classifying based on the corresponding characteristics of the whole, so it can realize the recognition of fine movements, especially for the fine movements in the face area or near the face area. Recognition to improve the accuracy and precision of recognition.
  • FIG. 3 is a schematic flowchart of another motion recognition method according to an embodiment of the present application. As shown in FIG. 3, the method includes:
  • acquiring the image to be processed may include: taking a picture of a person in the vehicle through a vehicle camera to acquire the image to be processed, or video capturing a person in the vehicle through the vehicle camera. , And use the frame image of the captured video as the image to be processed.
  • the person in the car includes at least one of the following: a driver in a driving zone of the car, a person in a passenger driving zone of the car, and a person in a rear seat of the car.
  • the vehicle-mounted camera may be an RGB camera, an infrared camera, or a near-infrared camera.
  • RGB cameras provide three basic color components by three different cables. This type of camera usually uses three independent charge coupled device (CCD, Charge Coupled Device) sensors to obtain three color signals. RGB cameras often It is used for very accurate color image acquisition.
  • CCD Charge Coupled Device
  • the light in the real environment is complicated, and the light in the vehicle is more complicated, and the light intensity will directly affect the shooting quality, especially when the light intensity in the car is low, ordinary cameras cannot capture clear photos or videos, making the image or The video loses some useful information, which affects subsequent processing.
  • the infrared camera can emit infrared light to the object being photographed, and then perform imaging based on the light reflected by the infrared light, which can solve the problem of low quality or abnormal shooting of images taken by ordinary cameras in low light or dark conditions.
  • an ordinary camera or an infrared camera may be provided. When the light intensity is higher than a preset value, an image to be processed is acquired by the ordinary camera; when the light intensity is lower than a preset value, An infrared camera acquires images to be processed.
  • a feature image is obtained by performing a convolution operation on an image to be processed through a feature extraction branch of a neural network.
  • the feature extraction branch of the neural network performs a convolution operation on the image to be processed by using a convolution kernel to "slide" on the image to be processed.
  • a convolution kernel For example, when the convolution kernel corresponds to a certain pixel point in the image, the gray value of the pixel point is multiplied with each value on the convolution kernel, and all products are added up to be the pixel point corresponding to the convolution kernel. The gray value further "slides" the convolution kernel to the next pixel, and so on, and finally completes the convolution processing of all pixels in the image to be processed to obtain a feature map.
  • the feature extraction branch of the neural network in this embodiment may include multiple layers of convolution layers, and the feature map obtained through feature extraction by the previous layer of convolution layers may be used as input data of the next layer of convolution layers.
  • Convolutional layers extract richer information in the image, thereby improving the accuracy of feature extraction.
  • a feature map corresponding to the image to be processed can be obtained by performing a stepwise convolution operation on the image to be processed through the feature extraction branch of the neural network including multiple convolution layers.
  • the candidate frame extraction branch of the neural network determines a plurality of candidate frames that may include a predetermined action on the feature map.
  • the feature map is processed through the candidate frame extraction branch of the neural network to determine a plurality of candidate frames that may include a predetermined action.
  • the feature map may include at least one of the features corresponding to a hand, a cigarette, a cup, a mobile phone, glasses, a mask, and a local area of a human face, and a plurality of candidate frames are determined based on the at least one feature.
  • the feature extraction branch of the neural network can be used to extract the features of the image to be processed in step 302
  • the extracted features may include features other than the features corresponding to the predetermined action. Therefore, the neural network is used here.
  • the multiple candidate frames identified by the candidate frame extraction branch at least some of the candidate frames may contain features other than the features corresponding to the predetermined action, or may not contain all the features corresponding to the predetermined action. Therefore, the multiple The candidate box may include a predetermined action.
  • the candidate frame extraction branch of the neural network in this embodiment may include multiple layers of convolutional layers, and the features extracted by the previous layer of convolutional layers will be used as input data for the next layer of convolutional layers. Layers extract richer information, thereby improving the accuracy of feature extraction.
  • the candidate frame extraction branch via the neural network determines a plurality of candidate frames that may include a predetermined action on the feature map, including: according to characteristics of the predetermined action Dividing the features in the feature map to obtain a plurality of candidate regions; and obtaining a plurality of candidate frames and a first confidence level of each candidate frame of the plurality of candidate frames according to the plurality of candidate regions, where: The first confidence level is a probability that the candidate frame is the action target frame.
  • a candidate frame extraction branch of a neural network identifies the feature map, and the feature map includes hand features and corresponding features of a local area of the face, or includes hand features and corresponding features of an action interaction object (such as a mobile phone). Corresponding features) and corresponding features of the face local area are divided from the feature map, candidate areas are determined based on the divided features, and the candidate areas are identified by candidate frames (the candidate frames such as rectangular frames). In this way, a plurality of candidate regions identified by the candidate frames are obtained.
  • the candidate frame extraction branch of the neural network may also determine a first confidence level corresponding to each candidate frame, where the first confidence level is used to represent a possibility that the candidate frame is a target action frame in a form of probability.
  • the first confidence degree is a candidate frame obtained by the candidate frame extraction branch of the neural network according to the characteristics of the candidate frame as a predicted value of the target action frame.
  • the 304 Refine branches of the detection frame of the neural network, and determine an action target frame based on the multiple candidate frames; wherein the action target frame includes a local area of a human face and an action interactor.
  • the step of refining branches of the detection frame of the neural network and determining an action target frame based on the plurality of candidate frames includes: refining of the detection frame of the neural network.
  • the branch removes candidate frames whose first confidence is less than the first threshold to obtain at least one first candidate frame; pools the at least one first candidate frame to obtain at least one second candidate frame; according to the at least one second candidate Box to determine the action target box.
  • the target object performs actions such as making a call, drinking water, and smoking in turn. These actions are similar. They place their right hands next to their faces, but Without mobile phones, drinking glasses, and cigarettes, neural networks are prone to mistakenly recognize these actions of target objects as making calls, drinking water, and smoking.
  • the predetermined action is a predetermined dangerous driving action
  • the driver may, for example, perform an ear scratching action due to itching in the ear area, or open the mouth or put hands for other reasons.
  • these movements are not part of the predetermined dangerous driving movements, but these movements will greatly interfere with the candidate frame extraction branch of the neural network in the process of extracting the candidate frame, and then affect the subsequent classification of the movement. Get wrong motion recognition results.
  • the detection frame refinement branch of the neural network is obtained through pre-training to remove candidate frames with a first confidence level less than a first threshold to obtain at least one first candidate frame; the first confidence level of the at least one first candidate frame Both are greater than or equal to the first threshold.
  • the first threshold may be 0.5, for example.
  • the value of the first threshold in the embodiments of the present application is not limited thereto.
  • the pooling processing the at least one first candidate frame to obtain at least one second candidate frame includes: pooling the at least one first candidate frame to obtain At least one first feature region corresponding to the at least one first candidate frame; adjusting the position and size of the corresponding first candidate frame based on each first feature region to obtain at least one second candidate frame.
  • the number of features in the area where the first candidate frame is located may be large. If the features in the area where the first candidate frame is located are used directly, a huge amount of calculation will be generated. Therefore, before performing subsequent processing on the features in the area where the first candidate frame is located, pool the first candidate frame first, that is, pool the features in the area where the first candidate frame is located, and reduce the The dimension of features meets the need for calculation in the subsequent processing, and greatly reduces the calculation in subsequent processing. Similar to obtaining the candidate region in step 303, the features after the pooling process are divided according to the characteristics of the predetermined action to obtain multiple first feature regions. It can be understood that, in this embodiment, by performing a pooling process on the area corresponding to the first candidate frame, the features corresponding to the predetermined action in the first feature area are presented in a low-dimensional form.
  • the specific implementation process of the pooling process can refer to the following example: Suppose that the size of the first candidate frame is expressed as h * w, where h may represent the height of the first candidate frame, and w may represent the Width; when the target size of the desired feature is H * W, the first candidate frame can be divided into H * W grids, and the size of each grid can be expressed as (h / H) * (w / W) , And then calculate the average gray value of the pixels in each grid or determine the maximum gray value in each grid, and use the average gray value or the maximum gray value as the value corresponding to each grid, Thereby, the pooling processing result of the first candidate frame is obtained.
  • the adjusting the position and size of a corresponding first candidate frame based on each first feature region to obtain at least one second candidate frame includes: A feature corresponding to the predetermined action in a feature region is obtained a first action feature frame corresponding to the feature of the predetermined action; and the at least one first is obtained according to a geometric center coordinate of the first action feature frame. A first position offset of the candidate frame; and obtaining a first zoom factor of the at least one first candidate frame according to the size of the first motion feature frame; and according to at least one first position offset and at least one The first zoom factor adjusts the position and size of the at least one first candidate frame to obtain at least one second candidate frame.
  • the features corresponding to each predetermined action in the first feature region are respectively identified by a first action feature frame, and the first action feature frame may specifically be a rectangular frame, for example, A feature corresponding to each predetermined action in the first feature region is identified by a rectangular frame.
  • the geometric center coordinates of the first motion feature frame in a pre-established XOY coordinate system are obtained, and the first position offset of the first candidate frame corresponding to the first motion feature frame is determined according to the geometric center coordinates;
  • the XOY coordinate system is generally a coordinate system established by setting the coordinate origin O, with the horizontal direction as the X axis, and the direction perpendicular to the X axis as the Y axis.
  • the geometric center of the first motion feature frame and The geometric center of the first candidate frame usually has a certain deviation, and the first position offset of the first candidate frame is determined according to the deviation.
  • an offset between the geometric center of the first motion feature frame and the geometric center of the first candidate frame corresponding to the feature of the same predetermined motion may be used as the first position offset of the first candidate frame. the amount.
  • each first candidate frame corresponds to a first position offset
  • the first position offset includes the X axis
  • the XOY coordinate system is based on the upper left corner of the first feature region (the orientation of the refined branch of the candidate frame of the input neural network) as the coordinate origin, and the horizontal direction to the right is the positive direction of the X axis, and the vertical direction is vertical. Downward is the positive direction of the Y axis.
  • the bottom left corner, the top right corner, the bottom right corner of the first feature region, or the center point of the first feature region may be used as the origin, the horizontal direction to the right is the positive direction of the X axis, and the vertical direction is the Y axis. Positive direction.
  • the size of the first motion feature frame is obtained, the length and width of the first motion feature frame are specifically obtained, and the first zoom factor of the corresponding first candidate frame is determined according to the length and width of the first motion feature frame.
  • the first zoom factor of the first candidate frame may be determined based on the length and width of the first motion feature frame and the length and width of the corresponding first candidate frame.
  • Each of the first candidate frames corresponds to a first zoom factor.
  • the first zoom factors of different first candidate frames may be the same or different.
  • the position and size of the first candidate frame are adjusted according to a first position offset and a first zoom factor corresponding to each first candidate frame.
  • the first candidate frame is moved according to the above-mentioned first position offset, and the first candidate frame is centered on the geometric center and the size is adjusted according to the first zoom factor to obtain the second candidate frame.
  • the number of second candidate frames is consistent with the number of first candidate frames.
  • the second candidate frame obtained in the above manner will contain all the features of the predetermined action in the smallest possible size, which is beneficial to improving the accuracy of the subsequent action classification results.
  • a plurality of second candidate frames with similar sizes and similar geometric candidate centers between the second candidate frames can be combined into one, and the combined second candidate frame can be used as an action target frame. It should be understood that the distance between the size and the geometric center of the second candidate frame corresponding to the same predetermined action may be very close, so for each predetermined action, one action target frame may be corresponding.
  • the driver is smoking while making a phone call
  • the obtained image to be processed may include features corresponding to two predetermined actions of making a phone call and smoking.
  • a candidate frame including features corresponding to a predetermined action of making a phone call can be obtained, the candidate frame includes a hand, a mobile phone, and a partial area of a human face, and a feature including a feature corresponding to a predetermined action of smoking can be obtained.
  • the size of the candidate frame corresponding to the predetermined action of making a call is similar.
  • the distance between the size of the candidate frame corresponding to the predetermined action of smoking and the geometric center are similar, and the size of any candidate frame corresponding to the predetermined action of calling and any candidate frame corresponding to the predetermined action of smoking.
  • the difference in the size of is greater than the difference in size between any two candidate frames corresponding to the predetermined action of making a call, and is greater than the difference in size between any two candidate frames corresponding to the predetermined action of smoking.
  • the distance between the geometric center of the candidate frame corresponding to the predetermined action of making a call and any candidate frame corresponding to the predetermined action of smoking is greater than the geometric center between any two candidate frames corresponding to the predetermined action of making a call
  • the distance between them is also larger than the distance between the geometric centers of any two candidate frames corresponding to the predetermined action of smoking.
  • the action classification branch of the neural network obtains an area map according to the area corresponding to the action target action frame divided from the feature map, and classifies a predetermined action based on the features in the area map to obtain a first Action recognition results; obtain the action recognition results corresponding to the image to be processed according to the first action recognition results corresponding to all target action frames.
  • a first action recognition result is obtained through a motion classification branch of a neural network
  • the first action recognition result may be obtained through a motion classification branch of a neural network.
  • a second confidence level the second confidence level characterizing the accuracy of the motion recognition result.
  • obtaining the motion recognition results corresponding to the image to be processed according to the first motion recognition results corresponding to all target motion frames includes: comparing the second confidence level and the preset of the first motion recognition results corresponding to each target motion frame.
  • the driver is photographed by a vehicle-mounted camera to obtain an image including the face of the driver, and the image is input to a neural network as an image to be processed.
  • the driver in the image to be processed corresponds to a "calling" action
  • obtains two action recognition results through the processing of the neural network the action recognition result of "calling” and the action recognition result of "drinking water", where " The second confidence level of the motion recognition result of "calling” is 0.8, and the second confidence level of the motion recognition result of "drinking water” is 0.4.
  • the preset threshold value is set to 0.6, it can be determined that the action recognition result of the image to be processed is a "call” action.
  • the method may further include: outputting reminder information.
  • the specific predetermined action may be a dangerous driving action
  • the dangerous driving action is an action that brings a dangerous event to the driving process when the driver is driving the vehicle.
  • the dangerous driving action may be an action generated by the driver himself, or an action generated by another person located in the cockpit.
  • the output of the reminder information may be output of the reminder information in at least one of audio, video, and text.
  • the terminal can output prompt information to the people in the vehicle (such as the driver and / or other people in the vehicle) through the terminal.
  • the method of outputting the prompt information can be: using the terminal to display text, and using the terminal to output voice data. Tips and more.
  • the terminal may be a vehicle-mounted terminal.
  • the terminal may be equipped with a display screen and / or an audio output function.
  • the specific predetermined actions are: drinking water, making phone calls, wearing glasses, and so on.
  • the prompt information is output, and the category of the specific predetermined action (such as a dangerous driving action) can also be output.
  • the prompt information may not be output, or the type of the predetermined action may be output.
  • a dialog box may be displayed through a head up display (HUD), and the driver may be displayed through the displayed content.
  • Send out prompt information you can also output prompt information through the built-in audio output function of the vehicle, for example, you can output: "Please note the driver's driving action" and other audio information; you can also output the reminder information by releasing the gas that has a refreshing effect
  • spray the toilet water spray through the car nozzle the smell of the toilet water is fragrant and pleasant, and it can also refresh the driver while reminding the driver; it can also release a low current to stimulate the driver through the seat Output prompt information in a way to achieve the effect of prompts and warnings.
  • feature extraction branches of a neural network are used to perform feature extraction.
  • Candidate frame extraction branches of the neural network are used to obtain candidate frames that may include predetermined actions according to the extracted features.
  • Action target frame and finally classify the features in the target action frame through the action classification branch of the neural network to classify the predetermined action to get the action recognition result of the image to be processed;
  • the entire recognition process is by extracting features in the image to be processed (such as the hand area , Facial local area extraction, feature extraction of the corresponding area of the action interaction object), and processing it, can realize the precise recognition of fine movements autonomously and quickly.
  • FIG. 5 is a schematic flowchart of a driving motion analysis method according to an embodiment of the present application; as shown in FIG. 5, the method includes:
  • the driver is video-captured by a vehicle camera to obtain a video stream, and each frame of the video stream is used as an image to be processed.
  • the corresponding motion recognition results are obtained, and the driving state of the driver is identified in combination with the motion recognition results of consecutive multiple frames of images to determine whether the driving state is a dangerous driving state corresponding to a dangerous driving action .
  • the predetermined condition includes at least one of the following: the occurrence of a specific predetermined action; the number of times the specific predetermined action occurs within a predetermined time period; and the duration of occurrence and maintenance of the specific predetermined action in the video stream .
  • the specific predetermined action may be a predetermined action corresponding to a dangerous driving action in the classification of the predetermined actions in the foregoing embodiments, for example, a driving action corresponding to a drinking water action, a calling action, or the like.
  • meeting the predetermined condition in response to the motion recognition result may include: in a case where the specific motion is included in the motion recognition result, determining that the motion recognition result meets the predetermined condition; or including the specific predetermined motion and the predetermined duration in the motion recognition result In the case where the number of occurrences of the specific predetermined action reaches a preset number, it is determined that the motion recognition result meets a predetermined condition; or, the specific recognition action is included in the motion recognition result, and the specific predetermined action appears in the video stream. When the duration reaches a preset duration, it is determined that the motion recognition result meets a predetermined condition.
  • the vehicle driving terminal may generate and output dangerous driving prompt information, and may also output a specific type of predetermined action.
  • the method for outputting the dangerous driving prompt information may include: outputting the dangerous driving prompt information by displaying characters on the vehicle terminal, and outputting the dangerous driving prompt information through the audio output function of the vehicle terminal.
  • the method further includes: acquiring a vehicle speed of a vehicle provided with a dual-vehicle camera; and generating a dangerous driving prompt message in response to the motion recognition result meeting a predetermined condition, including: The vehicle speed is greater than a set threshold and the motion recognition result meets the predetermined condition, and dangerous driving prompt information is generated.
  • the dangerous driving prompt information when the vehicle speed is not greater than a set threshold, even if the motion recognition result meets the preset condition, the dangerous driving prompt information may not be generated and output. Only when the vehicle speed is greater than a set threshold, when the motion recognition result meets the preset condition, a dangerous driving prompt message is generated and output.
  • a video is taken of a driver through a vehicle-mounted camera, and each frame of the captured video is used as an image to be processed.
  • Each frame of the picture taken by the camera is recognized to obtain the corresponding recognition result, and the actions of the driver are recognized in combination with the results of multiple consecutive frames.
  • the driver may be warned through the display terminal, and the type of the dangerous driving action may be prompted.
  • the ways to raise warnings include: pop-up dialog box to warn by text, and warn by built-in voice data.
  • the neural network in the embodiment of the present application is obtained by pre-supervised training based on a training image set.
  • the neural network may include network layers such as a convolution layer, a non-linear layer, and a pooling layer.
  • the embodiment of the present application does not have a specific network structure. limit. After the structure of the neural network is determined, iterative training can be performed on the neural network based on sample images with labeled information and supervised gradient back propagation, and the specific training method is not limited in the embodiments of the present application.
  • FIG. 6 is a schematic flowchart of a neural network training method according to an embodiment of the present application. As shown in FIG. 6, the method includes:
  • a sample image for training a neural network may be obtained from a training image set, where the training image set may include multiple sample images.
  • the sample images in the training image set include a positive sample image and a negative sample image.
  • the positive sample image includes at least one predetermined action corresponding to the target object, such as the target object drinking water, smoking, making a call, wearing glasses, wearing a mask, and the like;
  • the negative sample image includes at least one predetermined action Similar actions, such as: the target's hand is put on his lips, scratching his ears, touching his nose, and so on.
  • a sample image containing an action very similar to a predetermined action is used as a negative sample image.
  • a first feature map of a sample image may be extracted through a convolution layer in a neural network.
  • a convolution layer in a neural network.
  • Extracting the first feature map may include multiple third candidate frames of a predetermined action.
  • step 303 For the detailed process of this step, reference may be made to the description of step 303 in the foregoing embodiment, and details are not described herein again.
  • the determining an action target frame based on a plurality of third candidate frames includes: obtaining a first action supervision frame according to the predetermined action, wherein the first action supervision frame Including: a local area of a face and an action interactor, or a local area of a face, a hand region, and an action interactor; obtaining a second confidence level of the plurality of third candidate frames, wherein the second confidence level The degree includes: a first probability that the third candidate frame is the motion target frame, and a second probability that the third candidate frame is not the motion target frame; determining the plurality of third candidate frames and the first The area coincidence of an action monitoring frame; if the area coincidence is greater than or equal to a second threshold, the second confidence degree of the third candidate frame corresponding to the area coincidence is taken as the first Probability; if the area coincidence degree is less than the second threshold value, the second confidence degree of the third candidate frame corresponding to the area coincidence degree is taken as the second probability; the second probability The confidence level is less than the first
  • a feature of a predetermined motion may be defined in advance.
  • the motion characteristics of drinking water include the characteristics of the hand area, the face partial area, and the cup area (that is, the corresponding area of the action interacting object);
  • the smoking motion characteristics include the hand area, the facial area, and the smoke area (I.e., the corresponding area of the action interaction object);
  • the motion characteristics of the call include: the hand area, the local face area and the mobile phone area (that is, the action interaction object corresponding area);
  • the movement characteristics of wearing glasses include: the hand area, Features of the face local area and glasses area (that is, the corresponding area of the action interacting object);
  • the movement characteristics of wearing a mask include the characteristics of the hand area, the local face area, and the mask area (that is, the corresponding area of the action interacting object).
  • the label information of the sample image includes: an action supervision frame and an action category corresponding to the action supervision frame. It can be understood that before processing the sample images through a neural network, it is also necessary to obtain labeling information corresponding to each sample image.
  • the action monitoring box is specifically used to identify a predetermined action in the sample image. For details, see the action monitoring box of the target object drinking water in FIG. 7 and the action monitoring box of the target object calling in FIG. 8.
  • Actions that are very similar to the predetermined actions will often cause great interference to the process of extracting candidate frames from the neural network.
  • Figure 4 from left to right, actions similar to making phone calls, drinking water, and smoking are sequentially performed. That is, the target object places its right hand next to its face, but the target object does not have a mobile phone, a water glass, and Smoke, and neural networks are susceptible to mistakenly identifying these actions as making calls, drinking water, and smoking, and identifying corresponding candidate boxes respectively. Therefore, in the embodiment of the present application, the neural network is trained to distinguish between positive sample images and negative sample images.
  • the first action supervision frame corresponding to the positive sample image may include a predetermined action
  • the first action supervision frame corresponding to the negative sample image may also include a Scheduled actions are similar.
  • a second confidence level corresponding to the three candidate frames can also be obtained.
  • the second confidence level includes: the probability that the third candidate frame is an action target frame. , That is, the first probability; and the probability that the third candidate frame is not the action target frame, that is, the second probability.
  • the second confidence level is that the third candidate frame obtained by the neural network according to the characteristics of the third candidate frame is the predicted value of the target action frame.
  • the coordinates (x3, y3) of the third candidate frame in the coordinate system xoy can also be obtained through the processing of the neural network, and the third candidate frame's Size, the size of the third candidate frame may be represented by a product of length and width.
  • the coordinates (x3, y3) of the third candidate frame may be the coordinates of a vertex of the third candidate frame, such as the upper left corner, upper right corner, lower left corner, or lower right corner of the third candidate frame. coordinate of.
  • the third candidate box can be represented as bbox (x3, y3, x4, y4).
  • the first action supervision box may be expressed as bbox_gt (x1, y1, x2, y2).
  • the area overlap degree IOU of each third candidate box set bbox (x3, y3, x4, y4) and the first action supervision box bbox_gt (x1, y1, x2, y2) is determined, and the area is optional.
  • the calculation formula of the coincidence degree IOU is as follows:
  • a and B respectively represent the area of the third candidate frame and the area of the first action supervision frame
  • a ⁇ B represents the area of the area where the third candidate frame overlaps with the first action supervision frame
  • a ⁇ B represents the third candidate frame and The area of all regions contained in the first action supervision box.
  • the third candidate frame determines the third candidate frame as a candidate frame that may contain a predetermined action, and take the second confidence degree of the third candidate frame as the above-mentioned first probability; if the area coincidence degree IOU is less than
  • the second threshold value determines that the third candidate frame is a candidate frame that is unlikely to contain a predetermined action, and takes a second confidence level of the third candidate frame as the second probability.
  • the value of the second threshold is greater than or equal to 0 and less than or equal to 1.
  • the specific value of the second threshold may be determined according to a network training effect.
  • the plurality of third candidate frames whose second confidence is less than the first threshold may be removed to obtain a plurality of fourth candidate frames, and the positions and sizes of the fourth candidate frames may be adjusted to obtain the
  • the action target box is described. For details about how to obtain the action target frame, refer to step 304 in the foregoing embodiment.
  • adjusting the position and size of the fourth candidate frame to obtain the action target frame includes: pooling the fourth candidate frame, obtaining a second feature region corresponding to the fourth candidate frame, and based on the The second feature region adjusts the position and size of the corresponding fourth candidate frame to obtain a fifth candidate frame, and obtains an action target frame based on the fifth candidate frame.
  • adjusting the position and size of the corresponding fourth candidate frame based on the second feature region to obtain a fifth candidate frame includes: obtaining the fifth candidate frame according to a feature corresponding to a predetermined action in the second feature region.
  • the geometric center coordinates P (x n , y n ) of the fourth candidate frame in the coordinate system xoy and the geometric center coordinates Q (x, y) of the second motion feature frame in the coordinate system xoy are obtained.
  • ⁇ (x n , y n ) P (x n , y n ) -Q (x, y) , Where n is a positive integer, and the number of n is consistent with the number of the fourth candidate frame.
  • ⁇ (x n , y n ) is the second position offset of the plurality of fourth candidate frames.
  • the sizes of the fourth candidate frame and the second motion feature frame are obtained, and then the size of the second motion feature frame is divided by the size of the fourth candidate frame to obtain the second zoom factor ⁇ of the fourth candidate frame.
  • the second zoom factor ⁇ includes the zoom factor ⁇ of the length of the fourth candidate frame and the zoom factor ⁇ of the width.
  • the set of geometric center coordinates of the fourth candidate box is expressed as: According to the second position offset ⁇ (x n , y n ), the set of geometric center coordinates of the fourth candidate frame after adjusting the position of the geometric center can be obtained as: then:
  • the geometric center of the fourth candidate frame is fixed, and the length of the fourth candidate frame is adjusted to ⁇ times based on the second scaling factor ⁇ . , The width is adjusted to ⁇ times to obtain a fifth candidate frame.
  • obtaining the motion target frame based on the fifth candidate frame includes: merging a plurality of fifth candidate frames with similar sizes and distances, and the combined fifth candidate frame is used as the motion target frame. It should be understood that the size and distance of the fifth candidate frame corresponding to the same predetermined action will be very close, so each target frame of the action after the merge corresponds to only one predetermined action.
  • a third confidence level of the action target frame is also obtained, and the third confidence level indicates the action in the action target frame.
  • the probability of a predetermined action category that is, the third probability.
  • the above predetermined actions may include five categories of drinking, smoking, making phone calls, wearing glasses, and wearing a mask.
  • the third probability of each action target frame includes five The probability values are respectively the probability a of the action in the action target frame as a drinking action, the probability b of a smoking action, the probability c of a phone call action, the probability d of wearing a glasses action, and the probability e of wearing a mask.
  • Step 504 Classify a predetermined action based on the action target frame to obtain an action recognition result.
  • the five actions of the predetermined actions included in the action target box include drinking water, smoking, making a phone call, wearing glasses, and wearing a mask as an example.
  • the classification of the predetermined action with the highest third confidence ie, third probability
  • the maximum third confidence ie, the third probability
  • Step 505 Determine the detection result of the candidate frame of the sample image and the first loss of the detection frame labeling information, and the second loss of the motion recognition result and the motion category labeling information.
  • Step 506 Adjust network parameters of the neural network according to the first loss and the second loss.
  • the neural network may include a feature extraction branch of the neural network, a candidate frame extraction branch of the neural network, a detection frame refinement branch of the neural network, and an action classification branch of the neural network.
  • the functions of the branches of the neural network may be specifically See the detailed description of steps 301 to 305 in the foregoing embodiment.
  • the network parameters of the neural network are updated by calculating a candidate frame coordinate regression loss function smooth l1 and a category loss function soft max.
  • the expression of the loss function (Region Proposal Loss) extracted from the candidate box is as follows:
  • N and ⁇ are weight parameters of the candidate frame extraction branch of the neural network, and p i is a supervised variable.
  • the refinement branch of the detection frame of the neural network updates the weight parameters of the network through a loss function.
  • the specific expression of the loss function (BboxRefineLoss) is as follows:
  • M is the number of the sixth candidate frame
  • is the weight parameter of the refinement branch of the detection frame of the neural network
  • p i is the supervised variable
  • the expressions of the soft max loss function and smooth l1 loss function can be found in formula (4) and Formula (5), in particular, bbox i in formula (6) is the geometric center coordinate of the refined target frame, and bbox_gt j is the geometric center coordinate of the supervised motion frame.
  • the loss function is the objective function of the neural network optimization
  • the process of neural network training or optimization is the process of minimizing the loss function, that is, the closer the loss function value is to 0, the more the values corresponding to the predicted result and the real result become. Close.
  • the supervised variable p i in formula (3) and formula (4) is replaced with the second confidence of the fourth candidate frame, and substituted into formula (3), and the weight of the branch is extracted by adjusting the candidate frame of the neural network
  • the parameters N and ⁇ change the value of the Region Proposal Loss (that is, the first loss), and select a weight parameter combination N and ⁇ that makes the value of the Region Proposal Loss closest to 0.
  • the supervising variable p i is replaced with the fourth probability of the action target frame (that is, the maximum value among multiple third confidence degrees (ie, the third probability)) and substituted into the formula (6).
  • the weight parameter ⁇ of the box refinement branch change the value of Bbox Refine Loss (that is, the second loss), and select the weight parameter ⁇ that makes the value of Bbox Refine Loss closest to 0, and complete the neural network by gradient back propagation.
  • the candidate frame extraction branch with updated weight parameters, the refined frame detection branch with updated weight parameters, feature extraction branch, and action classification branch are trained again, that is, the sample image is input to the neural network, and processed by the neural network.
  • the action classification branch of the network outputs the recognition results. Because there is an error between the output result of the action classification branch and the actual result, the error between the output value of the action classification branch and the actual value is back-propagated from the output layer to the convolution layer until it is propagated to the input layer. In the process of back propagation, the weight parameters in the neural network are adjusted according to the error, and the above process is continuously iterated until convergence, and the network parameters of the neural network are updated again.
  • This embodiment performs fine movements on the face of the person in the vehicle according to the movement characteristics, such as a dangerous driving movement of a driver with respect to hands and human faces.
  • movement characteristics such as a dangerous driving movement of a driver with respect to hands and human faces.
  • some actions made by drivers similar to dangerous driving actions will easily interfere with the neural network and affect subsequent classification and recognition of actions. This will not only reduce the accuracy of action recognition results, but also make the user experience a straight line. decline.
  • the positive sample image and the negative sample image are used as the sample images for neural network training, supervised by a loss function, and the network parameters of the neural network are updated in a manner of gradient back propagation (especially the feature extraction branch and The candidate box of the neural network extracts the weight parameters of the branch) and completes the training, so that the feature extraction branch of the trained neural network can accurately extract the characteristics of dangerous driving actions, and then the candidate box extraction branch of the neural network will automatically include the The removal of candidate frames for actions similar to predetermined actions (such as dangerous driving actions) can greatly reduce the false detection rate of dangerous driving actions.
  • the candidate frame is pooled and adjusted to a predetermined size. It can greatly reduce the calculation amount of subsequent processing and speed up the processing speed; the candidate frame is refined by the refinement branch of the detection frame of the neural network, so that the action target frame obtained after the refinement contains only predetermined actions (such as dangerous driving actions) Features to improve the accuracy of recognition results.
  • FIG. 9 is a schematic structural diagram of a motion recognition device according to an embodiment of the present application.
  • the recognition device 1000 includes a first extraction unit 11, a second extraction unit 12, a determination unit 13, and a classification unit 14. among them:
  • the first extraction unit 11 is configured to extract features of an image including a human face
  • the second extraction unit 12 is configured to determine a plurality of candidate frames that may include a predetermined action based on the features
  • the determining unit 13 is configured to determine a motion target frame based on the multiple candidate frames, where the motion target frame includes a local area of a human face and a motion interactor;
  • the classification unit 14 is configured to perform classification of a predetermined motion based on the motion target frame to obtain a motion recognition result.
  • the local face area includes at least one of the following: a mouth area, an ear area, and an eye area.
  • the action interacting object includes at least one of the following: a container, a cigarette, a mobile phone, food, a tool, a beverage bottle, glasses, and a mask.
  • the action target frame further includes: a hand region.
  • the predetermined action includes at least one of the following: making a call, smoking, drinking water / beverage, eating, using tools, wearing glasses, and applying makeup.
  • the motion recognition device 1000 further includes a vehicle-mounted camera for capturing an image of a person located in the vehicle, including a human face.
  • the person in the vehicle includes at least one of the following: a driver in a driving zone of the car, a person in a passenger driving zone of the car, and a rear row of the car People on seats.
  • the vehicle-mounted camera is: an RGB camera, an infrared camera, or a near-infrared camera.
  • feature extraction is performed on an image to be processed, and actions in the image to be processed are identified according to the extracted features.
  • the above-mentioned actions may be: the action of the hand area and / or the action of the local area of the face, the action of the action interactive object, etc. Therefore, it is necessary to use the vehicle camera to collect image of the person in the vehicle to obtain Process the image. Then perform a convolution operation on the processed image to extract motion features.
  • the features of the predetermined action are first defined, and then a neural network is used to determine whether there is a predetermined action in the image according to the defined features and the extracted features in the image.
  • a neural network is used to determine whether there is a predetermined action in the image according to the defined features and the extracted features in the image.
  • the feature extraction process including a hand region and a human face is obtained through a neural network
  • a feature area of a local area, a candidate area is determined based on the feature area, and the candidate area is identified by a candidate frame; wherein the candidate frame may be represented by a rectangular frame, for example.
  • a feature region including a hand region, a local face region, and a corresponding region of a motion interactor is identified through another candidate frame.
  • the candidate frame may include features other than the features corresponding to the predetermined action, or may not include all features corresponding to the predetermined action (referring to all features of any one predetermined action), which will affect the final action recognition result. Therefore, in order to ensure the accuracy of the final recognition result, the position of the candidate frame needs to be adjusted, that is, the action target frame is determined based on a plurality of candidate frames. Based on this, by adjusting the position and size of each candidate frame, the adjusted candidate frame is determined as the action target frame. It can be understood that the adjusted multiple candidate frames can overlap into one candidate frame, and the overlapping candidate frames are determined as the action target frames.
  • the first extraction unit 11 includes a feature extraction branch 111 of a neural network for extracting features of an image including a human face to obtain a feature map.
  • the convolution operation is performed on the image to be processed through the feature extraction branch of the neural network, which uses a convolution kernel to "slide" on the image to be processed.
  • a convolution kernel to "slide" on the image to be processed.
  • the gray value of the pixel point is multiplied with each value on the convolution kernel, and all products are added up to be the pixel point corresponding to the convolution kernel.
  • the gray value further "slides” the convolution kernel to the next pixel, and so on, and finally completes the convolution processing of all pixels in the image to be processed to obtain a feature map.
  • the feature extraction branch 111 of the neural network may include multiple layers of convolutional layers.
  • the feature map obtained through feature extraction by the previous layer of convolutional layers can be used as input data for the next layer of convolutional layers. Richer information, thereby improving the accuracy of feature extraction.
  • the second extraction unit 12 includes: a candidate frame extraction branch 121 of the neural network, for extracting multiple candidates that may include a predetermined action on the feature map. frame.
  • the feature map may include at least one feature among features corresponding to a hand, a cigarette, a drinking cup, a mobile phone, glasses, a mask, and a local area of a human face, and a plurality of candidate frames are determined based on the at least one feature.
  • the features of the image to be processed can be extracted through the feature extraction branch of the neural network, the extracted features may include features other than the features corresponding to the predetermined action. Therefore, the candidate frame extraction by the neural network is used here.
  • the multiple candidate frames determined by the branch at least some of the candidate frames may include features other than the features corresponding to the predetermined action, or may not include all the features corresponding to the predetermined action. Therefore, the multiple candidate frames may include Scheduled action.
  • the candidate frame extraction branch 121 of the neural network is further configured to: divide the features in the feature map according to the characteristics of the predetermined action to obtain multiple candidate regions; And, according to the plurality of candidate regions, a first confidence level of each candidate frame in the plurality of candidate frames is obtained, where the first confidence level is a probability that the candidate frame is the action target frame.
  • the candidate frame extraction branch 121 of the neural network includes: a division subunit, configured to divide the features in the feature map according to the characteristics of the predetermined action, to obtain multiple candidate regions;
  • a first acquisition subunit configured to obtain a first confidence level of each candidate frame in the plurality of candidate frames according to the multiple candidate regions, where the first confidence level is that the candidate frame is the Probability of moving target frame.
  • the candidate frame extraction branch 121 of the neural network may further determine a first confidence level corresponding to each candidate frame, where the first confidence level is used to represent a possibility that the candidate frame is a target action frame in a form of probability.
  • the first confidence degree is a candidate frame obtained by the candidate frame extraction branch of the neural network according to the characteristics of the candidate frame as a predicted value of the target action frame.
  • the determining unit 13 includes: a detection frame refinement branch 131 of the neural network for determining an action target frame based on the plurality of candidate frames.
  • the detection frame refinement branch 131 of the neural network is further configured to: remove the candidate frame with the first confidence level less than a first threshold to obtain at least one first candidate frame; And pooling the at least one first candidate frame to obtain at least one second candidate frame; and determining an action target frame according to the at least one second candidate frame.
  • the refinement branch of the detection frame of the neural network includes: removing a sub-unit for removing a candidate frame with a first confidence degree less than a first threshold, to obtain at least one first candidate frame;
  • a second acquisition subunit configured to pool the at least one first candidate frame to obtain at least one second candidate frame
  • a determining subunit configured to determine an action target frame according to the at least one second candidate frame.
  • the target object performs actions such as making a call, drinking water, and smoking in turn. These actions are similar. They place their right hands next to their faces, but Without mobile phones, drinking glasses, and cigarettes, neural networks are prone to mistakenly recognize these actions of target objects as making calls, drinking water, and smoking.
  • the detection frame refinement branch 131 of the neural network is used to remove at least one first candidate frame with a first confidence level less than a first threshold value. If the first confidence level of the candidate frame is less than the first threshold value, It indicates that the candidate frame is a candidate frame similar to the above-mentioned action, and the candidate frame needs to be removed, so as to efficiently distinguish the predetermined action from the similar action, thereby reducing the false detection rate and greatly improving the accuracy of the action recognition result.
  • the detection frame refinement branch 131 (or the second acquisition subunit) of the neural network is further configured to separately process the at least one first candidate frame, Obtaining at least one first feature region corresponding to the at least one first candidate frame; and adjusting the position and size of the corresponding first candidate frame based on each first feature region to obtain at least one second candidate frame.
  • the number of features in the area where the first candidate frame is located may be large. If the features in the area where the first candidate frame is located are used directly, a huge amount of calculation will be generated. Therefore, before performing subsequent processing on the features in the area where the first candidate frame is located, pool the first candidate frame first, that is, pool the features in the area where the first candidate frame is located, and reduce the The dimension of features meets the need for calculation in the subsequent processing, and greatly reduces the calculation in subsequent processing.
  • the detection frame refinement branch 131 (or the second acquisition subunit) of the neural network is further configured to: based on the first feature region corresponding to the predetermined Obtain a first action feature frame corresponding to the feature of the predetermined action; and obtain a first position offset of the at least one first candidate frame according to a geometric center coordinate of the first action feature frame ; And obtaining a first zoom factor of the at least one first candidate frame according to the size of the first motion feature frame; and respectively adjusting at least one first zoom factor according to at least one first position offset and at least one first zoom factor.
  • the position and size of a candidate frame are adjusted to obtain at least one second candidate frame.
  • the classification unit 14 includes: an action classification branch 141 of the neural network, configured to obtain an area map corresponding to the action target frame on the feature map, and Classify a predetermined action based on the area map to obtain a motion recognition result.
  • the first action recognition result is obtained through the action classification branch 141 of the neural network, and on the other hand, the first action may be obtained through the action classification branch 141 of the neural network.
  • a second confidence level of the recognition result, and the second confidence level represents an accuracy rate of the motion recognition result.
  • the neural network is obtained by pre-supervising training based on a training image set, where the training image set includes a plurality of sample images, and the label information of the sample images includes: an action The action category corresponding to the supervision frame and the action supervision frame.
  • the training image set includes positive sample images and negative sample images
  • the actions of the negative sample images are similar to the actions of the positive sample images
  • the actions of the positive samples are supervised.
  • the frame includes a local area of a human face and an action interactor, or a local area of a human face, a hand region, and an action interactor.
  • the action of the positive sample image includes making a call
  • the negative sample image includes: disturbing the ear; and / or, the positive sample image includes smoking, eating, or drinking water
  • the negative sample image includes a motion of opening a mouth or placing a hand on the lips.
  • feature extraction branch 111 of the neural network is used for feature extraction
  • candidate frame extraction branch 121 of the neural network is used to obtain candidate frames that may include predetermined actions according to the extracted features.
  • the detection frame is refined through the neural network.
  • the branch 131 determines the action target frame.
  • the neural network's action classification branch 141 classifies the features in the target action frame into predetermined actions to obtain the action recognition result of the image to be processed.
  • the entire recognition process is performed by extracting features in the image to be processed ( For example, the feature extraction of the hand area, the local area of the face, and the corresponding area of the action interaction object), and processing it, can realize the precise recognition of fine movements autonomously and quickly.
  • the motion recognition device further includes a training component of the neural network.
  • FIG. 10 is a schematic structural diagram of a training component of a neural network according to an embodiment of the present application.
  • the training component 2000 includes: a first extraction unit 21, a second extraction unit 22, a first determination unit 23, and an acquisition.
  • the first extraction unit 21 is configured to extract a first feature map including a sample image
  • the second extraction unit 22 is configured to extract a plurality of third candidate frames in which the first feature map may include a predetermined action
  • the first determining unit 23 is configured to determine an action target frame based on the multiple third candidate frames
  • the obtaining unit 24 is configured to classify a predetermined action based on the action target frame to obtain a first action recognition result
  • the second determining unit 25 is configured to determine a detection result of the candidate frame of the sample image and a first loss of the detection frame labeling information, and a second loss of the motion recognition result and the motion category labeling information;
  • the adjusting unit 26 is configured to adjust network parameters of the neural network according to the first loss and the second loss.
  • the first determining unit 23 includes: a first obtaining subunit 231, configured to obtain a first action supervision frame according to the predetermined action, wherein the first action supervision The frame includes: a local area of a human face and an action interactor, or a local area of a human face, a hand region, and an action interactor;
  • the second acquisition subunit 232 is configured to acquire a second confidence level of the plurality of third candidate frames, where the second confidence level includes that the fourth candidate frame is a first confidence level of the action target frame. A probability that the third candidate frame is not the second probability of the action target frame;
  • the determining subunit 233 is configured to determine an area overlap degree between the plurality of third candidate frames and the first action supervision frame
  • the selecting subunit 234 is configured to: if the area coincidence degree is greater than or equal to a second threshold, take the second confidence degree of the third candidate frame corresponding to the area coincidence degree as the first Probability; if the area coincidence degree is less than the second threshold, taking the second confidence degree of the third candidate frame corresponding to the area coincidence degree as the second probability;
  • the removing sub-unit 235 is configured to remove the plurality of third candidate frames with the second confidence level less than the first threshold, to obtain a plurality of fourth candidate frames;
  • the adjustment subunit 236 is configured to adjust the position and size of the fourth candidate frame to obtain the action target frame.
  • This embodiment performs fine movements on the face of the person in the vehicle according to the movement characteristics, such as a dangerous driving movement of a driver with respect to hands and human faces.
  • movement characteristics such as a dangerous driving movement of a driver with respect to hands and human faces.
  • some actions made by drivers similar to dangerous driving actions will easily interfere with the neural network and affect subsequent classification and recognition of actions. This will not only reduce the accuracy of action recognition results, but also make the user experience a straight line. decline.
  • the positive sample image and the negative sample image are used as the sample images for neural network training, supervised by a loss function, and the network parameters of the neural network are updated in a manner of gradient back propagation (especially the feature extraction branch and The candidate box of the neural network extracts the weight parameters of the branch) and completes the training, so that the feature extraction branch of the trained neural network can accurately extract the characteristics of dangerous driving actions, and then the candidate box extraction branch of the neural network will automatically include the The removal of candidate frames for actions similar to predetermined actions (such as dangerous driving actions) can greatly reduce the false detection rate of dangerous driving actions.
  • the candidate frame is pooled and adjusted to a predetermined size. It can greatly reduce the calculation amount of subsequent processing and speed up the processing speed; the candidate frame is refined by the refinement branch of the detection frame of the neural network, so that the action target frame obtained after the refinement contains only predetermined actions (such as dangerous driving actions).
  • FIG. 11 is a schematic structural diagram of a driving motion analysis device according to an embodiment of the present application.
  • the analysis device 3000 includes a vehicle-mounted camera 31, a first acquisition unit 32, and a generation unit 33. among them:
  • the vehicle-mounted camera 31 is configured to collect a video stream including a driver's face image
  • the first acquiring unit 32 is configured to acquire a motion recognition result of at least one frame of an image in the video stream through the motion recognition device according to the foregoing embodiment of the present application;
  • the generating unit 33 is configured to generate distraction or dangerous driving prompt information in response to a result of a motion recognition meeting a predetermined condition.
  • the predetermined condition includes at least one of the following: the occurrence of a specific predetermined action; the number of times the specific predetermined action occurs within a predetermined time period; and the duration of occurrence and maintenance of the specific predetermined action in the video stream .
  • the analysis device 3000 further includes: a second obtaining unit 34, configured to obtain a vehicle speed of a vehicle provided with a dual-vehicle camera; and the generating unit 33 is further configured to: The vehicle speed is greater than a set threshold and the action recognition result satisfies the predetermined condition, generating distraction or dangerous driving prompt information.
  • a video is taken of a driver through a vehicle-mounted camera, and each frame of the captured video is used as an image to be processed.
  • Each frame of the picture taken by the camera is recognized to obtain the corresponding recognition result, and the actions of the driver are recognized in combination with the results of multiple consecutive frames.
  • the driver may be warned through the display terminal, and the type of the dangerous driving action may be prompted.
  • the ways to raise warnings include: pop-up dialog box to warn by text, and warn by built-in voice data.
  • FIG. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
  • the electronic device 4000 includes a memory 44 and a processor 41.
  • the memory 44 stores computer-executable instructions.
  • the processor 41 runs the computer-executable instructions on the memory 44, the actions described in the embodiments of the present application are implemented.
  • a recognition method, or a driving action analysis method according to an embodiment of the present application.
  • the electronic device may further include an input device 42 and an output device 43.
  • the input device 42, the output device 43, the memory 44 and the processor 41 can be connected to each other through a bus.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • CD-ROM portable Read-only memory
  • the input device is used to input data and / or signals
  • the output device is used to output data and / or signals.
  • the output device and the input device may be independent devices or an integrated device.
  • the processor may include one or more processors, for example, one or more central processing units (CPUs). When the processor is one CPU, the CPU may be a single-core CPU, or may be Multi-core CPU.
  • the processor may also include one or more special-purpose processors, and the special-purpose processors may include GPUs, FPGAs, and the like, for performing accelerated processing.
  • the memory is used to store program code and data of a network device.
  • the processor is configured to call program code and data in the memory and execute steps in the foregoing method embodiments. For details, refer to the description in the method embodiment, and details are not described herein again.
  • FIG. 12 only shows a simplified design of the electronic device.
  • the electronic device may also include other necessary components, including but not limited to any number of input / output devices, processors, controllers, memories, etc., and all the electronic devices that can implement the embodiments of this application are in It is within the protection scope of the embodiments of the present application.
  • An embodiment of the present application further provides a computer storage medium for storing computer-readable instructions that, when executed, implement operations of the action recognition method of any of the foregoing embodiments of the present application, or when the instructions are executed The operation of the driving action analysis method of any one of the foregoing embodiments of the present application is implemented.
  • An embodiment of the present application further provides a computer program including computer-readable instructions.
  • a processor in the device executes the instructions to implement any of the foregoing implementations of the application.
  • the executable instructions of the steps in the example motion recognition method, or the processor in the device executes the executable instructions for implementing the steps in the driving motion analysis method of any one of the foregoing embodiments of the present application.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner such as multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not implemented.
  • the components shown or discussed are mutually coupled, or directly coupled, or the communication connection may be through some interfaces.
  • the indirect coupling or communication connection of the device or unit may be electrical, mechanical, or other forms. of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, which may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration
  • the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
  • the foregoing program may be stored in a computer-readable storage medium.
  • the program is executed, the program is executed.
  • the method includes the steps of the foregoing method embodiment.
  • the foregoing storage medium includes: various types of media that can store program codes, such as a mobile storage device, a ROM, a RAM, a magnetic disk, or an optical disc.
  • the above-mentioned integrated unit of the present invention is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is caused to execute all or part of the methods described in the embodiments of the present invention.
  • the foregoing storage medium includes: various types of media that can store program codes, such as a mobile storage device, a ROM, a RAM, a magnetic disk, or an optical disc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Mechanical Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Traffic Control Systems (AREA)

Abstract

一种动作识别、驾驶动作分析方法和装置、电子设备。该方法包括:提取包括有人脸的图像中的特征(101);基于所述特征提取可能包括预定动作的多个候选框(102);基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物(103);基于所述动作目标框进行预定动作的分类,获得动作识别结果(104)。

Description

动作识别、驾驶动作分析方法和装置、电子设备
相关申请的交叉引用
本申请基于申请号为201811130798.6、申请日为2018年09月27日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请涉及图像处理技术领域,尤其涉及一种动作识别、驾驶动作分析方法和装置、电子设备。
背景技术
动作识别技术在近几年成为了非常热门的应用研究方向,在很多领域和产品上都可以见到这项技术的身影,采用这种技术也是未来人机交互的发展趋势,尤其在驾驶员监控领域有着广泛的应用前景。
发明内容
本申请实施例提供了一种动作识别技术方案和驾驶动作分析技术方案。
第一方面,本申请实施例提供了一种动作识别方法,所述方法包括:提取包括有人脸的图像的特征;基于所述特征确定可能包括预定动作的多个候选框;基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;基于所述动作目标框进行预定动作的分类,获得动作识别结果。
第二方面,本申请实施例提供了一种驾驶动作分析方法,所述方法包括:经车载摄像头采集包括有驾驶员人脸图像的视频流;通过本申请实施例所述动作识别方法的任意一种实现方式,获取所述视频流中至少一帧图像的动作识别结果;响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
第三方面,本申请实施例提供了一种动作识别装置,所述装置包括:第一提取单元,用于提取包括有人脸的图像的特征;第二提取单元,用于基于所述特征确定可能包括预定动作的多个候选框;确定单元,用于基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;分类单元,用于基于所述动作目标框进行预定动作的分类,获得动作识别结果。
第四方面,本申请实施例提供了一种驾驶动作分析装置,所述装置包括:车载摄像头,用于采集包括有驾驶员人脸图像的视频流;获取单元,用于本申请实施例通过所述动作识别装置中的任意一种实现方式,获取所述视频流中至少一帧图像的动作识别结果;生成单元,用于响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
第五方面,本申请实施例提供了一种电子设备,包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时实现本申请实施例第一方面或第二方面所述的方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行本申请实施例第一方面或第二方面所述的方法。
第七方面,本申请实施例提供了一种计算机程序,包括计算机指令,当所述计算机指令在设备的处理器中运行时,本申请实施例第一方面或第二方面所述的方法。
本申请实施例通过对包含有人脸的图像中的特征进行提取,基于提取的特征确定可能包括预定动作的多个候选框,基于多个候选框确定动作目标框,再基于动作目标框进行预定动作的分类,得到动作识别结果。由于本申请实施例中所述动作目标框包括人脸的局部区域和动作交互物,因此,在基于动作目标框对预定动作进行分类的过程中,是将对应于人脸的局部区域和动作交互物的动作作为整体,而不是割裂人体部位和动作交互物,并基于该整体对应的特征进行分类,因此可实现对 精细动作的识别,特别是对人脸区域或人脸区域附近的精细动作的识别,提高动作识别的准确度和精度。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1为本申请实施例提供的一种动作识别方法的流程示意图;
图2为本申请实施例提供的一种目标动作框示意图;
图3为本申请实施例提供的另一种动作识别方法的流程示意图;
图4为本申请实施例提供的一种包含与预定动作相似的动作的负样本图像示意图;
图5为本申请实施例提供的一种驾驶动作分析方法的流程示意图;
图6为本申请实施例提供的一种神经网络的训练方法的流程示意图;
图7为本申请实施例提供的一种喝水的动作监督框示意图;
图8为本申请实施例提供的一种打电话的动作监督框示意图;
图9为本申请实施例提供的一种动作识别装置的结构示意图;
图10为本申请实施例提供的一种神经网络的训练组件的结构示意图;
图11为本申请实施例提供的一种驾驶动作分析装置的结构示意图;
图12为本申请实施例提供的一种电子设备的硬件结构示意图。
具体实施方式
下面结合本申请实施例中的附图对本申请实施例进行描述。
图1是本申请实施例提供的一种动作识别方法的流程示意图,如图1所示,所述方法包括:
101、提取包括有人脸的图像中的特征。
本申请实施例主要针对车内人员的动作进行识别。以驾驶员为例,本申请实施例可对车辆驾驶员在驾驶车辆时所做的一些驾驶动作进行识别,可根据识别结果对驾驶员给出提醒。发明人在实现本申请实施例过程中发现,由于车内人员某些与人脸有关的精细动作,例如,驾驶员喝水、驾驶员打电话等,这些动作的识别很难甚至无法通过对人体关键点的检测或人体姿态的估计实现。本申请实施例通过对待处理图像进行特征提取,并根据提取到的特征实现待处理图像中动作的识别。上述的动作可以为:手部区域的动作和/或人脸局部区域的动作、针对动作交互物的动作等等,因此,需通过车载摄像头对车内人员进行图像采集,获取包括有人脸的待处理图像。再对待处理图像进行卷积操作,提取出动作特征。
在本申请的一种可选实施例中,所述方法还包括:经车载摄像头拍摄位于车内的人的包括有人脸的图像。其中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。
其中,所述车载摄像头可以为:红绿蓝(RGB)摄像头、红外摄像头或近红外摄像头。
102、基于所述特征确定可能包括预定动作的多个候选框。
本申请实施例主要针对车内人员的预定动作进行识别,以车内人员为驾驶员为例,所述预定动作例如可以是对应于驾驶员的危险驾驶的预定动作,或者针对驾驶员的某些危险动作的预定动作。在一种可选的实施方式中,首先对上述预定动作的特征进行定义,再通过神经网络根据定义的特征和提取到的图像中的特征,实现对图像中是否存在预定动作的判断,在判定图像中存在预定动作的情况下,确定图像中包括预定动作的多个候选框。
本实施例中的神经网络均是训练好的,即通过神经网络可提取图像中的预定动作的特征。在本申请的一种可选实施例中,所述神经网络可设置多层卷积层,通过多层卷积层提取图像中更为丰富的信息,由此提高预定动作的判定准确率。
本实施例中,若上述提取的特征对应于:手部区域、人脸局部区域、动作交互物对应区域等至少一种区域,则通过神经网络的特征提取处理获得包含有手部区域和人脸局部区域的特征区域,基于所述特征区域确定候选区域,通过候选框标识出所述候选区域;其中,所述候选框例如可以通过矩形框表示。同理,通过另一个候选框标识出包含手部区域、人脸局部区域和动作交互物对应区域的特征区域。这样,通过提取对应于预定动作的特征,获得多个候选区域;根据多个候选区域,确 定多个候选框。
103、基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物。
本申请实施例识别的动作均为与人脸有关的精细动作,这类与人脸有关的精细动作的识别很难甚至无法通过对人体关键点的检测实现,而这类精细动作对应区域都至少包括人脸的局部区域和动作交互物对应区域这两个区域,例如,包括人脸的局部区域和动作交互物对应区域,或者,包括人脸的局部区域、动作交互物对应区域以及手部区域,等等,因此,通过识别由多个候选框得到的动作目标框内的特征即可实现对这类精细动作的识别。
在本申请的一种可选实施例中,所述人脸的局部区域,包括以下至少之一:嘴部区域、耳部区域、眼部区域。所述动作交互物,包括以下至少之一:容器、烟、手机、食物、工具、饮料瓶、眼镜、口罩。
在本申请的一种可选实施例中,所述动作目标框还包括:手部区域。
例如,如图2所示的目标动作框内包括:局部人脸、手机(即动作交互物)以及手。又例如,对于抽烟动作,目标动作框内也可能包括:嘴部和烟(即动作交互物)。
本实施例中,由于候选框中可能包含除预定动作对应的特征以外的特征,或没有包含预定动作对应的所有特征(指任意一个预定动作的所有特征),这样都会影响最终的动作识别结果。因此,为保证最终识别结果的精度,需要对候选框的位置进行调整,即基于多个候选框确定动作目标框,所述动作目标框的位置和大小与多个候选框中的至少部分候选框的位置和大小可能存在偏差。如图2所示,可根据预定动作对应的特征的位置和大小,确定对应的候选框的位置偏移量和缩放倍数,再根据位置偏移量和缩放倍数调整候选框的位置和大小,使得调整后的动作目标框内仅包括预定动作对应的特征,并且包括预定动作对应的所有特征。基于此,通过对各个候选框的位置和大小的调整,将调整后的候选框确定为动作目标框。可以理解,调整后的多个候选框可以重叠为一个候选框,则重叠的候选框确定为动作目标框。
104、基于所述动作目标框进行预定动作的分类,获得动作识别结果。
在本申请的一种可选实施例中,所述预定动作包括以下至少之一:打电话、抽烟、喝水/饮料、进食、使用工具、戴眼镜、化妆。
本实施例中,可基于所述动作目标框内包含的预定动作对应的特征对预定动作进行分类。作为一种实施方式,可通过用于动作分类的神经网络对所述动作目标框内包含的预定动作对应的特征进行分类处理,获得特征对应的预定动作的分类识别结果。
采用本申请实施例的动作识别方法,通过对包含有人脸的图像中的特征进行提取,基于提取的特征确定可能包括预定动作的多个候选框,再基于多个候选框确定动作目标框,基于目标动作快进行预定动作的分类。由于本申请实施例中所述动作目标框包括人脸的局部区域和动作交互物,因此,在基于动作目标框对预定动作进行分类的过程中,是将对应于人脸的局部区域和动作交互物的动作作为整体,而不是割裂人体部位和动作交互物,并基于该整体对应的特征进行分类,因此可实现对精细动作的识别,特别是对人脸区域或人脸区域附近的精细动作的识别,提高识别的准确度和精度。
图3是本申请实施例提供的另一种动作识别方法的流程示意图,如图3所示,所述方法包括:
301、获取待处理图像,所述待处理图像中包括有人脸。
在本申请的一种可选实施例中,所述获取待处理图像,可包括:通过车载摄像头对车内的人进行拍照获取待处理图像,也可通过车载摄像头对车内的人进行视频拍摄,并以拍摄的视频的帧图像作为待处理图像。其中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。上述车载摄像头可以为:RGB摄像头、红外摄像头或近红外摄像头。
RGB摄像头由三根不同的线缆给出了三个基本彩色成分,这种类型的摄像头通常是用三个独立的电荷耦合元件(CCD,Charge Coupled Device)传感器来获取三种彩色信号,RGB摄像头经常被用来做非常精确的彩色图像采集。
现实环境的光线复杂,车辆内的光线复杂程度更甚,而光照强度会直接影响拍摄质量,尤其是当车内光照强度较低时,普通的摄像头无法采集到清晰的照片或视频,使图像或视频丢失一部分有用的信息,进而影响后续的处理。而红外摄像头可向被拍摄的物体发射红外光,再根据红外光反射的光线进行成像,可解决普通摄像头在暗光或黑暗条件下拍摄的图像质量较低或无法正常拍摄的问题。基于此,本实施例中,可设置有普通摄像头或红外摄像头,在光线强度高于预设值的情况下,通过普通摄像头获取待处理图像;在光线强度低于预设值的情况下,通过红外摄像头获取待处理图 像。
302、经神经网络的特征提取分支提取所述待处理图像中的特征,获得特征图。
在本申请的一种可选实施例中,通过神经网络的特征提取分支对待处理图像进行卷积操作,获得特征图。
在一示例中,通过神经网络的特征提取分支对待处理图像进行卷积操作,是利用卷积核在待处理图像上“滑动”。例如,在卷积核对应于图像某像素点时,将该像素点的灰度值与卷积核上的各数值相乘,将所有乘积加和后作为卷积核对应的所述像素点的灰度值,进一步将卷积核“滑动”至下一个像素点,以此类推,最终完成所述待处理图像中的所有像素点卷积处理,获得特征图。
需要理解的是,本实施例的神经网络的特征提取分支可包括多层卷积层,上一层卷积层通过特征提取获得的特征图可作为下一层卷积层的输入数据,通过多层卷积层提取图像中更为丰富的信息,由此提高特征提取的准确率。通过包括多层卷积层的神经网络的特征提取分支对待处理图像进行逐级的卷积操作,可获得与待处理图像相对应的特征图。
303、经上述神经网络的候选框提取分支在上述特征图上确定可能包括预定动作的多个候选框。
本实施例中,通过神经网络的候选框提取分支对特征图的处理,确定可能包括预定动作的多个候选框。例如,特征图中可包括:手、烟、水杯、手机、眼镜、口罩、人脸局部区域对应的特征中的至少一种特征,基于所述至少一种特征确定多个候选框。需要说明的是,虽然步骤302中,通过神经网络的特征提取分支能够提取出待处理图像的特征,但提取出的特征可能包括预定动作对应的特征之外的其他特征,因此,这里通过神经网络的候选框提取分支确定的多个候选框中,可能存在至少部分候选框中包含了除预定动作对应的特征以外的其他特征,或者并没有包含预定动作对应的所有特征,因此,所述多个候选框可能包括了预定动作。
需要理解的是,本实施例的神经网络的候选框提取分支可包括多层卷积层,上一层卷积层提取到的特征将作为下一层卷积层的输入数据,通过多层卷积层提取更为丰富的信息,由此提高提取的特征提取的准确率。
在本申请的一种可选实施例中,所述经所述神经网络的候选框提取分支在所述特征图上确定可能包括预定动作的多个候选框,包括:根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;根据所述多个候选区域,获得多个候选框和所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
本实施例中,神经网络的候选框提取分支识别所述特征图,将特征图中包含有手部特征和人脸局部区域对应特征、或者包含有手部特征、动作交互物对应特征(例如手机对应特征)和人脸局部区域对应特征从特征图中划分出,基于划分出的特征确定候选区域,通过候选框(所述候选框例如矩形框)标识出所述候选区域。这样,得到通过候选框标识出的多个候选区域。
本实施例中,神经网络的候选框提取分支还可以确定每个候选框对应的第一置信度,所述第一置信度用于以概率的形式表示候选框为目标动作框的可能性。通过神经网络的候选框提取分支对特征图的处理,在获多个候选框的同时,还获得多个候选框中每个候选框的第一置信度。需要理解的是,所述第一置信度为神经网络的候选框提取分支根据候选框中的特征得到的候选框为目标动作框的预测值。
304:经所述神经网络的检测框精修分支、基于所述多个候选框确定动作目标框;其中,所述动作目标框包括人脸的局部区域和动作交互物。
在本申请的一种可选实施例中,所述经所述神经网络的检测框精修分支、基于所述多个候选框确定动作目标框,包括:经所述神经网络的检测框精修分支去除第一置信度小于第一阈值的候选框,获得至少一个第一候选框;池化处理所述至少一个第一候选框,获得至少一个第二候选框;根据所述至少一个第二候选框,确定动作目标框。
本实施例中,由于在获得候选框的过程中,一些与预定动作很相似的动作会给神经网络的候选框提取分支带来很大的干扰。如图4中从左至右的子图片中,目标对象依次做了与打电话、喝水和抽烟等动作,这些动作比较相似,都是将右手分别放置在脸旁,但目标对象手里并没手机、水杯和烟,而神经网络容易误将目标对象的这些动作识别为打电话、喝水和抽烟。而在预定动作为预定的危险驾驶动作的情况下,驾驶员在驾驶车辆的过程中,会出现例如:因为耳部区域瘙痒的原因而做挠耳朵的动作、或者因为其他原因做张嘴或手搭着嘴唇的动作,显然,这些动作并不属于预定的危险驾驶动作,但这些动作会给神经网络的候选框提取分支在提取候选框过程带来很大的干扰,进而影响后续对动作的分类,获得错误的动作识别结果。
本申请实施例通过预先训练获得神经网络的检测框精修分支去除第一置信度小于第一阈值的候 选框,得到至少一个第一候选框;所述至少一个第一候选框的第一置信度均大于等于第一阈值。其中,若候选框的第一置信度小于第一阈值,则表明该候选框为上述与相似动作的候选框,需要将该候选框去除,从而能够高效的区分预定动作和相似动作,进而降低误检测率,大大提高动作识别结果的准确率。其中,上述第一阈值例如可取0.5,当然,本申请实施例中所述第一阈值的取值不限于此。
在本申请的一种可选实施例中,所述池化处理所述至少一个第一候选框,获得至少一个第二候选框,包括:池化处理所述至少一个第一候选框,获得与所述至少一个第一候选框对应的至少一第一特征区域;基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框。
本实施例中,第一候选框所在区域中的特征的数量可能较多,若直接使用第一候选框所在区域中的特征将产生巨大的计算量。因此,在对第一候选框所在区域中的特征进行后续处理之前,先池化处理第一候选框,即池化处理第一候选框所在区域中的特征,降低第一候选框所在区域中的特征的维度,以满足后续处理过程中对计算量的需要,大大减小后续处理的计算量。同步骤303中获得候选区域相似,根据预定动作的特征对上述池化处理后的特征进行划分,获得多个第一特征区域。可以理解,本实施例通过对第一候选框对应的区域进行池化处理,将第一特征区域中、对应于预定动作的特征以低维度的形式呈现。
作为一种示例,池化处理的具体实现过程可参见下示例:假设第一候选框的大小表示为h*w,其中,h可表示第一候选框的高度,w可表示第一候选框的宽度;当期望得到的特征的目标大小为H*W时,可将该第一候选框划分成H*W个格子,每个格子的大小可表示为(h/H)*(w/W),再计算每一个格子中的像素点的平均灰度值或确定每个格子中的最大灰度值,将所述平均灰度值或所述最大灰度值作为每个格子对应的取值,从而得到第一候选框的池化处理结果。
在本申请的一种可选实施例中,所述基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框,包括:基于所述第一特征区域中对应于所述预定动作的特征,获得与所述预定动作的特征对应的第一动作特征框;以及根据所述第一动作特征框的几何中心坐标,获得所述至少一个第一候选框的第一位置偏移量;以及根据所述第一动作特征框的大小,获得所述至少一个第一候选框的第一缩放倍数;以及根据至少一个第一位置偏移量和至少一个第一缩放倍数分别对所述至少一个第一候选框的位置和大小进行调整,获得至少一个第二候选框。
本实施例中,为方便后续处理,将第一特征区域中的对应于每一个预定动作的特征分别通过第一动作特征框标识出,所述第一动作特征框具体可以是矩形框,例如,通过矩形框标识出第一特征区域中的对应于每一个预定动作的特征。
本实施例中,获取第一动作特征框在预先建立的XOY坐标系下的几何中心坐标,根据几何中心坐标确定所述第一动作特征框对应的第一候选框的第一位置偏移量;其中,XOY坐标系通常是设定坐标原点O,以水平方向作为X轴,以垂直于X轴的方向作为Y轴建立的坐标系。由于第一动作特征框是基于预定动作的特征从第一特征区域中确定的,第一特征区域是基于预定动作的特征从第一候选框划分确定的,因此第一动作特征框的几何中心与第一候选框的几何中心通常存在一定的偏差,根据所述偏差确定第一候选框的第一位置偏移量。作为一种示例,可将第一动作特征框的几何中心与对应于相同预定动作的特征的第一候选框的几何中心之间的偏移量作为所述第一候选框的第一位置偏移量。
其中,在对应于相同预定动作的特征的第一候选框的数量为多个的情况下,每个第一候选框对应有第一位置偏移量,所述第一位置偏移量包括X轴方向的位置偏移量和Y轴方向的偏移量。其中,作为一种示例XOY坐标系为以第一特征区域的左上角(以输入神经网络的候选框精修分支的方位为准)为坐标原点,水平向右为X轴的正方向,竖直向下为Y轴的正方向。在其他实例中,还可以以第一特征区域的左下角、右上角、右下角或第一特征区域的中心点为原点,水平向右为X轴的正方向,竖直向下为Y轴的正方向。
本实施例中,获取第一动作特征框的尺寸,具体获取第一动作特征框的长度和宽度,根据第一动作特征框的长度和宽度确定对应的第一候选框的第一缩放倍数。在一示例中,可基于第一动作特征框的长度和宽度和对应的第一候选框的长度和宽度确定所述第一候选框的第一缩放倍数。其中,每个第一候选框均对应有第一缩放倍数,不同的第一候选框的第一缩放倍数可相同或不同。
本实施例中,根据每个第一候选框对应的第一位置偏移量和第一缩放倍数对所述第一候选框位置和大小进行调整。作为一种实施方式,将第一候选框按上述第一位置偏移量进行移动,并且将第一候选框以几何中心为中心、按照第一缩放倍数对尺寸进行调整,获得第二候选框。需要理解的是, 第二候选框的数量与第一候选框的数量一致。通过上述方式获得的第二候选框,将以尽可能小的尺寸包含预定动作的所有特征,有利于提高后续动作分类结果的精度。
本实施例中,可将多个第二候选框中尺寸相近以及几何中心之间的相近的第二候选框这合并为,合并后的第二候选框作为动作目标框。需要理解的是,对应于同一预定动作的第二候选框的尺寸和几何中心之间的距离可能非常接近,因此,针对每个预定动作,可对应一个动作目标框。
作为一种示例:驾驶员在打电话的同时还在抽烟,因此获得的待处理图像中可包含打电话和抽烟两个预定动作对应的特征。经过上述处理方式,可得到包括对应于打电话的预定动作的特征的候选框,所述候选框中包括手部、手机和人脸局部区域,还可得到包括对应于抽烟的预定动作的特征的候选框,所述候选框中包括手部、香烟和人脸局部区域。虽然对应于打电话的预定动作的候选框和对应于抽烟的预定动作的候选框都可能有多个,但所有对应于打电话的预定动作的候选框的尺寸和几何中心之间的距离都相近,所有对应于抽烟的预定动作的候选框的尺寸和几何中心之间的距离都相近,而且任一对应于打电话的预定动作的候选框的尺寸和任一对应于抽烟的预定动作的候选框的尺寸的差值,大于任意两个对应于打电话的预定动作的候选框之间的尺寸差值,也大于任意两个对应于抽烟的预定动作的候选框之间的尺寸差值,任一对应于打电话的预定动作的候选框与任一对应于抽烟的预定动作的候选框之间的几何中心之间的距离大于任意两个对应于打电话的预定动作的候选框之间的几何中心之间的距离,也大于任意两个对应于抽烟的预定动作的候选框之间的几何中心之间的距离。将所有对应于打电话的预定动作的候选框合并,得到一个动作目标框,将所有对应于抽烟的预定动作的候选框合并,得到另一个动作目标框。这样,对应于两个预定动作,分别得到两个动作目标框。
305、经所述神经网络的动作分类分支获取上述特征图上与上述动作目标框对应的区域图,基于所述区域图进行预定动作的分类,获得动作识别结果。
本实施例中,神经网络的动作分类分支根据从特征图中划分出与所述动作目标动作框对应的区域,得到区域图,基于所述区域图内的特征进行预定动作的分类,得到第一动作识别结果;根据所有目标动作框对应的第一动作识别结果,得到待处理图像对应的动作识别结果。
在本申请的一种可选实施例中,一方面,通过神经网络的动作分类分支获得第一动作识别结果,另一方面,通过神经网络的动作分类分支还可获得所述第一动作识别结果的第二置信度,所述第二置信度表征所述动作识别结果的准确率。则所述根据所有目标动作框对应的第一动作识别结果,得到待处理图像所对应的动作识别结果,包括:比较每个目标动作框对应的第一动作识别结果的第二置信度和预设阈值,获得第二置信度大于所述预设阈值的第一动作结果,基于第二置信度大于所述预设阈值的第一动作结果确定所述待处理图像对应的动作识别结果。
例如,通过车载摄像头对驾驶员进行拍摄,获得包括有驾驶员的人脸的图像,并将其作为待处理图像输入神经网络。假设待处理图像中的驾驶员对应有“打电话”的动作,通过神经网络的处理获得两个动作识别结果:“打电话”的动作识别结果和“喝水”的动作识别结果,其中,“打电话”的动作识别结果的第二置信度为0.8,“喝水”的动作识别结果的第二置信度为0.4。若设置的预设阈值为0.6,则可确定所述待处理图像的动作识别结果为“打电话”动作。
本实施例中,在动作识别结果为特定预定动作的情况下,所述方法还可包括:输出提醒信息。其中,所述特定预定动作可以是危险驾驶动作,所述危险驾驶动作为驾驶员在驾驶车辆过程中会对驾驶过程带来危险事件的动作。所述危险驾驶动作可以是驾驶员自身产生的动作,也可以是位于驾驶舱内的其他人员产生的动作。其中,所述输出提醒信息可以是通过音频、视频、文字中的至少一种方式输出提醒信息。例如,可通过终端对车内人员(例如驾驶员和/或车内其他人员)输出提示信息,输出提示信息的方式可以是:通过终端显示文字的方式进行提示、通过终端输出语音数据的方式进行提示等等。其中,所述终端可以为车载终端,可选的,终端可配备有显示屏和/或音频输出功能。
其中,若特定预定动作为:喝水、打电话、戴眼镜等等。当通过神经网络获得的动作识别结果为上述特定预定动作中的任意一个或多个动作,则输出提示信息,还可输出特定预定动作(例如危险驾驶动作)的类别。在未检测到有特定预定动作的情况下,可不输出提示信息,或者也可输出预定动作的类别。
作为一种示例,在获得的动作识别结果包含有特定预定动作(例如危险驾驶动作)的情况下,可通过抬头数字显示仪(head up display,HUD)显示对话框,通过显示的内容对驾驶员发出提示信息;还可通过车辆内置的音频输出功能输出提示信息,例如可输出:“请驾驶员注意驾驶动作”等音频信息;还可通过释放具有的醒脑提神功效的气体的方式输出提示信息,例如:通过车载喷头喷出 花露水喷雾,花露水的气味清香怡人,在对驾驶员进行提示的同时,还能起到醒脑提神的效果;还可通座椅释放出低电流刺激驾驶员的方式输出提示信息,以达到提示和警告的效果。
本申请实施例通过神经网络的特征提取分支对待处理图像进行特征提取,通过神经网络的候选框提取分支根据提取出的特征获得可能包括预定动作的候选框,通过神经网络的检测框精修分支确定动作目标框,最后通过神经网络的动作分类分支对目标动作框中的特征进行预定动作的分类,得到待处理图像的动作识别结果;整个识别过程通过提取待处理图像中的特征(例如手部区域、人脸局部区域、动作交互物对应区域的特征提取),并对其进行处理,可自主、快速的实现对精细动作的精确识别。
本申请实施例还提供了一种驾驶动作分析方法。图5为本申请实施例提供的一种驾驶动作分析方法的流程示意图;如图5所示,所述方法包括:
401:经车载摄像头采集包括有驾驶员人脸图像的视频流;
402:获取所述视频流中至少一帧图像的动作识别结果;
403:响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
本实施例中,通过车载摄像头对驾驶员进行视频拍摄,获得视频流,并以视频流的每一帧图像作为待处理图像。通过对每一帧图像进行动作识别,获得相应的动作识别结果,再结合连续多帧图像的动作识别结果对驾驶员的驾驶状态进行识别,确定驾驶状态是否为对应于危险驾驶动作的危险驾驶状态。其中,对多帧图像的动作识别的处理过程参照上述实施例中所述,这里不再赘述。
在本申请的一种可选实施例中,所述预定条件包括以下至少之一:出现特定预定动作;在预定时长内出现特定预定动作的次数;所述视频流中特定预定动作出现维持的时长。
本实施例中,所述特定预定动作可以为前述实施例中预定动作的分类中对应于危险驾驶动作的预定动作,例如对应于驾驶员的喝水动作、打电话动作等等。则所述响应于动作识别结果满足预定条件可包括:在动作识别结果中包括特定预定动作的情况下,确定动作识别结果满足预定条件;或者,在动作识别结果中包括特定预定动作、且预定时长内所述特定预定动作出现的次数达到预设数量的情况下,确定动作识别结果满足预定条件;或者,在动作识别结果中包括特定预定动作、且在所述视频流中所述特定预定动作出现的时长达到预设时长的情况下,确定动作识别结果满足预定条件。
例如,当检测到驾驶员正在进行喝水、打电话、戴眼镜中的任意一个动作时,可通过车载终端生成并输出危险驾驶提示信息,还可以输出特定预定动作的类别。其中,输出危险驾驶提示信息的方式可包括:通过车载终端显示文字的方式输出危险驾驶提示信息、通过车载终端的音频输出功能输出危险驾驶提示信息。
在本申请的一种可选实施例中,所述方法还包括:获取设置有车载双摄像头的车辆的车速;所述响应于动作识别结果满足预定条件,生成危险驾驶提示信息,包括:响应于所述车速大于设定阈值且所述动作识别结果满足所述预定条件,生成危险驾驶提示信息。
本实施例中,针对车速不大于设定阈值的情况下,即使动作识别结果满足所述预设条件,也可不生成并输出危险驾驶提示信息。仅在车速大于设定阈值的情况下,在动作识别结果满足所述预设条件时,生成并输出危险驾驶提示信息。
本实施例中,通过车载摄像头对驾驶员进行视频拍摄,并以拍摄的视频的每一帧画面作为待处理图像。通过对摄像头拍摄的每一帧画面进行识别,获得相应的识别结果,再结合连续多帧画面的结果对驾驶员的动作进行识别。当检测到驾驶员正在进行喝水、打电话、戴眼镜中的任意一个动作时,可通过显示终端对驾驶员提出警告,并提示危险驾驶动作的类别。提出警告的方式包括:弹出对话框通过文字提出警告、通过内置语音数据提出警告。
本申请实施例的神经网络为基于训练图像集预先监督训练而获得,所述神经网络可包括卷积层、非线性层、池化层等网络层,本申请实施例对具体的网络结构并不限制。确定神经网络结构后,可基于带有标注信息的样本图像、采用监督方式对神经网络进行反向梯度传播等方法进行迭代训练,具体的训练方式本申请实施例并不限制。
图6是本申请实施例提供的一种神经网络的训练方法的流程示意图,如图6所示,所述方法包括:
501、提取样本图像的第一特征图。
本实施例可从训练图集中获取用于对神经网络进行训练的样本图像,其中,所述训练图像集中可包括多个样本图像。
在本申请的一种可选实施例中,所述训练图像集中的样本图像包括:正样本图像和负样本图像。 所述正样本图像包含对应于目标对象的至少一个预定动作,所述预定动作例如目标对象喝水、抽烟、打电话、戴眼镜、戴口罩等动作;所述负样本图像包含至少一个与预定动作相似的动作,如:目标对象的手搭着嘴唇、挠耳朵、摸鼻子等等。
本实施例将包含有与预定动作很相似的动作的样本图像作为负样本图像,通过对神经网络的进行正样本图像和负样本图像区分训练,使训练后的神经网络的能高效的将与预定动作相似的动作区分出,大大提高动作分类结果的精确率和鲁棒性。
本实施例中,可通过神经网络中的卷积层提取样本图像的第一特征图。其中,提取样本图像的第一特征图的详细过程可参照前述步骤302的描述,这里不再赘述。
502、提取第一特征图可能包括预定动作的多个第三候选框。
本步骤的详细过程可参照前述实施例中的步骤303的描述,这里不再赘述。
503:基于所述多个第三候选框确定动作目标框。
在本申请的一种可选实施例中,所述基于多个第三候选框确定动作目标框,包括:根据所述预定动作,获得第一动作监督框,其中,所述第一动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物;获取所述多个第三候选框的第二置信度,其中,所述第二置信度包括:所述第三候选框为所述动作目标框的第一概率,所述第三候选框非所述动作目标框的第二概率;确定所述多个第三候选框与所述第一动作监督框的面积重合度;若所述面积重合度大于或等于第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第一概率;若所述面积重合度小于所述第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第二概率;将所述第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框;调整所述第四候选框的位置和大小,获得所述动作目标框。
本实施例中,对于与人脸有关的精细动作的识别,可预先对预定动作的特征进行定义。例如,喝水的动作特征包括:手部区域、人脸局部区域和水杯区域(即动作交互物对应区域)的特征;抽烟的动作特征包括:手部区域、人脸局部区域和烟的区域(即动作交互物对应区域)的特征;打电话的动作特征包括:手部区域、人脸局部区域和手机区域(即动作交互物对应区域)的特征,戴眼镜的动作特征包括:手部区域、人脸局部区域和眼镜区域(即动作交互物对应区域)的特征;戴口罩的动作特征包括:手部区域、人脸局部区域、口罩区域(即动作交互物对应区域)的特征。
本实施例中,所述样本图像的标注信息包括:动作监督框和所述动作监督框对应的动作类别。可以理解,在通过神经网络对所述样本图像进行处理之前,还需要获得各样本图像对应的标注信息。其中,所述动作监督框具体用于标识出样本图像中的预定动作,具体可参见图7中的目标对象喝水的动作监督框和图8中的目标对象打电话的动作监督框。
与预定动作很相似的动作往往会给神经网络的提取候选框的过程带来很大的干扰。如:图4中从左至右,依次做了与打电话、喝水和抽烟相似的动作,即目标对象将右手分别放置在脸旁,但此时目标对象的手里并没手机、水杯和烟,而神经网络易误将这些动作识别为打电话、喝水和抽烟,并分别标识出与之相应的候选框。因此,本申请实施例通过对神经网络的进行正样本图像和负样本图像区分训练,正样本图像对应的第一动作监督框可包括预定动作,负样本图像对应的第一动作监督框也包括与预定动作相似的动作。
本实施例中,通过神经网络标识出第三候选框的同时,还可获得所述三候选框对应的第二置信度,第二置信度包括:所述第三候选框为动作目标框的概率,即第一概率;以及该第三候选框不是动作目标框的概率,即第二概率。这样,通过神经网络获得多个第三候选框的同时,还将获得每个第三候选框的第二置信度。需要理解的是,第二置信度为神经网络的根据第三候选框中的特征得到的第三候选框为目标动作框的预测值。此外,在获得第三候选框和第二置信度的同时,通过神经网络的的处理还可得到第三候选框在坐标系xoy下的坐标(x3,y3),及所述第三候选框的尺寸,所述第三候选框的尺寸可通过长度和宽度的乘积表示。其中,所述第三候选框的坐标(x3,y3)可以是所述第三候选框的一个顶点的坐标,例如所述第三候选框的左上角、右上角、左下角或右下角的顶点的坐标。以所述第三候选框的坐标(x3,y3)为第三候选框的左上角的顶点坐标为例,则可获得的第三候选框的右上角的横坐标x4以及左下角的纵坐标y4,则第三候选框可表示为bbox(x3,y3,x4,y4)。同理,所述第一动作监督框可表示为bbox_gt(x1,y1,x2,y2)。
本实施例中,确定各第三候选框集合bbox(x3,y3,x4,y4)分别与第一动作监督框bbox_gt(x1,y1,x2,y2)的面积重合度IOU,可选的,面积重合度IOU的计算公式如下:
Figure PCTCN2019108167-appb-000001
其中,A、B分别表示第三候选框的面积和第一动作监督框的面积,A∩B表示第三候选框与第一动作监督框重合区域的面积,A∪B表示第三候选框与第一动作监督框包含的所有区域的面积。
若面积重合度IOU大于或等于第二阈值,判定第三候选框为可能包含预定动作的候选框,将该第三候选框的第二置信度取为上述第一概率;若面积重合度IOU小于所述第二阈值,判定该第三候选框为不可能包含预定动作的候选框,将该第三候选框的第二置信度取为上述第二概率。其中,所述第二阈值的取值大于等于0小于等于1;所述第二阈值的具体取值可根据网络训练效果确定。
本实施例中,可将第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框,调整所述第四候选框的位置和大小,获得所述动作目标框。其中,动作目标框的获取方式具体可参照前述实施例中的步骤304所述。
其中,所述调整所述第四候选框的位置和大小,获得所述动作目标框,包括:池化处理第四候选框,获得所述第四候选框对应的第二特征区域,基于所述第二特征区域对相对应的第四候选框的位置和大小进行调整,获得第五候选框,基于所述第五候选框获得动作目标框。其中,所述基于所述第二特征区域对相对应的第四候选框的位置和大小进行调整,获得第五候选框,包括:根据所述第二特征区域中对应于预定动作的特征,获得与所述预定动作的特征对应的第二动作特征框;根据所述第二动作特征框的几何中心坐标,获得所述第四候选框的第二位置偏移量;根据所述第二动作特征框的大小,获得所述第四候选框的第二缩放倍数;根据所述第二位置偏移量和所述第二缩放倍数对所述第四候选框的位置和大小进行调整,获得第五候选框。
本实施例中,分别获取上述第四候选框在坐标系xoy下的几何中心坐标P(x n,y n)和第二动作特征框在坐标系xoy下的几何中心坐标Q(x,y),获得第四候选框的几何中心与第二动作特征框的几何中心的第二位置偏移量:Δ(x n,y n)=P(x n,y n)-Q(x,y),其中,n为正整数,n的数量与第四候选框的数量一致。Δ(x n,y n)即为多个第四候选框的第二位置偏移量。
本实施例中,分别获得第四候选框与第二动作特征框的尺寸,再通过第二动作特征框的尺寸除以第四候选框的尺寸,得到第四候选框的第二缩放倍数ε,其中,第二缩放倍数ε包括第四候选框的长度的缩放倍数δ和宽度的缩放倍数η。
假设第四候选框的几何中心坐标的集合表示为:
Figure PCTCN2019108167-appb-000002
根据第二位置偏移量Δ(x n,y n)可得到几何中心的位置调整后的第四候选框的几何中心坐标的集合为:
Figure PCTCN2019108167-appb-000003
则:
Figure PCTCN2019108167-appb-000004
需要理解的是,在对第四候选框的几何中心的位置进行调整的时,所述第四候选框的长度和宽度保持不变。
在得到几何中心位置调整后的一个或多个第四候选框后,固定第四候选框的几何中心不变,基于所述第二缩放倍数ε将所述第四候选框的长度调整至δ倍,宽度调整至η倍,获得第五候选框。
本实施例中,所述基于所述第五候选框获得动作目标框,包括:将尺寸和距离相近的多个第五候选框合并,合并后的第五候选框作为动作目标框。需要理解的是,对应于同一预定动作的第五候选框的大小和距离会非常接近,所以,合并后每个动作目标框只对应于一个预定动作。
在本申请的一种可选实施例中,通过神经网络的处理获得动作目标框的同时还会获得所述动作目标框的第三置信度,第三置信度表示所述动作目标框中的动作为预定动作类别的概率,即第三概率,如:上述预定动作可包括喝水、抽烟、打电话、戴眼镜、戴口罩五个类别,则每个动作目标框的第三概率均包含五个概率值,分别为动作目标框中的动作为喝水动作的概率a、为抽烟动作的概率b、为打电话动作的概率c、为戴眼镜动作的概率d以及为戴口罩动作的概率e。
步骤504:基于所述动作目标框进行预定动作的分类,获得动作识别结果。
本实施例中,以动作目标框中包括的预定动作包括喝水、抽烟、打电话、戴眼镜、戴口罩五个类别为例,假设动作目标框的第三置信度分别为:a=0.65,b=0.45,c=0.7,d=0.45,e=0.88,则动作识别结果可以为戴口罩动作。则本实施例中,对应于不同预定动作的动作目标框的第三置信度(即第三概率)中,可选取第三置信度(即第三概率)最大的预定动作的分类作为动作识别结果。其中,最大的第三置信度(即第三概率)可记为第四概率。
步骤505:确定所述样本图像的候选框的检测结果和检测框标注信息的第一损失、以及动作识别结果和动作类别标注信息的第二损失。
步骤506:根据所述第一损失和所述第二损失调节所述神经网络的网络参数。
本实施例中,神经网络可包括神经网络的特征提取分支、神经网络的候选框提取分支、神经网络的检测框精修分支和神经网络的动作分类分支,上述神经网络的各分支的功能具体可参见前述实施例中步骤301至步骤305的详细阐述。
本实施例中,通过计算候选框坐标回归损失函数smooth l1和类别损失函数soft max对神经网络的网络参数进行更新。
可选的,候选框提取的损失函数(Region Proposal Loss)的表达式如下:
Figure PCTCN2019108167-appb-000005
其中,N和α均为神经网络的候选框提取分支的权重参数,p i为监督变量。
类别损失函数soft max和候选框坐标回归损失函数smooth l1的具体表达式如下:
Figure PCTCN2019108167-appb-000006
Figure PCTCN2019108167-appb-000007
其中,x=|x 1-x 3|+|y 1-y 3|+|x 2-x 4|+|y 2-y 4|。
神经网络的检测框精修分支通过损失函数来更新网络的权重参数,损失函数(Bbox Refine Loss)的具体表达式如下:
Figure PCTCN2019108167-appb-000008
其中,M为第六候选框的数量,β为神经网络的检测框精修分支的权重参数,p i为监督变量,soft max损失函数和smooth l1损失函数的表达形式可参见公式(4)和公式(5),特别地,公式(6)中的bbox i为精修后的动作目标框的几何中心坐标,bbox_gt j为监督动作框的几何中心坐标。
本实施例中,损失函数是神经网络优化的目标函数,神经网络训练或者优化的过程就是最小化损失函数的过程,即损失函数值越接近于0,对应预测的结果和真实结果的值就越接近。
本实施例中,用第四候选框的第二置信度替换公式(3)和公式(4)中的监督变量p i,并代入公式(3),通过调节神经网络的候选框提取分支的权重参数N和α,改变Region Proposal Loss的值(即第一损失),并选取使Region Proposal Loss的值最接近于0的权重参数组合N和α。
本实施例中,用动作目标框的第四概率(即多个第三置信度(即第三概率)中的最大值)替换掉监督变量p i代入公式(6),通过调节神经网络的检测框精修分支的权重参数β,改变Bbox Refine Loss的值(即第二损失),并选取使Bbox Refine Loss的值最接近于0的权重参数β,以梯度反向传播的方式完成对神经网络的检测框精修分支的权重参数的更新。
将更新完权重参数的候选框提取分支、更新完权重参数的检测框精修分支、特征提取分支、动作分类分支再次进行训练,即向神经网络输入样本图像,经过神经网络的处理,最终由神经网络的动作分类分支输出识别结果。由于动作分类分支的输出结果与实际结果之间存在误差,将动作分类分支的输出值与实际值之间的误差从输出层向卷积层反向传播,直至传播到输入层。在反向传播的过程中,根据误差调整神经网络中的权重参数,不断迭代上述过程,直至收敛,完成对神经网络的网络参数的再次更新。
本实施例根据动作特征进行车内人员的脸部精细动作,如与手和人脸相关的驾驶员危险驾驶动作。但在实际应用时,驾驶员所作的一些与危险驾驶动作相似的动作易对神经网络造成干扰,影响后续对动作的分类识别,这不仅会降低动作识别结果的精度,同时也会使用户体验直线下降。本实施例将正样本图像和负样本图像作为用于神经网络训练的样本图像,以损失函数进行监督,以梯度反向传播的方式更新神经网络的网络参数(特别是神经网络的特征提取分支和神经网络的候选框提取分支的权重参数)并完成训练,使训练后的神经网络的特征提取分支能准确的提取出危险驾驶动作的特征,再通过神经网络的候选框提取分支自动将包含有与预定动作(例如危险驾驶动作)相似的动作的候选框去除,可大大降低对危险驾驶动作的误检率。
另外,由于神经网络的候选框提取分支输出的动作候选框尺寸较大,若直接对其进行后续处理,计算量较大,本实施例通过对候选框进行池化处理,并调整至预定尺寸,可大大减小后续处理的计 算量,加快处理速度;通过神经网络的检测框精修分支对候选框的精修,使精修后得到的动作目标框只包含预定动作(例如危险驾驶动作)的特征,提高识别结果的准确率。
请参阅图9,图9为本申请实施例提供的一种动作识别装置的结构示意图,该识别装置1000包括:第一提取单元11、第二提取单元12、确定单元13及分类单元14。其中:
所述第一提取单元11,用于提取包括有人脸的图像的特征;
所述第二提取单元12,用于基于所述特征确定可能包括预定动作的多个候选框;
所述确定单元13,用于基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;
所述分类单元14,用于基于所述动作目标框进行预定动作的分类,获得动作识别结果。
在本申请的一种可选实施例中,所述人脸局部区域,包括以下至少之一:嘴部区域、耳部区域、眼部区域。
在本申请的一种可选实施例中,所述动作交互物,包括以下至少之一:容器、烟、手机、食物、工具、饮料瓶、眼镜、口罩。
在本申请的一种可选实施例中,所述动作目标框还包括:手部区域。
在本申请的一种可选实施例中,所述预定动作包括以下至少之一:打电话、抽烟、喝水/饮料、进食、使用工具、戴眼镜、化妆。
在本申请的一种可选实施例中,动作识别装置1000还包括:车载摄像头,用于拍摄位于车内的人的包括有人脸的图像。
在本申请的一种可选实施例中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。
在本申请的一种可选实施例中,所述车载摄像头为:RGB摄像头、红外摄像头或近红外摄像头。
本申请实施例通过对待处理图像进行特征提取,并根据提取到的特征实现待处理图像中动作的识别。上述的动作可以为:手部区域的动作和/或人脸局部区域的动作、针对动作交互物的动作等等,因此,需通过车载摄像头对车内人员进行图像采集,获取包括有人脸的待处理图像。再对待处理图像进行卷积操作,提取出动作特征。
在一种可选的实施方式中,首先对上述预定动作的特征进行定义,再通过神经网络根据定义的特征和提取到的图像中的特征,实现对图像中是否存在预定动作的判断,在判定图像中存在预定动作的情况下,确定图像中包括预定动作的多个候选框。
本实施例中,若上述提取的特征对应于:手部区域、人脸局部区域、动作交互物对应区域等至少一种区域,则通过神经网络的特征提取处理获得包含有手部区域和人脸局部区域的特征区域,基于所述特征区域确定候选区域,通过候选框标识出所述候选区域;其中,所述候选框例如可以通过矩形框表示。同理,通过另一个候选框标识出包含手部区域、人脸局部区域和动作交互物对应区域的特征区域。这样,通过提取对应于预定动作的特征,获得多个候选区域;根据多个候选区域,确定多个候选框。
本实施例中,由于候选框中可能包含除预定动作对应的特征以外的特征,或没有包含预定动作对应的所有特征(指任意一个预定动作的所有特征),这样都会影响最终的动作识别结果。因此,为保证最终识别结果的精度,需要对候选框的位置进行调整,即基于多个候选框确定动作目标框。基于此,通过对各个候选框的位置和大小的调整,将调整后的候选框确定为动作目标框。可以理解,调整后的多个候选框可以重叠为一个候选框,则重叠的候选框确定为动作目标框。
在本申请的一种可选实施例中,所述第一提取单元11包括:神经网络的特征提取分支111,用于提取包括有人脸的图像的特征,获得特征图。
本实施例中,通过神经网络的特征提取分支对待处理图像进行卷积操作,是利用卷积核在待处理图像上“滑动”。例如,在卷积核对应于图像某像素点时,将该像素点的灰度值与卷积核上的各数值相乘,将所有乘积加和后作为卷积核对应的所述像素点的灰度值,进一步将卷积核“滑动”至下一个像素点,以此类推,最终完成所述待处理图像中的所有像素点卷积处理,获得特征图。
其中,神经网络的特征提取分支111可包括多层卷积层,上一层卷积层通过特征提取获得的特征图可作为下一层卷积层的输入数据,通过多层卷积层提取图像中更为丰富的信息,由此提高特征提取的准确率。
在本申请的一种可选实施例中,所述第二提取单元12,包括:所述神经网络的候选框提取分支121,用于在所述特征图上提取可能包括预定动作的多个候选框。
例如,特征图中可包括:手、烟、水杯、手机、眼镜、口罩、人脸局部区域对应的特征中的至 少一种特征,基于所述至少一种特征确定多个候选框。需要说明的是,虽然通过神经网络的特征提取分支能够提取出待处理图像的特征,但提取出的特征可能包括预定动作对应的特征之外的其他特征,因此,这里通过神经网络的候选框提取分支确定的多个候选框中,可能存在至少部分候选框中包含了除预定动作对应的特征以外的其他特征,或者并没有包含预定动作对应的所有特征,因此,所述多个候选框可能包括了预定动作。
在本申请的一种可选实施例中,所述神经网络的候选框提取分支121还用于:根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;以及根据所述多个候选区域,获得所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
其中,所述神经网络的候选框提取分支121,包括:划分子单元,用于根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;
第一获取子单元,用于根据所述多个候选区域,获得所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
本实施例中,神经网络的候选框提取分支121还可以确定每个候选框对应的第一置信度,所述第一置信度用于以概率的形式表示候选框为目标动作框的可能性。通过神经网络的候选框提取分支对特征图的处理,在获多个候选框的同时,还获得多个候选框中每个候选框的第一置信度。需要理解的是,所述第一置信度为神经网络的候选框提取分支根据候选框中的特征得到的候选框为目标动作框的预测值。
在本申请的一种可选实施例中,所述确定单元13,包括:所述神经网络的检测框精修分支131,用于基于所述多个候选框确定动作目标框。
在本申请的一种可选实施例中,所述神经网络的检测框精修分支131还用于:去除所述第一置信度小于第一阈值的候选框,获得至少一个第一候选框;以及池化处理所述至少一个第一候选框,获得至少一个第二候选框;以及根据所述至少一个第二候选框,确定动作目标框。
其中,所述神经网络的检测框精修分支,包括:去除子单元,用于去除第一置信度小于第一阈值的候选框,获得至少一个第一候选框;
第二获取子单元,用于池化处理所述至少一个第一候选框,获得至少一个第二候选框;
确定子单元,用于根据所述至少一个第二候选框,确定动作目标框。
本实施例中,由于在获得候选框的过程中,一些与预定动作很相似的动作会给神经网络的候选框提取分支带来很大的干扰。如图4中从左至右的子图片中,目标对象依次做了与打电话、喝水和抽烟等动作,这些动作比较相似,都是将右手分别放置在脸旁,但目标对象手里并没手机、水杯和烟,而神经网络容易误将目标对象的这些动作识别为打电话、喝水和抽烟。
本申请实施例通过神经网络的检测框精修分支131去除第一置信度小于第一阈值的候选框,得到至少一个第一候选框;其中,若候选框的第一置信度小于第一阈值,则表明该候选框为上述与相似动作的候选框,需要将该候选框去除,从而能够高效的区分预定动作和相似动作,进而降低误检测率,大大提高动作识别结果的准确率。
在本申请的一种可选实施例中,所述神经网络的检测框精修分支131(或所述第二获取子单元)还用于:分别池化处理所述至少一个第一候选框,获得与所述至少一个第一候选框对应的至少一个第一特征区域;以及基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框。
本实施例中,第一候选框所在区域中的特征的数量可能较多,若直接使用第一候选框所在区域中的特征将产生巨大的计算量。因此,在对第一候选框所在区域中的特征进行后续处理之前,先池化处理第一候选框,即池化处理第一候选框所在区域中的特征,降低第一候选框所在区域中的特征的维度,以满足后续处理过程中对计算量的需要,大大减小后续处理的计算量。
在本申请的一种可选实施例中,所述神经网络的检测框精修分支131(或所述第二获取子单元)还用于:基于所述第一特征区域中对应于所述预定动作的特征,获得与所述预定动作的特征对应的第一动作特征框;以及根据所述第一动作特征框的几何中心坐标,获得所述至少一个第一候选框的第一位置偏移量;以及根据所述第一动作特征框的大小,获得所述至少一个第一候选框的第一缩放倍数;以及根据至少一个第一位置偏移量和至少一个第一缩放倍数分别对至少一个第一候选框的位置和大小进行调整,获得至少一个第二候选框。
在本申请的一种可选实施例中,所述分类单元14,包括:所述神经网络的动作分类分支141,用于获取所述特征图上与所述动作目标框对应的区域图,并基于所述区域图进行预定动作的分类, 获得动作识别结果。
在本申请的一种可选实施例中,一方面,通过神经网络的动作分类分支141获得第一动作识别结果,另一方面,通过神经网络的动作分类分支141还可获得所述第一动作识别结果的第二置信度,所述第二置信度表征所述动作识别结果的准确率。
在本申请的一种可选实施例中,所述神经网络为基于训练图像集预先监督训练而得,所述训练图像集包括多个样本图像,其中,所述样本图像的标注信息包括:动作监督框和所述动作监督框对应的动作类别。
在本申请的一种可选实施例中,所述训练图像集包括正样本图像和负样本图像,所述负样本图像的动作与所述正样本图像的动作相似,所述正样本的动作监督框包括人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物。
在本申请的一种可选实施例中,所述正样本图像的动作包括打电话,所述负样本图像包括:扰耳朵;和/或,所述正样本图像包括抽烟、进食或喝水,所述负样本图像包括张嘴或手搭着嘴唇的动作。
本申请实施例通过神经网络的特征提取分支111对待处理图像进行特征提取,通过神经网络的候选框提取分支121根据提取出的特征获得可能包括预定动作的候选框,通过神经网络的检测框精修分支131确定动作目标框,最后通过神经网络的动作分类分支141对目标动作框中的特征进行预定动作的分类,得到待处理图像的动作识别结果;整个识别过程通过提取待处理图像中的特征(例如手部区域、人脸局部区域、动作交互物对应区域的特征提取),并对其进行处理,可自主、快速的实现对精细动作的精确识别。
本申请实施例的所述动作识别装置还包括所述神经网络的训练组件。请参阅图10,图10为本申请实施例提供的一种神经网络的训练组件的结构示意图,该训练组件2000包括:第一提取单元21、第二提取单元22、第一确定单元23、获取单元24、第二确定单元25及调节单元26。其中:
所述第一提取单元21,用于提取包括样本图像的第一特征图;
所述第二提取单元22,用于提取所述第一特征图可能包括预定动作的多个第三候选框;
所述第一确定单元23,用于基于所述多个第三候选框确定动作目标框;
所述获取单元24,用于基于所述动作目标框进行预定动作的分类,获得第一动作识别结果;
所述第二确定单元25,用于确定所述样本图像的候选框的检测结果和检测框标注信息的第一损失、以及动作识别结果和动作类别标注信息的第二损失;
所述调节单元26,用于根据所述第一损失和所述第二损失调节所述神经网络的网络参数。
在本申请的一种可选实施例中,所述第一确定单元23包括:第一获取子单元231,用于根据所述预定动作,获得第一动作监督框,其中所述第一动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物;
所述第二获取子单元232,用于获取所述多个第三候选框的第二置信度,其中,所述第二置信度包括:所述第四候选框为所述动作目标框的第一概率,所述第三候选框非所述动作目标框的第二概率;
所述确定子单元233,用于确定所述多个第三候选框与所述第一动作监督框的面积重合度;
所述选取子单元234,用于若所述面积重合度大于或等于第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第一概率;若所述面积重合度小于所述第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第二概率;
所述去除子单元235,用于将所述第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框;
所述调整子单元236,用于调整所述第四候选框的位置和大小,获得所述动作目标框。
本实施例根据动作特征进行车内人员的脸部精细动作,如与手和人脸相关的驾驶员危险驾驶动作。但在实际应用时,驾驶员所作的一些与危险驾驶动作相似的动作易对神经网络造成干扰,影响后续对动作的分类识别,这不仅会降低动作识别结果的精度,同时也会使用户体验直线下降。本实施例将正样本图像和负样本图像作为用于神经网络训练的样本图像,以损失函数进行监督,以梯度反向传播的方式更新神经网络的网络参数(特别是神经网络的特征提取分支和神经网络的候选框提取分支的权重参数)并完成训练,使训练后的神经网络的特征提取分支能准确的提取出危险驾驶动作的特征,再通过神经网络的候选框提取分支自动将包含有与预定动作(例如危险驾驶动作)相似的动作的候选框去除,可大大降低对危险驾驶动作的误检率。
另外,由于神经网络的候选框提取分支输出的动作候选框尺寸较大,若直接对其进行后续处理, 计算量较大,本实施例通过对候选框进行池化处理,并调整至预定尺寸,可大大减小后续处理的计算量,加快处理速度;通过神经网络的检测框精修分支对候选框的精修,使精修后得到的动作目标框只包含预定动作(例如危险驾驶动作)的特征,提高识别结果的准确率。
请参阅图11,图11为本申请实施例提供的一种驾驶动作分析装置的结构示意图,该分析装置3000包括:车载摄像头31、第一获取单元32及生成单元33。其中:
所述车载摄像头31,用于采集包括有驾驶员人脸图像的视频流;
所述第一获取单元32,用于通过本申请前述实施例所述的动作识别装置,获取所述视频流中至少一帧图像的动作识别结果;
所述生成单元33,用于响应于动作识别结果满足预定条件,生成分心或危险驾驶提示信息。
在本申请的一种可选实施例中,所述预定条件包括以下至少之一:出现特定预定动作;在预定时长内出现特定预定动作的次数;所述视频流中特定预定动作出现维持的时长。
在本申请的一种可选实施例中,所述分析装置3000还包括:第二获取单元34,用于获取设置有车载双摄像头的车辆的车速;所述生成单元33还用于:响应于所述车速大于设定阈值且所述动作识别结果满足所述预定条件,生成分心或危险驾驶提示信息。
本实施例中,通过车载摄像头对驾驶员进行视频拍摄,并以拍摄的视频的每一帧画面作为待处理图像。通过对摄像头拍摄的每一帧画面进行识别,获得相应的识别结果,再结合连续多帧画面的结果对驾驶员的动作进行识别。当检测到驾驶员正在进行喝水、打电话、戴眼镜中的任意一个动作时,可通过显示终端对驾驶员提出警告,并提示危险驾驶动作的类别。提出警告的方式包括:弹出对话框通过文字提出警告、通过内置语音数据提出警告。
本申请实施例还提供了一种电子设备。图12为本申请实施例提供的一种电子设备的硬件结构示意图。该电子设备4000包括存储器44和处理器41,所述存储器44上存储有计算机可执行指令,所述处理器41运行所述存储器44上的计算机可执行指令时实现本申请实施例所述的动作识别方法,或者本申请实施例所述的驾驶动作分析方法。
在本申请的一种可选实施例中,所述电子设备还可以包括输入装置42、输出装置43。该输入装置42、输出装置43、存储器44和处理器41之间可通过总线相互连接。
存储器包括但不限于是随机存储记忆体(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、或便携式只读存储器(Compact Disc Read-Only Memory,CD-ROM),该存储器用于相关指令及数据。
输入装置用于输入数据和/或信号,以及输出装置用于输出数据和/或信号。输出装置和输入装置可以是独立的器件,也可以是一个整体的器件。
处理器可以包括是一个或多个处理器,例如包括一个或多个中央处理器(Central Processing Unit,CPU),在处理器是一个CPU的情况下,该CPU可以是单核CPU,也可以是多核CPU。处理器还可以包括一个或多个专用处理器,专用处理器可以包括GPU、FPGA等,用于进行加速处理。
存储器用于存储网络设备的程序代码和数据。
处理器用于调用该存储器中的程序代码和数据,执行上述方法实施例中的步骤。具体可参见方法实施例中的描述,在此不再赘述。
可以理解的是,图12仅仅示出了电子设备的简化设计。在实际应用中,电子设备还可以分别包含必要的其他元件,包含但不限于任意数量的输入/输出装置、处理器、控制器、存储器等,而所有可以实现本申请实施例的电子设备都在本申请实施例的保护范围之内。
本申请实施例还提供了一种计算机存储介质,用于存储计算机可读取的指令,该指令被执行时实现本申请上述任一实施例的动作识别方法的操作,或者,该指令被执行时实现本申请上述任一实施例的驾驶动作分析方法的操作。
本申请实施例还提供了一种计算机程序,包括计算机可读取的指令,当该计算机可读取的指令在设备中运行时,该设备中的处理器执行用于实现本申请上述任一实施例的动作识别方法中的步骤的可执行指令,或者,该设备中的处理器执行用于实现本申请上述任一实施例的驾驶动作分析方法中的步骤的可执行指令。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、 或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本发明各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准

Claims (51)

  1. 一种动作识别方法,包括:
    提取包括有人脸的图像中的特征;
    基于所述特征确定可能包括预定动作的多个候选框;
    基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;
    基于所述动作目标框进行预定动作的分类,获得动作识别结果。
  2. 根据权利要求1所述的方法,其中,所述人脸的局部区域,包括以下至少之一:嘴部区域、耳部区域、眼部区域。
  3. 根据权利要求1或2所述的方法,其中,所述动作交互物,包括以下至少之一:容器、烟、手机、食物、工具、饮料瓶、眼镜、口罩。
  4. 根据权利要求1至3任一项所述的方法,其中,所述动作目标框还包括:手部区域。
  5. 根据权利要求1至4任一项所述的方法,其中,所述预定动作包括以下至少之一:打电话、抽烟、喝水/饮料、进食、使用工具、戴眼镜、化妆。
  6. 根据权利要求1至5任一项所述的方法,其中,所述方法还包括:
    经车载摄像头拍摄位于车内的人的包括有人脸的图像。
  7. 根据权利要求6所述的方法,其中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。
  8. 根据权利要求6或7所述的方法,其中,所述车载摄像头为:RGB摄像头、红外摄像头或近红外摄像头。
  9. 根据权利要求1至8任一项所述的方法,其中,所述提取包括有人脸的图像中的特征,包括:
    经神经网络的特征提取分支提取包括有人脸的图像中的特征,获得特征图。
  10. 根据权利要求9所述的方法,其中,所述基于所述特征确定可能包括预定动作的多个候选框,包括:
    经所述神经网络的候选框提取分支在所述特征图上确定可能包括预定动作的多个候选框。
  11. 根据权利要求10所述的方法,其中,所述经所述神经网络的候选框提取分支在所述特征图上确定可能包括预定动作的多个候选框,包括:
    根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;
    根据所述多个候选区域,获得多个候选框和所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
  12. 根据权利要求9至11任一项所述的方法,其中,所述基于所述多个候选框确定动作目标框,包括:
    经所述神经网络的检测框精修分支、基于所述多个候选框确定动作目标框。
  13. 根据权利要求12所述的方法,其中,所述经所述神经网络的检测框精修分支、基于所述多个候选框确定动作目标框,包括:
    经所述神经网络的检测框精修分支去除第一置信度小于第一阈值的候选框,获得至少一个第一候选框;
    池化处理所述至少一个第一候选框,获得至少一个第二候选框;
    根据所述至少一个第二候选框,确定动作目标框。
  14. 根据权利要求13所述的方法,其中,所述池化处理所述至少一个第一候选框,获得至少一个第二候选框,包括:
    分别池化处理所述至少一个第一候选框,获得与所述至少一个第一候选框对应的至少一个第一特征区域;
    基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框。
  15. 根据权利要求14所述的方法,其中,所述基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框,包括:
    基于所述第一特征区域中对应于所述预定动作的特征,获得与所述预定动作的特征对应的第一 动作特征框;
    根据所述第一动作特征框的几何中心坐标,获得所述至少一个第一候选框的第一位置偏移量;
    根据所述第一动作特征框的大小,获得所述至少一个第一候选框的第一缩放倍数;
    根据至少一个第一位置偏移量和至少一个第一缩放倍数分别对所述至少一个第一候选框的位置和大小进行调整,获得至少一个第二候选框。
  16. 根据权利要求1至15任一项所述的方法,其中,所述基于所述动作目标框进行预定动作的分类,获得动作识别结果,包括:
    经所述神经网络的动作分类分支获取所述特征图上与所述动作目标框对应的区域图,基于所述区域图进行预定动作的分类,获得动作识别结果。
  17. 根据权利要求9至16任一项所述的方法,其中,所述神经网络为基于训练图像集预先监督训练而得,所述训练图像集包括多个样本图像,其中,所述样本图像的标注信息包括:动作监督框和所述动作监督框对应的动作类别。
  18. 根据权利要求17所述的方法,其中,所述训练图像集包括正样本图像和负样本图像,所述负样本图像的动作与所述正样本图像的动作相似,所述正样本的动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物。
  19. 根据权利要求17或18所述的方法,其中,所述正样本图像的动作包括打电话,所述负样本图像包括:扰耳朵;和/或,所述正样本图像包括抽烟、进食或喝水,所述负样本图像包括张嘴或手搭着嘴唇的动作。
  20. 根据权利要求17至19任一项所述的方法,其中,所述神经网络的训练方法包括:
    提取样本图像的第一特征图;
    提取所述第一特征图可能包括预定动作的多个第三候选框;
    基于所述多个第三候选框确定动作目标框;
    基于所述动作目标框进行预定动作的分类,获得动作识别结果;
    确定所述样本图像的候选框的检测结果和检测框标注信息的第一损失、以及动作识别结果和动作类别标注信息的第二损失;
    根据所述第一损失和所述第二损失调节所述神经网络的网络参数。
  21. 根据权利要求20所述的方法,其中,所述基于多个第三候选框确定动作目标框,包括:
    根据所述预定动作,获得第一动作监督框,其中,所述第一动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物;
    获取所述多个第三候选框的第二置信度,其中,所述第二置信度包括:所述第三候选框为所述动作目标框的第一概率,所述第三候选框非所述动作目标框的第二概率;
    确定所述多个第三候选框与所述第一动作监督框的面积重合度;
    若所述面积重合度大于或等于第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第一概率;若所述面积重合度小于所述第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第二概率;
    将所述第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框;
    调整所述第四候选框的位置和大小,获得所述动作目标框。
  22. 一种驾驶动作分析方法,包括:
    经车载摄像头采集包括有驾驶员人脸图像的视频流;
    通过如权利要求1至21任一所述的动作识别方法,获取所述视频流中至少一帧图像的动作识别结果;
    响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
  23. 根据权利要求22所述的方法,其中,所述预定条件包括以下至少之一:出现特定预定动作;在预定时长内出现特定预定动作的次数;所述视频流中特定预定动作出现维持的时长。
  24. 根据权利要求22或23所述的方法,其中,所述方法还包括:
    获取设置有车载双摄像头的车辆的车速;
    所述响应于动作识别结果满足预定条件,生成危险驾驶提示信息,包括:响应于所述车速大于设定阈值且所述动作识别结果满足所述预定条件,生成危险驾驶提示信息。
  25. 一种动作识别装置,包括:
    第一提取单元,用于提取包括有人脸的图像的特征;
    第二提取单元,用于基于所述特征确定可能包括预定动作的多个候选框;
    确定单元,用于基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;
    分类单元,用于基于所述动作目标框进行预定动作的分类,获得动作识别结果。
  26. 根据权利要求25所述的装置,其中,所述人脸的局部区域,包括以下至少之一:嘴部区域、耳部区域、眼部区域。
  27. 根据权利要求25或26所述的装置,其中,所述动作交互物,包括以下至少之一:容器、烟、手机、食物、工具、饮料瓶、眼镜、口罩。
  28. 根据权利要求25至27任一项所述的装置,其中,所述动作目标框还包括:手部区域。
  29. 根据权利要求25至28任一项所述的装置,其中,所述预定动作包括以下至少之一:打电话、抽烟、喝水/饮料、进食、使用工具、戴眼镜、化妆。
  30. 根据权利要求25至29任一项所述的装置,其中,还包括:
    车载摄像头,用于拍摄位于车内的人的包括有人脸的图像。
  31. 根据权利要求30所述的装置,其中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。
  32. 根据权利要求30或31所述的装置,其中,所述车载摄像头为:RGB摄像头、红外摄像头或近红外摄像头。
  33. 根据权利要求25至32任一项所述的装置,其中,所述第一提取单元包括:神经网络的特征提取分支,用于提取包括有人脸的图像的特征,获得特征图。
  34. 根据权利要求33所述的装置,其中,所述第二提取单元,包括:
    所述神经网络的候选框提取分支,用于在所述特征图上提取可能包括预定动作的多个候选框。
  35. 根据权利要求34所述的装置,其中,所述候选框提取分支,包括:
    划分子单元,用于根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;
    第一获取子单元,用于根据所述多个候选区域,获得所述多个候选框和所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
  36. 根据权利要求33至35任一项所述的装置,其中,所述确定单元,包括:所述神经网络的检测框精修分支,用于基于所述多个候选框确定动作目标框。
  37. 根据权利要求36所述的装置,其中,所述检测框精修分支,包括:
    去除子单元,用于去除所述第一置信度小于第一阈值的候选框,获得至少一个第一候选框;
    第二获取子单元,用于池化处理所述至少一个第一候选框,获得至少一个第二候选框;
    确定子单元,用于根据所述至少一个第二候选框,确定动作目标框。
  38. 根据权利要求37所述的装置,其中,所述第二获取子单元还用于:
    分别池化处理所述至少一个第一候选框,获得与所述至少一个第一候选框对应的至少一个第一特征区域;以及基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框。
  39. 根据权利要求38所述的装置,其中,所述第二获取子单元还用于:
    基于所述第一特征区域中对应于所述预定动作的特征,获得与所述预定动作的特征对应的第一动作特征框;以及根据所述第一动作特征框的几何中心坐标,获得所述至少一个第一候选框的第一位置偏移量;以及根据所述第一动作特征框的大小,获得所述至少一个第一候选框的第一缩放倍数;以及根据至少一个第一位置偏移量和至少一个第一缩放倍数分别对所述至少一个第一候选框的位置和大小进行调整,获得至少一个第二候选框。
  40. 根据权利要求25至39任一项所述的装置,其中,所述分类单元,包括:所述神经网络的动作分类分支,用于获取所述特征图上与所述动作目标框对应的区域图,并基于所述区域图进行预定动作的分类,获得动作识别结果。
  41. 根据权利要求35至40任一项所述的装置,其中,所述神经网络为基于训练图像集预先监督训练而得,所述训练图像集包括多个样本图像,其中,所述样本图像的标注信息包括:动作监督框和所述动作监督框对应的动作类别。
  42. 根据权利要求41所述的装置,其中,所述训练图像集包括正样本图像和负样本图像,所述负样本图像的动作与所述正样本图像的动作相似,所述正样本的动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物。
  43. 根据权利要求41或42所述的装置,其中,所述正样本图像的动作包括打电话,所述负样 本图像包括:扰耳朵;和/或,所述正样本图像包括抽烟、进食或喝水,所述负样本图像包括张嘴或手搭着嘴唇的动作。
  44. 根据权利要求41至43任一项所述的装置,其中,所述动作识别装置还包括所述神经网络的训练组件,所述神经网络的训练组件包括:
    第一提取单元,用于提取样本图像的第一特征图;
    第二提取单元,用于提取所述第一特征图可能包括预定动作的多个第三候选框;
    第一确定单元,用于基于所述多个第三候选框确定动作目标框;
    第三获取单元,用于基于所述动作目标框进行预定动作的分类,获得动作识别结果;
    第二确定单元,用于确定所述样本图像的候选框的检测结果和检测框标注信息的第一损失、以及动作识别结果和动作类别标注信息的第二损失;
    调节单元,用于根据所述第一损失和所述第二损失调节所述神经网络的网络参数。
  45. 根据权利要求44所述的装置,其中,所述第一确定单元包括:
    第一获取子单元,用于根据所述预定动作,获得第一动作监督框,其中所述第一动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物;
    第二获取子单元,用于获取所述多个第三候选框的第二置信度,其中,所述第二置信度包括:所述第四候选框为所述动作目标框的第一概率,所述第三候选框非所述动作目标框的第二概率;
    确定子单元,用于确定所述多个第三候选框与所述第一动作监督框的面积重合度;
    选取子单元,用于若所述面积重合度大于或等于第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第一概率;若所述面积重合度小于所述第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第二概率;
    去除子单元,用于将所述第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框;
    调整子单元,用于调整所述第四候选框的位置和大小,获得所述动作目标框。
  46. 一种驾驶动作分析装置,包括:
    车载摄像头,用于采集包括有驾驶员人脸图像的视频流;
    第一获取单元,用于通过如权利要求25至45任一项所述的动作识别装置,获取所述视频流中至少一帧图像的动作识别结果;
    生成单元,用于响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
  47. 根据权利要求46所述的装置,其中,所述预定条件包括以下至少之一:出现特定预定动作;在预定时长内出现特定预定动作的次数;所述视频流中特定预定动作出现维持的时长。
  48. 根据权利要求46或47所述的装置,其中,所述装置还包括:
    第二获取单元,用于获取设置有车载双摄像头的车辆的车速;
    所述生成单元还用于:响应于所述车速大于设定阈值且所述动作识别结果满足所述预定条件,生成危险驾驶提示信息。
  49. 一种电子设备,包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时实现权利要求1至21任一项所述的方法,或者权利要求22至24任一项所述的方法。
  50. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现权利要求1至21任一项所述的方法,或者权利要求22至24任一项所述的方法。
  51. 一种计算机程序,包括计算机指令,当所述计算机指令在设备的处理器中运行时,实现权利要求1至21任一项所述的方法,或者权利要求22至24任一项所述的方法。
PCT/CN2019/108167 2018-09-27 2019-09-26 动作识别、驾驶动作分析方法和装置、电子设备 WO2020063753A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2020551540A JP7061685B2 (ja) 2018-09-27 2019-09-26 動作認識、運転動作分析の方法及び装置、並びに電子機器
SG11202009320PA SG11202009320PA (en) 2018-09-27 2019-09-26 Maneuver recognition and driving maneuver analysis method and apparatus, and electronic device
KR1020207027826A KR102470680B1 (ko) 2018-09-27 2019-09-26 동작 인식, 운전 동작 분석 방법 및 장치, 전자 기기
US17/026,933 US20210012127A1 (en) 2018-09-27 2020-09-21 Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811130798.6A CN110956060A (zh) 2018-09-27 2018-09-27 动作识别、驾驶动作分析方法和装置及电子设备
CN201811130798.6 2018-09-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/026,933 Continuation US20210012127A1 (en) 2018-09-27 2020-09-21 Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2020063753A1 true WO2020063753A1 (zh) 2020-04-02

Family

ID=69951010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/108167 WO2020063753A1 (zh) 2018-09-27 2019-09-26 动作识别、驾驶动作分析方法和装置、电子设备

Country Status (6)

Country Link
US (1) US20210012127A1 (zh)
JP (1) JP7061685B2 (zh)
KR (1) KR102470680B1 (zh)
CN (1) CN110956060A (zh)
SG (1) SG11202009320PA (zh)
WO (1) WO2020063753A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257604A (zh) * 2020-10-23 2021-01-22 北京百度网讯科技有限公司 图像检测方法、装置、电子设备和存储介质
CN112270210A (zh) * 2020-10-09 2021-01-26 珠海格力电器股份有限公司 数据处理、操作指令识别方法、装置、设备和介质
CN113205067A (zh) * 2021-05-26 2021-08-03 北京京东乾石科技有限公司 作业人员监控方法、装置、电子设备和存储介质
WO2022027895A1 (zh) * 2020-08-07 2022-02-10 上海商汤临港智能科技有限公司 异常坐姿识别方法、装置、电子设备、存储介质及程序
JP2023509572A (ja) * 2020-04-29 2023-03-09 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 車両を検出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245662B (zh) * 2019-06-18 2021-08-10 腾讯科技(深圳)有限公司 检测模型训练方法、装置、计算机设备和存储介质
US11222242B2 (en) * 2019-08-23 2022-01-11 International Business Machines Corporation Contrastive explanations for images with monotonic attribute functions
US10803334B1 (en) * 2019-10-18 2020-10-13 Alpine Electronics of Silicon Valley, Inc. Detection of unsafe cabin conditions in autonomous vehicles
KR102374211B1 (ko) * 2019-10-28 2022-03-15 주식회사 에스오에스랩 객체 인식 방법 및 이를 수행하는 객체 인식 장치
US11043003B2 (en) 2019-11-18 2021-06-22 Waymo Llc Interacted object detection neural network
CN112947740A (zh) * 2019-11-22 2021-06-11 深圳市超捷通讯有限公司 基于动作分析的人机交互方法、车载装置
CN112339764A (zh) * 2020-11-04 2021-02-09 杨华勇 一种基于大数据的新能源汽车驾驶姿态分析系统
CN113011279A (zh) * 2021-02-26 2021-06-22 清华大学 粘膜接触动作的识别方法、装置、计算机设备和存储介质
CN117203678A (zh) * 2021-04-15 2023-12-08 华为技术有限公司 目标检测方法和装置
CN113205075A (zh) * 2021-05-31 2021-08-03 浙江大华技术股份有限公司 一种检测吸烟行为的方法、装置及可读存储介质
CN113362314B (zh) * 2021-06-18 2022-10-18 北京百度网讯科技有限公司 医学图像识别方法、识别模型训练方法及装置
CN114670856B (zh) * 2022-03-30 2022-11-25 湖南大学无锡智能控制研究院 一种基于bp神经网络的参数自整定纵向控制方法及系统
CN116901975B (zh) * 2023-09-12 2023-11-21 深圳市九洲卓能电气有限公司 一种车载ai安防监控系统及其方法
CN117953589B (zh) * 2024-03-27 2024-07-05 武汉工程大学 一种交互动作检测方法、系统、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130236057A1 (en) * 2007-10-10 2013-09-12 Samsung Electronics Co., Ltd. Detecting apparatus of human component and method thereof
CN104573659A (zh) * 2015-01-09 2015-04-29 安徽清新互联信息科技有限公司 一种基于svm的驾驶员接打电话监控方法
CN106780612A (zh) * 2016-12-29 2017-05-31 浙江大华技术股份有限公司 一种图像中的物体检测方法及装置
CN106815574A (zh) * 2017-01-20 2017-06-09 博康智能信息技术有限公司北京海淀分公司 建立检测模型、检测接打手机行为的方法和装置
CN107563446A (zh) * 2017-09-05 2018-01-09 华中科技大学 一种微操作系统目标检测方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI474264B (zh) * 2013-06-14 2015-02-21 Utechzone Co Ltd 行車警示方法及車用電子裝置
KR101386823B1 (ko) * 2013-10-29 2014-04-17 김재철 동작, 안면, 눈, 입모양 인지를 통한 2단계 졸음운전 방지 장치
CN105260705B (zh) * 2015-09-15 2019-07-05 西安邦威电子科技有限公司 一种适用于多姿态下的驾驶人员接打电话行为检测方法
CN105260703B (zh) * 2015-09-15 2019-07-05 西安邦威电子科技有限公司 一种适用于多姿态下的驾驶人员抽烟行为检测方法
JP6443393B2 (ja) * 2016-06-01 2018-12-26 トヨタ自動車株式会社 行動認識装置,学習装置,並びに方法およびプログラム
CN106096607A (zh) * 2016-06-12 2016-11-09 湘潭大学 一种车牌识别方法
CN106504233B (zh) * 2016-10-18 2019-04-09 国网山东省电力公司电力科学研究院 基于Faster R-CNN的无人机巡检图像电力小部件识别方法及系统
CN106941602B (zh) * 2017-03-07 2020-10-13 中国铁路总公司 机车司机行为识别方法及装置
CN107316001A (zh) * 2017-05-31 2017-11-03 天津大学 一种自动驾驶场景中小且密集的交通标志检测方法
CN107316058A (zh) * 2017-06-15 2017-11-03 国家新闻出版广电总局广播科学研究院 通过提高目标分类和定位准确度改善目标检测性能的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130236057A1 (en) * 2007-10-10 2013-09-12 Samsung Electronics Co., Ltd. Detecting apparatus of human component and method thereof
CN104573659A (zh) * 2015-01-09 2015-04-29 安徽清新互联信息科技有限公司 一种基于svm的驾驶员接打电话监控方法
CN106780612A (zh) * 2016-12-29 2017-05-31 浙江大华技术股份有限公司 一种图像中的物体检测方法及装置
CN106815574A (zh) * 2017-01-20 2017-06-09 博康智能信息技术有限公司北京海淀分公司 建立检测模型、检测接打手机行为的方法和装置
CN107563446A (zh) * 2017-09-05 2018-01-09 华中科技大学 一种微操作系统目标检测方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2023509572A (ja) * 2020-04-29 2023-03-09 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 車両を検出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム
JP7357789B2 (ja) 2020-04-29 2023-10-06 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 車両を検出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム
WO2022027895A1 (zh) * 2020-08-07 2022-02-10 上海商汤临港智能科技有限公司 异常坐姿识别方法、装置、电子设备、存储介质及程序
CN112270210A (zh) * 2020-10-09 2021-01-26 珠海格力电器股份有限公司 数据处理、操作指令识别方法、装置、设备和介质
CN112270210B (zh) * 2020-10-09 2024-03-01 珠海格力电器股份有限公司 数据处理、操作指令识别方法、装置、设备和介质
CN112257604A (zh) * 2020-10-23 2021-01-22 北京百度网讯科技有限公司 图像检测方法、装置、电子设备和存储介质
CN113205067A (zh) * 2021-05-26 2021-08-03 北京京东乾石科技有限公司 作业人员监控方法、装置、电子设备和存储介质
CN113205067B (zh) * 2021-05-26 2024-04-09 北京京东乾石科技有限公司 作业人员监控方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
JP2021517312A (ja) 2021-07-15
SG11202009320PA (en) 2020-10-29
KR102470680B1 (ko) 2022-11-25
JP7061685B2 (ja) 2022-04-28
KR20200124280A (ko) 2020-11-02
US20210012127A1 (en) 2021-01-14
CN110956060A (zh) 2020-04-03

Similar Documents

Publication Publication Date Title
WO2020063753A1 (zh) 动作识别、驾驶动作分析方法和装置、电子设备
US11726577B2 (en) Systems and methods for triggering actions based on touch-free gesture detection
TWI741512B (zh) 駕駛員注意力監測方法和裝置及電子設備
US10223838B2 (en) Method and system of mobile-device control with a plurality of fixed-gradient focused digital cameras
US20210081754A1 (en) Error correction in convolutional neural networks
EP2634727B1 (en) Method and portable terminal for correcting gaze direction of user in image
CN111566612A (zh) 基于姿势和视线的视觉数据采集系统
CN110956061B (zh) 动作识别方法及装置、驾驶员状态分析方法及装置
WO2020125499A1 (zh) 一种操作提示方法及眼镜
US11715231B2 (en) Head pose estimation from local eye region
US20180239975A1 (en) Method and system for monitoring driving behaviors
EP2490155A1 (en) A user wearable visual assistance system
WO2019128101A1 (zh) 一种投影区域自适应的动向投影方法、装置及电子设备
TW201445457A (zh) 虛擬眼鏡試戴方法及其裝置
WO2022042624A1 (zh) 信息显示方法、设备及存储介质
CN112183200B (zh) 一种基于视频图像的眼动追踪方法和系统
CN111046734A (zh) 基于膨胀卷积的多模态融合视线估计方法
JP2019179390A (ja) 注視点推定処理装置、注視点推定モデル生成装置、注視点推定処理システム、注視点推定処理方法、プログラム、および注視点推定モデル
CN114463725A (zh) 驾驶员行为检测方法及装置、安全驾驶提醒方法及装置
KR20150064977A (ko) 얼굴정보 기반의 비디오 분석 및 시각화 시스템
KR20190119212A (ko) 인공신경망을 이용한 가상 피팅 시스템, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체
Saif et al. Robust drowsiness detection for vehicle driver using deep convolutional neural network
WO2020051781A1 (en) Systems and methods for drowsiness detection
WO2018059258A1 (zh) 采用增强现实技术提供手掌装饰虚拟图像的实现方法及其装置
WO2022142079A1 (zh) 图形码显示方法、装置、终端及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19864657

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020551540

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20207027826

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19864657

Country of ref document: EP

Kind code of ref document: A1