WO2020063753A1 - 动作识别、驾驶动作分析方法和装置、电子设备 - Google Patents
动作识别、驾驶动作分析方法和装置、电子设备 Download PDFInfo
- Publication number
- WO2020063753A1 WO2020063753A1 PCT/CN2019/108167 CN2019108167W WO2020063753A1 WO 2020063753 A1 WO2020063753 A1 WO 2020063753A1 CN 2019108167 W CN2019108167 W CN 2019108167W WO 2020063753 A1 WO2020063753 A1 WO 2020063753A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- action
- frame
- candidate
- candidate frame
- motion
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims abstract description 98
- 230000003993 interaction Effects 0.000 claims abstract description 15
- 230000009471 action Effects 0.000 claims description 522
- 238000013528 artificial neural network Methods 0.000 claims description 144
- 238000000605 extraction Methods 0.000 claims description 84
- 238000012549 training Methods 0.000 claims description 37
- 238000001514 detection method Methods 0.000 claims description 35
- 238000012545 processing Methods 0.000 claims description 32
- 230000000391 smoking effect Effects 0.000 claims description 29
- 239000003651 drinking water Substances 0.000 claims description 24
- 235000020188 drinking water Nutrition 0.000 claims description 24
- 239000011521 glass Substances 0.000 claims description 22
- 230000015654 memory Effects 0.000 claims description 18
- 238000011176 pooling Methods 0.000 claims description 13
- 239000000284 extract Substances 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 11
- 238000002372 labelling Methods 0.000 claims description 10
- 235000019504 cigarettes Nutrition 0.000 claims description 9
- 235000013361 beverage Nutrition 0.000 claims description 8
- 230000007937 eating Effects 0.000 claims description 7
- 230000035622 drinking Effects 0.000 claims description 6
- 238000012544 monitoring process Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 description 23
- 230000006870 function Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 239000000779 smoke Substances 0.000 description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000007670 refining Methods 0.000 description 2
- 230000002393 scratching effect Effects 0.000 description 2
- 239000007921 spray Substances 0.000 description 2
- 208000003251 Pruritus Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007803 itching Effects 0.000 description 1
- 235000019353 potassium silicate Nutrition 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006748 scratching Methods 0.000 description 1
- NTHWMYGWWRZVTN-UHFFFAOYSA-N sodium silicate Chemical compound [Na+].[Na+].[O-][Si]([O-])=O NTHWMYGWWRZVTN-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60R—VEHICLES, VEHICLE FITTINGS, OR VEHICLE PARTS, NOT OTHERWISE PROVIDED FOR
- B60R21/00—Arrangements or fittings on vehicles for protecting or preventing injuries to occupants or pedestrians in case of accidents or other traffic risks
- B60R21/01—Electrical circuits for triggering passive safety arrangements, e.g. airbags, safety belt tighteners, in case of vehicle accidents or impending vehicle accidents
- B60R21/015—Electrical circuits for triggering passive safety arrangements, e.g. airbags, safety belt tighteners, in case of vehicle accidents or impending vehicle accidents including means for detecting the presence or position of passengers, passenger seats or child seats, and the related safety parameters therefor, e.g. speed or timing of airbag inflation in relation to occupant position or seat belt use
- B60R21/01512—Passenger detection systems
- B60R21/01542—Passenger detection systems detecting passenger motion
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W40/00—Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
- B60W40/08—Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to drivers or passengers
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
- B60W50/08—Interaction between the driver and the control system
- B60W50/14—Means for informing the driver, warning the driver or prompting a driver intervention
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/776—Validation; Performance evaluation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/59—Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
- G06V20/597—Recognising the driver's state or behaviour, e.g. attention or drowsiness
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N23/00—Cameras or camera modules comprising electronic image sensors; Control thereof
- H04N23/50—Constructional details
- H04N23/54—Mounting of pick-up tubes, electronic image sensors, deviation or focusing coils
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W2420/00—Indexing codes relating to the type of sensors based on the principle of their operation
- B60W2420/40—Photo, light or radio wave sensitive means, e.g. infrared sensors
- B60W2420/403—Image sensing, e.g. optical camera
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W2540/00—Input parameters relating to occupants
- B60W2540/229—Attention level, e.g. attentive to driving, reading or sleeping
Definitions
- the present application relates to the field of image processing technology, and in particular, to a method and device for motion recognition, driving motion analysis, and electronic equipment.
- Motion recognition technology has become a very popular application research direction in recent years, and it can be seen in many fields and products.
- the use of this technology is also the future development trend of human-computer interaction, especially in driver monitoring.
- the field has broad application prospects.
- the embodiments of the present application provide a technical solution for motion recognition and a technical solution for driving motion analysis.
- an embodiment of the present application provides a motion recognition method, which includes: extracting features of an image including a human face; determining a plurality of candidate frames that may include a predetermined motion based on the features; and based on the plurality of The candidate frame determines a motion target frame, wherein the motion target frame includes a local area of a human face and a motion interactor; classifying a predetermined motion based on the motion target frame to obtain a motion recognition result.
- an embodiment of the present application provides a driving motion analysis method.
- the method includes: collecting a video stream including a driver's face image through an on-board camera; and using any one of the motion recognition methods described in the embodiments of the present application.
- One implementation manner is to obtain a motion recognition result of at least one frame of images in the video stream; and in response to the motion recognition result meeting a predetermined condition, generating dangerous driving prompt information.
- an embodiment of the present application provides a motion recognition device.
- the device includes: a first extraction unit for extracting features of an image including a human face; and a second extraction unit for determining a possibility based on the features.
- a plurality of candidate frames including a predetermined action; a determination unit configured to determine a motion target frame based on the plurality of candidate frames, wherein the motion target frame includes a local area of a human face and a motion interactor; a classification unit configured to be based on The action target frame performs classification of a predetermined action to obtain an action recognition result.
- an embodiment of the present application provides a driving motion analysis device.
- the device includes: a vehicle-mounted camera for collecting a video stream including a driver's face image; and an obtaining unit for passing the In any one of the implementation manners of the motion recognition device, a motion recognition result of at least one frame image in the video stream is obtained; and a generating unit is configured to generate dangerous driving prompt information in response to the motion recognition result meeting a predetermined condition.
- an embodiment of the present application provides an electronic device including a memory and a processor.
- the memory stores computer-executable instructions.
- the processor implements the application when the computer runs the computer-executable instructions on the memory. The method described in the first or second aspect of the embodiment.
- an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the computer-readable storage medium is run on a computer, causes the computer to execute the first aspect or The method described in the second aspect.
- an embodiment of the present application provides a computer program including computer instructions.
- the computer instructions are run in a processor of a device, the method described in the first aspect or the second aspect of the embodiments of the present application.
- the embodiment of the present application extracts features in an image containing a human face, determines multiple candidate frames that may include a predetermined action based on the extracted features, determines an action target frame based on the plurality of candidate frames, and performs a predetermined action based on the action target frame. Classification to get action recognition results.
- the action target frame described in the embodiment of the present application includes a local area of a human face and a motion interactor, in the process of classifying a predetermined action based on the action target frame, the local area corresponding to the human face and the action interaction are The movement of the object as a whole, instead of cutting apart the human body part and the action interaction object, and classifying based on the corresponding characteristics of the whole, so it can realize the recognition of fine movements, especially for the fine movements in the face area or near the face area. Recognition to improve the accuracy and precision of motion recognition.
- FIG. 1 is a schematic flowchart of a motion recognition method according to an embodiment of the present application
- FIG. 2 is a schematic diagram of a target action frame according to an embodiment of the present application.
- FIG. 3 is a schematic flowchart of another motion recognition method according to an embodiment of the present application.
- FIG. 4 is a schematic diagram of a negative sample image including an action similar to a predetermined action according to an embodiment of the present application
- FIG. 5 is a schematic flowchart of a driving action analysis method according to an embodiment of the present application.
- FIG. 6 is a schematic flowchart of a neural network training method according to an embodiment of the present application.
- FIG. 7 is a schematic diagram of an action supervision frame for drinking water according to an embodiment of the present application.
- FIG. 8 is a schematic diagram of a call supervision frame provided by an embodiment of the present application.
- FIG. 9 is a schematic structural diagram of a motion recognition device according to an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of a training component of a neural network according to an embodiment of the present application.
- FIG. 11 is a schematic structural diagram of a driving motion analysis device according to an embodiment of the present application.
- FIG. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
- FIG. 1 is a schematic flowchart of a motion recognition method according to an embodiment of the present application. As shown in FIG. 1, the method includes:
- the embodiments of the present application mainly identify the actions of people in the vehicle. Taking the driver as an example, the embodiment of the present application can recognize some driving actions made by the vehicle driver when driving the vehicle, and can give a reminder to the driver according to the recognition result.
- the inventor found that due to some fine movements of the person in the car related to the human face, for example, the driver drinks water, the driver calls, etc., it is difficult or impossible to recognize these actions through Realization of detection of key points or estimation of human pose.
- feature extraction is performed on an image to be processed, and actions in the image to be processed are identified according to the extracted features.
- the above-mentioned actions may be: the action of the hand area and / or the action of the local area of the face, the action of the action interactive object, etc. Therefore, it is necessary to use the vehicle camera to collect image of the person in the vehicle to obtain Process the image. Then perform a convolution operation on the processed image to extract motion features.
- the method further includes: using a vehicle-mounted camera to capture an image of a person located in the vehicle, including a human face.
- the person in the car includes at least one of the following: a driver in a driving zone of the car, a person in a passenger driving zone of the car, and a person in a rear seat of the car.
- the vehicle-mounted camera may be a red-green-blue (RGB) camera, an infrared camera, or a near-infrared camera.
- RGB red-green-blue
- infrared camera or a near-infrared camera.
- the embodiments of the present application mainly identify predetermined actions of persons in the vehicle. Taking the persons in the vehicle as examples of the driver, the predetermined actions may be, for example, predetermined actions corresponding to the dangerous driving of the driver, or certain actions for the driver. A scheduled action for a dangerous action.
- the features of the predetermined action are first defined, and then a neural network is used to determine whether there is a predetermined action in the image according to the defined features and the extracted features in the image. When there is a predetermined motion in the image, it is determined that a plurality of candidate frames including the predetermined motion are included in the image.
- the neural networks in this embodiment are all trained, that is, features of a predetermined action in an image can be extracted through the neural network.
- the neural network may be provided with multiple convolutional layers, and the multi-layered convolutional layers may be used to extract richer information in the image, thereby improving the accuracy of determining a predetermined action.
- the feature extraction process including a hand region and a human face is obtained through a neural network
- a feature area of a local area, a candidate area is determined based on the feature area, and the candidate area is identified by a candidate frame; wherein the candidate frame may be represented by a rectangular frame, for example.
- a feature region including a hand region, a local face region, and a corresponding region of a motion interactor is identified through another candidate frame.
- 103 Determine a motion target frame based on the plurality of candidate frames, wherein the motion target frame includes a local area of a human face and a motion interactor.
- the actions identified in the embodiments of the present application are all fine actions related to the face.
- the recognition of such fine actions related to the face is difficult or even impossible to detect through key points of the human body, and the corresponding areas of such fine actions are at least Including the local area of the face and the corresponding area of the action interactor, for example, the local area of the face and the corresponding area of the action interactor, or the local area of the face, the corresponding area of the action interactor, and the hand area , Etc. Therefore, the recognition of such fine actions can be realized by identifying the features in the action target frame obtained from multiple candidate frames.
- the local area of the human face includes at least one of the following: a mouth area, an ear area, and an eye area.
- the action interaction objects include at least one of the following: containers, cigarettes, mobile phones, food, tools, beverage bottles, glasses, and masks.
- the action target frame further includes: a hand region.
- the target action box shown in FIG. 2 includes a local human face, a mobile phone (that is, an action interaction object), and a hand.
- the target action frame may also include: mouth and smoke (that is, an action interaction object).
- the candidate frame may include features other than the features corresponding to the predetermined action, or may not include all features corresponding to the predetermined action (referring to all features of any one predetermined action), which will affect the final action recognition result. Therefore, in order to ensure the accuracy of the final recognition result, the position of the candidate frame needs to be adjusted, that is, an action target frame is determined based on a plurality of candidate frames, and the position and size of the action target frame are at least part of the candidate frames of the plurality of candidate frames. Position and size may vary. As shown in FIG.
- the position offset and the zoom factor of the corresponding candidate frame can be determined according to the position and size of the feature corresponding to the predetermined action, and then the position and size of the candidate frame are adjusted according to the position offset and the zoom factor, so that
- the adjusted action target box includes only features corresponding to the predetermined action, and includes all features corresponding to the predetermined action. Based on this, by adjusting the position and size of each candidate frame, the adjusted candidate frame is determined as the action target frame. It can be understood that the adjusted multiple candidate frames can overlap into one candidate frame, and the overlapping candidate frames are determined as the action target frames.
- the predetermined action includes at least one of the following: making a call, smoking, drinking water / beverage, eating, using tools, wearing glasses, and applying makeup.
- the predetermined actions may be classified based on characteristics corresponding to the predetermined actions contained in the action target frame.
- a neural network for action classification may be used to perform classification processing on the features corresponding to the predetermined actions contained in the action target frame to obtain classification and recognition results of the predetermined actions corresponding to the features.
- a plurality of candidate frames that may include a predetermined motion are determined based on the extracted features, and an action target frame is determined based on the plurality of candidate frames.
- the target action is quickly classified into predetermined actions.
- the action target frame described in the embodiment of the present application includes a local area of a human face and a motion interactor, in the process of classifying a predetermined action based on the action target frame, the local area corresponding to the human face and the action interaction are The movement of the object as a whole, instead of cutting apart the human body part and the action interaction object, and classifying based on the corresponding characteristics of the whole, so it can realize the recognition of fine movements, especially for the fine movements in the face area or near the face area. Recognition to improve the accuracy and precision of recognition.
- FIG. 3 is a schematic flowchart of another motion recognition method according to an embodiment of the present application. As shown in FIG. 3, the method includes:
- acquiring the image to be processed may include: taking a picture of a person in the vehicle through a vehicle camera to acquire the image to be processed, or video capturing a person in the vehicle through the vehicle camera. , And use the frame image of the captured video as the image to be processed.
- the person in the car includes at least one of the following: a driver in a driving zone of the car, a person in a passenger driving zone of the car, and a person in a rear seat of the car.
- the vehicle-mounted camera may be an RGB camera, an infrared camera, or a near-infrared camera.
- RGB cameras provide three basic color components by three different cables. This type of camera usually uses three independent charge coupled device (CCD, Charge Coupled Device) sensors to obtain three color signals. RGB cameras often It is used for very accurate color image acquisition.
- CCD Charge Coupled Device
- the light in the real environment is complicated, and the light in the vehicle is more complicated, and the light intensity will directly affect the shooting quality, especially when the light intensity in the car is low, ordinary cameras cannot capture clear photos or videos, making the image or The video loses some useful information, which affects subsequent processing.
- the infrared camera can emit infrared light to the object being photographed, and then perform imaging based on the light reflected by the infrared light, which can solve the problem of low quality or abnormal shooting of images taken by ordinary cameras in low light or dark conditions.
- an ordinary camera or an infrared camera may be provided. When the light intensity is higher than a preset value, an image to be processed is acquired by the ordinary camera; when the light intensity is lower than a preset value, An infrared camera acquires images to be processed.
- a feature image is obtained by performing a convolution operation on an image to be processed through a feature extraction branch of a neural network.
- the feature extraction branch of the neural network performs a convolution operation on the image to be processed by using a convolution kernel to "slide" on the image to be processed.
- a convolution kernel For example, when the convolution kernel corresponds to a certain pixel point in the image, the gray value of the pixel point is multiplied with each value on the convolution kernel, and all products are added up to be the pixel point corresponding to the convolution kernel. The gray value further "slides" the convolution kernel to the next pixel, and so on, and finally completes the convolution processing of all pixels in the image to be processed to obtain a feature map.
- the feature extraction branch of the neural network in this embodiment may include multiple layers of convolution layers, and the feature map obtained through feature extraction by the previous layer of convolution layers may be used as input data of the next layer of convolution layers.
- Convolutional layers extract richer information in the image, thereby improving the accuracy of feature extraction.
- a feature map corresponding to the image to be processed can be obtained by performing a stepwise convolution operation on the image to be processed through the feature extraction branch of the neural network including multiple convolution layers.
- the candidate frame extraction branch of the neural network determines a plurality of candidate frames that may include a predetermined action on the feature map.
- the feature map is processed through the candidate frame extraction branch of the neural network to determine a plurality of candidate frames that may include a predetermined action.
- the feature map may include at least one of the features corresponding to a hand, a cigarette, a cup, a mobile phone, glasses, a mask, and a local area of a human face, and a plurality of candidate frames are determined based on the at least one feature.
- the feature extraction branch of the neural network can be used to extract the features of the image to be processed in step 302
- the extracted features may include features other than the features corresponding to the predetermined action. Therefore, the neural network is used here.
- the multiple candidate frames identified by the candidate frame extraction branch at least some of the candidate frames may contain features other than the features corresponding to the predetermined action, or may not contain all the features corresponding to the predetermined action. Therefore, the multiple The candidate box may include a predetermined action.
- the candidate frame extraction branch of the neural network in this embodiment may include multiple layers of convolutional layers, and the features extracted by the previous layer of convolutional layers will be used as input data for the next layer of convolutional layers. Layers extract richer information, thereby improving the accuracy of feature extraction.
- the candidate frame extraction branch via the neural network determines a plurality of candidate frames that may include a predetermined action on the feature map, including: according to characteristics of the predetermined action Dividing the features in the feature map to obtain a plurality of candidate regions; and obtaining a plurality of candidate frames and a first confidence level of each candidate frame of the plurality of candidate frames according to the plurality of candidate regions, where: The first confidence level is a probability that the candidate frame is the action target frame.
- a candidate frame extraction branch of a neural network identifies the feature map, and the feature map includes hand features and corresponding features of a local area of the face, or includes hand features and corresponding features of an action interaction object (such as a mobile phone). Corresponding features) and corresponding features of the face local area are divided from the feature map, candidate areas are determined based on the divided features, and the candidate areas are identified by candidate frames (the candidate frames such as rectangular frames). In this way, a plurality of candidate regions identified by the candidate frames are obtained.
- the candidate frame extraction branch of the neural network may also determine a first confidence level corresponding to each candidate frame, where the first confidence level is used to represent a possibility that the candidate frame is a target action frame in a form of probability.
- the first confidence degree is a candidate frame obtained by the candidate frame extraction branch of the neural network according to the characteristics of the candidate frame as a predicted value of the target action frame.
- the 304 Refine branches of the detection frame of the neural network, and determine an action target frame based on the multiple candidate frames; wherein the action target frame includes a local area of a human face and an action interactor.
- the step of refining branches of the detection frame of the neural network and determining an action target frame based on the plurality of candidate frames includes: refining of the detection frame of the neural network.
- the branch removes candidate frames whose first confidence is less than the first threshold to obtain at least one first candidate frame; pools the at least one first candidate frame to obtain at least one second candidate frame; according to the at least one second candidate Box to determine the action target box.
- the target object performs actions such as making a call, drinking water, and smoking in turn. These actions are similar. They place their right hands next to their faces, but Without mobile phones, drinking glasses, and cigarettes, neural networks are prone to mistakenly recognize these actions of target objects as making calls, drinking water, and smoking.
- the predetermined action is a predetermined dangerous driving action
- the driver may, for example, perform an ear scratching action due to itching in the ear area, or open the mouth or put hands for other reasons.
- these movements are not part of the predetermined dangerous driving movements, but these movements will greatly interfere with the candidate frame extraction branch of the neural network in the process of extracting the candidate frame, and then affect the subsequent classification of the movement. Get wrong motion recognition results.
- the detection frame refinement branch of the neural network is obtained through pre-training to remove candidate frames with a first confidence level less than a first threshold to obtain at least one first candidate frame; the first confidence level of the at least one first candidate frame Both are greater than or equal to the first threshold.
- the first threshold may be 0.5, for example.
- the value of the first threshold in the embodiments of the present application is not limited thereto.
- the pooling processing the at least one first candidate frame to obtain at least one second candidate frame includes: pooling the at least one first candidate frame to obtain At least one first feature region corresponding to the at least one first candidate frame; adjusting the position and size of the corresponding first candidate frame based on each first feature region to obtain at least one second candidate frame.
- the number of features in the area where the first candidate frame is located may be large. If the features in the area where the first candidate frame is located are used directly, a huge amount of calculation will be generated. Therefore, before performing subsequent processing on the features in the area where the first candidate frame is located, pool the first candidate frame first, that is, pool the features in the area where the first candidate frame is located, and reduce the The dimension of features meets the need for calculation in the subsequent processing, and greatly reduces the calculation in subsequent processing. Similar to obtaining the candidate region in step 303, the features after the pooling process are divided according to the characteristics of the predetermined action to obtain multiple first feature regions. It can be understood that, in this embodiment, by performing a pooling process on the area corresponding to the first candidate frame, the features corresponding to the predetermined action in the first feature area are presented in a low-dimensional form.
- the specific implementation process of the pooling process can refer to the following example: Suppose that the size of the first candidate frame is expressed as h * w, where h may represent the height of the first candidate frame, and w may represent the Width; when the target size of the desired feature is H * W, the first candidate frame can be divided into H * W grids, and the size of each grid can be expressed as (h / H) * (w / W) , And then calculate the average gray value of the pixels in each grid or determine the maximum gray value in each grid, and use the average gray value or the maximum gray value as the value corresponding to each grid, Thereby, the pooling processing result of the first candidate frame is obtained.
- the adjusting the position and size of a corresponding first candidate frame based on each first feature region to obtain at least one second candidate frame includes: A feature corresponding to the predetermined action in a feature region is obtained a first action feature frame corresponding to the feature of the predetermined action; and the at least one first is obtained according to a geometric center coordinate of the first action feature frame. A first position offset of the candidate frame; and obtaining a first zoom factor of the at least one first candidate frame according to the size of the first motion feature frame; and according to at least one first position offset and at least one The first zoom factor adjusts the position and size of the at least one first candidate frame to obtain at least one second candidate frame.
- the features corresponding to each predetermined action in the first feature region are respectively identified by a first action feature frame, and the first action feature frame may specifically be a rectangular frame, for example, A feature corresponding to each predetermined action in the first feature region is identified by a rectangular frame.
- the geometric center coordinates of the first motion feature frame in a pre-established XOY coordinate system are obtained, and the first position offset of the first candidate frame corresponding to the first motion feature frame is determined according to the geometric center coordinates;
- the XOY coordinate system is generally a coordinate system established by setting the coordinate origin O, with the horizontal direction as the X axis, and the direction perpendicular to the X axis as the Y axis.
- the geometric center of the first motion feature frame and The geometric center of the first candidate frame usually has a certain deviation, and the first position offset of the first candidate frame is determined according to the deviation.
- an offset between the geometric center of the first motion feature frame and the geometric center of the first candidate frame corresponding to the feature of the same predetermined motion may be used as the first position offset of the first candidate frame. the amount.
- each first candidate frame corresponds to a first position offset
- the first position offset includes the X axis
- the XOY coordinate system is based on the upper left corner of the first feature region (the orientation of the refined branch of the candidate frame of the input neural network) as the coordinate origin, and the horizontal direction to the right is the positive direction of the X axis, and the vertical direction is vertical. Downward is the positive direction of the Y axis.
- the bottom left corner, the top right corner, the bottom right corner of the first feature region, or the center point of the first feature region may be used as the origin, the horizontal direction to the right is the positive direction of the X axis, and the vertical direction is the Y axis. Positive direction.
- the size of the first motion feature frame is obtained, the length and width of the first motion feature frame are specifically obtained, and the first zoom factor of the corresponding first candidate frame is determined according to the length and width of the first motion feature frame.
- the first zoom factor of the first candidate frame may be determined based on the length and width of the first motion feature frame and the length and width of the corresponding first candidate frame.
- Each of the first candidate frames corresponds to a first zoom factor.
- the first zoom factors of different first candidate frames may be the same or different.
- the position and size of the first candidate frame are adjusted according to a first position offset and a first zoom factor corresponding to each first candidate frame.
- the first candidate frame is moved according to the above-mentioned first position offset, and the first candidate frame is centered on the geometric center and the size is adjusted according to the first zoom factor to obtain the second candidate frame.
- the number of second candidate frames is consistent with the number of first candidate frames.
- the second candidate frame obtained in the above manner will contain all the features of the predetermined action in the smallest possible size, which is beneficial to improving the accuracy of the subsequent action classification results.
- a plurality of second candidate frames with similar sizes and similar geometric candidate centers between the second candidate frames can be combined into one, and the combined second candidate frame can be used as an action target frame. It should be understood that the distance between the size and the geometric center of the second candidate frame corresponding to the same predetermined action may be very close, so for each predetermined action, one action target frame may be corresponding.
- the driver is smoking while making a phone call
- the obtained image to be processed may include features corresponding to two predetermined actions of making a phone call and smoking.
- a candidate frame including features corresponding to a predetermined action of making a phone call can be obtained, the candidate frame includes a hand, a mobile phone, and a partial area of a human face, and a feature including a feature corresponding to a predetermined action of smoking can be obtained.
- the size of the candidate frame corresponding to the predetermined action of making a call is similar.
- the distance between the size of the candidate frame corresponding to the predetermined action of smoking and the geometric center are similar, and the size of any candidate frame corresponding to the predetermined action of calling and any candidate frame corresponding to the predetermined action of smoking.
- the difference in the size of is greater than the difference in size between any two candidate frames corresponding to the predetermined action of making a call, and is greater than the difference in size between any two candidate frames corresponding to the predetermined action of smoking.
- the distance between the geometric center of the candidate frame corresponding to the predetermined action of making a call and any candidate frame corresponding to the predetermined action of smoking is greater than the geometric center between any two candidate frames corresponding to the predetermined action of making a call
- the distance between them is also larger than the distance between the geometric centers of any two candidate frames corresponding to the predetermined action of smoking.
- the action classification branch of the neural network obtains an area map according to the area corresponding to the action target action frame divided from the feature map, and classifies a predetermined action based on the features in the area map to obtain a first Action recognition results; obtain the action recognition results corresponding to the image to be processed according to the first action recognition results corresponding to all target action frames.
- a first action recognition result is obtained through a motion classification branch of a neural network
- the first action recognition result may be obtained through a motion classification branch of a neural network.
- a second confidence level the second confidence level characterizing the accuracy of the motion recognition result.
- obtaining the motion recognition results corresponding to the image to be processed according to the first motion recognition results corresponding to all target motion frames includes: comparing the second confidence level and the preset of the first motion recognition results corresponding to each target motion frame.
- the driver is photographed by a vehicle-mounted camera to obtain an image including the face of the driver, and the image is input to a neural network as an image to be processed.
- the driver in the image to be processed corresponds to a "calling" action
- obtains two action recognition results through the processing of the neural network the action recognition result of "calling” and the action recognition result of "drinking water", where " The second confidence level of the motion recognition result of "calling” is 0.8, and the second confidence level of the motion recognition result of "drinking water” is 0.4.
- the preset threshold value is set to 0.6, it can be determined that the action recognition result of the image to be processed is a "call” action.
- the method may further include: outputting reminder information.
- the specific predetermined action may be a dangerous driving action
- the dangerous driving action is an action that brings a dangerous event to the driving process when the driver is driving the vehicle.
- the dangerous driving action may be an action generated by the driver himself, or an action generated by another person located in the cockpit.
- the output of the reminder information may be output of the reminder information in at least one of audio, video, and text.
- the terminal can output prompt information to the people in the vehicle (such as the driver and / or other people in the vehicle) through the terminal.
- the method of outputting the prompt information can be: using the terminal to display text, and using the terminal to output voice data. Tips and more.
- the terminal may be a vehicle-mounted terminal.
- the terminal may be equipped with a display screen and / or an audio output function.
- the specific predetermined actions are: drinking water, making phone calls, wearing glasses, and so on.
- the prompt information is output, and the category of the specific predetermined action (such as a dangerous driving action) can also be output.
- the prompt information may not be output, or the type of the predetermined action may be output.
- a dialog box may be displayed through a head up display (HUD), and the driver may be displayed through the displayed content.
- Send out prompt information you can also output prompt information through the built-in audio output function of the vehicle, for example, you can output: "Please note the driver's driving action" and other audio information; you can also output the reminder information by releasing the gas that has a refreshing effect
- spray the toilet water spray through the car nozzle the smell of the toilet water is fragrant and pleasant, and it can also refresh the driver while reminding the driver; it can also release a low current to stimulate the driver through the seat Output prompt information in a way to achieve the effect of prompts and warnings.
- feature extraction branches of a neural network are used to perform feature extraction.
- Candidate frame extraction branches of the neural network are used to obtain candidate frames that may include predetermined actions according to the extracted features.
- Action target frame and finally classify the features in the target action frame through the action classification branch of the neural network to classify the predetermined action to get the action recognition result of the image to be processed;
- the entire recognition process is by extracting features in the image to be processed (such as the hand area , Facial local area extraction, feature extraction of the corresponding area of the action interaction object), and processing it, can realize the precise recognition of fine movements autonomously and quickly.
- FIG. 5 is a schematic flowchart of a driving motion analysis method according to an embodiment of the present application; as shown in FIG. 5, the method includes:
- the driver is video-captured by a vehicle camera to obtain a video stream, and each frame of the video stream is used as an image to be processed.
- the corresponding motion recognition results are obtained, and the driving state of the driver is identified in combination with the motion recognition results of consecutive multiple frames of images to determine whether the driving state is a dangerous driving state corresponding to a dangerous driving action .
- the predetermined condition includes at least one of the following: the occurrence of a specific predetermined action; the number of times the specific predetermined action occurs within a predetermined time period; and the duration of occurrence and maintenance of the specific predetermined action in the video stream .
- the specific predetermined action may be a predetermined action corresponding to a dangerous driving action in the classification of the predetermined actions in the foregoing embodiments, for example, a driving action corresponding to a drinking water action, a calling action, or the like.
- meeting the predetermined condition in response to the motion recognition result may include: in a case where the specific motion is included in the motion recognition result, determining that the motion recognition result meets the predetermined condition; or including the specific predetermined motion and the predetermined duration in the motion recognition result In the case where the number of occurrences of the specific predetermined action reaches a preset number, it is determined that the motion recognition result meets a predetermined condition; or, the specific recognition action is included in the motion recognition result, and the specific predetermined action appears in the video stream. When the duration reaches a preset duration, it is determined that the motion recognition result meets a predetermined condition.
- the vehicle driving terminal may generate and output dangerous driving prompt information, and may also output a specific type of predetermined action.
- the method for outputting the dangerous driving prompt information may include: outputting the dangerous driving prompt information by displaying characters on the vehicle terminal, and outputting the dangerous driving prompt information through the audio output function of the vehicle terminal.
- the method further includes: acquiring a vehicle speed of a vehicle provided with a dual-vehicle camera; and generating a dangerous driving prompt message in response to the motion recognition result meeting a predetermined condition, including: The vehicle speed is greater than a set threshold and the motion recognition result meets the predetermined condition, and dangerous driving prompt information is generated.
- the dangerous driving prompt information when the vehicle speed is not greater than a set threshold, even if the motion recognition result meets the preset condition, the dangerous driving prompt information may not be generated and output. Only when the vehicle speed is greater than a set threshold, when the motion recognition result meets the preset condition, a dangerous driving prompt message is generated and output.
- a video is taken of a driver through a vehicle-mounted camera, and each frame of the captured video is used as an image to be processed.
- Each frame of the picture taken by the camera is recognized to obtain the corresponding recognition result, and the actions of the driver are recognized in combination with the results of multiple consecutive frames.
- the driver may be warned through the display terminal, and the type of the dangerous driving action may be prompted.
- the ways to raise warnings include: pop-up dialog box to warn by text, and warn by built-in voice data.
- the neural network in the embodiment of the present application is obtained by pre-supervised training based on a training image set.
- the neural network may include network layers such as a convolution layer, a non-linear layer, and a pooling layer.
- the embodiment of the present application does not have a specific network structure. limit. After the structure of the neural network is determined, iterative training can be performed on the neural network based on sample images with labeled information and supervised gradient back propagation, and the specific training method is not limited in the embodiments of the present application.
- FIG. 6 is a schematic flowchart of a neural network training method according to an embodiment of the present application. As shown in FIG. 6, the method includes:
- a sample image for training a neural network may be obtained from a training image set, where the training image set may include multiple sample images.
- the sample images in the training image set include a positive sample image and a negative sample image.
- the positive sample image includes at least one predetermined action corresponding to the target object, such as the target object drinking water, smoking, making a call, wearing glasses, wearing a mask, and the like;
- the negative sample image includes at least one predetermined action Similar actions, such as: the target's hand is put on his lips, scratching his ears, touching his nose, and so on.
- a sample image containing an action very similar to a predetermined action is used as a negative sample image.
- a first feature map of a sample image may be extracted through a convolution layer in a neural network.
- a convolution layer in a neural network.
- Extracting the first feature map may include multiple third candidate frames of a predetermined action.
- step 303 For the detailed process of this step, reference may be made to the description of step 303 in the foregoing embodiment, and details are not described herein again.
- the determining an action target frame based on a plurality of third candidate frames includes: obtaining a first action supervision frame according to the predetermined action, wherein the first action supervision frame Including: a local area of a face and an action interactor, or a local area of a face, a hand region, and an action interactor; obtaining a second confidence level of the plurality of third candidate frames, wherein the second confidence level The degree includes: a first probability that the third candidate frame is the motion target frame, and a second probability that the third candidate frame is not the motion target frame; determining the plurality of third candidate frames and the first The area coincidence of an action monitoring frame; if the area coincidence is greater than or equal to a second threshold, the second confidence degree of the third candidate frame corresponding to the area coincidence is taken as the first Probability; if the area coincidence degree is less than the second threshold value, the second confidence degree of the third candidate frame corresponding to the area coincidence degree is taken as the second probability; the second probability The confidence level is less than the first
- a feature of a predetermined motion may be defined in advance.
- the motion characteristics of drinking water include the characteristics of the hand area, the face partial area, and the cup area (that is, the corresponding area of the action interacting object);
- the smoking motion characteristics include the hand area, the facial area, and the smoke area (I.e., the corresponding area of the action interaction object);
- the motion characteristics of the call include: the hand area, the local face area and the mobile phone area (that is, the action interaction object corresponding area);
- the movement characteristics of wearing glasses include: the hand area, Features of the face local area and glasses area (that is, the corresponding area of the action interacting object);
- the movement characteristics of wearing a mask include the characteristics of the hand area, the local face area, and the mask area (that is, the corresponding area of the action interacting object).
- the label information of the sample image includes: an action supervision frame and an action category corresponding to the action supervision frame. It can be understood that before processing the sample images through a neural network, it is also necessary to obtain labeling information corresponding to each sample image.
- the action monitoring box is specifically used to identify a predetermined action in the sample image. For details, see the action monitoring box of the target object drinking water in FIG. 7 and the action monitoring box of the target object calling in FIG. 8.
- Actions that are very similar to the predetermined actions will often cause great interference to the process of extracting candidate frames from the neural network.
- Figure 4 from left to right, actions similar to making phone calls, drinking water, and smoking are sequentially performed. That is, the target object places its right hand next to its face, but the target object does not have a mobile phone, a water glass, and Smoke, and neural networks are susceptible to mistakenly identifying these actions as making calls, drinking water, and smoking, and identifying corresponding candidate boxes respectively. Therefore, in the embodiment of the present application, the neural network is trained to distinguish between positive sample images and negative sample images.
- the first action supervision frame corresponding to the positive sample image may include a predetermined action
- the first action supervision frame corresponding to the negative sample image may also include a Scheduled actions are similar.
- a second confidence level corresponding to the three candidate frames can also be obtained.
- the second confidence level includes: the probability that the third candidate frame is an action target frame. , That is, the first probability; and the probability that the third candidate frame is not the action target frame, that is, the second probability.
- the second confidence level is that the third candidate frame obtained by the neural network according to the characteristics of the third candidate frame is the predicted value of the target action frame.
- the coordinates (x3, y3) of the third candidate frame in the coordinate system xoy can also be obtained through the processing of the neural network, and the third candidate frame's Size, the size of the third candidate frame may be represented by a product of length and width.
- the coordinates (x3, y3) of the third candidate frame may be the coordinates of a vertex of the third candidate frame, such as the upper left corner, upper right corner, lower left corner, or lower right corner of the third candidate frame. coordinate of.
- the third candidate box can be represented as bbox (x3, y3, x4, y4).
- the first action supervision box may be expressed as bbox_gt (x1, y1, x2, y2).
- the area overlap degree IOU of each third candidate box set bbox (x3, y3, x4, y4) and the first action supervision box bbox_gt (x1, y1, x2, y2) is determined, and the area is optional.
- the calculation formula of the coincidence degree IOU is as follows:
- a and B respectively represent the area of the third candidate frame and the area of the first action supervision frame
- a ⁇ B represents the area of the area where the third candidate frame overlaps with the first action supervision frame
- a ⁇ B represents the third candidate frame and The area of all regions contained in the first action supervision box.
- the third candidate frame determines the third candidate frame as a candidate frame that may contain a predetermined action, and take the second confidence degree of the third candidate frame as the above-mentioned first probability; if the area coincidence degree IOU is less than
- the second threshold value determines that the third candidate frame is a candidate frame that is unlikely to contain a predetermined action, and takes a second confidence level of the third candidate frame as the second probability.
- the value of the second threshold is greater than or equal to 0 and less than or equal to 1.
- the specific value of the second threshold may be determined according to a network training effect.
- the plurality of third candidate frames whose second confidence is less than the first threshold may be removed to obtain a plurality of fourth candidate frames, and the positions and sizes of the fourth candidate frames may be adjusted to obtain the
- the action target box is described. For details about how to obtain the action target frame, refer to step 304 in the foregoing embodiment.
- adjusting the position and size of the fourth candidate frame to obtain the action target frame includes: pooling the fourth candidate frame, obtaining a second feature region corresponding to the fourth candidate frame, and based on the The second feature region adjusts the position and size of the corresponding fourth candidate frame to obtain a fifth candidate frame, and obtains an action target frame based on the fifth candidate frame.
- adjusting the position and size of the corresponding fourth candidate frame based on the second feature region to obtain a fifth candidate frame includes: obtaining the fifth candidate frame according to a feature corresponding to a predetermined action in the second feature region.
- the geometric center coordinates P (x n , y n ) of the fourth candidate frame in the coordinate system xoy and the geometric center coordinates Q (x, y) of the second motion feature frame in the coordinate system xoy are obtained.
- ⁇ (x n , y n ) P (x n , y n ) -Q (x, y) , Where n is a positive integer, and the number of n is consistent with the number of the fourth candidate frame.
- ⁇ (x n , y n ) is the second position offset of the plurality of fourth candidate frames.
- the sizes of the fourth candidate frame and the second motion feature frame are obtained, and then the size of the second motion feature frame is divided by the size of the fourth candidate frame to obtain the second zoom factor ⁇ of the fourth candidate frame.
- the second zoom factor ⁇ includes the zoom factor ⁇ of the length of the fourth candidate frame and the zoom factor ⁇ of the width.
- the set of geometric center coordinates of the fourth candidate box is expressed as: According to the second position offset ⁇ (x n , y n ), the set of geometric center coordinates of the fourth candidate frame after adjusting the position of the geometric center can be obtained as: then:
- the geometric center of the fourth candidate frame is fixed, and the length of the fourth candidate frame is adjusted to ⁇ times based on the second scaling factor ⁇ . , The width is adjusted to ⁇ times to obtain a fifth candidate frame.
- obtaining the motion target frame based on the fifth candidate frame includes: merging a plurality of fifth candidate frames with similar sizes and distances, and the combined fifth candidate frame is used as the motion target frame. It should be understood that the size and distance of the fifth candidate frame corresponding to the same predetermined action will be very close, so each target frame of the action after the merge corresponds to only one predetermined action.
- a third confidence level of the action target frame is also obtained, and the third confidence level indicates the action in the action target frame.
- the probability of a predetermined action category that is, the third probability.
- the above predetermined actions may include five categories of drinking, smoking, making phone calls, wearing glasses, and wearing a mask.
- the third probability of each action target frame includes five The probability values are respectively the probability a of the action in the action target frame as a drinking action, the probability b of a smoking action, the probability c of a phone call action, the probability d of wearing a glasses action, and the probability e of wearing a mask.
- Step 504 Classify a predetermined action based on the action target frame to obtain an action recognition result.
- the five actions of the predetermined actions included in the action target box include drinking water, smoking, making a phone call, wearing glasses, and wearing a mask as an example.
- the classification of the predetermined action with the highest third confidence ie, third probability
- the maximum third confidence ie, the third probability
- Step 505 Determine the detection result of the candidate frame of the sample image and the first loss of the detection frame labeling information, and the second loss of the motion recognition result and the motion category labeling information.
- Step 506 Adjust network parameters of the neural network according to the first loss and the second loss.
- the neural network may include a feature extraction branch of the neural network, a candidate frame extraction branch of the neural network, a detection frame refinement branch of the neural network, and an action classification branch of the neural network.
- the functions of the branches of the neural network may be specifically See the detailed description of steps 301 to 305 in the foregoing embodiment.
- the network parameters of the neural network are updated by calculating a candidate frame coordinate regression loss function smooth l1 and a category loss function soft max.
- the expression of the loss function (Region Proposal Loss) extracted from the candidate box is as follows:
- N and ⁇ are weight parameters of the candidate frame extraction branch of the neural network, and p i is a supervised variable.
- the refinement branch of the detection frame of the neural network updates the weight parameters of the network through a loss function.
- the specific expression of the loss function (BboxRefineLoss) is as follows:
- M is the number of the sixth candidate frame
- ⁇ is the weight parameter of the refinement branch of the detection frame of the neural network
- p i is the supervised variable
- the expressions of the soft max loss function and smooth l1 loss function can be found in formula (4) and Formula (5), in particular, bbox i in formula (6) is the geometric center coordinate of the refined target frame, and bbox_gt j is the geometric center coordinate of the supervised motion frame.
- the loss function is the objective function of the neural network optimization
- the process of neural network training or optimization is the process of minimizing the loss function, that is, the closer the loss function value is to 0, the more the values corresponding to the predicted result and the real result become. Close.
- the supervised variable p i in formula (3) and formula (4) is replaced with the second confidence of the fourth candidate frame, and substituted into formula (3), and the weight of the branch is extracted by adjusting the candidate frame of the neural network
- the parameters N and ⁇ change the value of the Region Proposal Loss (that is, the first loss), and select a weight parameter combination N and ⁇ that makes the value of the Region Proposal Loss closest to 0.
- the supervising variable p i is replaced with the fourth probability of the action target frame (that is, the maximum value among multiple third confidence degrees (ie, the third probability)) and substituted into the formula (6).
- the weight parameter ⁇ of the box refinement branch change the value of Bbox Refine Loss (that is, the second loss), and select the weight parameter ⁇ that makes the value of Bbox Refine Loss closest to 0, and complete the neural network by gradient back propagation.
- the candidate frame extraction branch with updated weight parameters, the refined frame detection branch with updated weight parameters, feature extraction branch, and action classification branch are trained again, that is, the sample image is input to the neural network, and processed by the neural network.
- the action classification branch of the network outputs the recognition results. Because there is an error between the output result of the action classification branch and the actual result, the error between the output value of the action classification branch and the actual value is back-propagated from the output layer to the convolution layer until it is propagated to the input layer. In the process of back propagation, the weight parameters in the neural network are adjusted according to the error, and the above process is continuously iterated until convergence, and the network parameters of the neural network are updated again.
- This embodiment performs fine movements on the face of the person in the vehicle according to the movement characteristics, such as a dangerous driving movement of a driver with respect to hands and human faces.
- movement characteristics such as a dangerous driving movement of a driver with respect to hands and human faces.
- some actions made by drivers similar to dangerous driving actions will easily interfere with the neural network and affect subsequent classification and recognition of actions. This will not only reduce the accuracy of action recognition results, but also make the user experience a straight line. decline.
- the positive sample image and the negative sample image are used as the sample images for neural network training, supervised by a loss function, and the network parameters of the neural network are updated in a manner of gradient back propagation (especially the feature extraction branch and The candidate box of the neural network extracts the weight parameters of the branch) and completes the training, so that the feature extraction branch of the trained neural network can accurately extract the characteristics of dangerous driving actions, and then the candidate box extraction branch of the neural network will automatically include the The removal of candidate frames for actions similar to predetermined actions (such as dangerous driving actions) can greatly reduce the false detection rate of dangerous driving actions.
- the candidate frame is pooled and adjusted to a predetermined size. It can greatly reduce the calculation amount of subsequent processing and speed up the processing speed; the candidate frame is refined by the refinement branch of the detection frame of the neural network, so that the action target frame obtained after the refinement contains only predetermined actions (such as dangerous driving actions) Features to improve the accuracy of recognition results.
- FIG. 9 is a schematic structural diagram of a motion recognition device according to an embodiment of the present application.
- the recognition device 1000 includes a first extraction unit 11, a second extraction unit 12, a determination unit 13, and a classification unit 14. among them:
- the first extraction unit 11 is configured to extract features of an image including a human face
- the second extraction unit 12 is configured to determine a plurality of candidate frames that may include a predetermined action based on the features
- the determining unit 13 is configured to determine a motion target frame based on the multiple candidate frames, where the motion target frame includes a local area of a human face and a motion interactor;
- the classification unit 14 is configured to perform classification of a predetermined motion based on the motion target frame to obtain a motion recognition result.
- the local face area includes at least one of the following: a mouth area, an ear area, and an eye area.
- the action interacting object includes at least one of the following: a container, a cigarette, a mobile phone, food, a tool, a beverage bottle, glasses, and a mask.
- the action target frame further includes: a hand region.
- the predetermined action includes at least one of the following: making a call, smoking, drinking water / beverage, eating, using tools, wearing glasses, and applying makeup.
- the motion recognition device 1000 further includes a vehicle-mounted camera for capturing an image of a person located in the vehicle, including a human face.
- the person in the vehicle includes at least one of the following: a driver in a driving zone of the car, a person in a passenger driving zone of the car, and a rear row of the car People on seats.
- the vehicle-mounted camera is: an RGB camera, an infrared camera, or a near-infrared camera.
- feature extraction is performed on an image to be processed, and actions in the image to be processed are identified according to the extracted features.
- the above-mentioned actions may be: the action of the hand area and / or the action of the local area of the face, the action of the action interactive object, etc. Therefore, it is necessary to use the vehicle camera to collect image of the person in the vehicle to obtain Process the image. Then perform a convolution operation on the processed image to extract motion features.
- the features of the predetermined action are first defined, and then a neural network is used to determine whether there is a predetermined action in the image according to the defined features and the extracted features in the image.
- a neural network is used to determine whether there is a predetermined action in the image according to the defined features and the extracted features in the image.
- the feature extraction process including a hand region and a human face is obtained through a neural network
- a feature area of a local area, a candidate area is determined based on the feature area, and the candidate area is identified by a candidate frame; wherein the candidate frame may be represented by a rectangular frame, for example.
- a feature region including a hand region, a local face region, and a corresponding region of a motion interactor is identified through another candidate frame.
- the candidate frame may include features other than the features corresponding to the predetermined action, or may not include all features corresponding to the predetermined action (referring to all features of any one predetermined action), which will affect the final action recognition result. Therefore, in order to ensure the accuracy of the final recognition result, the position of the candidate frame needs to be adjusted, that is, the action target frame is determined based on a plurality of candidate frames. Based on this, by adjusting the position and size of each candidate frame, the adjusted candidate frame is determined as the action target frame. It can be understood that the adjusted multiple candidate frames can overlap into one candidate frame, and the overlapping candidate frames are determined as the action target frames.
- the first extraction unit 11 includes a feature extraction branch 111 of a neural network for extracting features of an image including a human face to obtain a feature map.
- the convolution operation is performed on the image to be processed through the feature extraction branch of the neural network, which uses a convolution kernel to "slide" on the image to be processed.
- a convolution kernel to "slide" on the image to be processed.
- the gray value of the pixel point is multiplied with each value on the convolution kernel, and all products are added up to be the pixel point corresponding to the convolution kernel.
- the gray value further "slides” the convolution kernel to the next pixel, and so on, and finally completes the convolution processing of all pixels in the image to be processed to obtain a feature map.
- the feature extraction branch 111 of the neural network may include multiple layers of convolutional layers.
- the feature map obtained through feature extraction by the previous layer of convolutional layers can be used as input data for the next layer of convolutional layers. Richer information, thereby improving the accuracy of feature extraction.
- the second extraction unit 12 includes: a candidate frame extraction branch 121 of the neural network, for extracting multiple candidates that may include a predetermined action on the feature map. frame.
- the feature map may include at least one feature among features corresponding to a hand, a cigarette, a drinking cup, a mobile phone, glasses, a mask, and a local area of a human face, and a plurality of candidate frames are determined based on the at least one feature.
- the features of the image to be processed can be extracted through the feature extraction branch of the neural network, the extracted features may include features other than the features corresponding to the predetermined action. Therefore, the candidate frame extraction by the neural network is used here.
- the multiple candidate frames determined by the branch at least some of the candidate frames may include features other than the features corresponding to the predetermined action, or may not include all the features corresponding to the predetermined action. Therefore, the multiple candidate frames may include Scheduled action.
- the candidate frame extraction branch 121 of the neural network is further configured to: divide the features in the feature map according to the characteristics of the predetermined action to obtain multiple candidate regions; And, according to the plurality of candidate regions, a first confidence level of each candidate frame in the plurality of candidate frames is obtained, where the first confidence level is a probability that the candidate frame is the action target frame.
- the candidate frame extraction branch 121 of the neural network includes: a division subunit, configured to divide the features in the feature map according to the characteristics of the predetermined action, to obtain multiple candidate regions;
- a first acquisition subunit configured to obtain a first confidence level of each candidate frame in the plurality of candidate frames according to the multiple candidate regions, where the first confidence level is that the candidate frame is the Probability of moving target frame.
- the candidate frame extraction branch 121 of the neural network may further determine a first confidence level corresponding to each candidate frame, where the first confidence level is used to represent a possibility that the candidate frame is a target action frame in a form of probability.
- the first confidence degree is a candidate frame obtained by the candidate frame extraction branch of the neural network according to the characteristics of the candidate frame as a predicted value of the target action frame.
- the determining unit 13 includes: a detection frame refinement branch 131 of the neural network for determining an action target frame based on the plurality of candidate frames.
- the detection frame refinement branch 131 of the neural network is further configured to: remove the candidate frame with the first confidence level less than a first threshold to obtain at least one first candidate frame; And pooling the at least one first candidate frame to obtain at least one second candidate frame; and determining an action target frame according to the at least one second candidate frame.
- the refinement branch of the detection frame of the neural network includes: removing a sub-unit for removing a candidate frame with a first confidence degree less than a first threshold, to obtain at least one first candidate frame;
- a second acquisition subunit configured to pool the at least one first candidate frame to obtain at least one second candidate frame
- a determining subunit configured to determine an action target frame according to the at least one second candidate frame.
- the target object performs actions such as making a call, drinking water, and smoking in turn. These actions are similar. They place their right hands next to their faces, but Without mobile phones, drinking glasses, and cigarettes, neural networks are prone to mistakenly recognize these actions of target objects as making calls, drinking water, and smoking.
- the detection frame refinement branch 131 of the neural network is used to remove at least one first candidate frame with a first confidence level less than a first threshold value. If the first confidence level of the candidate frame is less than the first threshold value, It indicates that the candidate frame is a candidate frame similar to the above-mentioned action, and the candidate frame needs to be removed, so as to efficiently distinguish the predetermined action from the similar action, thereby reducing the false detection rate and greatly improving the accuracy of the action recognition result.
- the detection frame refinement branch 131 (or the second acquisition subunit) of the neural network is further configured to separately process the at least one first candidate frame, Obtaining at least one first feature region corresponding to the at least one first candidate frame; and adjusting the position and size of the corresponding first candidate frame based on each first feature region to obtain at least one second candidate frame.
- the number of features in the area where the first candidate frame is located may be large. If the features in the area where the first candidate frame is located are used directly, a huge amount of calculation will be generated. Therefore, before performing subsequent processing on the features in the area where the first candidate frame is located, pool the first candidate frame first, that is, pool the features in the area where the first candidate frame is located, and reduce the The dimension of features meets the need for calculation in the subsequent processing, and greatly reduces the calculation in subsequent processing.
- the detection frame refinement branch 131 (or the second acquisition subunit) of the neural network is further configured to: based on the first feature region corresponding to the predetermined Obtain a first action feature frame corresponding to the feature of the predetermined action; and obtain a first position offset of the at least one first candidate frame according to a geometric center coordinate of the first action feature frame ; And obtaining a first zoom factor of the at least one first candidate frame according to the size of the first motion feature frame; and respectively adjusting at least one first zoom factor according to at least one first position offset and at least one first zoom factor.
- the position and size of a candidate frame are adjusted to obtain at least one second candidate frame.
- the classification unit 14 includes: an action classification branch 141 of the neural network, configured to obtain an area map corresponding to the action target frame on the feature map, and Classify a predetermined action based on the area map to obtain a motion recognition result.
- the first action recognition result is obtained through the action classification branch 141 of the neural network, and on the other hand, the first action may be obtained through the action classification branch 141 of the neural network.
- a second confidence level of the recognition result, and the second confidence level represents an accuracy rate of the motion recognition result.
- the neural network is obtained by pre-supervising training based on a training image set, where the training image set includes a plurality of sample images, and the label information of the sample images includes: an action The action category corresponding to the supervision frame and the action supervision frame.
- the training image set includes positive sample images and negative sample images
- the actions of the negative sample images are similar to the actions of the positive sample images
- the actions of the positive samples are supervised.
- the frame includes a local area of a human face and an action interactor, or a local area of a human face, a hand region, and an action interactor.
- the action of the positive sample image includes making a call
- the negative sample image includes: disturbing the ear; and / or, the positive sample image includes smoking, eating, or drinking water
- the negative sample image includes a motion of opening a mouth or placing a hand on the lips.
- feature extraction branch 111 of the neural network is used for feature extraction
- candidate frame extraction branch 121 of the neural network is used to obtain candidate frames that may include predetermined actions according to the extracted features.
- the detection frame is refined through the neural network.
- the branch 131 determines the action target frame.
- the neural network's action classification branch 141 classifies the features in the target action frame into predetermined actions to obtain the action recognition result of the image to be processed.
- the entire recognition process is performed by extracting features in the image to be processed ( For example, the feature extraction of the hand area, the local area of the face, and the corresponding area of the action interaction object), and processing it, can realize the precise recognition of fine movements autonomously and quickly.
- the motion recognition device further includes a training component of the neural network.
- FIG. 10 is a schematic structural diagram of a training component of a neural network according to an embodiment of the present application.
- the training component 2000 includes: a first extraction unit 21, a second extraction unit 22, a first determination unit 23, and an acquisition.
- the first extraction unit 21 is configured to extract a first feature map including a sample image
- the second extraction unit 22 is configured to extract a plurality of third candidate frames in which the first feature map may include a predetermined action
- the first determining unit 23 is configured to determine an action target frame based on the multiple third candidate frames
- the obtaining unit 24 is configured to classify a predetermined action based on the action target frame to obtain a first action recognition result
- the second determining unit 25 is configured to determine a detection result of the candidate frame of the sample image and a first loss of the detection frame labeling information, and a second loss of the motion recognition result and the motion category labeling information;
- the adjusting unit 26 is configured to adjust network parameters of the neural network according to the first loss and the second loss.
- the first determining unit 23 includes: a first obtaining subunit 231, configured to obtain a first action supervision frame according to the predetermined action, wherein the first action supervision The frame includes: a local area of a human face and an action interactor, or a local area of a human face, a hand region, and an action interactor;
- the second acquisition subunit 232 is configured to acquire a second confidence level of the plurality of third candidate frames, where the second confidence level includes that the fourth candidate frame is a first confidence level of the action target frame. A probability that the third candidate frame is not the second probability of the action target frame;
- the determining subunit 233 is configured to determine an area overlap degree between the plurality of third candidate frames and the first action supervision frame
- the selecting subunit 234 is configured to: if the area coincidence degree is greater than or equal to a second threshold, take the second confidence degree of the third candidate frame corresponding to the area coincidence degree as the first Probability; if the area coincidence degree is less than the second threshold, taking the second confidence degree of the third candidate frame corresponding to the area coincidence degree as the second probability;
- the removing sub-unit 235 is configured to remove the plurality of third candidate frames with the second confidence level less than the first threshold, to obtain a plurality of fourth candidate frames;
- the adjustment subunit 236 is configured to adjust the position and size of the fourth candidate frame to obtain the action target frame.
- This embodiment performs fine movements on the face of the person in the vehicle according to the movement characteristics, such as a dangerous driving movement of a driver with respect to hands and human faces.
- movement characteristics such as a dangerous driving movement of a driver with respect to hands and human faces.
- some actions made by drivers similar to dangerous driving actions will easily interfere with the neural network and affect subsequent classification and recognition of actions. This will not only reduce the accuracy of action recognition results, but also make the user experience a straight line. decline.
- the positive sample image and the negative sample image are used as the sample images for neural network training, supervised by a loss function, and the network parameters of the neural network are updated in a manner of gradient back propagation (especially the feature extraction branch and The candidate box of the neural network extracts the weight parameters of the branch) and completes the training, so that the feature extraction branch of the trained neural network can accurately extract the characteristics of dangerous driving actions, and then the candidate box extraction branch of the neural network will automatically include the The removal of candidate frames for actions similar to predetermined actions (such as dangerous driving actions) can greatly reduce the false detection rate of dangerous driving actions.
- the candidate frame is pooled and adjusted to a predetermined size. It can greatly reduce the calculation amount of subsequent processing and speed up the processing speed; the candidate frame is refined by the refinement branch of the detection frame of the neural network, so that the action target frame obtained after the refinement contains only predetermined actions (such as dangerous driving actions).
- FIG. 11 is a schematic structural diagram of a driving motion analysis device according to an embodiment of the present application.
- the analysis device 3000 includes a vehicle-mounted camera 31, a first acquisition unit 32, and a generation unit 33. among them:
- the vehicle-mounted camera 31 is configured to collect a video stream including a driver's face image
- the first acquiring unit 32 is configured to acquire a motion recognition result of at least one frame of an image in the video stream through the motion recognition device according to the foregoing embodiment of the present application;
- the generating unit 33 is configured to generate distraction or dangerous driving prompt information in response to a result of a motion recognition meeting a predetermined condition.
- the predetermined condition includes at least one of the following: the occurrence of a specific predetermined action; the number of times the specific predetermined action occurs within a predetermined time period; and the duration of occurrence and maintenance of the specific predetermined action in the video stream .
- the analysis device 3000 further includes: a second obtaining unit 34, configured to obtain a vehicle speed of a vehicle provided with a dual-vehicle camera; and the generating unit 33 is further configured to: The vehicle speed is greater than a set threshold and the action recognition result satisfies the predetermined condition, generating distraction or dangerous driving prompt information.
- a video is taken of a driver through a vehicle-mounted camera, and each frame of the captured video is used as an image to be processed.
- Each frame of the picture taken by the camera is recognized to obtain the corresponding recognition result, and the actions of the driver are recognized in combination with the results of multiple consecutive frames.
- the driver may be warned through the display terminal, and the type of the dangerous driving action may be prompted.
- the ways to raise warnings include: pop-up dialog box to warn by text, and warn by built-in voice data.
- FIG. 12 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
- the electronic device 4000 includes a memory 44 and a processor 41.
- the memory 44 stores computer-executable instructions.
- the processor 41 runs the computer-executable instructions on the memory 44, the actions described in the embodiments of the present application are implemented.
- a recognition method, or a driving action analysis method according to an embodiment of the present application.
- the electronic device may further include an input device 42 and an output device 43.
- the input device 42, the output device 43, the memory 44 and the processor 41 can be connected to each other through a bus.
- RAM Random Access Memory
- ROM Read-Only Memory
- EPROM Erasable Programmable Read Only Memory
- CD-ROM portable Read-only memory
- the input device is used to input data and / or signals
- the output device is used to output data and / or signals.
- the output device and the input device may be independent devices or an integrated device.
- the processor may include one or more processors, for example, one or more central processing units (CPUs). When the processor is one CPU, the CPU may be a single-core CPU, or may be Multi-core CPU.
- the processor may also include one or more special-purpose processors, and the special-purpose processors may include GPUs, FPGAs, and the like, for performing accelerated processing.
- the memory is used to store program code and data of a network device.
- the processor is configured to call program code and data in the memory and execute steps in the foregoing method embodiments. For details, refer to the description in the method embodiment, and details are not described herein again.
- FIG. 12 only shows a simplified design of the electronic device.
- the electronic device may also include other necessary components, including but not limited to any number of input / output devices, processors, controllers, memories, etc., and all the electronic devices that can implement the embodiments of this application are in It is within the protection scope of the embodiments of the present application.
- An embodiment of the present application further provides a computer storage medium for storing computer-readable instructions that, when executed, implement operations of the action recognition method of any of the foregoing embodiments of the present application, or when the instructions are executed The operation of the driving action analysis method of any one of the foregoing embodiments of the present application is implemented.
- An embodiment of the present application further provides a computer program including computer-readable instructions.
- a processor in the device executes the instructions to implement any of the foregoing implementations of the application.
- the executable instructions of the steps in the example motion recognition method, or the processor in the device executes the executable instructions for implementing the steps in the driving motion analysis method of any one of the foregoing embodiments of the present application.
- the disclosed device and method may be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the unit is only a logical function division.
- there may be another division manner such as multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not implemented.
- the components shown or discussed are mutually coupled, or directly coupled, or the communication connection may be through some interfaces.
- the indirect coupling or communication connection of the device or unit may be electrical, mechanical, or other forms. of.
- the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, which may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above integration
- the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
- the foregoing program may be stored in a computer-readable storage medium.
- the program is executed, the program is executed.
- the method includes the steps of the foregoing method embodiment.
- the foregoing storage medium includes: various types of media that can store program codes, such as a mobile storage device, a ROM, a RAM, a magnetic disk, or an optical disc.
- the above-mentioned integrated unit of the present invention is implemented in the form of a software functional module and sold or used as an independent product, it may also be stored in a computer-readable storage medium.
- the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is caused to execute all or part of the methods described in the embodiments of the present invention.
- the foregoing storage medium includes: various types of media that can store program codes, such as a mobile storage device, a ROM, a RAM, a magnetic disk, or an optical disc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Mechanical Engineering (AREA)
- Automation & Control Theory (AREA)
- Transportation (AREA)
- Biodiversity & Conservation Biology (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
- Traffic Control Systems (AREA)
Abstract
Description
Claims (51)
- 一种动作识别方法,包括:提取包括有人脸的图像中的特征;基于所述特征确定可能包括预定动作的多个候选框;基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;基于所述动作目标框进行预定动作的分类,获得动作识别结果。
- 根据权利要求1所述的方法,其中,所述人脸的局部区域,包括以下至少之一:嘴部区域、耳部区域、眼部区域。
- 根据权利要求1或2所述的方法,其中,所述动作交互物,包括以下至少之一:容器、烟、手机、食物、工具、饮料瓶、眼镜、口罩。
- 根据权利要求1至3任一项所述的方法,其中,所述动作目标框还包括:手部区域。
- 根据权利要求1至4任一项所述的方法,其中,所述预定动作包括以下至少之一:打电话、抽烟、喝水/饮料、进食、使用工具、戴眼镜、化妆。
- 根据权利要求1至5任一项所述的方法,其中,所述方法还包括:经车载摄像头拍摄位于车内的人的包括有人脸的图像。
- 根据权利要求6所述的方法,其中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。
- 根据权利要求6或7所述的方法,其中,所述车载摄像头为:RGB摄像头、红外摄像头或近红外摄像头。
- 根据权利要求1至8任一项所述的方法,其中,所述提取包括有人脸的图像中的特征,包括:经神经网络的特征提取分支提取包括有人脸的图像中的特征,获得特征图。
- 根据权利要求9所述的方法,其中,所述基于所述特征确定可能包括预定动作的多个候选框,包括:经所述神经网络的候选框提取分支在所述特征图上确定可能包括预定动作的多个候选框。
- 根据权利要求10所述的方法,其中,所述经所述神经网络的候选框提取分支在所述特征图上确定可能包括预定动作的多个候选框,包括:根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;根据所述多个候选区域,获得多个候选框和所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
- 根据权利要求9至11任一项所述的方法,其中,所述基于所述多个候选框确定动作目标框,包括:经所述神经网络的检测框精修分支、基于所述多个候选框确定动作目标框。
- 根据权利要求12所述的方法,其中,所述经所述神经网络的检测框精修分支、基于所述多个候选框确定动作目标框,包括:经所述神经网络的检测框精修分支去除第一置信度小于第一阈值的候选框,获得至少一个第一候选框;池化处理所述至少一个第一候选框,获得至少一个第二候选框;根据所述至少一个第二候选框,确定动作目标框。
- 根据权利要求13所述的方法,其中,所述池化处理所述至少一个第一候选框,获得至少一个第二候选框,包括:分别池化处理所述至少一个第一候选框,获得与所述至少一个第一候选框对应的至少一个第一特征区域;基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框。
- 根据权利要求14所述的方法,其中,所述基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框,包括:基于所述第一特征区域中对应于所述预定动作的特征,获得与所述预定动作的特征对应的第一 动作特征框;根据所述第一动作特征框的几何中心坐标,获得所述至少一个第一候选框的第一位置偏移量;根据所述第一动作特征框的大小,获得所述至少一个第一候选框的第一缩放倍数;根据至少一个第一位置偏移量和至少一个第一缩放倍数分别对所述至少一个第一候选框的位置和大小进行调整,获得至少一个第二候选框。
- 根据权利要求1至15任一项所述的方法,其中,所述基于所述动作目标框进行预定动作的分类,获得动作识别结果,包括:经所述神经网络的动作分类分支获取所述特征图上与所述动作目标框对应的区域图,基于所述区域图进行预定动作的分类,获得动作识别结果。
- 根据权利要求9至16任一项所述的方法,其中,所述神经网络为基于训练图像集预先监督训练而得,所述训练图像集包括多个样本图像,其中,所述样本图像的标注信息包括:动作监督框和所述动作监督框对应的动作类别。
- 根据权利要求17所述的方法,其中,所述训练图像集包括正样本图像和负样本图像,所述负样本图像的动作与所述正样本图像的动作相似,所述正样本的动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物。
- 根据权利要求17或18所述的方法,其中,所述正样本图像的动作包括打电话,所述负样本图像包括:扰耳朵;和/或,所述正样本图像包括抽烟、进食或喝水,所述负样本图像包括张嘴或手搭着嘴唇的动作。
- 根据权利要求17至19任一项所述的方法,其中,所述神经网络的训练方法包括:提取样本图像的第一特征图;提取所述第一特征图可能包括预定动作的多个第三候选框;基于所述多个第三候选框确定动作目标框;基于所述动作目标框进行预定动作的分类,获得动作识别结果;确定所述样本图像的候选框的检测结果和检测框标注信息的第一损失、以及动作识别结果和动作类别标注信息的第二损失;根据所述第一损失和所述第二损失调节所述神经网络的网络参数。
- 根据权利要求20所述的方法,其中,所述基于多个第三候选框确定动作目标框,包括:根据所述预定动作,获得第一动作监督框,其中,所述第一动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物;获取所述多个第三候选框的第二置信度,其中,所述第二置信度包括:所述第三候选框为所述动作目标框的第一概率,所述第三候选框非所述动作目标框的第二概率;确定所述多个第三候选框与所述第一动作监督框的面积重合度;若所述面积重合度大于或等于第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第一概率;若所述面积重合度小于所述第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第二概率;将所述第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框;调整所述第四候选框的位置和大小,获得所述动作目标框。
- 一种驾驶动作分析方法,包括:经车载摄像头采集包括有驾驶员人脸图像的视频流;通过如权利要求1至21任一所述的动作识别方法,获取所述视频流中至少一帧图像的动作识别结果;响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
- 根据权利要求22所述的方法,其中,所述预定条件包括以下至少之一:出现特定预定动作;在预定时长内出现特定预定动作的次数;所述视频流中特定预定动作出现维持的时长。
- 根据权利要求22或23所述的方法,其中,所述方法还包括:获取设置有车载双摄像头的车辆的车速;所述响应于动作识别结果满足预定条件,生成危险驾驶提示信息,包括:响应于所述车速大于设定阈值且所述动作识别结果满足所述预定条件,生成危险驾驶提示信息。
- 一种动作识别装置,包括:第一提取单元,用于提取包括有人脸的图像的特征;第二提取单元,用于基于所述特征确定可能包括预定动作的多个候选框;确定单元,用于基于所述多个候选框确定动作目标框,其中,所述动作目标框包括人脸的局部区域和动作交互物;分类单元,用于基于所述动作目标框进行预定动作的分类,获得动作识别结果。
- 根据权利要求25所述的装置,其中,所述人脸的局部区域,包括以下至少之一:嘴部区域、耳部区域、眼部区域。
- 根据权利要求25或26所述的装置,其中,所述动作交互物,包括以下至少之一:容器、烟、手机、食物、工具、饮料瓶、眼镜、口罩。
- 根据权利要求25至27任一项所述的装置,其中,所述动作目标框还包括:手部区域。
- 根据权利要求25至28任一项所述的装置,其中,所述预定动作包括以下至少之一:打电话、抽烟、喝水/饮料、进食、使用工具、戴眼镜、化妆。
- 根据权利要求25至29任一项所述的装置,其中,还包括:车载摄像头,用于拍摄位于车内的人的包括有人脸的图像。
- 根据权利要求30所述的装置,其中,所述车内的人包括以下至少之一:所述车的驾驶区的驾驶员、所述车的副驾驶区的人、所述车的后排座椅上的人。
- 根据权利要求30或31所述的装置,其中,所述车载摄像头为:RGB摄像头、红外摄像头或近红外摄像头。
- 根据权利要求25至32任一项所述的装置,其中,所述第一提取单元包括:神经网络的特征提取分支,用于提取包括有人脸的图像的特征,获得特征图。
- 根据权利要求33所述的装置,其中,所述第二提取单元,包括:所述神经网络的候选框提取分支,用于在所述特征图上提取可能包括预定动作的多个候选框。
- 根据权利要求34所述的装置,其中,所述候选框提取分支,包括:划分子单元,用于根据所述预定动作的特征对所述特征图中的特征进行划分,获得多个候选区域;第一获取子单元,用于根据所述多个候选区域,获得所述多个候选框和所述多个候选框中每个候选框的第一置信度,其中,所述第一置信度为所述候选框为所述动作目标框的概率。
- 根据权利要求33至35任一项所述的装置,其中,所述确定单元,包括:所述神经网络的检测框精修分支,用于基于所述多个候选框确定动作目标框。
- 根据权利要求36所述的装置,其中,所述检测框精修分支,包括:去除子单元,用于去除所述第一置信度小于第一阈值的候选框,获得至少一个第一候选框;第二获取子单元,用于池化处理所述至少一个第一候选框,获得至少一个第二候选框;确定子单元,用于根据所述至少一个第二候选框,确定动作目标框。
- 根据权利要求37所述的装置,其中,所述第二获取子单元还用于:分别池化处理所述至少一个第一候选框,获得与所述至少一个第一候选框对应的至少一个第一特征区域;以及基于每个第一特征区域对相对应的第一候选框的位置和大小进行调整,获得至少一个第二候选框。
- 根据权利要求38所述的装置,其中,所述第二获取子单元还用于:基于所述第一特征区域中对应于所述预定动作的特征,获得与所述预定动作的特征对应的第一动作特征框;以及根据所述第一动作特征框的几何中心坐标,获得所述至少一个第一候选框的第一位置偏移量;以及根据所述第一动作特征框的大小,获得所述至少一个第一候选框的第一缩放倍数;以及根据至少一个第一位置偏移量和至少一个第一缩放倍数分别对所述至少一个第一候选框的位置和大小进行调整,获得至少一个第二候选框。
- 根据权利要求25至39任一项所述的装置,其中,所述分类单元,包括:所述神经网络的动作分类分支,用于获取所述特征图上与所述动作目标框对应的区域图,并基于所述区域图进行预定动作的分类,获得动作识别结果。
- 根据权利要求35至40任一项所述的装置,其中,所述神经网络为基于训练图像集预先监督训练而得,所述训练图像集包括多个样本图像,其中,所述样本图像的标注信息包括:动作监督框和所述动作监督框对应的动作类别。
- 根据权利要求41所述的装置,其中,所述训练图像集包括正样本图像和负样本图像,所述负样本图像的动作与所述正样本图像的动作相似,所述正样本的动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物。
- 根据权利要求41或42所述的装置,其中,所述正样本图像的动作包括打电话,所述负样 本图像包括:扰耳朵;和/或,所述正样本图像包括抽烟、进食或喝水,所述负样本图像包括张嘴或手搭着嘴唇的动作。
- 根据权利要求41至43任一项所述的装置,其中,所述动作识别装置还包括所述神经网络的训练组件,所述神经网络的训练组件包括:第一提取单元,用于提取样本图像的第一特征图;第二提取单元,用于提取所述第一特征图可能包括预定动作的多个第三候选框;第一确定单元,用于基于所述多个第三候选框确定动作目标框;第三获取单元,用于基于所述动作目标框进行预定动作的分类,获得动作识别结果;第二确定单元,用于确定所述样本图像的候选框的检测结果和检测框标注信息的第一损失、以及动作识别结果和动作类别标注信息的第二损失;调节单元,用于根据所述第一损失和所述第二损失调节所述神经网络的网络参数。
- 根据权利要求44所述的装置,其中,所述第一确定单元包括:第一获取子单元,用于根据所述预定动作,获得第一动作监督框,其中所述第一动作监督框包括:人脸的局部区域和动作交互物,或者,人脸的局部区域、手部区域和动作交互物;第二获取子单元,用于获取所述多个第三候选框的第二置信度,其中,所述第二置信度包括:所述第四候选框为所述动作目标框的第一概率,所述第三候选框非所述动作目标框的第二概率;确定子单元,用于确定所述多个第三候选框与所述第一动作监督框的面积重合度;选取子单元,用于若所述面积重合度大于或等于第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第一概率;若所述面积重合度小于所述第二阈值,将与所述面积重合度对应的所述第三候选框的所述第二置信度取为所述第二概率;去除子单元,用于将所述第二置信度小于所述第一阈值的所述多个第三候选框去除,获得多个第四候选框;调整子单元,用于调整所述第四候选框的位置和大小,获得所述动作目标框。
- 一种驾驶动作分析装置,包括:车载摄像头,用于采集包括有驾驶员人脸图像的视频流;第一获取单元,用于通过如权利要求25至45任一项所述的动作识别装置,获取所述视频流中至少一帧图像的动作识别结果;生成单元,用于响应于动作识别结果满足预定条件,生成危险驾驶提示信息。
- 根据权利要求46所述的装置,其中,所述预定条件包括以下至少之一:出现特定预定动作;在预定时长内出现特定预定动作的次数;所述视频流中特定预定动作出现维持的时长。
- 根据权利要求46或47所述的装置,其中,所述装置还包括:第二获取单元,用于获取设置有车载双摄像头的车辆的车速;所述生成单元还用于:响应于所述车速大于设定阈值且所述动作识别结果满足所述预定条件,生成危险驾驶提示信息。
- 一种电子设备,包括存储器和处理器,所述存储器上存储有计算机可执行指令,所述处理器运行所述存储器上的计算机可执行指令时实现权利要求1至21任一项所述的方法,或者权利要求22至24任一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现权利要求1至21任一项所述的方法,或者权利要求22至24任一项所述的方法。
- 一种计算机程序,包括计算机指令,当所述计算机指令在设备的处理器中运行时,实现权利要求1至21任一项所述的方法,或者权利要求22至24任一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2020551540A JP7061685B2 (ja) | 2018-09-27 | 2019-09-26 | 動作認識、運転動作分析の方法及び装置、並びに電子機器 |
SG11202009320PA SG11202009320PA (en) | 2018-09-27 | 2019-09-26 | Maneuver recognition and driving maneuver analysis method and apparatus, and electronic device |
KR1020207027826A KR102470680B1 (ko) | 2018-09-27 | 2019-09-26 | 동작 인식, 운전 동작 분석 방법 및 장치, 전자 기기 |
US17/026,933 US20210012127A1 (en) | 2018-09-27 | 2020-09-21 | Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811130798.6A CN110956060A (zh) | 2018-09-27 | 2018-09-27 | 动作识别、驾驶动作分析方法和装置及电子设备 |
CN201811130798.6 | 2018-09-27 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/026,933 Continuation US20210012127A1 (en) | 2018-09-27 | 2020-09-21 | Action recognition method and apparatus, driving action analysis method and apparatus, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020063753A1 true WO2020063753A1 (zh) | 2020-04-02 |
Family
ID=69951010
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/108167 WO2020063753A1 (zh) | 2018-09-27 | 2019-09-26 | 动作识别、驾驶动作分析方法和装置、电子设备 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20210012127A1 (zh) |
JP (1) | JP7061685B2 (zh) |
KR (1) | KR102470680B1 (zh) |
CN (1) | CN110956060A (zh) |
SG (1) | SG11202009320PA (zh) |
WO (1) | WO2020063753A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257604A (zh) * | 2020-10-23 | 2021-01-22 | 北京百度网讯科技有限公司 | 图像检测方法、装置、电子设备和存储介质 |
CN112270210A (zh) * | 2020-10-09 | 2021-01-26 | 珠海格力电器股份有限公司 | 数据处理、操作指令识别方法、装置、设备和介质 |
CN113205067A (zh) * | 2021-05-26 | 2021-08-03 | 北京京东乾石科技有限公司 | 作业人员监控方法、装置、电子设备和存储介质 |
WO2022027895A1 (zh) * | 2020-08-07 | 2022-02-10 | 上海商汤临港智能科技有限公司 | 异常坐姿识别方法、装置、电子设备、存储介质及程序 |
JP2023509572A (ja) * | 2020-04-29 | 2023-03-09 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 車両を検出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110245662B (zh) * | 2019-06-18 | 2021-08-10 | 腾讯科技(深圳)有限公司 | 检测模型训练方法、装置、计算机设备和存储介质 |
US11222242B2 (en) * | 2019-08-23 | 2022-01-11 | International Business Machines Corporation | Contrastive explanations for images with monotonic attribute functions |
US10803334B1 (en) * | 2019-10-18 | 2020-10-13 | Alpine Electronics of Silicon Valley, Inc. | Detection of unsafe cabin conditions in autonomous vehicles |
KR102374211B1 (ko) * | 2019-10-28 | 2022-03-15 | 주식회사 에스오에스랩 | 객체 인식 방법 및 이를 수행하는 객체 인식 장치 |
US11043003B2 (en) | 2019-11-18 | 2021-06-22 | Waymo Llc | Interacted object detection neural network |
CN112947740A (zh) * | 2019-11-22 | 2021-06-11 | 深圳市超捷通讯有限公司 | 基于动作分析的人机交互方法、车载装置 |
CN112339764A (zh) * | 2020-11-04 | 2021-02-09 | 杨华勇 | 一种基于大数据的新能源汽车驾驶姿态分析系统 |
CN113011279A (zh) * | 2021-02-26 | 2021-06-22 | 清华大学 | 粘膜接触动作的识别方法、装置、计算机设备和存储介质 |
CN117203678A (zh) * | 2021-04-15 | 2023-12-08 | 华为技术有限公司 | 目标检测方法和装置 |
CN113205075A (zh) * | 2021-05-31 | 2021-08-03 | 浙江大华技术股份有限公司 | 一种检测吸烟行为的方法、装置及可读存储介质 |
CN113362314B (zh) * | 2021-06-18 | 2022-10-18 | 北京百度网讯科技有限公司 | 医学图像识别方法、识别模型训练方法及装置 |
CN114670856B (zh) * | 2022-03-30 | 2022-11-25 | 湖南大学无锡智能控制研究院 | 一种基于bp神经网络的参数自整定纵向控制方法及系统 |
CN116901975B (zh) * | 2023-09-12 | 2023-11-21 | 深圳市九洲卓能电气有限公司 | 一种车载ai安防监控系统及其方法 |
CN117953589B (zh) * | 2024-03-27 | 2024-07-05 | 武汉工程大学 | 一种交互动作检测方法、系统、设备及介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130236057A1 (en) * | 2007-10-10 | 2013-09-12 | Samsung Electronics Co., Ltd. | Detecting apparatus of human component and method thereof |
CN104573659A (zh) * | 2015-01-09 | 2015-04-29 | 安徽清新互联信息科技有限公司 | 一种基于svm的驾驶员接打电话监控方法 |
CN106780612A (zh) * | 2016-12-29 | 2017-05-31 | 浙江大华技术股份有限公司 | 一种图像中的物体检测方法及装置 |
CN106815574A (zh) * | 2017-01-20 | 2017-06-09 | 博康智能信息技术有限公司北京海淀分公司 | 建立检测模型、检测接打手机行为的方法和装置 |
CN107563446A (zh) * | 2017-09-05 | 2018-01-09 | 华中科技大学 | 一种微操作系统目标检测方法 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI474264B (zh) * | 2013-06-14 | 2015-02-21 | Utechzone Co Ltd | 行車警示方法及車用電子裝置 |
KR101386823B1 (ko) * | 2013-10-29 | 2014-04-17 | 김재철 | 동작, 안면, 눈, 입모양 인지를 통한 2단계 졸음운전 방지 장치 |
CN105260705B (zh) * | 2015-09-15 | 2019-07-05 | 西安邦威电子科技有限公司 | 一种适用于多姿态下的驾驶人员接打电话行为检测方法 |
CN105260703B (zh) * | 2015-09-15 | 2019-07-05 | 西安邦威电子科技有限公司 | 一种适用于多姿态下的驾驶人员抽烟行为检测方法 |
JP6443393B2 (ja) * | 2016-06-01 | 2018-12-26 | トヨタ自動車株式会社 | 行動認識装置,学習装置,並びに方法およびプログラム |
CN106096607A (zh) * | 2016-06-12 | 2016-11-09 | 湘潭大学 | 一种车牌识别方法 |
CN106504233B (zh) * | 2016-10-18 | 2019-04-09 | 国网山东省电力公司电力科学研究院 | 基于Faster R-CNN的无人机巡检图像电力小部件识别方法及系统 |
CN106941602B (zh) * | 2017-03-07 | 2020-10-13 | 中国铁路总公司 | 机车司机行为识别方法及装置 |
CN107316001A (zh) * | 2017-05-31 | 2017-11-03 | 天津大学 | 一种自动驾驶场景中小且密集的交通标志检测方法 |
CN107316058A (zh) * | 2017-06-15 | 2017-11-03 | 国家新闻出版广电总局广播科学研究院 | 通过提高目标分类和定位准确度改善目标检测性能的方法 |
-
2018
- 2018-09-27 CN CN201811130798.6A patent/CN110956060A/zh active Pending
-
2019
- 2019-09-26 WO PCT/CN2019/108167 patent/WO2020063753A1/zh active Application Filing
- 2019-09-26 SG SG11202009320PA patent/SG11202009320PA/en unknown
- 2019-09-26 JP JP2020551540A patent/JP7061685B2/ja active Active
- 2019-09-26 KR KR1020207027826A patent/KR102470680B1/ko active IP Right Grant
-
2020
- 2020-09-21 US US17/026,933 patent/US20210012127A1/en not_active Abandoned
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130236057A1 (en) * | 2007-10-10 | 2013-09-12 | Samsung Electronics Co., Ltd. | Detecting apparatus of human component and method thereof |
CN104573659A (zh) * | 2015-01-09 | 2015-04-29 | 安徽清新互联信息科技有限公司 | 一种基于svm的驾驶员接打电话监控方法 |
CN106780612A (zh) * | 2016-12-29 | 2017-05-31 | 浙江大华技术股份有限公司 | 一种图像中的物体检测方法及装置 |
CN106815574A (zh) * | 2017-01-20 | 2017-06-09 | 博康智能信息技术有限公司北京海淀分公司 | 建立检测模型、检测接打手机行为的方法和装置 |
CN107563446A (zh) * | 2017-09-05 | 2018-01-09 | 华中科技大学 | 一种微操作系统目标检测方法 |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2023509572A (ja) * | 2020-04-29 | 2023-03-09 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 車両を検出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム |
JP7357789B2 (ja) | 2020-04-29 | 2023-10-06 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 車両を検出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム |
WO2022027895A1 (zh) * | 2020-08-07 | 2022-02-10 | 上海商汤临港智能科技有限公司 | 异常坐姿识别方法、装置、电子设备、存储介质及程序 |
CN112270210A (zh) * | 2020-10-09 | 2021-01-26 | 珠海格力电器股份有限公司 | 数据处理、操作指令识别方法、装置、设备和介质 |
CN112270210B (zh) * | 2020-10-09 | 2024-03-01 | 珠海格力电器股份有限公司 | 数据处理、操作指令识别方法、装置、设备和介质 |
CN112257604A (zh) * | 2020-10-23 | 2021-01-22 | 北京百度网讯科技有限公司 | 图像检测方法、装置、电子设备和存储介质 |
CN113205067A (zh) * | 2021-05-26 | 2021-08-03 | 北京京东乾石科技有限公司 | 作业人员监控方法、装置、电子设备和存储介质 |
CN113205067B (zh) * | 2021-05-26 | 2024-04-09 | 北京京东乾石科技有限公司 | 作业人员监控方法、装置、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
JP2021517312A (ja) | 2021-07-15 |
SG11202009320PA (en) | 2020-10-29 |
KR102470680B1 (ko) | 2022-11-25 |
JP7061685B2 (ja) | 2022-04-28 |
KR20200124280A (ko) | 2020-11-02 |
US20210012127A1 (en) | 2021-01-14 |
CN110956060A (zh) | 2020-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020063753A1 (zh) | 动作识别、驾驶动作分析方法和装置、电子设备 | |
US11726577B2 (en) | Systems and methods for triggering actions based on touch-free gesture detection | |
TWI741512B (zh) | 駕駛員注意力監測方法和裝置及電子設備 | |
US10223838B2 (en) | Method and system of mobile-device control with a plurality of fixed-gradient focused digital cameras | |
US20210081754A1 (en) | Error correction in convolutional neural networks | |
EP2634727B1 (en) | Method and portable terminal for correcting gaze direction of user in image | |
CN111566612A (zh) | 基于姿势和视线的视觉数据采集系统 | |
CN110956061B (zh) | 动作识别方法及装置、驾驶员状态分析方法及装置 | |
WO2020125499A1 (zh) | 一种操作提示方法及眼镜 | |
US11715231B2 (en) | Head pose estimation from local eye region | |
US20180239975A1 (en) | Method and system for monitoring driving behaviors | |
EP2490155A1 (en) | A user wearable visual assistance system | |
WO2019128101A1 (zh) | 一种投影区域自适应的动向投影方法、装置及电子设备 | |
TW201445457A (zh) | 虛擬眼鏡試戴方法及其裝置 | |
WO2022042624A1 (zh) | 信息显示方法、设备及存储介质 | |
CN112183200B (zh) | 一种基于视频图像的眼动追踪方法和系统 | |
CN111046734A (zh) | 基于膨胀卷积的多模态融合视线估计方法 | |
JP2019179390A (ja) | 注視点推定処理装置、注視点推定モデル生成装置、注視点推定処理システム、注視点推定処理方法、プログラム、および注視点推定モデル | |
CN114463725A (zh) | 驾驶员行为检测方法及装置、安全驾驶提醒方法及装置 | |
KR20150064977A (ko) | 얼굴정보 기반의 비디오 분석 및 시각화 시스템 | |
KR20190119212A (ko) | 인공신경망을 이용한 가상 피팅 시스템, 이를 위한 방법 및 이 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체 | |
Saif et al. | Robust drowsiness detection for vehicle driver using deep convolutional neural network | |
WO2020051781A1 (en) | Systems and methods for drowsiness detection | |
WO2018059258A1 (zh) | 采用增强现实技术提供手掌装饰虚拟图像的实现方法及其装置 | |
WO2022142079A1 (zh) | 图形码显示方法、装置、终端及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19864657 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2020551540 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20207027826 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19864657 Country of ref document: EP Kind code of ref document: A1 |