CN114842528A

CN114842528A - Motion detection method, motion detection device, electronic device, and storage medium

Info

Publication number: CN114842528A
Application number: CN202210346614.XA
Authority: CN
Inventors: 丁业峰; 毛宁元; 许亮
Original assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Lingang Intelligent Technology Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-08-02
Also published as: KR20240140143A; WO2023185034A1

Abstract

The present disclosure relates to a motion detection method, a motion detection apparatus, an electronic device, and a storage medium, wherein the motion detection method includes: acquiring an image of a target object; carrying out bone key point detection on the image of the target object to obtain bone key point detection information of the target object; performing face key point detection on the image of the target object to obtain face key point detection information of the target object; obtaining an estimation result of the action information of the target object according to the geometric relationship between the target key points in the skeleton key point detection information; and checking the estimation result according to the face key point detection information to obtain the action detection result of the target object.

Description

Motion detection method, motion detection device, electronic device, and storage medium

Technical Field

The present disclosure relates to the field of image detection technologies, and in particular, to a motion detection method and apparatus, an electronic device, and a storage medium.

Background

Along with the continuous development of artificial intelligence technology, the image and video detection is more and more various, the effect is better and better, and especially, the detection technology is applied to the safety protection field, so that the danger of a user can be avoided. Taking a vehicle cabin scene as an example, the safety of the vehicle cabin environment and personnel can be detected through an image processing technology. The method can detect whether people in the vehicle have dangerous actions or not by acquiring images and videos in the vehicle in the driving process or in the parking state of the vehicle, so that the driving and riding safety can be improved. However, in the related art, misjudgment is easy to occur in the motion detection of people, and the use experience of users is poor.

Disclosure of Invention

The present disclosure provides a motion detection method, apparatus, device and storage medium to solve the drawbacks of the related art.

According to a first aspect of the embodiments of the present disclosure, there is provided an action detection method, including:

acquiring an image of a target object;

carrying out bone key point detection on the image of the target object to obtain bone key point detection information of the target object;

performing face key point detection on the image of the target object to obtain face key point detection information of the target object;

obtaining an estimation result of the action information of the target object according to the geometric relationship between the target key points in the skeleton key point detection information;

and checking the estimation result according to the face key point detection information to obtain the action detection result of the target object.

In one embodiment, the obtaining an estimation result of the motion information of the target object according to a geometric relationship between target key points in the bone key point detection information includes:

determining that the target action exists in the estimation result of the action information of the target object under the condition that the geometric relation between the target key points in the skeleton key point detection information meets a first preset condition corresponding to the target action;

the verifying the estimation result according to the face key point detection information to obtain the action detection result of the target object includes:

and under the condition that the estimation result is that the target action exists, responding to a target face key point corresponding to the target action in the face key point detection information, and meeting a second preset condition corresponding to the target action, and determining that the detection result of the target object is that the target action exists.

In one embodiment, the determining that the target action exists in the detection result of the target object in response to the target face key point corresponding to the target action in the face key point detection information meeting a second preset condition corresponding to the target action includes:

and under the condition that the target movement is left-leaning or right-leaning, responding to the position relation among a plurality of first target face key points corresponding to the left-leaning or right-leaning of the body in the face key point detection information, and meeting a second preset condition corresponding to the left-leaning or right-leaning of the body, and determining that the movement detection result of the target object is the presence of the left-leaning movement or the right-leaning movement of the body.

In one embodiment, the plurality of first target face keypoints comprises: a left eye lateral keypoint and a right eye lateral keypoint;

when the target motion is left-leaning or right-leaning, determining that the motion detection result of the target object is that the body left-leaning or the body right-leaning exists in response to a position relationship between a plurality of first target face key points corresponding to the body left-leaning or the body right-leaning in the face key point detection information and meeting a second preset condition corresponding to the body left-leaning or the body right-leaning, including:

under the condition that the target movement is left-leaning, in response to the fact that the tangent value of an included angle between a first target vector from a right eye outer side key point to a left eye outer side key point of the target object and a standard vector which is horizontally right is a positive number, and the absolute value of the tangent value is larger than a first threshold value, determining that the movement detection result of the target object is that the body left-leaning movement exists; and/or the presence of a gas in the gas,

and when the target movement is body right-leaning, determining that the movement detection result of the target object is body right-leaning movement in response to that the tangent value of an included angle between a first target vector from a right eye outer side key point to the left eye outer side key point of the target object and a standard vector which is horizontally right is a negative number and the absolute value of the tangent value is greater than a first threshold value.

and under the condition that the target action is forward-leaning and forward-lying, responding to the second target face key point loss corresponding to forward-leaning and forward-lying in the face key point detection information, and determining that the action detection result of the target object is the forward-leaning and forward-lying movement.

In one embodiment, the method further comprises:

determining that a second target face key point corresponding to the forward leaning and forward leaning is absent under the condition that the number of detected face content key points is determined not to exceed a preset number according to the face key point detection information, wherein the face content key points comprise face key points except face edge key points; or,

and determining that a second target face key point corresponding to the forward leaning and the prone leaning is absent under the condition that the number of the detected face key points is determined not to exceed the preset number according to the face key point detection information.

In one embodiment, the acquiring an image of a target object includes:

sequentially acquiring images of each frame of target object cached in a video;

after the estimation result is verified according to the face key point detection information to obtain an action detection result of the target object, the method further includes:

acquiring a motion detection result of the target object based on the latest images of the target object with the preset number of frames;

and determining one action detection result with the largest occurrence frequency in the action detection results of the target objects obtained based on the images of the target objects with the latest preset number of frames as the current action of the target object.

In one embodiment, before sequentially acquiring the image of each frame of the target object buffered from the video, the method further includes:

detecting/tracking a target object based on each frame image in the video; determining a plurality of images containing the target object according to the detection result/tracking result of the target object;

adding a cache into each frame of images containing the target object, wherein each frame of images contains preset key information of the target object, and the preset key information is used as the image of the target object, wherein the preset key information comprises at least one of a human face, at least part of a body and skeleton key points.

detecting the movement direction of the target object based on the images of the multiple frames of target objects in the cache;

and obtaining an estimation result of the motion information of the target object according to the geometric relationship between the target key points corresponding to the motion direction in the skeleton key point detection information.

In one embodiment, the image of the target object comprises an image of a target object in a vehicle cabin;

the acquiring of the image of the target object includes:

and acquiring the image of the target object under the condition that the door of the vehicle is in a locked state and/or the speed of the vehicle reaches a preset speed threshold value.

In one embodiment, further comprising:

acquiring an image in a vehicle cabin, and detecting a plurality of objects in the image;

and determining a target object in the plurality of objects according to the position of each object in the vehicle cabin and/or the face information of each object.

In one embodiment, further comprising:

and sending alarm information to a service platform under the condition that the action detection result of the target object is that the target object has a target action.

According to a second aspect of the embodiments of the present disclosure, there is provided a motion detection apparatus including:

the acquisition module is used for acquiring an image of a target object;

the first detection module is used for carrying out bone key point detection on the image of the target object to obtain bone key point detection information of the target object;

the second detection module is used for carrying out face key point detection on the image of the target object to obtain face key point detection information of the target object;

the estimation module is used for obtaining an estimation result of the action information of the target object according to the geometric relationship between the target key points in the skeleton key point detection information;

and the checking module is used for checking the estimation result according to the face key point detection information to obtain the action detection result of the target object.

In one embodiment, the estimation module is specifically configured to:

the verification module is specifically configured to:

In an embodiment, the verification module is configured to, in response to a target face key point corresponding to the target action in the face key point detection information, satisfy a second preset condition corresponding to the target action, and when it is determined that the target action exists in a detection result of the target object, specifically configured to:

the verification module is configured to, when the target motion is left-leaning or right-leaning of the body, respond to a positional relationship between a plurality of first target face key points corresponding to the left-leaning or right-leaning of the body in the face key point detection information and satisfy a second preset condition corresponding to the left-leaning or right-leaning of the body, and determine that the motion detection result of the target object is that the left-leaning or right-leaning of the body exists, specifically configured to:

In one embodiment, the method further comprises a determining module for:

In one embodiment, the obtaining module is specifically configured to:

sequentially acquiring images of each frame of target object cached in a video;

the apparatus further comprises a smoothing module to:

after the estimation result is verified according to the face key point detection information to obtain the action detection result of the target object, obtaining the action detection result of the target object based on the latest preset number of frames of images of the target object;

and determining one action detection result with the largest occurrence frequency in the action detection results of the target objects obtained based on the images of the target objects of the latest preset number of frames as the current action of the target object.

In one embodiment, the apparatus further comprises a mitigation module to:

before sequentially acquiring images of each frame of target object cached from a video, detecting/tracking the target object based on each frame of image in the video; determining a plurality of images containing the target object according to the detection result/tracking result of the target object;

In one embodiment, the estimation module is specifically configured to:

the acquisition module is specifically configured to:

In one embodiment, the system further comprises a target module for:

In one embodiment, the system further comprises an alarm module for:

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of the first aspect when executing the computer instructions.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect.

According to the embodiment, the image of the target object is obtained, the bone key point detection information of the target object is obtained by performing the bone key point detection on the image of the target object, the face key point detection information of the target object is obtained by performing the face key point detection on the image of the target object, the estimation result of the motion information of the target object can be obtained according to the geometric relationship between the target key points in the bone key point detection information, and finally the estimation result is verified according to the face key point detection information, so that the motion detection result of the target object is obtained. Whether dangerous actions exist in the target object can be objectively and accurately detected through the geometric relation between the target key points, the estimation result of the action information obtained based on the target key points is further verified through the detection information of the face key points, and the action detection accuracy can be further improved. If the detection method is applied to the vehicle, whether the driver and the passengers in the vehicle are in danger or not can be accurately detected, so that the riding safety can be improved, and the use experience of a user is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart illustrating a method of motion detection according to one embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating the structure of key points of a bone according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a face key point according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a method for motion detection in a vehicle driving scenario, according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a motion detection device shown in the embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device shown in an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In a first aspect, at least one embodiment of the present disclosure provides a motion detection method, please refer to fig. 1, which illustrates a flow of the method, including steps S101 to S103.

The method may be used for motion detection of a target object within an image of the target object, among other things. For example, whether the target object in the image of the target object has a target motion or not, the target motion may be a dangerous motion, that is, the method may be used to detect whether the target object in the image of the target object has a dangerous motion or not. The method can be applied to the scene of vehicle driving and the like, and whether dangerous actions occur to a driver or passengers or not can be detected in the scene. The dangerous motion may be predefined, and may be, for example, left leaning, right leaning, covering the chest, or leaning forward and over.

The image of the target object can be an image shot by the image acquisition device or a frame image in a video recorded by the image acquisition device. For example, in a vehicle scene, the image of the target object may be an image of a target object in a vehicle cabin, such as an image captured by a camera installed in the vehicle cabin in advance or an image frame in a video, and the image of the target object may be an image captured for a driver in the vehicle cabin or an image frame captured for a passenger in the vehicle cabin or an image frame in a video.

In addition, the method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA) handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling computer readable instructions stored in a memory. Alternatively, the method may be performed by a server, which may be a local server, a cloud server, or the like. In a scene that a vehicle runs, the method can be executed by an intelligent Emergency Call system (intelligent Emergency Call), and the system is connected with a camera in a vehicle cabin, so that a video stream of a vehicle cabin scene area acquired by the camera can be acquired.

In step S101, an image of a target object is acquired.

The image of the target object may be an image shot by an image capturing device, or an image frame in a video recorded by the image capturing device, where the image capturing device may be an electronic device with an image capturing function, such as a mobile phone, a camera, and a camera. As the name implies, the target object is provided in the image, the target object refers to a person in the image whose motion needs to be detected, and the region of the image other than the target object is a background region or another object (other objects may not exist). For example, the target object in a vehicle scene may be a driver or a specific passenger. Therefore, after acquiring the video stream of the scene area, a plurality of objects in the video stream can be detected, and then the target object in the plurality of objects can be determined according to the position of each object in the vehicle cabin and/or the face information of each object. For example, an object on a driving seat in the cabin may be determined as a target object, that is, a driver may be determined as a target object, or an object to which face information that matches a pre-entered reference face feature belongs may be determined as a target object, that is, a specific person such as a vehicle owner or a registered driver may be determined as a target object. The operation for the target object in the following step may be performed based on the target object determined in this step.

It is to be understood that there may be one or more target objects in the image of the target object. When a plurality of target objects exist in the image of the target object, the target objects may be processed in sequence according to the method provided by the embodiment, or processed simultaneously according to the method provided by the embodiment. For example, the image of the target object may be an image captured for a scene within the cabin, and the target object may include a plurality of target objects such as a driver, a passenger, and the like.

In the case that the image of the target object is an image frame in the video, a plurality of images may be cached in the video, and then the image of each frame of the target object cached in the video may be sequentially acquired. When the image is cached from the video, each frame of image can be sequentially obtained from the video for caching; or extracting image frames from the video for caching at certain intervals; or extracting image frames from a video according to a certain caching condition for caching, for example, the target object may be detected/tracked based on each frame image in the video, and a plurality of images including the target object may be determined according to a detection result/tracking result of the target object; then, adding a cache to each frame of the plurality of images containing the target object, wherein each frame of the image contains preset key information of the target object, and the image is used as an image of the target object, wherein the preset key information comprises at least one of a human face, at least one part of a body and bone key points, and the part of the body can be a left shoulder, a right shoulder, a left ear, a right ear and the like. It will be appreciated that the number of cached images may be set, and whether images are cached from the video in any of the above manners, the cached images are kept up to date, and the number of cached images satisfies the set number, i.e., as the video is updated, the cached images are updated.

In one possible embodiment, in a scene where the vehicle is traveling, the target object is a driver or a passenger, the starting condition of this step may be set in advance, for example, the starting condition may be that the vehicle door is in a locked state, and/or the vehicle speed of the vehicle reaches a preset speed threshold, and then the image of the target object may be acquired again in a case where the vehicle satisfies the starting condition, that is, in a case where the vehicle door of the vehicle is in a locked state, and/or the vehicle speed of the vehicle reaches a preset speed threshold. Therefore, the image of the target object can be acquired for detection under the scene that the vehicle has danger detection requirements, so that the detection method has pertinence, and the waste of computing power, memory and power consumption is reduced.

In step S102, bone key point detection is performed on the image of the target object, so as to obtain bone key point detection information of the target object.

The images of the target object can be detected by adopting a pre-trained neural network, so that the skeletal key points of the target object are obtained. The skeletal key points can represent joint parts in the skeletal structure of the human body, and the skeletal structure diagram of the human body can be drawn through the joint parts. For example, the skeleton key points that can be detected by the neural network and the skeleton structure diagram drawn by the skeleton key points are shown in fig. 2, and as can be seen from fig. 2, the skeleton key points include a nose key point 0, a left eye key point 1, a right eye key point 2, a left ear key point 3, a right ear key point 4, a left shoulder key point 5, a right shoulder key point 6, a left elbow key point 7, a right elbow key point 8, a left wrist key point 9, a right wrist key point 10, a left hip key point 11, a right hip key point 12, a left knee key point 13, a right knee key point 14, a left ankle key point 15, and a right ankle key point 16. It should be noted that the image of the target object is captured right opposite to or at an angle toward the target object, so that the target object in the image of the target object is in a mirror image relationship with the target object in the real scene, that is, the left side of the target object in the real scene is the right side of the target object in the image of the target object, and the right side of the target object in the real scene is the left side of the target object in the image of the target object; the origin of the coordinate system within the image of the target object may be in its upper left corner, and to the right along the horizontal side may be in the positive direction of the horizontal axis (e.g., the x-axis), and to the down along the vertical side may be in the positive direction of the vertical axis (e.g., the y-axis).

In this case, the bone key points of the target object may be detected, and the bone key points included in a portion of the image of the target object where the target object appears may be detected, for example, when the driver appears the upper body in the image of the target object, only the bone key points of the upper body may be detected in this step. In other words, in this step, the skeletal key points of the target object are detected, and all the key points shown in fig. 2 may be detected, or some of the key points shown in fig. 2 may be detected.

The detected bone key points can be represented by coordinate positions in the image of the target object, and the bone key points can be identified at corresponding positions on the image of the target object.

In step S103, face key point detection is performed on the image of the target object to obtain face key point detection information of the target object.

The pre-trained neural network can be adopted to detect the image of the target object, so that the face key points of the target object are obtained. The neural network may be a different neural network than the neural network previously described for detecting skeletal keypoints. The key points of the face can be key points of each area of the face, and the outline and the key areas of the face can be drawn through the key points. For example, the face key points that the neural network can detect and the face contour drawn by the face key points are shown in FIG. 3. As can be seen from FIG. 3, the face key points include face contour key points 0-32, right eyebrow upper edge contours 33-37, left eyebrow upper edge contours 38-42, nose center lines 43-46, nose lower side contours 47-51, nose outer side contours 80-83, lip outer side contours 84-95, lip inner side contours 97-103, right eye center point 104, and left eye center point 105. It should be noted that the image of the target object is being shot, so the target object in the image of the target object and the target object in the real scene are in a mirror image relationship, that is, the left side of the target object in the real scene is the right side of the target object in the image of the target object, and the right side of the target object in the real scene is the left side of the target object in the image of the target object.

The face key points of the target object are detected, and the face key points included in the part of the face of the target object appearing in the image of the target object can be detected. In other words, in this step, the face key point detection is performed on the target object, and the obtained result may be: all of the keypoints shown in fig. 3, or some of the keypoints shown in fig. 3, or none of the face keypoints shown in fig. 3 have been detected.

In some embodiments, face keypoint detection may be performed directly for a target object in an image. Or, the face detection may be performed on the target object in the image, the image of the face region may be segmented according to the face detection result, and the face key points may be further detected for the image of the face region, in which the face detection may be performed based on the features of other regions such as the trunk of the target object, so that the face position may be located even when some key information of the face is missing, and the detection result of the face key points may be obtained.

The detected face key points can be represented by coordinate positions in the image of the target object, and the face key points can be identified at corresponding positions on the image of the target object.

In step S104, an estimation result of the motion information of the target object is determined according to a geometric relationship between target key points in the skeletal key point detection information.

The estimation result of the motion information of the target object may be that the target object has a target motion or does not have a target motion, and the target motion may be a dangerous motion that needs to be detected. The target action may be preset in advance, and may be one or more. Each target action has a corresponding target key point, and the geometric relationship among the corresponding target key points in each target action meets a certain preset condition. Therefore, a first preset condition of each target object may be preset, where the first preset condition may be set for a geometric relationship between target key points corresponding to a target action, so that when determining action information of a target object, it may be determined, for each target action, whether a geometric relationship between target key points corresponding to the target action satisfies the first preset condition corresponding to the target action, if so, it is determined that the target action exists in the target object, otherwise, it is determined that the target object does not have the target action.

In one possible embodiment, four target actions of body left inclination, body right inclination, chest covering and forward inclination and prone inclination are preset, and a corresponding target key point and a first preset condition are set for each target action.

The target keypoints corresponding to the left inclination of the body may be set as a right shoulder keypoint and a left shoulder keypoint, a vector formed by a connection line from the right shoulder keypoint to the left shoulder keypoint is referred to as a first target vector, a vector horizontally directed to the right (i.e., parallel to the transverse side of the image of the target object and directed to the right) is referred to as a standard vector, and the corresponding first preset condition is set such that a tangent value of an included angle between the first target vector and the standard vector is a positive number, and an absolute value of the tangent value is greater than a first threshold (for example, the first threshold is 0.4), and the first preset condition corresponding to the left inclination of the body may be represented as tan (vec (6,5)) > 0.4, taking the right shoulder keypoint 6 and the left shoulder keypoint 5 shown in fig. 2 as an example. That is, in response to the fact that the tangent value of the included angle between the first target vector from the right shoulder key point to the left shoulder key point and the standard vector to the right is positive and the absolute value of the tangent value is greater than the first threshold, it is determined that the target subject has a body leaning motion as the estimation result.

The target keypoints corresponding to the body right-leaning may be set as the right shoulder keypoint and the left shoulder keypoint, a vector formed by a connecting line from the right shoulder keypoint to the left shoulder keypoint is referred to as a target vector, a vector horizontally directed to the right (i.e., parallel to the transverse side of the image of the target object and directed to the right) is referred to as a standard vector, and the corresponding first preset condition is set such that the tangent value of the included angle between the target vector and the standard vector is a negative number, and the absolute value of the tangent value is greater than a first threshold (e.g., the first threshold is 0.4), and the first preset condition corresponding to the body right-leaning may be represented as tan (vec (6,5)) < -0.4, taking the right shoulder keypoint 6 and the left shoulder keypoint 5 shown in fig. 2 as an example. That is, in response to the case that the tangent value of the angle between the first target vector from the right shoulder key point to the left shoulder key point and the standard vector to the right in the horizontal direction is negative and the absolute value of the tangent value is greater than the first threshold, it is determined that the target subject has a body right-leaning motion as the estimation result.

The target key points corresponding to the chest opening may be set as a left elbow key point, a left wrist key point, a left shoulder key point, a right elbow key point, a right wrist key point, and a right shoulder key point, a vector formed by a connection line from the left elbow key point to the left shoulder key point is referred to as a second target vector, a vector formed by a connection line from the left elbow key point to the left shoulder key point is referred to as a third target vector, a vector formed by a connection line from the right elbow key point to the right wrist key point is referred to as a fourth target vector, a vector formed by a connection line from the right elbow key point to the right shoulder key point is referred to as a fifth target vector, a cosine value of an included angle between the second target vector and the third target vector is referred to as a first cosine value, a cosine value of an included angle between the fourth target vector and the fifth target vector is referred to as a second cosine value, and the corresponding first preset condition is set that the first cosine value is greater than a second threshold (for example, the second threshold is 0.2) and a vertical distance between the left elbow key point and the right shoulder key point is greater than the second threshold A triple threshold (e.g., a third threshold of 100); and/or, the second cosine value is greater than a second threshold (e.g., the second threshold is 0.2) and the vertical distance between the right wrist key point and the left shoulder key point is greater than a third threshold (e.g., the third threshold is 100), for example, the left elbow key point 7, the left wrist key point 9, the left shoulder key point 5, the right elbow key point 8, the right wrist key point 10, and the right shoulder key point 6 shown in fig. 2, the first preset condition corresponding to the chest opening may be represented as cos (vec (7,5), vec (7,9)) > 0.2 and y (9) -y (6) > 100, and/or the first preset condition may be represented as cos (vec (8,6), vec (8,10)) > 0.2 and y (10) -y (5) > 100. That is, in response to a first situation or a second situation, determining that the estimation result is that the target object has a chest-covering action; the first scenario includes: an included angle between a second target vector and a third target vector is larger than a second threshold, a left wrist key point is lower than a right shoulder key point, and the vertical distance between the left wrist key point and the right shoulder key point is larger than a third threshold, wherein the second target vector is a vector from a left elbow key point to the left wrist key point, and the third target vector is a vector from the left elbow key point to the left shoulder key point; the second scenario includes: an included angle between the fourth target vector and the fifth target vector is larger than a second threshold value, a right wrist key point is lower than a left shoulder key point, and the vertical distance between the right wrist key point and the left shoulder key point is larger than a third threshold value; and the fourth target vector is a vector from the right elbow key point to the right wrist key point, and the fifth target vector is a vector from the right elbow key point to the right shoulder key point.

The target key points corresponding to forward and lie down may be set as a left ear key point, a left shoulder key point, a right ear key point, and a right shoulder key point, and the corresponding first preset condition is set as that the vertical distance between the left ear key point and the left shoulder key point is greater than a fourth threshold (e.g., the fourth threshold is 50), and/or the vertical distance between the right ear key point and the right shoulder key point is greater than a fourth threshold (e.g., the fourth threshold is 50), taking the left ear key point 3, the left shoulder key point 5, the right ear key point 4, and the right shoulder key point 6 shown in fig. 2 as an example, the first preset condition corresponding to forward and lie down may be represented as y (3) -y (5) > 50, and/or y (4) -y (6) > 50. That is, in response to the left ear key point being lower than the left shoulder key point and the vertical distance between the left ear key point and the left shoulder key point being greater than the fourth threshold, and/or in response to the right ear key point being lower than the right shoulder key point and the vertical distance between the right ear key point and the right shoulder key point being greater than the fourth threshold, it is determined that the estimation result is that the forward-leaning, forward-leaning and backward-leaning motion of the target object exists.

In step S105, the estimation result is verified according to the face key point detection information, so as to obtain an action detection result of the target object.

Wherein, the face in each target action presents a corresponding pose, for example, the face is deflected towards a certain direction, or at least part of the face area is blocked. The estimation result of the action information obtained in step S104 may be further verified by detecting a satisfied geometric constraint relationship or a relative position relationship between specific target face key points in the corresponding face pose. Specifically, one or more target face key points related to each action may be set, and each target action may further be preset with a second preset condition, where the second preset condition may be set for a geometric constraint relationship or a relative position relationship that is satisfied between the target face key points corresponding to the target action. Therefore, when the estimation result is that the target action exists, the detection result of the target object is determined to be that the target action exists in response to the target face key point corresponding to the target action in the face key point detection information and meeting a second preset condition corresponding to the target action. Namely, when the target object has the target action determined by the two dimensions of the skeleton key point and the face key point, the target object has the target action. Therefore, the accuracy of motion detection is further improved, and the false detection of motion information caused by bone key points is avoided.

In one possible embodiment, the target motion is a body left-leaning motion or a body right-leaning motion, i.e. the estimation result of the motion information of the target object determined in step S104 is that there is a body left-leaning motion or a body right-leaning motion. In this case, the second preset condition corresponding to the body left-leaning or the body right-leaning may be determined based on the position relationship of the first target face key point quality inspection corresponding to the body left-leaning or the body right-leaning, and then the motion detection result of the target object is determined to be the presence of the body left-leaning motion or the body right-leaning motion in response to the position relationship between the plurality of first target face key points corresponding to the body left-leaning or the body right-leaning in the face key point detection information and the second preset condition corresponding to the body left-leaning or the body right-leaning being satisfied.

For example, the first target face keypoints related to left inclination of the body may be set as right eye outside keypoints (i.e., keypoints 36 shown in fig. 3) and left eye outside keypoints (i.e., keypoints 45 shown in fig. 3), a vector formed by connecting the right eye outside keypoints and the left eye outside keypoints is referred to as a sixth target vector, and a vector horizontally to the right (i.e., parallel to the horizontal side of the image of the target object and oriented to the right) is referred to as a standard vector. Furthermore, the second preset condition for the left-leaning body may be set as that the tangent value of the angle between the sixth target vector and the standard vector is an integer, and the absolute value of the tangent value is greater than a fifth threshold (for example, the fifth threshold is 0.4), and taking the right-eye external key point 36 and the left-eye external key point 45 shown in fig. 3 as an example, the second preset condition for the left-leaning body may be represented as tan (vec (36,45)) > 0.4. That is, when the target movement is left-leaning, it is determined that there is a left-leaning movement of the body as a result of the motion detection of the target object in response to a positive tangent value of an angle between a first target vector from a right-eye outer side key point to the left-eye outer side key point of the target object and a horizontal right normal vector and an absolute value of the tangent value being greater than a first threshold value.

The first target face keypoints related to right inclination of the body may be set as right eye outside keypoints (i.e., keypoints 36 shown in fig. 3) and left eye outside keypoints (i.e., keypoints 45 shown in fig. 3), vectors formed by connecting the right eye outside keypoints to the left eye outside keypoints may be referred to as sixth target vectors, and vectors that are horizontally rightward (i.e., parallel to the horizontal side of the image of the target object and directed rightward) may be referred to as standard vectors. Furthermore, the second preset condition for right-leaning of the body may be set such that the tangent value of the angle between the sixth target vector and the normal vector is negative, and the absolute value of the tangent value is greater than a fifth threshold (for example, the fifth threshold is 0.4), and taking the right-eye external side key point 36 and the left-eye external side key point 45 shown in fig. 3 as an example, the second preset condition for left-leaning of the body may be represented as tan (vec (36,45)) < -0.4. That is, when the target motion is a body right-leaning motion, it is determined that the motion detection result of the target object is the presence of the body right-leaning motion in response to that a tangent value of an angle between a first target vector from a right-eye outer side key point to the left-eye outer side key point of the target object and a horizontal right normal vector is a negative number and an absolute value of the tangent value is greater than a first threshold value.

In one possible embodiment, the target action is forward-leaning and forward-leaning, i.e. the estimation result of the action information of the target object determined in step S104 is that there is forward-leaning and forward-leaning action. Under this condition, can confirm based on whether target face key point lacks the second preset condition that leans forward to lie prone to fall and correspond, then be in the target action is for leaning forward to lie prone to fall under the condition, respond to in the face key point detection information with the second target face key point that leans forward to lie prone to fall lacks, confirm target object's action detection result is for existence the action of leaning forward to lie prone to fall.

Wherein, can set up the second target people face key point that corresponds to leaning forward and lying down as including eyes key point, nose key point and mouth key point. And then can set up the second preset condition that the corresponding that leans forward and lies prone to fall as eyes key point, nose key point and mouth key point all not appear in people's face key point detection information.

The second target face key points corresponding to forward leaning and forward leaning can be set as face content key points, and the face content key points comprise face key points except face edge key points. And then can be with the second preset condition that leans forward and lie prone to fall down and correspond set up as: the number of face content keypoints does not exceed a preset number. That is, in the case that it is determined that the number of the detected face content key points does not exceed a preset number according to the face key point detection information, it is determined that the second target face key points corresponding to the forward leaning and the forward leaning are missing.

Wherein, can set up the second target human face key point that falls over corresponding to leaning forward into whole human face key point. And then can be with the second preset condition that leans forward and lie prone to fall down and correspond set up as: the number of the face key points does not exceed the preset number. That is to say, under the condition that the number of the detected face key points is determined not to exceed the preset number according to the face key point detection information, determining that the second target face key points corresponding to the forward leaning and the lying leaning are absent.

The above various ways of determining the missing of the key points of the second target face can adapt to the accurate detection of forward-leaning, lying-over and falling actions under the condition that the face presents various different postures in the actual situation, and the more flexible and accurate action detection result verification is realized.

It can be understood that, in the case that the motion detection result of the target object is that the target object has a target motion, the alarm information may be sent to the service platform. For example, if the target action is a dangerous action to be detected, alarm information may be sent to the service platform when the target object has the dangerous action. Under the scene of vehicle driving, the service platform can be a service platform operated by the vehicle, for example, the service platform of a network car appointment, the service platform can also be a medical platform, and after the service platform receives the alarm information, treatment measures can be taken, so that the treatment efficiency of dangerous drivers is improved, the treatment effect is further improved, and the life safety of personnel in the vehicle is better protected. In addition, it can be understood that if the image of the target object is an image frame in a video and the image of the target object is updated along with the recording and updating of the video, the action detection result continues to indicate that the target object has a target action within a preset time period, and then the alarm information is sent to the service platform, so that the situation that the alarm information is sent by mistake due to the fluctuation of the action information can be avoided.

According to the embodiment, the image of the target object is obtained, the bone key point detection information of the target object is obtained by performing the bone key point detection on the image of the target object, the face key point detection information of the target object is obtained by performing the face key point detection on the image of the target object, the estimation result of the motion information of the target object can be obtained according to the geometric relationship between the target key points in the bone key point detection information, and finally the estimation result is verified according to the face key point detection information, so that the motion detection result of the target object is obtained. The geometric relation between the target key points is objective and accurate, so that whether dangerous actions exist in the target object can be accurately detected, the estimation result of the action information obtained by the target key points is further verified through the detection information of the face key points, and the action detection accuracy can be further improved. If the detection method is applied to the vehicle, whether the driver and the passengers in the vehicle are in danger or not can be accurately detected, so that the riding safety can be improved, and the use experience of a user is improved.

In some embodiments of the present disclosure, the image of the target object is an image frame in a video. A plurality of images may be cached in the video, and then the image of each frame of the target object cached in the video is sequentially obtained, and the specific way of caching the image has been described in detail in step S101, and is not repeated here.

Based on the above-mentioned acquisition manner of the image of the target object, step S104 may be performed in such a manner that the estimation result of the motion information of the target object is determined according to the geometric relationship between the target key points in the skeletal key point detection information:

firstly, the movement direction of the target object is detected based on the images of the multiple frames of target objects in the cache. The feature of the target object in each image frame in the plurality of image frames may be extracted first, and for example, the feature point of the target object may be extracted based on a basic principle that when the same target moves between different frames, the brightness of the same target is not changed; then determining optical flow information of the target object in the video stream according to the characteristics of the target object in each image frame in the plurality of image frames, wherein the optical flow information can represent the motion of the target between different frames; and finally, determining the action direction of the target object in the scene area according to the optical flow information of the target object in the video stream. Illustratively, the Lucas Kanade algorithm is used to find the moving direction of the target object from relative rest to motion.

And then, obtaining an estimation result of the motion information of the target object according to the geometric relationship between the target key points corresponding to the motion direction in the skeleton key point detection information. Each action direction can correspond to one or more preset target actions, and a plurality of corresponding bone key points in each target action meet a certain geometric constraint relation. Therefore, a plurality of corresponding target key points may be set for each target action, and each target action may be preset with a first preset condition, which may be set for a geometric constraint relationship between the target key points corresponding to the target action. Therefore, when the action information of the target object is determined, whether the target key point corresponding to the target action meets the first preset condition corresponding to the target action or not can be judged for each target action in the action direction, if yes, the target object has the target action, and otherwise, the target object does not have the target action.

For example, each action direction corresponds to one target action, a target key point and a first preset condition corresponding to the action direction may be set. After the action direction is determined, whether the target object has the target action corresponding to the action direction can be judged according to the target key point corresponding to the action direction and the first preset condition corresponding to the action direction. The method reduces the detection range of the target action according to the action direction, thereby further saving energy consumption and memory and improving detection efficiency; and the target key points are detected in a targeted manner according to the action direction, so that the action detection has pertinence, and the detection accuracy is further improved. Specifically, target motions corresponding to four motion directions, i.e., left, right, up and down, may be preset as left-leaning, right-leaning, chest-sealing, front-leaning and front-leaning, and a corresponding target key point and a first preset condition may be set for each target motion (i.e., each motion direction). The target key point and the first preset condition corresponding to each target action are already described in detail in step S104, and are not repeated here.

Based on the above-mentioned obtaining manner of the image of the target object, after step S104 is executed, the motion detection result of the target object obtained based on the latest preset number of frames of the image of the target object may be obtained; and determining one action detection result with the largest occurrence frequency in the action detection results of the target objects obtained based on the latest images of the target objects with the preset number of frames as the current action of the target object. The preset number of frames may be 2 frames, 4 frames, 5 frames, etc.

The motion detection result obtained by detecting the image of each frame of target object can be input into a smoothing queue, then a smoothing window is set, the size of the smoothing window is the preset number of frames, then the smoothing window is moved along with the updating of the smoothing queue, and the current motion is determined according to a plurality of motion detection results in the smoothing window after the smoothing window is moved each time, so that the smoothing processing of the motion detection result is realized, and the effectiveness and the stability of the current motion are improved.

If the motion detection result detected by the image of each frame of target object in the images of the target objects of the preset number of frames is the same, it can be determined that the motion is the current motion. For example, if the motion detection result detected by each of the 5 frames of images of the target object is a body left-leaning, the body left-leaning may be determined as the current motion.

And if the preset number of motion detection results obtained by detecting the images of the target object in the preset number of frames are inconsistent, taking one motion detection result with the largest occurrence frequency, namely one motion detection result with the largest number as the current motion. For example, if the motion detection result obtained by detecting 4 frames of target objects in the 5 frames of target object images is a body left-leaning motion, and the motion detection result obtained by detecting 1 frame of target object images is no target motion, the body left-leaning motion can be determined as the current motion. Alternatively, if the number of motion detection results that occur the most frequently among the motion detection results of the plurality of frames of images within the smoothing window does not exceed a preset proportion (e.g., 50%) of the length of the smoothing window (i.e., the total number of image frames within the smoothing window), the motion detection results of the smoothing window may be discarded. At this time, the smoothing window may be moved by a preset step (i.e., the number of frames of the image) to determine a motion detection result corresponding to the new smoothing window.

It should be noted that the preset number may be set to an odd number of 3, 5, 7 … …, etc. in order to accurately determine the current action. And if the preset number is set as an even number and a plurality of motion detection results with the same number appear, taking the latest motion detection result as the current motion.

It is understood that in the case that the current action of the target object is a target action, an alarm message may be sent to the service platform. Compared with the method that the alarm information is sent when the target action exists in the action detection result of the target object, the method is more accurate and reliable, and the sending of wrong alarm information can be avoided.

Referring to fig. 4, a complete flow of the motion detection method in a vehicle driving scenario is exemplarily shown. As can be seen from fig. 4, step S11 is first executed to start the cabin emergency call system in case the vehicle state satisfies the starting condition of the method; then step S12 is executed, a camera in the vehicle collects information of passengers in the vehicle cabin, namely images of the passengers in the vehicle cabin; then, step S13 is executed, and an estimation result of the motion information is obtained based on the motion detection of the skeletal key points, that is, whether the target object has a target motion corresponding to the motion direction or not is obtained; then, executing a step S14, and verifying the estimation result of the step S13 by using the face key point to obtain an action detection result; then, step S15 is executed to perform smoothing processing on the plurality of detection results obtained in step S14; finally, step S16 is executed, and according to the result of the smoothing process in S15, a distress signal is sent out if a dangerous motion is detected and continues for a while.

The action detection method provided by the embodiment combines action detection based on the skeleton key points and action verification based on the face key points, and can evaluate the action information of passengers more accurately and comprehensively. Meanwhile, a smoothing algorithm and cache processing are used, the jumping and fluctuation results are effectively processed, important reference data are provided for taxi renting companies and traffic monitoring departments, safety schemes and operation management can be customized in a targeted mode, and the life health safety of passengers in a vehicle cabin is guaranteed.

According to a second aspect of the embodiments of the present disclosure, there is provided an action detection apparatus, referring to fig. 5, the apparatus including:

an obtaining module 501, configured to obtain an image of a target object;

a first detection module 502, configured to perform bone key point detection on the image of the target object to obtain bone key point detection information of the target object;

a second detection module 503, configured to perform face key point detection on the image of the target object to obtain face key point detection information of the target object;

an estimation module 504, configured to obtain an estimation result of the motion information of the target object according to a geometric relationship between target key points in the skeletal key point detection information;

and a checking module 505, configured to check the estimation result according to the face key point detection information, to obtain an action detection result of the target object.

In some embodiments of the present disclosure, the estimation module is specifically configured to:

the verification module is specifically configured to:

In some embodiments of the present disclosure, the verification module is configured to, in response to a target face key point corresponding to the target action in the face key point detection information, meet a second preset condition corresponding to the target action, and when it is determined that the target action exists in a detection result of the target object, specifically configured to:

In some embodiments of the present disclosure, the plurality of first target face keypoints comprises: a left eye lateral keypoint and a right eye lateral keypoint;

In some embodiments of the disclosure, the method further comprises a determining module for:

In some embodiments of the disclosure, the obtaining module is specifically configured to:

sequentially acquiring images of each frame of target object cached in a video;

the apparatus further comprises a smoothing module to:

In some embodiments of the present disclosure, the apparatus further comprises a mitigation module to:

In some embodiments of the present disclosure, the image of the target object comprises an image of a target object within a vehicle cabin;

the acquisition module is specifically configured to:

In some embodiments of the present disclosure, a target module is further included for:

In some embodiments of the present disclosure, the system further comprises an alarm module for:

With regard to the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the third aspect with respect to the embodiment of the method, and will not be elaborated here.

In a third aspect, at least one embodiment of the present disclosure provides an apparatus, please refer to fig. 6, which illustrates a structure of the apparatus, the apparatus includes a memory for storing computer instructions executable on a processor, and the processor is configured to detect an action based on the method according to any one of the first aspect when the computer instructions are executed.

In a fourth aspect, at least one embodiment of the disclosure provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, performs the method of any of the first aspects.

In this disclosure, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" means two or more unless expressly limited otherwise.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A motion detection method, comprising:

acquiring an image of a target object;

2. The method according to claim 1, wherein the obtaining an estimation result of the motion information of the target object according to a geometric relationship between target key points in the bone key point detection information comprises:

3. The method according to claim 2, wherein the determining that the target action exists in the detection result of the target object in response to that a target face key point corresponding to the target action in the face key point detection information meets a second preset condition corresponding to the target action comprises:

4. The motion detection method according to claim 3, wherein the plurality of first target face key points comprise: a left eye lateral keypoint and a right eye lateral keypoint;

5. The method according to claim 2, wherein the determining that the target action exists in the detection result of the target object in response to that a target face key point corresponding to the target action in the face key point detection information meets a second preset condition corresponding to the target action comprises:

6. The motion detection method according to claim 5, characterized in that the method further comprises:

7. The motion detection method according to any one of claims 1 to 6, wherein the acquiring an image of a target object includes:

sequentially acquiring images of each frame of target object cached in a video;

acquiring a motion detection result of the target object obtained based on the latest preset number of frames of the images of the target object;

8. The motion detection method according to claim 7, wherein before sequentially acquiring the image of each frame of the target object buffered from the video, the method further comprises:

9. The motion detection method according to claim 8, wherein obtaining the estimation result of the motion information of the target object according to the geometric relationship between the target key points in the skeletal key point detection information comprises:

10. The motion detection method according to any one of claims 1 to 9, wherein the image of the target object includes an image of a target object in a vehicle cabin;

the acquiring of the image of the target object includes:

11. The motion detection method according to claim 10, further comprising:

12. The motion detection method according to any one of claims 1 to 11, further comprising:

13. An action detection device, comprising:

the acquisition module is used for acquiring an image of a target object;

the estimation module is used for acquiring an estimation result of the action information of the target object according to the geometric relationship among the target key points in the skeleton key point detection information;

and the checking module is used for checking the estimation result according to the face key point detection information to obtain an action detection result of the target object.

14. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 12 when executing the computer instructions.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 12.