WO2023185037A1 - 动作检测方法、装置、电子设备及存储介质 - Google Patents

动作检测方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023185037A1
WO2023185037A1 PCT/CN2022/134872 CN2022134872W WO2023185037A1 WO 2023185037 A1 WO2023185037 A1 WO 2023185037A1 CN 2022134872 W CN2022134872 W CN 2022134872W WO 2023185037 A1 WO2023185037 A1 WO 2023185037A1
Authority
WO
WIPO (PCT)
Prior art keywords
target object
action
target
key point
video stream
Prior art date
Application number
PCT/CN2022/134872
Other languages
English (en)
French (fr)
Inventor
丁业峰
毛宁元
许亮
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023185037A1 publication Critical patent/WO2023185037A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Definitions

  • the present disclosure relates to the field of image detection technology, and in particular, to an action detection method, device, electronic equipment and storage medium.
  • image processing technology can be used to detect the safety of the cabin environment and personnel. By collecting images and videos in the vehicle while the vehicle is driving or parked, it is possible to detect whether people in the vehicle are taking dangerous actions, thereby improving driving and riding safety.
  • human motion detection in related technologies is prone to misjudgment, resulting in poor user experience.
  • the present disclosure provides an action detection method, device, equipment and storage medium to solve deficiencies in related technologies.
  • an action detection method including:
  • the action information of the target object is determined.
  • detecting the action direction of the target object in the scene area based on multiple image frames in the video stream includes:
  • the action direction of the target object in the scene area is determined.
  • each action direction corresponds to a target action
  • Determining the action information of the target object based on the geometric relationship between the detected target key points corresponding to the action direction among the detected skeletal key points includes:
  • the skeletal key points include: left shoulder key point, right shoulder key point, left wrist key point, right wrist key point, left elbow key point, right elbow key point, left ear key point, right ear key point point;
  • Determining the action information of the target object based on the geometric relationship between the detected target key points corresponding to the action direction among the detected skeletal key points includes at least one of the following:
  • the tangent value of the angle between the first target vector in response to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a positive number and the absolute value of the tangent value is greater than the first threshold. In the case of, it is determined that the target object has a left-leaning body movement;
  • the tangent value of the angle between the first target vector responding to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a negative number, and the absolute value of the tangent value is greater than the first threshold In the case of, it is determined that the target object has a right-leaning body movement;
  • the target object has a chest-covering action in response to the following first situation or the second situation; wherein the first situation includes: the intersection of the second target vector and the third target vector.
  • the angle is greater than the second threshold, the left wrist key point is lower than the right shoulder key point and the vertical distance between the left wrist key point and the right shoulder key point is greater than the third threshold, where the second target vector is from the left elbow key point to the left
  • the vector of the wrist key point and the third target vector are the vectors from the left elbow key point to the left shoulder key point
  • the second situation includes: the angle between the fourth target vector and the fifth target vector is greater than the second threshold, and the right wrist The key point is lower than the left shoulder key point, and the vertical distance between the right wrist key point and the left shoulder key point is greater than the third threshold; wherein, the fourth target vector is a vector from the right elbow key point to the right wrist key point, the third The five target vectors are the vectors from the right elbow key point to the right shoulder key
  • the action direction is down, in response to the left ear key point being lower than the left shoulder key point and the vertical distance between the left ear key point and the left shoulder key point being greater than the fourth threshold, and/or in response to the right ear key point being lower than the right If the vertical distance between the shoulder key point and the right ear key point and the right shoulder key point is greater than the fourth threshold, it is determined that the target object has a forward leaning and prone motion.
  • detecting the action direction of the target object in the scene area based on multiple image frames in the video stream includes:
  • the action direction of the target object in the scene area is detected based on multiple image frames in the video stream.
  • determining the action information of the target object based on the geometric relationship between the detected target key points corresponding to the action direction among the detected skeletal key points includes:
  • the image For each image frame among the plurality of image frames in the video stream, the image is determined based on the geometric relationship between the target key points corresponding to the action direction among the bone key points detected in the image frame. Action information of the target object in the frame;
  • the motion information of the target object in the plurality of image frames is smoothed to obtain the target motion information of the target object determined based on the plurality of image frames.
  • it also includes:
  • the smoothing process on the action information of the target object in the multiple image frames to obtain the target action information of the target object determined based on the multiple image frames includes:
  • the motion information of the target object in the cached preset number of image frames is smoothed to obtain the target motion information of the target object determined based on the cached preset number of image frames.
  • the real-time cache of the latest preset number of image frames includes:
  • the preset key information includes at least one of a face, at least part of the body, and a bone key point;
  • the scene area includes a car cabin scene area
  • the method of obtaining the video stream of the scene area includes:
  • the video stream of the scene area is obtained.
  • it also includes:
  • the target object is determined among the plurality of objects according to the position of each of the plurality of objects in the vehicle cabin and/or the face information of each of the plurality of objects.
  • it also includes:
  • alarm information is sent to the service platform.
  • an action detection device including:
  • Acquisition module used to obtain the video stream of the scene area
  • a direction module configured to detect the action direction of the target object in the scene area based on multiple image frames in the video stream
  • a detection module configured to detect skeletal key points of the target object in the multiple image frames
  • a determination module configured to determine the action information of the target object based on the geometric relationship between the detected target key points corresponding to the action direction among the detected skeletal key points.
  • the direction module is specifically used to:
  • the action direction of the target object in the scene area is determined.
  • each action direction corresponds to a target action
  • the determination module is specifically used for:
  • the skeletal key points include: left shoulder key point, right shoulder key point, left wrist key point, right wrist key point, left elbow key point, right elbow key point, left ear key point, right ear key point point;
  • the determination module is specifically used for at least one of the following:
  • the tangent value of the angle between the first target vector in response to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a positive number and the absolute value of the tangent value is greater than the first threshold. In the case of, it is determined that the target object has a left-leaning body movement;
  • the tangent value of the angle between the first target vector responding to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a negative number, and the absolute value of the tangent value is greater than the first threshold In the case of, it is determined that the target object has a right-leaning body movement;
  • the target object has a chest-covering action in response to the following first situation or the second situation; wherein the first situation includes: the intersection of the second target vector and the third target vector.
  • the angle is greater than the second threshold, the left wrist key point is lower than the right shoulder key point and the vertical distance between the left wrist key point and the right shoulder key point is greater than the third threshold, where the second target vector is from the left elbow key point to the left
  • the vector of the wrist key point and the third target vector are the vectors from the left elbow key point to the left shoulder key point
  • the second situation includes: the angle between the fourth target vector and the fifth target vector is greater than the second threshold, and the right wrist The key point is lower than the left shoulder key point, and the vertical distance between the right wrist key point and the left shoulder key point is greater than the third threshold; wherein, the fourth target vector is a vector from the right elbow key point to the right wrist key point, the third The five target vectors are the vectors from the right elbow key point to the right shoulder key
  • the action direction is down, in response to the left ear key point being lower than the left shoulder key point and the vertical distance between the left ear key point and the left shoulder key point being greater than the fourth threshold, and/or in response to the right ear key point being lower than the right If the vertical distance between the shoulder key point and the right ear key point and the right shoulder key point is greater than the fourth threshold, it is determined that the target object has a forward leaning and prone motion.
  • the direction module is specifically used to:
  • the action direction of the target object in the scene area is detected based on multiple image frames in the video stream.
  • the determining module is specifically used to:
  • the image For each image frame among the plurality of image frames in the video stream, the image is determined based on the geometric relationship between the target key points corresponding to the action direction among the bone key points detected in the image frame. Action information of the target object in the frame;
  • the motion information of the target object in the plurality of image frames is smoothed to obtain the target motion information of the target object determined based on the plurality of image frames.
  • a cache module is also included for:
  • the determination module is used to smooth the action information of the target object in the multiple image frames, and when obtaining the target action information of the target object determined based on the multiple image frames, it is specifically used to:
  • the motion information of the target object in the cached preset number of image frames is smoothed to obtain the target motion information of the target object determined based on the cached preset number of image frames.
  • the cache module is specifically used to:
  • the preset key information includes at least one of a face, at least part of the body, and a bone key point;
  • the scene area includes a car cabin scene area
  • the acquisition module is specifically used for:
  • the video stream of the scene area is obtained.
  • a target module is also included for:
  • the target object is determined among the plurality of objects according to the position of each of the plurality of objects in the vehicle cabin and/or the face information of each of the plurality of objects.
  • an alarm module is also included for:
  • alarm information is sent to the service platform.
  • an electronic device includes a memory and a processor.
  • the memory is used to store computer instructions executable on the processor.
  • the processor is used to execute the Computer instructions implement the method described in the first aspect.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method described in the first aspect is implemented.
  • the action direction of the target object in the scene area can be detected based on multiple image frames in the video stream, and then the target object in the image frame can be detected Finally, the action information of the target object can be determined based on the geometric relationship between the target key points corresponding to the action direction among the bone key points. Since the geometric relationship between the target key points is relatively objective and accurate, it can accurately detect whether the target object has dangerous actions, and the action direction detection results are introduced to analyze the geometric relationship between the target key points before judging the action information. , thereby reducing the spatio-temporal complexity of action detection, making action detection more targeted, and further improving the accuracy of detection. If this detection method is applied to a vehicle, it can accurately detect whether the driver and passengers in the vehicle are in danger, thereby improving the safety of the vehicle and improving the user experience.
  • Figure 1 shows a flow chart of an action detection method according to an embodiment of the present disclosure
  • Figure 2 shows a schematic structural diagram of skeletal key points according to an embodiment of the present disclosure
  • Figure 3 shows a complete flow chart of an action detection method in a vehicle driving scenario according to an embodiment of the present disclosure
  • Figure 4 shows a schematic structural diagram of an action detection device according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • word “if” as used herein may be interpreted as "when” or “when” or “in response to determining.”
  • At least one embodiment of the present disclosure provides an action detection method. Please refer to FIG. 1 , which shows the flow of the method, including steps S101 to S103.
  • this method can be used to perform action detection on target objects in the video stream in the scene area. For example, detecting whether a target action occurs on a target object in a video stream, and the target action may be a dangerous action. That is to say, this method can be used to detect whether a dangerous action occurs on a target object in a video stream in a scene area.
  • This method can be applied to scenarios such as vehicle driving, where it can be detected whether the driver or passengers make dangerous actions. Dangerous actions can be defined in advance.
  • dangerous actions can include leaning to the left, leaning to the right, covering the chest, or leaning forward and lying down, etc.
  • the video stream may be a video recorded by the image capture device for the scene area.
  • the scene area may be a car cabin scene area.
  • the video stream in the scene area can be the video collected by the camera installed in the car cabin.
  • the video stream can be the video collected in the car cabin for the driver, or the video collected in the car cabin for the passengers.
  • the method can be executed by an electronic device such as a terminal device or a server.
  • the terminal device can be a user equipment (User Equipment, UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDAs) handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the method can be implemented by the processor calling computer readable instructions stored in the memory.
  • the method can be executed through a server, which can be a local server, a cloud server, etc. In the scene where the vehicle is driving, this method can be executed by the Artificial Intelligence Emergency Call system, which is connected to the camera in the cabin, so that the video stream of the cabin scene area collected by the camera can be obtained.
  • step S101 the video stream of the scene area is obtained.
  • the video stream in the scene area can be a video recorded by an image acquisition device, and the image acquisition device can be a mobile phone, a camera, a camera, and other electronic devices with image acquisition functions.
  • the image acquisition device can be a mobile phone, a camera, a camera, and other electronic devices with image acquisition functions.
  • the target object refers to the person whose action needs to be detected in the video stream.
  • the area of the video stream other than the target object is the background area and other objects (other objects may not exist), for example
  • the target object in the vehicle driving scene can be the driver or specific passengers. Therefore, after acquiring the video stream of the scene area, multiple objects in the video stream can be detected, and then based on the position of each object in the cabin and/or the face information of each object, the multiple objects in the video stream can be detected.
  • the object in the driver's seat in the cabin can be determined as the target object, that is, the driver can be determined as the target object, or the object whose facial information is consistent with the pre-entered reference facial features can be determined as the target object. , that is, specific people such as car owners and registered drivers are identified as target objects.
  • the operations on the target object in the following steps can be performed based on the target object determined in this step.
  • the target object in the video stream can be one or multiple.
  • each target object can be processed sequentially according to the method provided in this embodiment, or each target object can be processed simultaneously according to the method provided in this embodiment.
  • the target objects may include one or more of multiple target objects such as the driver, co-driver passenger, and rear seat passenger.
  • the starting condition of this step can be set in advance.
  • the starting condition can be that the vehicle door is in a locked state, and/or the vehicle speed reaches Preset speed thresholds, etc., and then obtain the video of the scene area when the vehicle meets the above starting conditions, that is, when the vehicle's door is in a locked state, and/or when the vehicle's speed reaches the preset speed threshold. flow.
  • the video stream can be obtained for detection in scenarios where vehicles have dangerous detection requirements, thereby making the detection method more targeted, saving computing power, memory, and reducing power consumption.
  • step S102 based on multiple image frames in the video stream, the action direction of the target object in the scene area is detected.
  • the multiple image frames in the video stream may be a preset number of image frames, such as 2 frames, 4 frames, 5 frames, etc.
  • the latest preset number of image frames in the video stream can be cached in real time.
  • each image frame can be obtained from the video stream as a cache object; or image frames can be extracted from the video stream as a cache object at certain intervals; or image frames can be extracted from the video stream as a cache object according to certain caching conditions.
  • the preset key information includes at least one of a face, at least part of the body, and a bone key point, part of The body can be the left shoulder, the right shoulder, the left ear, the right ear, etc.; and then the latest preset number of image frames among the image frames in which the preset key information of the target object exists in the video stream is cached in real time. Since the image frames in the cache are image frames that contain preset key information, the success rate of obtaining the required skeletal key points can be improved in subsequent target object action detection based on the cached image frames. In the stage of caching image frames, coarse-grained screening of images in the video stream is performed based on the above key information, which reduces the time and memory resources consumed by detecting missing key bone key points during the action detection process, which is beneficial to improving action detection. efficiency.
  • the action direction of the target object can be left, right, up, down, etc.
  • the characteristics of the target object in each of the multiple image frames can be first extracted.
  • the same target can be based on the basic principle that its brightness remains unchanged when it moves between different frames.
  • the feature points of the target object are extracted; and then the optical flow information of the target object in the video stream is determined according to the characteristics of the target object in each of the multiple image frames, where the optical flow information It can characterize the movement of the target between different frames; finally, according to the optical flow information of the target object in the video stream, the action direction of the target object in the scene area is determined.
  • the LucasKanade algorithm is used to find the movement direction of the target object from being relatively stationary to making an action.
  • step S103 skeletal key points of the target object in the plurality of image frames are detected.
  • Pre-trained neural networks can be used to process image frames to obtain the skeletal key points of the target object.
  • Bone key points can represent the joint parts of the human body's skeletal structure, and through these joint parts, the human body's skeletal structure diagram can be drawn.
  • the skeletal key points that the neural network can detect and the skeletal structure diagram drawn by the skeletal key points are shown in Figure 2.
  • the skeletal key points include the nose key point 0, the left eye key point 1, Right eye key point 2, left ear key point 3, right ear key point 4, left shoulder key point 5, right shoulder key point 6, left elbow key point 7, right elbow key point 8, left wrist key point 9, right wrist key Point 10, left hip point 11, right hip point 12, left knee point 13, right knee point 14, left ankle point 15, right ankle point 16.
  • the video stream is recorded when the camera is facing or facing the target object at a certain angle, so the target object in the image frame is a mirror image of the target object in the real scene, that is, the left side of the target object in the real scene is The right side of the target object in the image frame, the right side of the target object in the real scene is the left side of the target object in the image frame; the origin of the coordinate system in the image frame can be at its upper left corner, and the horizontal axis can be the horizontal axis to the right.
  • the positive direction of the vertical axis (for example, the x-axis) downward along the vertical edge may be the positive direction of the vertical axis (for example, the y-axis).
  • detecting the skeletal key points of the target object can detect the skeletal key points included in the part of the target object that appears in the image frame. For example, if the driver's upper body appears in the image frame, only the skeletal key points of the upper body are detected in this step. In other words, by detecting skeletal key points of the target object in this step, all the key points shown in Figure 2 can be detected, or some of the key points shown in Figure 2 can be detected.
  • the detected bone key points can be represented by coordinate positions in the image frame, and the bone key points can also be identified at corresponding positions on the image frame.
  • step S104 the action information of the target object is determined based on the geometric relationship between the target key points corresponding to the action direction among the skeletal key points.
  • the action information of the target object can be the presence or absence of a target action of the target object, and the target action can be a dangerous action that needs to be detected, etc., wherein the dangerous action that needs to be detected can be set in advance.
  • Each action direction can correspond to one or more preset target actions.
  • Each target action has multiple corresponding target key points. In each target action, the corresponding multiple target key points satisfy the corresponding geometry. Relationship constraints. Therefore, a first preset condition may be set in advance for each target action, and the first preset condition may be set for a geometric relationship satisfied between target key points corresponding to the target action.
  • the action information of the target object for each target action in the action direction detected in step S102, it can be determined whether the target key points corresponding to the target action among the skeletal key points of the target object detected in step S103 satisfy the target.
  • the first preset condition corresponding to the action If it is met, the target action exists in the target object; otherwise, the target action does not exist in the target object.
  • the target key point and the first preset condition corresponding to the action direction can be set. Then, when determining the action information of the target object, the geometric relationship between the detected skeletal key points and the target key points corresponding to the action direction satisfies the first preset condition corresponding to the action direction. , determine that the target object has a target action corresponding to the action direction; otherwise (that is, the geometric relationship between the target key points does not meet the corresponding first preset condition), determine that the target object does not have the action The target action corresponding to the direction.
  • the detection range of the target action is narrowed according to the action direction, thereby further saving energy consumption and memory, and improving detection efficiency; and targeted key points of the target are detected according to the action direction, making the action detection targeted and further improving the accuracy of detection.
  • the target actions corresponding to the four action directions of left, right, up and down are pre-set as: leaning the body to the left (corresponding to the direction "left”), leaning the body to the right (corresponding to the direction "right”), covering the chest ( Corresponding to the direction "up”) and leaning forward (corresponding to the direction "down”), and set corresponding target key points and first preset conditions for each target action (that is, each action direction).
  • the target key points corresponding to the left leaning body can be set as the right shoulder key point and the left shoulder key point.
  • the vector formed by the line connecting the right shoulder key point to the left shoulder key point is called the first target vector.
  • the vector that detects the horizontal sides of the image (parallel and direction to the right) is called the standard vector, and then sets the corresponding first preset condition to the fact that the tangent value of the angle between the target vector and the standard vector is a positive number, and the absolute value of the tangent value is is greater than the first threshold (for example, the first threshold is 0.4).
  • the first preset condition corresponding to the body leaning left can be expressed as tan(vec( 6,5))>0.4. That is to say, when the action direction is left, the tangent value of the angle between the first target vector in response to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a positive number and the absolute value of the tangent value If it is greater than the first threshold, it is determined that the target object has a body tilt movement to the left.
  • the target key points corresponding to the right leaning of the body can be set as the right shoulder key point and the left shoulder key point, and the vector formed by the connection between the right shoulder key point and the left shoulder key point is called the target vector.
  • the vector whose transverse sides are parallel and direction is to the right) is called a standard vector, and then the corresponding first preset condition is set to be that the tangent value of the angle between the target vector and the standard vector is a negative number, and the absolute value of the tangent value is greater than the first Threshold (for example, the first threshold is 0.4), taking the right shoulder key point 6 and the left shoulder key point 5 shown in Figure 2 as an example, the first preset condition corresponding to the body leaning to the right can be expressed as tan(vec(6,5 )) ⁇ -0.4.
  • the tangent value of the angle between the first target vector responding to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a negative number, and the absolute value of the tangent value If it is greater than the first threshold, it is determined that the target object has a right-leaning body movement.
  • the vector formed by the line connecting the key point of the left elbow to the key point of the left shoulder is called the third target vector
  • the vector formed by the line connecting the key point of the right elbow to the key point of the right wrist is called The vector
  • the fourth target vector the vector formed by the line connecting the key point of the right elbow to the key point of the right shoulder
  • the cosine of the angle between the second target vector and the third target vector is called the third target vector.
  • a cosine value, the cosine value of the angle between the fourth target vector and the fifth target vector is called the second cosine value, and then the corresponding first preset condition is set to the first cosine value greater than the second threshold (
  • the second threshold is 0.2
  • the vertical distance between the left wrist key point and the right shoulder key point is greater than the third threshold (for example, the third threshold is 100); and/or the second cosine value is greater than the second threshold (for example, the The second threshold is 0.2) and the vertical distance between the right wrist key point and the left shoulder key point is greater than the third threshold (for example, the third threshold is 100), with the left elbow key point 7 and the left wrist key point 9 shown in Figure 2 , left shoulder key point 5, right elbow key point 8, right wrist key point 10, right shoulder key point 6 as an example, the first preset condition corresponding to covering the chest can be expressed as cos(vec(7,5), vec(7 ,9))>0.2 and y(9)-y(6)>100, and/or, the first
  • the first situation includes: a second target vector and a third target vector
  • the angle is greater than the second threshold, the left wrist key point is lower than the right shoulder key point and the vertical distance between the left wrist key point and the right shoulder key point is greater than the third threshold, where the second target vector is the left elbow key point The vector to the left wrist key point, the third target vector is the vector from the left elbow key point to the left shoulder key point;
  • the second situation includes: the angle between the fourth target vector and the fifth target vector is greater than the second threshold, The right wrist key point is lower than the left shoulder key point, and the vertical distance between the right wrist key point and the left shoulder key point is greater than the third threshold; wherein, the fourth target vector is the vector from the right elbow key point to the right wrist key point, The fifth target vector is the vector from the key point of the right elbow to the key point of the right shoulder.
  • the vertical distance is greater than the fourth threshold (for example, the fourth threshold is 50), and/or the vertical distance between the right ear key point and the right shoulder key point is greater than the fourth threshold (for example, the fourth threshold is 50), as shown in Figure 2
  • the first preset condition corresponding to leaning forward can be expressed as y(3)-y (5)>50, and/or, y(4)-y(6)>50.
  • the action direction when the action direction is down, in response to the left ear key point being lower than the left shoulder key point and the vertical distance between the left ear key point and the left shoulder key point being greater than the fourth threshold, and/or in response to the right ear key point If the point is lower than the right shoulder key point and the vertical distance between the right ear key point and the right shoulder key point is greater than the fourth threshold, it is determined that the target object has a forward leaning and prone motion.
  • alarm information may be sent to the service platform.
  • the service platform can be a service platform for vehicle operation, such as an online ride-hailing service platform.
  • the service platform can also be a medical platform.
  • the service platform can take treatment measures to improve the risk of danger. The driver's treatment efficiency is improved, thereby improving the treatment effect and better protecting the life safety of the people in the vehicle.
  • the action information of the target object can be determined in real time or at a certain frequency. Therefore, the action information can continue to exist for the target object within a preset time period.
  • the alarm information is then sent to the service platform, which can reduce the fluctuation of the action information detection results and mistakenly send alarm information.
  • the action direction of the target object in the scene area can be detected based on multiple image frames in the video stream, and then the target object in the image frame can be detected Finally, the action information of the target object can be determined based on the geometric relationship between the bone key points and the target key points corresponding to the action direction. Since the geometric relationship between target key points is relatively objective and accurate, it can accurately detect whether the target object has dangerous actions, and the target key points are detected in a targeted manner according to the direction of the action, thus making the action detection more targeted and further improving Detection accuracy. If this detection method is applied to a vehicle, it can accurately detect whether the driver and passengers in the vehicle are in danger, thereby improving the safety of the vehicle and improving the user experience.
  • step S103 shown in FIG. 1 may detect the skeletal key points of the target object in one of the multiple image frames of the video stream, that is, detect the skeletal key points of the target object in the multiple image frames in step S102.
  • the specific image frames to be detected can be set in advance, such as detecting the first frame, the last frame or the middle frame, etc. For example, in the case of caching the latest preset number of image frames in the video stream in real time, the skeletal key of the target object in one of the latest preset number of image frames may be detected in step S103. point.
  • step S104 shown in Figure 1 the action information of the target object can be determined directly based on the geometric relationship between the detected target key points of the skeleton key points corresponding to the action direction,
  • the specific determination method please refer to the detailed description of step S104 in the above embodiment.
  • step S103 shown in FIG. 1 may detect the skeletal key points of the target object in each of the multiple image frames of the video stream, that is, detect the multiple image frames in step S102. Bone keys for the target object in each image frame. For example, in the case of caching the latest preset number of image frames in the video stream in real time, the skeleton of the target object in each of the latest preset number of image frames may be detected in step S103 key point.
  • the target corresponding to the action direction among the skeletal key points detected in each image frame may first be used in the multiple image frames in the video stream.
  • the geometric relationship between the key points determines the action information of the target object in the corresponding image frame; and then smoothes the action information of the target object in the multiple image frames to obtain all the action information determined based on the multiple image frames.
  • Describe the target action information of the target object Exemplarily, in the case of caching the latest preset number of image frames in the video stream in real time, the action information of the target object in the cached preset number of image frames is smoothed to obtain the result based on the cache.
  • a preset number of image frames determines the target action information of the target object.
  • the action information detected by each image frame can be input into the smoothing queue, and then a smoothing window can be set.
  • the size of the smoothing window can be the number of multiple image frames in the video stream, such as the number of cached image frames; and then follow
  • the smoothing window is moved according to the update of the smoothing queue, and after each movement of the smoothing window, the target action information of the target object determined based on the multiple image frames is determined based on the multiple action information in the smoothing window, thereby achieving action control. Smooth processing of information improves the effectiveness and stability of action information.
  • the action information detected in each of the multiple image frames is the same, it can be determined to be the target action information of the target object determined based on the multiple image frames. For example, if the action information detected in each of the five image frames is a left leaning body, then the left leaning body action can be determined as the target action information of the target object determined based on the five image frames.
  • the action information detected by multiple image frames is inconsistent, the action information with the largest quantity is used as the target action information of the target object determined based on the multiple image frames. For example, if the action information detected in 4 out of 5 image frames is that the body is leaning to the left, and the action information detected in 1 image frame is that there is no target action, then the left leaning of the body can be determined based on the above 5 image frames. Target action information of the target object.
  • the action information detection result with the most occurrences does not exceed a preset proportion (for example, 50%), then the detection results in the smoothing window can be discarded to Further improve the reliability of motion detection results.
  • the number of multiple image frames may be set to an odd number such as 3, 5, 7... etc. If it is set to an even number and there are multiple types of action information with the same number, the latest action information among them will be used as the target action information of the target object determined based on the multiple image frames.
  • step S11 is first executed.
  • the cabin emergency call function is started.
  • step S12 is executed.
  • the camera in the vehicle collects cabin passenger information, that is, the cabin passenger information is collected.
  • the video stream of the passengers in the cabin then execute step S13 to save more than two recent frames to the cache; then execute step S14 to judge the action direction using the optical flow method; then execute step S15 to determine the direction of the action in the space corresponding to the action direction (i.e.
  • step S16 Perform action detection within the range of the target action corresponding to the action direction), that is, detect whether the target object has a target action corresponding to the action direction; then execute step S16 to smooth the multiple detection results obtained in step S15; finally execute step S17, According to the smoothing processing result of S15, if a dangerous action is detected and the dangerous action continues for a period of time, a distress signal will be sent.
  • the action detection method provided in this embodiment combines action geometric features, optical flow detection processing and some objective conditions, and can more accurately and comprehensively evaluate the current status of the passenger's action information.
  • smoothing algorithms and cache processing are used to effectively process some jump and fluctuation results, providing important reference data for car rental companies and traffic supervision departments, which can customize safety plans and operation management in a targeted manner, and improve the cabin The safety of life and health of the passengers.
  • a motion detection device is provided. Please refer to FIG. 4.
  • the device includes:
  • the acquisition module 401 is used to acquire the video stream of the scene area
  • Direction module 402 configured to detect the action direction of the target object in the scene area based on multiple image frames in the video stream;
  • the detection module 403 is used to detect the skeletal key points of the target object in the multiple image frames
  • the determination module 404 is configured to determine the action information of the target object based on the geometric relationship between the detected target key points corresponding to the action direction among the detected skeletal key points.
  • the direction module is specifically used for:
  • the action direction of the target object in the scene area is determined.
  • each action direction corresponds to a target action
  • the determination module is specifically used for:
  • the skeletal key points include: left shoulder key point, right shoulder key point, left wrist key point, right wrist key point, left elbow key point, right elbow key point, left ear key point, Right ear key points;
  • the determination module is specifically used for at least one of the following:
  • the tangent value of the angle between the first target vector in response to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a positive number and the absolute value of the tangent value is greater than the first threshold. In the case of, it is determined that the target object has a left-leaning body movement;
  • the tangent value of the angle between the first target vector responding to the right shoulder key point to the left shoulder key point and the horizontal standard vector to the right is a negative number, and the absolute value of the tangent value is greater than the first threshold In the case of, it is determined that the target object has a right-leaning body movement;
  • the target object has a chest-covering action in response to the following first situation or the second situation; wherein the first situation includes: the intersection of the second target vector and the third target vector.
  • the angle is greater than the second threshold, the left wrist key point is lower than the right shoulder key point and the vertical distance between the left wrist key point and the right shoulder key point is greater than the third threshold, where the second target vector is from the left elbow key point to the left
  • the vector of the wrist key point and the third target vector are the vectors from the left elbow key point to the left shoulder key point
  • the second situation includes: the angle between the fourth target vector and the fifth target vector is greater than the second threshold, and the right wrist The key point is lower than the left shoulder key point, and the vertical distance between the right wrist key point and the left shoulder key point is greater than the third threshold; wherein, the fourth target vector is a vector from the right elbow key point to the right wrist key point, the third The five target vectors are the vectors from the right elbow key point to the right shoulder key
  • the action direction is down, in response to the left ear key point being lower than the left shoulder key point and the vertical distance between the left ear key point and the left shoulder key point being greater than the fourth threshold, and/or in response to the right ear key point being lower than the right If the vertical distance between the shoulder key point and the right ear key point and the right shoulder key point is greater than the fourth threshold, it is determined that the target object has a forward leaning and prone motion.
  • the direction module is specifically used for:
  • the action direction of the target object in the scene area is detected based on multiple image frames in the video stream.
  • the determining module is specifically used to:
  • the image For each image frame among the plurality of image frames in the video stream, the image is determined based on the geometric relationship between the target key points corresponding to the action direction among the bone key points detected in the image frame. Action information of the target object in the frame;
  • the motion information of the target object in the plurality of image frames is smoothed to obtain the target motion information of the target object determined based on the plurality of image frames.
  • a cache module is also included for:
  • the determination module is used to smooth the action information of the target object in the multiple image frames, and when obtaining the target action information of the target object determined based on the multiple image frames, it is specifically used to:
  • the motion information of the target object in the cached preset number of image frames is smoothed to obtain the motion information of the target object determined based on the cached preset number of image frames.
  • the cache module is specifically used to:
  • the preset key information includes at least one of a face, at least part of the body, and a bone key point;
  • the scene area includes a car cabin scene area
  • the acquisition module is specifically used for:
  • the video stream of the scene area is obtained.
  • a target module is also included for:
  • the target object is determined among the plurality of objects according to the position of each of the plurality of objects in the vehicle cabin and/or the face information of each of the plurality of objects.
  • an alarm module is also included for:
  • alarm information is sent to the service platform.
  • At least one embodiment of the present disclosure provides a device. Please refer to FIG. 5 , which shows the structure of the device.
  • the device includes a memory and a processor.
  • the memory is used to store information available in the processor.
  • Computer instructions run on the processor, and the processor is configured to detect actions based on the method described in any one of the first aspects when executing the computer instructions.
  • At least one embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the method described in any one of the first aspects is implemented.
  • first and second are used for descriptive purposes only and are not to be understood as indicating or implying relative importance.
  • plurality refers to two or more than two, unless expressly limited otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

一种动作检测方法、装置、电子设备及存储介质,所述动作检测方法包括:获取场景区域的视频流(S101);基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向(S102);检测所述多个图像帧中所述目标对象的骨骼关键点(S103);根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息(S104)。

Description

动作检测方法、装置、电子设备及存储介质
相关申请的交叉引用
本公开要求于2022年03月31日提交的、申请号为202210346615.4的中国专利申请的优先权,该申请以引用的方式并入本文中。
技术领域
本公开涉及图像检测技术领域,尤其涉及一种动作检测方法、装置、电子设备及存储介质。
背景技术
随着人工智能技术的不断发展,图像、视频检测的种类越来越多,效果越来越好,尤其是将检测技术应用于安全防护领域能够避免用户发生危险。以车舱场景为例,可以通过图像处理技术检测车舱环境和人员的安全性。可以在车辆行驶过程中或驻停状态下通过采集车内图像、视频,来检测车辆内的人是否发生危险动作,从而可以提高驾驶和乘车安全。但是,相关技术中对人的动作检测容易出现误判,用户的使用体验较差。
发明内容
本公开提供一种动作检测方法、装置、设备及存储介质,以解决相关技术中的缺陷。
根据本公开实施例的第一方面,提供一种动作检测方法,包括:
获取场景区域的视频流;
基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向;
检测所述多个图像帧中所述目标对象的骨骼关键点;
根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息。
在一个实施例中,所述基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向,包括:
提取所述多个图像帧中每个图像帧内的所述目标对象的特征;
根据所述多个图像帧中每个图像帧内的所述目标对象的特征,确定所述目标对象在所述视频流中的光流信息;
根据所述目标对象在所述视频流中的光流信息,确定所述场景区域内的所述目标对象的动作方向。
在一个实施例中,每个动作方向对应一个目标动作;
所述根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,包括:
在检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,满足所述动作方向对应的第一预设条件的情况下,确定所述目标对象存在所述动作方向对应的目标动作;
否则,确定所述目标对象无所述动作方向对应的目标动作。
在一个实施例中,所述骨骼关键点包括:左肩关键点、右肩关键点、左手腕关键点、右手腕关键点、左肘关键点、右肘关键点、左耳关键点、右耳关键点;
所述根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,包括以下至少一项:
在动作方向为左的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为正数且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体左倾动作;
在动作方向为右的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为负数,且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体右倾动作;
在动作方向为上的情况下,响应于如下第一情形或第二情形,确定所述目标对象存在捂胸口动作;其中,所述第一情形包括:第二目标向量和第三目标向量的夹角大于第二阈值,左手腕关键点低于右肩关键点且左手腕关键点与右肩关键点的竖直距离大于第三阈值,其中,所述第二目标向量为左肘关键点至左手腕关键点的向量、所述第三目标向量为左肘关键点至左肩关键点的向量;所述第二情形包括:第四目标向量和第五目标向量的夹角大于第二阈值,右手腕关键点低于左肩关键点,且右手腕关键点与左肩关键点的竖直距离大于第三阈值;其中,所述第四目标向量为右肘关键点至右手腕关键点的向量、所述第五目标向量为右肘关键点至右肩关键点的向量;或,
在动作方向为下的情况下,响应于左耳关键点低于左肩关键点且左耳关键点与左肩关键点的垂直距离大于第四阈值,和/或,响应于右耳关键点低于右肩关键点且右耳关键点与右肩关键点的垂直距离大于第四阈值,确定所述目标对象存在前倾趴倒动作。
在一个实施例中,所述基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向,包括:
检测所述视频流内的目标对象是否存在运动;
在所述视频流内的目标对象存在运动的情况下,基于所述视频流中的多个图像帧, 检测所述场景区域内的目标对象的动作方向。
在一个实施例中,所述根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,包括:
对于所述视频流中的多个图像帧中每个图像帧,根据该图像帧中检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定该图像帧中所述目标对象的动作信息;
对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息。
在一个实施例中,还包括:
在获取所述场景区域的所述视频流的过程中,实时缓存所述视频流中最新的预设数量的图像帧;
所述对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息,包括:
对所述缓存的预设数量的图像帧中目标对象的动作信息进行平滑处理,得到基于所述缓存的预设数量的图像帧确定的所述目标对象的目标动作信息。
在一个实施例中,所述实时缓存最新的预设数量的图像帧,包括:
检测所述视频流中每个图像帧中是否存在目标对象的预设关键信息,其中,所述预设关键信息包括人脸、至少部分身体和骨骼关键点中的至少一项;
实时缓存所述视频流中存在所述目标对象的预设关键信息的图像帧中,最新的预设数量的图像帧。
在一个实施例中,所述场景区域包括车舱场景区域;
所述获取场景区域的视频流,包括:
在车辆的车门处于锁闭状态,和/或,车辆的车速达到预设速度阈值的情况下,获取所述场景区域的视频流。
在一个实施例中,还包括:
检测所述视频流中的多个对象;
根据所述多个对象中每个对象在车舱内的位置,和/或所述多个对象中每个对象的人脸信息,在所述多个对象中确定所述目标对象。
在一个实施例中,还包括:
在所述目标对象的动作信息表征所述目标对象存在目标动作的情况下,发送报警信息至服务平台。
根据本公开实施例的第二方面,提供一种动作检测装置,包括:
获取模块,用于获取场景区域的视频流;
方向模块,用于基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向;
检测模块,用于检测所述多个图像帧中所述目标对象的骨骼关键点;
确定模块,用于根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息。
在一个实施例中,所述方向模块具体用于:
提取所述多个图像帧中每个图像帧内的所述目标对象的特征;
根据所述多个图像帧中每个图像帧内的所述目标对象的特征,确定所述目标对象在所述视频流中的光流信息;
根据所述目标对象在所述视频流中的光流信息,确定所述场景区域内的所述目标对象的动作方向。
在一个实施例中,每个动作方向对应一个目标动作;
所述确定模块具体用于:
在检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,满足所述动作方向对应的第一预设条件的情况下,确定所述目标对象存在所述动作方向对应的目标动作;
否则,确定所述目标对象无所述动作方向对应的目标动作。
在一个实施例中,所述骨骼关键点包括:左肩关键点、右肩关键点、左手腕关键点、右手腕关键点、左肘关键点、右肘关键点、左耳关键点、右耳关键点;
所述确定模块具体用于以下至少一项:
在动作方向为左的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为正数且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体左倾动作;
在动作方向为右的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为负数,且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体右倾动作;
在动作方向为上的情况下,响应于如下第一情形或第二情形,确定所述目标对象存在捂胸口动作;其中,所述第一情形包括:第二目标向量和第三目标向量的夹角大于第二阈值,左手腕关键点低于右肩关键点且左手腕关键点与右肩关键点的竖直距离大于第 三阈值,其中,所述第二目标向量为左肘关键点至左手腕关键点的向量、所述第三目标向量为左肘关键点至左肩关键点的向量;所述第二情形包括:第四目标向量和第五目标向量的夹角大于第二阈值,右手腕关键点低于左肩关键点,且右手腕关键点与左肩关键点的竖直距离大于第三阈值;其中,所述第四目标向量为右肘关键点至右手腕关键点的向量、所述第五目标向量为右肘关键点至右肩关键点的向量;或,
在动作方向为下的情况下,响应于左耳关键点低于左肩关键点且左耳关键点与左肩关键点的垂直距离大于第四阈值,和/或,响应于右耳关键点低于右肩关键点且右耳关键点与右肩关键点的垂直距离大于第四阈值,确定所述目标对象存在前倾趴倒动作。
在一个实施例中,所述方向模块具体用于:
检测所述视频流内的目标对象是否存在运动;
在所述视频流内的目标对象存在运动的情况下,基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向。
在一个实施例中,所述确定模块具体用于:
对于所述视频流中的多个图像帧中每个图像帧,根据该图像帧中检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定该图像帧中所述目标对象的动作信息;
对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息。
在一个实施例中,还包括缓存模块,用于:
在获取所述场景区域的所述视频流的过程中,实时缓存所述视频流中最新的预设数量的图像帧;
所述确定模块用于对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息时,具体用于:
对所述缓存的预设数量的图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述缓存的预设数量的图像帧确定的所述目标对象的目标动作信息。
在一个实施例中,所述缓存模块具体用于:
检测所述视频流中每个图像帧中是否存在目标对象的预设关键信息,其中,所述预设关键信息包括人脸、至少部分身体和骨骼关键点中的至少一项;
实时缓存所述视频流中存在所述目标对象的预设关键信息的图像帧中,最新的预设数量的图像帧。
在一个实施例中,所述场景区域包括车舱场景区域;
所述获取模块具体用于:
在车辆的车门处于锁闭状态,和/或,车辆的车速达到预设速度阈值的情况下,获取所述场景区域的视频流。
在一个实施例中,还包括目标模块,用于:
检测所述视频流中的多个对象;
根据所述多个对象中每个对象在车舱内的位置,和/或所述多个对象中每个对象的人脸信息,在所述多个对象中确定所述目标对象。
在一个实施例中,还包括报警模块,用于:
在所述目标对象的动作信息表征所述目标对象存在目标动作的情况下,发送报警信息至服务平台。
根据本公开实施例的第三方面,提供一种电子设备,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现第一方面所述的方法。
根据本公开实施例的第四方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现第一方面所述的方法。
根据上述实施例可知,通过获取场景区域的视频流,可以基于所述视频流中的多个图像帧,检测场景区域内的目标对象的动作方向,然后可以检测所述图像帧中所述目标对象的骨骼关键点,最后可以根据骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定目标对象的动作信息。由于目标关键点之间的几何关系较为客观准确,因此可以准确的检测到目标对象是否存在危险动作,而且在判断动作信息之前引入动作方向检测结果来针对性的分析目标关键点之间的几何关系,从而减少了动作检测的时空复杂度,使动作检测具有针对性,进一步提高检测的准确性。若将该检测方法应用于车辆内,则可以准确的检测车辆内的驾驶员和乘客是否发生危险,从而可以提升乘车的安全性,提高了用户的使用体验。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。
图1示出了本公开一实施例的动作检测方法的流程图;
图2示出了本公开一实施例的骨骼关键点的结构示意图;
图3示出了本公开一实施例的车辆行驶场景下的动作检测方法的完整流程图;
图4示出了本公开实施例的动作检测装置的结构示意图;
图5示出了本公开实施例的电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
第一方面,本公开至少一个实施例提供了一种动作检测方法,请参照附图1,其示出了该方法的流程,包括步骤S101至步骤S103。
其中,该方法可以用于对场景区域的视频流内的目标对象进行动作检测。例如,检测视频流中的目标对象是否发生目标动作,目标动作可以为危险动作,也就是说,该方法可以用于检测场景区域的视频流中的目标对象是否发生危险动作。该方法可以应用于车辆行驶等场景中,在该场景中可以检测驾驶员或乘客是否发生危险动作。危险动作可以预先定义,示例性地,危险动作可以为身体左倾、身体右倾、捂胸口或前倾趴倒等。
视频流可以为图像采集设备针对场景区域所录制的视频。例如在车辆行驶的场景中,场景区域可以为车舱场景区域。场景区域的视频流可以为车舱内安装的摄像头所采集的视频,该视频流可以是车舱内针对驾驶员所采集的视频,或者车舱内针对乘客所采集的视频。
另外,该方法可以由终端设备或服务器等电子设备执行,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)手持设备、计算设备、车载设备、可穿戴设备等,该方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。或 者,可以通过服务器执行该方法,服务器可以为本地服务器、云端服务器等。在车辆行驶的场景中,该方法可以由智能紧急求救系统(Artificial Inteligience Emergency Call)执行,该系统与车舱内的摄像头连接,从而可以获取摄像头所采集的车舱场景区域的视频流。
在步骤S101中,获取场景区域的视频流。
其中,场景区域的视频流可以为图像采集设备录制的视频,其中的图像采集设备可以为手机、相机、摄像头等具备图像采集功能的电子设备。场景区域的视频流内具有目标对象,目标对象指的是视频流中需要检测动作的人物,视频流除目标对象之外的区域为背景区域和其他对象(其他对象也可能不不存在),例如车辆行驶场景中的目标对象可以为驾驶员或者特定乘客。因此,可以在获取场景区域的视频流后,检测所述视频流中的多个对象,然后根据每个对象在车舱内的位置,和/或每个对象的人脸信息,在所述多个对象中确定目标对象。示例性的,可以将车舱内驾驶位上的对象确定为目标对象,即将驾驶员确定为目标对象,或者将与预先录入的参考人脸特征一致的人脸信息所属的对象,确定为目标对象,即将车主、注册驾驶员等特定人员确定为目标对象。以下的步骤中针对目标对象的操作,可以基于本步骤中所确定的目标对象来执行。
可以理解的是,视频流中的目标对象可以为一个,也可以为多个。在视频流中存在多个目标对象时,可以按照本实施例提供的方法依次对各个目标对象进行处理,或者按照本实施例提供的方法同时对各个目标对象进行处理。例如,车舱场景区域的视频中,目标对象可以包括驾驶员、副驾乘客、后排座位乘客等多个目标对象中的一个或多个。
在一个可能的实施例中,在场景区域为车舱场景区域的情况下,可以预先设置该步骤的启动条件,例如启动条件可以为车辆的车门处于锁闭状态,和/或,车辆的车速达到预设速度阈值等,然后可以在车辆满足上述启动条件的情况下,即在车辆的车门处于锁闭状态,和/或,车辆的车速达到预设速度阈值的情况下,再获取场景区域的视频流。这样可以在车辆具有危险检测需求的场景下再获取视频流进行检测,从而使该检测方法具有针对性,节省算力、内存,减小功耗。
在步骤S102中,基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向。
其中,所述视频流中的多个图像帧,可以为预设数量的图像帧,例如2帧、4帧、5帧等。示例性的,可以获取场景区域的视频流的过程中,实时缓存所述视频流中最新的预设数量的图像帧。例如,可以从视频流中获取每一帧图像帧作为缓存对象;或者按照一定的间隔从视频流中抽取图像帧作为缓存对象;或者按照一定的缓存条件从视频流中抽取图像帧作为缓存对象,例如,可以检测所述视频流中每个图像帧中是否存在目标对象的预设关键信息,其中,所述预设关键信息包括人脸、至少部分身体和骨骼关 键点中的至少一项,部分身体可以为左肩、右肩、左耳、右耳等部位;然后实时缓存所述视频流中存在所述目标对象的预设关键信息的图像帧中,最新的预设数量的图像帧。由于缓存中的图像帧是包含预设关键信息的图像帧,可以提升在后续基于缓存的图像帧进行目标对象动作检测时获得所需要的骨骼关键点的成功率。在缓存图像帧的阶段基于上述关键信息对视频流中图像进行粗粒度的筛选,减少了动作检测过程中对于缺失的关键骨骼关键点的检测所占用的时间和内存等资源,有利于提升动作检测效率。
其中,目标对象的动作方向可以为左、右、上、下等。在一个可能的实施例中,可以先提取所述多个图像帧中每个图像帧内的目标对象的特征,示例性的,可以基于同一目标在不同帧间运动时,其亮度不变的基本原则,提取目标对象的特征点;然后根据所述多个图像帧中每个图像帧内的目标对象的特征,确定所述目标对象在所述视频流中的光流信息,其中,光流信息能够表征目标在不同帧间的运动;最后根据所述目标对象在所述视频流中的光流信息,确定所述场景区域内的目标对象的动作方向。示例性的,利用LucasKanade算法求出目标对象从相对静止到做出动作的运动方向。
可选的,可以为本步骤设置启动条件,从而有针对性的执行本步骤。示例性的,可以在获取场景区域的视频流的过程中,检测所述视频流内的目标对象是否存在运动;并在所述视频流内的目标对象存在运动的情况下,再执行本步骤,即基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向。这样可以提高本步骤的有效性和针对性。
在步骤S103中,检测所述多个图像帧中所述目标对象的骨骼关键点。
可以采用预先训练的神经网络对图像帧进行处理,从而得到目标对象的骨骼关键点。骨骼关键点可以表征人体的骨骼结构中的关节部分,通过这些关节部分能够绘制出人体的骨骼结构图。例如,神经网络能够检测得到的骨骼关键点以及由骨骼关键点绘制的骨骼结构图如图2所示,从图2中可以看出,骨骼关键点包括鼻子关键点0、左眼关键点1、右眼关键点2、左耳关键点3、右耳关键点4、左肩关键点5、右肩关键点6、左肘关键点7、右肘关键点8、左手腕关键点9、右手腕关键点10、左髋关节关键点11、右髋关节关键点12、左膝关键点13、右膝关键点14、左脚踝关键点15、右脚踝关键点16。需要注意的是,视频流是摄像头正对或以一定角度朝向目标对象进行录制的,因此图像帧中的目标对象与真实场景中的目标对象是镜像关系,即真实场景中目标对象的左侧为图像帧中目标对象的右侧,真实场景中目标对象的右侧为图像帧中目标对象的左侧;图像帧内的坐标系的原点可以在其左上角,沿横边向右可以为横轴(例如x轴)的正方向,沿竖边向下可以为纵轴(例如y轴)的正方向。
其中,检测目标对象的骨骼关键点,可以检测目标对象出现在图像帧中的部分所包括的骨骼关键点,例如驾驶员在图像帧中出现上半身,则本步骤中仅检测上半身的骨骼关键点。换句话说,本步骤中针对目标对象进行骨骼关键点进行检测,可以检测得 到图2中所示出的全部关键点,或者可以检测到图2中所示出的部分关键点。
检测得到的骨骼关键点可以以图像帧中的坐标位置进行表示,还可以在图像帧上的对应位置对骨骼关键点进行标识。
在步骤S104中,根据所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息。
其中,目标对象的动作信息可以为目标对象存在目标动作或不存在目标动作,目标动作可以为需要检测的危险动作等,其中,需要检测的危险动作可以预先设置。每个动作方向可以对应一个或多个预先设置的目标动作,每个目标动作具有对应的多个目标关键点,在每个目标动作中,其对应的多个目标关键点之间满足对应的几何关系约束。因此,每个目标动作可以预先设置第一预设条件,该第一预设条件可以针对目标动作对应的目标关键点之间满足的几何关系而设置。从而在确定目标对象的动作信息时,可以针对步骤S102检测得到的动作方向内的每个目标动作,判断步骤S103检测得到的目标对象的骨骼关键点中,目标动作对应的目标关键点是否满足目标动作对应的第一预设条件,若满足则目标对象存在该目标动作,否则目标对象不存在该目标动作。
示例性的,每个动作方向对应一个目标动作,则可以设置该动作方向对应的目标关键点和第一预设条件。然后在确定目标对象的动作信息时,在检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,满足所述动作方向对应的第一预设条件的情况下,确定所述目标对象存在所述动作方向对应的目标动作;否则(即所述目标关键点之间的几何关系不满足对应的第一预设条件),确定所述目标对象无所述动作方向对应的目标动作。根据动作方向缩小目标动作的检测范围,从而进一步节约能耗和内存,提高检测效率;而且按照动作方向来针对性的检测目标关键点,使动作检测具有针对性,进一步提高检测的准确性。
在一个可能的实施例中,预先设置方向左、右、上和下四个动作方向对应的目标动作为身体左倾(对应方向“左”)、身体右倾(对应方向“右”)、捂胸口(对应方向“上”)和前倾趴倒(对应方向“下”),并为每个目标动作(即每个动作方向)设置对应的目标关键点和第一预设条件。
可以将身体左倾对应的目标关键点设置为右肩关键点和左肩关键点,将右肩关键点至左肩关键点的连线形成的向量称为第一目标向量,将水平向右(即与待检测图像的横边平行且方向向右)的向量称为标准向量,进而将对应的第一预设条件设置为目标向量与标准向量的夹角的正切值为正数,且正切值的绝对值大于第一阈值(例如第一阈值为0.4),以图2中所示出的右肩关键点6和左肩关键点5为例,身体左倾对应的第一预设条件可以表示为tan(vec(6,5))>0.4。也就是说,在动作方向为左的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为正数且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体左倾动 作。
可以将身体右倾对应的目标关键点设置为右肩关键点和左肩关键点,将右肩关键点至左肩关键点的连线形成的向量称为目标向量,将水平向右(即与待检测图像的横边平行且方向向右)的向量称为标准向量,进而将对应的第一预设条件设置为目标向量与标准向量的夹角的正切值为负数,且正切值的绝对值大于第一阈值(例如第一阈值为0.4),以图2中所示出的右肩关键点6和左肩关键点5为例,身体右倾对应的第一预设条件可以表示为tan(vec(6,5))<-0.4。也就是说,在动作方向为右的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为负数,且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体右倾动作。
可以将捂胸口对应的目标关键点设置为左肘关键点、左手腕关键点、左肩关键点、右肘关键点、右手腕关键点、右肩关键点,将左肘关键点至左手腕关键点的连线形成的向量称为第二目标向量,将左肘关键点至左肩关键点的连线形成的向量称为第三目标向量,将右肘关键点至右手腕关键点的连线形成的向量称为第四目标向量,将右肘关键点至右肩关键点的连线形成的向量称为第五目标向量,将第二目标向量和第三目标向量的夹角的余弦值称为第一余弦值,将第四目标向量和第五目标向量的夹角的余弦值称为第二余弦值,进而将对应的第一预设条件设置为第一余弦值大于第二阈值(例如第二阈值为0.2)以及左手腕关键点与右肩关键点的竖直距离大于第三阈值(例如第三阈值为100);和/或,第二余弦值大于第二阈值(例如第二阈值为0.2)以及右手腕关键点与左肩关键点的竖直距离大于第三阈值(例如第三阈值为100),以图2中所示出的左肘关键点7、左手腕关键点9、左肩关键点5、右肘关键点8、右手腕关键点10、右肩关键点6为例,捂胸口对应的第一预设条件可以表示为cos(vec(7,5),vec(7,9))>0.2以及y(9)-y(6)>100,和/或,第一预设条件可以表示为cos(vec(8,6),vec(8,10))>0.2以及y(10)-y(5)>100。也就是说,在动作方向为上的情况下,响应于如下第一情形或第二情形,确定所述目标对象存在捂胸口动作;所述第一情形包括:第二目标向量和第三目标向量的夹角大于第二阈值,左手腕关键点低于右肩关键点且左手腕关键点与右肩关键点的竖直距离大于第三阈值,其中,所述第二目标向量为左肘关键点至左手腕关键点的向量、所述第三目标向量为左肘关键点至左肩关键点的向量;所述第二情形包括:第四目标向量和第五目标向量的夹角大于第二阈值,右手腕关键点低于左肩关键点,且右手腕关键点与左肩关键点的竖直距离大于第三阈值;其中,所述第四目标向量为右肘关键点至右手腕关键点的向量、所述第五目标向量为右肘关键点至右肩关键点的向量。
可以将前倾趴倒对应的目标关键点设置为左耳关键点、左肩关键点、右耳关键点、右肩关键点,将对应的第一预设条件设置为左耳关键点与左肩关键点的竖直距离大于第四阈值(例如第四阈值为50),和/或,右耳关键点与右肩关键点的竖直距离大于第四阈值(例如第四阈值为50),以图2中所示出的左耳关键点3、左肩关键点5、右 耳关键点4、右肩关键点6为例,前倾趴倒对应的第一预设条件可以表示为y(3)-y(5)>50,和/或,y(4)-y(6)>50。也就是说,在动作方向为下的情况下,响应于左耳关键点低于左肩关键点且左耳关键点与左肩关键点的垂直距离大于第四阈值,和/或,响应于右耳关键点低于右肩关键点且右耳关键点与右肩关键点的垂直距离大于第四阈值,确定所述目标对象存在前倾趴倒动作。
可以理解的是,在所述目标对象的动作信息表征所述目标对象存在目标动作的情况下,可以发送报警信息至服务平台。例如目标动作为需要检测的危险动作,则在目标对象存在危险动作时,可以向服务平台发送报警信息。在车辆行驶场景下,服务平台可以为车辆运营的服务平台,例如网约车的服务平台,服务平台也可以为医疗平台,服务平台收到报警信息后,可以采取救治措施,从而提高发生危险的驾驶员的救治效率,进而提高救治效果,更好保护车辆内人员的生命安全。另外可以理解的是,随着区域场景的视频流录制,可以实时或按一定频率确定目标对象的动作信息,因此可以在预设时长内,所述动作信息持续为所述目标对象存在目标动作,再发送报警信息至服务平台,从而可以减少动作信息检测结果的波动而误发送报警信息的情况。
根据上述实施例可知,通过获取场景区域的视频流,可以基于所述视频流中的多个图像帧,检测场景区域内的目标对象的动作方向,然后可以检测所述图像帧中所述目标对象的骨骼关键点,最后可以根据骨骼关键点与所述动作方向对应的目标关键点之间的几何关系,确定目标对象的动作信息。由于目标关键点之间的几何关系较为客观准确,因此可以准确的检测到目标对象是否存在危险动作,而且是按照动作方向来针对性的检测目标关键点,从而使动作检测具有针对性,进一步提高检测的准确性。若将该检测方法应用于车辆内,则可以准确的检测车辆内的驾驶员和乘客是否发生危险,从而可以提升乘车的安全性,提高了用户的使用体验。
本公开的一些实施例中,附图1所示的步骤S103可以检测视频流的多个图像帧中的一个图像帧的目标对象的骨骼关键点,即检测步骤S102中的多个图像帧中的一个图像帧的目标对象的骨骼关键点,具体检测的图像帧可以预先设置,例如检测第一帧、最后一帧或者中间帧等。示例性的,在实时缓存所述视频流中最新的预设数量的图像帧的情况下,可以在步骤S103中检测上述最新的预设数量的图像帧中的一个图像帧中目标对象的骨骼关键点。
基于此,附图1所示的步骤S104中,可以直接根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,具体确定方式可以参照上述实施例中步骤S104的详细说明。
本公开的另一些实施例中,附图1所示的步骤S103可以检测视频流的多个图像帧中的每个图像帧的目标对象的骨骼关键点,即检测步骤S102中的多个图像帧中的每个图像帧的目标对象的骨骼关键点。示例性的,在实时缓存所述视频流中最新的预设数 量的图像帧的情况下,可以在步骤S103中检测上述最新的预设数量的图像帧中的每个图像帧中目标对象的骨骼关键点。
基于此,附图1所示的步骤S104中,可以先根据所述视频流中的多个图像帧中,每个图像帧中检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定对应图像帧中所述目标对象的动作信息;然后对所述多个图像帧中目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息。示例性的,在实时缓存所述视频流中最新的预设数量的图像帧的情况下,对所述缓存的预设数量的图像帧中目标对象的动作信息进行平滑处理,得到基于所述缓存的预设数量的图像帧确定的所述目标对象的目标动作信息。
其中,可以将每个图像帧检测得到的动作信息输入至平滑队列中,然后设置平滑窗口,平滑窗口的尺寸可以为视频流中多个图像帧的数量,例如缓存的图像帧的数量;然后随着平滑队列的更新而移动平滑窗口,并在每次移动平滑窗口后根据平滑窗口内的多个动作信息确定基于所述多个图像帧确定的所述目标对象的目标动作信息,从而实现对动作信息的平滑处理,提高动作信息的有效性和稳定性。
若多个图像帧中每个图像帧检测到的动作信息相同,则可以确定其为基于所述多个图像帧确定的所述目标对象的目标动作信息。例如,5个图像帧中的每个图像帧检测到的动作信息均为身体左倾,则可以将身体左倾动作确定为基于上述5个图像帧确定的所述目标对象的目标动作信息。
若多个图像帧所检测得到的动作信息不一致,则将数量最多的一种动作信息作为基于所述多个图像帧确定的所述目标对象的目标动作信息。例如,5个图像帧中4个图像帧检测得到的动作信息为身体左倾,1个图像帧检测得到的动作信息为无任何目标动作,则可以将身体左倾确定为基于上述5个图像帧确定的所述目标对象的目标动作信息。可选地,若平滑窗口内多个图像帧的动作信息检测结果中,出现次数最多的动作信息检测结果未超过预设比例(例如50%),则可以丢弃该平滑窗口内的检测结果,以进一步提升动作检测结果的可靠性。
需要注意的是,为了准确确定基于所述多个图像帧确定的所述目标对象的目标动作信息,可以将多个图像帧的数量设置为3、5、7……等奇数。若设置为偶数且出现多种数量相同的动作信息,则将其中最新的动作信息作为基于所述多个图像帧确定的所述目标对象的目标动作信息。
请参照附图3,其示例性的示出了车辆行驶场景下的动作检测方法的完整流程。从图3中可以看出,首先执行步骤S11,在车辆状态满足该方法的启动条件的情况下,启动车舱紧急呼救功能;然后执行步骤S12,车辆内的摄像头采集车舱乘客信息,即车舱内乘客的视频流;然后执行步骤S13,保存多于两帧最近图像至缓存;然后执行步骤S14,利用光流法对动作方向进行判断;然后执行步骤S15,在动作方向对应的空间(即 动作方向对应的目标动作的范围)内进行动作检测,即检测目标对象是否存在动作方向对应的目标动作;然后执行步骤S16,对步骤S15得到的多个检测结果进行平滑处理;最后执行步骤S17,根据S15的平滑处理结果,若检测到危险动作并该危险动作持续一段时间则发出求救信号。
本实施例提供的动作检测方法,结合了动作几何特征和光流检测处理以及一些客观情况,可以更加精准、全方位的评估乘客的动作信息当前所处的状态。同时使用平滑算法,缓存处理,对一些跳变、波动的结果进行了有效处理,为租车公司和交通监管部门提供了重要的参考数据,可以有针对性的定制安全方案和运营管理,提升车舱乘员的生命健康安全。
根据本公开实施例的第二方面,提供一种动作检测装置,请参照附图4,所述装置包括:
获取模块401,用于获取场景区域的视频流;
方向模块402,用于基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向;
检测模块403,用于检测所述多个图像帧中所述目标对象的骨骼关键点;
确定模块404,用于根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息。
在本公开的一些实施例中,所述方向模块具体用于:
提取所述多个图像帧中每个图像帧内的所述目标对象的特征;
根据所述多个图像帧中每个图像帧内的所述目标对象的特征,确定所述目标对象在所述视频流中的光流信息;
根据所述目标对象在所述视频流中的光流信息,确定所述场景区域内的所述目标对象的动作方向。
在本公开的一些实施例中,每个动作方向对应一个目标动作;
所述确定模块具体用于:
在检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,满足所述动作方向对应的第一预设条件的情况下,确定所述目标对象存在所述动作方向对应的目标动作;
否则,确定所述目标对象无所述动作方向对应的目标动作。
在本公开的一些实施例中,所述骨骼关键点包括:左肩关键点、右肩关键点、左手腕关键点、右手腕关键点、左肘关键点、右肘关键点、左耳关键点、右耳关键点;
所述确定模块具体用于以下至少一项:
在动作方向为左的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为正数且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体左倾动作;
在动作方向为右的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为负数,且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体右倾动作;
在动作方向为上的情况下,响应于如下第一情形或第二情形,确定所述目标对象存在捂胸口动作;其中,所述第一情形包括:第二目标向量和第三目标向量的夹角大于第二阈值,左手腕关键点低于右肩关键点且左手腕关键点与右肩关键点的竖直距离大于第三阈值,其中,所述第二目标向量为左肘关键点至左手腕关键点的向量、所述第三目标向量为左肘关键点至左肩关键点的向量;所述第二情形包括:第四目标向量和第五目标向量的夹角大于第二阈值,右手腕关键点低于左肩关键点,且右手腕关键点与左肩关键点的竖直距离大于第三阈值;其中,所述第四目标向量为右肘关键点至右手腕关键点的向量、所述第五目标向量为右肘关键点至右肩关键点的向量;或,
在动作方向为下的情况下,响应于左耳关键点低于左肩关键点且左耳关键点与左肩关键点的垂直距离大于第四阈值,和/或,响应于右耳关键点低于右肩关键点且右耳关键点与右肩关键点的垂直距离大于第四阈值,确定所述目标对象存在前倾趴倒动作。
在本公开的一些实施例中,所述方向模块具体用于:
检测所述视频流内的目标对象是否存在运动;
在所述视频流内的目标对象存在运动的情况下,基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向。
在本公开的一些实施例中,所述确定模块具体用于:
对于所述视频流中的多个图像帧中每个图像帧,根据该图像帧中检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定该图像帧中所述目标对象的动作信息;
对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息。
在本公开的一些实施例中,还包括缓存模块,用于:
在获取所述场景区域的所述视频流的过程中,实时缓存所述视频流中最新的预设数量的图像帧;
所述确定模块用于对所述多个图像帧中目标对象的动作信息进行平滑处理,得 到基于所述多个图像帧确定的所述目标对象的目标动作信息时,具体用于:
对所述缓存的预设数量的图像帧中目标对象的动作信息进行平滑处理,得到基于所述缓存的预设数量的图像帧确定的所述目标对象的动作信息。
在本公开的一些实施例中,所述缓存模块具体用于:
检测所述视频流中每个图像帧中是否存在目标对象的预设关键信息,其中,所述预设关键信息包括人脸、至少部分身体和骨骼关键点中的至少一项;
实时缓存所述视频流中存在所述目标对象的预设关键信息的图像帧中,最新的预设数量的图像帧。
在本公开的一些实施例中,所述场景区域包括车舱场景区域;
所述获取模块具体用于:
在车辆的车门处于锁闭状态,和/或,车辆的车速达到预设速度阈值的情况下,获取场景区域的视频流。
在本公开的一些实施例中,还包括目标模块,用于:
检测所述视频流中的多个对象;
根据所述多个对象中每个对象在车舱内的位置,和/或所述多个对象中每个对象的人脸信息,在所述多个对象中确定所述目标对象。
在本公开的一些实施例中,还包括报警模块,用于:
在所述目标对象的动作信息表征所述目标对象存在目标动作的情况下,发送报警信息至服务平台。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在第三方面有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
第三方面,本公开至少一个实施例提供了一种设备,请参照附图5,其示出了该设备的结构,所述设备包括存储器、处理器,所述存储器用于存储可在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时基于第一方面任一项所述的方法对动作进行检测。
第四方面,本公开至少一个实施例提供了一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现第一方面任一项所述的方法。
在本公开中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。术语“多个”指两个或两个以上,除非另有明确的限定。
本领域技术人员在考虑说明书及实践这里公开的公开后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、 用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。

Claims (14)

  1. 一种动作检测方法,包括:
    获取场景区域的视频流;
    基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向;
    检测所述多个图像帧中所述目标对象的骨骼关键点;
    根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息。
  2. 根据权利要求1所述的动作检测方法,其特征在于,所述基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向,包括:
    提取所述多个图像帧中每个图像帧内的所述目标对象的特征;
    根据所述多个图像帧中每个图像帧内的所述目标对象的特征,确定所述目标对象在所述视频流中的光流信息;
    根据所述目标对象在所述视频流中的光流信息,确定所述场景区域内的所述目标对象的动作方向。
  3. 根据权利要求1所述的动作检测方法,其特征在于,每个动作方向对应一个目标动作;
    所述根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,包括:
    在检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,满足所述动作方向对应的第一预设条件的情况下,确定所述目标对象存在所述动作方向对应的目标动作;
    否则,确定所述目标对象无所述动作方向对应的目标动作。
  4. 根据权利要求1或3所述的动作检测方法,其特征在于,所述骨骼关键点包括:左肩关键点、右肩关键点、左手腕关键点、右手腕关键点、左肘关键点、右肘关键点、左耳关键点、右耳关键点;
    所述根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,包括以下至少一项:
    在动作方向为左的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为正数且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体左倾动作;
    在动作方向为右的情况下,响应于右肩关键点至左肩关键点的第一目标向量与水平向右的标准向量间夹角的正切值为负数,且正切值的绝对值大于第一阈值的情况下,确定所述目标对象存在身体右倾动作;
    在动作方向为上的情况下,响应于如下第一情形或第二情形,确定所述目标对象存在捂胸口动作;其中,所述第一情形包括:第二目标向量和第三目标向量的夹角大于第 二阈值,左手腕关键点低于右肩关键点且左手腕关键点与右肩关键点的竖直距离大于第三阈值,其中,所述第二目标向量为左肘关键点至左手腕关键点的向量、所述第三目标向量为左肘关键点至左肩关键点的向量;所述第二情形包括:第四目标向量和第五目标向量的夹角大于第二阈值,右手腕关键点低于左肩关键点,且右手腕关键点与左肩关键点的竖直距离大于第三阈值;其中,所述第四目标向量为右肘关键点至右手腕关键点的向量、所述第五目标向量为右肘关键点至右肩关键点的向量;或,
    在动作方向为下的情况下,响应于左耳关键点低于左肩关键点且左耳关键点与左肩关键点的垂直距离大于第四阈值,和/或,响应于右耳关键点低于右肩关键点且右耳关键点与右肩关键点的垂直距离大于第四阈值,确定所述目标对象存在前倾趴倒动作。
  5. 根据权利要求1至4任一项所述的动作检测方法,其特征在于,所述基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向,包括:
    检测所述视频流内的目标对象是否存在运动;
    在所述视频流内的目标对象存在运动的情况下,基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向。
  6. 根据权利要求1、3或4所述的动作检测方法,其特征在于,所述根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息,包括:
    对于所述视频流中的多个图像帧中每个图像帧,根据该图像帧中检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定该图像帧中所述目标对象的动作信息;
    对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息。
  7. 根据权利要求6所述的动作检测方法,其特征在于,还包括:
    在获取所述场景区域的所述视频流的过程中,实时缓存所述视频流中最新的预设数量的图像帧;
    所述对所述多个图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述多个图像帧确定的所述目标对象的目标动作信息,包括:
    对所述缓存的预设数量的图像帧中所述目标对象的动作信息进行平滑处理,得到基于所述缓存的预设数量的图像帧确定的所述目标对象的目标动作信息。
  8. 根据权利要求7所述的动作检测方法,其特征在于,所述实时缓存最新的预设数量的图像帧,包括:
    检测所述视频流中每个图像帧中是否存在目标对象的预设关键信息,其中,所述预设关键信息包括人脸、至少部分身体和骨骼关键点中的至少一项;
    实时缓存所述视频流中存在所述目标对象的预设关键信息的图像帧中最新的预设数量的图像帧。
  9. 根据权利要求1至8任一项所述的动作检测方法,其特征在于,所述场景区域 包括车舱场景区域;
    所述获取场景区域的视频流,包括:
    在车辆的车门处于锁闭状态,和/或,车辆的车速达到预设速度阈值的情况下,获取所述场景区域的视频流。
  10. 根据权利要求9所述的动作检测方法,其特征在于,还包括:
    检测所述视频流中的多个对象;
    根据所述多个对象中每个对象在车舱内的位置,和/或所述多个对象中每个对象的人脸信息,在所述多个对象中确定所述目标对象。
  11. 根据权利要求1至10任一项所述的动作检测方法,其特征在于,还包括:
    在所述目标对象的动作信息表征所述目标对象存在目标动作的情况下,发送报警信息至服务平台。
  12. 一种动作检测装置,包括:
    获取模块,用于获取场景区域的视频流;
    方向模块,用于基于所述视频流中的多个图像帧,检测所述场景区域内的目标对象的动作方向;
    检测模块,用于检测所述多个图像帧中所述目标对象的骨骼关键点;
    确定模块,用于根据检测得到的所述骨骼关键点中与所述动作方向对应的目标关键点之间的几何关系,确定所述目标对象的动作信息。
  13. 一种电子设备,其包括存储器、处理器,所述存储器用于存储在处理器上运行的计算机指令,所述处理器用于在执行所述计算机指令时实现权利要求1至11任一项所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至11任一项所述的方法。
PCT/CN2022/134872 2022-03-31 2022-11-29 动作检测方法、装置、电子设备及存储介质 WO2023185037A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210346615.4 2022-03-31
CN202210346615.4A CN114842459A (zh) 2022-03-31 2022-03-31 动作检测方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023185037A1 true WO2023185037A1 (zh) 2023-10-05

Family

ID=82564640

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134872 WO2023185037A1 (zh) 2022-03-31 2022-11-29 动作检测方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114842459A (zh)
WO (1) WO2023185037A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842459A (zh) * 2022-03-31 2022-08-02 上海商汤临港智能科技有限公司 动作检测方法、装置、电子设备及存储介质
CN116052273B (zh) * 2023-01-06 2024-03-08 北京体提科技有限公司 一种基于体态鱼骨线的动作比对方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858295A (zh) * 2018-08-24 2020-03-03 广州汽车集团股份有限公司 一种交警手势识别方法、装置、整车控制器及存储介质
CN111091044A (zh) * 2019-10-25 2020-05-01 武汉大学 一种面向网约车的车内危险场景识别方法
US20200193148A1 (en) * 2018-12-14 2020-06-18 Alibaba Group Holding Limited Method and system for recognizing user actions with respect to objects
CN111814587A (zh) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 人体行为检测方法、教师行为检测方法及相关系统和装置
CN113569753A (zh) * 2021-07-29 2021-10-29 杭州逗酷软件科技有限公司 视频中的动作比对方法、装置、存储介质与电子设备
CN113870205A (zh) * 2021-09-18 2021-12-31 中国联合网络通信集团有限公司 一种安全带佩戴检测方法、装置、电子设备及存储介质
CN114842528A (zh) * 2022-03-31 2022-08-02 上海商汤临港智能科技有限公司 动作检测方法、装置、电子设备及存储介质
CN114842459A (zh) * 2022-03-31 2022-08-02 上海商汤临港智能科技有限公司 动作检测方法、装置、电子设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858295A (zh) * 2018-08-24 2020-03-03 广州汽车集团股份有限公司 一种交警手势识别方法、装置、整车控制器及存储介质
US20200193148A1 (en) * 2018-12-14 2020-06-18 Alibaba Group Holding Limited Method and system for recognizing user actions with respect to objects
CN111091044A (zh) * 2019-10-25 2020-05-01 武汉大学 一种面向网约车的车内危险场景识别方法
CN111814587A (zh) * 2020-06-18 2020-10-23 浙江大华技术股份有限公司 人体行为检测方法、教师行为检测方法及相关系统和装置
CN113569753A (zh) * 2021-07-29 2021-10-29 杭州逗酷软件科技有限公司 视频中的动作比对方法、装置、存储介质与电子设备
CN113870205A (zh) * 2021-09-18 2021-12-31 中国联合网络通信集团有限公司 一种安全带佩戴检测方法、装置、电子设备及存储介质
CN114842528A (zh) * 2022-03-31 2022-08-02 上海商汤临港智能科技有限公司 动作检测方法、装置、电子设备及存储介质
CN114842459A (zh) * 2022-03-31 2022-08-02 上海商汤临港智能科技有限公司 动作检测方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN114842459A (zh) 2022-08-02

Similar Documents

Publication Publication Date Title
WO2023185037A1 (zh) 动作检测方法、装置、电子设备及存储介质
WO2023185034A1 (zh) 动作检测方法、装置、电子设备及存储介质
US10510157B2 (en) Method and apparatus for real-time face-tracking and face-pose-selection on embedded vision systems
JP6141079B2 (ja) 画像処理システム、画像処理装置、それらの制御方法、及びプログラム
KR20200124648A (ko) 저전력 사용으로 모바일 카메라를 동작하는 방법 및 장치
WO2017208529A1 (ja) 運転者状態推定装置、運転者状態推定システム、運転者状態推定方法、運転者状態推定プログラム、対象者状態推定装置、対象者状態推定方法、対象者状態推定プログラム、および記録媒体
WO2014125882A1 (ja) 情報処理システム、情報処理方法及びプログラム
US9932000B2 (en) Information notification apparatus and information notification method
CN110826521A (zh) 驾驶员疲劳状态识别方法、系统、电子设备和存储介质
CN111753711A (zh) 电动车及其控制方法、装置、电子设备和存储介质
JP7052305B2 (ja) 救援システムおよび救援方法、ならびにそれに使用されるサーバおよびプログラム
KR20180096038A (ko) 행위 예측을 위한 다중 모션 기반 옴니뷰 기법
CN114049587A (zh) 事件检测方法、服务器及系统
CN110713082B (zh) 一种电梯控制方法、系统、装置和存储介质
WO2017209225A1 (ja) 状態推定装置、状態推定方法、及び状態推定プログラム
US20220189038A1 (en) Object tracking apparatus, control method, and program
Yang et al. Dangerous Driving Behavior Recognition Based on Improved YoloV5 and OpenPose [J]
CN113955594A (zh) 一种电梯控制方法、装置、计算机设备和存储介质
JP6798609B2 (ja) 映像解析装置、映像解析方法およびプログラム
Miller et al. Intelligent Sensor Information System For Public Transport–To Safely Go…
JP6720010B2 (ja) 情報処理装置、情報処理方法、およびプログラム
US11710326B1 (en) Systems and methods for determining likelihood of traffic incident information
WO2023095196A1 (ja) 乗客監視装置、乗客監視方法、及び非一時的なコンピュータ可読媒体
CN115719347A (zh) 行为识别方法、装置、电子设备及车辆
CN114140746A (zh) 一种箱内摄像头遮挡检测方法、电梯运行控制方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22934855

Country of ref document: EP

Kind code of ref document: A1