WO2020103647A1 - 物体关键点的定位方法、图像处理方法、装置及存储介质 - Google Patents

物体关键点的定位方法、图像处理方法、装置及存储介质

Info

Publication number
WO2020103647A1
WO2020103647A1 PCT/CN2019/113611 CN2019113611W WO2020103647A1 WO 2020103647 A1 WO2020103647 A1 WO 2020103647A1 CN 2019113611 W CN2019113611 W CN 2019113611W WO 2020103647 A1 WO2020103647 A1 WO 2020103647A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
key point
detection area
target object
current
Prior art date
Application number
PCT/CN2019/113611
Other languages
English (en)
French (fr)
Inventor
彭伟龙
沈小勇
陈逸伦
孙亚楠
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP19887883.7A priority Critical patent/EP3885967A4/en
Publication of WO2020103647A1 publication Critical patent/WO2020103647A1/zh
Priority to US17/088,558 priority patent/US11450080B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams, manipulating MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present application relates to the field of computers, and in particular, to a method for positioning key points of an object, an image processing method, a device, and a storage medium.
  • Embodiments of the present application provide a method for positioning key points of an object, an image processing method, a device, and a storage medium, so as to at least solve the technical problem of low accuracy and low efficiency in detecting key points of an object by related technologies.
  • a method for positioning a key point of an object includes: detecting the target object in the current video frame of the target video stream to obtain the current detection area of the target object; obtaining the historical detection area corresponding to the target object in the historical video frame of the target video stream; according to the historical detection area And the current detection area to obtain the determined current detection area; based on the determined current detection area, locate key points of the target object to obtain the first object key point set; obtain the corresponding target object in the historical video frame of the target video stream
  • the second object key point set; according to the position of the second object key point set, the position of the first object key point set is stably adjusted to obtain the position of the current target object key point set in the current video frame.
  • an image processing method includes: detecting the target object in the current video frame of the target video stream to obtain the current detection area of the target object; according to the historical detection area and the current detection area corresponding to the target object in the historical video frame of the target video stream, Obtain the determined current detection area; based on the determined current detection area, locate key points of the target object to obtain the first object key point set; according to the position of the second object key point set corresponding to the target object in the historical video frame , Stably adjust the position of the key point set of the first object to get the position of the key point set of the current target object in the current video frame; According to the position of the key point set of the current target object, identify the location of the target object from the current video frame ; Perform adjustment processing on the identified target object; display the image of the target object after adjustment processing.
  • a device for positioning key points of an object includes: a detection unit for detecting the target object in the current video frame of the target video stream to obtain the current detection area for the target object; a first acquisition unit for acquiring in the historical video frame of the target video stream The historical detection area corresponding to the target object; the second acquisition unit is used to acquire the determined current detection area based on the historical detection area and the current detection area; the positioning unit is used to key the target object based on the determined current detection area Point positioning to obtain the first object key point set; the third acquisition unit, used to obtain the second object key point set corresponding to the target object in the historical video frame of the target video stream; the adjustment unit, used to determine the second object key point The position of the set performs stable adjustment on the position of the key point set of the first object to obtain the position of the key point set of the current target object in the current video frame.
  • an image processing device includes: a detection unit for detecting the target object in the current video frame of the target video stream to obtain the current detection area for the target object; an acquisition unit for detecting the target object in the historical video frame of the target video stream Corresponding historical detection area and current detection area to obtain the determined current detection area; the positioning unit is used to locate key points of the target object based on the determined current detection area to obtain the first object key point set; the first adjustment The unit is used to stably adjust the position of the key point set of the first object according to the position of the key point set of the second object corresponding to the target object in the historical video frame to obtain the position of the key point set of the current target object in the current video frame ; Recognition unit, used to identify the target object part from the current video frame according to the position of the current target object key point set; Second adjustment unit, used to adjust the identified target object part; Display unit, Used to display the image of the target object after adjustment processing.
  • a storage medium is also provided.
  • a computer program is stored in the storage medium, wherein the computer program is set to execute the method for locating the key points of the object of the embodiment of the present application during runtime.
  • the current object detection area of the target object is detected from the current video frame of the target video stream, and the determined object is acquired according to the historical detection area and the current detection area corresponding to the target object in the historical video frame
  • Current detection area based on the determined current detection area, locate key points of the target object to obtain the first object key point set, according to the position of the second object key point set corresponding to the target object in the historical video frame of the target video stream ,
  • the position of the key point set of the first object is stably adjusted to obtain the position of the key point set of the current target object in the current video frame, which stabilizes the key point of the object and avoids the jitter of the object key point between the video frames, thereby
  • the technical effect of improving the accuracy of positioning the key points of the object is realized, and then the technical problem of low accuracy of detecting the key points of the object by the related technology is solved.
  • FIG. 1 is a schematic diagram of a hardware environment of a method for locating a key point of an object provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a method for locating a key point of an object provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of stabilizing the current detection area of the current video frame and stabilizing the key point set of the target object in the current video frame provided by an embodiment of the present application;
  • FIG. 5 is a flowchart of a method for locating key points of a human body according to an embodiment of the present application
  • FIG. 6 is a schematic diagram of a multi-frame check provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a position change of a human detection frame of a video frame picture provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a feature pyramid network structure provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a distribution of key points of a human body provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a key point positioning provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a scene for detecting a human body detection frame of a target human body provided by an embodiment of the present application;
  • FIG. 12 is a schematic diagram of a scenario for detecting key points of a human body provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a body function entrance provided by an embodiment of the present application.
  • 15 is a schematic diagram of an object key point positioning device provided by an embodiment of the present application.
  • 16 is a schematic diagram of an image processing device provided by an embodiment of the present application.
  • 17 is a structural block diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a schematic diagram of a hardware environment of a method for locating a key point of an object provided by an embodiment of the present application.
  • the user 102 may perform data interaction with the user equipment 104, and the user equipment 104 may include, but is not limited to, a memory 106 and a processor 108.
  • the user device 104 can determine the target object to be processed in the target video, and detect the target object in the current video frame of the target video stream through the processor 108 to obtain the current detection area of the target object, based on the history of the target video stream The historical detection area and the current detection area corresponding to the target object in the video frame are obtained, and the determined current detection area is obtained, and based on the determined current detection area, the target object is keyed to locate the first object key point set, and step S102 is executed , The key point set of the first object is sent to the server 112 through the network 110.
  • the server 112 includes a database 114 and a processor 116. After the server 112 obtains the first object key point set, the processor 116 obtains the position of the second object key point set corresponding to the target object in the historical video frame of the target video stream from the database 114, and according to the second object key point The position of the set is adjusted stably to the position of the key point set of the first object to obtain the position of the key point set of the current target object in the current video frame, and finally step S104 is executed to return the key point set of the current target object to the user device 104 through the network 110 s position.
  • FIG. 2 is a flowchart of a method for locating key points of an object according to an embodiment of the present application. As shown in FIG. 2, the method may include the following steps:
  • Step S202 Detect the target object in the current video frame of the target video stream to obtain the current detection area for the target object.
  • the target video stream may be a video stream of any kind of video in a video application, for example, may be a short video video stream, and the target object in the current video frame in the target video stream is in the target scene displayed in the current video frame ,
  • the target object may be an object to be positioned for key points, for example, a human body, an animal, etc.
  • the target scene may be a video scene such as a selfie scene, a dancing scene, etc.
  • specific video scenes and target object objects There are no restrictions on types.
  • the so-called current detection area refers to the area where the target object is located in the target scene displayed in the current video frame, and the area includes the position and the range.
  • the current detection area can be represented as an object detection frame.
  • the object detection frame is specifically a human body detection frame; more specifically, the object detection frame can be represented as a rectangular frame, elliptical frame, Hexagonal frames and other arbitrary detection frames. These detection frames are used to mark the position and range of the target object in the target scene. Taking the rectangular frame as an example, the position marked by the rectangular frame can be understood as the upper left corner coordinates of the area The range marked by the rectangular frame can be understood as the length and width of the area.
  • the area selected by the rectangular frame is the area of the target object in the target scene.
  • this embodiment detects the target object based on the detection model trained by the deep neural network
  • the detection model may be a network trained based on the open source single multi-window detector (Single Shot MultiBox Detector, referred to as SSD) architecture. model.
  • the current video frame can be input to the above detection model to generate multiple candidate detection frames and the confidence levels of multiple candidate detection frames, and select the three candidate frames with the highest confidence level from them, for example, The first candidate frame, the second candidate frame, and the third candidate frame.
  • the first candidate frame, the second candidate frame and the third candidate frame are respectively checked with the current video frame in the historical detection area of a historical video frame adjacent to the target video stream.
  • a candidate detection frame with the largest overlap with the historical detection area is selected as the current detection area of the current video frame, thereby ensuring that the same target object is always positioned.
  • Step S204 Acquire a historical detection area corresponding to the target object in the historical video frame of the target video stream.
  • the so-called historical video frame refers to the historical video frame relative to the current video frame, that is, the historical video frame refers to the video frame in the target video stream, starting from the current video frame and within a preset time range in the past.
  • the so-called historical detection area may be a detection area corresponding to the target object in the historical video frame after being stably processed, and the specific position information of the historical detection area, that is, the historical detection result, is cached at the first predetermined position.
  • the historical video frame corresponding to the current video frame can be understood as the previous video frame or several video frames adjacent to the current video frame in time sequence, and the historical detection area corresponding to the historical video frame refers to the history
  • the stabilization detection area corresponding to the target object in the video frame, and the historical detection area will be sequentially stored in each sub-position in the first predetermined position according to the frame number order of the detected historical video frame.
  • the historical detection area related to the current detection area may be acquired from the first predetermined position.
  • Step S206 Acquire the determined current detection area according to the historical detection area and the current detection area.
  • the current detection area determines the input of the keypoint detection network, and the stability of the background of the target object in the area also affects the stability of the keypoint of the object in time series. Therefore, the stable frame mechanism is used to make the background of the keypoint model input of the front and back frames in the It is locally stable in the time-frequency domain. Therefore, in this embodiment, after obtaining the current detection area of the target object, the current detection area needs to be adjusted.
  • the current detection area is stabilized, and the current detection area can be adjusted through the historical detection area to obtain the determined
  • the current detection area is adjusted so that the area change value between the determined current detection area and the historical detection area is less than the first target threshold, that is, the area indicated by the determined current detection area and the historical detection area is greater than Part of the local time domain is unchanged, or the change is small.
  • the above first target threshold is used to measure the critical value of the small change value of the area.
  • the change value of the area can be determined by determining the coordinate value of the area indicated by the current detection area and the coordinate value of the area indicated by the historical detection area. The change value is determined.
  • the curve originally used to represent the position change of the object detection area may be converted into a stepped trajectory, and then based on the determined current detection Area, to detect the object key point set of the target object.
  • the area where the target object is located in the current video frame and the adjacent historical video frame is locally stable in the time domain, so that it is a key point of multiple objects of the target object Detection provides a stable background, thereby reducing the error of detecting key points of objects due to background changes, and ensuring the accuracy of the key points of the final output object.
  • the determined current detection area may be stored in the first predetermined storage location as a historical detection area corresponding to other subsequent video frames, so as to be used as a basic data to participate in the subsequent The detection area of the target object in other video frames is stabilized.
  • Step S208 Based on the determined current detection area, locate key points of the target object to obtain a first object key point set.
  • the first object key point set includes a plurality of object key points, and the plurality of object key points are used to identify feature points of key parts of the object, and are used to mark object contours.
  • the key points corresponding to different objects are different. Therefore, the key points of the object need to be defined before the solution is realized.
  • the key points of the object are the key points of the human body.
  • the human body key points of the human body parts such as the left eye, the nose, and the plurality of human body key points may be respectively located at the positions of the indicated parts.
  • the movement of objects is very complicated, "left", “right”, “inner”, “outer” is difficult to distinguish, you can define the side that appears first on the object from left to right, It is the left side of the object, and the side that appears first from right to left is the right side of the object.
  • the shoulder that appears first from left to right is the left shoulder
  • the thigh that appears first from left to right is the left thigh.
  • a key point detection algorithm is used to locate key points of the target object based on the determined current detection area to obtain a first object key point set, which can be based on the determined current detection area and A key point is separately positioned for key points, thereby obtaining a first object key point set, which may be a human body key point set, including 66 human body key points.
  • the key point detection algorithm may be obtained based on Feature Pyramid Networks (FPN for short) in the deep learning model.
  • Step S210 Acquire a second object key point set corresponding to the target object in the historical video frame of the target video stream.
  • the second object key point set in this embodiment is the object key point set corresponding to the target object in the historical video frame relative to the current video frame, and includes multiple object key points and multiple object keys included in the first object key point
  • One-to-one correspondence is to obtain the key point positioning of the target object in the historical video frame of the target video stream before obtaining the current detection area of the target object, that is, the historical detection result for the key point positioning of the target object .
  • the second object key point set may be a stable adjusted object key point set corresponding to the target object in the historical video frame adjacent to the current video frame, and stored in the second predetermined position, the second predetermined position It may be the same as the first predetermined position.
  • the stably adjusted object key point set corresponding to the target object in each historical video frame will be sequentially stored in each sub-position in the second predetermined position according to the frame number order of the historical video frame.
  • Step S212 Stably adjust the position of the key point set of the first object according to the position of the key point set of the second object to obtain the position of the key point set of the current target object in the current video frame.
  • the position of the key point set of the second target object is the detection result of the target object in the historical video frame that has been detected.
  • the position of the key point set of the second target object can be determined by The coordinates of the key points of the target object are represented on the target object.
  • the position of the key point set of the first object is stably adjusted according to the position of the key point set of the second object, to obtain the position of the key point set of the current target object in the current video frame,
  • the position of the first target object key point set can be represented by the coordinates of the first target object key point set on the target object
  • the current target object key point set position can be represented by the current target object key point set on the target object Coordinates are expressed.
  • the position of the key point set of the current target object in the current video frame and the position of the key point set of the second object of the target object in the historical video frame adjacent to the current video frame, the position change between the two is less than
  • the second target threshold which is used to measure a critical value in which the position change range of the key point set of the first object and the key point set of the second object is small, that is, the key point of the target object in the current video frame is reduced.
  • the current target object key point set After stably adjusting the position of the first object key point set according to the position of the second object key point set to obtain the position of the current target object key point set in the current video frame, the current target object key point set can be stored in the first Among the two predetermined storage locations, the second predetermined storage location may be the same as the first predetermined storage location, for stably adjusting the position of the object key point set of the video frame after the current video frame, that is, the current target object key
  • the point set will serve as the basis for the stable adjustment of the position of the object key point set of the video frame after the current video frame, and once determined, it will be stored in the second predetermined storage position.
  • the key point set of the current target object is stored in a third sub-location of the second predetermined storage location, and the third of the second predetermined storage location
  • the sub-location may be adjacent to the second sub-location of the above-mentioned second predetermined storage location for the object key point set of the fourth video frame (which may also be the fifth video frame, the sixth video frame, etc.) Stable position adjustment.
  • the current target object key point set When the current video frame becomes a historical video frame of the video frame after the current video frame in the target video stream, the current target object key point set will also become the second object key point set corresponding to the target object in the historical video frame.
  • the current target object key point set is the fourth video frame of the target video stream (it can also be the fifth video frame, the sixth Key points of the second object corresponding to the target object in the historical video frames of each video frame, etc.).
  • step S206 acquiring the determined current detection area according to the historical detection area and the current detection area includes: when the historical video frame is a historical video frame adjacent to the current video frame, according to the history Detection area and current detection area to obtain the determined current detection area.
  • the historical detection area in a historical video frame adjacent to the current video frame may be acquired, and the current detection area The detection area is adjusted so that the area indicated by the determined current detection area and the area indicated by the historical detection area, the area change value between the two is less than a predetermined threshold, thereby making the determined current detection area and historical detection area
  • the indicated area does not change in most local time domains or changes little.
  • acquiring the determined current detection area according to the historical detection area and the current detection area includes: when the historical video frame is When a historical video frame adjacent to the current video frame is obtained, the degree of overlap between the historical detection area and the current detection area is obtained; when the degree of overlap is greater than the target threshold, the historical detection area is used as the determined current detection area; When the degree of overlap is not less than the target threshold, the current detection area is directly used as the determined current detection area.
  • the current detection area is stabilized, that is, the frame stabilization processing can be performed by The result of the detection of the target object in a historical video frame adjacent to the current video frame is used to stabilize the current detection area.
  • a historical detection area indicating the area where the target object is located in the target scene displayed in the image of the historical video frame
  • obtain a historical detection area indicating the area where the target object is located in the target scene displayed in the image of the historical video frame
  • the current detection area The degree of overlap with the historical detection area, that is, the area A in the target scene displayed in the image of the current video frame by the target object indicated by the current detection area and the target object indicated by the historical detection area are The degree of overlap between the areas B in the target scene displayed in the image of the historical video frame.
  • the overlapping area is obtained by comparing the area where the area A and the area B intersect with the area where the area A and the area B are combined:
  • the target threshold is used to measure how much the current detection area and the historical detection area overlap, and can be 0.4.
  • the degree of overlap between the current detection area and the historical detection area is greater than the second target threshold, it is determined that the area where the current detection area and the historical detection area overlap is larger, and the historical detection area is determined to perform stable processing on the current detection area
  • the detection area that is, the historical detection area of the previous historical video frame of the current video frame is continued to be used as the detection area after the current detection area is stabilized, so that the background of the front and rear video frames is locally stable in the time domain To improve the accuracy of detecting the key points of the target object.
  • the current detection area may be directly used as the determined current detection area, that is, the current detection area may not be stably processed.
  • the first video frame in the target video stream may not be stably processed.
  • the current detection area and the historical detection area may be rectangular frames with a determined position in the target scene.
  • the first predetermined threshold is Critical value for measuring the size of the first and second size changes.
  • the second predetermined threshold is a critical value for measuring the magnitude of the change in the first position and the second position.
  • the historical detection area is determined as the historical detection area, thereby ensuring the size and position of the human detection frame associated with the target object, Most local changes in the time domain are small or unchanged, so as to improve the accuracy of detecting the key points of the target object, thereby improving the efficiency of processing the target object.
  • step S212 the position of the key point set of the first object is stably adjusted according to the position of the key point set of the second object, and the position of the key point set of the current target object in the current video frame includes: When the historical video frame is a historical video frame adjacent to the current video frame, the position of the key point set of the second object is determined as the position of the key point set of the current target object; or when the historical video frame is adjacent to the current video frame Multiple historical video frames, the position of the key point set of the first object is stably adjusted by the positions of the key point sets of the second object to obtain the position of the key point set of the current target object. Multiple sets of second object key point sets correspond to each other.
  • the historical video frame may be a historical video frame adjacent to the current video frame.
  • the position of the second object key point set corresponding to the target object in the historical video frame is the detection result of the object key point in the historical video frame that has been detected, and the position of the second object key point set is directly determined as the current target object key
  • the position of the point set, that is, the positions of the multiple object key points in the second object key point set are directly used as the positions of the multiple object key points in the current target object key point set.
  • the historical video frame may be multiple historical video frames adjacent to the current video frame, for example, if the current video frame is the third video frame of the target video stream, the historical video frame is the first of the target video stream Video frame and second video frame.
  • Each historical video frame corresponds to a set of second object key point sets.
  • the set of second object key point sets may include all key points on the target object.
  • the target object is a human body
  • a set of second object key point sets 66 key points of the human body can be included.
  • the position of the key point set of the first object can be stably adjusted by the positions of the key point sets of the second object in multiple groups.
  • the key points corresponding to multiple objects included in the set are stably adjusted to obtain the position of the key point set of the current target object.
  • the target video stream includes the Nth video frame, the N-1th video frame, the N-2th video frame ... the Nmth video frame, the N + 1th video frame, the N + th 2 video frames ... N + nth video frame, where the Nth video frame is the current video frame of the target video stream, the N-1th video frame, the N-2th video frame ... the Nmth
  • the video frame is multiple historical video frames adjacent to the current video frame in the target video stream, the N + 1th video frame, the N + 2th video frame ...
  • N + nth video frame is the current in the target video stream
  • N is a natural number greater than or equal to 1
  • m is a natural number
  • n is a natural number greater than or equal to 1.
  • the N-1th video frame in this embodiment may be the first historical video frame of the current video frame, corresponding to the first set of second object key point sets ⁇ key point A1, key point B1 ... key point Z1 ⁇
  • the N-2th video frame can be the second historical video frame of the current video frame, corresponding to the second set of second object key point sets ⁇ key point A2, key point B2 ... key point Z2 ⁇
  • the Nmth video frame may be the mth historical video frame of the current video frame, corresponding to the mth group of second object keypoint sets ⁇ keypoint Am, keypoint Bm ...
  • the Nth video frame is the current video frame, corresponding to the first object key point set ⁇ key point an, key point bn, key point zn ⁇ .
  • the mth set of second object key point sets are stored at a second predetermined location.
  • the determined historical The current detection area that is, the current detection area is stably processed according to the first historical detection area to obtain the determined current detection area.
  • the first historical detection area is used as the determined current detection area; if the degree of overlap is not less than the target threshold, directly Take the current detection area as the determined current detection area.
  • the first history detection area in this embodiment may be determined by the second history detection area, and the second history detection area may be determined by the history detection area corresponding to the target object in the N-3th video frame, and so on.
  • the detection area of the first video frame of the target video stream does not need to be stabilized.
  • the determined current detection area After acquiring the determined current detection area according to the first historical detection area and the current detection area, the determined current detection area may be stored in the first predetermined position, which may become the target object in the N + 1th video frame A historical detection area of the corresponding detection area is used for stabilizing the detection area corresponding to the target object in the N + 1th video frame.
  • the position of multiple object key points included in each set of second object key point sets may be used to stably adjust the corresponding multiple object key points included in the first object key point set. For example, through the first group The position of the key point A1 in the second object key point set, the position of the key point A2 in the second group second object key point set ...
  • the position of the key point Am in the m group second object key point set, for the first object The position of the key point aN in the key point set is stably adjusted to obtain the position of the key point AN in the key point set of the current target object; through the position of the key point B1 in the key point set of the second object in the first group, and the key of the second object in the second group The position of the key point B2 in the point set ...
  • the position of the key point Bm in the key point set of the second object in the mth group, and the position of the key point bN in the key point set of the first object is stably adjusted to obtain the current target object key point set
  • the position of the key point Zm stably adjusts the position of the key point zN in the key point set of the first object to obtain the position of the key point ZN in the key point set of the current target object, thereby achieving
  • the position of the key points of the object is stably adjusted for the corresponding key points of the plurality of objects included in the key point set of the first object, so as to obtain the current object key point set of the Nth group ⁇ key point An, key point Bn ... key point Zn ⁇ .
  • the current object key point set is stored in a second predetermined storage location for The position of the object key point set of the video frame after the current video frame is stably adjusted, where the second predetermined storage position may be the same as the first predetermined storage position. For example, the position of the key point set of the object in the N + 1th video frame is stably adjusted.
  • the position of the key point A1 in the first group second object key point set, the position of the key point A2 in the second group second object key point set ...
  • Set the position of the key point Am in the key point set of the second object and adjust the position of the key point aN + 1 in the object key point set of the N + 1th video frame stably to obtain the position of the key point AN + 1;
  • the m group second object key point set The position of the key point Bm of the N + 1 video frame is adjusted stably to the position of the key point bN + 1 in the object key point set of the N + 1 video frame, and the position of the key point BN + 1 is obtained; through the key point ZN of the current object key point set 1.
  • the position of the key point Z1 in the key point set of the second object of the first group, the position of the key point Z2 in the key point set of the second object of the second group ...
  • the position of the key point Zm in the key point set of the second object of the m group is stably adjusted to obtain the position of the key point ZN + 1, thus based on the current object key point set of the Nth group ⁇ key point AN, key Point BN ... key point ZN ⁇ , get the current object key point set of group N + 1 ⁇ key point AN + 1, key point BN + 1 ... key point ZN + 1 ⁇ .
  • the positions of the key points of the first object are stably adjusted by the positions of the key points of the second object groups to obtain the positions of the key points of the current target object in the current video frame include:
  • the first object key point set determines the position of the first target object key point to be stably adjusted; from each set of second object key point sets, the position of the second target object key point corresponding to the first target object key point is determined to obtain The positions of the key points of the plurality of second target objects, wherein the position of the target object indicated by the key points of the second target object is the same as the position of the target object indicated by the key points of the first target object;
  • the weighted sum of the positions of the points; the target coefficient is determined by the frame rate of the target video stream; the position of the key point of the first target object is smoothed according to the weighted sum and the target coefficient to obtain the position of the key point of the first target object after stable adjustment .
  • the historical video frame in this embodiment may be a plurality of historical video frames adjacent to the current video frame
  • the first object key point set may include the position of the plurality of second object key point sets corresponding to the plurality of historical video frames.
  • the positions of multiple key points are adjusted stably one by one to obtain the position of the key point set of the current target object in the current video frame.
  • the first object key point set includes a plurality of object key points, from which the position of the first target object key point to be stably adjusted is determined, and then from each set of second object key point sets, the key point of the first target object is determined
  • the position of the key point of the second target object corresponding to the point to obtain the positions of the key points of the plurality of second target objects.
  • the position of the target object indicated by the key point of the second target object is the same as the position of the target object indicated by the key point of the first target object, for example, the key point of the second target object and the key point of the second target object both indicate the Eye area.
  • the key points of the first target object are stably processed by the detection results of the key points of the plurality of second target objects, and the key points of the first target object can be processed according to the detection results of the key points of the plurality of second target objects Stable processing of spatiotemporal filtering.
  • the position p t ′ of the key point of the first target object after stable adjustment is obtained, for example, according to the weighted sum
  • the target coefficients c 1 and c 2 smooth the position of the key point of the first target object on the target object to obtain the position p t ′ of
  • w is used to represent the number of historical video frames, that is, the size of the spatiotemporal filtering window.
  • the position of the key point of the first target object on the target object is smoothed according to the weighted sum and the target coefficient, and the position of the key point of the first target object after stable adjustment is obtained, and a historical video frame adjacent to the current video frame
  • the position of the key point of the second target object on the middle target object, the variation range between the two is smaller than the second target threshold, so as to achieve stable processing of the key point of the first target object.
  • other object key points other than the first target object key point among the plurality of first object key points can be stabilized by the above method, so as to obtain a plurality of stabilized object key points Point, to ensure that the key points of multiple objects on the target object are stable in the video sequence, thereby eliminating the prediction error of the key points of the object, showing stronger temporal and spatial consistency in the front and back video frames, reducing jitter, and improving The accuracy of positioning the key points of the object.
  • step S202 the target object in the current video frame of the target video stream is detected, and obtaining the current detection area of the target object includes: detecting the current video frame to obtain multiple first Candidate detection area; among the plurality of first candidate detection areas, the first candidate detection area with the greatest degree of overlap with the historical detection area is determined as the current detection area.
  • the confidence of multiple target video frames, multiple target object detection areas, and multiple target object detection areas are used as training data for the first sub-target model, where multiple target video frames Is the input data of the first sub-target model, the confidence of multiple target object detection areas and multiple target object detection areas is the output data of the first sub-target model, each target object detection area can be the target object in The area in the target scene displayed in the image of each target video frame of the target video.
  • the second sub-target model is trained based on the deep neural network to obtain the first target model.
  • the first target model is used to detect the video frames included in the target video to obtain the detection area of the multiple objects and the multiple The confidence of the detection area of the object.
  • the first sub-target model of this embodiment may be an initially established detection model based on a deep neural network.
  • the first target model of this embodiment is a network model (MobileNet v1) based on open source single-shot multi-frame, detector SSD architecture training, and the number of channels of the network model is reduced to 1 according to the needs of the mobile terminal / 4 to facilitate model deployment and acceleration.
  • MobileNet v1 a network model based on open source single-shot multi-frame, detector SSD architecture training, and the number of channels of the network model is reduced to 1 according to the needs of the mobile terminal / 4 to facilitate model deployment and acceleration.
  • the detection algorithm of the object detection area in this embodiment can detect the current video frame through the first target model to obtain the confidence of multiple first candidate detection areas and multiple first candidate detection areas.
  • the confidence level is used to indicate the probability that the first candidate detection area is determined to be the current detection area.
  • the first candidate detection area with the largest overlap degree with the historical detection area is selected from the plurality of first candidate detection areas, and the first candidate detection area with the largest overlap degree is determined as the current detection area.
  • the historical detection area in this embodiment may be multiple historical detection areas corresponding to the target objects in multiple historical video frames adjacent to the current video frame, and multiple historical detection areas are selected from multiple first candidate detection areas At least two of the regions overlap, and the first candidate detection region with the largest overlap degree.
  • determining the first candidate detection area that has the largest overlap with the historical detection area among the multiple first candidate detection areas as the current detection area includes: when the historical video frame is the same as the current video When a historical video frame is adjacent to the frame, the first candidate detection area with the largest overlap with the historical detection area among the plurality of first candidate detection areas is determined as the current detection area.
  • the historical detection area corresponding to the target object in a historical video frame adjacent to the current video frame may be used as a reference object to select multiple first candidate detection areas, The first candidate detection area with the greatest degree of overlap with the historical detection area is determined as the current detection area.
  • the historical detection area A corresponding to the target object in a historical video frame adjacent to the current video frame is used as the reference object.
  • the size of the picture in the current video frame is adjusted to 300x300 as the input of the network model to generate 1000 first candidate detection areas, and the first candidate detection area with the largest IOU of the historical detection area A is used as the current detection area of the current video frame In this way, it can ensure that the human detection frame always locates the same target object.
  • determining the first candidate detection area that has the largest degree of overlap with the historical detection area among the plurality of first candidate detection areas as the current detection area includes: from the plurality of first candidate detection areas Select a target number of target candidate detection areas, where the confidence of each target candidate detection area is greater than or equal to any of the first candidate detection areas in the plurality of first candidate detection areas except the target number of target candidate detection areas Confidence; the first candidate detection area with the largest overlap with the historical detection area among the target number of target candidate detection areas is determined as the current detection area.
  • a target number of target candidate detection areas is selected from the plurality of first candidate detection areas For example, select three target candidate detection areas B0, B1, and B2, and the confidence of each target candidate detection area is greater than or equal to any of the first candidate detection areas except the target number of target candidate detection areas.
  • the confidence of a candidate detection area that is, the confidence of the three target candidate detection areas B0, B1, and B2 is the largest among the confidence levels of multiple current detection areas. Then, among the target candidate detection areas of the target number, the first candidate detection area with the largest overlap with the historical detection area is determined as the current detection area, so as to ensure that the human detection frame locates the same target object at all times.
  • step S202 before detecting the target object in the current video frame of the target video stream to obtain the current detection area of the target object, the method further includes: Detection of one historical video frame to obtain multiple second candidate detection areas; in the case that one historical video frame adjacent to the current video frame is the first video frame of the target video stream, multiple second candidate The second candidate detection area with the highest confidence in the detection area is determined as the historical detection area, where the confidence is used to indicate the probability that the corresponding second candidate detection area is determined as the historical detection area.
  • a historical video frame adjacent to the current video frame is detected by the first target model to obtain multiple second candidate detections
  • the confidence of the area and the plurality of second candidate detection areas is used to indicate the probability that the second candidate detection area is determined to be the historical detection area.
  • the historical detection area is relative to the current video frame.
  • the result of detecting the target object in a neighboring historical video frame is directly Determined as the historical detection area.
  • the first video frame of the target video stream may determine the candidate detection area with the highest confidence in the multiple candidate boxes determined by the first target model as the result of the object detection area that needs to be obtained
  • the candidate detection area with the greatest overlap with the object detection area of the previous video frame is selected as the object detection area to be determined, so as to ensure that the object detection area can always locate the same target object.
  • the detection area of the target object when detecting the detection area of the target object, may be detected every multiple frames by the detection method of the object detection area, thereby improving processing efficiency.
  • the first video frame has obtained the corresponding object detection area by the above method
  • the second video frame is a video frame after the first video frame in the target video stream, if the second video frame and the first video frame If the first number of video frames are separated, the condition for detecting every other frame is met.
  • the second video frame can be detected by the first target model to obtain the confidence of multiple third candidate detection areas and multiple third candidate detection areas.
  • the confidence of the third candidate detection area is used to indicate the probability that the third candidate detection area is determined to be the detection area associated with the target object of the second video frame.
  • the plurality of third candidate detection areas can be combined with the second video
  • the third candidate detection area with the largest overlapping degree of the detection area corresponding to the target object in the previous video frame adjacent to the frame is determined as the object detection area associated with the target object in the second video frame.
  • This embodiment takes into account the processing performance. Not every video frame passes through the object area detection algorithm to obtain the object detection area.
  • the object area detection algorithm can be used to detect every first number of video frames. The larger the number, the higher the processing efficiency and the shorter the time.
  • the confidence of object key points of the current video frame is generally low, the current video frame is detected by a human detection algorithm.
  • not every video frame goes through a human detection algorithm to obtain a human detection frame
  • the detection result of the second object key point set of a historical video frame adjacent to the current video frame can be used to generate the current The human detection frame associated with the target object in the video frame.
  • the history detection area of this embodiment is used to indicate the area where the target object is located in the target scene displayed in the image of a historical video frame adjacent to the current video frame, and the second Object key point set to generate the current detection area associated with the target object of the current video frame, the area of the target object indicated by the current detection area in the target scene displayed in the image of the current video frame, including the first
  • the area where the two-object key point set is located for example, the current detection area includes all the object key points of the second object key point set, and the minimum rectangular frame including the second object key point set can be expanded in the vertical direction by the target proportion
  • the side length for example, expands the 1/5 side length to obtain the current detection area, so as to determine the current detection area associated with the target object in the target video.
  • step S208 based on the determined current detection area, locate key points of the target object to obtain the first object key point set includes: the target object in the current video frame is not completely located in the determination
  • the center of the determined current detection area is taken as the center
  • the determined current detection area is externally expanded.
  • the so-called external expansion processing can be specifically expressed as the The width and height are adaptively increased to obtain the target object detection frame, so that the area of the target object in the current video frame in the target scene is completely located in the target object detection frame, and the target detection area is obtained after the expansion process; according to the target The target image of the target object included in the detection area is acquired to obtain the key point set of the first object.
  • acquiring the first object key point set according to the target image of the target object included in the target detection area includes: processing the target image to obtain multiple sets of confidences of the first object key point set Degree, where each group of confidence is used to predict the position of an object key point in the first object key point set; the target matrix is constructed by each group of confidence levels; the maximum confidence in each group of confidence levels is in the corresponding target matrix The rows and columns of are used to determine the first target coordinates; the first target coordinates are used to determine the position of an object key point in the first object key point set.
  • multiple images including objects and multiple object key points can be used as training data for the second target model, and the second target model can be trained based on deep learning through the training data.
  • a second target model is obtained, wherein the key points of the object are used to indicate a part on the object, and the second target model may be a model established for initial detection.
  • the object key point detection algorithm of this embodiment is based on the feature pyramid network FPN in the deep learning model, and the basic network is a reduced network (VGG).
  • this embodiment replaces the convolutional layer with a residual structure (Residual Block), and uses batch normalization (Bormal Normalization) and activation function (PReLU) after the convolutional layer to improve the accuracy of object key point detection.
  • the target image in the detection frame of the target object is processed through the second target model, and the target image can be input into the FPN network to obtain the thermal map of the key point set of the first object, corresponding to multiple target matrices
  • the target matrix is also a thermal map matrix, where the target image includes a target object, that is, the target image is a local image block including a human body region.
  • the size of the heat map obtained from the FPN network is proportional to the size of the input target image, that is, the size of the obtained target matrix also has a corresponding relationship with the input target image.
  • the target matrix is constructed by each group of confidence levels, where each confidence level in each group of confidence levels is used to predict the position of the corresponding object key point on the target object.
  • the multiple target matrices are 66 matrices, which correspond to 66 object key points one-to-one.
  • select the first confidence level with the largest value from multiple confidence levels and determine the first target coordinate P m1 according to the row number and column number of the first confidence level in the target matrix, and then pass the first target coordinate To determine the position of an object key point in the first object key point set.
  • the position of an object key point in the first object key point set through the first target coordinates since the input target image of the target matrix and the target matrix network has a corresponding relationship, according to this corresponding relationship and the target matrix
  • the coordinates of the first target determined by the rows and columns of the maximum confidence degree can inversely calculate the position of the key point of the first object on the target image.
  • the target image is determined by the initial image
  • the position and scale relationship of the target image in the initial image are also determined, and the position of the first object key point in the initial image can be calculated.
  • determining the position of a corresponding object key point in the first object key point set by using the first target coordinates includes: rows in the target matrix according to the next-largest confidence level in each group of confidence levels And column to determine the second target coordinate; shift the first target coordinate to the second target coordinate by the target distance; according to the second target coordinate offset by the target distance, determine an object key point corresponding to the target matrix on the target object s position.
  • the second target coordinate P m2 corresponding to the target matrix can be determined according to the rows and columns of the second confidence level of the target matrix ,
  • the second confidence level is less than the first confidence level and greater than the third confidence level
  • the third confidence level is any confidence level among the multiple confidence levels except the first confidence level and the second confidence level, that is,
  • the first object key point set is determined on the target object according to the above method, for example, the first object key point set including 66 object key points is obtained.
  • the current object detection area is adjusted according to the first target ratio; the target image including the target object in the determined current object detection area is processed to obtain the first object key point set.
  • the method when processing the target image including the target object in the determined target object detection area to obtain the object key points in the first object key point set, the method further includes: Adjust the target coordinates, where the second target ratio is the reciprocal of the first target ratio; determine the position of the point on the target object corresponding to the adjusted first target coordinate as the key point of the object in the first key point set The position on the target object.
  • the target image processed by the first target model has size requirements, for example, the width ⁇ height is 192 ⁇ 256, that is, a 3: 4 ratio. Because the target object detection frame is difficult to guarantee the 3: 4 ratio, the target object detection frame is adjusted according to the first target ratio, for example, the target object detection is cut according to the 3: 4 ratio, so that it is convenient to zoom to 192x256 as the second The input of the target model, and then through the second target model, process the target image including the target object in the determined current detection area to obtain the first object key point set.
  • the first target coordinates are scaled according to the second target ratio
  • the size of the heat map and the input of the network (target image) are in a proportional relationship.
  • the key point of the object on the target image can be calculated inversely Position, because the target image is derived from the target detection area obtained by externally expanding the determined current detection area, the position and the proportional relationship are also determined, and the position of the key point of the object in the image of the target detection area can be calculated.
  • each object key point in the first object key point set can be determined, and then the position of each object key point in the first object key point set can be stabilized through the key point stabilization algorithm.
  • FIG. 4 is a flowchart of an image processing method according to an embodiment of the present application. As shown in Figure 4, the method includes the following steps:
  • Step S402 Detect the target object in the current video frame of the target video stream to obtain the current detection area for the target object.
  • step S402 may include the technical solution provided in step S202.
  • Step S404 Acquire the determined current detection area according to the historical detection area and the current detection area corresponding to the target object in the historical video frame of the target video stream.
  • step S404 may include the technical solutions provided in step S204 and step S206.
  • Step S406 Based on the determined current detection area, locate key points of the target object to obtain a first object key point set.
  • step S406 may include the technical solution provided in step S208.
  • Step S408 Stably adjust the position of the key point set of the first object according to the position of the key point set of the second object corresponding to the target object in the historical video frame to obtain the position of the key point set of the current target object in the current video frame.
  • step S408 may include the technical solutions provided in step S210 and step S212.
  • Step S410 Identify the part of the target object from the current video frame according to the position of the key point set of the current target object.
  • the current target object key point set is a key point set after stably adjusting the first target object key point set, and the target object key point set in the historical video frame
  • the small jitter amplitude eliminates the prediction error of key points.
  • each key point in the current target object key point set can be used to indicate a part of the target object. For example, when the target object is a human body, there are 66 key points in the current target object key point set, which are used to indicate the human body. 66 different parts, and cover the outline of the object.
  • the part of the target object is accurately identified from the current video frame, for example, the ear, mouth, nose, eye and other parts of the human body are identified.
  • the area where the part of the target object identified in the current video frame is located and the area where the part of the target object identified in the historical video frame is located, the jitter amplitude between the two is small.
  • step S412 an adjustment process is performed on the recognized part of the target object.
  • an image of the identified part of the target object is displayed, and then an adjustment processing instruction for performing adjustment processing on the part is received.
  • the user determines the part that needs to be adjusted according to the displayed image of the part of the target object, and by operating the part that needs to be adjusted, triggers the adjustment processing instruction, and then responds to the adjustment processing instruction to target the object Real-time adjustment processing.
  • the user adjusts the "slim waist" function of the sliding rod, triggers the thin waist command, and responds to the thin waist command to adjust the degree of thin waist in real time.
  • This embodiment can also realize the adjustment processing of the long legs and hips, etc. Be limited.
  • step S414 the image of the target object after the adjustment process is displayed.
  • the image of the target object after the adjustment process can exhibit the effect of adjusting the part of the target object, when the effect of adjusting the part of the target object does not reach the predetermined effect , You can continue to adjust the identified target parts. Since this embodiment adjusts the position of the target object based on the set of key points of the first object after the stabilization process, the adjustment effect of the target object is more natural, real-time detailed processing of the target object is achieved, and the target video stream is avoided.
  • the dithering of the front and back video frames in causes the problem of poor processing effect on the target object, which makes the processing effect on the target object more real and natural, and reduces the gap between the processing effect on the target object and the natural beautification effect accepted by the user.
  • the detection algorithm of the object detection area and the detection algorithm of the key points of the object are based on the deep neural network method, and the stabilization algorithm can be based on the spatio-temporal filtering algorithm.
  • the stabilization algorithm can be based on the spatio-temporal filtering algorithm.
  • These three are the core of real-time tracking of the key points of the object and determine the final output.
  • the accuracy of the key points of the object and the stability between the front and back video frames; the use of deep network methods makes the field data have good convergence and generalization performance, so that the human detection frame and point positioning can achieve good accuracy;
  • the current detection area is stabilized, the purpose is to provide a stable background for the detection of key points of the object, reduce the error caused by the background change, and the stable adjustment of the position of the key point set of the first object can eliminate the prediction error of the point.
  • the frame picture shows stronger spatiotemporal consistency, reduces jitter, and improves the accuracy of detecting key points of the object, and then adjusts the position of the target object through the stabilized key points of the object, so that the adjustment of the object
  • the effect is more natural, which avoids the poor processing effect on the target object when the front and back video frames in the video are jittery, thereby achieving the technical effect of improving the processing efficiency of the target object.
  • the solution provided in this application can detect any target object that requires a locator to improve the accuracy of positioning.
  • the target object can be a target object with motion capabilities.
  • the target object It may be a human body, other animals (such as dogs, cats, etc.) and other living bodies, and the process of detecting any one kind of target object is basically the same.
  • the target object is a human body as an example.
  • the target object may be a human body
  • the detection area may be a human body frame
  • the human body detection algorithm and its stable tracking algorithm 66 human body key point detection algorithm and its stable tracking algorithm for example.
  • FIG. 5 is a flowchart of a method for locating key points of a human body according to an embodiment of the present application. As shown in Figure 5, the method includes the following steps:
  • Step S501 input a video frame picture.
  • the video in this embodiment may be a short video.
  • the short video takes a person as the theme and includes multiple video frame pictures. Enter a video frame picture.
  • the video frame picture includes an image of a human body. Position and track the human body in the current video frame picture Image.
  • Step S502 Detect the current video frame picture through the human body detection algorithm to obtain the current human body detection frame.
  • the human detection algorithm is based on the human detection model trained by the deep neural network training with higher performance.
  • the deep neural network makes the model have better convergence and generalization performance in the field data, so that the human detection frame can be made The detection achieves good accuracy.
  • the human detection model can be a MobileNet v1 network model trained based on the SSD architecture.
  • the number of channels of the network model can be reduced to 1 / original according to the needs of the mobile terminal. 4. To facilitate the deployment and acceleration of the model. .
  • the human body detection model can output 1000 candidate body detection frames, and the human body detection model can also output The confidence of the candidate body detection frame, which is used to indicate the probability that the candidate frame is selected as the human body detection frame.
  • the following short video scenes use a person detection and tracking as an example to explain.
  • the current video frame picture is the first video frame picture of the video
  • the human body detection frame with the highest confidence in the 1000 candidate body detection frames of the current video frame picture is determined as the human body frame, and the current human body detection frame A is obtained.
  • the verification process is explained below with reference to a schematic diagram of a multi-frame verification provided by the embodiment of the present application implemented in FIG. 6.
  • Use the human body detection frame A to verify the human body detection frames B0, B1, and B2, and calculate the respective overlap IOU of the human body detection frame A and the human body detection frames B0, B1, and B2.
  • the IOU is calculated as That is, it is the ratio of the first area where A and B intersect, and the second area combined with A and B, where B is taken as B0, B1, and B2, respectively, in the calculation process.
  • the detection frame B1 with the largest IOU of the human detection frame A is selected as the current human detection frame of the current video frame picture, so as to ensure that the human detection frame locates the same person at all times.
  • the above-mentioned human detection algorithm can be used to detect part of the video frame pictures, and another part of the video frame pictures can be used to calculate a human detection frame using the position of the key points of the human body in the previous video frame.
  • the current human detection frame as the current video frame picture. Relative to the detection method of the above human detection algorithm for each frame of video pictures, this method can save computing resources and improve positioning efficiency. Specifically, those video frames are detected using a human detection algorithm, and those frames depend on the corresponding previous video frame image for positioning, and can be flexibly configured in practical applications.
  • the frame detection method can be used, for example, every 4 frames, the human body detection algorithm is used to detect once, for example, the first video frame is detected using the human body detection algorithm, and the second video frame to the fifth Frame video frames are dependent on the corresponding key points of the human body detected in the previous frame to determine the human body detection frame, and the sixth video frame is detected using the above human body detection algorithm, and so on, using two methods for interval detection, In this way, the operating efficiency can be improved. It should be noted that the greater the number of spaced video frame pictures, the higher the efficiency. Of course, you can also decide when to use the above human detection algorithm to detect based on the confidence of human key point detection. For example, only when the confidence of human key point detection for a current video frame picture is generally low, the above human body The detection algorithm performs the detection of the current video frame picture.
  • Step S503 Use the detection result of the last video frame picture to stabilize the human body detection frame of the current video frame picture to obtain the stabilized current human body detection frame, and cache the stabilized current human body detection frame.
  • the area of the current human detection frame determines the input of the human key point detection algorithm, and the stability of the background of the human body in the area also affects the stability of the human key point in the video timing, so this embodiment adopts a stable tracking algorithm for human detection ,
  • Use the detection result of the previous video frame picture to stabilize the human body detection frame of the current video frame picture get the stabilized current human body detection frame, and cache the stabilized current human body detection frame, so as to be the key point of the human body of the current video frame picture
  • Detection provides a stable background, so that the background input by the human key point model of the front and rear video frame pictures is locally stable in the time domain, reducing errors caused by background changes.
  • a stable tracking algorithm for human detection that is, a stable frame algorithm, is used to perform stable operations on the human detection frame.
  • the human detection frame of the previous video frame picture is continued to be used as the human detection frame of the current video frame picture, where the target threshold may be 0.4. That is, in this embodiment, the frame where a human body is located can be calculated according to the position of the key point of the human body of the previous video frame picture, which can be used as the current human body detection frame of the current video frame picture, or the human body detection frame of the previous video frame picture can be used Stabilize the current human detection frame of the current video frame picture. If the current video frame picture is the first video frame picture of the video, there is no need to stabilize the current human detection frame of the current video frame picture.
  • FIG. 7 is a schematic diagram of a position change of a human detection frame of a video frame picture provided by an embodiment of the present application.
  • the position p (x or y component) of the human detection frame is converted from the original curved trajectory to a stepped trajectory, so that the size and position of the human detection frame are unchanged in most local time domains, making The background of the human key point model input of the front and back video frame pictures is locally stable in the time domain.
  • Step S504 Input the local human interest area to the human key point detection algorithm to obtain the first human key point set.
  • the human key point detection algorithm of this embodiment is based on a deep neural network, so that it has good convergence and generalization performance in the field data, and the point positioning achieves good accuracy, which can be calculated by the spatiotemporal filtering algorithm.
  • the human key point detection algorithm of this embodiment is based on the feature pyramid network FPN in the deep learning model.
  • This FPN mainly solves the multi-scale problem in object detection. Through simple network connection changes, the performance of small object detection is greatly improved without substantially increasing the calculation amount of the original model.
  • FIG. 8 is a schematic diagram of a feature pyramid network structure provided by an embodiment of the present application.
  • the basic network is a simplified VGG network.
  • the convolutional layer is replaced with a residual structure (Residual Block), and after the convolutional layer, batch normalization and PReLU activation functions are used to improve the accuracy of key point detection.
  • the human key point detection algorithm in this embodiment is based on the FPN structure design, which is only an example of the embodiment of the present application, and is not limited to the human key point detection algorithm in this embodiment is only based on the FPN structure design. It is based on HourGlass structure design, and the basic network can also be VGG network, MobileNetV1, MobileNetV2, ShuffleNetV1, ShuffleNetV2 and other small networks and their variants, without any restrictions here.
  • FIG. 9 is a schematic diagram of a distribution of key points of a human body provided by an embodiment of the present application. As shown in FIG. 9, it includes 1-66 key points of the human body, which are used to indicate the position of a part on the human body on the human body. As shown in Table 1, Table 1 is a definition table of key points of the human body, which defines the key points of the human body, where left and right are determined according to the orientation on the picture.
  • the training of the human key point network model in this embodiment is based on a large amount of strictly labeled data. If there is ambiguity in the labeling, the training of the model will be invalidated. It is relatively clear to define the position of the key points of the human body according to the human body model in FIG. 9, but in actual scenes, the movement of the human body is very complicated, and “left” in Table 1. "Right”, “inner” and “outer” are often difficult to distinguish. Therefore, this embodiment also needs to define criteria, for example, the side that appears first from left to right on the body is the left side of the body, the shoulder that appears first from left to right is the left shoulder, and the thigh that appears first from left to right is the left thigh. ; Inside and outside are defined relative to the center of the body. According to this standard, there is no ambiguity in data labeling.
  • the interest area is input into a human key point detection algorithm to obtain a first human key point set.
  • the human detection frame located according to the stability mechanism may not exactly frame the human detection frame.
  • this embodiment needs to appropriately expand the rectangular area where the human body is located.
  • the center of the human body detection frame can be kept unchanged, and the adaptability of the width and height of the human body detection frame can be increased, that is, the center of the human body detection frame can be enlarged a little outward.
  • the size of the heat map and the input image of the network are in a proportional relationship. According to this proportional relationship and the size of the heat map, the position of the key point of the human body on the target image can be calculated back, because The target image comes from the original image, and the relationship between the original image position and the ratio is also determined, and the position of the key points of the human body in the original image can be calculated.
  • the input requirement of the human key point detection algorithm in this embodiment is width 192 height 256, that is, the ratio of width to height is 3: 4. Since it is difficult to guarantee a ratio of 3: 4 for the expanded historical detection area, the expanded historical detection area needs to be processed to 3: 4, so that it can be easily scaled to 192x256 to be used as the input of the human key detection algorithm.
  • the area included in the expanded historical detection area may be an area of interest, which is the result of secondary processing of the historical detection area.
  • this embodiment After processing the expanded historical detection area as 3: 4, it is input into the FPN network, and heat maps of 66 key points of the human body are predicted, that is, 66 matrices, and the 66 body keys The points correspond to 66 matrices one by one, and each matrix is used to represent the confidence of the position of the elements in the matrix.
  • this embodiment can map the image included in the original historical detection area according to the position of the maximum confidence of the heat map matrix (row number, column number), and inversely calculate the coordinates of the key point on the human body in the image included in the original historical detection area Value, that is, point position.
  • this embodiment adopts the maximum point P m1 to the next largest value point P for each heat map.
  • m2 is offset by a distance of 1/4 for prediction:
  • the local coordinates of 66 key points are obtained. Then, according to the cutting position of the human body detection frame and the scaling scale 3: 4, inversely calculate the original coordinates, and determine the coordinates of 66 key points on the human body.
  • Step S505 Use the detection results of the historical video frame pictures to stabilize the first human key point set of the current video frame picture to obtain a second human key point set, and cache the second human key point set.
  • the stable tracking algorithm for human body key point detection in this embodiment is used to perform a stable point operation on human body key points. After inputting the local human interest area to the human key point detection algorithm to obtain the first human key point set, the stable tracking algorithm of human key point detection is used to stabilize the first human body of the current video frame picture using the detection results of the historical video frame picture Key points, get the second human key point set, and cache and output the second human key point set.
  • the human body key point of each frame can be calculated according to the human body key point algorithm, but due to the presence of prediction errors, it appears that there is jitter in the point position in the video.
  • the historical video frame pictures are pictures of previous video frames adjacent to the current video frame
  • w is used to indicate the number of historical video frame pictures, that is, the window size based on the spatiotemporal filtering algorithm.
  • This embodiment can eliminate the prediction error of key points, show stronger temporal and spatial consistency in the front and rear video frame pictures, and reduce jitter, thereby ensuring the accuracy and stability of human key points of the front and rear video frame pictures.
  • the human body detection frame is obtained by inputting the video frame picture into the human body detection algorithm, and the current human body detection area is obtained.
  • the stable tracking algorithm of human body detection is used to detect the current human body of the current video frame picture using the detection result of the previous video frame picture.
  • the parts are adjusted to meet the needs of the beautification of the human body, so as to achieve a detailed beautification effect, which is close to the natural beautification effect accepted by the user.
  • the human detection algorithm and the human key point detection algorithm are based on the deep neural network method, and the stabilization algorithm can be based on the spatio-temporal filtering algorithm.
  • the stabilization algorithm can be based on the spatio-temporal filtering algorithm.
  • the solution of this embodiment can realize real-time positioning of human key points covering the contours of the human body on the mobile terminal.
  • the human body is located and tracked in the video, and the local key points are detected and tracked steadily in the local area.
  • the human body detection model and the key point detection model total 3M can support real-time body at 30fps on the mobile terminal, the effect is more natural.
  • This embodiment can be applied to short video apps, mobile phone camera functions, P-picture software, etc., can be used in selfies, dancing and other scenes to achieve face-lifting, lip coating, microdermabrasion, etc., as well as breast enhancement, waist reduction, and long legs. Skinny legs, etc., and can achieve real-time effects of waist thinning, leg pulling, and hip lifting, which can adapt to the complex movements of front, side, back, squat and limb postures, and meet the needs of human beautification in various scenarios.
  • FIG. 10 is a schematic diagram of a key point positioning according to an embodiment of the present application.
  • this embodiment is a mobile terminal real-time keypoint tracking method that covers the contours of a human body.
  • the current video frame picture includes a human body, and the human body includes 66 keypoints, which are used to indicate the position of a part on the human body, such as Table 1 shows. Stabilize the 66 key points of the human body, and use the detection results of the historical video frame pictures to perform stable operation on the 66 key points of the current video frame picture, so as to obtain the stabilized 66 key points.
  • FIG. 11 is a schematic view of a scene for detecting a human body detection frame of a target human body according to an embodiment of the present application.
  • the dancing video of the target human body is shot through the mobile terminal, the target human body in the video is detected, the human body detection frame of the target human body is determined, and the outline of the target human body can be framed.
  • the A human detection frame is used to indicate the position and range of the area where the target human body is located in the target scene displayed in the image of the current video frame of the dancing video
  • the B human detection frame is used to indicate that the target human body is in contact with the current video frame.
  • the position and range of the area in the target scene displayed in the image of the adjacent historical video frame.
  • the B human detection frame is directly determined as the human detection frame of the current video frame, so that the size and position of the human detection wire frame are mostly local The domain is unchanged, so as to realize the stable processing of the human detection frame of the current video frame.
  • FIG. 12 is a schematic diagram of a scene for detecting key points of a human body according to an embodiment of the present application.
  • the image included in the human body detection frame B is input into the key point detection algorithm to obtain multiple human body key points.
  • select a human key point from multiple human key points and use the position B1, position B2, and position B3 of the target human key point of the historical video frame to the corresponding target human key point of the current video frame based on the stabilization algorithm Position a for stable operation.
  • the position B1, the position B2, and the position B3 may be weighted and summed, and the position a may be smoothed based on the spatiotemporal filtering, thereby obtaining the position A of the target key point on the target human body.
  • This position A is the position after stabilizing the position a of the key point of the target human body on the target human body, thereby reducing the jitter of the point position between video frames.
  • FIG. 13 is a schematic diagram of a body function entrance according to an embodiment of the present application.
  • the body function entrance to detect the human body in the video frame.
  • the prediction error of the point is eliminated, so that after showing stronger temporal and spatial consistency in the front and rear video frame pictures, the stabilized key points are used to position the human body
  • FIG. 14 is a comparison diagram before and after weight loss according to an embodiment of the present application.
  • on the terminal enter the body function entrance to detect the human body in the video frame.
  • adjust the entire body part through the stabilized key points for example, select the overall slimming function to achieve slimming of the human body and achieve a more detailed beautification effect. Make it closer to the natural beautification effect accepted by the user.
  • This embodiment can assist in the realization of slimming, long legs and other body functions, as well as loading body pendants, which avoids the manual intervention of the PS scheme for a single picture, which is time-consuming and labor-intensive, and limits the defects of the application scene.
  • the defect that the effect is not real, which improves the efficiency of processing the objects in the video and improves the user experience.
  • each human key point can be stably operated, so as to realize smooth processing on multiple human key points, and ensure that the point position is stable on the video sequence.
  • this embodiment enters the body function entrance to detect the human body in the video frame. After stabilizing the 66 key points of the current video frame picture, the prediction error of the point is eliminated, so that after showing stronger temporal and spatial consistency in the front and rear video frame pictures, the stabilized human key points Adjust the position, for example, select the "slim waist” function, adjust the "slim waist” function of the sliding rod to adjust the degree of slim waist in real time, so as to achieve a slim waist for the human body, achieve a more detailed beautification effect, and make it closer to the user Accepted natural beautification effect.
  • This embodiment uses a stable frame algorithm to provide a stable background for human key point detection and reduce errors caused by background changes.
  • the stable point algorithm can eliminate the prediction error of points and show stronger spatiotemporal consistency in front and back video frame pictures , Reduce jitter, and then adjust the target human body part through the stabilized human key points, so that the adjustment effect of the object is more natural, avoiding the jitter of the front and rear video frames in the video, the target human body The processing effect is poor, thereby achieving the technical effect of improving the processing efficiency of the target human body.
  • an object key point positioning device for implementing the above object key point positioning method.
  • 15 is a schematic diagram of a device for locating key points of an object according to an embodiment of the present application.
  • the positioning device 150 of the key point of the object may include: a detection unit 10, a first acquisition unit 20, a second acquisition unit 30, a positioning unit 40, a third acquisition unit 50 and an adjustment unit 60.
  • the detection unit 10 is configured to detect the target object in the current video frame of the target video stream to obtain the current detection area of the target object.
  • the first acquiring unit 20 is configured to acquire a historical detection area corresponding to the target object in the historical video frame of the target video stream.
  • the second acquiring unit 30 is configured to acquire the determined current detection area according to the historical detection area and the current detection area.
  • the positioning unit 40 is configured to locate key points of the target object based on the determined current detection area to obtain a first object key point set.
  • the third acquiring unit 50 is configured to acquire a second object key point set corresponding to the target object in the historical video frame of the target video stream.
  • the adjusting unit 60 is configured to stably adjust the position of the key point set of the first object according to the position of the key point set of the second object to obtain the position of the key point set of the current target object in the current video frame.
  • the second acquisition unit 30 includes an acquisition module for acquiring the determined current detection area according to the historical detection area and the current detection area when the historical video frame is a historical video frame adjacent to the current video frame.
  • the detection unit 10 in this embodiment may be used to perform step S202 in the embodiment of the present application
  • the first obtaining unit 20 in this embodiment may be used to perform step S204 in the embodiment of the present application
  • the second acquiring unit 30 in the embodiment may be used to perform step S206 in the embodiment of the present application
  • the positioning unit 40 in the embodiment may be used to perform step S208 in the embodiment of the present application
  • the obtaining unit 50 may be used to perform step S210 in the embodiment of the present application
  • the adjustment unit 60 in this embodiment may be used to perform step S212 in the embodiment of the present application.
  • an object key point positioning device for implementing the above object key point positioning method.
  • 16 is a schematic diagram of an image processing apparatus according to an embodiment of the present application.
  • the positioning device 160 of the key point of the object may include: a detection unit 70, an acquisition unit 80, a positioning unit 90, a first adjustment unit 100, a recognition unit 110, a second adjustment unit 120 and a display unit 130.
  • the detection unit 70 is configured to detect the target object in the current video frame of the target video stream to obtain the current detection area of the target object.
  • the obtaining unit 80 is configured to obtain the determined current detection area according to the historical detection area and the current detection area corresponding to the target object in the historical video frame of the target video stream.
  • the positioning unit 90 is configured to locate key points of the target object based on the determined current detection area to obtain a first object key point set.
  • the first adjusting unit 100 is used to steadily adjust the position of the key point set of the first object according to the position of the key point set of the second object corresponding to the target object in the historical video frame to obtain the key of the current target object in the current video frame The location of the point set.
  • the identifying unit 110 is configured to identify the target object part from the current video frame according to the position of the key point set of the current target object.
  • the second adjustment unit 120 is configured to perform adjustment processing on the recognized target object part.
  • the display unit 130 is used to display the image of the target object after the adjustment process.
  • the detection unit 70 in this embodiment may be used to perform step S402 in the embodiment of the present application
  • the obtaining unit 80 in this embodiment may be used to perform step S404 in the embodiment of the present application.
  • the positioning unit 90 in may be used to perform step S406 in the embodiment of the present application
  • the first adjustment unit 100 in this embodiment may be used to perform step S408 in the embodiment of the present application
  • the identification unit 110 in this embodiment may For performing step S410 in the embodiment of the present application
  • the second adjustment unit 120 in this embodiment may be used for performing step S412 in the embodiment of the present application
  • the display unit 130 in this embodiment may be used for performing the implementation of the present application Step S414 in the example.
  • the above units and modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the contents disclosed in the foregoing embodiments. It should be noted that, as a part of the device, the above-mentioned module can run in the hardware environment shown in FIG. 1, and can be implemented by software or hardware.
  • the hardware environment includes a network environment.
  • an electronic device for implementing the method for locating the key point of the object.
  • the electronic device includes a memory 172 and a processor 174.
  • the memory stores a computer program
  • the processor is configured to execute the steps in any one of the foregoing method embodiments through the computer program.
  • the above-mentioned electronic device may be located in at least one network device among multiple network devices of the computer network.
  • the above processor may be configured to be executed by a computer program to perform steps S206 to S210 shown in FIG. 2 in the above embodiment.
  • the foregoing processor may also be configured to execute steps S402 to S412 shown in FIG. 4 in the foregoing embodiment through a computer program.
  • the structure shown in FIG. 17 is only an illustration, and the electronic device may also be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, and a mobile Internet device (Mobile Internet devices, MID), PAD and other terminal devices.
  • FIG. 17 does not limit the structure of the above electronic device.
  • the electronic device may also include more or fewer components (such as a network interface, etc.) than shown in FIG. 17 or have a different configuration from that shown in FIG. 17.
  • the memory 172 may be used to store software programs and modules, such as program instructions / modules corresponding to the media file delivery method and apparatus in the embodiments of the present application, and the processor 174 runs the software programs and modules stored in the memory 172, thereby Performing various functional applications and data processing, that is, implementing the above-mentioned method of locating key points of objects.
  • the memory 172 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 172 may further include memories remotely provided with respect to the processor 174, and these remote memories may be connected to the terminal through a network.
  • the above network examples include but are not limited to the Internet, intranet, local area network, mobile communication network, and combinations thereof.
  • the memory 172 may specifically but not limited to store information such as video frames of the target video, human body detection frames, object key points, and the like.
  • the above-mentioned memory 172 may, but is not limited to, the sum adjustment unit 40 in the positioning device 110 including the above-mentioned object key points.
  • it may also include but is not limited to other module units in the above media file delivery device, which will not be repeated in this example.
  • the transmission device 176 described above is used to receive or transmit data via a network.
  • Specific examples of the aforementioned network may include a wired network and a wireless network.
  • the transmission device 176 includes a network adapter (Network Interface Controller, NIC), which can be connected to other network devices and routers through a network cable to communicate with the Internet or a local area network.
  • the transmission device 176 is a radio frequency (Radio Frequency) module, which is used to communicate with the Internet in a wireless manner.
  • Radio Frequency Radio Frequency
  • the above electronic device further includes: a display 178 for displaying the execution status of the above target code in the first target function; and a connection bus 180 for connecting each module component in the above electronic device.
  • a storage medium in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the above method embodiments during runtime.
  • the above storage medium may be set to store a computer program, and when the computer program is executed to perform steps S206 to S210 shown in FIG. 2 in the above embodiment.
  • the above-mentioned storage medium may be set to store a computer program, and when the computer program is executed to perform steps S402 to S412 shown in FIG. 4 in the above-mentioned embodiment.
  • the above storage medium may include, but is not limited to: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic Various media such as discs or optical discs that can store program codes.
  • the integrated unit in the above embodiment is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in the computer-readable storage medium.
  • the technical solution of the present application may essentially be a part that contributes to the existing technology or all or part of the technical solution may be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • Several instructions are included to enable one or more computer devices (which may be personal computers, servers, network devices, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the disclosed client may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.

Abstract

本申请公开了一种物体关键点的定位方法、图像处理方法、装置及存储介质。该方法包括:对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域;获取在目标视频流的历史视频帧中目标物体对应的历史检测区域;利用历史检测区域对当前检测区域进行调整,获取确定后的当前检测区域;基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集;获取在目标视频流的历史视频帧中目标物体对应的第二物体关键点集;根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。本申请解决了相关技术对物体关键点进行检测的准确性低的技术问题。

Description

物体关键点的定位方法、图像处理方法、装置及存储介质
本申请要求于2018年11月19日提交中国专利局、申请号为CN201811377195.6、发明名称为“物体关键点的定位方法、图像处理方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,尤其涉及一种物体关键点的定位方法、图像处理方法、装置及存储介质。
背景技术
目前,对于单张图像处理(Photoshop,简称为PS)图中的物体关键点定位,需要用户或者设计师肉眼确定人体的各个物体关键点,耗时耗力,限制了应用场景;另外,视频应用中通常对对象所在的画面区域进行整体拉伸,即使定位了物体关键点,在视频中的前后视频帧出现抖动的情况下,对物体关键点进行定位的准确性低。
针对上述对物体关键点进行定位的准确性低的问题,目前尚未提出有效的解决方案。
发明内容
本申请实施例提供了一种物体关键点的定位方法、图像处理方法、装置及存储介质,以至少解决相关技术对物体关键点进行检测的准确性低的效率低的技术问题。
根据本申请实施例的一个方面,提供了一种物体关键点的定位方法。该方法包括:对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域;获取在目标视频流的历史视频帧中目标物体对应的历史检测区域;根据历史检测区域和当前检测区域,获取确定后的当前检测区域;基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集;获取在目标视频流的历史视频帧中目标物体对应的第二物体关键点集;根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。
根据本申请实施例的另一方面,还提供了一种图像处理方法。该方法包括:对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域;根据在目标视频流的历史视频帧中目标物体对应的历史检测区域和当前检测区域,获取确定后的当前检测区域;基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集;根据在历史视频帧中目标物 体对应的第二物体关键点集的位置,对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置;根据当前目标物体关键点集的位置,从当前视频帧中识别出目标物体的部位;对识别出的目标物体的部位进行调整处理;显示调整处理后的目标物体的图像。
根据本申请实施例的另一方面,还提供了一种物体关键点的定位装置。该装置包括:检测单元,用于对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域;第一获取单元,用于获取在目标视频流的历史视频帧中目标物体对应的历史检测区域;第二获取单元,用于根据历史检测区域和当前检测区域,获取确定后的当前检测区域;定位单元,用于基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集;第三获取单元,用于获取在目标视频流的历史视频帧中目标物体对应的第二物体关键点集;调整单元,用于根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。
根据本申请实施例的另一方面,还提供了一种图像处理装置。该装置包括:检测单元,用于对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域;获取单元,用于根据在目标视频流的历史视频帧中目标物体对应的历史检测区域和当前检测区域,获取确定后的当前检测区域;定位单元,用于基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集;第一调整单元,用于根据在历史视频帧中目标物体对应的第二物体关键点集的位置,对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置;识别单元,用于根据当前目标物体关键点集的位置,从当前视频帧中识别出目标物体的部位;第二调整单元,用于对识别出的目标物体的部位进行调整处理;显示单元,用于显示调整处理后的目标物体的图像。
根据本申请实施例的另一方面,还提供了一种存储介质。该存储介质中存储有计算机程序,其中,计算机程序被设置为运行时执行本申请实施例的物体关键点的定位方法。
在本申请实施例中,由于从目标视频流的当前视频帧中,检测目标物体的当前物体检测区域,并根据历史视频帧中的目标物体对应的历史检测区域和当前检测区域,获取确定后的当前检测区域,基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集,根据在目标视频流的历史视频帧中目标物体对应的第二物体关键点集的位置,对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置,实现了对物体关键点进行稳定,避免了视频帧间物体关键点的抖动,从而实现了提高对物体关键点进行定位的准确性的技术效果,进而解决了相关技术对物体关键点进行检测的准确性低的技术问题。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是本申请实施例提供的一种物体关键点的定位方法的硬件环境的示意图;
图2是本申请实施例提供的一种物体关键点的定位方法的流程图;
图3是本申请实施例提供的一种对当前视频帧的当前检测区域进行稳定处理和对当前视频帧中目标物体的关键点集进行稳定处理的示意图;
图4是本申请实施例提供的一种图像处理方法的流程图;
图5是本申请实施例提供的一种人体关键点定位方法的流程图;
图6是本申请实施例提供的一种多框校验的示意图;
图7是本申请实施例提供的一种视频帧图片的人体检测框的位置变化的示意图;
图8是本申请实施例提供的一种特征金字塔网络结构的示意图;
图9是本申请实施例提供的一种人体关键点分布的示意图;
图10是本申请实施例提供的一种关键点定位的示意图;
图11是本申请实施例提供的一种对目标人体的人体检测框进行检测的场景示意图;
图12是本申请实施例提供的一种对目标人体的人体关键点进行检测的场景示意图;
图13是本申请实施例提供的一种美体功能入口的示意图;
图14是本申请实施例提供的一种瘦身前后的对比示意图;
图15是本申请实施例提供的一种物体关键点的定位装置的示意图;
图16是本申请实施例提供的一种图像处理装置的示意图;
图17是本申请实施例提供的一种电子装置的结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包 括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
根据本申请实施例的一个方面,提供了一种物体关键点的定位方法,可选地,上述物体关键点的定位方法可以但不限于应用于如图1所示的环境中。其中,图1是本申请实施例提供的一种物体关键点的定位方法的硬件环境的示意图。如图1所示,用户102可以与用户设备104进行数据交互,用户设备104中可以但不限于包括存储器106和处理器108。用户设备104可以确定目标视频中的待处理的目标物体,通过处理器108对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域,根据在目标视频流的历史视频帧中目标物体对应的历史检测区域和当前检测区域,获取确定后的当前检测区域,基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集,执行步骤S102,通过网络110将第一物体关键点集发送给服务器112。
服务器112中包含有数据库114和处理器116。服务器112在获取到第一物体关键点集之后,处理器116从数据库114中获取在目标视频流的历史视频帧中目标物体对应的第二物体关键点集的位置,并根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置,最后执行步骤S104,通过网络110向用户设备104返回当前目标物体关键点集的位置。
需要说明的是,上述仅为本申请实施例的物体关键点的定位方法的硬件环境的一种举例,并不限于本申请实施例的硬件环境仅为上述,比如,还可以仅通过客户端执行上述物体关键点的定位方法,此处不再一一举例说。
需要说明的是,在相关技术中,对于单张图像处理中的物体关键点定位,需要用户或者设计师肉眼确定人体的各个物体关键点,耗时耗力;另外,视频应用中通常对对象所在的画面区域进行整体拉伸,即使定位了物体关键点,在视频中的前后视频帧出现抖动的情况下,对物体关键点进行定位的准确性低。
图2是根据本申请实施例提供的一种物体关键点的定位方法的流程图。如图2所示,该方法可以包括以下步骤:
步骤S202,对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域。
其中,目标视频流可以为视频应用中任一种视频的视频流,例如可以是短视频的视频流,该目标视频流中的当前视频帧中的目标物体处于当前视频帧所显示的目标场景中,目标物体可以为待进行关键点定位的对象,比如,人体、动物等对象,目标场景可以为自拍场景、跳舞场景等视频场景,在本申请实施例中对具体的视频场景和目标物体的对象类型不做任何限制。
所谓当前检测区域是指目标物体在当前视频帧所显示的目标场景中所处 的区域,该区域包括位置和范围。在具体表现时,该当前检测区域可以表现为物体检测框,当目标物体为人体时,该物体检测框具体为人体检测框;更具体的该物体检测框可以表现为矩形框、椭圆形框、六边形框等任意形状的检测框,这些检测框用于标记目标物体在目标场景中所处的位置和范围,以矩形框为例,矩形框所标注的位置可以理解为区域的左上角坐标,而该矩形框所标注的范围可以理解为区域的长和宽,该矩形框所框选的区域就是目标对象在目标场景中的区域。
可选地,该实施例基于深度神经网络训练的检测模型对目标物体进行检测,该检测模型可以是基于开源的单次多窗口检测器(Single Shot MultiBox Detector,简称为SSD)架构训练成的网络模型。通过该检测模型进行目标物体检测时,可以将当前视频帧输入至上述检测模型,产生多个候选检测框和多个候选检测框的置信度,从中选择置信度最大的三个候选框,比如,第一候选框、第二候选框和第三候选框。将第一候选框、第二候选框和第三候选框分别和当前视频帧在目标视频流的相邻的一个历史视频帧的历史检测区域进行校验,可以从第一候选框、第二候选框和第三候选框中选择与历史检测区域的重叠度最大的一个候选检测框,作为当前视频帧的当前检测区域,从而就能保证时刻定位同一个目标物体。
步骤S204,获取在目标视频流的历史视频帧中目标物体对应的历史检测区域。
其中,所谓历史视频帧是指当前视频帧相对的历史视频帧,即历史视频帧是指在目标视频流中,以当前视频帧为起点,过去预设时间范围内的视频帧。
所谓历史检测区域可以为历史视频帧中目标物体对应的经稳定处理后的检测区域,该历史检测区域的具体位置信息也即历史检测结果,其被缓存在第一预定位置。在具体实现时,与当前视频帧对应的历史视频帧可以理解为是在时间顺序上与当前视频帧相邻的前一个或者前几个视频帧,而历史视频帧对应的历史检测区域是指历史视频帧中与目标物体对应的经稳定处理后的检测区域,而历史检测区域会按照被检测的历史视频帧的帧编号顺序依次存储在第一预定位置中的各个子位置。在获得对目标物体的当前检测区域之后,可以从上述第一预定位置中获取与当前检测区域相关的历史检测区域。
步骤S206,根据历史检测区域和当前检测区域,获取确定后的当前检测区域。
当前检测区域决定了关键点检测网络的输入,而区域中目标物体所在背景的稳定性也影响着时序上物体关键点的稳定性,因此采用稳框机制使得前后帧的关键点模型输入的背景在时频域上是局部稳定的。因而,该实施例在获得对目标物体的当前检测区域之后,需要对当前检测区域进行调整,比如,对当前检测区域进行稳定处理,可以通过历史检测区域对当前检测区域进行调整,得到确定后的当前检测区域,如此调整使得该确定后的当前检测区域和历史检测区域两者间区域变化值小于第一目标阈值,也即,确定后的当前检测区域和历 史检测区域所指示的区域,在大部分局部时域上不变,或者,变化小。
上述第一目标阈值用于衡量区域变化值小的临界值,该区域变化值可以通过确定后的当前检测区域所指示的区域的坐标值与历史检测区域所指示的区域的坐标值二者之间的变化值进行确定。
可选地,在根据历史检测区域和当前检测区域,获取确定后的当前检测区域之后,可以将原来用于表示物体检测区域的位置变化的曲线,转化为阶梯轨迹,进而根据确定后的当前检测区域,检测目标物体的物体关键点集。
该实施例通过对当前检测区域进行稳定处理,使得当前视频帧与其相邻的历史视频帧中目标物体所处的区域,在时域上是局部稳定的,从而为目标物体的多个物体关键点的检测提供稳定的背景,进而减少由于背景变化造成对物体关键点进行检测的误差,保证最终输出的物体关键点的准确性。
在上述获取确定后的当前检测区域之后,可以将该确定后的当前检测区域存储在上述第一预定存储位置中,作为后续其他视频帧对应的一历史检测区域,以便用于作为基础数据参与后续其他视频帧中目标物体的检测区域进行稳定处理。
步骤S208,基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集。
其中,第一物体关键点集包括多个物体关键点,该多个物体关键点用于标识出物体关键部位的特征点,以用于标注出物体轮廓。
不同物体对应的关键点不同,如此,在方案实现之前需要先定义出物体关键点,例如,当目标物体为人体时,则物体关键点为人体关键点,具体的可以定义用于指示左耳朵、左眼、鼻子等人体部位的人体关键点,该多个人体关键点可以分别位于所指示的部位的位置上。可选地,在实际场景中,物体的运动是非常复杂的,“左”、“右”、“内”、“外”很难区分,可以定义从左到右物体上先出现的一侧,为物体左侧,从右到左物体上先出现的一侧,为物体右侧。可选地,当物体关键点为人体时,从左到右先出现的肩膀,为左肩膀,从左到右先出现的大腿,为左大腿。通过对多个物体关键点进行定义,从而避免对多个物体关键点的定义出现歧义。
可选地,该实施例通过关键点检测算法基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集,可以基于确定后的当前检测区域,对目标物体的每一个关键点分别进行关键点定位,从而得到第一物体关键点集,该第一物体关键点集可以为人体关键点集,包括66个人体关键点。其中,该关键点检测算法可以是基于深度学习模型中的特征金字塔网络(Feature Pyramid Networks,简称为FPN)得到的。
步骤S210,获取在目标视频流的历史视频帧中目标物体对应的第二物体关键点集。
该实施例的第二物体关键点集为相对当前视频帧而言的历史视频帧中目标物体对应的物体关键点集,包括的多个物体关键点与第一物体关键点包括的 多个物体关键点一一对应,为在获得对目标物体的当前检测区域之前,对目标视频流的历史视频帧中的目标物体进行关键点定位得到,也即,为对目标物体进行关键点定位的历史检测结果。
在该实施例中,第二物体关键点集可以为与当前视频帧相邻的历史视频帧中目标物体对应的稳定调整后的物体关键点集,存储在第二预定位置,该第二预定位置可以和第一预定位置相同。而每个历史视频帧中目标物体对应的稳定调整后的物体关键点集,会按照历史视频帧的帧编号顺序依次存储在第二预定位置中的各个子位置。
步骤S212,根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。
在本申请上述步骤S212提供的技术方案中,第二目标物体关键点集的位置为已经检测出的历史视频帧中对目标物体的检测结果,该第二目标物体关键点集的位置可以通过第二目标物体关键点集在目标物体上的坐标进行表示。
在该实施例中,由于预测误差的存在,使得目标视频流中的点位存在抖动。在获取第二目标物体关键点集的位置之后,根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置,其中,第一目标物体关键点集的位置可以通过第一目标物体关键点集在目标物体上的坐标进行表示,当前目标物体关键点集的位置可以通过当前目标物体关键点集在目标物体上的坐标进行表示。
该实施例的当前视频帧中的当前目标物体关键点集的位置和与当前视频帧相邻的历史视频帧中目标物体的第二物体关键点集的位置,二者之间的位置变化幅度小于第二目标阈值,该第二目标阈值用于衡量第一物体关键点集和第二物体关键点集的位置变化幅度小的临界值,也即,降低了当前视频帧中的目标物体的关键点集与历史视频帧中的目标物体的关键点集之间的抖动幅度。
在根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置之后,可以将当前目标物体关键点集存储在第二预定存储位置中,该第二预定存储位置可以和第一预定存储位置相同,以用于对当前视频帧之后的视频帧的物体关键点集的位置进行稳定调整,也即,当前目标物体关键点集会作为对当前视频帧之后的视频帧的物体关键点集的位置进行稳定调整的基础,一旦确定之后,会存储至第二预定存储位置。比如,在当前视频帧为目标视频流的第三个视频帧的情况下,将当前目标物体关键点集存储在第二预定存储位置的第三子位置中,该第二预定存储位置的第三子位置可以与上述第二预定存储位置的第二子位置相邻,以用于对第四个视频帧(还可以是第五个视频帧、第六个视频帧等)的物体关键点集的位置进行稳定调整。
在当前视频帧成为目标视频流中当前视频帧之后的视频帧的一个历史视频帧的情况下,该当前目标物体关键点集也会成为历史视频帧中目标物体对应的第二物体关键点集。比如,在当前视频帧为目标视频流的第三个视频帧的情 况下,则该当前目标物体关键点集为目标视频流的第四个视频帧(还可以是第五个视频帧、第六个视频帧等)的历史视频帧中与目标物体对应的第二物体关键点集。
作为一种可选的实施方式,步骤S206,根据历史检测区域和当前检测区域,获取确定后的当前检测区域包括:当历史视频帧为与当前视频帧相邻的一个历史视频帧时,根据历史检测区域和当前检测区域,获取确定后的当前检测区域。
在该实施例中,在根据历史检测区域和当前检测区域,获取确定后的当前检测区域时,可以获取与当前视频帧相邻的一个历史视频帧中的历史检测区域,通过历史检测区域对当前检测区域进行调整,以使得确定后的当前检测区域所指示的区域与历史检测区域所指示的区域,二者之间的区域变化值小于预定阈值,进而使得确定后的当前检测区域和历史检测区域所指示的区域,在大部分局部时域上不变,或者变化小。
作为一种可选的实施方式,当历史视频帧为与当前视频帧相邻的一个历史视频帧时,根据历史检测区域和当前检测区域,获取确定后的当前检测区域包括:当历史视频帧为与当前视频帧相邻的一个历史视频帧时,获取历史检测区域与当前检测区域之间的重叠度;在重叠度大于目标阈值的情况下,将历史检测区域作为确定后的当前检测区域;在重叠度不小于目标阈值的情况下,直接将当前检测区域作为确定后的当前检测区域。
在该实施例中,在对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域之后,对当前检测区域进行稳定处理,也即,稳框处理,可以通过对与当前视频帧相邻的一个历史视频帧中的对目标物体进行检测得到的结果,来对当前检测区域进行稳定处理。
可选地,针对与当前视频帧相邻的一个历史视频帧,获取用于指示目标物体在历史视频帧的图像中所显示的目标场景中所处的区域的历史检测区域,再获取当前检测区域和历史检测区域之间的重叠度,也即,获取当前检测区域所指示的目标物体在当前视频帧的图像中所显示的目标场景中所处的区域A与历史检测区域所指示的目标物体在历史视频帧的图像中所显示的目标场景中所处的区域B之间的重叠度。
可选地,通过区域A与区域B相交的面积比上区域A与区域B相并的面积,得到重叠度:
Figure PCTCN2019113611-appb-000001
在获取当前检测区域和历史检测区域之间的重叠度之后,判断重叠度是否大于目标阈值。该目标阈值用于衡量当前检测区域和历史检测区域重叠的多少,可以为0.4。在当前检测区域和历史检测区域之间的重叠度大于第二目标阈值的情况下,确定当前检测区域和历史检测区域重叠的区域较大,则将历史 检测区域确定为对当前检测区域进行稳定处理后的检测区域,也即,继续沿用当前视频帧的上一历史视频帧的历史检测区域,作为对当前检测区域进行稳定处理后的检测区域,实现前后视频帧的背景在时域上是局部稳定的,以提高对目标物体的物体关键点进行检测的准确性。
在重叠度不小于目标阈值的情况下,可以直接将当前检测区域作为确定后的当前检测区域,也即,可以不对当前检测区域进行稳定处理。
可选地,目标视频流中的第一个视频帧可以不用进行稳定处理。
作为另一种可选示例,当前检测区域和历史检测区域可以为矩形框,在目标场景中具有确定的位置。在根据历史检测区域和当前检测区域,获取确定后的当前检测区域之前,可以获取当前检测区域的第一尺寸、当前检测区域在目标场景中的第一位置、历史检测区域的第二尺寸和历史检测区域在目标场景中的第二位置,其中,第一尺寸和第二尺寸可以通过面积进行表示,第一位置和第二位置可以通过目标场景中的坐标值进行表示。
在获取当前检测区域的第一尺寸和历史检测区域的第二尺寸之后,获取第一尺寸和第二尺寸的尺寸变化值,并判断尺寸变化值是否小于第一预定阈值,该第一预定阈值是用于衡量第一尺寸和第二尺寸变化的大小的临界值。在获取当前检测区域在目标场景中的第一位置和确定后的当前检测区域在目标场景中的第二位置之后,获取第一位置和第二位置之间的位置变化值,判断位置变化值是否小于第二预定阈值,该第二预定阈值是用于衡量第一位置和第二位置变化的大小的临界值。如果判断出尺寸变化值小于第一预定阈值,且位置变化值小于第二预定阈值,则将历史检测区域确定为历史检测区域,从而保证与目标物体相关联的人体检测框的大小和位置,在大部分局部时域上变化小或者是不变的,以提高对目标物体的物体关键点进行检测的准确性,进而提高对目标物体进行处理的效率。
作为一种可选的实施方式,步骤S212,根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置包括:当历史视频帧为与当前视频帧相邻的一个历史视频帧时,将第二物体关键点集的位置确定为当前目标物体关键点集的位置;或者当历史视频帧为与当前视频帧相邻的多个历史视频帧时,通过多组第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前目标物体关键点集的位置,其中,多个历史视频帧与多组第二物体关键点集一一对应。
在该实施例中,历史视频帧可以为与当前视频帧相邻的一个历史视频帧。历史视频帧中目标物体对应的第二物体关键点集的位置,为已经检测出的历史视频帧中的物体关键点的检测结果,将第二物体关键点集的位置直接确定为当前目标物体关键点集的位置,也即,将第二物体关键点集中的多个物体关键点的位置,直接作为当前目标物体关键点集中多个物体关键点的位置。
可选地,历史视频帧可以为与当前视频帧相邻的多个历史视频帧,比如,当前视频帧为目标视频流的第三个视频帧,则历史视频帧为目标视频流的第一 个视频帧和第二个视频帧。每个历史视频帧对应一组第二物体关键点集,该一组第二物体关键点集可以包括目标物体上的所有关键点,比如,目标物体为人体,则一组第二物体关键点集可以包括66个人体关键点。通过多组第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,可以通过每组第二物体关键点集中包括的多个物体关键点的位置,对第一物体关键点集中包括的对应多个物体关键点,进行稳定调整,得到当前目标物体关键点集的位置。
图3是根据本申请实施例的一种对当前视频帧的当前检测区域进行稳定处理和对当前视频帧中目标物体的关键点集进行稳定处理的示意图。如图3所示,目标视频流包括第N个视频帧、第N-1个视频帧、第N-2个视频帧……第N-m个视频帧、第N+1个视频帧、第N+2个视频帧……第N+n个视频帧,其中,第N个视频帧为目标视频流的当前视频帧,第N-1个视频帧、第N-2个视频帧……第N-m个视频帧为目标视频流中与当前视频帧相邻的多个历史视频帧,第N+1个视频帧、第N+2个视频帧……第N+n个视频帧为目标视频流中当前视频帧之后的多个视频帧,N为大于等于1的自然数,1≤m<N,且m为自然数,n为大于等于1的自然数。
可选地,该实施例的第N-1个视频帧可以为当前视频帧的第一历史视频帧,对应第一组第二物体关键点集{关键点A1,关键点B1……关键点Z1}和第一历史检测区域,第N-2个视频帧可以为当前视频帧的第二历史视频帧,对应第二组第二物体关键点集{关键点A2,关键点B2……关键点Z2}和第二历史检测区域,第N-m个视频帧可以为当前视频帧的第m历史视频帧,对应第m组第二物体关键点集{关键点Am,关键点Bm……关键点Zm}和第m历史检测区域,第N个视频帧为当前视频帧,对应第一物体关键点集{关键点an,关键点bn,关键点zn}。上述第一组第二物体关键点集、第二组第二物体关键点集……第m组第二物体关键点集存储在第二预定位置。
在该实施例中,在对目标视频流的第N个视频帧中的目标物体进行检测,获得对目标物体的当前检测区域之后,可以根据第一历史检测区域和当前检测区域,获取确定后的当前检测区域,也即,根据第一历史检测区域对当前检测区域进行稳定处理,得到确定后的当前检测区域。可选地,在第一历史检测区域和当前检测区域的重叠度大于目标阈值的情况下,将第一历史检测区域作为确定后的当前检测区域;在重叠度不小于目标阈值的情况下,直接将当前检测区域作为确定后的当前检测区域。该实施例的第一历史检测区域可以由第二历史检测区域确定,第二历史检测区域可以由第N-3个视频帧中与目标物体对应的历史检测区域确定,以此类推。可选地,目标视频流的第一个视频帧的检测区域不用进行稳定处理。
在根据第一历史检测区域和当前检测区域,获取确定后的当前检测区域之后,可以将确定后的当前检测区域存储在第一预定位置中,可以成为第N+1个视频帧中与目标物体对应的检测区域的一个历史检测区域,以用于对第N+1个视频帧中与目标物体对应的检测区域进行稳定处理。
通过上述方法还可以类推,对第N+2个视频帧、第N+3个视频帧中与目标物体对应的检测区域进行稳定处理。
该实施例还可以通过每组第二物体关键点集中包括的多个物体关键点的位置,对第一物体关键点集中包括的对应多个物体关键点,进行稳定调整,比如,通过第一组第二物体关键点集中的关键点A1的位置、第二组第二物体关键点集中的关键点A2的位置……第m组第二物体关键点集中的关键点Am的位置,对第一物体关键点集中关键点aN的位置进行稳定调整,得到当前目标物体关键点集中的关键点AN的位置;通过第一组第二物体关键点集中的关键点B1的位置、第二组第二物体关键点集中的关键点B2的位置……第m组第二物体关键点集中的关键点Bm的位置,对第一物体关键点集中关键点bN的位置进行稳定调整,得到当前目标物体关键点集中的关键点BN的位置;通过第一组第二物体关键点集中的关键点Z1的位置、第二组第二物体关键点集中的关键点Z2的位置……第m组第二物体关键点集中的关键点Zm的位置对第一物体关键点集中关键点zN的位置进行稳定调整,得到当前目标物体关键点集中的关键点ZN的位置,从而实现通过每组第二物体关键点集中包括的多个物体关键点的位置,对第一物体关键点集中包括的对应多个物体关键点,进行稳定调整,从而得到第N组的当前物体关键点集{关键点An,关键点Bn……关键点Zn}。
可选地,在得到第N组的当前物体关键点集{关键点An,关键点Bn……关键点Zn}之后,将当前物体关键点集存储至第二预定存储位置中,以用于对当前视频帧之后的视频帧的物体关键点集的位置进行稳定调整其中,第二预定存储位置可以和第一预定存储位置相同。比如,对第N+1个视频帧的物体关键点集的位置进行稳定调整。
可选地,通过当前物体关键点集中的关键点An、第一组第二物体关键点集中的关键点A1的位置、第二组第二物体关键点集中的关键点A2的位置……第m组第二物体关键点集中的关键点Am的位置,对第N+1个视频帧的物体关键点集中关键点aN+1的位置进行稳定调整,得到关键点AN+1的位置;通过当前物体关键点集中的关键点Bn、第一组第二物体关键点集中的关键点B1的位置、第二组第二物体关键点集中的关键点B2的位置……第m组第二物体关键点集中的关键点Bm的位置,对第N+1个视频帧的物体关键点集中关键点bN+1的位置进行稳定调整,得到关键点BN+1的位置;通过当前物体关键点集中的关键点ZN、第一组第二物体关键点集中的关键点Z1的位置、第二组第二物体关键点集中的关键点Z2的位置……第m组第二物体关键点集中的关键点Zm的位置,对第N+1个视频帧的物体关键点集中关键点zN+1的位置进行稳定调整,得到关键点ZN+1的位置,从而基于第N组的当前物体关键点集{关键点AN,关键点BN……关键点ZN},得到第N+1组的当前物体关键点集{关键点AN+1,关键点BN+1……关键点ZN+1}。
通过上述方法还可以类推,对第N+2个视频帧、第N+3个视频帧中与目 标物体对应的物体关键点集的位置进行稳定调整。
作为一种可选的实施方式,通过多组第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置包括:从第一物体关键点集中确定待稳定调整的第一目标物体关键点的位置;从每组第二物体关键点集中,确定与第一目标物体关键点对应的第二目标物体关键点的位置,得到多个第二目标物体关键点的位置,其中,第二目标物体关键点所指示的目标物体的部位和第一目标物体关键点所指示的目标物体的部位相同;获取多个第二目标物体关键点的位置的加权和;通过目标视频流的帧率确定目标系数;根据加权和与目标系数对第一目标物体关键点的位置进行平滑处理,得到稳定调整后的第一目标物体关键点的位置。
该实施例的历史视频帧可以为与当前视频帧相邻的多个历史视频帧,可以通过与多个历史视频帧对应的多组第二物体关键点集的位置对第一物体关键点集包括的多个关键点的位置一一进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。可选地,第一物体关键点集包括多个物体关键点,从中确定待稳定调整的第一目标物体关键点的位置,再从每组第二物体关键点集中,确定与第一目标物体关键点对应的第二目标物体关键点的位置,得到多个第二目标物体关键点的位置。该第二目标物体关键点所指示的目标物体的部位和第一目标物体关键点所指示的目标物体的部位相同,比如,第二目标物体关键点和第二目标物体关键点均指示目标人体的眼睛部位。
该实施例通过对多个第二目标物体关键点的检测结果来对第一目标物体关键点进行稳定处理,可以根据多个第二目标物体关键点的检测结果来对第一目标物体关键点进行时空滤波稳定处理。可选地,分别获取多个第二目标物体关键点在目标物体上的第一位置,得到多个第一位置,比如,{p t-i} i=0:w,其中,t用于表示当前视频帧,w用于表示历史视频帧的数量;在获取多个第一位置之后,获取多个第一位置的加权和,比如,对{p t-i} i=0:w进行加权运算,得到加权和
Figure PCTCN2019113611-appb-000002
通过目标视频流的帧率确定目标系数,比如,通过目标视频流的帧率c 1和c 2值;根据加权和与目标系数对第一目标物体关键点在目标物体上的位置进行平滑处理,得到稳定调整后的第一目标物体关键点的位置p t′,比如,根据加权和
Figure PCTCN2019113611-appb-000003
和目标系数c 1和c 2对第一目标物体关键点在目标物体上的位置进行平滑处理,得到稳定调整后的第一目标物体关键点的位置p t′:
Figure PCTCN2019113611-appb-000004
其中,
Figure PCTCN2019113611-appb-000005
考虑了时间因素,w用于表示历史视频帧的数量,也即,时空滤波的窗口的大小。
根据加权和与目标系数对第一目标物体关键点在目标物体上的位置进行 平滑处理,得到的稳定调整后的第一目标物体关键点的位置,和与当前视频帧相邻的一个历史视频帧中目标物体上的第二目标物体关键点的位置,二者之间的变化幅度小于第二目标阈值,从而实现对第一目标物体关键点的稳定处理。
可选地,该实施例还可以通过上述方法,对多个第一物体关键点中除第一目标物体关键点之外的其它物体关键点,进行稳定处理,从而得到稳定后的多个物体关键点,保证了目标物体上的多个物体关键点在视频序列上是平稳的,从而消除物体关键点的预测误差,在前后视频帧中表现出更强的时空一致性,减小抖动,提高了对物体关键点进行定位的准确性。
作为另一种可选的实施方式,步骤S202,对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域包括:对当前视频帧进行检测,得到多个第一候选检测区域;将多个第一候选检测区域中,与历史检测区域的重叠度最大的第一候选检测区域,确定为当前检测区域。
在该实施例中,将多个目标视频帧、多个目标物体的检测区域和多个目标物体的检测区域的置信度作为对第一子目标模型进行训练的数据,其中,多个目标视频帧为第一子目标模型的输入数据,多个目标物体的检测区域和多个目标物体的检测区域的置信度为第一子目标模型的输出数据,每个目标物体的检测区域可以为目标物体在目标视频的每个目标视频帧的图像中所显示的目标场景中所处的区域。通过上述数据基于深度神经网络对第二子目标模型进行训练,得到第一目标模型,该第一目标模型用于对目标视频包括的视频帧进行检测,以得到多个物体的检测区域和多个物体的检测区域的置信度。该实施例的第一子目标模型可以为基于深度神经网络的初始建立的检测模型。
可选地,该实施例的第一目标模型是基于开源的单次多框,检测器SSD架构训练的网络模型(MobileNet v1),且该网络模型的通道数根据移动端需求减少为原来的1/4,以方便模型的部署和加速。
该实施例的物体检测区域的检测算法可以通过第一目标模型对当前视频帧进行检测,得到多个第一候选检测区域和多个第一候选检测区域的置信度,该第一候选检测区域的置信度用于指示第一候选检测区域被确定为当前检测区域的概率。从多个第一候选检测区域中选择与历史检测区域的重叠度最大的第一候选检测区域,将该重叠度最大的第一候选检测区域确定为当前检测区域。可选地,该实施例历史检测区域可以为与当前视频帧相邻的多个历史视频帧中目标物体对应的多个历史检测区域,从多个第一候选检测区域中选择与多个历史检测区域中的至少两个重叠,且重叠度最大的第一候选检测区域。
作为另一种可选的实施方式,将多个第一候选检测区域中,与历史检测区域的重叠度最大的第一候选检测区域,确定为当前检测区域包括:当历史视频帧为与当前视频帧相邻的一个历史视频帧时,将多个第一候选检测区域中,与历史检测区域的重叠度最大的第一候选检测区域,确定为当前检测区域。
在该实施例中,在确定当前检测区域时,可以将与当前视频帧相邻的一个历史视频帧中目标物体对应的历史检测区域,作为参考的对象,将多个第一候 选检测区域中,与该历史检测区域的重叠度最大的第一候选检测区域确定为当前检测区域。
举例而言,将与当前视频帧相邻的一个历史视频帧中目标物体对应的历史检测区域A,作为参考的对象。将当前视频帧中的图片的尺寸调整到300x300作为网络模型的输入,产生1000个第一候选检测区域,将与历史检测区域A的IOU最大的第一候选检测区域作为当前视频帧的当前检测区域,这样就能保证人体检测框时刻定位同一目标物体。
作为一种可选的实施方式,将多个第一候选检测区域中,与历史检测区域的重叠度最大的第一候选检测区域,确定为当前检测区域包括:从多个第一候选检测区域中选择目标数量的目标候选检测区域,其中,每个目标候选检测区域的置信度,大于等于多个第一候选检测区域中除目标数量的目标候选检测区域之外的任一第一候选检测区域的置信度;将目标数量的目标候选检测区域中,与历史检测区域的重叠度最大的第一候选检测区域,确定为当前检测区域。
在该实施例中,在将多个第一候选检测区域中置信度最大的第一候选检测区域,确定为当前检测区域时,从多个第一候选检测区域中选择目标数量的目标候选检测区域,比如,选择三个目标候选检测区域B0、B1和B2,每个目标候选检测区域的置信度,大于等于多个第一候选检测区域中除目标数量的目标候选检测区域之外的任一第一候选检测区域的置信度,也即,这三个目标候选检测区域B0、B1和B2的置信度在多个当前检测区域中的置信度中最大。然后将目标数量的目标候选检测区域中,与历史检测区域的重叠度最大的第一候选检测区域,确定为当前检测区域,这样就能保证人体检测框时刻定位同一目标物体。
作为一种可选的实施方式,在步骤S202,对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域之前,该方法还包括:对与当前视频帧相邻的一个历史视频帧进行检测,得到多个第二候选检测区域;在与当前视频帧相邻的一个历史视频帧,为目标视频流的第一个视频帧的情况下,将多个第二候选检测区域中置信度最大的第二候选检测区域,确定为历史检测区域,其中,置信度用于指示对应的第二候选检测区域被确定为历史检测区域的概率。
在该实施例中,在检测目标视频流中与目标物体相关联的当前检测区域之前,通过第一目标模型对与当前视频帧相邻的一个历史视频帧进行检测,得到多个第二候选检测区域和多个第二候选检测区域的置信度,该第二候选检测区域的置信度用于指示第二候选检测区域被确定为历史检测区域的概率,该历史检测区域为对与当前视频帧相邻的一个历史视频帧中的目标物体进行检测得到的结果。可选地,在与当前视频帧相邻的一个历史视频帧为目标视频流的第一个视频帧的情况下,将多个第二候选检测区域中置信度最大的第二候选检测区域,直接确定为历史检测区域。
在该实施例中,目标视频流的第一个视频帧,可以将通过由第一目标模型 确定的多个候选框中置信度最大的侯选检测区域,确定为需要得到的物体检测区域的结果,而在后续视频帧中,会选择与前一视频帧的物体检测区域重叠度最大的候选检测区域作为需要确定的物体检测区域,以保证物体检测区域可以时刻定位同一个目标物体。
可选地,在该实施例中,在对目标物体的检测区域进行检测时,可以通过上述物体检测区域的检测方法每隔多帧对目标物体的检测区域进行检测,从而提高处理效率。可选地,第一视频帧通过上述方法已经得到对应的物体检测区域,第二视频帧为目标视频流中在上述第一视频帧之后的一个视频帧,如果第二视频帧和第一视频帧相隔第一数量的视频帧,则满足隔帧检测的条件,可以通过第一目标模型对第二视频帧进行检测,得到多个第三候选检测区域和多个第三候选检测区域的置信度,该第三候选检测区域的置信度用于指示第三候选检测区域被确定为与第二视频帧的目标物体相关联的检测区域的概率。
在通过第一目标模型对第二视频帧进行检测,得到多个第三候选检测区域和多个第三候选检测区域的置信度之后,可以将多个第三候选检测区域中,与第二视频帧相邻的上一视频帧中的目标物体对应的检测区域的重叠度最大的第三候选检测区域,确定为与第二视频帧中与目标物体相关联的物体检测区域。
该实施例考虑到处理的性能,并不是每个视频帧都经过物体区域的检测算法来获得物体检测区域,可以通过物体区域的检测算法检测每隔第一数量的视频帧检测一次,其中,第一数量越大,处理效率越高,时间越短。可选地,在当前视频帧的物体关键点的置信度普遍较低的情况下,通过人体检测算法对当前视频帧进行检测。
在该实施例中,并不是每个视频帧都经过人体检测算法来获得人体检测框,可以运用与当前视频帧相邻的一个历史视频帧的第二物体关键点集的检测结果,来生成当前视频帧中的与目标物体相关联的人体检测框。
该实施例的历史检测区域用于指示目标物体在与当前视频帧相邻的一个历史视频帧的图像中所显示的目标场景中所处的区域,可以根据历史检测区域内的目标物体的第二物体关键点集,来生成与当前视频帧的目标物体相关联的当前检测区域,该当前检测区域所指示的目标物体在当前视频帧的图像中所显示的目标场景中所处的区域,包括第二物体关键点集所处的区域,比如,该当前检测区域包括第二物体关键点集中所有的物体关键点,可以将包括第二物体关键点集的最小矩形框,沿竖方向扩展目标比例的边长,比如,扩展1/5边长,得到当前检测区域,从而实现确定目标视频中与目标物体相关联的当前检测区域。
作为另一种可选的实施方式,步骤S208,基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集包括:在当前视频帧中的目标物体不完全位于确定后的当前检测区域内的情况下,以确定后的当前检测区域的中心为中心,对确定后的当前检测区域进行外扩处理,所谓外扩处理可 以具体表现为将确定后的当前检测区域的宽和高进行适应性增加,得到目标物体检测框,如此使得当前视频帧中的目标物体在目标场景中所处的区域完全位于目标物体检测框中,在外扩处理后得到目标检测区域;根据目标检测区域内包括的目标物体的目标图像,获取第一物体关键点集。
作为另一种可选的实施方式,根据目标检测区域内包括的目标物体的目标图像,获取第一物体关键点集,包括:对目标图像进行处理,得到第一物体关键点集的多组置信度,其中,每组置信度用于预测第一物体关键点集中的一个物体关键点的位置;通过每组置信度构建目标矩阵;根据每组置信度中的最大置信度在对应的目标矩阵中的行和列,确定第一目标坐标;通过第一目标坐标,确定第一物体关键点集中的一个物体关键点的位置。
在该实施例的物体关键点检测算法中,可以将多个包括物体的图像、多个物体关键点作为对第二目标模型进行训练的数据,通过该训练数据对第二目标模型基于深度学习进行训练,得到第二目标模型,其中,物体关键点用于指示物体上的一个部位,第二目标模型可以为初始检测建立的模型。
可选地,该实施例的物体关键点检测算法是基于深度学习模型中的特征金字塔网络FPN,基础网络是精简后的网络(VGG)。可选地,该实施例将卷积层替换为残差结构(Residual Block),并在卷积层后运用批量规范化(Batch Normalization)和激活函数(PReLU)以提升物体关键点检测的准确性。
该实施例通过第二目标模型,对目标物体检测框内的目标图像进行处理,可以将目标图像输入到FPN网络中,从而得到第一物体关键点集的热力图,对应于多个目标矩阵,该目标矩阵也即热力图矩阵,其中,目标图像包括目标物体,也即,该目标图像为包括人体区域的局部图像块。由FPN网络得到的热力图的尺寸与输入的目标图像的尺寸是成比例关系的,也即,得到的目标矩阵的大小与输入的目标图像也是具有对应关系的。通过FPN网络可以获取第一物体关键点集的多组置信度,每组置信度用于预测第一物体关键点集中的一个物体关键点的位置。通过每组置信度构建目标矩阵,其中,每组置信度中的每个置信度用于预测对应的物体关键点在目标物体上的位置。可选地,多个目标矩阵为66个矩阵,与66个物体关键点一一对应。可选地,从多个置信度中选择值最大的第一置信度,并根据第一置信度在目标矩阵中的行号和列号来确定第一目标坐标P m1,再通过第一目标坐标,确定第一物体关键点集中的一个物体关键点的位置。
在通过第一目标坐标,确定第一物体关键点集中的一个物体关键点的位置时,由于目标矩阵和目标矩阵网络的输入的目标图像是具有对应关系的,根据这种对应关系和由目标矩阵的最大置信度的行和列确定的第一目标坐标,可以反算出第一物体关键点在目标图像上的位置。可选地,如果目标图像是由初始图像确定的,则目标图像在初始图像中的位置和比例关系也是确定的,就可以计算出第一物体关键点在初始图像的位置。
作为另一种可选的实施方式,通过第一目标坐标,确定第一物体关键点集 中对应的一个物体关键点的位置包括:根据每组置信度中的次大置信度在目标矩阵中的行和列,确定第二目标坐标;将第一目标坐标向第二目标坐标偏移目标距离;根据偏移目标距离后的第二目标坐标,确定与目标矩阵对应的一个物体关键点在目标物体上的位置。
在该实施例中,由于受噪声影响,热力图能量多数不是正态分布的,用最大置信度值预测点位并准确度低。在通过第一目标坐标,确定第一物体关键点集中对应的一个物体关键点的位置时,可以根据目标矩阵的第二置信度的行和列,确定与目标矩阵对应的第二目标坐标P m2,该第二置信度小于第一置信度且大于第三置信度,该第三置信度为多个置信度中除第一置信度和第二置信度之外的任一置信度,也即,该第二置信度为多个置信度中的次大置信度,进而将第一目标坐标向第二目标坐标偏移目标距离,比如,偏移目标距离后的第二目标坐标P=P m1+0.25*(P m2-P m1),进而根据偏移目标距离后的第二目标坐标,确定与目标矩阵对应的一个物体关键点在目标物体上的位置。
该实施例按照上述方法在目标物体上确定第一物体关键点集,比如,得到包括66个物体关键点的第一物体关键点集。
可选地,按照第一目标比例对当前物体检测区域进行调整;对确定后的当前物体检测区域内的包括目标物体的目标图像进行处理,得到第一物体关键点集。
可选地,在对确定后的目标物体检测区域内的包括目标物体的目标图像进行处理,得到第一物体关键点集中的物体关键点时,该方法还包括:按照第二目标比例对第一目标坐标进行调整,其中,第二目标比例为第一目标比例的倒数;将目标物体上与调整后的第一目标坐标所对应的点的位置,确定为第一物体关键点集中的物体关键点在目标物体上的位置。
在该实施例中,第一目标模型处理的目标图像有尺寸要求,比如,宽×高为192x256,也即,3:4比例。由于目标物体检测框难以保证3:4比例,则按照第一目标比例对目标物体检测框进行调整,比如,对目标物体检测按照3:4的比例切割,从而方便缩放到192x256,以作为第二目标模型的输入,进而通过第二目标模型,对确定后的当前检测区域内的包括目标物体的目标图像进行处理,得到第一物体关键点集。
在该实施例中,在从多个置信度中选择第一置信度,并根据第一置信度在目标矩阵中的行和列确定第一目标坐标之后,按照第二目标比例对第一目标坐标进行调整,比如,按照人体检测框的切割位置和缩放尺度,将第一目标坐标反算为原目标图像的坐标,进而将当前视频帧的目标物体上与调整后的第一目标坐标所对应的点,确定为第一物体关键点集中的物体关键点。
可选地,在该实施例中,热力图的尺寸和网络的输入(目标图像)是成比例关系的,根据这种比例关系和热力图的尺寸,可以反算出物体关键点在目标图像上的位置,由于目标图像来源于对确定后的当前检测区域进行外扩处理得到的目标检测区域,位置和比例关系也是确定的,就可以计算出物体关键点在 目标检测区域的图像中的位置。
通过上述方法可以确定第一物体关键点集中的每一个物体关键点,进而通过关键点稳定算法,对第一物体关键点集中的每一物体关键点的位置进行稳定。
图4是根据本申请实施例提供的一种图像处理方法的流程图。如图4所示,该方法包括以下步骤:
步骤S402,对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域。
需要说明的是,该步骤S402提供的技术方案可以包括步骤S202所提供的技术方案。
步骤S404,根据在目标视频流的历史视频帧中目标物体对应的历史检测区域和当前检测区域,获取确定后的当前检测区域。
需要说明的是,该步骤S404提供的技术方案可以包括步骤S204和步骤S206所提供的技术方案。
步骤S406,基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集。
需要说明的是,该步骤S406提供的技术方案可以包括步骤S208所提供的技术方案。
步骤S408,根据在历史视频帧中目标物体对应的第二物体关键点集的位置,对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。
需要说明的是,该步骤S408提供的技术方案可以包括步骤S210和步骤S212所提供的技术方案。
步骤S410,根据当前目标物体关键点集的位置,从当前视频帧中识别出目标物体的部位。
在本申请上述步骤S410提供的技术方案中,当前目标物体关键点集为对第一目标物体关键点集进行稳定调整后的关键点集,与历史视频帧中的目标物体的关键点集之间的抖动幅度小,消除了关键点的预测误差。另外,当前目标物体关键点集中的每一个关键点可以用于指示目标物体的一个部位,比如,当目标物体为人体时,当前目标物体关键点集中的关键点有66个,用于指示人体上的66个不同的部位,且覆盖了物体的轮廓。在此基础上,根据当前目标物体关键点集的位置,从当前视频帧中准确识别出目标物体的部位,比如,识别出人体的耳朵、嘴巴、鼻子、眼睛等部位。从当前视频帧中识别出的目标物体的部位所在的区域和与历史视频帧中识别出目标物体的部位所在的区域,二者之间的抖动幅度小。
步骤S412,对识别出的目标物体的部位进行调整处理。
在从当前视频帧中识别出目标物体的部位之后,显示识别出的目标物体的部位的图像,进而接收用于对部位进行调整处理的调整处理指令。可选地,用 户根据显示出的目标物体的部位的图像,确定需要进行调整处理的部位,通过该对需要调整处理的部位进行操作,触发调整处理指令,进而响应该调整处理指令,对目标物体的部位进行实时调整处理。比如,用户通过调节“瘦腰”功能的滑竿,触发瘦腰指令,响应瘦腰指令以实时调节瘦腰的程度,该实施例同样可以实现长腿、提臀等部位的调整处理,此次不做限定。
步骤S414,显示调整处理后的目标物体的图像。
在本申请上述步骤S414提供的技术方案中,调整处理后的目标物体的图像可以呈现出对目标物体的部位进行调整的效果,在对目标物体的部位进行调整的效果未达到预定效果的情况下,可以继续对识别出的目标物体的部位进行调整处理。由于该实施例基于稳定处理后的第一物体关键点集来对目标物体的部位进行调整,使得对目标物体的调整效果更加自然,实现了实时对目标物体进行细致处理,避免了由于目标视频流中的前后视频帧出现抖动导致对目标物体的处理效果差的问题,从而使得对目标物体的处理效果更加真实、自然,减小了对目标物体的处理效果与用户接受的自然美化效果差距。
在该实施例中,物体检测区域的检测算法和物体关键点的检测算法基于深度神经网络方法,稳定算法可以基于时空滤波算法,这三者是物体关键点实时追踪的核心,决定着最终输出的物体关键点的准确性和前后视频帧间的稳定性;采用深度网络方法使得在领域数据上有很好的收敛和泛化性能,使人体检测框和点定位都能达到很好的精度;对当前检测区域进行稳定,目的是为物体关键点的检测提供稳定的背景,减少背景变化造成的误差,对第一物体关键点集的位置的稳定调整,则可以消除点的预测误差,在前后视频帧图片中表现出更强的时空一致性,减小抖动,提高了对物体关键点进行检测的准确性,进而通过稳定处理后的物体关键点对目标物体的部位进行调整,使得对物体的调整效果更加自然,避免了在视频中的前后视频帧出现抖动的情况下,对目标物体的处理效果差,从而实现了提高对目标物体的处理效率的技术效果。
需要说明的是,本申请提供的方案可以对任何需要对关键点进行定位仪提高定位的准确性的目标物体进行检测,例如,该目标物体可以是具有运动能力的目标物体进行检测,该目标物体可以是人体、其他动物(例如狗、猫等)等生命体,而对任意一种目标物体进行检测的过程基本相同,为了方便描述,下面仅目标物体为人体为例进行解释。
下面结合另一实施例对本申请的技术方案进行说明,可以以目标物体为人体,检测区域为人体框,以及人体检测算法及其稳定追踪算法、66个人体关键点检测算法及其稳定追踪算法进行举例说明。
图5是根据本申请实施例的一种人体关键点定位方法的流程图。如图5所示,该方法包括以下步骤:
步骤S501,输入一个视频帧图片。
该实施例的视频可以为短视频,该短视频以人物为主题,包括多个视频帧图片,输入一个视频帧图片,该视频帧图片包括人体的图像,在当前视频帧图 片中定位并追踪人体的图像。
步骤S502,通过人体检测算法对当前视频帧图片进行检测,得到当前人体检测框。
其中,该人体检测算法基于较高性能的深度神经网络训练所得的人体检测模型实现检测,通过深度神经网络使得在模型在领域数据上具有较好的收敛和泛化性能,如此能够使得人体检测框的检测达到较好的精度,在具体实现时,该人体检测模型可以是基于SSD架构训练的MobileNet v1网络模型,具体实现时,该网络模型的通道数可以根据移动端需求减少为原来的1/4,从而方便模型的部署和加速。。
为了保证模型处理效果,在具体图像处理过程中,将当前视频帧图片的尺寸调整到300x300作为人体检测模型的输入,该人体检测模型可以输出1000个候选人体检测框,该人体检测模型还可以输出候选人体检测框的置信度,该置信度用于指示侯选框被选中为人体检测框的概率。
考虑到短视频场景大多数都以一个人物为主题,下面短视频场景下以一个人物的检测和跟踪为例进行解释。在当前视频帧图片为视频的第一个视频帧图片时,将当前视频帧图片的1000个候选人体检测框中置信度最大的人体检测框确定为人体所在框,得到当前人体检测框A。
在当前视频帧图片不为视频的第一个视频帧图片时,需要进行多框校验,需要从人体检测模型根据当前视频帧图片输出的1000个候选人体检测框中,选择其中置信度最大的3个候选人体检测框B0、B1和B2和当前视频帧图片的前一视频帧图片的人体检测框A进行校验,以确定出当前视频帧图片对应的当前人体检测框。
下面结合图6实处的本申请实施例提供的一种多框校验的示意图对校验过程进行解释。利用人体检测框A对人体检测框B0、B1和B2进行校验,分别计算人体检测框A与人体检测框B0、B1和B2各自的重叠度IOU,IOU的计算为
Figure PCTCN2019113611-appb-000006
也即,为A和B相交的第一面积,与A和B相并的第二面积之比,其中B在计算过程中分别取为B0、B1和B2。
从人体检测框B0、B1和B2中,选择与人体检测框A的IOU最大的检测框B1作为当前视频帧图片的当前人体检测框,从而保证人体检测框时刻都定位同一个人。
可选地,为了提高定位效率,可以针对部分视频帧图片采用上述人体检测算法进行检测,而另一部分视频帧图片,则可以利用上一视频帧的人体关键点的位置计算出一个人体检测框,作为当前视频帧图片的当前人体检测框。相对针对每一帧视频图片都采用上述人体检测算法进行检测的方式,这种方式能够 节省计算资源,提高定位效率。具体针对那些视频帧采用人体检测算法进行检测,那些帧依赖其对应的上一视频帧图片进行定位,在实际应用中可以灵活配置。例如,可以采用隔帧检测的方式,例如每间隔4个帧,采用上述人体检测算检测一次,例如,第一帧视频帧采用上述人体检测算法进行检测,而第二帧视频帧至第第五帧视频帧都是依赖各自对应的上一帧检测的人体关键点来确定人体检测框,而第六帧视频帧采用上述人体检测算法进行检测,以此类推,采用两种方式进行间隔性检测,如此,就能够提高运行效率,需要说明的是,间隔的视频帧图片的数量越多,效率越高。当然,也可以基于人体关键点检测置信度来决定在何时采用上述人体检测算法进行检测,例如,只有当前一视频帧图片的人体关键点检测置信度普遍较低的情况下,才按照上述人体检测算法执行对当前视频帧图片的检测。
步骤S503,利用上一视频帧图片的检测结果稳定当前视频帧图片的人体检测框,得到稳定后的当前人体检测框,并缓存稳定后的当前人体检测框。
当前人体检测框的区域决定了人体关键点检测算法的输入,而区域中人体所在背景的稳定性,也影响着视频时序上人体关键点的稳定性,因此该实施例采取人体检测的稳定追踪算法,利用上一视频帧图片的检测结果稳定当前视频帧图片的人体检测框,得到稳定后的当前人体检测框,并缓存稳定后的当前人体检测框,从而为当前视频帧图片的人体关键点的检测,提供稳定的背景,使得前后视频帧图片的人体关键点模型输入的背景在时域上是局部稳定的,减少由于背景变化造成的误差。其中,人体检测的稳定追踪算法,也即,稳框算法,用于对人体检测框进行稳定操作。
可选地,在利用上一视频帧图片的检测结果稳定当前视频帧图片的人体检测框时,设定如果当前视频帧图片的当前人体检测框和前一视频帧图片的人体检测框的IOU大于目标阈值时,则继续沿用前一视频帧图片的人体检测框作为当前视频帧图片的人体检测框,其中,目标阈值可以为0.4。也即,该实施例既可以根据前视频帧图片的人体关键点的位置计算出一个人体所在的框,作为当前视频帧图片的当前人体检测框,也可以利用前一视频帧图片的人体检测框对当前视频帧图片的当前人体检测框进行稳框,如果当前视频帧图片为视频的第一个视频帧图片,则不需要对当前视频帧图片的当前人体检测框进行稳框了。
图7是本申请实施例提供的一种视频帧图片的人体检测框的位置变化示意图。如图7所示,人体检测框的位置p(x或y分量)从原来的曲线轨迹转化为阶梯轨迹,使得人体检测框的大小和位置在大部分局部时域上都是不变的,使得前后视频帧图片的人体关键点模型输入的背景在时域上是局部稳定的。
步骤S504,将局部人体兴趣区域输入到人体关键点检测算法,得到第一人体关键点集。
该实施例的人体关键点检测算法基于深度神经网络,使得在领域数据上有 很好的收敛和泛化性能,使点定位达到很好的精度,可以通过时空滤波算法进行计算。
可选地,该实施例的人体关键点检测算法是基于深度学习模型中的特征金字塔网络FPN。该FPN主要解决的是物体检测中的多尺度问题,通过简单的网络连接改变,在基本不增加原有模型计算量情况下,大幅度提升了小物体检测的性能。
图8是本申请实施例提供的一种特征金字塔网络结构的示意图。如图8所示,输入视频帧图片(Image),得到关键点热力图(Heat Maps),基础网络是精简后的VGG网络。可选地,该实施例将卷积层替换为残差结构(Residual Block),并在卷积层后运用批量规范化(Batch Normalization)和PReLU激活函数以提升关键点检测的准确性。
需要说明的是,该实施例的人体关键点检测算法是基于FPN结构设计仅为本申请实施例的一种举例,并不限于该实施例的人体关键点检测算法仅基于FPN结构设计,还可以是基于HourGlass结构设计,基础网络也可以是VGG网络、MobileNetV1、MobileNetV2、ShuffleNetV1、ShuffleNetV2等小网络及其变种,此处不做任何限制。
图9是本申请实施例提供的一种人体关键点分布的示意图。如图9所示,包括1~66个人体关键点,用于指示人体上的一个部位在人体上的位置。如表1所示,表1为人体关键点定义表,对人体关键点进行了定义,其中,左右按照图片上的方位来定。
表1人体关键点定义表
Figure PCTCN2019113611-appb-000007
Figure PCTCN2019113611-appb-000008
Figure PCTCN2019113611-appb-000009
该实施例的人体关键点网络模型的训练基于大量严格标注的数据,如果标注中出现歧义,就会造成模型的训练失效。按照图9中的人体模型定义人体关键点的位置比较清晰,但在实际场景中,人体的运动是非常复杂的,表1中的“左”。“右”、“内”、“外”通常就很难区分。因而该实施例还需要定义标准,比如,从左到右身体上先出现的一侧为身体左侧,从左到右先出现的肩膀为左肩膀,从左到右先出现的大腿为左大腿;内外是相对身体的正中定义的。根据这个标准,数据标注就不会出现歧义。
在利用上一视频帧图片的检测结果稳定当前视频帧图片的人体检测框,得到稳定后的人体检测框之后,将该兴趣区域输入到人体关键点检测算法中,得到第一人体关键点集。
可选地,由于算法准确度的问题,根据稳定机制定位的人体检测框不一定刚好把人体检测框住,为了保证将人体全部包含进去,该实施例需要对人体所在矩形区域进行适当外扩,可以保持人体检测框的中心不变,使人体检测框的宽和高适应性增大,也即,将人体检测框的中心向外放大一点。
可选地,在该实施例中,热力图的尺寸和网络的输入图像是成比例关系的, 根据这种比例关系和热力图的尺寸,可以反算出人体关键点在目标图像上的位置,由于目标图像来源于原图,原图位置和比例关系也是确定的,就可以计算出人体关键点在原图中的位置。
可选地,该实施例人体关键点检测算法其输入要求是宽192高256,也即,宽高之比为3:4。由于外扩的历史检测区域难以保证比例为3:4,需要将外扩的历史检测区域处理为3:4,从而方便缩放到192x256,以作为人体关键点检测算法的输入。外扩的历史检测区域所包括的区域可以为兴趣区域,为历史检测区域二次处理的结果。在将外扩的历史检测区域处理为3:4之后,将其输入到FPN网络中,预测得到66个人体关键点的热力图(Heat Maps),也即,66个矩阵,该66个人体关键点与66个矩阵一一对应,每个矩阵用于表示矩阵中的元素的位置的置信度。可选地,该实施例根据热力图矩阵的最大置信度的位置(行号、列号)可映射原历史检测区域包括的图像,反算出关键点在原历史检测区域包括的图像中人体上的坐标值,也即,点位。
但是,由于受到噪声影响,热力图多数不是正态分布的,用最大置信度预测点位并不准确,因此该实施例对每张热力图采取最大值点位P m1向次大值点位P m2偏移1/4的距离进行预测:
P=P m1+0.25*(P m2-P m1)
从而得到66个关键点的局部坐标。然后,根据人体检测框的切割位置和缩放尺度3:4反算出原图坐标,确定出66个关键点在人体上的坐标。
步骤S505,利用历史视频帧图片的检测结果稳定当前视频帧图片的第一人体关键点集,得到第二人体关键点集,并缓存第二人体关键点集。
该实施例的人体关键点检测的稳定追踪算法,也即,稳点算法,用于对人体的人体关键点进行稳点操作。在将局部人体兴趣区域输入到人体关键点检测算法,得到第一人体关键点集之后,通过人体关键点检测的稳定追踪算法,利用历史视频帧图片的检测结果稳定当前视频帧图片的第一人体关键点,得到第二人体关键点集,并缓存及输出第二人体关键点集。
在该实施例中,根据人体关键点算法可以计算每帧的人体关键点,但由于预测误差的存在,使得视频中看来点位存在抖动。为减少视频帧间点位的抖动,需要一个时域点稳定算法。假设某个关键点在第t帧位置为p t(x分量或y分量),可以用与当前视频帧相邻的历史视频帧图片中与p t对应的点的位置{p t-i} i=0:w进行加权,对p t进行时空滤波,重新进行计算:
Figure PCTCN2019113611-appb-000010
其中,历史视频帧图片为与当前视频帧相邻的前几视频帧图片,w用于表示历史视频帧图片的数量,也即,基于时空滤波算法的窗口大小。根据帧率选择合适的c 1和c 2值,对66个关键点分别作平滑处理,从而保证点位在视频序列上是平稳的。
该实施例可以消除关键点的预测误差,在前后视频帧图片中表现出更强的时空一致性,减小抖动,从而保证前后视频帧图片的人体关键点的准确性和稳定性。
该实施例通过将视频帧图片输入到人体检测算法中检测人体检测框,得到当前人体检测区域,采取人体检测的稳定追踪算法利用前一视频帧图片的检测结果对当前视频帧图片的当前人体检测区域进行稳框,得到历史检测区域,根据稳定后的历史检测区域进行外扩,以选定人体兴趣区域,将外扩后的历史检测区域输入到人体关键点检测算法中,得到第一人体关键点集;基于人体关键点检测的稳定追踪算法,利用历史帧的检测结果对当前第一人体关键点集也进行稳定操作,得到第二人体关键点集,从而通过第二人体关键点对人体的部位进行调整,满足人体美化的需求,进而达到细致的美化效果,接近用户接受的自然美化效果。
在该实施例中,人体检测算法和人体关键点检测算法基于深度神经网络方法,稳定算法可以基于时空滤波算法,这三者是人体关键点实时追踪的核心,决定着最终输出的人体关键点的准确性和前后帧间的稳定性;采用深度网络方法使得在领域数据上有很好的收敛和泛化性能,使人体检测框和点定位都能达到很好的精度;稳框算法的作用是为人体关键点检测提供稳定的背景,减少背景变化造成的误差,稳点算法则可以消除点的预测误差,在前后视频帧图片中表现出更强的时空一致性,减小抖动。
该实施例的方案可以实现移动端上覆盖人体轮廓的人体关键点的实时定位,首先在视频中定位并追踪人体,并在局部检测并稳定追踪局部关键点,人体检测模型和关键点检测模型合计3M,可以支持在移动端30fps的实时美体,效果更自然。
该实施例可以应用于短视频App、手机相机功能、P图软件等,可以在自拍、跳舞等场景中,实现瘦脸,涂唇,磨皮等等,也可以实现丰胸、瘦腰、长腿、瘦腿等,并且可以实现实时瘦腰、拉腿、提臀特效,能够适应正面、侧身、背身、下蹲以及四肢姿态复杂的动作,满足多样场景的人体美化需求。
图10是根据本申请实施例的一种关键点定位的示意图。如图10所示,该实施例为覆盖人体轮廓的移动端实时关键点追踪方法,当前视频帧图片中包括一人体,该人体包括66个关键点,用于指示人体上的部位的位置,如表1所示。对该人体的66个关键点进行稳定操作,利用历史视频帧图片的检测结果对当前视频帧图片的66个关键点进行稳定操作,从而得到稳定后的66个关键点。
图11是根据本申请实施例的一种对目标人体的人体检测框进行检测的场 景示意图。如图11所示,通过移动终端拍摄目标人体的跳舞视频,对视频中的目标人体进行检测,确定出目标人体的人体检测框,可以将目标人体的轮廓框住。其中,A人体检测框用于指示目标人体在跳舞视频的当前视频帧的图像中所显示的目标场景中所处的区域的位置和范围,B人体检测框用于指示目标人体在与当前视频帧相邻的历史视频帧的图像中所显示的目标场景中所处的区域的位置和范围。通过A人体检测框与B人体检测框相交的面积比A人体检测框与B人体检测框相并的面积,得到重叠度:
Figure PCTCN2019113611-appb-000011
在A人体检测框与B人体检测框的重叠度大于目标阈值的情况下,将B人体检测框直接确定为当前视频帧的人体检测框,使得人体检测线框的大小和位置在大部分局部时域上是不变的,从而实现对当前视频帧的人体检测框的稳定处理。
图12是根据本申请实施例的一种对目标人体的人体关键点进行检测的场景示意图。如图12所示,在确定图11所示的人体检框B为当前视频帧的人体检测框之后,将人体检框B包括的图像输入到关键点检测算法中,得到多个人体关键点。可选地,从多个人体关键点中选择一个人体关键点,基于稳定算法利用历史视频帧的目标人体关键点的位置B1、位置B2、位置B3对当前视频帧的对应的目标人体关键点的位置a进行稳定操作。可以对位置B1、位置B2、位置B3进行加权求和,基于时空滤波对位置a进行平滑处理,从而得到目标关键点在目标人体上的位置A。该位置A为对目标人体关键点在目标人体上的位置a进行稳定之后的位置,从而减少视频帧间点位的抖动。
图13是根据本申请实施例的一种美体功能入口的示意图。如图13所示,在终端上,进入美体功能入口,对视频帧中的人体进行检测。在对当前视频帧图片的66个关键点进行稳定操作之后,消除了点的预测误差,使得在前后视频帧图片中表现出更强的时空一致性之后,通过稳定后的关键点对人体的部位进行调整,比如,选择“瘦腰”功能,通过调节“瘦腰”功能的滑竿,以实时调节瘦腰的程度,从而实现对人体的瘦腰,达到更细致的美化效果,使得更接近用户接受的自然美化效果。
图14是根据本申请实施例的一种瘦身前后的对比示意图。如图14所示,在终端上,进入美体功能入口,对视频帧中的人体进行检测。在对当前视频帧图片的66个关键点进行稳定操作之后,通过稳定后的关键点对人体的整体部位进行调整,比如,选择整体瘦身功能,实现对人体的瘦身,达到更细致的美化效果,使得更接近用户接受的自然美化效果。
该实施例可以辅助实现瘦身、长腿等美体功能,以及加载身体挂件,避免了针对单张图的PS方案需要人工干预,耗时耗力,限制应用场景的缺陷,也表面了对人体简单拉伸,效果并不真实的缺陷,从而提高了对视频中的对象进 行处理的效率,提升了用户体验。
需要说明的是,通过上述方法可以对每个人体关键点都进行稳定操作,从而实现对多个人体关键点分别作平滑处理,保证点位在视频序列上是平稳的。
作为另一种可选示例,该实施例进入美体功能入口,对视频帧中的人体进行检测。在对当前视频帧图片的66个关键点进行稳定操作之后,消除了点的预测误差,使得在前后视频帧图片中表现出更强的时空一致性之后,通过稳定后的人体关键点对人体的部位进行调整,比如,选择“瘦腰”功能,通过调节“瘦腰”功能的滑竿,以实时调节瘦腰的程度,从而实现对人体的瘦腰,达到更细致的美化效果,使得更接近用户接受的自然美化效果。
该实施例通过稳框算法的为人体关键点检测提供稳定的背景,减少背景变化造成的误差,稳点算法则可以消除点的预测误差,在前后视频帧图片中表现出更强的时空一致性,减小抖动,进而通过稳定处理后的人体关键点对目标人体的部位进行调整,使得对对象的调整效果更加自然,避免了在视频中的前后视频帧出现抖动的情况下,对目标人体的处理效果差,从而实现了提高对目标人体的处理效率的技术效果。
可以辅助实现瘦身、长腿等美体功能,以及加载身体挂件,避免了针对单张图的PS方案需要人工干预,耗时耗力,限制应用场景的缺陷,也表面了对人体简单拉伸,效果并不真实的缺陷,从而提高了对视频中的对象进行处理的效率,提升了用户体验。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
根据本申请实施例另一方面,还提供了一种用于实施上述物体关键点的定位方法的物体关键点的定位装置。图15是根据本申请实施例的一种物体关键点的定位装置的示意图。如图15所示,该物体关键点的定位装置150可以包括:检测单元10、第一获取单元20、第二获取单元30、定位单元40、第三获取单元50和调整单元60。
检测单元10,用于对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域。
第一获取单元20,用于获取在目标视频流的历史视频帧中目标物体对应的历史检测区域。
第二获取单元30,用于根据历史检测区域和当前检测区域,获取确定后的当前检测区域。
定位单元40,用于基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集。
第三获取单元50,用于获取在目标视频流的历史视频帧中目标物体对应的第二物体关键点集。
调整单元60,用于根据第二物体关键点集的位置对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。
第二获取单元30包括:获取模块,用于当历史视频帧为与当前视频帧相邻的一个历史视频帧时,根据历史检测区域和当前检测区域,获取确定后的当前检测区域。
需要说明的是,该实施例中的检测单元10可以用于执行本申请实施例中的步骤S202,该实施例中的第一获取单元20可以用于执行本申请实施例中的步骤S204,该实施例中的第二获取单元30可以用于执行本申请实施例中的步骤S206,该实施例中的定位单元40可以用于执行本申请实施例中的步骤S208,该实施例中的第三获取单元50可以用于执行本申请实施例中的步骤S210,该实施例中的调整单元60可以用于执行本申请实施例中的步骤S212。
根据本申请实施例另一方面,还提供了一种用于实施上述物体关键点的定位方法的物体关键点的定位装置。图16是根据本申请实施例的一种图像处理装置的示意图。如图16所示,该物体关键点的定位装置160可以包括:检测单元70、获取单元80、定位单元90、第一调整单元100、识别单元110、第二调整单元120和显示单元130。
检测单元70,用于对目标视频流的当前视频帧中的目标物体进行检测,获得对目标物体的当前检测区域。
获取单元80,用于根据在目标视频流的历史视频帧中目标物体对应的历史检测区域和当前检测区域,获取确定后的当前检测区域。
定位单元90,用于基于确定后的当前检测区域,对目标物体进行关键点定位,得到第一物体关键点集。
第一调整单元100,用于根据在历史视频帧中目标物体对应的第二物体关键点集的位置,对第一物体关键点集的位置进行稳定调整,得到当前视频帧中的当前目标物体关键点集的位置。
识别单元110,用于根据当前目标物体关键点集的位置,从当前视频帧中识别出目标物体的部位。
第二调整单元120,用于对识别出的目标物体的部位进行调整处理。
显示单元130,用于显示调整处理后的目标物体的图像。
需要说明的是,该实施例中的检测单元70可以用于执行本申请实施例中 的步骤S402,该实施例中的获取单元80可以用于执行本申请实施例中的步骤S404,该实施例中的定位单元90可以用于执行本申请实施例中的步骤S406,该实施例中的第一调整单元100可以用于执行本申请实施例中的步骤S408,该实施例中的识别单元110可以用于执行本申请实施例中的步骤S410,该实施例中的第二调整单元120可以用于执行本申请实施例中的步骤S412,该实施例中的显示单元130可以用于执行本申请实施例中的步骤S414。
此处需要说明的是,上述单元和模块与对应的步骤所实现的示例和应用场景相同,但不限于上述实施例所公开的内容。需要说明的是,上述模块作为装置的一部分可以运行在如图1所示的硬件环境中,可以通过软件实现,也可以通过硬件实现,其中,硬件环境包括网络环境。
根据本申请实施例的又一方面,还提供了一种用于实施上述物体关键点的定位方法的电子装置。
图17是根据本申请实施例的一种电子装置的结构框图。如图17所示,该电子装置包括存储器172和处理器174,该存储器中存储有计算机程序,该处理器被设置为通过计算机程序执行上述任一项方法实施例中的步骤。
可选地,上述电子装置可以位于计算机网络的多个网络设备中的至少一个网络设备。
可选地,上述处理器可以被设置为通过计算机程序执行以执行上述实施例中图2所示的步骤S206至S210。
可选地,上述处理器也可以被设置成通过计算机程序执行上述实施例中图4所示的步骤S402至S412。
可选的,该计算机程序被执行时还可以执行上述实施例中其他步骤,具体步骤参见上文实施例描述。
可选地,本领域普通技术人员可以理解,图17所示的结构仅为示意,电子装置也可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图17其并不对上述电子装置的结构造成限定。例如,电子装置还可包括比图17中所示更多或者更少的组件(如网络接口等),或者具有与图17所示不同的配置。
其中,存储器172可用于存储软件程序以及模块,如本申请实施例中的媒体文件的投放方法和装置对应的程序指令/模块,处理器174通过运行存储在存储器172内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的物体关键点的定位方法。存储器172可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器172可进一步包括相对于处理器174远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。其中,存储器172具体可以但不限于用于存储目视频的视频帧、人体检测框、物体关 键点等信息。作为一种示例,如图10所示,上述存储器172中可以但不限于包括上述物体关键点的定位装置110中的和调整单元40。此外,还可以包括但不限于上述媒体文件的投放装置中的其他模块单元,本示例中不再赘述。
上述的传输装置176用于经由一个网络接收或者发送数据。上述的网络具体实例可包括有线网络及无线网络。在一个实例中,传输装置176包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置176为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
此外,上述电子装置还包括:显示器178,用于显示上述目标代码在第一目标函数中的执行状态;连接总线180,用于连接上述电子装置中的各个模块部件。
根据本申请的实施例的又一方面,还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述存储介质可以被设置为存储计算机程序,当该计算机程序被运行时以执行上述实施例中图2所示的步骤S206至S210。
可选地,在本实施例中,上述存储介质可以被设置为存储计算机程序,当该计算机程序被运行时以执行上述实施例中图4所示的步骤S402至S412。
可选的,该计算机程序被执行时还可以执行上述实施例中其他步骤,具体步骤参见上文实施例描述。可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。
在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽 略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。

Claims (29)

  1. 一种物体关键点的定位方法,应用于终端设备包括:
    对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域;
    获取在所述目标视频流的历史视频帧中所述目标物体对应的历史检测区域;
    根据所述历史检测区域对所述当前检测区域进行调整,获取确定后的当前检测区域;
    基于所述确定后的当前检测区域,对所述目标物体进行关键点定位,得到第一物体关键点集;
    获取在所述目标视频流的历史视频帧中所述目标物体对应的第二物体关键点集;
    根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置。
  2. 根据权利要求1所述的方法,所述根据所述历史检测区域对所述当前检测区域进行调整,获取确定后的当前检测区域包括:
    确定所述历史检测区域与所述当前检测区域之间的重叠度;
    在所述重叠度大于目标阈值的情况下,将所述历史检测区域作为所述确定后的当前检测区域;
    在所述重叠度小于等于目标阈值的情况下,将所述当前检测区域作为所述确定后的当前检测区域。
  3. 根据权利要求1所述的方法,所述根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置包括:
    根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置。
  4. 根据权利要求3所述的方法,所述根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置,包括:
    从所述第一物体关键点集中确定待稳定调整的第一目标物体关键点的位置;
    从所述第二物体关键点集中,确定与所述第一目标物体关键点指示的部位对应的第二目标物体关键点的位置;
    对确定的所有所述第二目标物体关键点的位置和对应的所述第一目标物体关键点的位置进行加权得到加权和;
    通过所述目标视频流的帧率确定目标系数;
    根据所述加权和与所述目标系数对所述第一目标物体关键点的位置进行平滑处理,得到稳定调整后的所述第一目标物体关键点的位置。
  5. 根据权利要求1所述的方法,所述对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域包括:
    对所述当前视频帧进行检测,得到多个第一候选检测区域;
    将所述多个第一候选检测区域中,与所述历史检测区域的重叠度最大的第一候选检测区域,确定为所述当前检测区域。
  6. 根据权利要求1至5中任一项所述的方法,在对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域之前,所述方法还包括:
    针对所述目标视频流的第一个视频帧进行检测,得到多个第二候选检测区域;
    将所述多个第二候选检测区域中置信度最大的第二候选检测区域,作为所述第一个视频帧对应的检测区域,则所述第一个视频帧对应的检测区域作为所述目标视频流中其他视频帧的历史检测区域。
  7. 根据权利要求1至5中任一项所述的方法,所述基于所述确定后的当前检测区域,对所述目标物体进行关键点定位,得到第一物体关键点集包括:
    在所述当前视频帧中的所述目标物体不完全位于所述确定后的当前检测区域内的情况下,以所述确定后的当前检测区域的中心为中心,对所述确定后的当前检测区域进行外扩处理,得到目标检测区域;
    根据所述目标检测区域内包括所述目标物体的目标图像,获取所述第一物体关键点集。
  8. 根据权利要求7所述的方法,所述根据所述目标检测区域内包括所述目标物体的目标图像,获取所述第一物体关键点集包括:
    对所述目标图像进行处理,得到所述第一物体关键点集的多组置信度,其中,每组所述置信度用于预测所述第一物体关键点集中的一个物体关键点的位置;
    通过每组所述置信度构建目标矩阵;
    根据每组所述置信度中的最大置信度在对应的所述目标矩阵中的行和列,确定第一目标坐标;
    通过所述第一目标坐标,确定所述第一物体关键点集中的一个物体关键点的位置。
  9. 根据权利要求8所述的方法,所述通过所述第一目标坐标,确定所述第一物体关键点集中对应的一个物体关键点的位置包括:
    根据每组所述置信度中的次大置信度在所述目标矩阵中的行和列,确定第二目标坐标;
    将所述第一目标坐标向所述第二目标坐标偏移目标距离;
    根据偏移所述目标距离后的所述第二目标坐标,确定与所述目标矩阵对应的一个所述物体关键点在所述目标物体上的位置。
  10. 一种图像处理方法,包括:
    对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域;
    根据在所述目标视频流的历史视频帧中的所述目标物体对应的历史检测区域,对所述当前检测区域进行调整,获取确定后的当前检测区域;
    基于所述确定后的当前检测区域,对所述目标物体进行关键点定位,得到第一物体关键点集;
    根据在所述历史视频帧中所述目标物体对应的第二物体关键点集的位置,对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置;
    根据所述当前目标物体关键点集的位置,从所述当前视频帧中识别出所述目标物体的部位;
    对识别出的所述目标物体的部位进行调整处理;
    显示调整处理后的所述目标物体的图像。
  11. 根据权利要求10所述的方法,所述根据在所述目标视频流的历史视频帧中的所述目标物体对应的历史检测区域,对所述当前检测区域进行调整,获取确定后的当前检测区域,包括:
    确定所述历史检测区域与所述当前检测区域之间的重叠度;
    在所述重叠度大于目标阈值的情况下,将所述历史检测区域作为所述确定后的当前检测区域;
    在所述重叠度小于等于目标阈值的情况下,将所述当前检测区域作为所述确定后的当前检测区域。
  12. 根据权利要求10所述的方法,所述根据在所述历史视频帧中所述目标物体对应的第二物体关键点集的位置,对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置,包括:
    根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置。
  13. 根据权利要求12所述的方法,所述根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置包括:
    从所述第一物体关键点集中确定待稳定调整的第一目标物体关键点的位置;
    从所述第二物体关键点集中,确定与所述第一目标物体关键点指示的部位对应的第二目标物体关键点的位置;
    对确定的所有所述第二目标物体关键点的位置和对应的所述第一目标物体关键点的位置进行加权得到加权和;
    通过所述目标视频流的帧率确定目标系数;
    根据所述加权和与所述目标系数对所述第一目标物体关键点的位置进行平滑处理,得到稳定调整后的所述第一目标物体关键点的位置。
  14. 根据权利要求10所述的方法,所述对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域包括:
    对所述当前视频帧进行检测,得到多个第一候选检测区域;
    将所述多个第一候选检测区域中,与所述历史检测区域的重叠度最大的第一候选检测区域,确定为所述当前检测区域。
  15. 根据权利要求10至14中任一项所述的方法,在对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域之前,所述方法还包括:
    针对所述目标视频流的第一个视频帧进行检测,得到多个第二候选检测区域;
    将所述多个第二候选检测区域中置信度最大的第二候选检测区域,作为所述第一个视频帧对应的检测区域,则所述第一个视频帧对应的检测区域作为所述目标视频流中其他视频帧的历史检测区域。
  16. 根据权利要求10至14中任一项所述的方法,所述基于所述确定后的当前检测区域,对所述目标物体进行关键点定位,得到第一物体关键点集包括:
    在所述当前视频帧中的所述目标物体不完全位于所述确定后的当前检测区域内的情况下,以所述确定后的当前检测区域的中心为中心,对所述确定后的当前检测区域进行外扩处理,得到目标检测区域;
    根据所述目标检测区域内包括所述目标物体的目标图像,获取所述第一物体关键点集。
  17. 根据权利要求16所述的方法,所述根据所述目标检测区域内包括所述目标物体的目标图像,获取所述第一物体关键点集包括:
    对所述目标图像进行处理,得到所述第一物体关键点集的多组置信度,其中,每组所述置信度用于预测所述第一物体关键点集中的一个物体关键点的位置;
    通过每组所述置信度构建目标矩阵;
    根据每组所述置信度中的最大置信度在对应的所述目标矩阵中的行和列,确定第一目标坐标;
    通过所述第一目标坐标,确定所述第一物体关键点集中的一个物体关键点的位置。
  18. 根据权利要求17所述的方法,所述通过所述第一目标坐标,确定所述第一物体关键点集中对应的一个物体关键点的位置包括:
    根据每组所述置信度中的次大置信度在所述目标矩阵中的行和列,确定第二目标坐标;
    将所述第一目标坐标向所述第二目标坐标偏移目标距离;
    根据偏移所述目标距离后的所述第二目标坐标,确定与所述目标矩阵对应的一个所述物体关键点在所述目标物体上的位置。
  19. 一种物体关键点的定位装置,包括:
    存储器和处理器;其中,所述存储器,用于存储计算机程序;
    所述处理器,用于运行所述计算机程序,以执行以下动作:
    对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域;
    获取在所述目标视频流的历史视频帧中所述目标物体对应的历史检测区域;
    根据所述历史检测区域对所述当前检测区域进行调整,获取确定后的当前检测区域;
    基于所述确定后的当前检测区域,对所述目标物体进行关键点定位,得到第一物体关键点集;
    获取在所述目标视频流的历史视频帧中所述目标物体对应的第二物体关键点集;
    根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置。
  20. 根据权利要求19所述的定位装置,当所述处理器用于执行所述根据所述历史检测区域对所述当前检测区域进行调整,获取确定后的当前检测区域的动作时,具体用于执行以下动作:
    确定所述历史检测区域与所述当前检测区域之间的重叠度;
    在所述重叠度大于目标阈值的情况下,将所述历史检测区域作为所述确定后的当前检测区域;
    在所述重叠度小于等于目标阈值的情况下,将所述当前检测区域作为所述确定后的当前检测区域。
  21. 根据权利要求19所述的定位装置,当所述处理器用于执行根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置的动作时,具体用于执行以下动作:
    根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置。
  22. 根据权利要求21所述的定位装置,当所述处理器用于执行所述根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置的动作时,具体用于执行以下动作:
    从所述第一物体关键点集中确定待稳定调整的第一目标物体关键点的位置;
    从所述第二物体关键点集中,确定与所述第一目标物体关键点指示的部位对应的第二目标物体关键点的位置;
    对确定的所有所述第二目标物体关键点的位置和对应的所述第一目标物体关键点的位置进行加权得到加权和;
    通过所述目标视频流的帧率确定目标系数;
    根据所述加权和与所述目标系数对所述第一目标物体关键点的位置进行平滑处理,得到稳定调整后的所述第一目标物体关键点的位置。
  23. 根据权利要求19所述的定位装置,当所述处理器用于执行所述对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域的动作时,具体用于执行以下动作:
    对所述当前视频帧进行检测,得到多个第一候选检测区域;
    将所述多个第一候选检测区域中,与所述历史检测区域的重叠度最大的第一候选检测区域,确定为所述当前检测区域。
  24. 一种图像处理装置,包括:
    存储器、处理器以及显示器;
    其中,所述存储器,用于存储计算机程序;
    所述处理器,用于运行所述计算机程序,以执行以下动作:
    对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域;
    根据在所述目标视频流的历史视频帧中的所述目标物体对应的历史检测区域,对所述当前检测区域进行调整,获取确定后的当前检测区域;
    基于所述确定后的当前检测区域,对所述目标物体进行关键点定位,得到第一物体关键点集;
    根据在所述历史视频帧中所述目标物体对应的第二物体关键点集的位置,对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标物体关键点集的位置;
    根据所述当前目标物体关键点集的位置,从所述当前视频帧中识别出所述目标物体的部位;
    对识别出的所述目标物体的部位进行调整处理;
    控制所述显示器显示调整处理后的所述目标物体的图像。
  25. 根据权利要求24所述的图像处理装置,所述当所述处理器用于执行所述根据在所述目标视频流的历史视频帧中的所述目标物体对应的历史检测区域,对所述当前检测区域进行调整,获取确定后的当前检测区域的动作时,具体用于执行以下动作:
    确定所述历史检测区域与所述当前检测区域之间的重叠度;
    在所述重叠度大于目标阈值的情况下,将所述历史检测区域作为所述确定后的当前检测区域;
    在所述重叠度小于等于目标阈值的情况下,将所述当前检测区域作为所述确定后的当前检测区域。
  26. 根据权利要求24所述的图像处理装置,所述当所述处理器用于执行根据在所述历史视频帧中所述目标物体对应的第二物体关键点集的位置,对所述第一物体关键点集的位置进行稳定调整,得到所述当前视频帧中的当前目标 物体关键点集的位置的动作时,具体用于执行以下动作:
    根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置。
  27. 根据权利要求24所述的图像处理装置,所述当所述处理器用于执行所述根据所述第二物体关键点集的位置对所述第一物体关键点集的位置进行坐标平滑处理,得到所述当前视频帧中当前目标物体关键点集的位置的动作时,具体用于执行以下动作:
    从所述第一物体关键点集中确定待稳定调整的第一目标物体关键点的位置;
    从所述第二物体关键点集中,确定与所述第一目标物体关键点指示的部位对应的第二目标物体关键点的位置;
    对确定的所有所述第二目标物体关键点的位置和对应的所述第一目标物体关键点的位置进行加权得到加权和;
    通过所述目标视频流的帧率确定目标系数;
    根据所述加权和与所述目标系数对所述第一目标物体关键点的位置进行平滑处理,得到稳定调整后的所述第一目标物体关键点的位置。
  28. 根据权利要求24所述的图像处理装置,所述当所述处理器用于执行所述对目标视频流的当前视频帧中的目标物体进行检测,获得对所述目标物体的当前检测区域的动作时,具体用于执行以下动作:
    对所述当前视频帧进行检测,得到多个第一候选检测区域;
    将所述多个第一候选检测区域中,与所述历史检测区域的重叠度最大的第一候选检测区域,确定为所述当前检测区域。
  29. 一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至9任一项中所述的物体关键点的定位方法或者执行所述权利要求10至18任一项中所述的图像处理方法。
PCT/CN2019/113611 2018-11-19 2019-10-28 物体关键点的定位方法、图像处理方法、装置及存储介质 WO2020103647A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19887883.7A EP3885967A4 (en) 2018-11-19 2019-10-28 METHOD AND APPARATUS FOR POSITIONING KEY POINTS OF AN OBJECT, METHOD AND APPARATUS FOR PROCESSING IMAGES AND MEMORY MEDIA
US17/088,558 US11450080B2 (en) 2018-11-19 2020-11-03 Image processing method and apparatus, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811377195.6A CN109684920B (zh) 2018-11-19 2018-11-19 物体关键点的定位方法、图像处理方法、装置及存储介质
CN201811377195.6 2018-11-19

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/088,558 Continuation US11450080B2 (en) 2018-11-19 2020-11-03 Image processing method and apparatus, and storage medium

Publications (1)

Publication Number Publication Date
WO2020103647A1 true WO2020103647A1 (zh) 2020-05-28

Family

ID=66185882

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/113611 WO2020103647A1 (zh) 2018-11-19 2019-10-28 物体关键点的定位方法、图像处理方法、装置及存储介质

Country Status (4)

Country Link
US (1) US11450080B2 (zh)
EP (1) EP3885967A4 (zh)
CN (1) CN109684920B (zh)
WO (1) WO2020103647A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968050A (zh) * 2020-08-07 2020-11-20 Oppo(重庆)智能科技有限公司 人体图像处理方法及相关产品
CN112906495A (zh) * 2021-01-27 2021-06-04 深圳安智杰科技有限公司 一种目标检测方法、装置、电子设备及存储介质
CN114067369A (zh) * 2022-01-17 2022-02-18 深圳爱莫科技有限公司 基于图像识别的餐桌状态识别方法及系统
TWI758205B (zh) * 2020-07-28 2022-03-11 大陸商浙江商湯科技開發有限公司 目標檢測方法、電子設備和電腦可讀儲存介質
CN112906495B (zh) * 2021-01-27 2024-04-30 深圳安智杰科技有限公司 一种目标检测方法、装置、电子设备及存储介质

Families Citing this family (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684920B (zh) * 2018-11-19 2020-12-11 腾讯科技(深圳)有限公司 物体关键点的定位方法、图像处理方法、装置及存储介质
CN110215232A (zh) * 2019-04-30 2019-09-10 南方医科大学南方医院 基于目标检测算法的冠状动脉血管内超声斑块分析方法
CN110148158A (zh) * 2019-05-13 2019-08-20 北京百度网讯科技有限公司 用于处理视频的方法、装置、设备和存储介质
CN110427806A (zh) * 2019-06-20 2019-11-08 北京奇艺世纪科技有限公司 视频识别方法、装置及计算机可读存储介质
CN110345407B (zh) * 2019-06-20 2022-01-18 华南理工大学 一种基于深度学习的智能矫姿台灯及矫姿方法
CN110264430B (zh) * 2019-06-29 2022-04-15 北京字节跳动网络技术有限公司 视频美化方法、装置及电子设备
CN110288554B (zh) * 2019-06-29 2022-09-16 北京字节跳动网络技术有限公司 视频美化方法、装置及电子设备
CN110349177B (zh) * 2019-07-03 2021-08-03 广州多益网络股份有限公司 一种连续帧视频流的人脸关键点跟踪方法和系统
CN110414585B (zh) * 2019-07-22 2022-04-01 武汉理工大学 基于改进的嵌入式平台的实时颗粒物检测方法
CN110414514B (zh) * 2019-07-31 2021-12-07 北京字节跳动网络技术有限公司 图像处理方法及装置
CN110290410B (zh) * 2019-07-31 2021-10-29 合肥华米微电子有限公司 影像位置调节方法、装置、系统及调节信息生成设备
CN110570460B (zh) * 2019-09-06 2024-02-13 腾讯云计算(北京)有限责任公司 目标跟踪方法、装置、计算机设备及计算机可读存储介质
CN110533006B (zh) * 2019-09-11 2022-03-25 北京小米智能科技有限公司 一种目标跟踪方法、装置及介质
EP4052190A4 (en) * 2019-11-15 2023-12-06 Waymo Llc SPACE-TIME-INTERACTIVE NETWORKS
CN110909655A (zh) * 2019-11-18 2020-03-24 上海眼控科技股份有限公司 一种识别视频事件的方法及设备
CN111104611B (zh) * 2019-11-18 2023-01-20 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备及存储介质
CN111027412B (zh) * 2019-11-20 2024-03-08 北京奇艺世纪科技有限公司 一种人体关键点识别方法、装置及电子设备
CN111310595B (zh) * 2020-01-20 2023-08-25 北京百度网讯科技有限公司 用于生成信息的方法和装置
CN111292337B (zh) * 2020-01-21 2024-03-01 广州虎牙科技有限公司 图像背景替换方法、装置、设备及存储介质
CN111405198A (zh) * 2020-03-23 2020-07-10 北京字节跳动网络技术有限公司 视频中人体胸部美体处理方法、装置及电子设备
CN111523402B (zh) * 2020-04-01 2023-12-12 车智互联(北京)科技有限公司 一种视频处理方法、移动终端及可读存储介质
CN111723776A (zh) * 2020-07-03 2020-09-29 厦门美图之家科技有限公司 人体外轮廓点检测方法、装置、电子设备和可读存储介质
CN112164090A (zh) * 2020-09-04 2021-01-01 杭州海康威视系统技术有限公司 数据处理方法、装置、电子设备及机器可读存储介质
CN112163516A (zh) * 2020-09-27 2021-01-01 深圳市悦动天下科技有限公司 跳绳计数的方法、装置及计算机存储介质
CN112270669B (zh) * 2020-11-09 2024-03-01 北京百度网讯科技有限公司 人体3d关键点检测方法、模型训练方法及相关装置
CN113743177A (zh) * 2021-02-09 2021-12-03 北京沃东天骏信息技术有限公司 关键点检测方法、系统、智能终端和存储介质
CN113160244B (zh) * 2021-03-24 2024-03-15 北京达佳互联信息技术有限公司 视频处理方法、装置、电子设备及存储介质
CN113095232B (zh) * 2021-04-14 2022-04-22 浙江中正智能科技有限公司 一种目标实时跟踪方法
CN113158981A (zh) * 2021-05-17 2021-07-23 广东中卡云计算有限公司 一种基于级联卷积神经网络的骑行姿态分析方法
CN113223084B (zh) * 2021-05-27 2024-03-01 北京奇艺世纪科技有限公司 一种位置确定方法、装置、电子设备及存储介质
CN113223083B (zh) * 2021-05-27 2023-08-15 北京奇艺世纪科技有限公司 一种位置确定方法、装置、电子设备及存储介质
CN113361364B (zh) * 2021-05-31 2022-11-01 北京市商汤科技开发有限公司 目标行为检测方法、装置、设备及存储介质
CN113743219B (zh) * 2021-08-03 2023-09-19 北京格灵深瞳信息技术股份有限公司 运动目标检测方法、装置、电子设备及存储介质
CN113627306B (zh) * 2021-08-03 2023-04-07 展讯通信(上海)有限公司 关键点处理方法及装置、可读存储介质、终端
CN113808230A (zh) * 2021-08-26 2021-12-17 华南理工大学 提高电阻抗成像准确性的方法、系统、装置和存储介质
CN114092556A (zh) * 2021-11-22 2022-02-25 北京百度网讯科技有限公司 用于确定人体姿态的方法、装置、电子设备、介质
CN115035186B (zh) * 2021-12-03 2023-04-11 荣耀终端有限公司 目标对象标记方法和终端设备
CN114266769B (zh) * 2022-03-01 2022-06-21 北京鹰瞳科技发展股份有限公司 一种基于神经网络模型进行眼部疾病识别的系统及其方法
CN114677625B (zh) * 2022-03-18 2023-09-08 北京百度网讯科技有限公司 目标检测方法、装置、设备、存储介质和程序产品
CN115134521B (zh) * 2022-04-22 2023-09-22 咪咕视讯科技有限公司 视频拍摄防抖动方法、装置、设备及存储介质
CN114979567B (zh) * 2022-04-29 2023-03-24 北京容联易通信息技术有限公司 一种应用于视频智能监控的物体与区域交互方法及系统
CN115273243B (zh) * 2022-09-27 2023-03-28 深圳比特微电子科技有限公司 跌倒检测方法、装置、电子设备和计算机可读存储介质
CN115496911B (zh) * 2022-11-14 2023-03-24 腾讯科技(深圳)有限公司 一种目标点检测方法、装置、设备及存储介质
CN117079058B (zh) * 2023-10-11 2024-01-09 腾讯科技(深圳)有限公司 图像处理方法和装置、存储介质及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512627A (zh) * 2015-12-03 2016-04-20 腾讯科技(深圳)有限公司 一种关键点的定位方法及终端
CN106682619A (zh) * 2016-12-28 2017-05-17 上海木爷机器人技术有限公司 一种对象跟踪方法及装置
CN106778585A (zh) * 2016-12-08 2017-05-31 腾讯科技(上海)有限公司 一种人脸关键点跟踪方法和装置
CN108230357A (zh) * 2017-10-25 2018-06-29 北京市商汤科技开发有限公司 关键点检测方法、装置、存储介质、计算机程序和电子设备
CN109684920A (zh) * 2018-11-19 2019-04-26 腾讯科技(深圳)有限公司 物体关键点的定位方法、图像处理方法、装置及存储介质

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7916897B2 (en) * 2006-08-11 2011-03-29 Tessera Technologies Ireland Limited Face tracking for controlling imaging parameters
CN101499128B (zh) * 2008-01-30 2011-06-29 中国科学院自动化研究所 基于视频流的三维人脸动作检测和跟踪方法
CN101833771B (zh) * 2010-06-03 2012-07-25 北京智安邦科技有限公司 解决多目标交汇遮挡的跟踪装置及方法
CN102222346B (zh) * 2011-05-23 2013-03-13 北京云加速信息技术有限公司 一种车辆检测和跟踪方法
US10757369B1 (en) * 2012-10-08 2020-08-25 Supratik Mukhopadhyay Computer implemented system and method for high performance visual tracking
KR101458099B1 (ko) * 2013-04-24 2014-11-05 전자부품연구원 흔들림 영상 안정화 방법 및 이를 적용한 영상 처리 장치
US9367897B1 (en) * 2014-12-11 2016-06-14 Sharp Laboratories Of America, Inc. System for video super resolution using semantic components
US9736366B1 (en) * 2015-05-23 2017-08-15 Google Inc. Tile-based digital image correspondence
US10110846B2 (en) * 2016-02-03 2018-10-23 Sharp Laboratories Of America, Inc. Computationally efficient frame rate conversion system
CN106205126B (zh) * 2016-08-12 2019-01-15 北京航空航天大学 基于卷积神经网络的大规模交通网络拥堵预测方法及装置
WO2018152609A1 (en) * 2017-02-24 2018-08-30 Synaptive Medical (Barbados) Inc. Video stabilization system and method
CN106991388B (zh) * 2017-03-27 2020-04-21 中国科学院自动化研究所 关键点定位方法
CN107784288B (zh) * 2017-10-30 2020-01-14 华南理工大学 一种基于深度神经网络的迭代定位式人脸检测方法
US20190130191A1 (en) * 2017-10-30 2019-05-02 Qualcomm Incorporated Bounding box smoothing for object tracking in a video analytics system
CN107967693B (zh) * 2017-12-01 2021-07-09 北京奇虎科技有限公司 视频关键点处理方法、装置、计算设备及计算机存储介质
CN108492287B (zh) * 2018-03-14 2020-06-02 罗普特(厦门)科技集团有限公司 一种视频抖动检测方法、终端设备及存储介质
US10171738B1 (en) * 2018-05-04 2019-01-01 Google Llc Stabilizing video to reduce camera and face movement
CN108830900B (zh) * 2018-06-15 2021-03-12 北京字节跳动网络技术有限公司 关键点的抖动处理方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512627A (zh) * 2015-12-03 2016-04-20 腾讯科技(深圳)有限公司 一种关键点的定位方法及终端
CN106778585A (zh) * 2016-12-08 2017-05-31 腾讯科技(上海)有限公司 一种人脸关键点跟踪方法和装置
CN106682619A (zh) * 2016-12-28 2017-05-17 上海木爷机器人技术有限公司 一种对象跟踪方法及装置
CN108230357A (zh) * 2017-10-25 2018-06-29 北京市商汤科技开发有限公司 关键点检测方法、装置、存储介质、计算机程序和电子设备
CN109684920A (zh) * 2018-11-19 2019-04-26 腾讯科技(深圳)有限公司 物体关键点的定位方法、图像处理方法、装置及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3885967A4

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI758205B (zh) * 2020-07-28 2022-03-11 大陸商浙江商湯科技開發有限公司 目標檢測方法、電子設備和電腦可讀儲存介質
CN111968050A (zh) * 2020-08-07 2020-11-20 Oppo(重庆)智能科技有限公司 人体图像处理方法及相关产品
CN111968050B (zh) * 2020-08-07 2024-02-20 Oppo(重庆)智能科技有限公司 人体图像处理方法及相关产品
CN112906495A (zh) * 2021-01-27 2021-06-04 深圳安智杰科技有限公司 一种目标检测方法、装置、电子设备及存储介质
CN112906495B (zh) * 2021-01-27 2024-04-30 深圳安智杰科技有限公司 一种目标检测方法、装置、电子设备及存储介质
CN114067369A (zh) * 2022-01-17 2022-02-18 深圳爱莫科技有限公司 基于图像识别的餐桌状态识别方法及系统

Also Published As

Publication number Publication date
EP3885967A4 (en) 2022-02-16
CN109684920A (zh) 2019-04-26
EP3885967A1 (en) 2021-09-29
US11450080B2 (en) 2022-09-20
CN109684920B (zh) 2020-12-11
US20210049395A1 (en) 2021-02-18

Similar Documents

Publication Publication Date Title
WO2020103647A1 (zh) 物体关键点的定位方法、图像处理方法、装置及存储介质
JP7236545B2 (ja) ビデオターゲット追跡方法と装置、コンピュータ装置、プログラム
CN108205655B (zh) 一种关键点预测方法、装置、电子设备及存储介质
US10198823B1 (en) Segmentation of object image data from background image data
US11915514B2 (en) Method and apparatus for detecting facial key points, computer device, and storage medium
CN110998659B (zh) 图像处理系统、图像处理方法、及程序
WO2019128508A1 (zh) 图像处理方法、装置、存储介质及电子设备
JP5554984B2 (ja) パターン認識方法およびパターン認識装置
US11074430B2 (en) Directional assistance for centering a face in a camera field of view
US9443325B2 (en) Image processing apparatus, image processing method, and computer program
KR20220066366A (ko) 예측적 개인별 3차원 신체 모델
US20180268207A1 (en) Method for automatic facial impression transformation, recording medium and device for performing the method
US11417095B2 (en) Image recognition method and apparatus, electronic device, and readable storage medium using an update on body extraction parameter and alignment parameter
CN109657615B (zh) 一种目标检测的训练方法、装置及终端设备
WO2023082882A1 (zh) 一种基于姿态估计的行人摔倒动作识别方法及设备
CN109063584B (zh) 基于级联回归的面部特征点定位方法、装置、设备及介质
WO2021203644A1 (zh) 温度修正方法、装置及系统
US20120007859A1 (en) Method and apparatus for generating face animation in computer system
CN112241976A (zh) 一种训练模型的方法及装置
CN109740416B (zh) 目标跟踪方法及相关产品
WO2016165614A1 (zh) 一种即时视频中的表情识别方法和电子设备
WO2022147736A1 (zh) 虚拟图像构建方法、装置、设备及存储介质
US9165213B2 (en) Information processing apparatus, information processing method, and program
WO2021098545A1 (zh) 一种姿势确定方法、装置、设备、存储介质、芯片及产品
KR20220004009A (ko) 키 포인트 검출 방법, 장치, 전자 기기 및 저장 매체

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19887883

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019887883

Country of ref document: EP

Effective date: 20210621