CN113489897A

CN113489897A - Image processing method and related device

Info

Publication number: CN113489897A
Application number: CN202110723162.8A
Authority: CN
Inventors: 潘睿
Original assignee: Hangzhou Douku Software Technology Co Ltd
Current assignee: Hangzhou Douku Software Technology Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-10-08
Anticipated expiration: 2041-06-28
Also published as: CN113489897B

Abstract

The application provides an image processing method and a related device, wherein the method comprises the following steps: acquiring a first detection frame in a first image and a second detection frame in a second image, as well as a first key point of a first shot object in the first detection frame and a second key point of a second shot object in the second detection frame; determining the matching condition of the first detection frame and the second detection frame according to the first detection frame, the second detection frame, the first key point and the second key point, wherein the matching condition is used for indicating the shooting state of the shot object in the front frame image and the back frame image; and determining a posture estimation result of the first image according to the matching condition, wherein the posture estimation result comprises the detection frame and key points contained in the detection frame. The method and the device improve the efficiency and accuracy of attitude estimation of the equipment.

Description

Image processing method and related device

Technical Field

The present application belongs to the field of image processing technologies, and in particular, to an image processing method and a related apparatus.

Background

At present, because the precision of real-time frame detection is not particularly high, the position of a frame output by each frame in a preview frame is randomly jittered, and stable output is difficult to obtain. In addition, due to the problem of real-time detection accuracy, some frames are often lost in a certain frame, so that information of the frames cannot be stably acquired for each person to be tracked and shot.

Disclosure of Invention

The application provides an image processing method and a related device, so that the maximum matching frame is calculated based on the existing key point features and the IOU of a detection frame, extra time is not required to be consumed for extracting the features for matching, and the efficiency and accuracy of posture estimation of equipment are improved.

In a first aspect, the present application provides an image processing method, including:

acquiring a first detection frame in a first image and a second detection frame in a second image, as well as a first key point of a first shot object in the first detection frame and a second key point of a second shot object in the second detection frame, wherein the second image is a previous frame image of the first image, the detection frames are used for indicating an area of a corresponding shot object in the image, and the key points comprise pixel points used for describing key positions of the shot object;

determining a matching condition of the first detection frame and the second detection frame according to the first detection frame, the second detection frame, the first key point and the second key point, wherein the matching condition is used for indicating a state that the object is shot in two frames of images in front and back;

and determining a posture estimation result of the first image according to the matching condition, wherein the posture estimation result comprises a detection frame and key points contained in the detection frame.

It can be seen that, in the embodiment of the application, the device can determine the matching condition of the first detection frame and the second detection frame according to the first detection frame in the first image and the second detection frame in the second image, the first key point of the first detection frame and the second key point of the second detection frame, and since the detection frame and the key points are data in the posture estimation task, the device does not need to consume extra time to extract other image features for matching processing, and the efficiency and the accuracy of posture estimation performed by the device are improved.

In a second aspect, the present application provides an image processing apparatus comprising:

the image processing device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first detection frame in a first image and a second detection frame in a second image, and a first key point of a first shot object in the first detection frame and a second key point of a second shot object in the second detection frame, the second image is a previous frame image of the first image, the detection frames are used for indicating the area of the corresponding shot object in the image, and the key points comprise pixel points for describing the key position of the shot object;

a determining unit, configured to determine, according to the first detection frame, the second detection frame, the first key point, and the second key point, a matching situation of the first detection frame and the second detection frame, where the matching situation is used to indicate a state in which a subject is captured in two preceding and following frames of images;

the determining unit is further configured to determine a pose estimation result of the first image according to the matching condition, where the pose estimation result includes a detection frame and key points included in the detection frame.

In a third aspect, the present application provides an electronic device, one or more processors;

one or more memories for storing programs,

the one or more memories and the program are configured to control the electronic device, by the one or more processors, to execute the instructions of the steps in any of the methods of the first aspect of the embodiments of the present application.

In a fourth aspect, the present application provides a chip comprising: and the processor is used for calling and running the computer program from the memory so that the device provided with the chip executes part or all of the steps described in any method of the first aspect of the embodiment of the application.

In a fifth aspect, the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, and wherein the computer program causes a computer to perform some or all of the steps as described in any one of the methods of the first aspect of the embodiments of the present application.

In a sixth aspect, the present application provides a computer program, wherein the computer program is operable to cause a computer to perform some or all of the steps as described in any of the methods of the first aspect of the embodiments of the present application. The computer program may be a software installation package.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device 100 according to an embodiment of the present disclosure;

fig. 2a is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 2b is a schematic diagram illustrating a distribution of joints of a human body according to an embodiment of the present application;

FIG. 2c is a schematic diagram of gesture tracking for a single user according to an embodiment of the present application;

FIG. 2d is a schematic flowchart of another image processing method provided in the embodiment of the present application;

fig. 3 is a block diagram of functional units of an image processing apparatus according to an embodiment of the present application;

fig. 4 is a block diagram of functional units of another image processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In the present application, "at least one" means one or more, and a plurality means two or more. In this application and/or, an association relationship of an associated object is described, which means that there may be three relationships, for example, a and/or B, which may mean: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a and b, a and c, b and c, or a, b and c, wherein each of a, b, c may itself be an element or a set comprising one or more elements.

It should be noted that, in the embodiments of the present application, the term "equal to" may be used in conjunction with more than, and is applicable to the technical solution adopted when more than, and may also be used in conjunction with less than, and is applicable to the technical solution adopted when less than, and it should be noted that when equal to or more than, it is not used in conjunction with less than; when the ratio is equal to or less than the combined ratio, the ratio is not greater than the combined ratio. In the embodiments of the present application, "of", "corresponding" and "corresponding" may be sometimes used in combination, and it should be noted that the intended meaning is consistent when the difference is not emphasized.

First, partial terms referred to in the embodiments of the present application are explained so as to be easily understood by those skilled in the art.

1. An electronic device. In the embodiment of the present application, the electronic device is a device having an image signal processing function, and may be referred to as a User Equipment (UE), a terminal (terminal), a terminal device, a Mobile Station (MS), a Mobile Terminal (MT), an access terminal device, a vehicle-mounted terminal device, an industrial control terminal device, a UE unit, a UE station, a mobile station, a remote terminal device, a mobile device, a UE terminal device, a wireless communication device, a UE agent, or a UE apparatus. The user equipment may be fixed or mobile. For example, the user equipment may be a mobile phone (mobile phone), a tablet (pad), a desktop, a notebook, a kiosk, a vehicle-mounted terminal, a Virtual Reality (VR) terminal, an Augmented Reality (AR) terminal, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving, a wireless terminal in remote surgery, a wireless terminal in smart grid, a wireless terminal in transportation safety, a wireless terminal in city, a wireless terminal in smart home, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (PDA, personal local), a personal digital assistant (wldi), a handheld wireless terminal with wireless communication function, a wireless communication terminal with a wireless communication function, a wireless communication terminal, a, A computing device or other processing device connected to a wireless modem, a wearable device, a terminal device in a future mobile communication network or a terminal device in a future evolved public mobile land network (PLMN), etc. In some embodiments of the present application, the user equipment may also be a device having a transceiving function, such as a system-on-chip. The chip system may include a chip and may also include other discrete devices.

2. And (5) estimating the posture of the human body. In the embodiment of the application, the human body posture estimation is to estimate the two-dimensional coordinates or the three-dimensional coordinates of human body skeleton points in the picture through an algorithm.

3. And (4) multi-target tracking. In the embodiment of the application, the multi-target tracking refers to continuous tracking shooting of multiple people, multiple people can be stably and continuously tracked in a video stream or a preview frame, and the Identity (Identity Document, ID) of each detection frame is kept unchanged, so that smooth processing of posture estimation results of front and rear frames is facilitated, and stable result output is kept.

4. Intersection Over Union (IOU). The IOU in the embodiment of the application is a concept used in target detection, and is the overlapping rate of a detection frame predicted in a current frame image and a detection frame predicted in a previous frame image; simply put, namely the ratio of the intersection and union of the areas of the two rectangular frames; which is a measure of the accuracy of detecting a corresponding object in a particular data set.

5. KM (Kuhn-Munkras, name of the person, no Chinese paraphrase) algorithm. In the embodiment of the application, the KM algorithm is a computer algorithm, and the function is to solve the maximum weight matching under the perfect matching. In a bipartite graph, the left vertex is X and the right vertex is Y, now for each set of left and right connections XiYj there is a weight wij, a match is found that maximizes the sum of all wij.

At present, multi-target tracking is generally divided into two types, one type is matching only through an IOU of a detection frame, for example, compared with an IOU Tracker algorithm, the efficiency of the algorithm is very high, but the accuracy is not high enough, and when a plurality of frames have a large overlapping area, matching errors can be caused, and the correction is difficult. Moreover, the method strongly depends on the prediction of the detection frame, and if a lost frame appears in the middle, the matching cannot be carried out. And the other method is to add additional features, such as a visual cross-over ratio tracking V-IoU Tracker algorithm, to separately extract picture features by adopting a tracking method and then use the features for matching, but the extraction of the features will increase additional time consumption. These methods can only process offline videos due to high algorithm complexity and low real-time performance, and cannot process preview frames online in real time.

In view of the foregoing problems, embodiments of the present application provide an image processing method and a related apparatus, so as to calculate a maximum matching frame based on an existing keypoint feature in combination with an IOU of a detection frame, without consuming additional time to extract features for matching, and improve efficiency and accuracy of posture estimation performed by a device, which is described below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic view of an electronic device 100 according to an embodiment of the present disclosure. The electronic device 100 comprises an application processor 120, a memory 130, a communication module 140, and one or more programs 131, wherein the application processor 120 is communicatively coupled to the memory 130 and the communication module 140 via an internal communication bus.

In a specific implementation, the one or more programs 131 are stored in the memory 130 and configured to be executed by the application processor 120, and the one or more programs 131 include instructions for performing some or all of the steps performed by the electronic device in the embodiment of the present application.

The communication module 140 includes a local area network wireless communication module and a wired communication module.

The Application Processor 120 may be, for example, a Central Processing Unit (CPU), a general purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), other Programmable logic devices (Programmable Gate Array), a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, units, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others.

The memory 130 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct bus RAM (DR RAM).

Referring to fig. 2a, fig. 2a is a schematic flowchart of an image processing method according to an embodiment of the present disclosure, applied to an electronic device 100 of a vehicle; as shown in the figure, the image processing method includes the following steps.

Step 201, acquiring a first detection frame in a first image and a second detection frame in a second image, and a first key point of a first photographed object in the first detection frame and a second key point of a second photographed object in the second detection frame, where the second image is a previous frame image of the first image, the detection frame is used to indicate an area of a corresponding photographed object in the image, and the key points include pixel points used to describe key positions of the photographed object.

For example, the number of the first detection frames may be one or more, and the number of the second detection frames may be one or more, which is not limited herein. The first subject and the second subject may be the same subject or different subjects, and this is not limited herein.

For example, assuming that both the first and second objects are human bodies, the first and second key points may be any one of joint points of the human bodies as shown in fig. 2 b.

And step 202, determining a matching condition of the first detection frame and the second detection frame according to the first detection frame, the second detection frame, the first key point and the second key point, wherein the matching condition is used for indicating a shooting state of the shot object in two frames of images in front and back.

For example, as shown in fig. 2c, if the first detection frame and the second detection frame are both one, and the first object to be photographed and the second object to be photographed are both target users, the matching condition determined by the electronic device may be: the first detection frame is matched with the second detection frame.

Step 203, determining a posture estimation result of the first image according to the matching condition, wherein the posture estimation result comprises a detection frame and key points contained in the detection frame.

In specific implementation, the number of the first detection frame and the second detection frame is single;

if the first detection frame is matched with the second detection frame, it is indicated that the first detection frame and the second detection frame correspond to the same object being subjected to gesture tracking, the electronic device may specifically add real information of the first detection frame in a gesture tracking subset corresponding to the second detection frame in a preset gesture tracking set for maintaining gesture information of the object, where the real information may specifically include identification information (e.g., a frame number) of the first image, position information of the first detection frame, and position information of the first key point. The gesture tracking set is used for tracking and recording gesture information of at least one object in continuous frames, and the gesture tracking subset is used for tracking and recording gesture information of a single object in continuous frames.

It can be seen that, in the embodiment of the application, the device can determine the matching condition of the first detection frame and the second detection frame according to the first detection frame in the first image and the second detection frame in the second image, the first key point of the first detection frame and the second key point of the second detection frame, and since the detection frame and the key points are data in the posture estimation task, the device does not need to consume extra time to extract other image features for matching processing, the calculation amount is at millisecond level, and the efficiency and the accuracy of posture estimation performed by the device are improved.

In a possible embodiment, the determining, according to the first detection frame, the second detection frame, the first key point, and the second key point, a matching condition of the first detection frame and the second detection frame includes:

calculating an intersection ratio IOU of two detection frames of each detection frame combination in at least one detection frame combination to obtain a detection frame matching weight of each detection frame combination, wherein the at least one detection frame combination is obtained by dividing the first detection frame and the second detection frame according to the following mode: a first detection frame and a second detection frame form a detection frame combination;

wherein the detection box matching weight may have a value equal to the cross-over ratio of the detection box combination.

For example, assuming that the number of first detection frames is N and the number of second detection frames is M, the number of detection frame combinations is N × M, and N, M are both positive integers.

Calculating Euclidean distances of two key points of each key point combination in at least one key point combination of each detection frame combination, and calculating the sum of the Euclidean distances of the at least one key point combination to obtain a key point matching weight of each detection frame combination, wherein the at least one key point combination is obtained by dividing the key points of the two detection frames in each detection frame combination in the following way: a first key point and a second key point form a key point combination, and the position type of the second key point is consistent with that of the first key point;

wherein the keypoint matching weight may have a value equal to the sum of at least one euclidean distance of at least one combination of keypoints.

For example, as shown in fig. 2b, if the number of the first keypoints and the second keypoints in the detection box combination is 17, then correspondingly, the number of the keypoint combinations included in each detection box combination is 17.

Determining a reference matching weight of each detection frame combination according to the detection frame matching weight and the key point matching weight of each detection frame combination;

for example, the reference matching weight may be calculated by the following formula:

V＝C1*Vbox+C2*Vjoint，

wherein, V is a reference matching weight, Vbox is a detection frame matching weight, Vjoin is a key point matching weight, and C1 and C2 are constants.

And performing maximum weight matching on the bipartite graph according to the reference matching weight of the at least one detection frame combination to obtain the matching condition between the first detection frame and the second detection frame.

In a specific implementation, the algorithm for the electronic device to perform maximum weight matching on the bipartite graph includes, but is not limited to, a KM algorithm.

As can be seen, in this example, the electronic device may specifically calculate the detection frame matching weight and the key point matching weight according to the detection frame and the key point, calculate the reference matching weight by integrating the detection frame matching weight and the key point matching weight, and finally perform bipartite graph maximum weight matching according to the reference matching weight to obtain a matching situation between the detection frames, without consuming extra time to extract features for matching, thereby improving efficiency and accuracy.

In one possible embodiment, the matching case includes at least one of:

a first detection frame combination with a target matching weight larger than or equal to a preset matching weight, wherein the target matching weight is the matching weight obtained after the first detection frame combination is subjected to maximum weight matching processing of the bipartite graph, and the first detection frame combination belongs to the at least one detection frame combination;

a second detection box combination with the target matching weight smaller than the preset matching weight, wherein the second detection box combination belongs to the at least one detection box combination;

a first detection frame a1 not matched to any second detection frame; and the number of the first and second groups,

the second detection box b1 that does not match any of the first detection boxes.

Wherein the preset matching weight may be a preset empirical value.

Therefore, in this example, the electronic device can accurately classify the matching condition of the first detection frame and the second detection frame, so that the different classification results are pertinently processed, and the accuracy and the refinement degree are improved.

In one possible embodiment, the determining the pose estimation result of the first image according to the matching condition includes:

for the first detection frame combination, adding real information of a first detection frame a2 in the first detection frame combination to a posture tracking subset B2 corresponding to the second detection frame B2, wherein the real information comprises identification information of the first image, position information of the detection frame and position information of key points in the detection frame;

the number of the first detection frame combinations may be one or more, and the images of the detection frames included in each first detection frame combination correspond to images of the same object to be tracked, that is, the first detection frame a2 and the second detection frame B2 in the first detection frame combination correspond to the same object, so that the tracking recording of the posture of the object can be realized by adding the real information of the first detection frame a2 of the current image frame, that is, the first image, to the posture tracking subset B2 for tracking and recording the image information of the corresponding object.

For the second detection frame combination, creating a corresponding posture tracking subset A3 for a first detection frame A3 in the second detection frame combination, and storing real information of the first detection frame A3 in the second detection frame combination;

the number of the second detection frame combinations may be one or more, and since the target matching weight of the second detection frame combination is smaller than the preset matching weight, the electronic device determines that the first detection frame A3 of the second detection frame combination does not match the first detection frame b3, and thus the first detection frame a2 cannot find the detection frame existing in the previous frame and matching with the first detection frame, it is necessary to create a corresponding posture tracking subset A3 for the first detection frame A3, and store the real information of the first detection frame A3 to implement tracking record of the posture of the object corresponding to the first detection frame A3.

For the first test box a1, a corresponding pose tracking subset a1 is created for the first test box a1 and the true information of the first test box a1 is stored.

Therefore, in this example, the electronic device can record the attitude estimation results of the object newly shot by the current frame and the object already shot by the previous frame, so that the attitude estimation result of each object shot by the current frame is not missed, and the accuracy of object attitude tracking performed by the device is improved.

In one possible embodiment, the method further comprises: for the second detection frame combination, increasing the value of the count identifier of the posture tracking subset B3 corresponding to the second detection frame B3 in the second detection frame combination by 1, and adding compensation information of the second detection frame B3 in the posture tracking subset B3, wherein the compensation information comprises the identifier information of the first image, the position information of the detection frame B3 and the position information of the key point of the detection frame B3;

determining whether the value of the count flag of the gesture tracking subset B3 is greater than or equal to a preset value;

if yes, deleting the posture tracking subset B3 from the posture estimation result of the first image;

if not, adding the identification information of the first image, the position information of the second detection frame B3 and the position information of the second key point of the second detection frame B3 in the pose tracking subset B3.

Illustratively, the method further comprises: resetting the count identification of the pose tracking subset B2 for the first detection box combination.

The preset value may be 5, 6, 7, 10, etc., and is not limited herein.

As can be seen, in this example, for an object whose detection frame is lost for the first time in the current frame, the electronic device can perform statistics on the number of losses for the attitude tracking subset of the object and perform compensation on lost attitude information in the current frame, so as to avoid misjudgment for the following cases: due to the fact that the object cannot be shot at the current frame but is actually still in the view range space due to the problems of shooting angles of the electronic equipment and the like, the method and the device are beneficial to improving the continuity and accuracy of posture tracking.

In one possible embodiment, the method further comprises: for the second detection box B1, increasing the number of the count identifiers of the posture tracking subset B1 corresponding to the second detection box B1 by 1, and adding compensation information of the second detection box B1 to the posture tracking subset B1, wherein the compensation information includes the identification information of the first image, the position information of the detection box B1, and the position information of the key point of the detection box B1.

The identification information includes, but is not limited to, information such as a frame number that can characterize a time sequence position of the first image.

In one possible embodiment, the method further comprises: acquiring at least one posture tracking subset in a preset posture tracking set, wherein the at least one posture tracking subset is a posture tracking subset except for a reference posture tracking subset in the posture tracking set, the reference posture tracking subset refers to a posture tracking subset associated with the first detection frame and/or the second detection frame, and the posture tracking set is used for recording a posture estimation result of an object which is tracked and shot;

for each of the at least one subset of pose tracking, performing the following:

adding 1 to the numerical value of the counting identification of the currently processed attitude tracking subset;

judging whether the numerical value of the counting identification of the currently processed attitude tracking subset is greater than or equal to a preset numerical value or not;

if yes, deleting the currently processed attitude tracking subset from the attitude tracking set;

and if not, adding the identification information of the first image, the position information of the detection frame associated with the second image in the currently processed posture tracking subset and the position information of the key point of the detection frame in the currently processed posture tracking subset.

The preset value may be 5, 6, 7, 10, etc., and is not limited herein.

It can be seen that, in this example, for an object whose detection frame is not detected in consecutive frames, the electronic device can perform intelligent detection and setting in a statistical manner, specifically, for an object whose number of times of loss does not reach a preset number, continuously count the number of times of loss, and perform compensation for losing the posture information in the current frame, and for an object whose number of times of loss has reached the preset number, delete the posture tracking subset used for tracking the object, which is beneficial to improving the continuity and accuracy of posture tracking.

In one possible embodiment, the acquiring a first detection frame in a first image and a second detection frame in a second image, and a first key point of a first captured object in the first detection frame and a second key point of a second captured object in the second detection frame includes: acquiring the second detection frame and the second key point corresponding to the identification information of the second image in a pre-stored gesture tracking set; processing the first image by using a pre-trained detection frame prediction model to obtain the first detection frame, and processing the first detection frame by using a pre-trained attitude estimation prediction model to obtain the first key point.

For example, the detection box prediction model and the pose estimation prediction model may be any neural network model, and are not limited herein.

In addition, after the electronic device predicts the detection frame, the electronic device can further screen through a Non-Maximum Suppression (NMS) algorithm, and eliminate the false detection frame in time, so as to improve the accuracy.

Therefore, in this example, the electronic device can maintain the attitude estimation result of the attitude tracking object through the attitude tracking set, and can predict the detection frame for each frame of image first, and predict the key point efficiently according to the detection frame, thereby avoiding the interference of too many background images and improving the accuracy and efficiency.

In one possible embodiment, the method further comprises: acquiring the second image; processing the second image by using the detection frame prediction model to obtain a second detection frame, and processing the second detection frame by using the attitude estimation prediction model to obtain a second key point of the second detection frame; detecting that the second image is a first frame image; and creating the attitude tracking set for the second detection frame, and storing real information of the second detection frame, wherein the real information comprises identification information of the second image, position information of the second detection frame and position information of the second key point.

As can be seen, in this example, the electronic device creates a gesture tracking set for at least one captured user starting from the first frame of image, and maintains a gesture tracking subset for each user to achieve continuous recording of gesture information for each frame of the user, thereby improving comprehensiveness and accuracy of gesture tracking.

The following describes the image processing method in detail with reference to a specific application example, assuming that an object of the electronic device for performing gesture tracking is a human body, the gesture tracking set includes a plurality of gesture tracking subsets, each gesture tracking subset includes information of a frame detected by a corresponding user in at least one frame of image, a user in the second image includes a user a, a user b, a user c, and a user d, a user in the first image includes a user a, a user b, a user e, and a user f, and the corresponding relationship between the user and the frame is as follows:

the frames corresponding to user a include frame 1 in the second image and frame 4 in the first image,

the frame of user b comprises frame 2 in the second image and frame 5 in the first image,

the frame of user c comprises frame 3 in the second image,

the frame of user d comprises frame 7 in the second image,

the frame of user e comprises frame 6 in the first image,

the frame of user f comprises frame 8 in the first image;

as shown in fig. 2d, the image processing method provided in the embodiment of the present application includes the following steps:

step 2d01, the electronic device predicts the body posture joint points of frame 4, frame 5, frame 6, frame 8 in the first image.

In step 2d02, the electronic device determines that the first image is not the first frame image.

Step 2d03, the electronic device calculates the intersection ratio IOU of the detection frame combination in groups of two to obtain the detection frame matching weight Vbox of the detection frame combination.

Specifically, the detection frame matching weight of the frame 4 with respect to the four detection frame combinations of the frame 1, the frame 2, the frame 3, and the frame 7 is Vbox4_1 ═ IOU4_1, Vbox4_2 ═ IOU4_2, Vbox4_3 ═ IOU4_3, and Vbox4_7 ═ IOU4_7, and IOUx _ y is the intersection ratio of the detection frame combination [ x, y ];

the detection frame matching weights of the frame 5 relative to the four detection frame combinations of the frame 1, the frame 2, the frame 3 and the frame 7 are Vbox5_1 ═ IOU5_1, Vbox5_2 ═ IOU5_2, Vbox5_3 ═ IOU5_3 and Vbox5_7 ═ IOU5_ 7;

the detection frame matching weights of the frame 6 with respect to the four detection frame combinations of the frame 1, the frame 2, the frame 3, and the frame 7 are Vbox6_1 ═ IOU6_1, Vbox6_2 ═ IOU6_2, Vbox6_3 ═ IOU6_3, and Vbox6_7 ═ IOU6_ 7;

the detection frame matching weights of the frame 8 with respect to the four detection frame combinations of the frame 1, the frame 2, the frame 3, and the frame 7 are Vbox8_1 ═ IOU8_1, Vbox8_2 ═ IOU8_2, Vbox8_3 ═ IOU8_3, and Vbox8_7 ═ IOU8_ 7.

And 2d04, the electronic equipment calculates the sum of Euclidean distances of the attitude joint points in a group by group aiming at each detection frame combination to obtain joint point matching weight Vjoin.

Specifically, if the human body posture joint points include joint point 1 and key point 2, then:

detecting a box combination [ box4, box 1] keypoint matching weight vjoin 4_1 ═ p 41_1+ p 41_2, p denotes a euclidean distance, and pxy _ i denotes a euclidean distance of a keypoint i combination of the box combination [ box x, box y ];

detect the box combination [ box4, box 2] keypoint matching weight vjoin 4_2 ═ p 42_1+ p 42_ 2;

detect the box combination [ box4, box 3] keypoint matching weight vjoin 4_3 ═ p 43_1+ p 43_ 2;

detect the box combination [ box4, box 7] keypoint matching weight vjoin 4_7 ═ p 47_1+ p 47_ 2;

detect the box combination [ box5, box 1] keypoint matching weight vjoin 5_1 ═ p 51_1+ p 51_ 2;

detect the box combination [ box5, box 2] keypoint matching weight vjoin 5_2 ═ p 52_1+ p 52_ 2;

detect the box combination [ box5, box 3] keypoint matching weight vjoin 5_3 ═ p 53_1+ p 53_ 2;

detect the box combination [ box5, box 7] keypoint matching weight vjoin 5_7 ═ p 57_1+ p 57_ 2;

detect the box combination [ box6, box 1] keypoint matching weight vjoin 6_1 ═ p 61_1+ p 61_ 2;

detect the box combination [ box6, box 2] keypoint matching weight vjoin 6_2 ═ p 62_1+ p 62_ 2;

detect the box combination [ box6, box 3] keypoint matching weight vjoin 6_3 ═ p 63_1+ p 63_ 2;

detect the box combination [ box6, box 7] keypoint matching weight vjoin 6_7 ═ p 67_1+ p 67_ 2;

detect the box combination [ box8, box 1] keypoint matching weight vjoin 8_1 ═ p 81_1+ p 81_ 2;

detect the box combination [ box8, box 2] keypoint matching weight vjoin 8_2 ═ p 82_1+ p 82_ 2;

detect the box combination [ box8, box 3] keypoint matching weight vjoin 8_3 ═ p 83_1+ p 83_ 2;

the box combination [ box8, box 7] keypoint matching weight vjoin 8_7 ═ p 87_1+ p 87_2 is detected.

Step 2d05, the electronic device calculates the reference matching weight of each detection frame combination according to the detection frame matching weight and the key point matching weight of each detection frame combination.

Specifically, the reference matching weight V4_1 of the detection frame combination [ frame 4, frame 1] is C1 × Vbox4_1+ C2 × vjoin 4_1,

the reference matching weight V4_2 of the detection box combination [ box4, box 2], (C1 × Vbox4_2+ C2 × vjoin 4_ 2),

the reference matching weight V4_3 of the detection box combination [ box4, box 3], (C1) Vbox4_3+ C2) vjoin 4_3,

the reference matching weight V4_7 of the detection box combination [ box4, box 7], (C1) Vbox4_7+ C2) vjoin 4_7,

the reference matching weight V5_1 of the detection box combination [ box5, box 1], (C1) Vbox5_1+ C2) vjoin 5_1,

the reference matching weight V5_2 of the detection box combination [ box5, box 2], (C1 × Vbox5_2+ C2 × vjoin 5_ 2),

the reference matching weight V5_3 of the detection box combination [ box5, box 3], (C1) Vbox5_3+ C2 vjoin 5_ 3),

the reference matching weight V5_7 of the detection box combination [ box5, box 7], (C1) Vbox5_7+ C2) vjoin 5_7,

the reference matching weight V6_1 of the detection box combination [ box6, box 1], (C1) Vbox6_1+ C2) vjoin 6_1,

the reference matching weight V6_2 of the detection box combination [ box6, box 2], (C1 × Vbox6_2+ C2 × vjoin 6_ 2),

the reference matching weight V6_3 of the detection box combination [ box6, box 3], (C1) Vbox6_3+ C2 vjoin 6_ 3),

the reference matching weight V6_7 of the detection box combination [ box6, box 7], (C1) Vbox6_7+ C2) vjoin 6_7,

the reference matching weight V8_1 of the detection box combination [ box8, box 1], (C1 × Vbox8_1+ C2 × vjoin 8_ 1),

the reference matching weight V8_2 of the detection box combination [ box8, box 2], (C1 × Vbox8_2+ C2 × vjoin 8_ 2),

the reference matching weight V8_3 of the detection box combination [ box8, box 3], (C1 × Vbox8_3+ C2 × vjoin 8_ 3),

the reference matching weight V8_7 of the detection box combination [ box8, box 7] is C1 × Vbox8_7+ C2 × vjoin 8_ 7.

And 2d06, the electronic equipment solves the maximum weight matching of the bipartite graph according to the KM algorithm to obtain the matching condition between the detection frame of the first image and the detection frame of the second image.

Wherein the input of the KM algorithm comprises: reference matching weights V4_1, V4_2, V4_3, V4_7, V5_1, V5_2, V5_3, V6_7, V6_1, V6_2, V6_3, V6_ 7;

the output of the KM algorithm includes:

detecting the target matching weight V' 4_1 of the frame combination [ frame 4, frame 1 ];

detecting a target matching weight V' 5_2 of a frame combination [ frame 5, frame 2 ];

detecting a target matching weight V' 6_3 of a frame combination [ frame 6, frame 3 ];

frame 8 has no match, frame 7 has no match, i.e. frame 7 does not match the detection frame in the first image, and frame 8 is the detection frame that exists in the first image and does not match the detection frame in the second image.

Step 2d07, the electronic device traverses the detection box in the matching condition, which specifically includes the following steps:

for a frame 4 and a frame 1, detecting that V' 4_1 is greater than a preset weight | V |, adding information (identification information, a detected frame position and coordinates of a joint point) of the frame 4 to a gesture tracking subset 1 of the frame 1 in a gesture tracking set (specifically, an array named Track), and resetting a count identifier of the gesture tracking subset 1;

aiming at the frame 5 and the frame 2, detecting that V' 5_2 is greater than a preset weight | V |, adding the information of the frame 5 into a posture tracking subset 2 of the frame 2 in a posture tracking set, and resetting the counting identifier of the posture tracking subset 2;

aiming at the frame 6 and the frame 3, detecting that V' 6_3 is smaller than a preset weight | V |, creating a posture tracking subset 6 corresponding to the frame 6 in a posture tracking set, adding the information of the frame 6 to the posture tracking subset 6, and adding 1 to the numerical value of the counting identifier of the posture tracking subset 3; judging whether the numerical value of the counting identification of the attitude tracking subset 3 is more than 10 times;

if the number of the counting identifiers of the attitude tracking subset 3 is more than or equal to 10 times, deleting the attitude tracking subset 3 from the attitude tracking set;

if the number of the count identification of the pose tracking subset 3 is less than 10 times, the count number is kept incremented by 1, and compensation information of the frame 3 is added to the pose tracking subset 3, the compensation information including identification information of the first image, position information of the detection frame 3, and coordinates of the articulated point in the detection frame 3.

For box8, a gesture tracking subset 8 corresponding to box8 is created in the gesture tracking set, and the information of box8 is added to the gesture tracking subset 8.

For the frame 7, adding 1 to the counting identifier of the posture tracking subset 7 corresponding to the frame 7, and judging whether the numerical value of the counting identifier is greater than 10 times;

if the number of the counting identifiers of the attitude tracking subset 7 is greater than or equal to 10 times, deleting the attitude tracking subset 7 from the attitude tracking set;

if the number of the count identification of the pose tracking subset 7 is less than 10 times, the count number is maintained, and compensation information of the frame 7 is added to the pose tracking subset 7, the compensation information including identification information of the first image, position information of the detection frame 7, and coordinates of the joint point in the detection frame 7.

Step 2d08, the electronic device traverses a gesture tracking subset in the gesture tracking set except for a reference gesture tracking subset, where the reference gesture tracking subset refers to a gesture tracking subset associated with the first detection frame and/or the second detection frame, and specifically includes the following steps:

for each traversed pose tracking subset, performing the following operations:

and if not, adding the identification information of the first image, the information of the detection frame associated with the second image in the currently processed attitude tracking subset or compensation information in the currently processed attitude tracking subset.

As can be seen, in this example, for a multi-user gesture tracking scene, the electronic device can calculate the maximum matching frame based on the existing keypoint features in combination with the IOU of the human body frame, and does not need to consume extra time to extract features for matching, thereby improving the efficiency and accuracy of gesture estimation performed by the device.

The embodiment of the application provides an image processing device which can be an electronic device. Specifically, the image processing apparatus is configured to execute the steps executed by the electronic device in the above image processing method. The image processing apparatus provided by the embodiment of the present application may include modules corresponding to the respective steps.

The embodiment of the present application may perform division of functional modules on the image processing apparatus according to the above method, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The division of the modules in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Fig. 3 shows a schematic diagram of a possible structure of the image processing apparatus according to the above-described embodiment, in a case where each functional module is divided in correspondence with each function. As shown in fig. 3, the image processing apparatus 3 is applied to an electronic device; the device comprises:

an obtaining unit 30, configured to obtain a first detection frame in a first image and a second detection frame in a second image, and a first key point of a first photographed object in the first detection frame and a second key point of a second photographed object in the second detection frame, where the second image is a previous frame image of the first image, the detection frame is used to indicate an area of a corresponding photographed object in the image, and the key points include pixel points used to describe key positions of the photographed object;

a determining unit 31, configured to determine, according to the first detection frame, the second detection frame, the first key point, and the second key point, a matching condition of the first detection frame and the second detection frame, where the matching condition is used to indicate a state in which a subject is captured in two consecutive frame images;

the determining unit 31 is further configured to determine a pose estimation result of the first image according to the matching condition, where the pose estimation result includes a detection frame and key points included in the detection frame.

In a possible embodiment, in the aspect that the determination of the matching condition between the first detection frame and the second detection frame is performed according to the first detection frame, the second detection frame, the first key point, and the second key point, the determining unit is specifically configured to: calculating an intersection ratio IOU of two detection frames of each detection frame combination in at least one detection frame combination to obtain a detection frame matching weight of each detection frame combination, wherein the at least one detection frame combination is obtained by dividing the first detection frame and the second detection frame according to the following mode: a first detection frame and a second detection frame form a detection frame combination; and calculating euclidean distances of two key points of each key point combination in at least one key point combination of each detection frame combination, and calculating the sum of the euclidean distances of the at least one key point combination to obtain a key point matching weight of each detection frame combination, wherein the at least one key point combination is obtained by dividing the key points of the two detection frames in each detection frame combination in the following manner: a first key point and a second key point form a key point combination, and the position type of the second key point is consistent with that of the first key point; determining a reference matching weight of each detection frame combination according to the detection frame matching weight and the key point matching weight of each detection frame combination; and performing maximum weight matching on the bipartite graph according to the reference matching weight of the at least one detection frame combination to obtain the matching condition between the first detection frame and the second detection frame.

In one possible embodiment, the matching case includes at least one of:

a first detection frame a1 not matched to any second detection frame;

In a possible embodiment, in terms of said determining the pose estimation result of the first image according to the matching condition, the determining unit 31 is specifically configured to: for the first detection frame combination, adding real information of a first detection frame a2 in the first detection frame combination to a posture tracking subset B2 corresponding to the second detection frame B2, wherein the real information comprises identification information of the first image, position information of the detection frame and position information of key points in the detection frame;

In one possible embodiment, the apparatus further comprises:

a counting unit 32, configured to increase, by 1, a numerical value of a count identifier of the pose tracking subset B3 corresponding to a second detection frame B3 in the second detection frame combination for the second detection frame combination, and add compensation information of the second detection frame B3 to the pose tracking subset B3, where the compensation information includes identification information of the first image, position information of the detection frame B3, and position information of a key point of the detection frame B3;

a determining unit 33, configured to determine whether a value of the count identifier of the pose tracking subset B3 is greater than or equal to a preset value;

a deleting unit 34, configured to delete the pose tracking subset B3 from the pose estimation result of the first image if yes;

an adding unit 35, configured to add, if not, the identification information of the first image, the position information of the second detection frame B3, and the position information of the second keypoint of the second detection frame B3 in the pose tracking subset B3.

In a possible embodiment, the counting unit 32 is further configured to: for the second detection box B1, increasing the value of the count identification of the gesture tracking subset B1 corresponding to the second detection box B1 by 1;

the adding unit 35 is further configured to add compensation information of the second detection box B1 in the pose tracking subset B1, where the compensation information includes identification information of the first image, position information of the detection box B1, and position information of a key point of the detection box B1.

In a possible embodiment, the obtaining unit 30 is further configured to obtain at least one posture tracking subset in a preset posture tracking set, where the at least one posture tracking subset is a posture tracking subset in the posture tracking set except for a reference posture tracking subset, where the reference posture tracking subset refers to a posture tracking subset associated with the first detection frame and/or the second detection frame, and the posture tracking set is used for recording a posture estimation result of an object being tracked and photographed;

the adding unit 35 is further configured to, for each of the at least one gesture tracking subsets, perform the following operations: adding 1 to the numerical value of the counting identification of the currently processed attitude tracking subset;

the judging unit 33 is further configured to: judging whether the numerical value of the counting identification of the currently processed attitude tracking subset is greater than or equal to a preset numerical value or not;

the deleting unit 34 is further configured to delete the currently processed gesture tracking subset from the gesture tracking set if the current gesture tracking subset is the current gesture tracking subset;

the adding unit 35 is further configured to, if not, add, in the currently-processed pose tracking subset, the identification information of the first image, the position information of the detection frame associated with the second image in the currently-processed pose tracking subset, and the position information of the key point of the detection frame.

In a possible embodiment, in terms of acquiring a first detection frame in a first image and a second detection frame in a second image, and a first key point of a first captured object in the first detection frame and a second key point of a second captured object in the second detection frame, the acquiring unit 30 is specifically configured to: acquiring the second detection frame and the second key point corresponding to the identification information of the second image in a pre-stored gesture tracking set; and processing the first image by using a pre-trained detection frame prediction model to obtain the first detection frame, and processing the first detection frame by using a pre-trained attitude estimation prediction model to obtain the first key point.

In a possible embodiment, the obtaining unit 30 is further configured to: acquiring the second image;

the determining unit 31 is further configured to: processing the second image by using the detection frame prediction model to obtain a second detection frame, and processing the second detection frame by using the attitude estimation prediction model to obtain a second key point of the second detection frame;

the device further comprises:

a detecting unit 36, configured to detect that the second image is a first frame image;

a creating unit 37, configured to create the gesture tracking set for the second detection frame, and store real information of the second detection frame, where the real information includes identification information of the second image, position information of the second detection frame, and position information of the second keypoint.

In the case of using an integrated unit, a schematic structural diagram of another image processing apparatus provided in the embodiment of the present application is shown in fig. 4. In fig. 4, the image processing apparatus 4 includes: a processing module 40 and a communication module 41. The processing module 40 is used for controlling and managing actions of the device control apparatus, such as steps performed by the acquiring unit 30, the determining unit 31, the counting unit 32, the judging unit 33, the deleting unit 34, the adding unit 35, the detecting unit 36, the creating unit 37, and/or other processes for performing the techniques described herein. The communication module 41 is used to support interaction between the device control apparatus and other devices. As shown in fig. 4, the image processing apparatus may further include a storage module 42, and the storage module 42 is used for storing program codes and data of the image processing apparatus.

The Processing module 40 may be a Processor or a controller, and may be, for example, a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication module 41 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 42 may be a memory.

All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. Both the image processing apparatus 3 and the image processing apparatus 4 can perform the steps performed by the electronic device in the image processing method shown in fig. 2 a.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus and system may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative; for example, the division of the unit is only a logic function division, and there may be another division manner in actual implementation; for example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be physically included alone, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications can be easily made by those skilled in the art without departing from the spirit and scope of the present invention, and it is within the scope of the present invention to include different functions, combination of implementation steps, software and hardware implementations.

Claims

1. An image processing method, comprising:

2. The method of claim 1, wherein the determining the matching of the first detection frame and the second detection frame according to the first detection frame, the second detection frame, the first key point and the second key point comprises:

3. The method of claim 2, wherein the matching condition comprises at least one of:

4. The method of claim 3, wherein determining the pose estimation result for the first image based on the match condition comprises:

5. The method of claim 4, further comprising:

for the second detection frame combination, increasing the value of the count identifier of the posture tracking subset B3 corresponding to the second detection frame B3 in the second detection frame combination by 1, and adding compensation information of the second detection frame B3 in the posture tracking subset B3, wherein the compensation information comprises the identifier information of the first image, the position information of the detection frame B3 and the position information of the key point of the detection frame B3;

6. The method according to claim 4 or 5, characterized in that the method further comprises:

for the second detection box B1, increasing the number of the count identifiers of the posture tracking subset B1 corresponding to the second detection box B1 by 1, and adding compensation information of the second detection box B1 to the posture tracking subset B1, wherein the compensation information includes the identification information of the first image, the position information of the detection box B1, and the position information of the key point of the detection box B1.

7. The method according to claim 4 or 5, characterized in that the method further comprises:

acquiring at least one posture tracking subset in a preset posture tracking set, wherein the at least one posture tracking subset is a posture tracking subset except for a reference posture tracking subset in the posture tracking set, the reference posture tracking subset refers to a posture tracking subset associated with the first detection frame and/or the second detection frame, and the posture tracking set is used for recording a posture estimation result of an object which is tracked and shot;

for each of the at least one subset of pose tracking, performing the following:

8. The method according to any one of claims 1 to 7, wherein the acquiring a first detection frame in the first image and a second detection frame in the second image, and a first key point of a first captured object in the first detection frame and a second key point of a second captured object in the second detection frame, comprises:

acquiring the second detection frame and the second key point corresponding to the identification information of the second image in the attitude tracking set;

processing the first image by using a pre-trained detection frame prediction model to obtain the first detection frame, and processing the first detection frame by using a pre-trained attitude estimation prediction model to obtain the first key point.

9. The method of claim 8, further comprising:

acquiring the second image;

processing the second image by using the detection frame prediction model to obtain a second detection frame, and processing the second detection frame by using the attitude estimation prediction model to obtain a second key point of the second detection frame;

detecting that the second image is a first frame image;

and creating the attitude tracking set for the second detection frame, and storing real information of the second detection frame, wherein the real information comprises identification information of the second image, position information of the second detection frame and position information of the second key point.

10. An image processing apparatus characterized by comprising:

11. An electronic device, comprising:

one or more processors;

one or more memories for storing programs,

the one or more memories and the program are configured to control the apparatus to perform the steps in the method of any one of claims 1-9 by the one or more processors.

12. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-9.