CN117237386A

CN117237386A - Method, device and computer equipment for carrying out structuring processing on target object

Info

Publication number: CN117237386A
Application number: CN202210633956.XA
Authority: CN
Inventors: 徐列; 孙冲; 潘俊毅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2023-12-15

Abstract

The present application relates to a method, an apparatus, a computer device, a storage medium and a computer program product for structuring a target object. The method comprises the following steps: detecting a target object in an image to be processed to obtain a semantic segmentation mask corresponding to the target object and first position information of key points of the target object; screening target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask; determining at least one candidate region corresponding to the target object based on the target edge pixels; determining a target area belonging to the target object from at least one candidate area according to first position information of a key point of the target object; and determining a first segmentation mask corresponding to the target object based on the target region, and determining a structuring result of the target object according to the first segmentation mask and the first position information of the key points, thereby improving the accuracy of the structuring result.

Description

Method, device and computer equipment for carrying out structuring processing on target object

Technical Field

The present application relates to the field of image processing technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for performing structural processing on a target object.

Background

With the development of image processing technology, the application of the image processing technology is more and more widely used. For example, a specific object on an image is intercepted, and a driving action is designed for the object, so that the object can execute the driving action, and the interestingness and play diversity of an application program can be increased. The premise of intercepting the object is that the structured result of the object is obtained, and the structured result of the object may include position information of key points of the object and a segmentation mask of the object.

In the conventional technology, the acquisition of the position information of the key points is realized by using a key point detection algorithm, so that the acquisition of the segmentation mask is realized by using a binarization algorithm. However, the image to be processed contains many backgrounds in addition to the object of interest, and the segmentation mask obtained based on the binarization algorithm may contain many background information, so that the accuracy is not high, and therefore, the accuracy of the structural result of the object is not high.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, a storage medium, and a computer program product for performing a structuring process on a target object, which can improve accuracy of a structuring result.

In one aspect, the present application provides a method for structuring a target object. The method comprises the following steps:

detecting a target object in an image to be processed to obtain a semantic segmentation mask corresponding to the target object and first position information of key points of the target object;

determining edge pixels in the image to be processed, and screening target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask;

determining at least one candidate region corresponding to the target object based on the target edge pixels;

determining a target area belonging to the target object from at least one candidate area according to first position information of a key point of the target object;

and determining a first segmentation mask corresponding to the target object based on the target region, and determining a structuring result of the target object according to the first segmentation mask and the first position information of the key points.

On the other hand, the application also provides a device for carrying out structuring processing on the target object. The device comprises:

the detection module is used for detecting a target object in the image to be processed to obtain a semantic segmentation mask corresponding to the target object and first position information of key points of the target object;

The first screening module is used for determining edge pixels in the image to be processed and screening target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask;

the second screening module is used for determining at least one candidate region corresponding to the target object based on the target edge pixels; determining a target area belonging to the target object from at least one candidate area according to first position information of a key point of the target object;

and the determining module is used for determining a first segmentation mask corresponding to the target object based on the target area and determining a structuring result of the target object according to the first segmentation mask and the first position information of the key points.

In some embodiments, the detection module is specifically configured to:

inputting the image to be processed into a multi-task neural network to obtain a key point thermodynamic diagram and a semantic feature diagram; searching a maximum value on the key point thermodynamic diagram, and determining a mapping proportion according to the scale of the image to be processed and the scale of the key point thermodynamic diagram; determining first position information of a key point corresponding to the target object based on the maximum value and the mapping proportion; and up-sampling the semantic feature images to obtain target semantic feature images matched with the scales of the images to be processed, and normalizing the target semantic feature images to obtain semantic segmentation masks corresponding to the target objects.

In some embodiments, the detection module is specifically configured to:

determining a symmetrical window on a key point thermodynamic diagram by taking the maximum value as the center;

weighting the coordinates of each position in the symmetrical window based on the values of each position in the symmetrical window to obtain the coordinates of the key points before mapping;

and determining first position information of the key points corresponding to the target object based on the key point coordinates before mapping and the mapping proportion.

In some embodiments, the first screening module is specifically configured to:

for each pixel on the image to be processed, determining a binarization threshold value corresponding to the current pixel based on the pixel value of the current pixel and the pixel values of surrounding pixels of the current pixel;

updating the pixel value of the current pixel to a first value under the condition that the pixel value of the current pixel is larger than a binarization threshold value;

updating the pixel value of the current pixel to a second value if the pixel value of the current pixel is less than or equal to the binarization threshold, wherein the second value is different from the first value;

and taking the pixel with the pixel value of the first value as an edge pixel in the image to be processed.

In some embodiments, the first screening module is specifically configured to:

traversing each edge pixel, and for the current edge pixel traversed currently, searching a classification label corresponding to the current edge pixel from the semantic segmentation mask;

And under the condition that the classification label is a target object, taking the current edge pixel as the target edge pixel until all edge pixels are traversed, and obtaining the target edge pixel corresponding to the target object.

In some embodiments, the first screening module is specifically configured to:

determining the respective corresponding outline of each part of the target object based on the target edge pixels;

for each contour, filling from outside to inside from the edge of the contour, and obtaining at least one candidate region corresponding to the target object.

In some embodiments, the first screening module is specifically configured to:

for each candidate region in the at least one candidate region, determining whether the candidate region contains a key point according to the first position information of the key point of the target object;

in the case where the candidate region contains a key point, the candidate region is determined as a target region.

In some embodiments, the determining module is specifically configured to:

and setting the pixel values of all pixels in the target area as a first value, and setting the pixel values of all pixels in the area except the target area in the image to be processed as a second value, so as to obtain a first segmentation mask corresponding to the target object.

In some embodiments, the image to be processed is one frame of image in the video stream, and the determining module is specifically configured to:

acquiring a homography transformation matrix corresponding to an image to be processed, wherein the homography transformation matrix characterizes optical flow information of a video stream;

determining second position information of key points of a target object in the image to be processed according to the first position information of the key points in the previous frame of the image to be processed and the homography transformation matrix;

determining target position information of key points of a target object in an image to be processed according to first position information and second position information corresponding to the key points of the target object in the image to be processed;

determining a second segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask corresponding to a previous frame of the image to be processed and the homography transformation matrix;

determining a target segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask and a second segmentation mask corresponding to the target object in the image to be processed; the target position information of the key points of the target object and the target segmentation mask are used to construct a structured result of the target object.

In some embodiments, the determining module is specifically configured to:

Acquiring a first image and a second image from a video stream, wherein the first image is a previous frame image of the second image;

detecting a target object in the first image to obtain a key point corresponding to the target object on the first image;

detecting a target object in the second image to obtain a key point corresponding to the target object on the second image;

a homography transformation matrix is determined based on keypoints on the first image corresponding to the target object and keypoints on the second image corresponding to the target object.

In some embodiments, the determining module is further to:

extracting an avatar corresponding to the target object from the image to be processed based on the target segmentation mask;

determining the motion trail of each part of the virtual image based on a preset driving action and according to the target position information of the key point of the target object;

and obtaining a video stream matched with the preset driving action based on the motion trail of each part of the virtual image.

On the other hand, the application also provides computer equipment. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In another aspect, the present application also provides a computer-readable storage medium. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In another aspect, the present application also provides a computer program product. Computer program product comprising a computer program which, when executed by a processor, realizes the steps of:

The method, the device, the computer equipment, the storage medium and the computer program product for carrying out the structuring processing on the target object can detect the target object in the image to be processed on one hand, and obtain the semantic segmentation mask corresponding to the target object and the first position information of the key point of the target object. On the other hand, edge pixels in the image to be processed can be determined, and because the pixels in the semantic segmentation mask are provided with classification labels, target edge pixels corresponding to the target object can be screened out from the edge pixels according to the semantic segmentation mask, and at least one candidate region corresponding to the target object can be determined based on the target edge pixels. In order to filter out the regions which do not belong to the target object, the target region which belongs to the target object can be determined from the at least one candidate region according to the first position information of the key point of the target object, and then a first segmentation mask corresponding to the target object is determined based on the target region, and the structuring result of the target object is determined according to the first segmentation mask and the first position information of the key point, wherein the structuring result of the target object can be used for preset action driving or other filter application. The target area and the target object obtained by the method are more consistent, so that the first segmentation mask obtained based on the target area is more accurate, and the accuracy of the structuring process is greatly improved.

Drawings

FIG. 1 is an application environment diagram of a method of structuring a target object in one embodiment;

FIG. 2 is a flow diagram of a method of structuring a target object in one embodiment;

FIG. 3 is a schematic diagram of a architecture of a multi-tasking neural network in one embodiment;

FIG. 4 is a schematic diagram of a symmetric window in one embodiment;

FIG. 5 is a schematic diagram of an edge pixel in one embodiment;

FIG. 6 is a schematic diagram of filtering edge pixels in one embodiment;

FIG. 7 is a schematic diagram of a target edge pixel in one embodiment;

FIG. 8 is a schematic diagram of at least one candidate region in one embodiment;

FIG. 9 is a schematic diagram of a target area in one embodiment;

FIG. 10 is a flow chart of a method for structuring a target object in another embodiment;

FIG. 11 is a schematic diagram of a method of structuring a target object in one embodiment;

FIG. 12 is a schematic diagram of a keypoint predictor in one embodiment;

FIG. 13 is a schematic diagram of the result of a structuring process in one embodiment;

FIG. 14 is a block diagram of an apparatus for structuring a target object in one embodiment;

fig. 15 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The method for carrying out the structuring processing on the target object provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. The terminal 102 may be configured to collect an image or a video stream, the terminal 102 may send the collected image or video stream to the server 104, and the server 104 may use the received image as an image to be processed, or use any frame of image in the received video stream as an image to be processed. The server 104 may detect the target object in the image to be processed to obtain the semantic segmentation mask corresponding to the target object and the first location information of the key point of the target object. On the other hand, edge pixels in the image to be processed can be determined, and because the pixels in the semantic segmentation mask are provided with classification labels, target edge pixels corresponding to the target object can be screened out from the edge pixels according to the semantic segmentation mask, and at least one candidate region corresponding to the target object can be determined based on the target edge pixels. In order to filter out the regions which do not belong to the target object, the target region which belongs to the target object can be determined from the at least one candidate region according to the first position information of the key point of the target object, and then a first segmentation mask corresponding to the target object is determined based on the target region, and the structuring result of the target object is determined according to the first segmentation mask and the first position information of the key point, wherein the structuring result of the target object can be used for preset action driving or other filter application. According to the process of screening the target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask and the process of determining the target area belonging to the target object from at least one candidate area according to the first position information of the key point of the target object, the last target area is more consistent with the target object, the first segmentation mask obtained based on the target area is more accurate, and the accuracy of the structuring processing is improved.

It should be noted that: the application environment shown in fig. 1 is only an example, and the method for performing the structuring processing on the target object provided by the embodiment of the present application may also be independently executed by the terminal or independently executed by the server, which is not limited by the embodiment of the present application.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services.

In some embodiments, as shown in fig. 2, a method for performing structural processing on a target object is provided, and the method is applied to a computer device (the computer device may be a terminal or a server in fig. 1 specifically) for illustration, and includes the following steps:

s202, detecting a target object in an image to be processed to obtain a semantic segmentation mask corresponding to the target object and first position information of key points of the target object.

In the case that the computer device is the server in fig. 1, after the terminal collects the image, the image is sent to the server, and the server can use the received image as the image to be processed. Or after the terminal collects the video stream, the video stream is sent to the server, and the server can take any frame image in the received video stream as an image to be processed.

The target object may be any object in the image to be processed, for example, the target object may be an object having an anthropomorphic form in the image to be processed. The object having the anthropomorphic form may be a cartoon man, a matchman, or the like, and the embodiment of the present application is not limited thereto.

In some embodiments, the computer device may obtain a multi-tasking neural network through training, which may include a backbone convolutional network and two branch networks, the output of the backbone convolutional network being the input of the two branch networks. One of the two branch networks is a key point detection network, and the other branch network is a semantic segmentation network. The computer equipment can input the image to be processed into the multi-task neural network, the feature layer is obtained through the processing of the main convolution network, the feature layer is respectively input into the key point detection network and the semantic segmentation network, the key point detection network can output the key point thermodynamic diagram, and the first position information of the key point of the target object can be determined based on the key point thermodynamic diagram. The semantic segmentation network outputs a semantic feature map, and a semantic segmentation mask corresponding to the target object can be determined based on the semantic feature map.

S204, determining edge pixels in the image to be processed, and screening target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask.

The edge pixels in the image to be processed comprise pixels corresponding to edge contours of all objects on the image to be processed, and the image to be processed comprises two objects, namely cartoon people and tree, and the edge pixels in the image to be processed comprise pixels corresponding to edge contours of the cartoon people and pixels corresponding to edge contours of the tree.

In some embodiments, the computer device may binarize the image to be processed and determine edge pixels based on the structure of the binarization. Specifically, for each pixel on the image to be processed, comparing the pixel value of the pixel with a specific threshold, if the pixel value is greater than the specific threshold, updating the pixel value of the pixel to 1, if the pixel value is less than or equal to the specific threshold, updating the pixel value of the pixel to 0, performing the above processing on all the pixels to obtain a black-and-white image, searching the pixels with the pixel value of 1 from the black-and-white image, and taking the pixels as edge pixels.

The description is as follows: the above-described updating of the pixel value to 1 in the case where the pixel value is greater than the specific threshold value, and updating of the pixel value to 0 in the case where the pixel value is less than or equal to the specific threshold value are only one example. If the pixel value is greater than the specific threshold value, the pixel value may be updated to 0, and if the pixel value is less than or equal to the specific threshold value, the pixel value may be updated to 1, and if the pixel value is 0, the pixel in the black-and-white image may be regarded as an edge pixel. The former processing method is exemplified in the following examples.

The specific threshold may be a fixed value configured in advance, or may be an adaptive threshold. The adaptive threshold may be understood that the threshold for comparing with the pixel value of the pixel and the threshold corresponding to other pixels may be different for each pixel, and exemplary, the threshold for comparing with the pixel value of the pixel may be determined based on the pixel value of the pixel and the pixel values of surrounding pixels of the pixel, which is not limited by the embodiment of the present application.

It should be noted that: after the computer device performs binarization processing on the image to be processed, a black-and-white image can be obtained, and due to the problems of shooting angle and image quality, the edge contour of the target object may not be clear, so that a break may exist between pixels with a pixel value of 1 in the black-and-white image obtained by the binarization processing, and therefore, the black-and-white image obtained by the binarization processing can be subjected to expansion processing to repair the break, so that a repaired black-and-white image is obtained, and the pixels with the pixel value of 1 in the repaired black-and-white image are used as edge pixels.

As described above, the edge pixels may include edge pixels of a plurality of objects on the image to be processed, and after obtaining the edge pixels, the computer device may filter out edge pixels that do not belong to the target object from the edge pixels based on the semantic segmentation mask, considering that the pixels in the semantic segmentation mask are labeled with classification, thereby leaving target edge pixels corresponding to the target object.

In some embodiments, after performing expansion processing on the black-and-white image obtained by the binarization processing to obtain a repaired black-and-white image, the computer device may determine, based on a semantic segmentation mask, whether each pixel value of 1 on the repaired black-and-white image belongs to the target object, if the pixel belongs to the target object, the pixel value of the pixel is reserved, and if the pixel does not belong to the target object, the pixel value of the pixel is updated to 0, so that all pixels not belonging to the target object are converted to black, after the processing, a semantically filtered black-and-white image can be obtained, and a pixel with the pixel value of 1 in the semantically filtered black-and-white image is used as a target edge pixel.

S206, determining at least one candidate region corresponding to the target object based on the target edge pixels.

Wherein the target edge pixels comprise a plurality of pixels, wherein a portion of the pixels may form a closed contour, another portion of the pixels may form another closed contour, and so on, the target edge pixels may form at least one closed contour. The interior of each closed contour may be filled to obtain at least one region, which may be taken as at least one candidate region corresponding to the target object.

In some embodiments, after obtaining the semantically filtered black-and-white image in S204, when filling the interior of each closed contour, the computer device may update the pixel values of all pixels in the corresponding contour in the semantically filtered black-and-white image to 1, to obtain a filled black-and-white image, where the filled black-and-white image includes at least one connected domain with a pixel value of 1, and the connected domain with a pixel value of 1 may be used as at least one candidate region corresponding to the target object.

S208, determining a target area belonging to the target object from at least one candidate area according to the first position information of the key point of the target object.

The location information may be coordinates, and the area, which does not belong to the target object, in the at least one candidate area may be filtered based on the first location information of the key point of the target object, so as to obtain the target area which belongs to the target object.

In some embodiments, for each candidate region, the computer device may determine, based on the first location information of the keypoints of the target object, whether there are keypoints that fall into the candidate region, if there are keypoints that fall into the candidate region, take the candidate region as the target region, and if there are no keypoints that fall into the candidate region, determine that the candidate region is not the target region.

S2010, determining a first segmentation mask corresponding to the target object based on the target region, and determining a structuring result of the target object according to the first segmentation mask and first position information of the key points.

In some embodiments, after obtaining the black-and-white image after the filling processing in S206, the computer device determines, for each candidate region on the black-and-white image after the filling processing, whether the candidate region belongs to the target object based on the first position information of the key point of the target object, if the candidate region belongs to the target object, does not perform any processing on the candidate region, and if the candidate region does not belong to the target object, updates the pixel values of all pixels corresponding to the candidate region to 1, thereby converting the candidate region not belonging to the target object to black, and may use the image obtained after the updating this time as the first segmentation mask corresponding to the target object.

In some embodiments, after obtaining the first location information of the key point of the target object and the semantic segmentation mask corresponding to the target object, the computer device may use the first location information and the semantic segmentation mask as the structured result of the target object.

In some embodiments, on the one hand, the computer device may predict the position information of the key point of the target object in the image to be processed to obtain the second position information of the key point of the target object in the image to be processed, and after obtaining the first position information, may obtain the target position information of the key point of the target object in the image to be processed based on the first position information and the second position information. On the other hand, the computer device may predict the division mask corresponding to the target object in the image to be processed to obtain a second division mask corresponding to the target object in the image to be processed, and after obtaining the first division mask, may obtain the target division mask corresponding to the target object in the image to be processed based on the first division mask and the second division mask. The target location information and the target segmentation mask may be used as the structured result of the target object.

According to the method for carrying out structural processing on the target object, on one hand, the target object in the image to be processed can be detected, and the semantic segmentation mask corresponding to the target object and the first position information of the key point of the target object are obtained. On the other hand, edge pixels in the image to be processed can be determined, and because the pixels in the semantic segmentation mask are provided with classification labels, target edge pixels corresponding to the target object can be screened out from the edge pixels according to the semantic segmentation mask, and at least one candidate region corresponding to the target object can be determined based on the target edge pixels. In order to filter out the regions which do not belong to the target object, the target region which belongs to the target object can be determined from the at least one candidate region according to the first position information of the key point of the target object, and then a first segmentation mask corresponding to the target object is determined based on the target region, and the structuring result of the target object is determined according to the first segmentation mask and the first position information of the key point, wherein the structuring result of the target object can be used for preset action driving or other filter application. According to the process of screening the target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask and the process of determining the target area belonging to the target object from at least one candidate area according to the first position information of the key point of the target object, the last target area is more consistent with the target object, the first segmentation mask obtained based on the target area is more accurate, and the accuracy of the structuring processing is improved.

In some embodiments, the step of detecting a target object in an image to be processed to obtain a semantic segmentation mask corresponding to the target object and first location information of a key point of the target object includes:

In some embodiments, as shown with reference to fig. 3, the multi-tasking neural network may include a backbone convolutional network and two branch networks, one of which is a keypoint detection network and the other of which is a semantic segmentation network. The output of the backbone convolutional network is the input of the two branch networks. The following describes a training process of a key point detection network, and an example may be that a certain object is photographed to obtain a video stream corresponding to the object, each frame of image in the video stream is taken as a sample image, a key point thermodynamic diagram corresponding to each sample image is obtained, one sample image and the key point thermodynamic diagram corresponding to the sample image form one sample in a training set, after the training set is obtained, an initial network for detecting key points is trained based on the training set, so as to obtain the key point detection network. The training process of the semantic segmentation network is described below, and an example may be that the video stream obtained by shooting a certain object is multiplexed, and similarly, each frame of image in the video stream is taken as a sample image, a semantic feature image corresponding to each sample image is obtained, one sample image and the semantic feature image corresponding to the sample image form one sample in a training set, and after the training set is obtained, an initial network for obtaining a semantic segmentation mask is trained based on the training set, so as to obtain the semantic segmentation network.

It should be noted that: in the video stream corresponding to the object obtained by shooting a certain object, there may be image frames which do not contain the object, and these image frames may be used to form negative samples, so that the training set contains both positive samples and negative samples, and the discrimination capability of the key point detection network and the semantic segmentation network obtained by training on the object is stronger.

In some embodiments, with continued reference to fig. 3, the computer device may input the image to be processed into the multi-task neural network, obtain the feature layer through the processing of the backbone convolution network, and after inputting the feature layer into the keypoint detection network, the keypoint detection network outputs a keypoint thermodynamic diagram corresponding to the image to be processed. For example, assuming that the number of pre-configured keypoints is 17, there are 17 keypoint thermodynamic diagrams output by the keypoint detection network, and the first location information of the 17 keypoints on the image to be processed may be determined based on the 17 keypoint thermodynamic diagrams.

In some embodiments, where the scale of the image to be processed is h×w×3, where H represents the number of pixels in each column of the image to be processed, W represents the number of pixels in each row of the image to be processed, 3 represents that the image to be processed has R, G, B channels, and the scale of the keypoint thermodynamic diagram may be (H/4) ×17, where 17 represents the number of pre-configured keypoints. The computer device may determine the mapping ratio based on the scale of the image to be processed and the scale of the keypoint thermodynamic diagram. For each of the 17 keypoint thermodynamic images output by the keypoint detection network, a maximum value can be searched on the keypoint thermodynamic diagram, the coordinate of the keypoint before mapping is determined based on the maximum value, and the first position information of the keypoint corresponding to the keypoint thermodynamic diagram is determined based on the coordinate of the keypoint before mapping and the mapping proportion. And for 17 key point thermal images output by the key point detection network, executing the operation, and obtaining the first position information of 17 key points.

In some embodiments, with continued reference to fig. 3, the computer device may input the feature layer output by the backbone convolution network into the semantic segmentation network, where the semantic segmentation network outputs a semantic feature map corresponding to the image to be processed, and it may be understood that the number of semantics of the image to be processed may be configured in advance, and the number of semantic feature maps output by the semantic segmentation network matches the number of preconfigured semantics. For example, assuming that the number of preset semantics is 2, namely a foreground and a background, respectively, there are 2 semantic feature graphs output by the semantic segmentation network, and a semantic segmentation mask corresponding to the target object can be determined based on the 2 semantic feature graphs.

In some embodiments, in the case where the scale of the image to be processed is h×w×3, the scale of the semantic feature map may be (H/4) ×2, where 2 represents the number of preconfigured semantics, and the embodiment of the present application is described by taking preconfigured semantics including foreground and background as an example. The 2 semantic feature graphs output by the semantic segmentation network can be respectively called a foreground feature graph and a background feature graph, wherein the value of each point in the foreground feature graph represents the score value of the point belonging to the foreground, and the value of each point in the background feature graph represents the score value of the point belonging to the background. After the computer device obtains the foreground feature map, the foreground feature map can be up-sampled based on the scale of the foreground feature map and the scale of the image to be processed, so as to obtain a foreground map matched with the scale of the image to be processed. Similarly, after obtaining the background feature map, the background feature map may be up-sampled based on the scale of the background feature map and the scale of the image to be processed to obtain a background map matched with the scale of the image to be processed, where the foreground map and the background map may be used as the target semantic feature map in the embodiment of the present application, the target semantic feature map may be normalized to obtain a foreground score feature map, a value of each point in the foreground score feature map represents a confidence score that the point belongs to the foreground, and for each point on the foreground score feature map, it is determined whether the value of the point is greater than a preset threshold, if the value of the point is greater than the preset threshold, it is determined that the classification label of the point is a target object, and if the value of the classification label of the point is less than or equal to the preset threshold, it is determined that the classification label of the point is not a target object. Based on the classification labels of the points, a semantic segmentation mask is obtained.

In some embodiments, after obtaining the target semantic feature map, the computer device may normalize the target semantic feature map by a softmax function to obtain a semantic segmentation mask corresponding to the target object.

In the above embodiment, the keypoint thermodynamic diagram and the semantic feature diagram are firstly obtained through the multi-task neural network, then, on one hand, the semantic segmentation mask corresponding to the target object is obtained based on the semantic feature diagram, the semantic segmentation mask can be used for subsequently screening out the edge pixels which do not belong to the target object, and on the other hand, the first position information of the keypoints is determined based on the keypoint thermodynamic diagram, and the first position information can be used for subsequently screening out the candidate regions which do not belong to the target object, so that the finally obtained structuring result is more accurate.

In some embodiments, the step of determining the first location information of the keypoint corresponding to the target object based on the maximum value and the mapping ratio includes:

determining a symmetrical window on a key point thermodynamic diagram by taking the maximum value as the center; weighting the coordinates of each position in the symmetrical window based on the values of each position in the symmetrical window to obtain the coordinates of the key points before mapping; and determining first position information of the key points corresponding to the target object based on the key point coordinates before mapping and the mapping proportion.

For each keypoint thermodynamic diagram, a maximum value is searched on the keypoint thermodynamic diagram, the maximum value is taken as a center, and a symmetrical window is determined on the keypoint thermodynamic image, wherein the symmetrical window can be a window 3*3 taking the maximum value as a center or a window 5*5 taking the maximum value as a center, and the embodiment of the application is not limited to the above.

In some embodiments, after determining the symmetric window, the computer device may weight coordinates at each location within the symmetric window based on values at each location within the symmetric window, for example, as shown in fig. 4, assuming that a scale of a thermodynamic diagram of a certain key point is 10×11, the maximum value on the thermodynamic diagram of the key point is 0.8, the symmetric window determined based on the maximum value is a window of 3*3 centered on the maximum value, as shown in fig. 4, the window of 3*3 includes a value of a 4 th row and a 7 th column, a value of a 4 th row and a 8 th column, a value of a 4 th row and a 9 th column, a value of a 5 th row and a 7 th column, a value of a 5 th row and a 8 th column, a value of a 5 th row and a 9 th column, a value of a 6 th row and a 8 th column, and a value of a 6 th row and a 9 th column. The abscissa of the pre-map keypoint can be determined by the following formula:

the ordinate of the key point before mapping can be determined by the following formula:

In some embodiments, after obtaining the abscissa of the pre-mapping key point, the computer device may obtain the abscissa of the key point on the to-be-processed image based on the abscissa of the pre-mapping key point and the mapping proportion, and after obtaining the ordinate of the pre-mapping key point, may obtain the ordinate of the key point on the to-be-processed image based on the ordinate of the pre-mapping key point and the mapping proportion, where the scale of the to-be-processed image is h×w×3, and where the scale of the key point thermodynamic diagram may be (H/4) ×17, and where the mapping proportion is 4, multiplying the abscissa of the pre-mapping key point by 4 may obtain the abscissa of the key point on the to-be-processed image, and multiplying the ordinate of the pre-mapping key point by 4 may obtain the ordinate of the key point on the to-be-processed image, where the abscissa and the ordinate of the key point on the to-be-processed image constitute the first location information of the key point.

It should be noted that: in the case where the number of preconfigured keypoints is 17, 17 keypoint thermodynamic diagrams may be obtained, and the above operation is performed for each keypoint thermodynamic diagram, and the first position information of 17 keypoints may be obtained.

In the above embodiment, for each key point thermodynamic diagram, a maximum value is found on the key point thermodynamic diagram, then the maximum value is taken as the center, a symmetrical window is determined on the key point thermodynamic diagram, and based on the values at each position in the symmetrical window, the coordinates at each position in the symmetrical window are weighted to obtain the coordinates of the key point before mapping.

In some embodiments, the step of determining edge pixels in the image to be processed comprises:

for each pixel on the image to be processed, determining a binarization threshold value corresponding to the current pixel based on the pixel value of the current pixel and the pixel values of surrounding pixels of the current pixel; updating the pixel value of the current pixel to a first value under the condition that the pixel value of the current pixel is larger than a binarization threshold value; updating the pixel value of the current pixel to a second value if the pixel value of the current pixel is less than or equal to the binarization threshold, wherein the second value is different from the first value; and taking the pixel with the pixel value of the first value as an edge pixel in the image to be processed.

The first value and the second value may be flexibly set according to actual situations, so long as the first value and the second value are different, and the first value may be 1, the second value may be 0, or the first value may be 0, and the second value may be 1.

The surrounding pixels of the current pixel may be at least one circle of pixels surrounding the current pixel, and exemplary, the surrounding pixels of the current pixel may be one circle of pixels surrounding the current pixel or two circles of pixels surrounding the current pixel.

In some embodiments, after obtaining the pixel value of the current pixel and the pixel values of the surrounding pixels of the current pixel, the computer device may calculate an average value of the pixel values of the current pixel and the pixel values of the surrounding pixels, and use the average value as the binarization threshold corresponding to the current pixel.

In some embodiments, the computer device may calculate a distance between each of the surrounding pixels and the current pixel, calculate a weighted average of the pixel value of the current pixel and the pixel value of the surrounding pixels according to the distance between each of the pixels and the current pixel, and use the weighted average as the binarization threshold corresponding to the current pixel.

The following is illustrative:

assuming that the first value is 1 and the second value is 0, for each pixel on the image to be processed, calculating the average value of the pixel value of the current pixel and the pixel values of the surrounding pixels of the current pixel, comparing the pixel value of the current pixel with the average value, if the pixel value of the current pixel is greater than the average value, updating the pixel value of the current pixel to be 1, if the pixel value of the current pixel is less than or equal to the average value, updating the pixel value of the current pixel to be 0, and performing the above updating processing on all the pixels on the image to be processed, so as to obtain a black-and-white image after the binarization processing, wherein the pixel with the pixel value of 1 in the black-and-white image is used as an edge pixel as shown in fig. 5.

In the above embodiment, in the binarizing process of the image to be processed, the binarization threshold corresponding to the current pixel is determined based on the pixel value of the current pixel and the pixel values of the surrounding pixels of the current pixel.

In some embodiments, the step of screening the edge pixels of the object corresponding to the object from the edge pixels according to the semantic segmentation mask comprises:

traversing each edge pixel, and for the current edge pixel traversed currently, searching a classification label corresponding to the current edge pixel from the semantic segmentation mask; and under the condition that the classification label is a target object, taking the current edge pixel as the target edge pixel until all edge pixels are traversed, and obtaining the target edge pixel corresponding to the target object.

In some embodiments, in the case where the scale of the image to be processed is h×w×3, the scale of the semantic feature map may be (H/4) ×2, and the scale of the semantic segmentation mask obtained by up-sampling, normalization, and the like based on the semantic feature map is h×w. Each point on the semantic segmentation mask is provided with a classification label, and after the edge pixels in the image to be processed are determined, the computer equipment screens the edge pixels through the semantic segmentation mask to obtain target edge pixels corresponding to the target object.

For example, referring to fig. 6, after binarizing an image to be processed, the obtained edge pixel is a pixel with a pixel value of 1 in the left image of fig. 6, for each pixel with a pixel value of 1 in fig. 6, a classification label corresponding to the pixel is searched in a semantic segmentation mask, if the classification label is a target object, the pixel value of the pixel is maintained, and if the classification label is not a target object, the pixel value of the pixel is updated to 0. This is done for all pixels with pixel values of 1 in the left graph of fig. 6, resulting in the right graph of fig. 6. And the pixel with the pixel value of 1 in the right image is the target edge pixel corresponding to the target object.

In the above embodiment, considering that there may be a plurality of objects on the image to be processed, when the edge pixel obtained through binarization processing may include edge pixels of a plurality of objects, for screening out the target edge pixels of the target object, for each edge pixel, a classification label of the edge pixel is searched for from the semantic segmentation mask, if the classification label of the edge pixel is the target object, the edge pixel is used as the edge pixel of the target object, and if the classification label of the edge pixel is not the target object, the edge pixel is filtered out, so that the edge pixel not the target object can be filtered out, so that the segmentation result for the target object is more accurate.

In some embodiments, the step of determining at least one candidate region corresponding to the target object based on the target edge pixels comprises:

determining the respective corresponding outline of each part of the target object based on the target edge pixels; for each contour, filling from outside to inside from the edge of the contour, and obtaining at least one candidate region corresponding to the target object.

In some embodiments, the target edge pixels that can surround an area can be connected to obtain a contour, and in this way, at least one contour can be obtained, and the at least one contour can be used as a contour corresponding to each part of the target object.

In some embodiments, after obtaining the contours corresponding to the respective portions of the target object, the computer device may sequentially update, for each contour, from outside to inside, the pixel value of the pixel to the same value as the target edge pixel, so as to obtain at least one connected domain, and may use the at least one connected domain as at least one candidate region corresponding to the target object.

For example, referring to fig. 7, the target edge pixels corresponding to the target object are pixels with a pixel value of 1 in fig. 7, the target edge pixels capable of surrounding one region in fig. 7 may be connected to obtain 5 contours, for each contour, the pixel values of the pixels may be sequentially updated to 1 from outside to inside from the edge of the contour, so as to obtain 5 connected regions, and as shown in fig. 8, the 5 connected regions may be used as 5 candidate regions corresponding to the target object.

In the above embodiment, considering that the target object often appears as a plurality of regions on the image, after obtaining the target edge pixels corresponding to the target object, the contours corresponding to the respective portions of the target object are determined based on the target edge pixels first; and then, for each contour, filling from the edge of the contour from outside to inside to obtain at least one candidate region corresponding to the target object, wherein the at least one candidate region is more attached to the target object, and the segmentation accuracy of the target object after the segmentation is improved.

In some embodiments, the step of determining a target region belonging to the target object from the at least one candidate region according to the keypoints of the target object comprises:

for each candidate region in the at least one candidate region, determining whether the candidate region contains a key point according to the first position information of the key point of the target object; in the case where the candidate region contains a key point, the candidate region is determined as a target region.

In some embodiments, after obtaining at least one candidate region corresponding to the target object, the computer device considers that a certain region may exist in the at least one candidate region and does not belong to the target object, and the embodiments of the present application provide that the location information of the key point is used to screen the at least one candidate region, specifically, for each candidate region, coordinates of all pixels in the candidate region may be obtained, coordinates of all pixels and location information of the key point may be matched, if the coordinates of a certain pixel and the location information of the key point can be successfully matched, it is determined that the candidate region includes the key point, the candidate region is used as the target region, and all candidate regions are determined as above, so as to obtain the target region belonging to the target object.

In the above embodiment, after obtaining at least one candidate region corresponding to the target object, considering that a certain region may exist in the at least one candidate region and does not belong to the target object, the embodiment of the present application proposes that the location information of the key point is used to screen the at least one candidate region, so that the target region obtained by screening is more accurate, and thus the score mask obtained based on the target region is also more accurate.

In some embodiments, the step of determining a first segmentation mask corresponding to the target object based on the target region comprises:

Wherein the first value may be 1 and the second value may be 0.

For example, it is assumed that coordinates of 17 keypoints are obtained by the scheme of the foregoing embodiment, and at least one candidate region is shown in fig. 8, and a connected region constituted by pixels having a pixel value of 1 in fig. 8 is one candidate region. And for each candidate region, acquiring coordinates of all pixels in the candidate region, matching the coordinates of all pixels with the coordinates of the 1 st key point in the 17 key points, if the matching is successful, determining that the candidate region contains the first key point, if the matching is unsuccessful, further matching the coordinates of all pixels with the coordinates of the 2 nd key point in the 17 key points, and so on. If the coordinates of all pixels are successfully matched with the coordinates of a certain key point, determining that the candidate region contains the key point, and taking the candidate region as a target region to maintain the pixel values of all pixels in the candidate region; if the coordinates of all the pixels and the coordinates of all the key points are not successfully matched, determining that the candidate region is not the target region, and further updating the pixel values of all the pixels corresponding to the candidate region to 0. The processing results shown in fig. 9 can be obtained by performing the above processing on all the candidate regions. In fig. 9, a connected region constituted by pixels having a pixel value of 1 is a target region.

In the above embodiment, the location information of the key points is used to screen at least one candidate region, so that the target region obtained by screening is more accurate, and the score mask obtained based on the target region is also more accurate.

In some embodiments, the image to be processed is one frame of image in the video stream, and the step of determining the structural result of the target object according to the first segmentation mask and the first position information of the key point includes:

acquiring a homography transformation matrix corresponding to an image to be processed, wherein the homography transformation matrix characterizes optical flow information of a video stream; determining second position information of key points of a target object in the image to be processed according to the first position information of the key points in the previous frame of the image to be processed and the homography transformation matrix; determining target position information of key points of a target object in an image to be processed according to first position information and second position information corresponding to the key points of the target object in the image to be processed; determining a second segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask corresponding to a previous frame of the image to be processed and the homography transformation matrix; determining a target segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask and a second segmentation mask corresponding to the target object in the image to be processed; the target position information of the key points of the target object and the target segmentation mask are used to construct a structured result of the target object.

The computer device may extract two adjacent frames of images from the video stream, and may acquire location information of a key point of the target object on each frame of image by using the method provided in the foregoing embodiment, and determine the homography transformation matrix based on the location information of the key point of the target object on the two frames of images.

The previous frame of the image to be processed may be any frame image before the image to be processed, and exemplary, the previous frame of the image to be processed may be a previous frame image of the image to be processed, or may be a further previous frame image of the previous frame image.

In some embodiments, the computer device may determine the first location information of the keypoint of the target object in the previous frame by using the method provided in the foregoing embodiments, may multiply the first location information of the keypoint with the homography transformation matrix to obtain the transformed location information, and may use the transformed location information as the second location information of the keypoint of the target object in the image to be processed. For example, assuming that the position information of 17 keypoints in the previous frame is obtained by the method provided in the foregoing embodiment, for each of the 17 keypoints, the position information of the keypoint is multiplied by the homography transformation matrix, so as to obtain converted position information, and the converted position information may be used as second position information of the keypoint of the target object on the image to be processed. The second position information of 17 key points of the target object in the image to be processed can be obtained by processing the position information of 17 key points in the previous frame.

In some embodiments, after obtaining the first position information and the second position information corresponding to the key point of the target object in the image to be processed, the computer device may perform linear filtering addition on the first position information and the second position information to obtain the target position information of the key point of the target object in the image to be processed.

Taking 17 as an example of the number of key points, taking a certain key point as an example after obtaining the first position information of 17 key points in a previous frame, for convenience of explanation, converting the first position information of the current key point through a homography conversion matrix to obtain converted position information, namely second position information, after obtaining the first position information of 17 key points on an image to be processed through the method provided by the foregoing embodiment, searching the first position information of the current key point from the first position information of 17 key points, and calculating the target position information of the current key point through the following formula:

wherein P is _t+1 For the target position information of the current key point, alpha is the weight based on the second position information obtained from the previous frame, the value range of alpha is (0, 1), and P _t ' ₊₁ For the second position information, P _t ” ₊₁ Is the first location information. This is done for all 17 keypoints, so that the target position information of 17 keypoints can be obtained.

In some embodiments, the computer device may determine the first segmentation mask corresponding to the previous frame by using the method provided in the foregoing embodiments, and may convert the first segmentation mask by using a homography transformation matrix to obtain a converted segmentation mask, that is, the second segmentation mask, and after obtaining the first segmentation mask corresponding to the target object in the image to be processed, may perform linear filtering addition on the first segmentation mask corresponding to the target object in the image to be processed and the second segmentation mask to obtain the target segmentation mask corresponding to the target object in the image to be processed.

By way of example, the target segmentation mask may be calculated by the following formula:

Q _t+1 ＝β*Q _t ' ₊₁ +(1-β)*Q _t ” ₊₁

wherein Q is _t+1 For the target division mask, beta is the weight of the second division mask, the value range of beta is (0, 1), Q _t ' ₊₁ For a second segmentation mask corresponding to a target object in the image to be processed, Q _t ” ₊₁ Is a first segmentation mask corresponding to a target object in the image to be processed.

In some embodiments, after obtaining the target position information of the key point of the target object and the target segmentation mask corresponding to the target object in the image to be processed, the computer device may use the target position information and the target segmentation mask as the structured result of the target object.

In the above embodiment, considering that there may be jitter in pixels at the same position on two frames of images before and after a video stream scene, the above embodiment proposes that, on one hand, based on a previous frame of an image to be processed, position information of a key point of a target object is predicted to obtain second position information of the key point of the target object in the image to be processed, and then the second position information is combined with the first position information determined in the above embodiment to determine final target position information, so that a prediction result of the previous frame is fused to obtain the target position information more stably. On the other hand, based on the previous frame of the image to be processed, the segmentation mask corresponding to the target object is predicted to obtain a second segmentation mask corresponding to the target object in the image to be processed, and then the second segmentation mask is combined with the first segmentation mask corresponding to the target object in the image to be processed obtained through the embodiment to determine the final target segmentation mask.

In some embodiments, the step of obtaining a homography transformation matrix corresponding to the image to be processed includes:

Acquiring a first image and a second image from a video stream, wherein the first image is a previous frame image of the second image; detecting a target object in the first image to obtain a key point corresponding to the target object on the first image; detecting a target object in the second image to obtain a key point corresponding to the target object on the second image; a homography transformation matrix is determined based on keypoints on the first image corresponding to the target object and keypoints on the second image corresponding to the target object.

The first image may be a previous frame image of the second image, or a further previous frame image of the previous frame image, which is not limited in the embodiment of the present application.

In some embodiments, the computer device may obtain the key point corresponding to the target object on the first image and the key point corresponding to the target object on the second image by using the method provided in the foregoing embodiments, and the specific process may refer to the foregoing embodiments, which are not described herein again, and after obtaining the key point corresponding to the target object on the first image and the key point corresponding to the target object on the second image, the RANSAC algorithm may be used to calculate the homography transformation matrix.

For example, assuming that the number of key points is 17, the coordinates of 17 key points corresponding to the target object on the first image and the coordinates of 17 key points corresponding to the target object on the second image are obtained by the method provided in the foregoing embodiment, and the homography transformation matrix is calculated using the RANSAC algorithm based on the coordinates of 17 key points on the two-frame image.

In the above embodiment, two frames of images are taken from the video stream, the two frames of images are detected respectively to obtain the key points on the two frames of images, and the homography transformation matrix is determined based on the key points on the two frames of images, and can be used for the subsequent prediction of the position information and the prediction of the semantic segmentation mask, so that the finally obtained target position information and the target segmentation mask are more accurate.

In some embodiments, the target object may be an avatar object, and after obtaining the target segmentation mask corresponding to the target object, the method provided by the embodiment of the present application further includes:

extracting an avatar corresponding to the target object from the image to be processed based on the target segmentation mask; determining the motion trail of each part of the virtual image based on a preset driving action and according to the target position information of the key point of the target object; and obtaining a video stream matched with the preset driving action based on the motion trail of each part of the virtual image.

The target object may be an object having an anthropomorphic form in the image to be processed. The avatar corresponding to the target object may include a head, an arm, a leg, a foot, etc. The keypoints of the target object may include head keypoints, arm keypoints, leg keypoints, foot keypoints, and the like.

The preset driving action may be any action that is desired to be performed by the avatar, for example, jumping, turning, nodding, waving, etc., which is not limited in the embodiment of the present application.

In some embodiments, since the pixel value corresponding to the target object and the pixel value of the background are different in the target segmentation mask, the computer device may determine the contour of the target object based on the pixel value of the target object, and extract the avatar corresponding to the target object from the image to be processed based on the contour of the target object.

In some embodiments, the computer device may determine the movement trace of each part of the avatar based on the preset driving action and the target position information of the key points corresponding to each part of the target object. The computer device may configure the number of frames of the video stream to be generated in advance, determine the form of each part of the avatar in each frame based on the number of frames after determining the motion trail of each part of the avatar, and play all frames in a specific order to obtain the video stream matched with the preset driving action.

For example, if the preset driving action is nodding, only the head of each part of the avatar moves, and other parts are not moved, and assuming that the number of pre-configured frames is 50 frames, the form of the head in each frame can be obtained based on the motion track of the head, and the 50 frames are played in sequence, so that the video stream of the nodding action can be obtained.

In the above embodiment, after the target segmentation mask is obtained, the avatar corresponding to the target object may be extracted from the image to be processed based on the target segmentation mask; further, based on a preset driving action and according to target position information of key points of a target object, determining a motion track of each part of the virtual image; and finally, obtaining a video stream matched with the preset driving action based on the motion trail of each part of the virtual image. The user can design the action for the target object at will, and the target object can execute the action, so that the control feeling of the user on the target object is improved, and the interestingness is enhanced.

In some embodiments, a target area corresponding to a preset image acquisition frame can be determined on an image acquired by a camera; based on a target area corresponding to a preset image acquisition frame, capturing an image in the target area from the image acquired by the camera, and taking the image in the target area as an image to be processed.

The preset image acquisition frame can be displayed on a shooting page, so that a user can adjust an object falling into the preset image acquisition frame based on the actual shooting condition.

In some embodiments, coordinates of a preset image acquisition frame are known, after the camera acquires an image, the computer device may search a corresponding area on the acquired image based on the coordinates of the preset image acquisition frame, take the searched area as a target area, further intercept a portion of the acquired image corresponding to the target area, and take the intercepted image as an image to be processed in the embodiment of the present application.

In some embodiments, after the computer device obtains the target segmentation mask of the image to be processed through the method provided in the foregoing embodiments, it may determine whether the target object is about to overflow the preset image acquisition frame based on the position of the target object in the target segmentation mask, and if the target object is about to overflow the preset image acquisition frame, obtain the direction of the offset of the target object, and adjust the position of the preset image acquisition frame based on the direction of the offset of the target object, so as to achieve the positioning effect of the preset image acquisition frame on the target object.

In the above embodiment, the image acquisition frame is designed on the page for acquiring the image, so that the user can shoot the target object in the image acquisition frame, and then when the structural processing is performed on the target object, the image in the image acquisition frame is used, and compared with the whole image serving as the image to be processed, the image background in the image acquisition frame is less, so that the waste of processing resources caused by the processing of binarizing, expanding, filling and the like on excessive backgrounds is avoided.

In some embodiments, a method for structuring a target object is provided, as shown in fig. 10, the method comprising:

s10, inputting the image to be processed into a multi-task neural network to obtain a key point thermodynamic diagram and a semantic feature diagram.

And S11, searching the maximum value on the key point thermodynamic diagram, and determining the mapping proportion according to the scale of the image to be processed and the scale of the key point thermodynamic diagram.

S12, centering on the maximum value, and determining a symmetrical window on the thermodynamic diagram of the key point.

And S13, weighting the coordinates at each position in the symmetrical window based on the values at each position in the symmetrical window to obtain the coordinates of the key points before mapping.

S14, determining first position information of the key points corresponding to the target object based on the key point coordinates before mapping and the mapping proportion.

S15, up-sampling the semantic feature images to obtain target semantic feature images matched with the scales of the images to be processed, and normalizing the target semantic feature images to obtain semantic segmentation masks corresponding to the target objects.

S16, for each pixel on the image to be processed, determining a binarization threshold value corresponding to the current pixel based on the pixel value of the current pixel and the pixel values of surrounding pixels of the current pixel.

And S17, updating the pixel value of the current pixel to be a first value under the condition that the pixel value of the current pixel is larger than the binarization threshold value.

And S18, updating the pixel value of the current pixel to a second value under the condition that the pixel value of the current pixel is smaller than or equal to the binarization threshold value, wherein the second value is different from the first value.

S19, taking the pixel with the pixel value of the first value as an edge pixel in the image to be processed.

S20, traversing each edge pixel, and searching a classification label corresponding to the current edge pixel from the semantic segmentation mask for the current edge pixel traversed currently.

S21, under the condition that the classification label is a target object, taking the current edge pixel as the target edge pixel until all edge pixels are traversed, and obtaining the target edge pixel corresponding to the target object.

S22, determining the corresponding outline of each part of the target object based on the target edge pixels.

S23, for each contour, filling from outside to inside from the edge of the contour to obtain at least one candidate region corresponding to the target object.

S24, for each candidate region in the at least one candidate region, determining whether the candidate region contains the key point according to the first position information of the key point of the target object.

And S25, determining the candidate region as a target region when the candidate region contains the key point.

S26, setting the pixel values of all pixels in the target area as a first value, and setting the pixel values of all pixels in the area except the target area in the image to be processed as a second value, so as to obtain a first segmentation mask corresponding to the target object.

S27, acquiring a first image and a second image from the video stream, wherein the first image is a previous frame image of the second image.

And S28, detecting the target object in the first image to obtain a key point corresponding to the target object on the first image.

And S29, detecting the target object in the second image to obtain a key point corresponding to the target object on the second image.

S30, determining a homography transformation matrix based on the key points corresponding to the target object on the first image and the key points corresponding to the target object on the second image, wherein the homography transformation matrix represents the optical flow information of the video stream.

S31, determining second position information of the key points of the target object in the image to be processed according to the first position information of the key points in the previous frame of the image to be processed and the homography transformation matrix.

S32, determining target position information of the key points of the target object in the image to be processed according to the first position information and the second position information corresponding to the key points of the target object in the image to be processed.

S33, determining a second segmentation mask corresponding to the target object in the image to be processed according to the first segmentation mask and the homography transformation matrix corresponding to the previous frame of the image to be processed.

S34, determining a target segmentation mask corresponding to the target object in the image to be processed according to the first segmentation mask and the second segmentation mask corresponding to the target object in the image to be processed; the target position information of the key points of the target object and the target segmentation mask are used to construct a structured result of the target object.

And S35, extracting the virtual image corresponding to the target object from the image to be processed based on the target segmentation mask.

S36, determining the motion trail of each part of the virtual image based on the preset driving action and according to the target position information of the key points of the target object.

S37, obtaining a video stream matched with the preset driving action based on the motion trail of each part of the virtual image.

In one possible scenario, as shown in fig. 11, the image to be processed is a frame of image in the video stream, the target object in the image to be processed is a cartoon character, the image to be processed including the cartoon character is input into the multi-task neural network, the multi-task neural network can output a keypoint thermodynamic diagram, and the first position information of the keypoints of the cartoon character on the image to be processed can be obtained based on the keypoint thermodynamic diagram. The multi-task neural network can also output a semantic feature map, and a semantic segmentation mask corresponding to the cartoon character can be obtained based on the semantic feature map. On the other hand, the binarized image is processed to obtain a black-and-white image after the binarization, and the edge contour of the cartoon character may not be clear due to the problems of the shooting angle and the image quality, so that a break may exist between pixels with a pixel value of 1 in the black-and-white image obtained by the binarization, and the black-and-white image obtained by the binarization may be subjected to expansion processing to repair the break, thereby obtaining a repaired black-and-white image, and the pixels with the pixel value of 1 in the repaired black-and-white image are used as edge pixels. After the edge pixels are obtained, edge pixels which do not belong to cartoon characters in the edge pixels can be filtered based on a semantic segmentation mask, so that target edge pixels corresponding to the cartoon characters are left. The target edge pixels comprise a plurality of pixels, wherein a portion of the pixels may form a closed contour, another portion of the pixels may form another closed contour, and so on, the target edge pixels may form at least one closed contour. The interior of each closed contour may be filled to obtain at least one region that may be considered as at least one candidate region corresponding to the cartoon character. For each candidate region, whether a key point falls into the candidate region or not can be judged based on the first position information of the key point of the cartoon character, if the key point falls into the candidate region, the candidate region is used as a target region, and if the key point does not fall into the candidate region, the candidate region is determined not to be the target region. If the candidate region is the target region, no processing is carried out on the candidate region, if the candidate region is not the target region, the pixel values of all pixels corresponding to the candidate region are updated to be 1, so that the candidate region which does not belong to the cartoon character is converted into black, and a first segmentation mask corresponding to the cartoon character is obtained. And the first position information of the key points of the cartoon characters on the image to be processed and the first segmentation mask corresponding to the cartoon characters can be used as the structuring result of the cartoon characters. Or after the first segmentation mask is obtained, calculating the outline of the cartoon character on the first segmentation mask, and taking the first position information of the key points of the cartoon character on the image to be processed and the outline of the cartoon character as the structural result of the cartoon character.

Considering that there may be jitter in pixels at the same position on two previous and subsequent frames of images in the video stream scene, the position information of the previous frame of the image to be processed may be predicted, as shown in fig. 12, t=t+1 is the acquisition time corresponding to the image to be processed, t=t is the acquisition time corresponding to the previous frame, the first position information of each key point in the previous frame may be determined by the method provided in the foregoing embodiment, and then the position information of each key point in the image to be processed is predicted based on the first position information of each key point in the previous frame, and fig. 12 shows an example of a prediction result. After the first position information of the key points of the cartoon characters on the image to be processed is obtained, the first position information and the prediction result can be fused to obtain target position information. On the other hand, the segmentation mask of the cartoon character in the image to be processed can be predicted based on the first segmentation mask of the cartoon character in the previous frame to obtain a prediction result, and then the first segmentation mask of the cartoon character in the image to be processed and the prediction result are fused to obtain a target segmentation mask. The target location information and the target segmentation mask may be used as the structured results of the cartoon character. Fig. 13 is an example of the result of structuring cartoon characters using the method provided by an embodiment of the present application. After the structured result of the cartoon character is obtained, the cartoon character can be extracted from the to-be-processed graph based on the structured result of the cartoon character, or a driving action is designed for the cartoon character, so that the cartoon character can act according to the action, the control feeling of a user on the cartoon character is improved, and the interestingness is enhanced.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a device for structuring the target object, which is used for realizing the method for structuring the target object. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the device for structuring the target object provided below may refer to the limitation in the method for structuring the target object described above, and will not be repeated here.

In some embodiments, as shown in fig. 14, there is provided an apparatus for performing a structuring process on a target object, including:

the detection module 141 is configured to detect a target object in the image to be processed, and obtain a semantic segmentation mask corresponding to the target object and first location information of a key point of the target object.

The first filtering module 142 is configured to determine edge pixels in the image to be processed, and filter target edge pixels corresponding to the target object from the edge pixels according to the semantic segmentation mask.

A second filtering module 143, configured to determine at least one candidate region corresponding to the target object based on the target edge pixel; and determining a target area belonging to the target object from at least one candidate area according to the first position information of the key point of the target object.

The determining module 144 is configured to determine a first segmentation mask corresponding to the target object based on the target region, and determine a structured result of the target object according to the first segmentation mask and the first location information of the key point.

In some embodiments, the detection module 141 is specifically configured to: inputting the image to be processed into a multi-task neural network to obtain a key point thermodynamic diagram and a semantic feature diagram; searching a maximum value on the key point thermodynamic diagram, and determining a mapping proportion according to the scale of the image to be processed and the scale of the key point thermodynamic diagram; determining first position information of a key point corresponding to the target object based on the maximum value and the mapping proportion; and up-sampling the semantic feature images to obtain target semantic feature images matched with the scales of the images to be processed, and normalizing the target semantic feature images to obtain semantic segmentation masks corresponding to the target objects.

In some embodiments, the detection module 141 is specifically configured to: determining a symmetrical window on a key point thermodynamic diagram by taking the maximum value as the center; weighting the coordinates of each position in the symmetrical window based on the values of each position in the symmetrical window to obtain the coordinates of the key points before mapping; and determining first position information of the key points corresponding to the target object based on the key point coordinates before mapping and the mapping proportion.

In some embodiments, the first screening module 142 is specifically configured to: for each pixel on the image to be processed, determining a binarization threshold value corresponding to the current pixel based on the pixel value of the current pixel and the pixel values of surrounding pixels of the current pixel; updating the pixel value of the current pixel to a first value under the condition that the pixel value of the current pixel is larger than a binarization threshold value; updating the pixel value of the current pixel to a second value if the pixel value of the current pixel is less than or equal to the binarization threshold, wherein the second value is different from the first value; and taking the pixel with the pixel value of the first value as an edge pixel in the image to be processed.

In some embodiments, the first screening module 142 is specifically configured to: traversing each edge pixel, and for the current edge pixel traversed currently, searching a classification label corresponding to the current edge pixel from the semantic segmentation mask; and under the condition that the classification label is a target object, taking the current edge pixel as the target edge pixel until all edge pixels are traversed, and obtaining the target edge pixel corresponding to the target object.

In some embodiments, the first screening module 142 is specifically configured to: determining the respective corresponding outline of each part of the target object based on the target edge pixels; for each contour, filling from outside to inside from the edge of the contour, and obtaining at least one candidate region corresponding to the target object.

In some embodiments, the first screening module 142 is specifically configured to: for each candidate region in the at least one candidate region, determining whether the candidate region contains a key point according to the first position information of the key point of the target object; in the case where the candidate region contains a key point, the candidate region is determined as a target region.

In some embodiments, the determining module 144 is specifically configured to: and setting the pixel values of all pixels in the target area as a first value, and setting the pixel values of all pixels in the area except the target area in the image to be processed as a second value, so as to obtain a first segmentation mask corresponding to the target object.

In some embodiments, the image to be processed is one of the frames in the video stream, and the determining module 144 is specifically configured to: acquiring a homography transformation matrix corresponding to an image to be processed, wherein the homography transformation matrix characterizes optical flow information of a video stream; determining second position information of key points of a target object in the image to be processed according to the first position information of the key points in the previous frame of the image to be processed and the homography transformation matrix; determining target position information of key points of a target object in an image to be processed according to first position information and second position information corresponding to the key points of the target object in the image to be processed; determining a second segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask corresponding to a previous frame of the image to be processed and the homography transformation matrix; determining a target segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask and a second segmentation mask corresponding to the target object in the image to be processed; the target position information of the key points of the target object and the target segmentation mask are used to construct a structured result of the target object.

In some embodiments, the determining module 144 is specifically configured to: acquiring a first image and a second image from a video stream, wherein the first image is a previous frame image of the second image; detecting a target object in the first image to obtain a key point corresponding to the target object on the first image; detecting a target object in the second image to obtain a key point corresponding to the target object on the second image; a homography transformation matrix is determined based on keypoints on the first image corresponding to the target object and keypoints on the second image corresponding to the target object.

In some embodiments, the determination module 144 is further to: extracting an avatar corresponding to the target object from the image to be processed based on the target segmentation mask; determining the motion trail of each part of the virtual image based on a preset driving action and according to the target position information of the key point of the target object; and obtaining a video stream matched with the preset driving action based on the motion trail of each part of the virtual image.

The above-described respective modules in the apparatus for structuring the target object may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In some embodiments, a computer device is provided, which may be a terminal or a server, and the internal structure of which may be as shown in fig. 15. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as a multitasking neural network, a structured processing result and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of structuring a target object.

It will be appreciated by those skilled in the art that the structure shown in fig. 15 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements are applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In some embodiments, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In some embodiments, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In some embodiments, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of structuring a target object, the method comprising:

Determining at least one candidate region corresponding to the target object based on the target edge pixel;

determining a target area belonging to the target object from the at least one candidate area according to first position information of the key point of the target object;

and determining a first segmentation mask corresponding to the target object based on the target area, and determining a structuring result of the target object according to the first segmentation mask and the first position information of the key point.

2. The method according to claim 1, wherein detecting the target object in the image to be processed to obtain the semantic segmentation mask corresponding to the target object and the first location information of the key point of the target object includes:

inputting the image to be processed into a multi-task neural network to obtain a key point thermodynamic diagram and a semantic feature diagram;

searching a maximum value on the key point thermodynamic diagram, and determining a mapping proportion according to the scale of the image to be processed and the scale of the key point thermodynamic diagram;

determining first position information of a key point corresponding to the target object based on the maximum value and the mapping proportion;

And up-sampling the semantic feature map to obtain a target semantic feature map matched with the scale of the image to be processed, and normalizing the target semantic feature map to obtain a semantic segmentation mask corresponding to the target object.

3. The method according to claim 2, wherein determining the first location information of the keypoint corresponding to the target object based on the maximum value and the mapping ratio comprises:

determining a symmetrical window on the key point thermodynamic diagram by taking the maximum value as a center;

weighting the coordinates of each position in the symmetrical window based on the value of each position in the symmetrical window to obtain the coordinates of the key point before mapping;

and determining first position information of the key point corresponding to the target object based on the pre-mapping key point coordinates and the mapping proportion.

4. The method of claim 1, wherein the determining edge pixels in the image to be processed comprises:

for each pixel on the image to be processed, determining a binarization threshold corresponding to the current pixel based on the pixel value of the current pixel and the pixel values of surrounding pixels of the current pixel;

Updating the pixel value of the current pixel to a first value under the condition that the pixel value of the current pixel is larger than the binarization threshold value;

5. The method of claim 1, wherein the screening the edge pixels from the edge pixels for the target object based on the semantic segmentation mask comprises:

and under the condition that the classification label is a target object, taking the current edge pixel as a target edge pixel until all edge pixels are traversed, and obtaining a target edge pixel corresponding to the target object.

6. The method of claim 1, wherein the determining at least one candidate region corresponding to the target object based on the target edge pixels comprises:

and for each contour, filling from outside to inside from the edge of the contour to obtain at least one candidate region corresponding to the target object.

7. The method according to claim 1, wherein determining a target region belonging to the target object from the at least one candidate region according to the keypoints of the target object comprises:

for each candidate region in the at least one candidate region, determining whether the candidate region contains a key point according to first position information of the key point of the target object;

and in the case that the candidate region contains a key point, determining the candidate region as a target region.

8. The method of claim 1, wherein the determining a first segmentation mask corresponding to the target object based on the target region comprises:

9. The method according to any one of claims 1 to 8, wherein the image to be processed is one of the frames of the video stream, and wherein the determining the structured result of the target object according to the first segmentation mask and the first location information of the keypoint comprises:

acquiring a homography transformation matrix corresponding to the image to be processed, wherein the homography transformation matrix characterizes optical flow information of the video stream;

determining target position information of key points of a target object in the image to be processed according to first position information and second position information corresponding to the key points of the target object in the image to be processed;

determining a target segmentation mask corresponding to a target object in the image to be processed according to a first segmentation mask and a second segmentation mask corresponding to the target object in the image to be processed; the target position information of the key points of the target object and the target segmentation mask are used for forming a structural result of the target object.

10. The method of claim 9, wherein the acquiring a homography transformation matrix corresponding to the image to be processed comprises:

detecting the target object in the first image to obtain a key point corresponding to the target object on the first image;

detecting the target object in the second image to obtain a key point corresponding to the target object on the second image;

and determining a homography transformation matrix based on the key points corresponding to the target object on the first image and the key points corresponding to the target object on the second image.

11. The method according to claim 9, wherein the method further comprises:

12. An apparatus for structuring a target object, comprising:

the detection module is used for detecting a target object in an image to be processed to obtain a semantic segmentation mask corresponding to the target object and first position information of key points of the target object;

a second filtering module, configured to determine at least one candidate region corresponding to the target object based on the target edge pixel; determining a target area belonging to the target object from the at least one candidate area according to first position information of the key point of the target object;

13. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 11.

15. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.