CN113487624B

CN113487624B - Human body instance segmentation method, terminal equipment and storage medium

Info

Publication number: CN113487624B
Application number: CN202110707258.5A
Authority: CN
Inventors: 王宗跃; 陈文平; 陈智鹏
Original assignee: Shenzhen Saiwei Network Technology Co ltd; Jimei University
Current assignee: Shenzhen Saiwei Network Technology Co ltd; Jimei University
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2023-10-13
Anticipated expiration: 2041-06-24
Also published as: CN113487624A

Abstract

The invention relates to a human body instance segmentation method, a terminal device and a storage medium, wherein the method comprises the following steps: s1: calculating the distance between pixels through different distance calculation algorithms based on the color information of the color picture, and converting all the distance calculation results into picture features through a convolution layer after connecting all the distance calculation results in series; s2: dividing a picture into a plurality of blocks, and obtaining a feature representation vector E of the picture according to the block feature and the pixel position of each block; s3: constructing a converter encoder/decoder network, and converting the feature representation vector E into an optimized feature F through the converter encoder/decoder network _d The method comprises the steps of carrying out a first treatment on the surface of the S4: will optimize feature F _d And F is obtained after connection with the characteristic expression vector E _c The method comprises the steps of carrying out a first treatment on the surface of the S5: by means of attentiveness mechanism to F _c After the important features in the model are aggregated, the aggregation result is input into a convolutional neural network to obtain a human body example segmentation result. The method fully utilizes the advantages of the transducer to optimize the human body instance segmentation result, and conventional operations such as clustering, candidate frame and non-maximum value suppression operation are not needed.

Description

Human body instance segmentation method, terminal equipment and storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to a human body instance segmentation method, a terminal device, and a storage medium.

Background

Current example segmentation techniques fall into two main categories: 1) Candidate target frame-free algorithms (Proposal-free methods) whose core idea is to learn a distinguishable feature for each point and then cluster the points of the same instance together by a clustering method such as mean-shift. The problem with this type of approach is that the examples that are eventually clustered together are less targeted. Furthermore, the time cost of such process post-treatment steps is typically high; 2) Candidate target frame based algorithms (Proposal-based methods) typically generate a large number of candidate frames prior to the second stage of optimization. But such an approach is difficult to work with if the deviation of the candidate boxes is relatively large. At the same time, operations similar to non-maximum suppression are needed to select dense candidate frames, which has the disadvantage of being extremely computationally intensive, and when there is a highly overlapping object, the non-maximum suppression considers one of the candidate frames as a repeated candidate region and then deletes it. This means that it is difficult to deal with cases where objects overlap highly based on an instance of a candidate box.

Disclosure of Invention

In order to solve the problems, the invention provides a human body instance segmentation method, a terminal device and a storage medium, which overcome the defects of a traditional instance segmentation algorithm aiming at a non-candidate frame algorithm and a candidate frame-based algorithm, and do not need clustering, candidate frames and non-maximum suppression operation.

The specific scheme is as follows:

a human body instance segmentation method, comprising the steps of:

s1: when a color picture containing a human body is received, calculating the distance between pixels through different distance calculation algorithms based on the color information of the picture, and converting all the distance calculation results into picture features through a convolution layer after connecting all the distance calculation results in series;

s2: dividing a picture into a plurality of blocks, acquiring block characteristics corresponding to each block according to pixel positions and picture characteristics corresponding to each block, and acquiring a characteristic representation vector E of the picture according to the block characteristics and the pixel positions corresponding to each block;

s3: constructing a converter encoder/decoder network, and converting the feature representation vector E into an optimized feature F through the converter encoder/decoder network _d ；

S4: will optimize feature F _d And F is obtained after connection with the characteristic expression vector E _c ；

S5: by means of attentiveness mechanism to F _c After the important features in the model are aggregated, the aggregation result is input into a convolutional neural network to obtain a human body example segmentation result.

Further, the distance calculation algorithm includes a Euclidean distance, a Chebyshev distance, and a cosine distance.

Further, the feature expression vector e= [ f ₁ +p ₁ ,f ₂ +p ₂ ,...,f _N +p _N ]Wherein f _N Block characteristics representing the nth block, p _N The pixel position of the nth block is represented, and N represents the number of blocks.

Further, the implementation process of step S3 is as follows:

s301: inputting the feature expression vector E into a transducer coding block, and outputting x by a multi-head self-attention mechanism in the transducer coding block ₁ Then, the output of the multi-head self-attention mechanism is connected with the characteristic expression vector E through residual connection and then x is output ₂ The method comprises the steps of carrying out a first treatment on the surface of the For x ₂ Normalized and output x ₃ The method comprises the steps of carrying out a first treatment on the surface of the Will x ₃ Input fully connected network layer post output with activation function x ₄ The method comprises the steps of carrying out a first treatment on the surface of the Will x ₄ Input of fully connected network layer post output x without activation function ₅ The method comprises the steps of carrying out a first treatment on the surface of the Setting x ₅ An output encoding a block for a transducer;

s302: stacking a plurality of transducer encoding blocks to form a transducer encoder so that an encoding information matrix M of the output of the transducer encoder is consistent with the dimension of the feature representation vector E;

s303: taking the coding information matrix M output by the transducer encoder as the input of the transducer decoder; the transform decoding block contains a two-pass multi-headed self-attention mechanism;

the output of the first self-attention mechanism is y ₁ The method comprises the steps of carrying out a first treatment on the surface of the The output M of the transducer encoder and the output y of the first self-attention mechanism are connected by a residual error ₁ Post-connection output y ₂ The method comprises the steps of carrying out a first treatment on the surface of the For y ₂ Normalized and output y ₃ ；

Second self-attentionThe input of the first decoding layer in the mechanism is characteristic representation vector E, and the corresponding output is y ₄ The method comprises the steps of carrying out a first treatment on the surface of the Output y of the first self-attention mechanism via residual connection ₁ Output y from the second self-attention mechanism ₄ Post-connection output y ₅ The method comprises the steps of carrying out a first treatment on the surface of the For y ₅ Normalized and output y ₆ The method comprises the steps of carrying out a first treatment on the surface of the Will y ₆ Input fully connected network layer post output y with activation function ₇ The method comprises the steps of carrying out a first treatment on the surface of the Will y ₇ Input fully connected network layer post output y without activation function ₈ The method comprises the steps of carrying out a first treatment on the surface of the Setting y ₈ An output for the transform decoding block;

s304: stacking multiple transducer decoding blocks to form a transducer decoder, the output of which is the optimized feature F _d 。

Further, in step S4, the feature F is optimized by a Concat operation _d Connected to the feature representation vector E.

The human body instance segmentation terminal equipment comprises a processor, a memory and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the method according to the embodiment of the invention when executing the computer program.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method described above for embodiments of the present invention.

According to the method, the device and the system, the advantages of the Transformer are fully utilized to optimize the human body example segmentation result, the model with better interpretability is generated through the multi-head self-attention mechanism, and conventional operations such as clustering, candidate frame and non-maximum value inhibition operation are not needed.

Drawings

Fig. 1 is a flowchart of a first embodiment of the present invention.

Detailed Description

For further illustration of the various embodiments, the invention is provided with the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments and together with the description, serve to explain the principles of the embodiments. With reference to these matters, one of ordinary skill in the art will understand other possible embodiments and advantages of the present invention.

The invention will now be further described with reference to the drawings and detailed description.

Embodiment one:

the embodiment of the invention provides a human body example segmentation method, as shown in fig. 1, which is a flow chart of the human body example segmentation method according to the embodiment of the invention, wherein the method comprises the following steps:

s1: when a color picture containing a human body is received, calculating the distance between pixels through different distance calculation algorithms based on the color information of the picture, and converting all the distance calculation results into picture features through a convolution layer after connecting all the distance calculation results in series.

In this embodiment the color picture is denoted as xR ^3xHxW Wherein 3 represents R, G, B three elements, H represents the height of the picture, and W represents the width of the picture.

The distance calculation algorithm comprises a Euclidean distance, a Chebyshev distance and a cosine distance, and is used for calculating the distance between two pixels, and the calculation formulas are respectively as follows:

euclidean distance

Chebyshev distance d ₂ ＝max(|R ₁ -R ₂ |,|G ₁ -G ₂ |,|B ₁ -B ₂ |)；

Cosine distance

Wherein R is ₁ 、G ₁ 、B ₁ Respectively representing chromaticity values of red, green and blue channels of the first pixel point, R ₂ 、G ₂ 、B ₂ And respectively representing the chromaticity values of the red, green and blue channels of the second pixel point.

In other embodiments, other distance calculation algorithms may be selected as desired, and are not limited herein.

It should be noted that, all the distance calculation results are connected in series, that is, the pixel points are connected in series with the distances calculated by all the distance calculation algorithms of each pixel point except the pixel points, and assuming that n pixel points are total, a× (n-1) distances are connected in series for each pixel point, and a represents the number of the distance calculation algorithms, which is 3 in this embodiment.

The image characteristics converted by the convolution layer are fixed dimensions, namely d dimensions, and the value of d can be set by a person skilled in the art, and the method is not limited herein.

S2: dividing a picture into a plurality of blocks (patch), acquiring the characteristics corresponding to each block according to the pixel position and the picture characteristics corresponding to each block, and acquiring the characteristic representation vector E of the picture according to the characteristics and the pixel position corresponding to each block.

In this embodiment, the number of the set blocks is N, and the calculation formula of N is:where s represents the area of the block.

Feature representation vector e= [ f ₁ +p ₁ ,f ₂ +p ₂ ,...,f _N +p _N ]Wherein f _N Block characteristics representing the nth block, p _N Representing the pixel location of the nth block.

By combining picture convolution with color coding, block features and pixel locations are learned so that the constructed feature representation vector can be better adapted to the input of a transform encoder.

S3: constructing a converter encoder/decoder network, and converting the feature representation vector E into an optimized feature F through the converter encoder/decoder network _d 。

The implementation procedure of step S3 in this embodiment is as follows:

The input of the first layer decoding layer in the second self-attention mechanism is the feature representation vector E, and the corresponding output is y ₄ The method comprises the steps of carrying out a first treatment on the surface of the Output y of the first self-attention mechanism via residual connection ₁ Output y from the second self-attention mechanism ₄ Post-connection output y ₅ The method comprises the steps of carrying out a first treatment on the surface of the For y ₅ Normalized and output y ₆ The method comprises the steps of carrying out a first treatment on the surface of the Will y ₆ Input fully connected network layer post output y with activation function ₇ The method comprises the steps of carrying out a first treatment on the surface of the Will y ₇ Input fully connected network layer post output y without activation function ₈ The method comprises the steps of carrying out a first treatment on the surface of the Setting y ₈ An output for the transform decoding block;

In the above steps, the normalization process is used to accelerate network convergence.

S4: will optimize feature F _d And F is obtained after connection with the characteristic expression vector E _c 。

In this embodiment feature F will be optimized by the Concat operation _d Connected to the feature representation vector E.

S5: by means of attentiveness mechanism to F _c After polymerization of the important features, the polymerization result F _A Inputting the human body example segmentation result into a convolutional neural network.

The discrimination of feature maps can be enhanced by a mechanism of attention.

The embodiment of the invention is an effective improvement on the traditional candidate frame-based algorithm and the candidate frame-free algorithm, fully utilizes the advantages of a transducer to optimize the human body example segmentation result, generates a model with more interpretability through a multi-head self-attention mechanism, and does not need conventional operations such as clustering, candidate frame and non-maximum value inhibition operation and the like.

Embodiment two:

the invention also provides a human body instance segmentation terminal device, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the steps in the method embodiment of the first embodiment of the invention are realized when the processor executes the computer program.

Further, as an executable scheme, the human body instance segmentation terminal device may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, and the like. The human body instance segmentation terminal device may include, but is not limited to, a processor, a memory. It will be understood by those skilled in the art that the above-described constituent structure of the human body example division terminal device is merely an example of the human body example division terminal device, and does not constitute limitation of the human body example division terminal device, and may include more or less components than those described above, or may combine some components, or different components, for example, the human body example division terminal device may further include an input/output device, a network access device, a bus, etc., which is not limited in the embodiment of the present invention.

Further, as an implementation, the processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The general processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the human body instance division terminal device, and connects the respective parts of the whole human body instance division terminal device using various interfaces and lines.

The memory may be used to store the computer program and/or the module, and the processor may implement various functions of the human body instance segmentation terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The present invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements the steps of the above-described method of an embodiment of the present invention.

The modules/units integrated by the human body instance segmentation terminal device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A human body instance segmentation method, characterized by comprising the steps of:

s3: constructing a converter encoder/decoder network, and converting the feature representation vector E into an optimized feature F through the converter encoder/decoder network _d The method comprises the steps of carrying out a first treatment on the surface of the The implementation process of step S3 is as follows:

s301: inputting the feature expression vector E into a transducer coding block, and outputting x by a multi-head self-attention mechanism in the transducer coding block ₁ Then, the output of the multi-head self-attention mechanism is connected with the characteristic expression vector E through residual connection and then outputx ₂ The method comprises the steps of carrying out a first treatment on the surface of the For x ₂ Normalized and output x ₃ The method comprises the steps of carrying out a first treatment on the surface of the Will x ₃ Input fully connected network layer post output with activation function x ₄ The method comprises the steps of carrying out a first treatment on the surface of the Will x ₄ Input of fully connected network layer post output x without activation function ₅ The method comprises the steps of carrying out a first treatment on the surface of the Setting x ₅ An output encoding a block for a transducer;

s304: stacking multiple transducer decoding blocks to form a transducer decoder, the output of which is the optimized feature F _d ；

2. The human body instance segmentation method according to claim 1, characterized in that: the distance calculation algorithm includes a Euclidean distance, a Chebyshev distance, and a cosine distance.

3. The human body instance segmentation method according to claim 1, characterized in that: feature representation vector e= [ f ₁ +p ₁ ,f ₂ +p ₂ ,...,f _N +p _N ]Wherein f _N Block characteristics representing the nth block, p _N The pixel position of the nth block is represented, and N represents the number of blocks.

4. The human body instance segmentation method according to claim 1, characterized in that: optimizing feature F by Concat operation in step S4 _d Connected to the feature representation vector E.

5. A human body instance segmentation terminal device, characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, which processor, when executing the computer program, carries out the steps of the method according to any one of claims 1 to 4.

6. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 4.