CN111415373A

CN111415373A - Target tracking and segmenting method, system and medium based on twin convolutional network

Info

Publication number: CN111415373A
Application number: CN202010202511.7A
Authority: CN
Inventors: 盛校粼; 李凡平; 石柱国
Original assignee: Beijing Yisa Technology Co ltd; Qingdao Yisa Data Technology Co Ltd
Current assignee: Beijing Yisa Technology Co ltd; Qingdao Yisa Data Technology Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-07-14

Abstract

The invention discloses a target tracking and segmenting method based on a twin convolutional network, which comprises the following steps: extracting image features by adopting a densely connected convolutional neural network to obtain a target feature map and a tracking area feature map; performing cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram; convolving the image data and sending the convolved image data to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, setting each pixel in the first feature map and a corresponding channel thereof as ROW, and setting each pixel in the score map as a confidence corresponding to each ROW in the first feature map; and selecting the ROW corresponding to the pixel point with the highest confidence degree in the score map on the first characteristic map, converting the corresponding ROW into a first matrix, performing secondary classification to obtain a mask matrix, processing the mask matrix to obtain a target mask, and acquiring a boundary frame of the tracked target according to the target mask. The method improves the target tracking precision and realizes the pixel-level tracking of the target.

Description

Target tracking and segmenting method, system and medium based on twin convolutional network

Technical Field

The invention relates to the technical field of computer vision, in particular to a target tracking and segmenting method, a target tracking and segmenting system, a target tracking and segmenting terminal and a target tracking and segmenting medium based on a twin convolutional network.

Background

In recent years, with the rise of artificial intelligence and deep learning, convolutional neural network algorithms gradually enter the target tracking field and gain unsophisticated performance and achievements, wherein an algorithm frame based on a twin convolutional network receives great attention in the international computer vision conference and tracking events in recent years by virtue of good performance and simple network structure.

In order to facilitate the expression of the tracking result, the original tracking algorithm returns the target tracking result by using a rectangular box with aligned coordinate axes. However, as the tracking accuracy is improved, the difficulty of the data set is improved, and a rotating rectangular box is proposed to be used as a mark in the VOT 2015. An automatic method of generating a rotating frame through a mask is proposed at the time of the VOT2016, but the requirement of a diversified target tracking task cannot be met.

Disclosure of Invention

Aiming at the defects in the prior art, the embodiment of the invention provides a target tracking and segmenting method, a target tracking and segmenting system, a target tracking and segmenting terminal and a target tracking and segmenting medium based on a twin convolutional network.

In a first aspect, a target tracking and segmenting method based on a twin convolutional network provided in an embodiment of the present invention includes:

acquiring input image information;

extracting input image features by adopting a densely connected convolutional neural network to obtain a target feature map and a tracking area feature map;

performing cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram;

after convolution is carried out on the output feature map, the output feature map is respectively sent to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, each pixel in the first feature map and a corresponding channel of each pixel are set to be ROW, and each pixel in the score map is a confidence corresponding to each ROW in the first feature map;

selecting an ROW corresponding to a pixel point with the highest confidence degree in a score map on a first feature map, converting the corresponding ROW into a first matrix, classifying the first matrix into two classes to obtain a mask matrix, mapping the mask matrix to an original image through affine transformation, binarizing the numerical value between 0 and 1 in the mask matrix by using a set segmentation threshold value to obtain a target mask of a tracking target in the original image, and acquiring a boundary frame of the tracking target by using the minimum circumscribed rectangle of the target mask.

In a second aspect, an embodiment of the present invention provides a target tracking and segmenting system based on a twin convolutional network, including: an image acquisition module, an image feature extraction module, a cross-correlation module, a first analysis module and a second analysis module,

the image acquisition module is used for acquiring input image information;

the image feature extraction module adopts a densely connected convolutional neural network to extract the features of an input image to obtain a target feature map and a tracking area feature map;

the cross-correlation module performs cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram;

the first analysis module is used for convolving the output feature map and then respectively sending the convolved output feature map into the semantic segmentation branch and the score map branch to obtain a first feature map and a score map, setting each pixel in the first feature map and a corresponding channel thereof as ROW, and setting each pixel in the score map as a confidence corresponding to each ROW in the first feature map;

the second analysis module is used for selecting the ROW corresponding to the pixel point with the highest reliability in the score map on the first feature map, converting the corresponding ROW into the first matrix, classifying the first matrix into two classes to obtain a mask matrix, mapping the elements of the mask matrix into values between 0 and 1 after the two classes, performing affine transformation on the mask matrix to map the mask matrix back to the original image, performing binarization on the values between 0 and 1 in the mask matrix by using a set segmentation threshold value to obtain a target mask of the tracking target in the original image, and obtaining a boundary frame of the tracking target by using the minimum circumscribed rectangle of the target mask.

In a third aspect, an intelligent terminal provided in an embodiment of the present invention includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method steps described in the foregoing embodiment.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method steps described in the above embodiments.

The invention has the beneficial effects that:

the target tracking and segmenting method, the target tracking and segmenting system, the target tracking and segmenting terminal and the target tracking and segmenting medium based on the twin convolutional network provided by the embodiment of the invention adopt the convolutional neural network with dense connection to extract image characteristics, and add the semantic segmentation branch and the score map branch, thereby improving the target tracking precision and realizing the pixel-level tracking of the target.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a flow chart illustrating a target tracking and segmenting method based on a twin convolutional network according to a first embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a target tracking and segmenting system based on a twin convolutional network according to another embodiment of the present invention;

fig. 3 shows a schematic structural diagram of an intelligent terminal according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

As shown in fig. 1, a flowchart of a target tracking and segmenting method based on a twin convolutional network according to a first embodiment of the present invention is shown, and the method includes the following steps:

s1: input image information is acquired.

S2: and extracting the features of the input image by adopting a densely connected convolutional neural network to obtain a target feature map and a tracking area feature map.

S3: performing cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram;

s4: and after convolution, the output feature map is respectively sent to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, each pixel in the first feature map and a corresponding channel thereof are set as ROW, and each pixel in the score map is a confidence corresponding to each ROW in the first feature map.

S5: selecting an ROW corresponding to a pixel point with the highest confidence degree in a score map on a first feature map, converting the corresponding ROW into a first matrix, classifying the first matrix into two classes to obtain a mask matrix, mapping the mask matrix to an original image through affine transformation, binarizing the numerical value between 0 and 1 in the mask matrix by using a set segmentation threshold value to obtain a target mask of a tracking target in the original image, and acquiring a boundary frame of the tracking target by using the minimum circumscribed rectangle of the target mask.

The above technical solution is described in detail below using a specific example.

Acquiring two input images, wherein the dimension of one image is 127 × 3, the dimension of the other image is 255 × 3, inputting the two images into a convolution neural network which is connected in a dense mode respectively for feature extraction, dividing the feature extraction network into two paths, respectively extracting target features and tracking area features, outputting a target image (the scale is 127 × 127) by a full convolution network, and outputting a tracking area image (the scale is 255). The feature extraction process is represented by the following mathematical expression:

x_l＝H_l([x₀,x₁,...,x_l-1])

wherein H_lRepresents a network extraction feature operation, [ x ]₀,x₁,...,x_l-1]Indicating that the feature maps from the first layer to the last layer are merged as channels, x_lThen for the output of the feature extraction network, a target feature map dimension of 15 × 256 and a tracking region feature map dimension of 31 × 256 are obtained, respectively. And (3) performing correlation operation on the target feature map dimension of 15 × 256 and the tracking region feature map dimension of 31 × 256, padding of 0 and step size of 1 to obtain an output feature map with the dimension of 17 × 256. The output feature map with the dimension of 17 × 256 is sent to a semantic segmentation branch and a score map (score map) branch respectively, the semantic segmentation branch and the score map branch are formed by 1 × 1 convolution, the output feature map is subjected to 1 × 1 convolution to obtain a first feature map (fmask) with the dimension of 17 × 17 (63 × 63) and a score map with the dimension of 17 × 1 respectively, each pixel and a corresponding channel in the fmask are called RoW, namely, response of a candidate window, so that the fmask is totally 17 × 17 RoW, and each RoW has the dimension of 1 × 1 (63 × 63). And each pixel in the score map is a confidence corresponding to each RoW in the fmask, and RoW corresponding to the pixel point with the highest confidence in the score map on the fmask is selected as RoW used when the mask is finally generated. After obtaining dimension 1 x 1 (63 x 63) RoW, selected RoW resize was set to 6363 and simultaneously carrying out sigmoid two classification on the first matrix, and judging whether the pixels on the matrix generated by RoW belong to a mask (mask). After the two classifications, a mask matrix is obtained, elements of the first matrix are values between 0 and 1 after sigmoid, the mask matrix is mapped back to the original image through affine transformation, a numerical value between 0 and 1 in the mask matrix is binarized by setting a segmentation threshold (in the embodiment, 0.35 is selected as the segmentation threshold of the mask), and finally the mask of the tracking target in the original image can be obtained, so that a boundary frame (bounding box) of the tracking target can be obtained through the minimum bounding rectangle of the target mask.

The method was performed on a DAVIS2016 dataset and compared to other state-of-the-art tracking algorithms (including traditional twin network based tracking algorithms) for various performance indicators, with the results shown in table 1:

TABLE 1

The method is adopted on a DAVIS2017 data set, and various performance indexes are compared with other state-of-the-art tracking algorithms (including the traditional twin network-based tracking algorithm), and the results are shown in the following table 2:

TABLE 2

The data in tables 1 and 2 show that the target tracking and segmenting method based on the twin convolutional network provided by the embodiment of the invention has obviously better performance than various methods in the prior art.

According to the target tracking and segmenting method based on the twin convolutional network, the image features are extracted by adopting the convolutional neural network in intensive connection, the extracting capability of the network features is improved, the semantic segmentation branches and the score map branches are added into the convolutional neural network in intensive connection, the target tracking precision is improved, and the pixel-level tracking of the target is realized.

In the first embodiment, a twin convolutional network based target tracking and segmentation method is provided, and correspondingly, the present application also provides a twin convolutional network based target tracking and segmentation system. Please refer to fig. 2, which is a schematic diagram of a target tracking and segmenting system based on a twin convolutional network according to a second embodiment of the present invention. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points.

As shown in fig. 2, there is shown a schematic structural diagram of a target tracking and segmenting system based on a twin convolutional network according to another embodiment of the present invention, where the system includes: the device comprises an image acquisition module, an image feature extraction module, a cross-correlation module, a first analysis module and a second analysis module, wherein the image acquisition module is used for acquiring input image information; the image feature extraction module adopts a densely connected convolutional neural network to extract the features of an input image to obtain a target feature map and a tracking area feature map; the cross-correlation module performs cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram; the first analysis module is used for convolving the output feature map and then respectively sending the convolved output feature map into the semantic segmentation branch and the score map branch to obtain a first feature map and a score map, setting each pixel in the first feature map and a corresponding channel thereof as ROW, and setting each pixel in the score map as a confidence corresponding to each ROW in the first feature map; the second analysis module is used for selecting the ROW corresponding to the pixel point with the highest reliability in the score map on the first feature map, converting the corresponding ROW into the first matrix, classifying the first matrix into two classes to obtain a mask matrix, mapping the elements of the mask matrix into values between 0 and 1 after the two classes, performing affine transformation on the mask matrix to map the mask matrix back to the original image, performing binarization on the values between 0 and 1 in the mask matrix by using a set segmentation threshold value to obtain a target mask of the tracking target in the original image, and obtaining a boundary frame of the tracking target by using the minimum circumscribed rectangle of the target mask.

The image acquisition module acquires two input images, the dimensionality of one image is 127 × 3, the dimensionality of the other image is 255 × 3, the two input images are respectively input into the image feature extraction module, the image feature extraction module adopts a convolution neural network in dense connection to extract features, the feature extraction network is divided into two paths and is respectively used for extracting target features and tracking region features, the full convolution network outputs a target image (the scale is 127 × 127), and a tracking region image (the scale is 255) is output. The image feature extraction module feature extraction process is represented by the following mathematical expression:

x_l＝H_l([x₀,x₁,...,x_l-1])

wherein H_lRepresents a network extraction feature operation, [ x ]₀,x₁,...,x_l-1]Indicating that the feature maps from the first layer to the last layer are merged as channels, x_lThen for the output of the feature extraction network, a target feature map dimension of 15 × 256 and a tracking region feature map dimension of 31 × 256 are obtained, respectively. And the cross-correlation module performs correlation operation on the target feature map dimension of 15 × 256 and the tracking region feature map dimension of 31 × 256, padding is 0, and the step size is 1, so as to obtain an output feature map with the dimension of 17 × 256. The cross-correlation module sends the output feature maps with the dimensions of 17 × 256 to the first analysis module respectively, the first analysis module sends the output feature maps to the semantic segmentation branch and the score map (score map) branch respectively, the semantic segmentation branch and the score map branch are formed by convolution with 1 × 1, the output feature maps are convolved with 1 × 1 to obtain the first feature map (fmask) with the dimensions of 17 × 17 (63) and the score map with the dimensions of 17 × 17 1 respectively, and each pixel and the corresponding channel in fmask are called RoW, namely, response of a candidate window, so that 17 × 17 pixels RoW are contained in fmask in total, and the dimension of each RoW is 1 × 1 (63). And each pixel in the score map is a confidence corresponding to each RoW in the fmask, and the second analysis module selects RoW corresponding to the pixel point with the highest confidence in the score map on the fmask as RoW used when the mask is finally generated. When the dimension 1 x 1 (63 x 63) is obtained as RoW, the selected RoW resize is formed into a first matrix of 63 x 63, and the first matrix is sigmoid-bisectedClass, for determining RoW whether the pixel on the generated matrix belongs to a mask (mask). After the two classifications, a mask matrix is obtained, elements of the first matrix are values between 0 and 1 after sigmoid, the mask matrix is mapped back to the original image through affine transformation, a numerical value between 0 and 1 in the mask matrix is binarized by setting a segmentation threshold (in the embodiment, 0.35 is selected as the segmentation threshold of the mask), and finally the mask of the tracking target in the original image can be obtained, so that a boundary frame (bounding box) of the tracking target can be obtained through the minimum bounding rectangle of the target mask.

According to the target tracking and segmenting system based on the twin convolutional network, the image features are extracted by adopting the densely connected convolutional neural network, the network feature extraction capability is improved, the semantic segmentation branches and the score map branches are added into the densely connected convolutional neural network, the target tracking precision is improved, and the pixel-level tracking of the target is realized.

As shown in fig. 3, a schematic diagram of an intelligent terminal according to a third embodiment of the present invention is provided, where the terminal includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method described in the first embodiment.

It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input devices may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output devices may include a display (L CD, etc.), a speaker, etc.

The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In a specific implementation, the processor, the input device, and the output device described in the embodiments of the present invention may execute the implementation described in the method embodiments provided in the embodiments of the present invention, and may also execute the implementation described in the system embodiments in the embodiments of the present invention, which is not described herein again.

The invention also provides an embodiment of a computer-readable storage medium, in which a computer program is stored, which computer program comprises program instructions that, when executed by a processor, cause the processor to carry out the method described in the above embodiment.

The computer readable storage medium may be an internal storage unit of the terminal described in the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A target tracking and segmenting method based on a twin convolutional network is characterized by comprising the following steps:

acquiring input image information;

2. The twin convolutional network-based target tracking and segmentation method of claim 1 wherein the semantic segmentation branches are composed of 1 x 1 convolutional layers.

3. The twin convolutional network-based target tracking and segmentation method of claim 1 wherein the score map branches are comprised of 1 x 1 convolutional layers.

4. The twin convolutional network based target tracking and segmentation method of any one of claims 1 to 3, wherein the segmentation threshold is 0.35.

5. A twin convolutional network based target tracking and segmentation system, comprising: an image acquisition module, an image feature extraction module, a cross-correlation module, a first analysis module and a second analysis module,

the image acquisition module is used for acquiring input image information;

6. The twin convolutional network-based target tracking and segmentation system of claim 5 wherein the semantic segmentation branches are composed of 1 x 1 convolutional layers.

7. The twin convolutional network-based target tracking and segmentation system of claim 5 wherein the score map branches are comprised of 1 x 1 convolutional layers.

8. A twin convolutional network based target tracking and segmentation system as claimed in any of claims 5 to 7 wherein the segmentation threshold is 0.35.

9. An intelligent terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, the memory being adapted to store a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method steps according to any of claims 1 to 4.

10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps according to any one of claims 1 to 4.