CN116188804A

CN116188804A - Twin network target search system based on transformer

Info

Publication number: CN116188804A
Application number: CN202310449364.7A
Authority: CN
Inventors: 郑艳伟; 何国海; 于东晓; 李峰
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-05-30
Anticipated expiration: 2043-04-25
Also published as: CN116188804B

Abstract

The invention belongs to the field of image retrieval and target detection in computer vision, and discloses a twin network target search system based on a transformer.

Description

Twin network target search system based on transformer

Technical Field

The invention belongs to the field of image retrieval and target detection in computer vision, and discloses a twin network target searching system based on a transformer.

Background

The computer vision means that a camera and a computer are used for replacing human eyes to perform machine vision such as recognition, tracking and measurement on targets, and further graphic processing is performed, so that the computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can obtain 'information' from images or multidimensional data. Perception can be seen as the extraction of information from sensory signals, so computer vision can also be seen as science of how to "perceive" an artificial system from images or multidimensional data. Image processing techniques convert an input image into another image having desired characteristics. Image processing technology is often utilized in computer vision research to perform preprocessing and feature extraction, so that the computer has the capabilities of vision, hearing, speaking and the like.

Object detection and recognition are widely used in many fields of life, and is to distinguish objects in images or videos from uninteresting parts, determine whether the objects exist, determine the positions of the objects if the objects exist, and recognize the objects as a computer vision task. Target detection and recognition are very important research directions in the field of computer vision, along with the rapid development of the Internet, artificial intelligence technology and intelligent hardware, a large amount of image and video data exist in human life, so that the computer vision technology plays a larger and larger role in human life, and research on computer vision is also getting more and more hot. Object detection and recognition are also becoming increasingly important as a cornerstone in the field of computer vision. Because the current demand for the target retrieval system is larger and larger, and the technical development of the target retrieval system is slower, a mature target retrieval system is urgently needed, and the actual target retrieval problem is accurately solved, so the system is developed.

Disclosure of Invention

In order to solve the technical problems, the invention provides a twin network target retrieval system based on a transducer, which utilizes a camera to monitor targets and combines a method of image retrieval and target detection in computer vision to realize the retrieval and display of targets in a monitoring area of the camera.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a transformer-based twin network target search system, comprising the steps of:

(1) Collecting image data as a graph to be searched; extracting an interested target from part of the graphs to be searched, taking the interested target as a query graph, and designing and twinning a network target searching training model;

(2) Selecting a camera area, selecting a camera group to determine a search area, and inputting a target picture to be searched;

(3) Starting a search task, acquiring scene pictures from the inside of a camera at equal time intervals in a video frame taking mode, detecting the pictures through a model, detecting each target, comparing the target with the picture to be searched as a characteristic, calculating the matching degree of the targets, taking the maximum value, and if the matching degree exceeds a set threshold value, searching the picture sequence number

Adding into a result queue;

(4) If the result queue has new records, the current detection picture is stored in a static resource directory set by a background server, information is stored in a database, and the front-end interface screens and displays search result information of a corresponding target from the database according to requirements.

Further, the specific method of the step (1) is as follows:

(1.1) acquisition of n Zhang Sousuo plot

The default size of each search graph is 224 x 224, and the number of targets in the n search graphs is +.>

Cut out the inquiry graph to be->

Scaling each query graph from its original size to 56 x 56, denoted +.>

Then for each sheetManually classifying the query graphs, classifying the query graphs of the same target into one type, and marking the query graphs as +.>

Placing each query graph into the corresponding class of folders corresponding to the count folders, and then establishing a dictionary +.>

，/>

The value corresponds to each search graph and is marked as +.>

，/>

Each of->

Value corresponding +.>

Class names of all targets exist in the current search graph;

(1.2) designing a twin network target search model, wherein a model feature extraction backbone is divided into a vit1 and a vit2, and the vit1 is used for extracting features of a search graph; then 16 query graphs are selected, and the selection rules are as follows: querying index of current search graph

Randomly selecting 4 query graphs from the category folders which are not currently indexed, and selecting 12 query graphs from all the category files which are indexed, wherein 3 query graphs are randomly selected from each category folder, and if 12 query graphs can be selected, splicing the 16 query graphs of 56 x 56 into one 224 x 224 graph in random sequence; the vit2 is used for extracting the characteristics of a series of corresponding query graphs in the query splice graph, and the vit1 and the vit2 share the weight;

(1.3) the features extracted by vit1 are obtained by the DETR target detection head

The DETR target detection head is used for predicting the position of each target in the search graph, and the extracted characteristics of the vit1 and the vit2 jointly obtain a +.>

，

And->

The combination is performed by a proportional relationship.

Further, in step (1.2): if 12 query graphs cannot be selected, randomly selecting one query graph from the selected query graphs each time by using a data enhancement mode, generating a new query graph with the size of 56 x 56 by a turnover or rotation mode, repeating the data enhancement operation until the total number of the query graphs reaches 16, ending the data enhancement operation, and then splicing the 16 query graphs with the size of 56 x 56 into a 224 x 224 graph, wherein the newly spliced graph is named as a query splice graph and is recorded as

。

Further, in the step (1.2),

(1.2.1) adding a DETR target detection head, which can detect and frame each target from each diagram to be searched, and obtain the coordinates of each target;

(1.2.2) data are divided into n groups, each group being

，/>

，/>

Wherein->

Is the firstuZhang Sousuo figure->

Is the firstvZhang Chaxun splice; />

Extraction of features by vit1>

Then, by the DETR target detection head, the +.>

Feature vector of individual object->

Scaling the feature vectors of the m targets to feature dimensions of 56 x 384, the corresponding feature vector being +.>

，/>

Extracting features by vit2>

Due to->

Is formed by splicing 16 query graphs with the size of 56 x 56, namely, the corresponding 16 feature vectors can be obtained by extracting features according to fixed coordinate positions>

；

(1.2.3) feature vectors generated for search graphs

Characteristic vector generated by inquiring splice diagram>

When two are compared, and the two belong to the same category,defining as positive samples, defining as negative samples when the two samples do not belong to the same category, and defining a loss function by adopting a cosine distance formula:

(equation 1);

(equation 2); />

(equation 3);

，/>

，/>

，/>

，/>

to search for a graphuIs the first of (2)δFeature vectors of the target; />

Splice graph for inquiryvIs the first of (2)ηA feature vector;

when the input network is positive, the loss is calculated by using formula 1, and two eigenvectors are required to be made

The smaller the distance between them, resulting in +.>

The smaller, when the input network is a negative sample, the loss is calculated using equation 2, requiring two eigenvectors +.>

The greater the distance between them, resulting in +.>

The smaller the final +.>

The smaller.

Further, the specific method of step (1.3) is as follows:

(1.3.1) let the single target in vit1 pass through the DETR target detection head output loss be

The detection head is arranged to obtainkThe probability of each detection frame is +.>

The result box number is:

(equation 4);

(1.3.2) selecting num number detection frame region to record as A, setting preset anchor region as B, wherein A and B ensure intersection region, B does not completely contain A, and supposing

Or->

Region (2)>

Representative area->

Is a part of the area of (2);

(equation 5);

(1.3.3) features of vit1 and vit2 together obtain one

Define benchmark->

The eigenvector of vit1 is defined as +.>

The eigenvector of vit2 is defined as +.>

，/>

，/>

And->

Is a parameter which can be learned, and when all search graphs and corresponding query graphs are input in groups, the method is characterized in that +.>

To get as close to 0 as possible, otherwise, let +.>

As close to 1 as possible, +.>

；

(1.3.4) determining the final loss:

wherein->

。

Further, the specific method of the step (2) is as follows:

(2.1) establishing a new process for the current task, adding the current process ID into a process queue, starting the current process, and preparing to execute the target search task;

(2.2) when the program is startedCamera groups requiring selection of corresponding areas at the front end, assuming selectionqCorresponding to each camera

And detects whether the target picture has been added +.>

And (5) successfully starting the system when the condition is met.

Further, the specific method of the step (3) is as follows:

(3.1) transmitting a starting command to the front end to start the target searching module;

(3.2) running a video frame taking module forqEach camera takes out a picture to be detected, and the picture name is that of the cameras

Each picture generates a feature vector of +.>

Inside each picture +.>

Target (/ ->

) The corresponding feature vectors are respectively

Scaling to feature vector +.>

The same dimension is +.>

Then, feature vector of target picture +.>

Performing feature comparison to calculate matching degree, and generating a feature matching degree hash table ++>

，/>

Recording picture number->

Maximum matching degree of ∈10->

Recording picture number->

Position coordinates of the highest matching degree target area, wherein each value is:

(formula 6); ->

(equation 7);

wherein the method comprises the steps of

Represents the abscissa of the center of the region, +.>

Represents the ordinate of the center of the region,/>

Representing the height of the area>

Representing the area width;

(3.3) setting the threshold value as y, and selecting the picture sequence number with the value exceeding y from maps

Picture order->

Add to the Result pair column Result.

Further, the specific method of step (3.3) is as follows:

(3.3.1) traversing the current Map, for each

If->

Indicating that the current picture is an effective scene picture containing the target picture characteristics, and recording the sequence number of the picture +.>

；

(3.3.2) performing an addition operation to the selected picture order, and adding a newly generated picture order at the end of the Result pair column Result

，/>

And finally, returning a Result pair column Result.

Further, the specific method of the step (4) is as follows:

(4.1) detecting the result queue, if a new record is generated, obtaining the current picture sequence number

The current picture number +.>

Storing in preset server static folder, and generating picture sequence number +.>

Is +.>

The camera is +.>

Target matching degree->

Coordinates of the object->

Target picture name->

And access address->

Writing into a database;

(4.2) the front end screens and displays the current target picture in real time by setting the searching condition

Corresponding search result information.

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the invention, the twin network is adopted, and the query is introduced for training, so that the trained model is more accurate and has pertinence.

(2) According to the invention, the vision transformer model and the DETR target detection head are introduced to train the target retrieval model, train the current target retrieval model end to end, complete the tasks of detection and retrieval, and better promote the accuracy of the model.

(3) The invention screens and displays the condition of the current target picture in the monitoring area in real time through the front end and updates the current target picture in real time.

Drawings

Fig. 1 is an overall schematic diagram of a twin network target retrieval system based on a transducer according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides a twin network target retrieval system based on a transformer, which can utilize a camera to monitor targets and combine the methods of image retrieval and target detection in computer vision to realize the retrieval and display of the targets in a monitoring area of the camera as shown in figure 1.

The DETR, collectively DEtection TRansforme, is a Transformer-based end-to-end target detection network proposed by Facebook. the transducer is an attention model and is used in the field of natural language processing; vision transformer is an attention model in the computer vision field, which is a migration application of a transducer model, and is improved to be applied to the computer vision field, so that the image processing is completed.

Specific examples are as follows:

a transformer-based twin network target retrieval system, comprising the steps of:

(1) Data acquisition and model design stage:

(1.1) acquisition of n Zhang Sousuo plot

Cutting out a query graph to be->

Scaling each query graph from its original size to 56 x 56, denoted +.>

Then manually classifying each query graph, classifying the query graph of the same target into one type, and marking the query graph as +.>

，/>

The value corresponds to each search graph and is marked as +.>

，/>

Each of->

Value corresponding +.>

Class names (same file folder names) for all targets exist in the current search graph.

(1.2) designing a twin network target search model, wherein a model feature extraction backbone is divided into a vit1 and a vit2 (both vit1 and vit2 are based on vision transformer), and the vit1 is used for extracting features of a search graph; then 16 query graphs are selected, and the selection rules are as follows: querying index of current search graph

Randomly selecting 4 query graphs from the category folders which are not currently indexed, and selecting 12 query graphs from all the category files which are indexed, wherein 3 query graphs are randomly selected from each category folder, and if 12 query graphs can be selected, splicing the 16 query graphs of 56 x 56 into one 224 x 224 graph in random sequence; if 12 query graphs cannot be selected, randomly selecting one query graph from the selected query graphs each time by using a data enhancement mode, generating a new query graph with the size of 56 x 56 by a turnover or rotation mode, repeating the data enhancement operation until the total number of the query graphs reaches 16, ending the data enhancement operation, and then splicing the 16 query graphs with the size of 56 x 56 into 224 x 224 pictures, wherein the newly spliced picture is named as a query splicing picture and is named as a query splicing picture

The vit2 is used for extracting the characteristics of a series of corresponding query graphs in the query splice graph, and the vit1 and the vit2 perform weight sharing to further improve the accuracy of the network, and the specific operation is as follows:

(1.2.1) adding a DETR target detection head, which is a functional module for detecting the position of a target in a picture, which can predict the position of each target in a current picture, by which each target can be detected and framed from each picture to be searched, and the coordinates of each target can be obtained.

(1.2.2) data are divided into n groups, each group being

，/>

，/>

Wherein->

Is the firstuZhang Sousuo figure->

Is the firstvZhang Chaxun splice; />

Extraction of features by vit1>

Then, by the DETR target detection head, the +.>

Feature vector of individual object->

Scaling the feature vectors of m targets to the feature dimension of 56 x 384 through an ROI Pooling operation, wherein the corresponding feature vector is +.>

，/>

Extracting features by vit2>

Due to->

。

(1.2.3) feature vectors generated for search graphs

Characteristic vector generated by inquiring splice diagram>

And comparing every two samples, defining positive samples when the two samples belong to the same category, defining negative samples when the two samples do not belong to the same category, and defining a loss function by adopting a cosine distance formula:

(equation 1); />

(equation 2);

(equation 3);

，/>

，/>

to search for a graphuIs the first of (2)δFeature vectors of the target; />

Splice graph for inquiryvIs the first of (2)ηA feature vector; when the input network is positive, the loss is calculated using equation 1, requiring two eigenvectors +.>

The smaller the distance between them, resulting in +.>

The greater the distance between them, resulting in +.>

The smaller the final +.>

The smaller.

(1.3) the features extracted by vit1 are passed through a DETR target detection head (the function of the DETR target detection head is to predict the position of each target in the search map), obtaining one

The extracted features of vit1 and vit2 jointly obtain one

，/>

And->

The combination is carried out through the proportion relation, and the specific operation is as follows:

（1.3.1) Let the output loss of a single target in vit1 through the DETR target detection head be

The result box number is:

(equation 4);

Or->

Region (2)>

Representative area->

Is a part of the area of (2);

(equation 5);

(1.3.3) features of vit1 and vit2 together obtain one

Define benchmark->

The eigenvector of vit1 is defined as +.>

The eigenvector of vit2 is defined as +.>

，/>

，/>

And->

To get as close to 0 as possible, otherwise, let +.>

As close to 1 as possible, +.>

。

(1.3.4) determining the final loss:

wherein->

Can be adjusted according to the requirement; at present we use +.>

Is a value of (a).

(2) Switch setting and zone setting phase:

and (2.1) establishing a new process for the current task, adding the current process ID into a process queue, starting the current process, and preparing to execute the target search task.

(2.2) when the program is started, the camera group of the corresponding region needs to be selected at the front end, assuming that the selection is performedqCorresponding to each camera

And detects whether the target picture has been added +.>

And (5) successfully starting the system when the condition is met.

(3) Model detection processing stage:

Each picture generates a feature vector of +.>

Inside each picture +.>

Target (/ ->

) The corresponding feature vectors are +.>

Through an ROI Pooling operation (scaling the feature vector), scaling to the feature vector +_with the target picture>

The same dimension is +.>

Then, feature vector of target picture +.>

，/>

Recording picture number->

Maximum matching degree of ∈10->

Recording picture number->

(equation 6);

(equation 7);

wherein the method comprises the steps of

Represents the abscissa of the center of the region, +.>

Represents the ordinate of the center of the region,/>

Representing the height of the area>

Representing the area width.

(3.3) setting the threshold toySelecting a value from Map to exceedyPicture sequence number of (2)

Picture order->

Adding to a Result pair column Result, wherein the specific operation is as follows:

(3.3.1) traversing the current Map, for each

If->

Indicating that the current picture is an effective scene picture containing target picture characteristics, recording picture sequence number +.>

。

，/>

And finally, returning a Result pair column Result.

(4) Storage and display stage:

Picture +.>

Is +.>

The camera is +.>

Degree of matching of targets

Coordinates of the object->

Target picture name->

And access address->

Writing into a database.

Corresponding search result information.

According to the method, under the scene of deploying the camera, the searching and displaying of the targets in the monitoring area of the camera can be realized, a twin network is introduced, a mode of vision transformer model and DETR target detecting head is introduced, and the methods of result queue, area loss calculation, matching degree calculation, camera area selection and the like are compatible with the use of various types of visible light cameras, and the method has high robustness, screens and displays the conditions of the current target picture in the monitoring area at the front end, and updates the current target picture in real time.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A transformer-based twin network target search system, comprising the steps of:

(3) Search task initiation, framing by videoScene pictures are obtained from the camera at equal time intervals in a mode, the pictures are detected through a model, each target is detected, the targets are compared with the to-be-searched picture in characteristics, the matching degree of the targets is calculated, the maximum value is taken, and if the matching degree exceeds a set threshold value, the picture number is searched

Adding into a result queue;

2. The transformer-based twin network target search system of claim 1, wherein the specific method of step (1) is as follows:

(1.1) acquisition of n Zhang Sousuo plot

Cut out the inquiry graph to be->

Scaling each query graph from its original size to 56 x 56, denoted +.>

，/>

The value corresponds to each search graph and is marked as +.>

，/>

Each of->

Value corresponding +.>

Class names of all targets exist in the current search graph;

The DETR target detection head is used for predicting the position of each target in the search graph, and the characteristics extracted by the vit1 and the vit2 are obtained togetherGet a +.>

，

And->

The combination is performed by a proportional relationship.

3. The transformer-based twin network target search system of claim 2, wherein in step (1.2): if 12 query graphs cannot be selected, randomly selecting one query graph from the selected query graphs each time by using a data enhancement mode, generating a new query graph with the size of 56 x 56 by a turnover or rotation mode, repeating the data enhancement operation until the total number of the query graphs reaches 16, ending the data enhancement operation, and then splicing the 16 query graphs with the size of 56 x 56 into a 224 x 224 graph, wherein the newly spliced graph is named as a query splice graph and is recorded as

。

4. The method of claim 2, wherein in step (1.2),

(1.2.2) data are divided into n groups, each group being

Wherein->

Is the firstuZhang Sousuo figure->

Is the firstvZhang Chaxun splice; />

Extraction of features by vit1>

Then, by the DETR target detection head, the +.>

Feature vector of individual object->

，/>

Extracting features by vit2>

Due to->

Is formed by splicing 16 query graphs with the size of 56 x 56, namely, the corresponding 16 feature vectors can be obtained by extracting features according to fixed coordinate positions

；

(1.2.3) feature vectors generated for search graphs

Feature vectors generated by querying a mosaic

(equation 1);

(equation 2);

(equation 3);

；/>

to search for a graphuIs the first of (2)δFeature vectors of the target; />

The smaller the distance between them, resulting in +.>

The greater the distance between them, resulting in

The smaller the final +.>

The smaller.

5. The transformer-based twin network target search system of claim 2, wherein the specific method of step (1.3) is as follows:

The result box number is:

(equation 4);

Or->

Region (2)>

Representative area->

Is a part of the area of (2);

(equation 5);

(1.3.3) features of vit1 and vit2 together obtain one

Define benchmark->

The eigenvector of vit1 is defined as +.>

The eigenvector of vit2 is defined as +.>

，/>

，/>

And->

To get as close to 0 as possible, otherwise, let +.>

As close to 1 as possible, +.>

；

(1.3.4) determining the final loss:

wherein->

。

6. The method of claim 2, wherein the specific method of step (2) is as follows:

(2.2) when the program is started, the camera group of the corresponding region needs to be selected at the front end, assuming that the selection is performedqThe cameras are corresponding to each other,