CN111209829B

CN111209829B - Vision-based moving vision body static medium-small scale target identification method

Info

Publication number: CN111209829B
Application number: CN201911406950.3A
Authority: CN
Inventors: 王滔; 胡纪远; 朱世强; 祝义朋; 张雲策
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-05-02
Anticipated expiration: 2039-12-31
Also published as: CN111209829A

Abstract

The invention discloses a vision-based method for identifying a static small-scale and medium-scale target of a mobile visual body, which comprises the following steps: step S1: obtaining boundary frame information; step S2: creating a bounding box object set; step S3: maintaining a short-term memory state of the subject; step S4: when the object has short-term memory, comparing the confidence coefficient grading value of the object with a display threshold value, and if the confidence coefficient grading value is greater than or equal to the display threshold value, preparing a recognition result of the display object; step S5: when the object does not have short-term memory, comparing the confidence coefficient grading value of the object with a static threshold value, and if the confidence coefficient grading value is greater than or equal to the static threshold value, preparing to display the recognition result of the object; step S6: displaying a bounding box of the object meeting the condition; step S7: the exposure threshold for all subjects with short-term memory is updated.

Description

Vision-based moving vision body static medium-small scale target identification method

Technical Field

The invention relates to image processing and computer vision technology, in particular to a vision-based method for identifying a static small-scale target in a mobile view body.

Background

As machine learning evolves and hardware computing power increases, image-based real-time object detection algorithms evolve rapidly. Among them, the method represented by Faster-RCNN, YOLO, SSD, which aims at identifying various types of objects in a dataset and marking their positions in a field of view, has made a breakthrough progress in this field.

Image-based object detection is one of the main approaches for moving vision sensing and cognition of the environment, such as robots. As the view volume (sensor) moves in the environment, there are many more small and medium sized static objects in some scenes (home environments), and the information that the object recognition system gets when recognizing them often is not just a single frame of pictures, but the view volume has a sequence of pictures with small position changes.

The method has certain detection and identification capability in such an environment, but the accuracy and the robustness of identification still have a large gap from the actual application, and the defect of the effect is mainly reflected in the aspects of severe jump of a detection result, unstable identification score and false detection in the moving process.

Disclosure of Invention

In order to solve the defects of the prior art, improve the accuracy and the robustness of identification and avoid the severe jump of detection results, unstable identification scoring and false detection, the invention adopts the following technical scheme:

the method for identifying the static small-scale target of the mobile visual body based on vision comprises the following steps:

step S1: obtaining boundary frame information, wherein the boundary frame information comprises the number of boundary frames, boundary frame coordinates, boundary frame categories and confidence score values;

step S2: creating a boundary box object set for storing and maintaining the boundary box information;

step S3: maintaining a short-term memory state of the subject;

step S4: when the object has short-term memory, comparing the confidence rating value of the object with a display threshold, and if the confidence rating value is greater than or equal to the display threshold, preparing to display the recognition result of the object; the exposure threshold is a lower confidence score limit for the subject with short term memory to reach exposure conditions;

step S5: comparing the confidence score value of the object with a static threshold value when the object does not have short-term memory, and preparing to display the recognition result of the object if the confidence score value is greater than or equal to the static threshold value; the static threshold is a lower confidence score limit value for the object without short-term memory to reach a display condition;

step S6: displaying a bounding box of the object meeting the condition;

step S7: updating the exposure threshold for all of the subjects having short-term memory.

The step S2 comprises the following steps:

step S2-1: creating a set of bounding boxes by taking each bounding box as an element;

step S2-2: expanding a set of difference sets, namely solving the difference set between the set of the boundary frames of the current frame and the set of the previous boundary frames, adding the difference set to the set of the previous boundary frames, and creating an object without short-term memory for elements of the expanded set; the object without short-term memory is used for recording the information of the boundary frame and the continuous detection number and the continuous failure detection number of the boundary frame;

step S2-3: updating a continuous detection number and a continuous failure detection number for the object, and setting the continuous detection number and the continuous failure detection number to zero when the object is introduced into a current set by a latest frame; when the object is not introduced into the current set by the latest frame, if the latest frame identifies the object, the continuous detection number is increased by one, the continuous failure detection number is set to zero, and if the latest frame does not detect the object, the continuous detection number is set to zero, and the continuous failure detection number is increased by one;

step S2-4: and deleting the object when the continuous failure detection number is larger than or equal to a certain value, and simplifying redundant elements in the set, wherein the certain value is an integer more than 2.

The step S2-3 is realized by establishing a mapping relation between adjacent frame boundary frames through the boundary frame category and the boundary frame coordinates. And fusing multi-frame boundary frames to improve the stability of target identification.

The step S3 comprises the following steps:

step S3-1: judging the object without creating short-term memory, if the continuous detection number of the object is larger than or equal to a certain value, creating short-term memory for the object, wherein the certain value is an integer more than 2, creating short-term memory is to create a priori knowledge queue for the object, and resetting the continuous failure detection number;

in step S3-2: and judging the object with the created short-term memory, and forgetting the short-term memory if the continuous failure detection number of the object is larger than or equal to a certain value, wherein the certain value is an integer more than 2, and the forgetting the short-term memory is deleting the prior knowledge queue of the object.

The static threshold value range is between 0.65 and 0.8.

The step S7 includes the following steps:

step S7-1: placing the confidence score value of the object with short-term memory into the priori knowledge queue, and storing the vertex coordinates of the nearest frame boundary frame;

step S7-2: the prior knowledge queue comprises a qa queue and a qb queue, wherein the qa queue is used for storing a historical confidence score value of the object compared with a learning threshold, when the confidence score value of the object is larger than or equal to the learning threshold, the confidence score value is added into the qa queue, when the confidence score value of the object is smaller than the learning threshold, a 0 value is added into the qa queue, and the learning threshold is a lower limit value of the confidence score value of the prior knowledge of the current object; the qb queue is used for storing the confidence coefficient score value after the historical confidence coefficient score value is compared with the display threshold value; when the historical confidence score value is greater than or equal to a display threshold, adding the historical confidence score value into the qb queue, and when the historical confidence score value is less than the display threshold, adding a 0 value into the qb queue;

step S7-3: updating the display threshold of the object by a dynamic thresholding method, and distributing confidence score values of the object by using time-sequence related weights, so that an algorithm can adjust the display threshold based on the prior identification data and time weight parameters.

Step S7-2, before comparing with the display threshold, performing filtering treatment on the historical confidence score value, wherein the filtering treatment is to perform smoothing operation on the confidence score value which is changed drastically along with time by adopting a variable sliding window mean algorithm, so as to avoid influence on the recognition result of the object caused by overlarge fluctuation of the confidence score value;

in the step S7-3, the dynamic thresholding method uses the confidence score value in the qb queue as input data, designs an exponential filtering core which decays in an equal-ratio array based on a time sequence, establishes a negative correlation between a display threshold and the input data, and sets a control range of the display threshold, and the correlation expression is as follows:

/>

the threshold is a presentation threshold, the c is the maximum of dynamic thresholds, the

Representing a convolution term, wherein norm is a normalization parameter, a is an exponential filtering kernel, s is input data, and i is a sequence number of historical confidence score values in the qb queue; the display threshold value ranges from [ (c-scope), c]The normalization parameter is used for limiting the value range of the convolution term between (0, scope), and the scope represents the adjustment range width of the display threshold.

The invention has the advantages that:

according to the invention, the multi-frame boundary boxes are fused through the mapping relation to improve the target identification stability, the influence on the identification result caused by overlarge fluctuation of the confidence score value is avoided through filtering processing, the display threshold value is updated through a dynamic threshold value method, so that the display result is more accurate, the problems of severe jump of the detection result and unstable identification score are well solved through redundant information generated in the movement of the view body, the false detection is avoided, the robustness is improved, and the target detection capability in a single frame picture is improved.

Drawings

Fig. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of maintaining bounding box object collection elements in the present invention.

FIG. 3 is a flow chart of maintaining a short-term memory state of a subject in the present invention.

FIG. 4 is a flow chart showing the updating of thresholds by the dynamic thresholding method in the present invention.

FIG. 5 is a graph comparing an actual detection score curve with a dynamic real-time display threshold variation curve in the present invention.

Fig. 6 is a video frame in which actual detection results are arranged in time series in the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and the specific embodiments.

As shown in fig. 1, the vision-based moving vision body static small-medium scale target identification method comprises the following steps:

step S1: acquiring corresponding data required by a basic recognition algorithm, wherein the corresponding data comprises boundary frame information marked in an image view field, and the boundary frame information comprises the number of boundary frames, coordinates of each boundary frame, corresponding target categories of the boundary frames and confidence score values; taking the data as input of a multi-frame fusion algorithm, wherein the confidence score value is a detection score value which corresponds to the boundary box and is output by the algorithm when the single-frame target detection algorithm is used for detection;

step S2: creating a boundary box object set for storing and maintaining the current boundary box information;

step S3: processing all elements in the set, maintaining the short-term memory state of the object, creating short-term memory of the object which meets the condition but does not create short-term memory, and clearing short-term memory of the object which is partially provided with short-term memory but is not detected for a plurality of continuous frames;

step S4: when the object has short-term memory, comparing the confidence score value of the object with a display threshold value at a corresponding moment in the set object, and if the confidence score value is greater than or equal to the display threshold value at the corresponding moment of the object, preparing to display the recognition result of the object; the exposure threshold is a lower confidence score limit for the subject with short term memory to reach exposure conditions;

step S6: displaying a bounding box of the object meeting the condition;

step S7: updating the display threshold of all the boundary box objects with short-term memory according to the recently input data, wherein the specific implementation comprises two parts of maintaining a scoring queue and establishing a mapping relation between the scoring queue and the display threshold.

As shown in fig. 2, the step S2 includes the following steps:

step S2-1: the data obtained from the basic recognition algorithm in the previous step are arranged into a set form, and each bounding box is taken as an element to create a set of bounding boxes;

step S2-2: the method comprises the steps of expanding ase:Sub>A set of difference sets, namely, solving the difference set between ase:Sub>A set of boundary frames of ase:Sub>A current frame and ase:Sub>A set of previous boundary frames, adding the difference set to the set of previous boundary frames, setting the set of previously identified boundary frames as A, setting the set of boundary frames of ase:Sub>A current input frame as B, adding elements (B-A) in the previous set, and creating an object without short-term memory for the elements of the expanded set of previous boundary frames; the object without short-term memory is used for recording basic information of the current boundary frame and the continuous detection number and the continuous failure detection number of the current boundary frame;

step S2-3: after the set is expanded, updating two numerical values of a continuous detection number and a continuous failure detection number for all objects in the set, and setting the continuous detection number and the continuous failure detection number to zero when the objects are introduced into the current set by a nearest frame boundary box; when the object is not introduced into the current set by the latest frame, if the latest frame identifies the object, the continuous detection number is increased by one, the continuous failure detection number is set to zero, and if the latest frame does not detect the object, the continuous detection number is set to zero, and the continuous failure detection number is increased by one;

The step S2-3 is realized by establishing a mapping relation between adjacent frame boundary frames through the boundary frame category and the boundary frame coordinates. The multi-frame boundary frames are fused to improve the target identification stability, and the mapping relation refers to a corresponding method of the same target boundary frame in the adjacent frames on the basis that the boundary frames corresponding to different frames mark the same target.

As shown in fig. 3, the step S3 includes the following steps:

step S3-1: judging the object in the set, which does not create short-term memory, if the continuous detection number of the object is larger than or equal to a certain value, creating short-term memory for the object, wherein the certain value is an integer more than 2, creating short-term memory is to create a priori knowledge queue for the object, and resetting the continuous failure detection number;

in step S3-2: and judging the object with the created short-term memory in the set, and forgetting the short-term memory if the continuous failure detection number of the object is larger than or equal to a certain value, wherein the certain value is an integer more than 2, and the forgetting the short-term memory is the prior knowledge queue for deleting the object.

The static threshold value range is between 0.65 and 0.8.

As shown in fig. 4, the step S7 includes the following steps:

step S7-1: and placing the confidence score value of the object with short-term memory into the priori knowledge queue, and storing the vertex coordinates of the nearest frame boundary frame for establishing the mapping relation of the adjacent frame boundary frames.

Step S7-2: the prior knowledge queue comprises a qa queue and a qb queue, wherein the qa queue is used for storing a historical confidence score value of the object compared with a learning threshold, when the confidence score value of the object is larger than or equal to the learning threshold, the confidence score value is added into the qa queue, when the confidence score value of the object is smaller than the learning threshold, a 0 value is added into the qa queue, and the learning threshold is a lower limit value of the confidence score value of the prior knowledge of the current object; the qb queue is used for storing the confidence coefficient score value after the historical confidence coefficient score value is compared with the display threshold value; when the historical confidence score value is larger than or equal to the display threshold value at the current moment, adding the historical confidence score value into the qb queue, and when the historical confidence score value is smaller than the display threshold value at the current moment, adding a 0 value into the qb queue; respectively processing confidence score value sequences corresponding to different bounding boxes in the view field by taking the same target bounding box as a unit;

step S7-3: updating the display threshold of the object by a dynamic threshold method, wherein the dynamic threshold method is a method for adjusting the display threshold by utilizing a time sequence weight mechanism, and the time sequence weight mechanism is used for distributing confidence score values of the object by using time sequence related weights, so that an algorithm can adjust the display threshold in real time based on the prior identification data and time weight parameters; and firstly convolving the qb queue with the time weight to obtain a convolved value, then establishing a negative correlation relation between the display threshold of the object and the convolved value through a mathematical expression, and finally updating the display threshold of the object.

And step S7-2, before comparing with the display threshold, performing filtering treatment on the historical confidence score value, wherein the filtering treatment is to perform smoothing operation on the confidence score value which is changed drastically along with time by adopting a variable sliding window mean algorithm, so as to avoid influence on the recognition result of the object caused by overlarge fluctuation of the confidence score value.

In the step S7-3, the dynamic thresholding method uses the confidence score value in the qb queue as input data score_smooth, designs an exponential filter kernel conv_geo that decays in an equal-ratio array based on a time sequence, sets a decay coefficient c_fee, sets a convolution kernel size to conv_size, establishes a negative correlation between a display threshold and the input data using a functional relationship, and sets a control range of the display threshold, and the correlation expression is as follows:

the threshold is a display threshold, the 0.8 is the maximum of dynamic thresholds, the

Representing a convolution term, the norm being a normalization parameter, the a being an exponential filtering kernel, the s being filtered input data, the i being a sequence of bounding box historical confidence score values in the qb queue; the display threshold ranges from [ (0.8-scope),0.8]the normalization parameter is used for limiting the value range of the convolution term between (0, scope), and the scope represents the width of the adjustment range of the display threshold; if it is desired to control the presentation threshold at [0.5,0.8 ]]Within the range, the range width scope is 0.3, and the appropriate regularization parameter norm should be selected to define the range of the convolution term between 0 and 0.3. As shown in fig. 5 and 6, the method has no failure detection phenomenon in a plurality of continuous frames.

The present invention is capable of other and further embodiments and its several details are capable of modification and variation in accordance with the present invention by those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. The method for identifying the small-scale target in the static state of the mobile visual body based on vision is characterized by comprising the following steps:

step S2: creating a set of bounding box objects for storing and maintaining said bounding box information, comprising the steps of:

step S2-2: expanding a set of difference sets, namely solving the difference set between the set of the boundary frames of the current frame and the set of the previous boundary frames, adding the difference set to the set of the previous boundary frames, and creating an object without short-term memory for elements of the expanded set;

the object without short-term memory is used for recording the information of the boundary frame and the continuous detection number and the continuous failure detection number of the boundary frame;

step S2-4: if the continuous failure detection number is larger than or equal to a certain value, deleting the object, wherein the certain value is an integer more than 2;

step S3: maintaining a short-term memory state of the subject, comprising the steps of:

in step S3-2: judging the object with the created short-term memory, if the continuous failure detection number of the object is larger than or equal to a certain value, forgetting the short-term memory, wherein the certain value is an integer more than 2, and the forgetting the short-term memory is deleting the prior knowledge queue of the object;

step S6: displaying a bounding box of the object meeting the condition;

step S7: updating the exposure threshold of all the objects with short-term memory, comprising the steps of:

2. The method for identifying a static small-scale object in a mobile view body based on vision according to claim 1, wherein said identification in step S2-3 is implemented by a mapping relationship between adjacent frame bounding boxes established by said bounding box class and said bounding box coordinates.

3. The vision-based moving view body static small-medium scale object recognition method according to claim 1, wherein the static threshold value range is between 0.65 and 0.8.

4. The method according to claim 1, wherein the step S7-2 is performed to filter the historical confidence score value before comparing with the presentation threshold, and the filtering is performed by smoothing the confidence score value that varies sharply with time using a variable sliding window mean algorithm.

5. The method for identifying a small-scale object in a static state of a moving view body based on vision according to claim 1, wherein in the step S7-3, the dynamic thresholding method is to use a confidence score value in the qb queue as input data, design an exponential filtering kernel which decays in an equal-ratio array based on a time sequence, establish a negative correlation between a display threshold and the input data, and set a control range of the display threshold, and the correlation expression is as follows:

Representing a convolution term, wherein norm is a normalization parameter, a is an exponential filtering kernel, s is input data, and i is a sequence number of historical confidence score values in the qb queue; the display threshold value ranges from [ (c-scope), c]The normalization parameter is used for limiting the value range of the convolution term between (0, scope), and the scope represents the adjustment range width of the display threshold. />