CN111598078A

CN111598078A - Object detection method and system based on sequence optimization

Info

Publication number: CN111598078A
Application number: CN201910125642.7A
Authority: CN
Inventors: 董健
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2019-02-20
Filing date: 2019-02-20
Publication date: 2020-08-28

Abstract

The invention relates to the field of object detection, in particular to a method and a system for detecting an object based on sequence optimization. Wherein, the method comprises the following steps: detecting an image to obtain a candidate frame of a corresponding object; calculating the coincidence degree IOU of the candidate frame; clustering the candidate frames based on the contact degree IOU, and determining the candidate frames belonging to the same class; the candidate frames belonging to the same class and the images thereof are input into a neural network to obtain a result of detection of the corresponding object. The invention combines the visual characteristic information and the geometric relation of the detected object, optimizes the existing object detection technology and improves the accuracy and the correctness of the object detection.

Description

Object detection method and system based on sequence optimization

Technical Field

The invention relates to the technical field of object detection, in particular to a method and a system for detecting an object based on sequence optimization.

Background

In the field of object detection, after a plurality of object frames corresponding to an object are detected, the frames are finally fused to find the best detection result. The existing non-maximum value suppression (NMS) method is a mainstream method for fusing the frames, mainly considers the geometric relationship of the detected object frames, sorts the object frames from high to low according to the detection scores through a greedy-like algorithm to reserve the frame with the largest score, then removes other frames with the degree of coincidence IOU with the frame larger than a certain fixed threshold value, and then adds the frame with the second largest score. However, the method does not consider the characteristics of the visual information of the image of the object under detection, so that the detection processing effect on the adjacent object is poor.

Therefore, there is a need to optimize the existing non-maximum suppression process to improve the accuracy and correctness of object detection.

Disclosure of Invention

In view of the above, the present invention has been developed to provide a method and system for sequence-optimization-based object detection that overcomes the above-mentioned drawbacks or at least partially solves the above-mentioned problems.

In a first aspect, the present invention provides a method for object detection based on sequence optimization, which is characterized by comprising: detecting an image to obtain a candidate frame of a corresponding object; calculating the coincidence degree IOU of the candidate frame; clustering the candidate frames based on the contact degree IOU, and determining the candidate frames belonging to the same class; the candidate frames belonging to the same class and the images thereof are input into a neural network to obtain a result of detection of the corresponding object.

The detecting the image to obtain the candidate frame of the corresponding object specifically includes: detecting foreground and/or background objects in the image through a target detection algorithm;

obtaining a candidate frame of a corresponding object; wherein each candidate box has a respective confidence score and location information.

The calculating the degree of coincidence IOU of the candidate frame specifically includes: selecting a candidate frame with the highest confidence score in the candidate frames of the corresponding objects according to the confidence score ranking; calculating the respective areas of all the candidate frames of the corresponding object according to the position information of the candidate frames; and calculating the coincidence degree IOU of the candidate frame with the highest confidence score of the corresponding object and other candidate frames of the corresponding object according to the area of the candidate frame.

The clustering of the candidate frames based on the contact degree IOU to determine the candidate frames belonging to the same class specifically includes: clustering all candidate frames in the detection image based on the contact ratio IOU; and outputting a set of candidate frames which are respectively gathered into the same class.

Based on the degree of coincidence IOU, clustering all candidate frames in the detected image, specifically comprising: and using a graph clustering algorithm, taking each candidate frame as a vertex, determining a matrix of the distance between every two vertexes as clustering input based on the contact degree IOU, and obtaining a set of vertexes which are clustered into the same class, wherein the set of each vertex is a set of corresponding candidate frames.

The inputting of the candidate frames belonging to the same class into the neural network to obtain the detection result of the corresponding object specifically includes: and respectively converting images of the candidate frames in the set of the candidate frames gathered into the same class into feature vectors, inputting the feature vectors into a trained neural network, determining which images of the candidate frames are the correct detected corresponding objects through a matching algorithm, and outputting the corresponding candidate frames as the detection results of the corresponding objects.

Wherein, the neural network specifically includes: RNN model, or LSTM model.

In a second aspect, the present invention further provides a system for object detection based on sequence optimization, which is characterized by comprising: the detection module is used for detecting the image to obtain a candidate frame of the corresponding object; the coincidence degree IOU calculating module is used for calculating the coincidence degree IOU of the candidate frame; the clustering module is used for clustering the candidate frames based on the contact ratio IOU and determining the candidate frames belonging to the same class; and the detection result determining module is used for inputting the candidate frames belonging to the same class and the images thereof into the neural network so as to obtain the detection result of the corresponding object.

Wherein, the detection module specifically includes: detecting foreground and/or background objects in the image through a target detection algorithm;

The coincidence degree IOU calculation module specifically comprises: selecting a candidate frame with the highest confidence score in the candidate frames of the corresponding objects according to the confidence score ranking; calculating the respective areas of all the candidate frames of the corresponding object according to the position information of the candidate frames; and calculating the coincidence degree IOU of the candidate frame with the highest confidence score of the corresponding object and other candidate frames of the corresponding object according to the area of the candidate frame.

Wherein, the clustering module specifically comprises: clustering all candidate frames in the detection image based on the contact ratio IOU; and outputting a set of candidate frames which are respectively gathered into the same class.

Wherein the clustering module further comprises: and using a graph clustering algorithm, taking each candidate frame as a vertex, determining a matrix of the distance between every two vertexes as clustering input based on the contact degree IOU, and obtaining a set of vertexes which are clustered into the same class, wherein the set of each vertex is a set of corresponding candidate frames.

The module for determining the detection result specifically comprises: and respectively converting images of the candidate frames in the set of the candidate frames gathered into the same class into feature vectors, inputting the feature vectors into a trained neural network, determining which images of the candidate frames are the correct detected corresponding objects through a matching algorithm, and outputting the corresponding candidate frames as the detection results of the corresponding objects.

Wherein, the neural network specifically includes: RNN model, or LSTM model.

In a third aspect, the present invention also provides a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor performing the above-mentioned method steps.

In a fourth aspect, the invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps.

One or more technical solutions in the embodiments of the present invention have at least the following technical effects or advantages:

the object detection scheme based on sequence optimization improves the non-maximum suppression mode of fusing the candidate frames by only considering the geometric position relation of the object detection candidate frames and a fixed threshold, combines the characteristics of visual information such as image content (semantics and object segmentation) and the like of the areas where the candidate frames are located, can better process the problem of the superposition of adjacent objects through the sequence optimization non-maximum suppression processing, and improves the accuracy and the correctness of the object detection.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a schematic diagram of a detection scene corresponding to an embodiment of the method for object detection based on sequence optimization according to the present invention;

FIG. 2 is a flow chart illustrating steps of an embodiment of a method for sequence-based optimized object detection of the present invention;

FIG. 3 is a block diagram showing a schematic structure of an embodiment of the system for object detection based on sequence optimization according to the present invention;

FIG. 4 is a block diagram illustrating a schematic structure of a computing device in an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Example one

Referring to fig. 1, an application scenario of an embodiment according to the present invention is schematically illustrated. The object detection technology is applied to various environments, including but not limited to object detection in real-time non-real-time processing of various videos and images such as automatic driving, driving recording, live broadcasting, short videos and the like, generally, the obtained image information foreground and background are very complex, and especially when some special effects need to be inserted into the corresponding object image in the image, the object in the image needs to be accurately and precisely detected. The specific process includes, for example, detecting the image at the top left corner of fig. 1, obtaining one or more candidate frames corresponding to the target object (as shown in the top right corner of fig. 1), filtering according to the confidence scores (scores) of the detected candidate frames, and removing one or more candidate frames corresponding to each remaining object with a particularly low score (as shown in the bottom right corner of fig. 1). Non-maximum suppression NMS processing is then performed to obtain the final detection result (as shown in the bottom left of figure 1).

Specifically, after the candidate frame for object detection is obtained, the candidate frame with a very low score is directly filtered through the score, the candidate frame is clustered after the coincidence degree IOU is calculated, and the candidate frame serving as the detection result is output after the images of the candidate frames belonging to the same class are input into the RNN for matching and recognition. Here, it is possible to determine which categories (e.g., pedestrians, bicyclists, automobiles, buses, tables, chairs) are more overlapped by using a statistical method, perform RNN matching recognition on the categories first, output detection results (candidate frames), perform RNN matching recognition on other categories with smaller overlapping, output detection results (candidate frames), and then retain the candidate frame with the highest score among all the output detection results as a final output result.

Referring to fig. 2, an embodiment of the object detection method based on sequence optimization according to the present invention is shown. The steps of this embodiment include:

s101, detecting the image to obtain a candidate frame of the corresponding object.

In particular, standard object detection algorithms, such as target detection algorithms, may be used to detect objects in the image, including objects in the foreground and/or background. For example, the detection algorithm fast RCNN, which detects the image in the top left corner of fig. 1, may obtain one or more candidate frames corresponding to some objects in the image, such as a plurality of objects detected on the image shown in the top right corner of fig. 1, each object having a corresponding plurality of detected candidate frames.

In particular, each candidate box has a corresponding confidence score and location information. Specifically, the detected candidate frames have coordinates of boundaries (e.g., coordinates of the top left corner and the bottom right corner of the boundary of each frame), i.e., position information, and a confidence score (score) of each frame, i.e., an image in which the candidate frame is located is more suitable for the object to be detected, for example: the coordinates and scores of box a may be represented as (x1, y1, x2, y2, score), etc.

Further, the candidate frames with particularly low scores may be removed by filtering according to the scores of the detected candidate frames, and one or more candidate frames corresponding to the remaining objects, for example, the candidate frame with more obvious overlapping condition of each detected object in the lower right corner of fig. 1 is the candidate frame remaining after the score filtering.

In step S102, the degree of coincidence IOU of the candidate frame is calculated.

Specifically, the candidate frames corresponding to the objects with the highest confidence score may be selected and extracted from the candidate frames according to the filtered confidence scores in the candidate frames of each object. For example, an object 1 detected in the image is filtered according to the score, for example, after filtering out other candidate frames with a score lower than 0.5, frames A, B, C, D, E, F of the object 1 are left, and the score of each frame is 0.9, 0.8, 0.82, 0.79 and 0.75 in turn. And sorting the boxes A with the highest scores according to the score.

Further, the respective areas of all the candidate frames corresponding to the object are calculated based on the position information of the candidate frames. Specifically, according to the position information of all candidate frames, such as the boundary coordinates of the candidate frames, the area of each candidate frame may be calculated, and the area of all candidate frames divided according to the boundary is obtained. For example, areas SA to SF of frames A to F.

Further, according to the area of the candidate frame, the coincidence degree IOU of the candidate frame with the highest confidence score of the corresponding object and other candidate frames of the corresponding object is calculated. Specifically, the area intersection part of the boundary divisions of the two candidate boxes is divided by their area union part. For example: the IOU of the frame A and the frame B can be divided by the range of the intersection part of the SA and the SB, other candidate frames such as the frames B to F respectively calculate the overlapping degree IOU with the frame A, and the overlapping degree or the overlapping degree of the frame A and other frames B to F is determined.

And step S103, clustering the candidate frames based on the contact degree IOU, and determining the candidate frames belonging to the same class.

Specifically, all candidate frames in the detection image may be clustered based on the degree of coincidence IOU. Further, a graph clustering algorithm may be used, each candidate box is used as a vertex, a matrix of distances between every two vertices is determined based on the contact degree IOU as an input of clustering, and each set of vertices clustered into the same class is obtained, where each set of vertices is a set of corresponding candidate boxes.

For example, the graph clustering algorithm may select a non-hierarchical clustering method such as spectral clustering. The candidate frames corresponding to all objects detected in the image are respectively used as vertices, and then the weights of the undirected edges (i.e., connecting lines) between the candidate frames (i.e., between the vertices) are determined according to the degree of coincidence IOU between the frames calculated in step S102, the related vertices are connected by connecting lines (undirected edges), and the unrelated vertices are not connected by connecting lines. Determining each matrix in the clustering analysis according to the weight values of the vertexes, the edges and the edges in the undirected edge weight graph, and further cutting the undirected edge weight graph to output a set of vertexes which are all aggregated into the same class, namely outputting a set of candidate frames corresponding to the vertexes, wherein, for example, the frames A to E are of the same class, and the frames E to G are of the other class and the like, and the frames A to E belong to different sets respectively. Therefore, when all the detected candidate frames are clustered through the graph clustering algorithm, a matrix of distances between every two vertexes determined according to the contact degree IOU can be used as input (for example, the contact degree IOU is very close when the contact degree IOU is large), the matrix is input into the clustering analysis model, and clustering results are obtained through the clustering algorithm, namely, which vertexes belong to the same class are obtained. A vertex that is of the same class indicates that the candidate boxes represented by the vertex are grouped into the same class.

Step S104, inputting the candidate frames belonging to the same class and the images thereof into a neural network to obtain the detection result of the corresponding object.

Specifically, candidate boxes belonging to the same class and their image content may be input into a trained neural network. Neural networks include, for example, but are not limited to, employing LSTM models, RNN models, and the like.

Taking the rnn (current Neural network) model as an example below, it can be used to process sequence data, where the connections between nodes in the model form a directed graph along the sequence, which makes it possible to represent the dynamic temporal behavior of the time series. The graphics image data may be semantically divided (pixels) and the divided contents converted into feature vectors, so that the input image in each candidate box is converted into a feature vector and input into the trained RNN model.

Specifically, in the above example, the RNN model training may adopt, for example, the hungarian matching algorithm to calculate which are correct image detection results and which are wrong image detection results, and then guide the RNN model training with these labels Label. For example, samples of the category corresponding to each label are accumulated and input into the RNN model for classification training, so that the trained RNN model is obtained.

Furthermore, after the image contents in the input candidate frame are identified by the RNN model, which image contents are correct detection results and which are false detection results can be confirmed, so that the candidate frame corresponding to the image is used as the detection result of the corresponding object, and if the image contents are false, the candidate frame is discarded. It may finally be determined which candidate boxes remain to be output as a result of the detection of the corresponding object. For example, the images a to E of the respective candidate frames a to E grouped into the same class are input to the RNN model, the probability that the feature vector of the image a after the matching operation in the RNN model determines the label corresponding to the certain object W1 is 90%, the probability that the feature vector of the image a determines the label corresponding to the other object Wn is less than 10%, the probability that the feature vector of the image b after the matching operation in the RNN model determines the label corresponding to the certain object W1 is 60%, the probability that the feature vector of the image b after the matching operation in the RNN model determines the label corresponding to the certain object W2 is 95%, and the probability that the feature vector of the image b determines the label corresponding to the other object Wn; assuming that the threshold value of the probability of each tag is set to 85%, it is determined that the image a is the accurate detection result of the object W1 and the image B is the accurate detection result of the object W2, and the frame candidate a corresponding to the object W1 and the frame candidate B corresponding to the object W2 are output, and the image a, B are discarded as the false detection result at the object Wn corresponding to the tag.

And outputting the candidate frames corresponding to the identified images in sequence, namely obtaining the candidate frames of the object detection result.

Further, all the candidate frames of all the clustered categories are subjected to input RNN model matching and output as object detection result candidate frames (the output result of each category may not be unique, for example, it may be that object W1 outputs frame a, frame R, etc.), and according to the score, the score is low, and the score is high, which is determined as the unique candidate frame corresponding to each object, and the candidate frame is used as the final detection result. In this way, all corresponding objects of all classes have a more accurate detection result.

Thus, for object detection in an image, repeated detection removal (according to clustering analysis of IOU and RNN model identification) is performed on an initial result of image detection (object target detection obtains a candidate frame) on the basis of considering the content of the candidate frame, and a processing mode of non-maximum value suppression is optimized, so that the accuracy and the correctness of a detection result can be more effectively improved in complex image foreground and/or background object detection by considering both the geometric relationship (position relationship) of the object detection and the information of visual features.

Example two

Based on the same inventive concept, a second embodiment of the present invention provides a system for object detection based on sequence optimization, as shown in fig. 3, including:

a detection module 301, configured to detect an image to obtain a candidate frame of a corresponding object;

an overlap degree IOU calculation module 302, configured to calculate an overlap degree IOU of the candidate frame;

the clustering module 303 is configured to cluster the candidate frames based on the degree of coincidence IOU, and determine candidate frames belonging to the same class;

and a detection result determining module 304, configured to input the candidate frames belonging to the same class and the images thereof into the neural network, so as to obtain a result of detection of the corresponding object.

The detection module 301, in particular, may use a standard object detection algorithm, such as a target detection algorithm, to detect each object in the image, including objects in the foreground and/or the background. For example, the detection algorithm fast RCNN, which detects the image in the top left corner of fig. 1, may obtain one or more candidate frames corresponding to some objects in the image, such as a plurality of objects detected on the image shown in the top right corner of fig. 1, each object having a corresponding plurality of detected candidate frames.

The calculating module 302 may specifically sort the candidate frames of each object according to the confidence score in the filtered candidate frame, select the candidate frame with the highest confidence score in the candidate frames of the corresponding object, and extract the candidate frame from the candidate frames. For example, an object 1 detected in the image is filtered according to the score, for example, after filtering out other candidate frames with a score lower than 0.5, frames A, B, C, D, E, F of the object 1 are left, and the score of each frame is 0.9, 0.8, 0.82, 0.79 and 0.75 in turn. And sorting the boxes A with the highest scores according to the score.

The clustering module 303, in particular, may cluster all candidate frames in the detected image based on the degree of coincidence IOU. Further, a graph clustering algorithm may be used, each candidate box is used as a vertex, a matrix of distances between every two vertices is determined based on the contact degree IOU as an input of clustering, and each set of vertices clustered into the same class is obtained, where each set of vertices is a set of corresponding candidate boxes.

For example, the graph clustering algorithm may select a non-hierarchical clustering method such as spectral clustering. All candidate frames corresponding to all objects detected in the image are respectively used as vertices, and then the weights of the undirected edges (i.e., connecting lines) between each candidate frame (i.e., between each vertex) are determined according to the degree of coincidence IOU between frames calculated by the module 302, the related vertices are connected by connecting lines (undirected edges), and the unrelated vertices are not connected by connecting lines. Determining each matrix in the clustering analysis according to the weight values of the vertexes, the edges and the edges in the undirected edge weight graph, and further cutting the undirected edge weight graph to output a set of vertexes which are all aggregated into the same class, namely outputting a set of candidate frames corresponding to the vertexes, wherein, for example, the frames A to E are of the same class, and the frames E to G are of the other class and the like, and the frames A to E belong to different sets respectively. Therefore, when all the detected candidate frames are clustered through the graph clustering algorithm, a matrix of distances between every two vertexes determined according to the contact degree IOU can be used as input (for example, the contact degree IOU is very close when the contact degree IOU is large), the matrix is input into the clustering analysis model, and clustering results are obtained through the clustering algorithm, namely, which vertexes belong to the same class are obtained. A vertex that is of the same class indicates that the candidate boxes represented by the vertex are grouped into the same class.

The module 304 for determining the detection result may specifically input the candidate boxes belonging to the same class and the image contents thereof into the trained neural network. Neural networks include, for example, but are not limited to, employing LSTM models, RNN models, and the like.

Thus, for object detection in an image, object segmentation is performed on the image (object target detection obtains a candidate frame and an IOU), and meanwhile pixel segmentation is performed on the image (according to clustering analysis of the IOU and RNN model identification), so that a processing mode of non-maximum value suppression is optimized, and therefore, the accuracy and the correctness of a detection result can be effectively improved in complex image foreground and/or background object detection by considering both the geometric relation (position relation) of the object detection and the information of visual features.

EXAMPLE III

Based on the same inventive concept, a third embodiment of the present invention provides a computing device, as shown in fig. 4, including a memory 404, a processor 402, and a computer program stored on the memory 404 and executable on the processor 402, wherein the processor 402 implements the above-mentioned object detection method steps based on sequence optimization when executing the program.

Where in fig. 4 a bus architecture (represented by bus 400) is shown, bus 400 may include any number of interconnected buses and bridges, and bus 400 links together various circuits including one or more processors, represented by processor 402, and memory, represented by memory 404. The bus 400 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 406 provides an interface between the bus 400 and the receiver 401 and transmitter 403. The receiver 401 and the transmitter 403 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 402 is responsible for managing the bus 400 and general processing, while the memory 404 may be used for storing data used by the processor 402 in performing operations.

Example four

Based on the same inventive concept, a fourth embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, performs the above-mentioned method steps for object detection based on sequence optimization.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of an apparatus, server, etc. for intelligent scheduling according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The invention discloses A1 and an object detection method based on sequence optimization, which is characterized by comprising the following steps: detecting an image to obtain a candidate frame of a corresponding object; calculating the coincidence degree IOU of the candidate frame; clustering the candidate frames based on the contact degree IOU, and determining the candidate frames belonging to the same class; the candidate frames belonging to the same class and the images thereof are input into a neural network to obtain a result of detection of the corresponding object.

The method of claim a1, as claimed in a2, wherein detecting the image to obtain the candidate frame for the corresponding object, specifically comprises: detecting foreground and/or background objects in the image through a target detection algorithm; obtaining a candidate frame of a corresponding object; wherein each candidate box has a respective confidence score and location information.

A3 the method of claim a2, wherein the calculating the degree of coincidence IOU of the candidate frames specifically includes: selecting a candidate frame with the highest confidence score in the candidate frames of the corresponding objects according to the confidence score ranking; calculating the respective areas of all the candidate frames of the corresponding object according to the position information of the candidate frames; and calculating the coincidence degree IOU of the candidate frame with the highest confidence score of the corresponding object and other candidate frames of the corresponding object according to the area of the candidate frame.

The method of claim A3, as at a4, wherein clustering the candidate frames based on the degree of coincidence IOU, and determining the candidate frames belonging to the same class specifically includes: clustering all candidate frames in the detection image based on the contact ratio IOU; and outputting a set of candidate frames which are respectively gathered into the same class.

A5 the method as claimed in claim 4, wherein clustering all candidate frames in the detected image based on the degree of coincidence IOU specifically includes: and using a graph clustering algorithm, taking each candidate frame as a vertex, determining a matrix of the distance between every two vertexes as clustering input based on the contact degree IOU, and obtaining a set of vertexes which are clustered into the same class, wherein the set of each vertex is a set of corresponding candidate frames.

A6, the method of claim a5, wherein inputting the candidate boxes belonging to the same class into a neural network to obtain the result of the detection of the corresponding object, specifically comprises: and respectively converting images of the candidate frames in the set of the candidate frames gathered into the same class into feature vectors, inputting the feature vectors into a trained neural network, determining which images of the candidate frames are the correct detected corresponding objects through a matching algorithm, and outputting the corresponding candidate frames as the detection results of the corresponding objects.

A7, the method of claim a6, wherein the neural network, in particular comprises: RNN model, or LSTM model.

The invention also discloses B8, a system for detecting objects based on sequence optimization, which is characterized by comprising: the detection module is used for detecting the image to obtain a candidate frame of the corresponding object; the coincidence degree IOU calculating module is used for calculating the coincidence degree IOU of the candidate frame; the clustering module is used for clustering the candidate frames based on the contact ratio IOU and determining the candidate frames belonging to the same class; and the detection result determining module is used for inputting the candidate frames belonging to the same class and the images thereof into the neural network so as to obtain the detection result of the corresponding object.

B9, the system of claim B8, wherein the detection module specifically comprises:

detecting foreground and/or background objects in the image through a target detection algorithm; obtaining a candidate frame of a corresponding object; wherein each candidate box has a respective confidence score and location information.

B10 the system of claim B9, wherein the coincidence IOU calculating module specifically comprises: selecting a candidate frame with the highest confidence score in the candidate frames of the corresponding objects according to the confidence score ranking; calculating the respective areas of all the candidate frames of the corresponding object according to the position information of the candidate frames; and calculating the coincidence degree IOU of the candidate frame with the highest confidence score of the corresponding object and other candidate frames of the corresponding object according to the area of the candidate frame.

B11, the system of claim B10, wherein the clustering module, in particular comprises: clustering all candidate frames in the detection image based on the contact ratio IOU; and outputting a set of candidate frames which are respectively gathered into the same class.

B12, the system of claim B11, wherein the clustering module further comprises: and using a graph clustering algorithm, taking each candidate frame as a vertex, determining a matrix of the distance between every two vertexes as clustering input based on the contact degree IOU, and obtaining a set of vertexes which are clustered into the same class, wherein the set of each vertex is a set of corresponding candidate frames.

B13, the system according to claim B12, wherein the module for determining the detection result specifically comprises: and respectively converting images of the candidate frames in the set of the candidate frames gathered into the same class into feature vectors, inputting the feature vectors into a trained neural network, determining which images of the candidate frames are the correct detected corresponding objects through a matching algorithm, and outputting the corresponding candidate frames as the detection results of the corresponding objects.

B14, the system of claim B13, wherein the neural network specifically comprises: RNN model, or LSTM model.

C15, a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, realizes the method steps of any of claims a1-a 7.

D16, a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any of claims a1-a 7.

Claims

1. A method for object detection based on sequence optimization, comprising:

detecting an image to obtain a candidate frame of a corresponding object;

calculating the coincidence degree IOU of the candidate frame;

clustering the candidate frames based on the contact degree IOU, and determining the candidate frames belonging to the same class;

the candidate frames belonging to the same class and the images thereof are input into a neural network to obtain a result of detection of the corresponding object.

2. The method according to claim 1, wherein detecting the image to obtain the frame candidate of the corresponding object specifically comprises:

detecting foreground and/or background objects in the image through a target detection algorithm;

3. The method of claim 2, wherein calculating the degree of overlap IOU of the candidate boxes specifically comprises:

selecting a candidate frame with the highest confidence score in the candidate frames of the corresponding objects according to the confidence score ranking;

calculating the respective areas of all the candidate frames of the corresponding object according to the position information of the candidate frames;

and calculating the coincidence degree IOU of the candidate frame with the highest confidence score of the corresponding object and other candidate frames of the corresponding object according to the area of the candidate frame.

4. The method of claim 3, wherein clustering the candidate frames based on the degree of overlap IOU to determine candidate frames belonging to the same class comprises:

clustering all candidate frames in the detection image based on the contact ratio IOU;

and outputting a set of candidate frames which are respectively gathered into the same class.

5. The method of claim 4, wherein clustering all candidate frames in the detected image based on the degree of overlap IOU comprises:

and using a graph clustering algorithm, taking each candidate frame as a vertex, determining a matrix of the distance between every two vertexes as clustering input based on the contact degree IOU, and obtaining a set of vertexes which are clustered into the same class, wherein the set of each vertex is a set of corresponding candidate frames.

6. The method according to claim 5, wherein inputting the candidate boxes belonging to the same class into the neural network to obtain the result of the detection of the corresponding object specifically comprises:

and respectively converting images of the candidate frames in the set of the candidate frames gathered into the same class into feature vectors, inputting the feature vectors into a trained neural network, determining which images of the candidate frames are the correct detected corresponding objects through a matching algorithm, and outputting the corresponding candidate frames as the detection results of the corresponding objects.

7. The method of claim 6, wherein the neural network, in particular comprises: RNN model, or LSTM model.

8. A system for sequence-optimization-based object detection, comprising:

the detection module is used for detecting the image to obtain a candidate frame of the corresponding object;

the coincidence degree IOU calculating module is used for calculating the coincidence degree IOU of the candidate frame;

the clustering module is used for clustering the candidate frames based on the contact ratio IOU and determining the candidate frames belonging to the same class;

and the detection result determining module is used for inputting the candidate frames belonging to the same class and the images thereof into the neural network so as to obtain the detection result of the corresponding object.

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method steps of any of claims 1-7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.