CN108229289B

CN108229289B - Target retrieval method and device and electronic equipment

Info

Publication number: CN108229289B
Application number: CN201710500550.3A
Authority: CN
Inventors: 田茂清; 伊帅; 闫俊杰
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-06-27
Filing date: 2017-06-27
Publication date: 2021-02-05
Anticipated expiration: 2037-06-27
Also published as: CN108229289A

Abstract

The application discloses a target retrieval method and a target retrieval device. One embodiment of the above method comprises: acquiring a plurality of image sets, wherein each image set comprises at least one image containing at least one object to be detected; respectively extracting the characteristic vectors of the to-be-detected target contained in at least partial images in the plurality of image sets; respectively determining the similarity of every two feature vectors and the similarity ranking information of the similarity in the image set indicated by the two feature vectors and to which the to-be-detected target belongs in at least part of the extracted feature vectors; and determining the probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarity and the similarity ranking information. The implementation mode realizes the preliminary retrieval of the target, reduces the workload of the labeling personnel and improves the efficiency of the target retrieval.

Description

Target retrieval method and device and electronic equipment

Technical Field

The present application relates to the field of computer vision, and in particular, to the field of image processing, and in particular, to a target retrieval method and apparatus, and an electronic device.

Background

Computer vision is a simulation of biological vision using a computer and related equipment that can use a camera and computer to acquire data and information of a subject. The target retrieval is an important research direction in computer vision research, and can find all images which are the same as an input target in a large-scale data set according to the input image of the target; a pair of images may also be input thereto, resulting in a similarity value for the pair of images.

Target retrieval often requires large amounts of labeled target retrieval data. If all labels are manually marked by the marking personnel, the workload of the marking personnel is large, and the efficiency is low.

Disclosure of Invention

The present application aims to provide a target retrieval method and apparatus and an electronic device to solve the technical problems mentioned in the above background.

In a first aspect, the present application provides a target retrieval method, including: acquiring a plurality of image sets, wherein each image set comprises at least one image containing at least one object to be detected; respectively extracting the characteristic vectors of the to-be-detected target contained in at least partial images in the plurality of image sets; respectively determining the similarity of every two feature vectors and the similarity ranking information of the similarity in the image set indicated by the two feature vectors and to which the to-be-detected target belongs in at least part of the extracted feature vectors; and determining the probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarity and the similarity ranking information.

In some embodiments, the above method further comprises: and generating prompt information for inquiring whether the two feature vectors indicate the same target or not in response to the fact that the probability that the targets to be detected indicated by the two feature vectors are the same target meets a first preset condition.

In some embodiments, the first preset condition includes at least one of: the probability is greater than a predetermined probability threshold; the probabilities are in the top predetermined scale range with all probabilities sorted from large to small.

In some embodiments, the set of images is derived from a video source; and the above-mentioned characteristic vector to extract the target to be examined that at least some pictures in the above-mentioned multiple image sets include separately, including: acquiring identification information of each image set, wherein the identification information comprises first identification information of a video source to which the image set belongs and second identification information of the image set in the video source to which the image set belongs; comparing the identification information of every two image sets to generate a comparison list; and respectively extracting the characteristic vectors of the target to be detected contained in at least partial images in each image set, and determining the similarity between the characteristic vectors indicated by each comparison in the comparison list.

In some embodiments, the images in each image set include the same suspect target; and the above method further comprises: for each comparison in the comparison list, determining whether the first identification information of the two image sets indicated by the comparison is the same; in response to that the first identification information of the two image sets indicated by the comparison is the same, determining whether images with the same generation time exist in the two image sets indicated by the comparison; and/or deleting the comparison in response to the existence of the images with the same generation time in the two image sets indicated by the comparison so as to optimize the comparison list.

In some embodiments, the above method further comprises: determining the average generation time of each image set according to the generation time of each image in each image set; for each comparison in the comparison list, determining whether the difference between the average generation moments of the two image sets indicated by the comparison is greater than a preset time length; and deleting the comparison in response to the difference between the average generation moments of the two image sets indicated by the comparison being greater than a preset time length so as to optimize the comparison list.

In some embodiments, the extracting the feature vectors of the to-be-detected target included in at least some of the images in the plurality of image sets respectively includes: and respectively extracting the characteristic vector of the to-be-detected target contained in each image in the at least partial images of the plurality of image sets by using a preset first neural network.

In some embodiments, the determining the similarity of each two feature vectors and the similarity ranking information of the similarity in the image set to which the to-be-detected target indicated by the two feature vectors belongs respectively includes: determining the similarity between two feature vectors indicated by each alignment in the alignment list; determining a similarity set of each target to be detected, wherein the similarity set comprises the similarity corresponding to the comparison of the target to be detected contained in the comparison list; and arranging each similarity in the similarity set according to a descending order or a descending order, and determining the similarity ranking information of each object to be detected in the image set to which the object to be detected belongs.

In some embodiments, the determining, according to the similarities and the ranking information of the similarities, the probability that the to-be-detected target indicated by each two feature vectors is the same target includes: and inputting the similarity between every two feature vectors and ranking information of the to-be-detected targets indicated by the two feature vectors in respective similarity sequence into a preset classifier, and determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target based on the classifier.

In some embodiments, the classifier is pre-established using the following method: acquiring a plurality of pre-labeled image sets, wherein labeled information is used for indicating whether two targets in the plurality of image sets are the same target or not; extracting the characteristic vectors of the targets contained in each image set, and determining the similarity between every two characteristic vectors; determining similarity ranking between each feature vector and other feature vectors according to the similarity between every two feature vectors; determining the rank of each two eigenvectors in the respective similarity sequence according to the similarity sequence; and training a classifier according to the labeling information of the image sets, the similarity between every two feature vectors and the ranking.

In some embodiments, the acquiring a plurality of image sets includes: acquiring a plurality of video sources, each video source comprising at least one image set; detecting at least partial images in the plurality of video sources by using a preset second neural network, and determining a to-be-detected target contained in each image in the at least partial images; and labeling each detected target to be detected to obtain a labeled image set corresponding to each target to be detected.

In some embodiments, the labeling each detected object to be detected to obtain a labeled image set corresponding to each object to be detected includes: marking each detected target to be detected by using a minimum external rectangular frame; and cutting each marked area to obtain a cut image set corresponding to each target to be detected.

In some embodiments, the acquiring a plurality of image sets includes: determining the number of the cut images contained in each cut image set and the following parameters of each cut image: a first number of pixels along a first direction, a second number of pixels along a second direction, and a ratio of the first number of pixels to the second number of pixels, wherein the first direction and the second direction are respectively extending directions of two adjacent sides of the minimum bounding rectangle; and selecting a preset number of the cut images meeting a second preset condition in each cut image set to form a new image set.

In a second aspect, the present application provides a target retrieval apparatus, the apparatus comprising: the image set acquisition unit is used for acquiring a plurality of image sets, wherein each image set comprises at least one image containing at least one object to be detected; a feature vector extraction unit for extracting feature vectors of the target to be detected included in at least part of the images in the plurality of image sets, respectively; the similarity determining unit is used for respectively determining the similarity of every two feature vectors and the similarity ranking information of the similarity in the image set to which the to-be-detected target indicated by the two feature vectors belongs in at least part of the extracted feature vectors; and the probability determining unit is used for determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target according to the similarities and the ranking information of the similarities.

In some embodiments, the above apparatus further comprises: and the prompt information generating unit is used for responding to the condition that the probability that the targets to be detected indicated by the two characteristic vectors are the same target meets a first preset condition, and generating prompt information for inquiring whether the two characteristic vectors indicate the same target.

In some embodiments, the set of images is derived from a video source; and the feature vector extraction unit includes: the identification information acquisition module is used for acquiring identification information of each image set, wherein the identification information comprises first identification information of a video source to which the image set belongs and second identification information of the image set in the video source to which the image set belongs; the comparison list generation module is used for comparing the identification information of every two image sets to generate a comparison list; and the characteristic vector extraction module is used for respectively extracting the characteristic vectors of the target to be detected contained in at least partial images in each image set and determining the similarity between the characteristic vectors indicated by each comparison in the comparison list.

In some embodiments, the images in each image set include the same suspect target; the feature vector extraction unit further comprises a comparison list optimization module, and the comparison list optimization module is configured to: for each comparison in the comparison list, determining whether the first identification information of the two image sets indicated by the comparison is the same; in response to that the first identification information of the two image sets indicated by the comparison is the same, determining whether images with the same generation time exist in the two image sets indicated by the comparison; and/or deleting the comparison in response to the existence of the images with the same generation time in the two image sets indicated by the comparison so as to optimize the comparison list.

In some embodiments, the feature vector extraction unit further comprises an alignment list optimization module, and the alignment list optimization module is configured to: determining the average generation time of each image set according to the generation time of each image in each image set; for each comparison in the comparison list, determining whether the difference between the average generation moments of the two image sets indicated by the comparison is greater than a preset time length; and deleting the comparison in response to the difference between the average generation moments of the two image sets indicated by the comparison being greater than a preset time length so as to optimize the comparison list.

In some embodiments, the feature vector extracting unit is further configured to: and respectively extracting the characteristic vector of the to-be-detected target contained in each image in the at least partial images of the plurality of image sets by using a preset first neural network.

In some embodiments, the similarity determination unit includes: a similarity determining module, configured to determine a similarity between two feature vectors indicated by each comparison in the comparison list; a similarity set determining module, configured to determine a similarity set of each to-be-detected target, where the similarity set includes a similarity corresponding to the comparison of the to-be-detected target included in the comparison list; and the similarity set ranking module is used for arranging the similarities in the similarity set from large to small or from small to large and determining the similarity ranking information of each target to be detected in the image set to which the target to be detected belongs.

In some embodiments, the probability determination unit is further configured to: and inputting the similarity between every two feature vectors and ranking information of the to-be-detected targets indicated by the two feature vectors in respective similarity sequence into a preset classifier, and determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target based on the classifier.

In some embodiments, the classifier is pre-established by a classifier establishing unit, and the classifier establishing unit includes: the system comprises an annotated image set acquisition module, a storage module and a display module, wherein the annotated image set acquisition module is used for acquiring a plurality of pre-annotated image sets, and annotated information is used for indicating whether two targets in the plurality of image sets are the same target or not; the similarity determining module is used for extracting the feature vectors of the targets contained in each image set and determining the similarity between every two feature vectors; the similarity ranking determining module is used for determining similarity ranking between each feature vector and other feature vectors according to the similarity between every two feature vectors; the ranking determining module is used for determining the ranking of each two feature vectors in the respective similarity ranking according to the similarity ranking; and the training module is used for training the classifier according to the labeling information of the image sets, the similarity between every two feature vectors and the ranking.

In some embodiments, the image set acquiring unit includes: the system comprises a video source acquisition module, a video source acquisition module and a video source acquisition module, wherein the video source acquisition module is used for acquiring a plurality of video sources, and each video source comprises at least one image set; the target detection module is used for detecting at least partial images in the plurality of video sources by utilizing a preset second neural network and determining a target to be detected contained in each image in the at least partial images; and the target labeling module is used for labeling each detected target to be detected to obtain a labeled image set corresponding to each target to be detected.

In some embodiments, the target labeling module is further configured to: marking each detected target to be detected by using a minimum external rectangular frame; and cutting each marked area to obtain a cut image set corresponding to each target to be detected.

In some embodiments, the target labeling module is further configured to: determining the number of the cut images contained in each cut image set and the following parameters of each cut image: a first number of pixels along a first direction, a second number of pixels along a second direction, and a ratio of the first number of pixels to the second number of pixels, wherein the first direction and the second direction are respectively extending directions of two adjacent sides of the minimum bounding rectangle; and selecting a preset number of the cut images meeting a second preset condition in each cut image set to form a new image set.

In a third aspect, an embodiment of the present application provides a server, including: one or more processors; a storage device, configured to store one or more programs, which when executed by the one or more processors, cause the one or more processors to implement the method described in any of the above embodiments.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method described in any of the above embodiments.

The method and the device for retrieving the target comprise the steps of firstly determining a plurality of image sets, then extracting feature vectors of the target to be detected contained in at least part of images in the image sets, then determining the similarity of every two feature vectors and the similarity ranking information of every similarity in the image set to which the target to be detected belongs and indicated by the two feature vectors in the extracted at least part of feature vectors, and finally determining the probability that the target to be detected indicated by every two feature vectors is the same target according to the similarities and the similarity ranking information. Therefore, the probability that the targets to be detected in different images are the same target can be obtained by using the electronic equipment, the primary retrieval of the target can be realized, the workload of marking personnel is reduced, and the efficiency of target retrieval is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a flow diagram of one embodiment of a target retrieval method according to the present application;

FIG. 2 is a flow chart of extracting feature vectors according to the target retrieval method of the present application;

FIG. 3 is a flow chart of acquiring a plurality of image sets according to a target retrieval method of the present application;

FIG. 4 is a schematic diagram of an embodiment of a target retrieval device according to the present application;

fig. 5 is a schematic structural diagram of an electronic device suitable for implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 shows a flow 100 of one embodiment of a target retrieval method according to the present application. As shown in fig. 1, the target retrieval method of the present embodiment includes the following steps:

step 101, a plurality of image sets are acquired.

In this embodiment, an electronic device (for example, a terminal or a server) on which the target retrieval method operates may acquire a plurality of image sets stored locally or a plurality of image sets input by a user through the terminal through a wired connection manner or a wireless connection manner. Each of the image sets may include a plurality of images, and the plurality of images may include at least one image including an object to be examined, and the number of the objects to be examined may be one or more. The target to be detected can be any target needing to be searched, such as pedestrians, vehicles, non-motor vehicles and the like.

And 102, respectively extracting the characteristic vectors of the to-be-detected target contained in at least partial images in the plurality of image sets.

After the plurality of image sets are obtained, feature vectors of the target to be examined included in at least a part of the images in the plurality of image sets may be extracted, respectively. It is understood that there may be a plurality of images containing the object to be inspected in the plurality of image sets, and some or all of the plurality of images may be selected to extract the feature vectors of the object to be inspected contained in the selected portion of the images. The above feature vectors are used to characterize features of each object under examination that are different from other objects under examination.

And 103, respectively determining the similarity of every two feature vectors and the similarity ranking information of the similarity in the image set to which the to-be-detected target indicated by the two feature vectors belongs in at least part of the extracted feature vectors.

After extracting part of the feature vectors of the target to be detected, selecting part or all of the feature vectors from the extracted feature vectors to perform the following processing: and determining the similarity of every two feature vectors in the selected feature vectors. After the similarity is obtained through calculation, for each similarity, the similarity ranking information of the similarity in the image set to which the to-be-detected target indicated by the two feature vectors belongs can be determined.

In this embodiment, the similarity may be related to a distance between two feature vectors, and the distance may be a euclidean distance, a minuscule distance, a manhattan distance, a chebyshev distance, a mahalanobis distance, or the like. The similarity may also be vector space cosine similarity, pearson correlation coefficient, Jaccard similarity coefficient, adjusted cosine similarity, and so on.

It is understood that, for two feature vectors in different image sets, the similarity between each feature vector and other feature vectors except the feature vector in the image set to which the other feature vector belongs can be determined, and then two similarity sets can be obtained. According to the two obtained similarity sets and the similarity between the two feature vectors, ranking information of the similarity between the two feature vectors in the two similarity sets can be determined. For example, for feature vector 1 in image set a and feature vector 2 in image set B, the similarity between feature vector 1 and feature vector 2 may be calculated first. Then, the similarity between the feature vector 1 and other feature vectors in the image set B can be calculated, so as to obtain a similarity set between each feature vector in the image set B and the feature vector 1. Then, the similarity between the feature vector 2 and other feature vectors in the image set a can be calculated, so as to obtain a similarity set between each feature vector in the image set a and the feature vector 2. Finally, according to the similarity between the feature vector 1 and the feature vector 2, ranking information of the similarity in two similarity sets can be determined. It can be understood that after two similarity sets are obtained, the similarities in the similarity sets may be sorted from large to small or from small to large, and then ranking information of the similarities is determined.

And step 104, determining the probability that the target to be detected indicated by every two feature vectors is the same target according to the similarities and the ranking information of the similarities.

After the similarity and the similarity ranking information are obtained, the probability that the targets to be detected indicated by the two feature vectors are the same target can be determined. For example, if the similarity value of two feature vectors is larger, and the rank of the similarity in the two similarity sets is located upstream in the two similarity sets, it can be assumed that the two feature vectors indicate that the target to be detected is the same target with a higher probability.

In some optional implementations of the present embodiment, when the probability calculated in step 104 satisfies the first preset condition, a prompt message for inquiring whether the two feature vectors indicate the same target is generated.

In this implementation, the above prompt information can be used to remind the labeling personnel to confirm again whether the target to be detected indicated by the two feature vectors is the same target, so that the workload of the labeling personnel can be effectively reduced.

In some optional implementations of the present embodiment, the first preset condition may include at least one of: the probability is larger than a predetermined probability threshold value, and the probability is in the front predetermined proportion range of all the probabilities in descending order.

When the probability indicates that the two targets to be detected are most likely to be the same target, prompt information can be output. In this implementation, whether two targets to be detected are most likely to be the same target can be determined by setting a probability threshold or a probability ratio.

The target retrieval method provided by the above embodiment of the application includes determining a plurality of image sets, extracting feature vectors of a to-be-detected target included in at least a part of images in the plurality of image sets, determining similarity of every two feature vectors and similarity ranking information of every similarity in the image set to which the to-be-detected target indicated by the two feature vectors belongs in the extracted at least part of feature vectors, and determining probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarities and the similarity ranking information. Therefore, the probability that the target to be detected in different images is the same target and the ranking information of the probability in the image set can be obtained by using the electronic equipment, the primary retrieval of the target can be realized through the probability and the ranking information of the probability, the workload of annotating personnel is reduced, and the efficiency of target retrieval is improved.

With continued reference to fig. 2, a flow 200 of extracting feature vectors according to the target retrieval method of the present application is shown. As shown in fig. 2, in this embodiment, each image set source and each video source may be a surveillance video, and a distance between centers of surveillance ranges of the surveillance videos may be smaller than a preset value, and the preset value may be set according to an actual application scenario, for example, 300 meters (a distance between centers of surveillance ranges of two adjacent surveillance cameras on a street) or 10 to 50 meters (a distance between centers of surveillance ranges of different surveillance cameras in a store). In this embodiment, the center of the monitoring range may be the center of a projection range of the range monitored by the monitoring camera on the ground, for example, when the projection range of the range monitored by the monitoring camera on the ground is elliptical, the center of the projection range is the center of the ellipse. It will be appreciated that multiple sets of images may be included in each video source.

In this embodiment, feature vector extraction may be implemented by the following steps:

step 201, acquiring identification information of each image set.

In this embodiment, the identification information may include first identification information of a video source to which the image set belongs and second identification information of the video source to which the image set belongs. The first identification information may be a number of a video source to which the image set belongs among a plurality of video sources, and the second identification information may be a number of the image set among the video sources to which the image set belongs, for example, the identification information of the image set may be 2 to 5, which indicates a 5 th image set in a 2 nd video source.

Step 202, comparing the identification information of each two image sets to generate a comparison list.

And comparing the identification information of every two image sets in the plurality of monitoring videos to generate a comparison list. The above comparison list includes a plurality of comparisons, and each comparison may be represented by identification information of the image set, which may be, for example, 2 to 5: 3-1.

Step 203, respectively extracting the feature vectors of the target to be detected contained in at least part of the images in each image set, and determining the similarity between the feature vectors indicated by each comparison in the comparison list.

When the similarity between the feature vectors of every two targets is determined, the comparison in the comparison list can be taken as a unit, and the display and subsequent manual review of the target retrieval result are facilitated. It is to be understood that the above-mentioned auditing may be performed manually or implemented by other algorithms, which is not limited in this embodiment.

In some optional implementation manners of this embodiment, when extracting the feature vectors, a preset first neural network may be used to respectively extract the feature vectors of the target to be detected included in each of at least some images of the plurality of image sets.

The feature vector of the target to be detected contained in each image of at least part of the images in each image set can be extracted by using a preset first neural network, and specifically, the feature vector in each image can be extracted by using a plurality of convolution layers in the first neural network. After obtaining at least part of the feature vectors of each image set, averaging or calculating a weighted average value for each feature vector, and using the obtained average vector as the feature vector of the target to be detected indicated by the image set.

In some optional implementations of the present embodiment, the images in each image set include the same object to be inspected, that is, only one object to be inspected in each image set. The method further comprises the step of optimizing the alignment list after the alignment list is generated. The step of optimizing the alignment list comprises: for each comparison in the comparison list, determining whether the first identification information of the two image sets indicated by the comparison is the same; in response to that the first identification information of the two image sets indicated by the comparison is the same, determining whether images with the same generation time exist in the two image sets indicated by the comparison; and/or deleting the comparison to optimize the comparison list in response to the images with the same generation time in the two image sets indicated by the comparison.

And detecting whether the first identification information of each image set indicated by the comparison in the comparison list is the same, and if so, indicating that the two image sets indicated by the comparison belong to the same video source. After determining that the two image sets belong to the same video source, detecting whether images with the same generation time exist in the two image sets, if so, indicating that two to-be-detected targets indicated by the two image sets exist in at least one frame of image in the two image sets, and comparing and deleting the two to-be-detected targets which cannot be the same target so as to optimize a comparison list.

In some optional implementation manners of this embodiment, the step of optimizing the alignment list may further include the following steps: determining the average generation time of each image set according to the generation time of each image in each image set; for each comparison in the comparison list, determining whether the difference between the average generation moments of the two image sets indicated by the comparison is greater than a preset time length; and deleting the comparison to optimize the comparison list in response to the difference between the average generation moments of the two image sets indicated by the comparison being greater than a preset time length.

The average generation time of each image set can be determined according to the generation time of each frame of image in each image set. And detecting whether the difference between the average generation moments of each compared two image sets in the comparison list is greater than a preset time length. Here, when the video source is a surveillance video, the preset duration may be determined according to the position of the actual surveillance camera and the normal walking speed of the object to be detected. And taking the object to be detected as a pedestrian for example, if the average generation time of the two image sets is greater than the preset time length, the probability that the pedestrians indicated by the two compared image sets are the same pedestrian is low, and the comparison is deleted. For example, two surveillance cameras located on the same street have adjacent surveillance ranges, each surveillance camera can monitor a street with a length of 150 meters, and the two surveillance cameras can monitor a street with a length of 300 meters. The normal pedestrian travels at a speed of about 1 m/s, and it takes about 300 seconds to assume that the pedestrian travels the above-mentioned monitoring range of 300 m. The difference between the average generation times of the two image sets obtained in the two surveillance videos is about 150 seconds, where the preset time period may be set to 180 seconds, and when the difference between the average generation times is greater than 180 seconds, the two pedestrians are not considered to be the same pedestrian.

In some optional implementation manners of this embodiment, when determining the similarity ranking information in step 103, the following steps may also be implemented: determining the similarity between two feature vectors indicated by each alignment in the alignment list; determining a similarity set of each to-be-detected target; and arranging each similarity in the similarity set according to a sequence from large to small or from small to large, and determining the similarity ranking information of each object to be detected in the image set to which the object to be detected belongs.

After the alignment list is determined, since each alignment indicates two image sets, the feature vector of each image set may be determined first, and then the similarity between the feature vectors of each two image sets is determined, so that each alignment corresponds to one similarity. And then determining a similarity set of each target to be detected, wherein the similarity set comprises the similarity corresponding to the comparison of the target to be detected in a comparison list. After the similarity set of each object to be detected is obtained, the similarities can be arranged in the order from large to small or from small to large, and the similarity ranking information of each object to be detected in the image set to which the object to be detected belongs is determined. For example, one alignment in the alignment list is 2-5: 3-1, and there are 10 image sets in total, and the obtained image sets 2-5 are sorted by the distance from small to large as follows: 2-1, 1-3, 1-2, 3-2, 2-4, 3-1, 2-2, 2-3, 1-1, then the image collection 3-1 is ranked as the 6 th in the above distance ordering. Similarly, after the distance ranking of the image set 3-1 is obtained, the ranking of the image set 2-5 in the distance ranking can also be determined.

In this implementation manner, the feature vector of each image set may be obtained by weighting the feature vectors of at least a part of the images including the object to be detected in the image set.

In some optional implementation manners of this embodiment, when determining the probability that the target to be detected indicated by each two feature vectors is the same target in step 104, the following steps not shown in fig. 2 may be specifically implemented: and inputting the similarity between every two feature vectors and ranking information of the to-be-detected targets indicated by the two feature vectors in respective similarity sequence into a preset classifier, and determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target based on the classifier.

In this implementation, a classifier may be employed to determine whether two targets to be examined are the same target. The preset classifier can be various classifiers which can output probabilities based on numerical values, and can be a support vector machine classifier, for example.

In some optional implementations of this embodiment, the classifier may be established by the following steps not shown in fig. 2: acquiring a plurality of pre-labeled image sets; extracting the characteristic vectors of the targets contained in each image set, and determining the similarity between every two characteristic vectors; determining similarity ranking between each feature vector and other feature vectors according to the similarity between every two feature vectors; determining the rank of each two eigenvectors in the respective similarity sequence according to the similarity sequence; and training the classifier according to the labeling information of the image sets, the similarity between every two feature vectors and the ranking.

The information marked in the image sets is used to indicate whether two objects in the image sets are the same object, that is, the image sets include a plurality of image sets in which two objects are the same object and a plurality of image sets in which two objects are not the same object. Extracting the characteristic vectors of the targets contained in each image set, then calculating the similarity between every two characteristic vectors, and sequencing the similarities belonging to the same target to be detected to obtain similarity sequencing. And determining the rank of each two feature vectors in the similarity sequence. For example, two feature vectors are respectively marked as 1 and 5, and there are 10 feature vectors in total, so after the similarity ranking of the feature vector 1 is determined, the ranking of the feature vector 5 in the similarity ranking can be determined; similarly, after the similarity ranking of the feature vector 5 is determined, the rank of the feature vector 1 in the similarity ranking described above may be determined. After the ranking is obtained, a preset classifier can be obtained by using the labeling information of the image sets, the similarity between every two feature vectors and the ranking training classifier. That is, the classifier is trained by using the similarity between the feature vector 1 and the feature vector 5, the ranking of the feature vector 1 in the similarity ranking of the feature vector 5, the ranking of the feature vector 5 in the similarity ranking of the feature vector 1, and the labeling result (whether the target indicated by the feature vector 1 and the target indicated by the feature vector 5 are the same target), and through such multiple training, the preset classifier can be obtained.

In this embodiment, the plurality of image sets may contain a large number of useless images, that is, only a few images in the plurality of image sets may contain the same target. The workload is very high if each image in each image set needs to be annotated by the annotating personnel. After the implementation mode is combined with the embodiment shown in fig. 1, the two images of the target to be detected, which are most likely to be the same target, are output to the annotating personnel, so that the workload of the annotating personnel can be greatly reduced. Meanwhile, after the output information is confirmed by the annotating personnel, the confirmed images can be used for training the classifier, and the accuracy of the classifier training can be improved.

According to the target retrieval method provided by the embodiment of the application, after the feature vector of the target to be detected is extracted, the retrieval result can be clear by constructing the comparison list; meanwhile, the workload of subsequent calculation can be effectively reduced and the calculation efficiency is improved by optimizing the comparison list.

With continued reference to fig. 3, fig. 3 is a flow 300 of acquiring a plurality of image sets according to the target retrieval method of the present embodiment. As shown in fig. 3, in the present embodiment, a plurality of image sets may be acquired by:

step 301, acquiring a plurality of video sources.

In this embodiment, the video source may be various video data including at least one object to be detected, and each video data may include a plurality of images. Each video source may comprise at least one image set.

And 302, detecting at least partial images in a plurality of video sources by using a preset second neural network, and determining a to-be-detected target contained in each image in the at least partial images.

In this embodiment, a preset second neural network may be used to detect at least a partial image of the acquired multiple video sources, so as to determine a to-be-detected target included in each of the at least partial image. The preset second neural network can be a trained convolutional neural network and can be used for detecting the target to be detected in the input image.

And 303, marking each detected target to be detected by using the minimum circumscribed rectangle frame.

After the target in each frame of image is detected, each detected target may be labeled. When labeling, various labeling boxes can be used for labeling, such as a circle, a rectangle, an ellipse, and the like. In this embodiment, each detected target may be labeled by using the minimum bounding rectangle of the detected target. It can be understood that, in this embodiment, when labeling the detected target, it may be further determined whether the definition of the detected target meets the requirement, and if not, the image may be removed to ensure that the definition of the labeled image is good.

And step 304, cutting each marked area to obtain a cut image set corresponding to each target to be detected.

After labeling each target, each labeled region may be cut to obtain a cut image set corresponding to each target. It will be appreciated that each cropped image contains only one object, and each image in each set of cropped images contains the same object.

Step 305, determining the number of the cropping images contained in each cropping image set and the following parameters of each cropping image: the pixel array comprises a first pixel number along a first direction, a second pixel number along a second direction, and a ratio of the first pixel number to the second pixel number.

Since each of the cutout images is rectangular, the number of first pixels in the length direction and the number of second pixels in the width direction of each of the cutout images may be determined, and the ratio of the number of first pixels in the length direction to the number of second pixels in the width direction may also be determined. Wherein, the first direction and the second direction are respectively one of a length direction and a width direction. In this embodiment, the number of cropped images included in each cropped image set may also be determined, so that an excessively small number of cropped image sets may be deleted. Therefore, when the comparison list is established, the number of comparisons in the comparison list can be effectively reduced, and the calculation amount is reduced.

And step 306, selecting a preset number of cut images meeting a second preset condition in each cut image set to form a new image set.

In order to enable each image in the cut image set to clearly reflect the characteristics of the target, in this embodiment, each cut image in the cut image set may be filtered. The screening may be to select a cropped image in which the number of first pixels in the length direction is greater than a certain value, the number of second pixels in the width direction is greater than a certain value, and the aspect ratio of the cropped image is greater than a certain value, for example, a cropped image in which the number of first pixels in the length direction is greater than 60 pixels, the number of second pixels in the width direction is greater than 30 pixels, and the aspect ratio is greater than 2 may be selected. It is understood that the screening can be performed according to the definition of the image, the integrity of the image, and the like.

Meanwhile, in order to improve the subsequent operation speed, a preset number of cut images can be selected from the screened cut image set to form a new image set. The preset number may be set according to actual application requirements, and this embodiment does not limit this. For example, 10 or 20 representative images may be used.

The target retrieval method provided by the above embodiment of the application reduces the size of the image of each target, reduces the number of images in each image set, and can effectively improve the subsequent operation efficiency.

With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of an object retrieval device, which corresponds to the embodiment of the method shown in fig. 1, and which can be applied in various electronic devices.

As shown in fig. 4, the target retrieval apparatus 400 of the present embodiment includes: an image set acquisition unit 401, a feature vector extraction unit 402, a similarity determination unit 403, and a probability determination unit 404.

The image set acquiring unit 401 is configured to acquire a plurality of image sets.

Wherein each image set comprises at least one image containing at least one object to be detected.

A feature vector extracting unit 402, configured to extract feature vectors of the target to be examined included in at least part of the images in the plurality of image sets, respectively.

A similarity determining unit 403, configured to determine, in at least some of the extracted feature vectors, a similarity of every two feature vectors and similarity ranking information of the similarity in an image set to which the to-be-detected target indicated by the two feature vectors belongs, respectively.

And a probability determining unit 404, configured to determine, according to the similarities and the ranking information of the similarities, a probability that the to-be-detected targets indicated by every two feature vectors are the same target.

In some optional implementations of the present embodiment, the apparatus 400 may further include a prompt information generating unit, not shown in fig. 4, configured to generate prompt information for inquiring whether two feature vectors indicate the same target in response to that a probability that two of the feature vectors indicate that the target to be detected is the same target satisfies a first preset condition.

In some optional implementations of the present embodiment, the first preset condition includes at least one of: the probability is greater than a predetermined probability threshold; the probabilities are in the top predetermined scale range with all probabilities sorted from large to small.

In some alternative implementations of the present embodiment, the set of images is derived from a video source. The feature vector extraction unit 402 may further include an identification information acquisition module, a comparison list generation module, and a feature vector extraction module, which are not shown in fig. 4.

And the identification information acquisition module is used for acquiring the identification information of each image set.

The identification information comprises first identification information of a video source to which the image set belongs and second identification information of the image set in the video source to which the image set belongs.

And the comparison list generation module is used for comparing the identification information of every two image sets to generate a comparison list.

And the feature vector extraction module is used for respectively extracting the feature vectors of the to-be-detected target contained in at least part of the images in each image set and determining the similarity between the feature vectors indicated by each comparison in the comparison list.

In some optional implementations of the present embodiment, the images in each image set include the same object to be inspected. The feature vector extraction unit 402 may further include an alignment list optimization module not shown in fig. 4. The alignment list optimization module is configured to: for each comparison in the comparison list, determining whether the first identification information of the two image sets indicated by the comparison is the same; in response to that the first identification information of the two image sets indicated by the comparison is the same, determining whether images with the same generation time exist in the two image sets indicated by the comparison; and/or deleting the comparison in response to the existence of the images with the same generation time in the two image sets indicated by the comparison so as to optimize the comparison list.

In some optional implementation manners of this embodiment, the alignment list optimization module is configured to: determining the average generation time of each image set according to the generation time of each image in each image set; for each comparison in the comparison list, determining whether the difference between the average generation moments of the two image sets indicated by the comparison is greater than a preset time length; and deleting the comparison in response to the difference between the average generation moments of the two image sets indicated by the comparison being greater than a preset time length so as to optimize the comparison list.

In some optional implementations of the present embodiment, the feature vector extraction unit 402 may be further configured to: and respectively extracting the characteristic vector of the to-be-detected target contained in each image in the at least partial images of the plurality of image sets by using a preset first neural network.

In some optional implementations of the present embodiment, the similarity determining unit 403 may further include a similarity determining module, a similarity set determining module, and a similarity set ranking module, which are not shown in fig. 4.

The similarity determining module is used for determining the similarity between the two feature vectors indicated by each comparison in the comparison list.

And the similarity set determining module is used for determining a similarity set of each target to be detected, wherein the similarity set comprises the similarity corresponding to the comparison of the target to be detected contained in the comparison list.

And the similarity set ranking module is used for arranging the similarities in the similarity set from large to small or from small to large and determining the similarity ranking information of each target to be detected in the image set to which the target to be detected belongs.

In some optional implementations of the present embodiment, the probability determining unit 404 may be further configured to: and inputting the similarity between every two feature vectors and ranking information of the to-be-detected targets indicated by the two feature vectors in respective similarity sequence into a preset classifier, and determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target based on the classifier.

In some optional implementations of this embodiment, the apparatus 400 may further include a classifier establishing unit not shown in fig. 4. The classifier establishing unit is used for establishing a classifier in advance and comprises an annotated image set acquisition module, a similarity determining module, a similarity sequencing determining module, a ranking determining module and a training module.

The system comprises an annotated image set acquisition module and a pre-annotated image set acquisition module, wherein the annotated image set acquisition module is used for acquiring a plurality of pre-annotated image sets. The marked information is used for indicating whether two targets in the plurality of image sets are the same target or not.

And the similarity determining module is used for extracting the feature vectors of the targets contained in each image set and determining the similarity between every two feature vectors.

And the similarity ranking determining module is used for determining similarity ranking between each feature vector and other feature vectors according to the similarity between every two feature vectors.

And the ranking determining module is used for determining the ranking of each two feature vectors in the respective similarity ranking according to the similarity ranking.

And the training module is used for training the classifier according to the labeling information of the image sets, the similarity between every two feature vectors and the ranking.

In some optional implementations of the present embodiment, the image set obtaining unit 401 may further include a video source obtaining module, an object detecting module, and an object labeling module, which are not shown in fig. 4.

The video source acquisition module is used for acquiring a plurality of video sources, and each video source comprises at least one image set.

And the target detection module is used for detecting at least partial images in the plurality of video sources by utilizing a preset second neural network and determining a target to be detected contained in each image in the at least partial images.

And the target labeling module is used for labeling each detected target to be detected to obtain a labeled image set corresponding to each target to be detected.

In some optional implementations of this embodiment, the target labeling module may be further configured to: marking each detected target to be detected by using a minimum external rectangular frame; and cutting each marked area to obtain a cut image set corresponding to each target to be detected.

In some optional implementations of this embodiment, the target labeling module may be further configured to: determining the number of the cut images contained in each cut image set and the following parameters of each cut image: a first number of pixels along a first direction, a second number of pixels along a second direction, and a ratio of the first number of pixels to the second number of pixels, wherein the first direction and the second direction are respectively extending directions of two adjacent sides of the minimum bounding rectangle; and selecting a preset number of the cut images meeting a second preset condition in each cut image set to form a new image set.

The target retrieval device provided by the above embodiment of the application determines a plurality of image sets, extracts feature vectors of a to-be-detected target included in at least a part of images in the plurality of image sets, determines similarity of every two feature vectors and similarity ranking information of every similarity in the image set to which the to-be-detected target indicated by the two feature vectors belongs in the extracted at least part of feature vectors, and determines the probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarities and the similarity ranking information. Therefore, the probability that the targets to be detected in different images are the same target can be obtained by using the electronic equipment, the primary retrieval of the target can be realized, the workload of marking personnel is reduced, and the efficiency of target retrieval is improved.

It should be understood that units 401 to 404 recited in the target retrieval apparatus 400 correspond to respective steps in the method described with reference to fig. 1. Thus, the operations and features described above for the target retrieval method are equally applicable to the apparatus 400 and the units included therein, and are not described in detail here. The corresponding elements of the apparatus 400 may cooperate with elements in a server to implement aspects of embodiments of the present application.

The embodiment of the invention also provides electronic equipment, which can be a mobile terminal, a Personal Computer (PC), a tablet computer, a server and the like. Referring now to fig. 5, a schematic diagram of an electronic device 500 suitable for implementing a terminal device or a server according to an embodiment of the present application is shown: as shown in fig. 5, the computer system 500 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPUs) 501, and/or one or more image processors (GPUs) 513, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)502 or loaded from a storage section 508 into a Random Access Memory (RAM) 503. The communication part 512 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card.

The processor may communicate with the read-only memory 502 and/or the random access memory 503 to execute the executable instructions, connect with the communication part 512 through the bus 504, and communicate with other target devices through the communication part 512, thereby performing operations corresponding to any one of the methods provided by the embodiments of the present application, for example, acquiring a plurality of image sets, wherein each image set includes at least one image containing at least one target to be detected; respectively extracting the characteristic vectors of the to-be-detected target contained in at least partial images in the plurality of image sets; respectively determining the similarity of every two feature vectors and the similarity ranking information of the similarity in the image set indicated by the two feature vectors and to which the to-be-detected target belongs in at least part of the extracted feature vectors; and determining the probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarity and the similarity ranking information.

In addition, in the RAM 503, various programs and data necessary for the operation of the apparatus can also be stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. The ROM 502 is an optional module in case of the RAM 503. The RAM 503 stores or writes executable instructions into the ROM 502 at runtime, and the executable instructions cause the processor 501 to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 505 is also connected to bus 504. The communication unit 512 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

It should be noted that the architecture shown in fig. 5 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 5 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication part may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flowchart, the program code may include instructions corresponding to performing the method steps provided by embodiments of the present disclosure, e.g., acquiring a plurality of image sets, wherein each image set includes at least one image containing at least one object to be inspected; respectively extracting the characteristic vectors of the to-be-detected target contained in at least partial images in the plurality of image sets; respectively determining the similarity of every two feature vectors and the similarity ranking information of the similarity in the image set indicated by the two feature vectors and to which the to-be-detected target belongs in at least part of the extracted feature vectors; and determining the probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarity and the similarity ranking information. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501.

The method and apparatus, device of the present invention may be implemented in a number of ways. For example, the method, apparatus and device of the present invention may be implemented by software, hardware, firmware or any combination of software, hardware and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present invention. Thus, the present invention also covers a recording medium storing a program for executing the method according to the present invention.

The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A target retrieval method, comprising:

acquiring a plurality of image sets, wherein each image set comprises at least one image containing at least one object to be detected;

respectively extracting the characteristic vectors of the target to be detected contained in at least partial images in the plurality of image sets;

respectively determining the similarity of every two feature vectors in at least part of the extracted feature vectors and the similarity ranking information of the similarity in two similarity sets corresponding to the image set to which the to-be-detected target indicated by the two feature vectors belongs;

and determining the probability that the to-be-detected target indicated by every two feature vectors is the same target according to the similarity and the similarity ranking information.

2. The method of claim 1, further comprising:

and generating prompt information for inquiring whether the two feature vectors indicate the same target or not in response to the fact that the probability that the targets to be detected indicated by the two feature vectors are the same target meets a first preset condition.

3. The method of claim 2, wherein the first preset condition comprises at least one of: the probability is greater than a predetermined probability threshold; the probabilities are in the top predetermined scale range with all probabilities ordered from large to small.

4. The method of any of claims 1-3, wherein the set of images is derived from a video source; and

the step of respectively extracting the feature vectors of the target to be detected contained in at least partial images in the plurality of image sets comprises the following steps:

acquiring identification information of each image set, wherein the identification information comprises first identification information of a video source to which the image set belongs and second identification information of the image set in the video source to which the image set belongs;

comparing the identification information of every two image sets to generate a comparison list;

respectively extracting the characteristic vectors of the target to be detected contained in at least partial images in each image set, and determining the similarity between the characteristic vectors indicated by each comparison in the comparison list.

5. The method according to claim 4, characterized in that the images in each image set comprise the same object to be examined; and

the method further comprises the following steps:

for each comparison in the comparison list, determining whether the first identification information of the two image sets indicated by the comparison is the same;

in response to that the first identification information of the two image sets indicated by the comparison is the same, determining whether images with the same generation time exist in the two image sets indicated by the comparison; and/or deleting the comparison to optimize the comparison list in response to the existence of the images with the same generation time in the two image sets indicated by the comparison.

6. The method of claim 4, further comprising:

determining the average generation time of each image set according to the generation time of each image in each image set;

for each comparison in the comparison list, determining whether the difference between the average generation moments of the two image sets indicated by the comparison is greater than a preset time length;

and deleting the comparison to optimize the comparison list in response to the difference between the average generation moments of the two image sets indicated by the comparison being greater than a preset time length.

7. The method according to any one of claims 1 to 3, wherein the extracting the feature vectors of the target to be inspected contained in at least some of the images in the plurality of image sets respectively comprises:

and respectively extracting the characteristic vector of the target to be detected contained in each image in the at least partial images of the plurality of image sets by utilizing a preset first neural network.

8. The method according to claim 4, wherein the determining similarity of each two feature vectors and the similarity ranking information of the similarity in the image set to which the target to be detected indicated by the two feature vectors belongs respectively comprises:

determining a similarity between two feature vectors indicated by each alignment in the alignment list;

determining a similarity set of each target to be detected, wherein the similarity set comprises the similarity corresponding to the comparison of the target to be detected in the comparison list;

and arranging each similarity in the similarity set according to a descending order or a descending order, and determining the similarity ranking information of each target to be detected in the image set to which the target to be detected belongs.

9. The method according to claim 8, wherein the determining the probability that the to-be-detected target indicated by each two feature vectors is the same target according to each similarity and each similarity ranking information comprises:

and inputting the similarity between every two feature vectors and ranking information of the to-be-detected targets indicated by the two feature vectors in respective similarity sequence into a preset classifier, and determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target based on the classifier.

10. The method of claim 9, wherein the classifier is pre-established using the following method:

acquiring a plurality of pre-labeled image sets, wherein labeled information is used for indicating whether two targets in the plurality of image sets are the same target or not;

extracting the characteristic vectors of the targets contained in each image set, and determining the similarity between every two characteristic vectors;

determining similarity ranking between each feature vector and other feature vectors according to the similarity between every two feature vectors;

determining the ranking of every two eigenvectors in the respective similarity sequence according to the similarity sequence;

and training a classifier according to the labeling information of the image sets, the similarity between every two feature vectors and the ranking.

11. The method of any of claims 1-3, wherein the acquiring a plurality of image sets comprises:

acquiring a plurality of video sources, each video source comprising at least one image set;

detecting at least partial images in the plurality of video sources by using a preset second neural network, and determining a to-be-detected target contained in each image in the at least partial images;

and labeling each detected target to be detected to obtain a labeled image set corresponding to each target to be detected.

12. The method according to claim 11, wherein the labeling each detected object to be detected to obtain a labeled image set corresponding to each object to be detected comprises:

marking each detected target to be detected by using a minimum external rectangular frame;

and cutting each marked area to obtain a cut image set corresponding to each target to be detected.

13. The method of claim 12, wherein the acquiring a plurality of sets of images comprises:

determining the number of the cropping images contained in each of the cropping image sets and the following parameters for each cropping image: the pixel array comprises a first pixel number along a first direction, a second pixel number along a second direction, and a ratio of the first pixel number to the second pixel number, wherein the first direction and the second direction are respectively extension directions of two adjacent side lengths of the minimum circumscribed rectangle;

and selecting a preset number of the cut images meeting a second preset condition in each cut image set to form a new image set.

14. An object retrieval apparatus, characterized in that the apparatus comprises:

the image set acquisition unit is used for acquiring a plurality of image sets, wherein each image set comprises at least one image containing at least one object to be detected;

a feature vector extraction unit, configured to extract feature vectors of an object to be detected included in at least part of the images in the plurality of image sets, respectively;

the similarity determining unit is used for respectively determining the similarity of every two feature vectors in at least part of the extracted feature vectors and the similarity ranking information of the similarity in two similarity sets corresponding to the image set to which the to-be-detected target indicated by the two feature vectors belongs;

and the probability determining unit is used for determining the probability that the to-be-detected targets indicated by every two feature vectors are the same target according to the similarities and the ranking information of the similarities.

15. The apparatus of claim 14, further comprising:

and the prompt information generating unit is used for responding to the condition that the probability that the targets to be detected indicated by the two characteristic vectors are the same target meets a first preset condition, and generating prompt information for inquiring whether the two characteristic vectors indicate the same target.

16. The apparatus of claim 15, wherein the first preset condition comprises at least one of: the probability is greater than a predetermined probability threshold; the probabilities are in the top predetermined scale range with all probabilities ordered from large to small.

17. The apparatus of any of claims 14-16, wherein the set of images is derived from a video source; and

the feature vector extraction unit includes:

the identification information acquisition module is used for acquiring identification information of each image set, wherein the identification information comprises first identification information of a video source to which the image set belongs and second identification information of the image set in the video source to which the image set belongs;

the comparison list generation module is used for comparing the identification information of every two image sets to generate a comparison list;

18. The apparatus according to claim 17, wherein the images in each image set comprise the same object to be inspected; and

the feature vector extraction unit further comprises an alignment list optimization module configured to:

19. The apparatus of claim 17, wherein the feature vector extraction unit further comprises an alignment list optimization module configured to:

20. The apparatus according to any of claims 14-16, wherein the feature vector extraction unit is further configured to:

21. The apparatus of claim 17, wherein the similarity determining unit comprises:

a similarity determination module, configured to determine a similarity between two feature vectors indicated by each comparison in the comparison list;

a similarity set determining module, configured to determine a similarity set of each to-be-detected target, where the similarity set includes a similarity corresponding to the comparison of the to-be-detected target included in the comparison list;

22. The apparatus of claim 21, wherein the probability determination unit is further configured to:

23. The apparatus of claim 22, wherein the classifier is pre-established by a classifier establishment unit comprising:

the system comprises an annotated image set acquisition module, a storage module and a display module, wherein the annotated image set acquisition module is used for acquiring a plurality of pre-annotated image sets, and annotated information is used for indicating whether two targets in the plurality of image sets are the same target or not;

the similarity determining module is used for extracting the feature vectors of the targets contained in each image set and determining the similarity between every two feature vectors;

the similarity ranking determining module is used for determining similarity ranking between each feature vector and other feature vectors according to the similarity between every two feature vectors;

the ranking determining module is used for determining the ranking of each two feature vectors in respective similarity ranking according to the similarity ranking;

24. The apparatus according to any one of claims 14 to 16, wherein the image set acquisition unit includes:

the system comprises a video source acquisition module, a video source acquisition module and a video source acquisition module, wherein the video source acquisition module is used for acquiring a plurality of video sources, and each video source comprises at least one image set;

the target detection module is used for detecting at least partial images in the plurality of video sources by utilizing a preset second neural network and determining a target to be detected contained in each image in the at least partial images;

25. The apparatus of claim 24, wherein the target annotation module is further configured to:

26. The apparatus of claim 25, wherein the target labeling module is further configured to:

27. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-13.

28. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-13.