CN114880513A

CN114880513A - Target retrieval method and related device

Info

Publication number: CN114880513A
Application number: CN202210538456.8A
Authority: CN
Inventors: 鲁逸峰; 周祥明; 郑春煌; 吴剑峰; 韩加旭
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-08-09

Abstract

The application relates to the technical field of computers, in particular to a target retrieval method and a related device, which are used for improving retrieval efficiency and accuracy, and the method comprises the following steps: after an image to be retrieved containing a target retrieval object is obtained, the image to be retrieved is input into a target classification model, the target classification model comprises a feature extraction layer and an output layer, visual features are obtained from the feature extraction layer, semantic features are obtained from the output layer, then feature fusion is carried out on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, and then a target image matched with the image to be retrieved is determined from all candidate images based on the target fusion features.

Description

Target retrieval method and related device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a target retrieval method and a related apparatus.

Background

With the continuous development of computer technology, visual target retrieval is applied to more and more scenes, and the visual target retrieval compares the feature vectors of the image to be detected with the feature vectors of massive images in a database to determine a target image similar to the image to be detected from the massive images.

In the related art, a machine learning model is usually adopted to obtain a feature vector of an image to be detected, however, the interpretability of the feature vector obtained by the machine learning model is poor, and the feature dimension of the feature vector depends on the output dimension of the model, so that it is difficult to perform better adaptation on feature types other than a training set. For example, the classification model trained by using images of cats is not suitable for dog classification and search, has poor universality, and needs to be retrained if people want to search other classes of objects, and the search efficiency is reduced.

Disclosure of Invention

The application provides a target retrieval method and a related device, which are used for improving the robustness and the scene adaptability of characteristics and improving the retrieval efficiency and the accuracy.

The embodiment of the application provides the following specific technical scheme:

in a first aspect, a target retrieval method includes:

acquiring an image to be retrieved containing a target retrieval object;

inputting the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtaining visual features output by the feature extraction layer, and obtaining semantic features output by the output layer, wherein the semantic features are used for representing classification results of the target detection object;

performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;

and determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion characteristics corresponding to the candidate images and the image to be retrieved respectively.

Optionally, the performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved includes:

splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and directly taking the initial fusion features as the target fusion features; or,

and splicing the semantic features and the visual features to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved, and obtaining the target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.

Optionally, the splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved includes:

splicing the semantic features and the visual features according to a specified feature splicing sequence;

and weighting the spliced features based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain initial fusion features corresponding to the image to be retrieved.

Optionally, the obtaining the target fusion feature based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion feature corresponding to each image to be fused includes:

based on the weight coefficients corresponding to the image to be retrieved and the images to be fused, carrying out weighted summation on the initial fusion features corresponding to the image to be retrieved and the initial fusion features corresponding to the images to be fused;

and averaging the fusion features obtained after weighted summation based on the number of the images to be fused to obtain the target fusion features.

Optionally, the semantic features include classification confidence degrees and normalization information; the splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved comprises:

based on the normalization information contained in the semantic features, performing normalization processing on the classification confidence degrees and the visual features contained in the semantic features to obtain the classification confidence degrees and the visual features which accord with a preset value range;

and splicing the classification confidence coefficient and the visual characteristic which accord with a preset value range to obtain an initial fusion characteristic corresponding to the image to be retrieved.

Optionally, the classification confidences include category confidences and/or attribute confidences.

Optionally, the image to be retrieved is a video frame in a video, and the video further includes other video frames;

before the obtaining of the initial fusion features corresponding to the images to be fused associated with the images to be retrieved, the method further includes:

taking other video frames which contain the target retrieval object and have playing time earlier than the video frame as images to be fused associated with the images to be retrieved;

aiming at any image to be fused in the images to be fused, the following operations are executed:

inputting the any image to be fused into the target classification model to obtain a visual feature and a semantic feature corresponding to the any image to be fused, wherein the semantic feature of the any image to be fused is used for representing a classification result of the any image to be fused;

and obtaining the initial fusion feature corresponding to any image to be fused based on the visual feature and the semantic feature corresponding to any image to be fused.

Optionally, before determining at least one target image matched with the image to be retrieved from each candidate image based on the target fusion features corresponding to each candidate image and the image to be retrieved, the method further includes:

determining at least one candidate area contained in each candidate image from each candidate image, wherein each candidate area contains a retrieval object of one retrieval type;

for any one of the determined candidate regions, performing the following operations:

inputting the any candidate region into the target classification model to obtain a visual feature and a semantic feature corresponding to the any candidate region, and obtaining a target fusion feature corresponding to the any candidate region based on the visual feature and the semantic feature corresponding to the any candidate region;

and recording the mapping relation between the target fusion characteristics corresponding to any one candidate region and the corresponding candidate image.

Optionally, the determining, from the candidate images, at least one target image matched with the image to be retrieved based on the target fusion features corresponding to the candidate images and the image to be retrieved includes:

calculating the similarity between the target fusion characteristics corresponding to the candidate images and the target fusion characteristics corresponding to the images to be retrieved;

and determining at least one target image matched with the image to be retrieved from candidate images with the similarity greater than a preset threshold value based on the calculated similarity.

In a second aspect, a target retrieval apparatus includes:

the device comprises an acquisition unit, a retrieval unit and a retrieval unit, wherein the acquisition unit is used for acquiring an image to be retrieved containing a target retrieval object;

the output unit is used for inputting the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtaining visual features output by the feature extraction layer and obtaining semantic features output by the output layer, wherein the semantic features are used for representing classification results of the target detection object;

the fusion unit is used for performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;

and the matching unit is used for determining at least one target image matched with the image to be retrieved from each candidate image based on the target fusion characteristics corresponding to each candidate image and the image to be retrieved.

Optionally, when performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, the fusion unit is specifically configured to:

Optionally, when the semantic features and the visual features are spliced to obtain initial fusion features corresponding to the image to be retrieved, the fusion unit is specifically configured to:

Optionally, when the target fusion feature is obtained based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion feature corresponding to each image to be fused, the fusion unit is specifically configured to:

Optionally, the semantic features include classification confidence degrees and normalization information; when the semantic features and the visual features are spliced to obtain initial fusion features corresponding to the image to be retrieved, the fusion unit is specifically configured to:

before the initial fusion features corresponding to the images to be fused associated with the images to be retrieved are obtained, the fusion unit is further configured to:

aiming at any image to be fused in the images to be fused, executing the following operations:

inputting any image to be fused into the target classification model to obtain a visual feature and a semantic feature corresponding to the any image to be fused, wherein the semantic feature of the any image to be fused is used for representing a classification result of the any image to be fused;

Optionally, before determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the fusion unit is further configured to:

Optionally, when determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the matching unit is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the steps of the above-mentioned object retrieval method.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes a computer program, when the computer program runs on an electronic device, the computer program is configured to enable the electronic device to execute the steps of the above-mentioned object retrieval method.

In a fifth aspect, the present application provides a computer program product, where the program product includes a computer program, where the computer program is stored in a computer-readable storage medium, and a processor of an electronic device reads and executes the computer program from the computer-readable storage medium, so that the electronic device executes the steps of the above object retrieval method.

In summary, in the embodiment of the present application, after an image to be retrieved including a target retrieval object is obtained, the image to be retrieved is input into a target classification model, the target classification model includes a feature extraction layer and an output layer, a visual feature is obtained from the feature extraction layer, a semantic feature is obtained from the output layer, then, feature fusion is performed on the semantic feature and the visual feature to obtain a target fusion feature corresponding to the image to be retrieved, and then, based on the target fusion feature, a target image matched with the image to be retrieved is determined from each candidate image. Because the target fusion features simultaneously comprise the visual features and the semantic features, the interpretability of the visual features is good, the unexpected feature types of the training set can be well adapted, and the accuracy and the robustness of the semantic features are good, so that the target fusion features have better robustness and wider scene adaptability, and the accuracy and the detection efficiency of image retrieval are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a target retrieval method provided in an embodiment of the present application;

FIG. 3 is a logic diagram of a feature stitching provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a logic for determining an image to be fused according to an embodiment of the present disclosure;

FIG. 5 is a logic diagram for determining a target fusion feature provided in an embodiment of the present application;

FIG. 6 is a logic diagram of a target retrieval process provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a target retrieval apparatus provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.

It is to be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or operation from another entity or operation without necessarily requiring or implying any actual such relationship or order between such entities or operations.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.

Fig. 1 is a schematic diagram of a possible application scenario provided in an embodiment of the present application, where the application scenario at least includes a terminal device 110 and a server 120. The number of the terminal devices 110 may be one or more, the number of the servers 120 may also be one or more, and the number of the terminal devices 110 and the number of the servers 120 are not particularly limited in the present application.

The terminal device 110 has an application with information processing functions such as information verification and information retrieval, where the application may be a client application, a web page application, an applet application, and the like. The terminal device 110 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like.

The server 120 may be a background server of the application, and provides a corresponding information verification service for the application. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data, and an artificial intelligence platform.

The terminal device 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and communicate and transmit data through a network, which is not limited in this application.

The target retrieval method mentioned in the present application can be applied to the terminal device or the server individually, or can be executed by the terminal device and the server together.

For example, after acquiring an image to be retrieved containing a target retrieval object, the terminal device inputs the image to be retrieved into a target classification model to obtain a visual feature output by a feature extraction layer and a semantic feature output by an output layer, then performs feature fusion on the semantic feature and the visual feature to obtain a target fusion feature corresponding to the image to be retrieved, and further determines a target image matched with the image to be retrieved from each candidate image based on the target fusion feature.

For another example, after the server acquires an image to be retrieved containing a target retrieval object, the image to be retrieved is input into the target classification model to obtain a visual feature output by the feature extraction layer and a semantic feature output by the output layer, then feature fusion is performed on the semantic feature and the visual feature to obtain a target fusion feature corresponding to the image to be retrieved, and then the target image matched with the image to be retrieved is determined from each candidate image based on the target fusion feature.

For another example, after acquiring the image to be retrieved containing the target retrieval object, the terminal device sends the image to be retrieved to the server. The method comprises the steps that after receiving an image to be retrieved, a server inputs the image to be retrieved into a target classification model to obtain visual features output by a feature extraction layer and semantic features output by an output layer, feature fusion is conducted on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, the target fusion features corresponding to the image to be retrieved are sent to a terminal device, and the terminal device determines a target image matched with the image to be retrieved from candidate images based on the target fusion features.

Fig. 2 is a schematic flowchart of a target retrieval method provided in an embodiment of the present application, where the flowchart may be applied to an electronic device, and the electronic device may be a server or a terminal device, and the following description only takes the terminal device as an example, and the specific flowchart is as follows:

s201, the terminal device obtains an image to be retrieved containing a target retrieval object.

S202, the terminal device inputs the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtains visual features output by the feature extraction layer, and obtains semantic features output by the output layer, wherein the semantic features are used for representing classification results of target detection objects.

It should be noted that, in the embodiment of the present application, the visual Feature includes, but is not limited to, a Speeded Up Robust Feature (SURF) Feature, a Histogram of Oriented Gradients (HOG) Feature, and a Haar Feature.

The semantic features may include classification confidence levels, which include category confidence levels and/or attribute confidence levels. Categories include, but are not limited to, pedestrian, automotive, non-automotive, watercraft, aircraft, and the like. Attributes include, but are not limited to, color, vehicle type, whether glasses are worn, etc. The semantic features may also contain normalization information, including a normalized width and a normalized height.

The feature vector a1 is used to represent semantic features, and it is assumed that, in the semantic features output by the target classification model, the number of categories is n1, the number of attributes is n2, l is normalized width, and h is normalized height, then the category confidence, the attribute confidence, the normalized width, and the normalized height are encoded into the feature vector a1 according to a specified order, and the feature vector a1 can be represented as [ category 1 confidence, category 2 confidence, …, category n1 confidence, attribute 1 confidence, attribute 2 confidence, …, attribute n2 confidence, l, h ], and the length of the feature vector a1 is n1+ n2+ 2.

Where l is lt/lp, h is ht/hp, lp is the width of the original image, hp is the height of the original image, lt is the width of the target image, and ht is the height of the target image.

For example, the category includes pedestrian, vehicle, the attribute includes red, orange, yellow, green, the feature vector a1 can be expressed as [ pedestrian confidence, vehicle confidence, red confidence, orange confidence, yellow confidence, green confidence, l, h ].

The visual features are represented by feature vectors a2, and assuming that the feature sizes output by the feature extraction layer have widths w, heights z and channel numbers c, respectively, the length of the feature vectors a2 is w × z × c.

And S203, the terminal equipment performs feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved.

In S203, the semantic features and the visual features may be directly fused, or the semantic features and the visual features may be fused in combination with the image to be fused associated with the image to be retrieved. Two feature fusion methods are described below.

In the mode 1, the terminal equipment splices the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and the initial fusion features are directly used as target fusion features.

In the process that the semantic features and the visual features are spliced by the terminal equipment to obtain the initial fusion features corresponding to the image to be retrieved, the terminal equipment can splice the semantic features and the visual features according to the specified feature splicing sequence, and then the spliced features are weighted based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain the initial fusion features corresponding to the image to be retrieved.

Herein, the semantic features are represented by feature vector a1, the visual features are represented by feature vector a2, and the initial fusion features are represented by feature vector a. The preset weight coefficient is used for representing the weight ratio of the corresponding features, and can be set according to a specific application scene.

In the embodiment of the present application, the feature extraction layer may also be referred to as a network feature layer, and the output layer may also be referred to as a network output layer.

Referring to fig. 3, assuming that the feature concatenation sequence is a semantic feature and a visual feature connected end to end, the feature vector a may be represented as { p1 a1, p2 a2}, where p1 represents a weight coefficient corresponding to the semantic feature, p2 represents a weight coefficient corresponding to the visual feature, and the length of the fused feature vector a is the sum of the lengths of the feature vector a1 and the feature vector a 2. Illustratively, p1 ═ 0.5 and p2 ═ 0.5.

In some embodiments, if the semantic features include each classification confidence and normalization information, the terminal device splices the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, including:

based on normalization information contained in the semantic features, performing normalization processing on each classification confidence coefficient and each visual feature contained in the semantic features to obtain each classification confidence coefficient and each visual feature which accord with a preset value range;

and splicing the classification confidence coefficient and the visual characteristic which accord with the preset value range to obtain the initial fusion characteristic corresponding to the image to be retrieved.

For example, the preset value range is 0 to 1, the terminal device performs normalization processing on each classification confidence coefficient and each visual feature included in the semantic features based on normalization information included in the semantic features, and the values of each component in each classification confidence coefficient and each component in each visual feature after the normalization processing are both 0 to 1.

It should be noted that, in this embodiment of the application, the terminal device may also perform normalization processing in the feature extraction layer to obtain each classification confidence that meets the preset value range, and correspondingly, in the process of splicing the semantic features and the visual features, perform normalization processing on the visual features according to normalization information included in the semantic features to obtain the visual features that meet the preset value range.

And 2, splicing the semantic features and the visual features by the terminal equipment to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved and associated with the images to be retrieved, and obtaining target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.

The process of obtaining the initial fusion feature corresponding to the image to be retrieved by the terminal device in the mode 2 is the same as the process of obtaining the initial fusion feature corresponding to the image to be retrieved by the terminal device in the mode 1 by splicing the semantic feature and the visual feature, and is not repeated herein.

In this embodiment of the application, if the image to be retrieved is a video frame in a video, and the video further includes at least one other video frame in addition to the video frame, the terminal device may use the other video frame that includes the target retrieval object and has a playing time earlier than that of the video frame as each image to be fused associated with the image to be retrieved. That is, the terminal device may use the historical video frame containing the target retrieval object as each image to be fused associated with the image to be retrieved. And then the terminal equipment executes the following operations aiming at any image to be fused in the images to be fused:

inputting any image to be fused into a target classification model to obtain a visual feature and a semantic feature corresponding to any image to be fused, wherein the semantic feature of any image to be fused is used for representing the classification result of any image to be fused;

It should be noted that, in this embodiment of the application, the fact that the image to be retrieved is one video frame in the video may refer to that the image to be retrieved is a certain video frame in the video, and may also refer to that the image to be retrieved is a partial image included in a certain video frame in the video. Since the generation process of the initial fusion features corresponding to each image to be fused is the same as the generation process of the initial fusion features corresponding to the image to be retrieved, the description is omitted here.

Suppose that a video includes m video frames, such as video frame 1, video frame 2, …, video frame i, …, and video frame m, and the playing time of the m video frames is: the image retrieval method comprises a video frame 1, a video frame 2, a video frame …, a video frame i, a video frame … and a video frame m, wherein an image to be retrieved is the video frame i.

If the video frames including the target retrieval object in the m video frames include: n video frames, such as the video frame k1, the video frames k2, …, the video frames ki, …, and the video frame kn, include the target retrieval object, and the other video frames which are played earlier than the video frame of the image to be retrieved include: video frame k1, video frames k2, …, and video frame ki-1, it is obvious that the image to be fused includes: video frame k1, video frames k2, …, video frame ki-1.

The terminal device generates corresponding initial fusion features for a video frame k1, a video frame k2, … and a video frame ki-1 respectively, the initial fusion features corresponding to the video frame k1, the video frame k2, … and the video frame ki-1 are respectively represented as a (k1), a (k2), … and a (ki-1), and the initial fusion features corresponding to the image to be retrieved are represented as a (ki).

For example, referring to fig. 4, a video includes 10 video frames, where the 10 video frames include video frame 1, video frame 2, …, and video frame 10, and it is assumed that an image to be retrieved is video frame 5, a target retrieval object is a pedestrian in video frame 5, and among the 10 video frames, other video frames that include the target retrieval object and are played earlier than video frame 5 have: the terminal equipment takes the video frame 1, the video frame 2, the video frame 3 and the video frame 4 as images to be fused.

Referring to fig. 5, the terminal device obtains initial fusion features corresponding to the video frames 1, 2, 3 and 4, respectively, where the initial fusion feature corresponding to the video frame 1 is a (k1), the initial fusion feature corresponding to the video frame 2 is a (k2), the initial fusion feature corresponding to the video frame 3 is a (k3), the initial fusion feature corresponding to the video frame 4 is a (k4), and the initial fusion feature corresponding to the video frame 5 is a (k 5).

Specifically, based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused, the terminal device may obtain the target fusion features in the following manner:

A. and the terminal equipment performs weighted summation on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused based on the weight coefficients corresponding to the images to be retrieved and the images to be fused.

It should be noted that, in the embodiment of the present application, the weight coefficient corresponding to each of the image to be retrieved and each of the image to be fused may be determined according to the number of each of the image to be fused, or may be preset, which is not limited to this.

Assuming that the weight coefficients corresponding to the images to be fused are represented as q1, q2, … … and qi-1, the weight coefficient corresponding to the image to be retrieved is qi, and the fusion feature obtained after weighted summation can be represented as q1 a (k1) + q2 a (k2) + … + + qi-1 a (ki-1) + qi a (ki).

Still taking the image to be retrieved as the video frame 5 as an example, assuming that the weight coefficients corresponding to the image to be retrieved and the images to be fused are both 1, the fusion feature obtained after weighting and summing is a (k1) + a (k2) + a (k3) + a (k4) + a (k 5).

B. And the terminal equipment averages the fusion characteristics obtained after weighted summation based on the number of the images to be fused to obtain target fusion characteristics.

Assuming that the weight coefficients corresponding to the image to be retrieved and each image to be fused are both 1,

representing the target fusion characteristics, wherein the target fusion characteristics can be calculated by adopting the following formula:

still taking the image to be retrieved as the video frame 5 as an example, the number of each image to be fused is 4, and averaging the fusion features obtained after weighted summation to obtain the target fusion feature, wherein the target fusion feature is represented as { a (k1) + a (k2) + a (k3) + a (k4) + a (k5) }/5.

It should be noted that, in the embodiment of the present application, the image to be fused may also be another video frame in the video within the specified time length, which includes the target retrieval object and has a playing time earlier than that of the video frame.

S204, the terminal device determines at least one target image matched with the image to be retrieved from the candidate images based on the target fusion characteristics corresponding to the candidate images and the image to be retrieved.

Specifically, when S204 is executed, the following manners may be adopted, but not limited to:

the terminal device calculates the target fusion features corresponding to the candidate images respectively, the similarity between the candidate images and the target fusion features corresponding to the images to be retrieved respectively, and then determines at least one target image matched with the images to be retrieved from the candidate images with the similarity between the candidate images and the images to be retrieved larger than a preset threshold value based on the calculated similarity.

For example, but not limited to, a cosine similarity may be adopted as a similarity between a target fusion feature corresponding to an image to be retrieved and a target fusion feature corresponding to any candidate image, and the cosine similarity calculation formula is as follows:

wherein s is _x The degree of similarity is expressed in terms of,

representing the corresponding target fusion feature of the image to be retrieved, b _x And representing the target fusion characteristics corresponding to any one candidate image.

In the embodiment of the application, when the semantic feature and the visual feature are normalized to 0 to 1, the feature vector is used

And b _x Is in the range of 0 to 1, and s is thus calculated _x Also ranges from 0 to 1, and s _x The larger the value of (a) is,

and b _x The higher the similarity between, s _x The smaller the value of (a) is,

and b _x The lower the similarity between them.

And based on the calculated similarity, in the process of determining a target image matched with the image to be retrieved from the candidate images with the similarity greater than a preset threshold value, the terminal equipment filters the candidate images through the preset threshold value to obtain a candidate image set with the similarity greater than the preset threshold value, and then sequentially selects the target images with a set number from the candidate image set according to the sequence of similarity values from large to small. Further, the terminal device can also display the determined target image in the operation interface.

In some embodiments, before determining at least one target image matching the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the terminal device may further determine at least one candidate region included in each candidate image from the candidate images, where each candidate region includes a retrieval object of one retrieval type.

The terminal device may perform the following operations for any one of the determined candidate regions:

inputting any candidate region into a target classification model to obtain a visual feature and a semantic feature corresponding to any candidate region, and obtaining a target fusion feature corresponding to any candidate region based on the visual feature and the semantic feature corresponding to any candidate region;

and recording the mapping relation between the target fusion characteristics corresponding to any candidate region and the corresponding candidate image.

Since the process of obtaining the target fusion feature corresponding to any one candidate region is the same as the process of obtaining the target fusion feature corresponding to the image to be retrieved, the details are not repeated here.

In the embodiment of the application, the candidate image may be a certain video frame in a video or may be a picture.

Assuming, for example, that a pedestrian and a car are included in the candidate image 1, the terminal device extracts the candidate image 1 from the terminal device, determining a candidate area 1 and a candidate area 2, wherein the candidate area 1 comprises a pedestrian, the candidate area 2 comprises a vehicle, then, aiming at the candidate region 1, the terminal equipment inputs the candidate region 1 into a target classification model to obtain the visual characteristic and the semantic characteristic corresponding to the candidate region 1, and based on the visual features and semantic features corresponding to the candidate region 1, obtaining target fusion features corresponding to the candidate region 1, and recording the mapping relation between the target fusion feature corresponding to the candidate region 1 and the candidate image 1, and similarly, the terminal device may obtain the target fusion feature corresponding to the candidate region 2, and recording the mapping relation between the target fusion characteristics corresponding to the candidate region 1 and the candidate image 1.

The present application will be described below with reference to a specific example.

Referring to fig. 6, in a video monitoring scene, a video frame 4 in a surveillance video is taken as an image to be retrieved, a target retrieval object included in the image to be retrieved is a vehicle, after a terminal device acquires the video frame 4, an initial fusion feature corresponding to the video frame 4 is acquired, assuming that the image to be fused associated with the video frame 4 is a video frame 1, a video frame 2 and a video frame 3, the terminal device obtains the target fusion feature corresponding to the image to be retrieved based on the initial fusion features corresponding to the video frame 1, the video frame 2 and the video frame 3, and the initial fusion feature corresponding to the video frame 4.

Then, the terminal device matches the target fusion feature corresponding to the image to be retrieved with the target fusion feature corresponding to the candidate image 1, wherein the target fusion feature corresponding to the candidate image 1 includes the target fusion feature corresponding to the candidate region 1 and the target fusion feature corresponding to the candidate region 2, and determines that the candidate image 1 is matched with the image to be retrieved based on the target fusion feature corresponding to the candidate image and the target fusion feature corresponding to the image to be retrieved, specifically, determines that the candidate region 1 in the candidate image 1 is matched with the image to be retrieved.

Based on the same inventive concept, referring to fig. 7, an embodiment of the present application provides a target retrieval apparatus, where the target retrieval apparatus 700 at least includes:

an obtaining unit 701, configured to obtain an image to be retrieved including a target retrieval object;

an output unit 702, configured to input the image to be retrieved into a target classification model including a feature extraction layer and an output layer, to obtain a visual feature output by the feature extraction layer, and obtain a semantic feature output by the output layer, where the semantic feature is used to characterize a classification result of the target detection object;

a fusion unit 703, configured to perform feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;

a matching unit 704, configured to determine, from each candidate image, at least one target image that matches the image to be retrieved based on the target fusion features corresponding to each candidate image and the image to be retrieved.

Optionally, when performing feature fusion on the semantic features and the visual features to obtain a target fusion feature corresponding to the image to be retrieved, the fusion unit 703 is specifically configured to:

Optionally, when the semantic features and the visual features are spliced to obtain an initial fusion feature corresponding to the image to be retrieved, the fusion unit 703 is specifically configured to:

Optionally, when the target fusion feature is obtained based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion feature corresponding to each image to be fused, the fusion unit 703 is specifically configured to:

Optionally, the semantic features include classification confidence degrees and normalization information; when the semantic features and the visual features are spliced to obtain initial fusion features corresponding to the image to be retrieved, the fusion unit 703 is specifically configured to:

Optionally, the classification confidences include a category confidence and/or an attribute confidence.

before the initial fusion features corresponding to the images to be fused associated with the images to be retrieved are obtained, the fusion unit 703 is further configured to:

Optionally, before determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the fusion unit 703 is further configured to:

Optionally, when determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the matching unit 704 is specifically configured to:

For convenience of description, the above parts are described separately as modules (or units) according to functions. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.

In the embodiment of the application, after an image to be retrieved containing a target retrieval object is obtained, the image to be retrieved is input into a target classification model, the target classification model comprises a feature extraction layer and an output layer, visual features are obtained from the feature extraction layer, semantic features are obtained from the output layer, then feature fusion is carried out on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, and then a target image matched with the image to be retrieved is determined from all candidate images based on the target fusion features. Because the target fusion features simultaneously comprise the visual features and the semantic features, the interpretability of the visual features is good, the unexpected feature types of the training set can be well adapted, and the accuracy and the robustness of the semantic features are good, so that the target fusion features have better robustness and wider scene adaptability, and the accuracy and the detection efficiency of image retrieval are improved.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Based on the same inventive concept, the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server or a terminal device. Referring to fig. 8, which is a schematic structural diagram of a possible electronic device provided in an embodiment of the present application, in fig. 8, an electronic device 800 includes: a processor 810 and a memory 820.

The memory 820 stores a computer program executable by the processor 810, and the processor 810 can execute the steps of the target retrieval method by executing the instructions stored in the memory 820.

The memory 820 may be a volatile memory (RAM), such as a random-access memory (RAM); the Memory 820 may also be a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD); or memory 820 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. Memory 820 may also be a combination of the above.

Processor 810 may include one or more Central Processing Units (CPUs), or be a digital processing unit, or the like. A processor 810 for implementing the above-described object retrieval method when executing the computer program stored in the memory 820.

In some embodiments, processor 810 and memory 820 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The specific connection medium between the processor 810 and the memory 820 is not limited in the embodiments of the present application. In the embodiment of the present application, the processor 810 and the memory 820 are connected by a bus as an example, the bus is depicted by a thick line in fig. 8, and the connection manner between other components is merely illustrative and is not meant to be limiting. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of description, only one thick line is depicted in fig. 8, but not only one bus or one type of bus.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium including a computer program for causing an electronic device to perform the steps of the above-mentioned object retrieval method when the computer program runs on the electronic device. In some possible embodiments, the various aspects of the object retrieval method provided in the present application may also be implemented in the form of a program product including a computer program for causing an electronic device to perform the steps of the object retrieval method described above when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 2.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable Disk, a hard Disk, a RAM, a ROM, an erasable programmable Read-Only Memory (EPROM or flash Memory), an optical fiber, a portable Compact Disk Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of the embodiments of the present application may be a CD-ROM and include a computer program, and may be run on an electronic device. However, the program product of the present application is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a computer program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with a readable computer program embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a computer program for use by or in connection with a command execution system, apparatus, or device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A target retrieval method, comprising:

acquiring an image to be retrieved containing a target retrieval object;

2. The method of claim 1, wherein the performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved comprises:

3. The method of claim 2, wherein the stitching the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved comprises:

4. The method according to claim 2, wherein the obtaining the target fusion feature based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion features corresponding to the respective images to be fused comprises:

5. The method of claim 2, wherein the semantic features include respective classification confidence levels and normalization information;

the splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved comprises:

6. The method of claim 5, wherein the classification confidences comprise category confidences, and/or attribute confidences.

7. The method of claim 2, wherein the image to be retrieved is a video frame in a video, and the video further comprises other video frames;

8. The method according to any one of claims 1 to 7, wherein before determining at least one target image matching the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the method further comprises:

9. The method according to any one of claims 1 to 7, wherein the determining at least one target image matching the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved respectively comprises:

10. An object retrieval apparatus, comprising:

11. An electronic device, characterized in that it comprises a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any of claims 1-9.

12. A computer-readable storage medium, characterized in that it comprises a computer program for causing an electronic device to carry out the steps of the method of any one of claims 1-9, when the computer program is run on the electronic device.