CN114880513A - Target retrieval method and related device - Google Patents

Target retrieval method and related device Download PDF

Info

Publication number
CN114880513A
CN114880513A CN202210538456.8A CN202210538456A CN114880513A CN 114880513 A CN114880513 A CN 114880513A CN 202210538456 A CN202210538456 A CN 202210538456A CN 114880513 A CN114880513 A CN 114880513A
Authority
CN
China
Prior art keywords
image
features
retrieved
target
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210538456.8A
Other languages
Chinese (zh)
Inventor
鲁逸峰
周祥明
郑春煌
吴剑峰
韩加旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202210538456.8A priority Critical patent/CN114880513A/en
Publication of CN114880513A publication Critical patent/CN114880513A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of computers, in particular to a target retrieval method and a related device, which are used for improving retrieval efficiency and accuracy, and the method comprises the following steps: after an image to be retrieved containing a target retrieval object is obtained, the image to be retrieved is input into a target classification model, the target classification model comprises a feature extraction layer and an output layer, visual features are obtained from the feature extraction layer, semantic features are obtained from the output layer, then feature fusion is carried out on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, and then a target image matched with the image to be retrieved is determined from all candidate images based on the target fusion features.

Description

Target retrieval method and related device
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a target retrieval method and a related apparatus.
Background
With the continuous development of computer technology, visual target retrieval is applied to more and more scenes, and the visual target retrieval compares the feature vectors of the image to be detected with the feature vectors of massive images in a database to determine a target image similar to the image to be detected from the massive images.
In the related art, a machine learning model is usually adopted to obtain a feature vector of an image to be detected, however, the interpretability of the feature vector obtained by the machine learning model is poor, and the feature dimension of the feature vector depends on the output dimension of the model, so that it is difficult to perform better adaptation on feature types other than a training set. For example, the classification model trained by using images of cats is not suitable for dog classification and search, has poor universality, and needs to be retrained if people want to search other classes of objects, and the search efficiency is reduced.
Disclosure of Invention
The application provides a target retrieval method and a related device, which are used for improving the robustness and the scene adaptability of characteristics and improving the retrieval efficiency and the accuracy.
The embodiment of the application provides the following specific technical scheme:
in a first aspect, a target retrieval method includes:
acquiring an image to be retrieved containing a target retrieval object;
inputting the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtaining visual features output by the feature extraction layer, and obtaining semantic features output by the output layer, wherein the semantic features are used for representing classification results of the target detection object;
performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;
and determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion characteristics corresponding to the candidate images and the image to be retrieved respectively.
Optionally, the performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved includes:
splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and directly taking the initial fusion features as the target fusion features; or,
and splicing the semantic features and the visual features to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved, and obtaining the target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.
Optionally, the splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved includes:
splicing the semantic features and the visual features according to a specified feature splicing sequence;
and weighting the spliced features based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain initial fusion features corresponding to the image to be retrieved.
Optionally, the obtaining the target fusion feature based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion feature corresponding to each image to be fused includes:
based on the weight coefficients corresponding to the image to be retrieved and the images to be fused, carrying out weighted summation on the initial fusion features corresponding to the image to be retrieved and the initial fusion features corresponding to the images to be fused;
and averaging the fusion features obtained after weighted summation based on the number of the images to be fused to obtain the target fusion features.
Optionally, the semantic features include classification confidence degrees and normalization information; the splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved comprises:
based on the normalization information contained in the semantic features, performing normalization processing on the classification confidence degrees and the visual features contained in the semantic features to obtain the classification confidence degrees and the visual features which accord with a preset value range;
and splicing the classification confidence coefficient and the visual characteristic which accord with a preset value range to obtain an initial fusion characteristic corresponding to the image to be retrieved.
Optionally, the classification confidences include category confidences and/or attribute confidences.
Optionally, the image to be retrieved is a video frame in a video, and the video further includes other video frames;
before the obtaining of the initial fusion features corresponding to the images to be fused associated with the images to be retrieved, the method further includes:
taking other video frames which contain the target retrieval object and have playing time earlier than the video frame as images to be fused associated with the images to be retrieved;
aiming at any image to be fused in the images to be fused, the following operations are executed:
inputting the any image to be fused into the target classification model to obtain a visual feature and a semantic feature corresponding to the any image to be fused, wherein the semantic feature of the any image to be fused is used for representing a classification result of the any image to be fused;
and obtaining the initial fusion feature corresponding to any image to be fused based on the visual feature and the semantic feature corresponding to any image to be fused.
Optionally, before determining at least one target image matched with the image to be retrieved from each candidate image based on the target fusion features corresponding to each candidate image and the image to be retrieved, the method further includes:
determining at least one candidate area contained in each candidate image from each candidate image, wherein each candidate area contains a retrieval object of one retrieval type;
for any one of the determined candidate regions, performing the following operations:
inputting the any candidate region into the target classification model to obtain a visual feature and a semantic feature corresponding to the any candidate region, and obtaining a target fusion feature corresponding to the any candidate region based on the visual feature and the semantic feature corresponding to the any candidate region;
and recording the mapping relation between the target fusion characteristics corresponding to any one candidate region and the corresponding candidate image.
Optionally, the determining, from the candidate images, at least one target image matched with the image to be retrieved based on the target fusion features corresponding to the candidate images and the image to be retrieved includes:
calculating the similarity between the target fusion characteristics corresponding to the candidate images and the target fusion characteristics corresponding to the images to be retrieved;
and determining at least one target image matched with the image to be retrieved from candidate images with the similarity greater than a preset threshold value based on the calculated similarity.
In a second aspect, a target retrieval apparatus includes:
the device comprises an acquisition unit, a retrieval unit and a retrieval unit, wherein the acquisition unit is used for acquiring an image to be retrieved containing a target retrieval object;
the output unit is used for inputting the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtaining visual features output by the feature extraction layer and obtaining semantic features output by the output layer, wherein the semantic features are used for representing classification results of the target detection object;
the fusion unit is used for performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;
and the matching unit is used for determining at least one target image matched with the image to be retrieved from each candidate image based on the target fusion characteristics corresponding to each candidate image and the image to be retrieved.
Optionally, when performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, the fusion unit is specifically configured to:
splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and directly taking the initial fusion features as the target fusion features; or,
and splicing the semantic features and the visual features to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved, and obtaining the target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.
Optionally, when the semantic features and the visual features are spliced to obtain initial fusion features corresponding to the image to be retrieved, the fusion unit is specifically configured to:
splicing the semantic features and the visual features according to a specified feature splicing sequence;
and weighting the spliced features based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain initial fusion features corresponding to the image to be retrieved.
Optionally, when the target fusion feature is obtained based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion feature corresponding to each image to be fused, the fusion unit is specifically configured to:
based on the weight coefficients corresponding to the image to be retrieved and the images to be fused, carrying out weighted summation on the initial fusion features corresponding to the image to be retrieved and the initial fusion features corresponding to the images to be fused;
and averaging the fusion features obtained after weighted summation based on the number of the images to be fused to obtain the target fusion features.
Optionally, the semantic features include classification confidence degrees and normalization information; when the semantic features and the visual features are spliced to obtain initial fusion features corresponding to the image to be retrieved, the fusion unit is specifically configured to:
based on the normalization information contained in the semantic features, performing normalization processing on the classification confidence degrees and the visual features contained in the semantic features to obtain the classification confidence degrees and the visual features which accord with a preset value range;
and splicing the classification confidence coefficient and the visual characteristic which accord with a preset value range to obtain an initial fusion characteristic corresponding to the image to be retrieved.
Optionally, the classification confidences include category confidences and/or attribute confidences.
Optionally, the image to be retrieved is a video frame in a video, and the video further includes other video frames;
before the initial fusion features corresponding to the images to be fused associated with the images to be retrieved are obtained, the fusion unit is further configured to:
taking other video frames which contain the target retrieval object and have playing time earlier than the video frame as images to be fused associated with the images to be retrieved;
aiming at any image to be fused in the images to be fused, executing the following operations:
inputting any image to be fused into the target classification model to obtain a visual feature and a semantic feature corresponding to the any image to be fused, wherein the semantic feature of the any image to be fused is used for representing a classification result of the any image to be fused;
and obtaining the initial fusion feature corresponding to any image to be fused based on the visual feature and the semantic feature corresponding to any image to be fused.
Optionally, before determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the fusion unit is further configured to:
determining at least one candidate area contained in each candidate image from each candidate image, wherein each candidate area contains a retrieval object of one retrieval type;
for any one of the determined candidate regions, performing the following operations:
inputting the any candidate region into the target classification model to obtain a visual feature and a semantic feature corresponding to the any candidate region, and obtaining a target fusion feature corresponding to the any candidate region based on the visual feature and the semantic feature corresponding to the any candidate region;
and recording the mapping relation between the target fusion characteristics corresponding to any one candidate region and the corresponding candidate image.
Optionally, when determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the matching unit is specifically configured to:
calculating the similarity between the target fusion characteristics corresponding to the candidate images and the target fusion characteristics corresponding to the images to be retrieved;
and determining at least one target image matched with the image to be retrieved from candidate images with the similarity greater than a preset threshold value based on the calculated similarity.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to execute the steps of the above-mentioned object retrieval method.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, which includes a computer program, when the computer program runs on an electronic device, the computer program is configured to enable the electronic device to execute the steps of the above-mentioned object retrieval method.
In a fifth aspect, the present application provides a computer program product, where the program product includes a computer program, where the computer program is stored in a computer-readable storage medium, and a processor of an electronic device reads and executes the computer program from the computer-readable storage medium, so that the electronic device executes the steps of the above object retrieval method.
In summary, in the embodiment of the present application, after an image to be retrieved including a target retrieval object is obtained, the image to be retrieved is input into a target classification model, the target classification model includes a feature extraction layer and an output layer, a visual feature is obtained from the feature extraction layer, a semantic feature is obtained from the output layer, then, feature fusion is performed on the semantic feature and the visual feature to obtain a target fusion feature corresponding to the image to be retrieved, and then, based on the target fusion feature, a target image matched with the image to be retrieved is determined from each candidate image. Because the target fusion features simultaneously comprise the visual features and the semantic features, the interpretability of the visual features is good, the unexpected feature types of the training set can be well adapted, and the accuracy and the robustness of the semantic features are good, so that the target fusion features have better robustness and wider scene adaptability, and the accuracy and the detection efficiency of image retrieval are improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a target retrieval method provided in an embodiment of the present application;
FIG. 3 is a logic diagram of a feature stitching provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a logic for determining an image to be fused according to an embodiment of the present disclosure;
FIG. 5 is a logic diagram for determining a target fusion feature provided in an embodiment of the present application;
FIG. 6 is a logic diagram of a target retrieval process provided in an embodiment of the present application;
fig. 7 is a schematic structural diagram of a target retrieval apparatus provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.
It is to be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or operation from another entity or operation without necessarily requiring or implying any actual such relationship or order between such entities or operations.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
The preferred embodiments of the present application will be described in conjunction with the drawings of the specification, it should be understood that the preferred embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the present application, and features of the embodiments and examples of the present application may be combined with each other without conflict.
Fig. 1 is a schematic diagram of a possible application scenario provided in an embodiment of the present application, where the application scenario at least includes a terminal device 110 and a server 120. The number of the terminal devices 110 may be one or more, the number of the servers 120 may also be one or more, and the number of the terminal devices 110 and the number of the servers 120 are not particularly limited in the present application.
The terminal device 110 has an application with information processing functions such as information verification and information retrieval, where the application may be a client application, a web page application, an applet application, and the like. The terminal device 110 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like.
The server 120 may be a background server of the application, and provides a corresponding information verification service for the application. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data, and an artificial intelligence platform.
The terminal device 110 and the server 120 may be directly or indirectly connected through wired or wireless communication, and communicate and transmit data through a network, which is not limited in this application.
The target retrieval method mentioned in the present application can be applied to the terminal device or the server individually, or can be executed by the terminal device and the server together.
For example, after acquiring an image to be retrieved containing a target retrieval object, the terminal device inputs the image to be retrieved into a target classification model to obtain a visual feature output by a feature extraction layer and a semantic feature output by an output layer, then performs feature fusion on the semantic feature and the visual feature to obtain a target fusion feature corresponding to the image to be retrieved, and further determines a target image matched with the image to be retrieved from each candidate image based on the target fusion feature.
For another example, after the server acquires an image to be retrieved containing a target retrieval object, the image to be retrieved is input into the target classification model to obtain a visual feature output by the feature extraction layer and a semantic feature output by the output layer, then feature fusion is performed on the semantic feature and the visual feature to obtain a target fusion feature corresponding to the image to be retrieved, and then the target image matched with the image to be retrieved is determined from each candidate image based on the target fusion feature.
For another example, after acquiring the image to be retrieved containing the target retrieval object, the terminal device sends the image to be retrieved to the server. The method comprises the steps that after receiving an image to be retrieved, a server inputs the image to be retrieved into a target classification model to obtain visual features output by a feature extraction layer and semantic features output by an output layer, feature fusion is conducted on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, the target fusion features corresponding to the image to be retrieved are sent to a terminal device, and the terminal device determines a target image matched with the image to be retrieved from candidate images based on the target fusion features.
Fig. 2 is a schematic flowchart of a target retrieval method provided in an embodiment of the present application, where the flowchart may be applied to an electronic device, and the electronic device may be a server or a terminal device, and the following description only takes the terminal device as an example, and the specific flowchart is as follows:
s201, the terminal device obtains an image to be retrieved containing a target retrieval object.
S202, the terminal device inputs the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtains visual features output by the feature extraction layer, and obtains semantic features output by the output layer, wherein the semantic features are used for representing classification results of target detection objects.
It should be noted that, in the embodiment of the present application, the visual Feature includes, but is not limited to, a Speeded Up Robust Feature (SURF) Feature, a Histogram of Oriented Gradients (HOG) Feature, and a Haar Feature.
The semantic features may include classification confidence levels, which include category confidence levels and/or attribute confidence levels. Categories include, but are not limited to, pedestrian, automotive, non-automotive, watercraft, aircraft, and the like. Attributes include, but are not limited to, color, vehicle type, whether glasses are worn, etc. The semantic features may also contain normalization information, including a normalized width and a normalized height.
The feature vector a1 is used to represent semantic features, and it is assumed that, in the semantic features output by the target classification model, the number of categories is n1, the number of attributes is n2, l is normalized width, and h is normalized height, then the category confidence, the attribute confidence, the normalized width, and the normalized height are encoded into the feature vector a1 according to a specified order, and the feature vector a1 can be represented as [ category 1 confidence, category 2 confidence, …, category n1 confidence, attribute 1 confidence, attribute 2 confidence, …, attribute n2 confidence, l, h ], and the length of the feature vector a1 is n1+ n2+ 2.
Where l is lt/lp, h is ht/hp, lp is the width of the original image, hp is the height of the original image, lt is the width of the target image, and ht is the height of the target image.
For example, the category includes pedestrian, vehicle, the attribute includes red, orange, yellow, green, the feature vector a1 can be expressed as [ pedestrian confidence, vehicle confidence, red confidence, orange confidence, yellow confidence, green confidence, l, h ].
The visual features are represented by feature vectors a2, and assuming that the feature sizes output by the feature extraction layer have widths w, heights z and channel numbers c, respectively, the length of the feature vectors a2 is w × z × c.
And S203, the terminal equipment performs feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved.
In S203, the semantic features and the visual features may be directly fused, or the semantic features and the visual features may be fused in combination with the image to be fused associated with the image to be retrieved. Two feature fusion methods are described below.
In the mode 1, the terminal equipment splices the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and the initial fusion features are directly used as target fusion features.
In the process that the semantic features and the visual features are spliced by the terminal equipment to obtain the initial fusion features corresponding to the image to be retrieved, the terminal equipment can splice the semantic features and the visual features according to the specified feature splicing sequence, and then the spliced features are weighted based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain the initial fusion features corresponding to the image to be retrieved.
Herein, the semantic features are represented by feature vector a1, the visual features are represented by feature vector a2, and the initial fusion features are represented by feature vector a. The preset weight coefficient is used for representing the weight ratio of the corresponding features, and can be set according to a specific application scene.
In the embodiment of the present application, the feature extraction layer may also be referred to as a network feature layer, and the output layer may also be referred to as a network output layer.
Referring to fig. 3, assuming that the feature concatenation sequence is a semantic feature and a visual feature connected end to end, the feature vector a may be represented as { p1 a1, p2 a2}, where p1 represents a weight coefficient corresponding to the semantic feature, p2 represents a weight coefficient corresponding to the visual feature, and the length of the fused feature vector a is the sum of the lengths of the feature vector a1 and the feature vector a 2. Illustratively, p1 ═ 0.5 and p2 ═ 0.5.
In some embodiments, if the semantic features include each classification confidence and normalization information, the terminal device splices the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, including:
based on normalization information contained in the semantic features, performing normalization processing on each classification confidence coefficient and each visual feature contained in the semantic features to obtain each classification confidence coefficient and each visual feature which accord with a preset value range;
and splicing the classification confidence coefficient and the visual characteristic which accord with the preset value range to obtain the initial fusion characteristic corresponding to the image to be retrieved.
For example, the preset value range is 0 to 1, the terminal device performs normalization processing on each classification confidence coefficient and each visual feature included in the semantic features based on normalization information included in the semantic features, and the values of each component in each classification confidence coefficient and each component in each visual feature after the normalization processing are both 0 to 1.
It should be noted that, in this embodiment of the application, the terminal device may also perform normalization processing in the feature extraction layer to obtain each classification confidence that meets the preset value range, and correspondingly, in the process of splicing the semantic features and the visual features, perform normalization processing on the visual features according to normalization information included in the semantic features to obtain the visual features that meet the preset value range.
And 2, splicing the semantic features and the visual features by the terminal equipment to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved and associated with the images to be retrieved, and obtaining target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.
The process of obtaining the initial fusion feature corresponding to the image to be retrieved by the terminal device in the mode 2 is the same as the process of obtaining the initial fusion feature corresponding to the image to be retrieved by the terminal device in the mode 1 by splicing the semantic feature and the visual feature, and is not repeated herein.
In this embodiment of the application, if the image to be retrieved is a video frame in a video, and the video further includes at least one other video frame in addition to the video frame, the terminal device may use the other video frame that includes the target retrieval object and has a playing time earlier than that of the video frame as each image to be fused associated with the image to be retrieved. That is, the terminal device may use the historical video frame containing the target retrieval object as each image to be fused associated with the image to be retrieved. And then the terminal equipment executes the following operations aiming at any image to be fused in the images to be fused:
inputting any image to be fused into a target classification model to obtain a visual feature and a semantic feature corresponding to any image to be fused, wherein the semantic feature of any image to be fused is used for representing the classification result of any image to be fused;
and obtaining the initial fusion feature corresponding to any image to be fused based on the visual feature and the semantic feature corresponding to any image to be fused.
It should be noted that, in this embodiment of the application, the fact that the image to be retrieved is one video frame in the video may refer to that the image to be retrieved is a certain video frame in the video, and may also refer to that the image to be retrieved is a partial image included in a certain video frame in the video. Since the generation process of the initial fusion features corresponding to each image to be fused is the same as the generation process of the initial fusion features corresponding to the image to be retrieved, the description is omitted here.
Suppose that a video includes m video frames, such as video frame 1, video frame 2, …, video frame i, …, and video frame m, and the playing time of the m video frames is: the image retrieval method comprises a video frame 1, a video frame 2, a video frame …, a video frame i, a video frame … and a video frame m, wherein an image to be retrieved is the video frame i.
If the video frames including the target retrieval object in the m video frames include: n video frames, such as the video frame k1, the video frames k2, …, the video frames ki, …, and the video frame kn, include the target retrieval object, and the other video frames which are played earlier than the video frame of the image to be retrieved include: video frame k1, video frames k2, …, and video frame ki-1, it is obvious that the image to be fused includes: video frame k1, video frames k2, …, video frame ki-1.
The terminal device generates corresponding initial fusion features for a video frame k1, a video frame k2, … and a video frame ki-1 respectively, the initial fusion features corresponding to the video frame k1, the video frame k2, … and the video frame ki-1 are respectively represented as a (k1), a (k2), … and a (ki-1), and the initial fusion features corresponding to the image to be retrieved are represented as a (ki).
For example, referring to fig. 4, a video includes 10 video frames, where the 10 video frames include video frame 1, video frame 2, …, and video frame 10, and it is assumed that an image to be retrieved is video frame 5, a target retrieval object is a pedestrian in video frame 5, and among the 10 video frames, other video frames that include the target retrieval object and are played earlier than video frame 5 have: the terminal equipment takes the video frame 1, the video frame 2, the video frame 3 and the video frame 4 as images to be fused.
Referring to fig. 5, the terminal device obtains initial fusion features corresponding to the video frames 1, 2, 3 and 4, respectively, where the initial fusion feature corresponding to the video frame 1 is a (k1), the initial fusion feature corresponding to the video frame 2 is a (k2), the initial fusion feature corresponding to the video frame 3 is a (k3), the initial fusion feature corresponding to the video frame 4 is a (k4), and the initial fusion feature corresponding to the video frame 5 is a (k 5).
Specifically, based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused, the terminal device may obtain the target fusion features in the following manner:
A. and the terminal equipment performs weighted summation on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused based on the weight coefficients corresponding to the images to be retrieved and the images to be fused.
It should be noted that, in the embodiment of the present application, the weight coefficient corresponding to each of the image to be retrieved and each of the image to be fused may be determined according to the number of each of the image to be fused, or may be preset, which is not limited to this.
Assuming that the weight coefficients corresponding to the images to be fused are represented as q1, q2, … … and qi-1, the weight coefficient corresponding to the image to be retrieved is qi, and the fusion feature obtained after weighted summation can be represented as q1 a (k1) + q2 a (k2) + … + + qi-1 a (ki-1) + qi a (ki).
Still taking the image to be retrieved as the video frame 5 as an example, assuming that the weight coefficients corresponding to the image to be retrieved and the images to be fused are both 1, the fusion feature obtained after weighting and summing is a (k1) + a (k2) + a (k3) + a (k4) + a (k 5).
B. And the terminal equipment averages the fusion characteristics obtained after weighted summation based on the number of the images to be fused to obtain target fusion characteristics.
Assuming that the weight coefficients corresponding to the image to be retrieved and each image to be fused are both 1,
Figure BDA0003647380640000142
representing the target fusion characteristics, wherein the target fusion characteristics can be calculated by adopting the following formula:
Figure BDA0003647380640000141
still taking the image to be retrieved as the video frame 5 as an example, the number of each image to be fused is 4, and averaging the fusion features obtained after weighted summation to obtain the target fusion feature, wherein the target fusion feature is represented as { a (k1) + a (k2) + a (k3) + a (k4) + a (k5) }/5.
It should be noted that, in the embodiment of the present application, the image to be fused may also be another video frame in the video within the specified time length, which includes the target retrieval object and has a playing time earlier than that of the video frame.
S204, the terminal device determines at least one target image matched with the image to be retrieved from the candidate images based on the target fusion characteristics corresponding to the candidate images and the image to be retrieved.
Specifically, when S204 is executed, the following manners may be adopted, but not limited to:
the terminal device calculates the target fusion features corresponding to the candidate images respectively, the similarity between the candidate images and the target fusion features corresponding to the images to be retrieved respectively, and then determines at least one target image matched with the images to be retrieved from the candidate images with the similarity between the candidate images and the images to be retrieved larger than a preset threshold value based on the calculated similarity.
For example, but not limited to, a cosine similarity may be adopted as a similarity between a target fusion feature corresponding to an image to be retrieved and a target fusion feature corresponding to any candidate image, and the cosine similarity calculation formula is as follows:
Figure BDA0003647380640000151
wherein s is x The degree of similarity is expressed in terms of,
Figure BDA0003647380640000152
representing the corresponding target fusion feature of the image to be retrieved, b x And representing the target fusion characteristics corresponding to any one candidate image.
In the embodiment of the application, when the semantic feature and the visual feature are normalized to 0 to 1, the feature vector is used
Figure BDA0003647380640000153
And b x Is in the range of 0 to 1, and s is thus calculated x Also ranges from 0 to 1, and s x The larger the value of (a) is,
Figure BDA0003647380640000154
and b x The higher the similarity between, s x The smaller the value of (a) is,
Figure BDA0003647380640000155
and b x The lower the similarity between them.
And based on the calculated similarity, in the process of determining a target image matched with the image to be retrieved from the candidate images with the similarity greater than a preset threshold value, the terminal equipment filters the candidate images through the preset threshold value to obtain a candidate image set with the similarity greater than the preset threshold value, and then sequentially selects the target images with a set number from the candidate image set according to the sequence of similarity values from large to small. Further, the terminal device can also display the determined target image in the operation interface.
In some embodiments, before determining at least one target image matching the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the terminal device may further determine at least one candidate region included in each candidate image from the candidate images, where each candidate region includes a retrieval object of one retrieval type.
The terminal device may perform the following operations for any one of the determined candidate regions:
inputting any candidate region into a target classification model to obtain a visual feature and a semantic feature corresponding to any candidate region, and obtaining a target fusion feature corresponding to any candidate region based on the visual feature and the semantic feature corresponding to any candidate region;
and recording the mapping relation between the target fusion characteristics corresponding to any candidate region and the corresponding candidate image.
Since the process of obtaining the target fusion feature corresponding to any one candidate region is the same as the process of obtaining the target fusion feature corresponding to the image to be retrieved, the details are not repeated here.
In the embodiment of the application, the candidate image may be a certain video frame in a video or may be a picture.
Assuming, for example, that a pedestrian and a car are included in the candidate image 1, the terminal device extracts the candidate image 1 from the terminal device, determining a candidate area 1 and a candidate area 2, wherein the candidate area 1 comprises a pedestrian, the candidate area 2 comprises a vehicle, then, aiming at the candidate region 1, the terminal equipment inputs the candidate region 1 into a target classification model to obtain the visual characteristic and the semantic characteristic corresponding to the candidate region 1, and based on the visual features and semantic features corresponding to the candidate region 1, obtaining target fusion features corresponding to the candidate region 1, and recording the mapping relation between the target fusion feature corresponding to the candidate region 1 and the candidate image 1, and similarly, the terminal device may obtain the target fusion feature corresponding to the candidate region 2, and recording the mapping relation between the target fusion characteristics corresponding to the candidate region 1 and the candidate image 1.
The present application will be described below with reference to a specific example.
Referring to fig. 6, in a video monitoring scene, a video frame 4 in a surveillance video is taken as an image to be retrieved, a target retrieval object included in the image to be retrieved is a vehicle, after a terminal device acquires the video frame 4, an initial fusion feature corresponding to the video frame 4 is acquired, assuming that the image to be fused associated with the video frame 4 is a video frame 1, a video frame 2 and a video frame 3, the terminal device obtains the target fusion feature corresponding to the image to be retrieved based on the initial fusion features corresponding to the video frame 1, the video frame 2 and the video frame 3, and the initial fusion feature corresponding to the video frame 4.
Then, the terminal device matches the target fusion feature corresponding to the image to be retrieved with the target fusion feature corresponding to the candidate image 1, wherein the target fusion feature corresponding to the candidate image 1 includes the target fusion feature corresponding to the candidate region 1 and the target fusion feature corresponding to the candidate region 2, and determines that the candidate image 1 is matched with the image to be retrieved based on the target fusion feature corresponding to the candidate image and the target fusion feature corresponding to the image to be retrieved, specifically, determines that the candidate region 1 in the candidate image 1 is matched with the image to be retrieved.
Based on the same inventive concept, referring to fig. 7, an embodiment of the present application provides a target retrieval apparatus, where the target retrieval apparatus 700 at least includes:
an obtaining unit 701, configured to obtain an image to be retrieved including a target retrieval object;
an output unit 702, configured to input the image to be retrieved into a target classification model including a feature extraction layer and an output layer, to obtain a visual feature output by the feature extraction layer, and obtain a semantic feature output by the output layer, where the semantic feature is used to characterize a classification result of the target detection object;
a fusion unit 703, configured to perform feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;
a matching unit 704, configured to determine, from each candidate image, at least one target image that matches the image to be retrieved based on the target fusion features corresponding to each candidate image and the image to be retrieved.
Optionally, when performing feature fusion on the semantic features and the visual features to obtain a target fusion feature corresponding to the image to be retrieved, the fusion unit 703 is specifically configured to:
splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and directly taking the initial fusion features as the target fusion features; or,
and splicing the semantic features and the visual features to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved, and obtaining the target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.
Optionally, when the semantic features and the visual features are spliced to obtain an initial fusion feature corresponding to the image to be retrieved, the fusion unit 703 is specifically configured to:
splicing the semantic features and the visual features according to a specified feature splicing sequence;
and weighting the spliced features based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain initial fusion features corresponding to the image to be retrieved.
Optionally, when the target fusion feature is obtained based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion feature corresponding to each image to be fused, the fusion unit 703 is specifically configured to:
based on the weight coefficients corresponding to the image to be retrieved and the images to be fused, carrying out weighted summation on the initial fusion features corresponding to the image to be retrieved and the initial fusion features corresponding to the images to be fused;
and averaging the fusion features obtained after weighted summation based on the number of the images to be fused to obtain the target fusion features.
Optionally, the semantic features include classification confidence degrees and normalization information; when the semantic features and the visual features are spliced to obtain initial fusion features corresponding to the image to be retrieved, the fusion unit 703 is specifically configured to:
based on the normalization information contained in the semantic features, performing normalization processing on the classification confidence degrees and the visual features contained in the semantic features to obtain the classification confidence degrees and the visual features which accord with a preset value range;
and splicing the classification confidence coefficient and the visual characteristic which accord with a preset value range to obtain an initial fusion characteristic corresponding to the image to be retrieved.
Optionally, the classification confidences include a category confidence and/or an attribute confidence.
Optionally, the image to be retrieved is a video frame in a video, and the video further includes other video frames;
before the initial fusion features corresponding to the images to be fused associated with the images to be retrieved are obtained, the fusion unit 703 is further configured to:
taking other video frames which contain the target retrieval object and have playing time earlier than the video frame as images to be fused associated with the images to be retrieved;
aiming at any image to be fused in the images to be fused, the following operations are executed:
inputting any image to be fused into the target classification model to obtain a visual feature and a semantic feature corresponding to the any image to be fused, wherein the semantic feature of the any image to be fused is used for representing a classification result of the any image to be fused;
and obtaining the initial fusion feature corresponding to any image to be fused based on the visual feature and the semantic feature corresponding to any image to be fused.
Optionally, before determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the fusion unit 703 is further configured to:
determining at least one candidate area contained in each candidate image from each candidate image, wherein each candidate area contains a retrieval object of one retrieval type;
for any one of the determined candidate regions, performing the following operations:
inputting the any candidate region into the target classification model to obtain a visual feature and a semantic feature corresponding to the any candidate region, and obtaining a target fusion feature corresponding to the any candidate region based on the visual feature and the semantic feature corresponding to the any candidate region;
and recording the mapping relation between the target fusion characteristics corresponding to any one candidate region and the corresponding candidate image.
Optionally, when determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the matching unit 704 is specifically configured to:
calculating the similarity between the target fusion characteristics corresponding to the candidate images and the target fusion characteristics corresponding to the images to be retrieved;
and determining at least one target image matched with the image to be retrieved from candidate images with the similarity greater than a preset threshold value based on the calculated similarity.
For convenience of description, the above parts are described separately as modules (or units) according to functions. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.
With regard to the apparatus in the above-described embodiment, the specific manner in which each unit executes the request has been described in detail in the embodiment related to the method, and will not be elaborated here.
In the embodiment of the application, after an image to be retrieved containing a target retrieval object is obtained, the image to be retrieved is input into a target classification model, the target classification model comprises a feature extraction layer and an output layer, visual features are obtained from the feature extraction layer, semantic features are obtained from the output layer, then feature fusion is carried out on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved, and then a target image matched with the image to be retrieved is determined from all candidate images based on the target fusion features. Because the target fusion features simultaneously comprise the visual features and the semantic features, the interpretability of the visual features is good, the unexpected feature types of the training set can be well adapted, and the accuracy and the robustness of the semantic features are good, so that the target fusion features have better robustness and wider scene adaptability, and the accuracy and the detection efficiency of image retrieval are improved.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
Based on the same inventive concept, the embodiment of the application also provides the electronic equipment. In one embodiment, the electronic device may be a server or a terminal device. Referring to fig. 8, which is a schematic structural diagram of a possible electronic device provided in an embodiment of the present application, in fig. 8, an electronic device 800 includes: a processor 810 and a memory 820.
The memory 820 stores a computer program executable by the processor 810, and the processor 810 can execute the steps of the target retrieval method by executing the instructions stored in the memory 820.
The memory 820 may be a volatile memory (RAM), such as a random-access memory (RAM); the Memory 820 may also be a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk Drive (HDD), or a solid-state drive (SSD); or memory 820 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. Memory 820 may also be a combination of the above.
Processor 810 may include one or more Central Processing Units (CPUs), or be a digital processing unit, or the like. A processor 810 for implementing the above-described object retrieval method when executing the computer program stored in the memory 820.
In some embodiments, processor 810 and memory 820 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The specific connection medium between the processor 810 and the memory 820 is not limited in the embodiments of the present application. In the embodiment of the present application, the processor 810 and the memory 820 are connected by a bus as an example, the bus is depicted by a thick line in fig. 8, and the connection manner between other components is merely illustrative and is not meant to be limiting. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of description, only one thick line is depicted in fig. 8, but not only one bus or one type of bus.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium including a computer program for causing an electronic device to perform the steps of the above-mentioned object retrieval method when the computer program runs on the electronic device. In some possible embodiments, the various aspects of the object retrieval method provided in the present application may also be implemented in the form of a program product including a computer program for causing an electronic device to perform the steps of the object retrieval method described above when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable Disk, a hard Disk, a RAM, a ROM, an erasable programmable Read-Only Memory (EPROM or flash Memory), an optical fiber, a portable Compact Disk Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of the embodiments of the present application may be a CD-ROM and include a computer program, and may be run on an electronic device. However, the program product of the present application is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a computer program for use by or in connection with a command execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with a readable computer program embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a computer program for use by or in connection with a command execution system, apparatus, or device.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (12)

1. A target retrieval method, comprising:
acquiring an image to be retrieved containing a target retrieval object;
inputting the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtaining visual features output by the feature extraction layer, and obtaining semantic features output by the output layer, wherein the semantic features are used for representing classification results of the target detection object;
performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;
and determining at least one target image matched with the image to be retrieved from the candidate images based on the target fusion characteristics corresponding to the candidate images and the image to be retrieved respectively.
2. The method of claim 1, wherein the performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved comprises:
splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved, and directly taking the initial fusion features as the target fusion features; or,
and splicing the semantic features and the visual features to obtain initial fusion features corresponding to the images to be retrieved, obtaining initial fusion features corresponding to the images to be retrieved, and obtaining the target fusion features based on the initial fusion features corresponding to the images to be retrieved and the initial fusion features corresponding to the images to be fused.
3. The method of claim 2, wherein the stitching the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved comprises:
splicing the semantic features and the visual features according to a specified feature splicing sequence;
and weighting the spliced features based on the preset weight coefficients corresponding to the semantic features and the visual features respectively to obtain initial fusion features corresponding to the image to be retrieved.
4. The method according to claim 2, wherein the obtaining the target fusion feature based on the initial fusion feature corresponding to the image to be retrieved and the initial fusion features corresponding to the respective images to be fused comprises:
based on the weight coefficients corresponding to the image to be retrieved and the images to be fused, carrying out weighted summation on the initial fusion features corresponding to the image to be retrieved and the initial fusion features corresponding to the images to be fused;
and averaging the fusion features obtained after weighted summation based on the number of the images to be fused to obtain the target fusion features.
5. The method of claim 2, wherein the semantic features include respective classification confidence levels and normalization information;
the splicing the semantic features and the visual features to obtain initial fusion features corresponding to the image to be retrieved comprises:
based on the normalization information contained in the semantic features, performing normalization processing on the classification confidence degrees and the visual features contained in the semantic features to obtain the classification confidence degrees and the visual features which accord with a preset value range;
and splicing the classification confidence coefficient and the visual characteristic which accord with a preset value range to obtain an initial fusion characteristic corresponding to the image to be retrieved.
6. The method of claim 5, wherein the classification confidences comprise category confidences, and/or attribute confidences.
7. The method of claim 2, wherein the image to be retrieved is a video frame in a video, and the video further comprises other video frames;
before the obtaining of the initial fusion features corresponding to the images to be fused associated with the images to be retrieved, the method further includes:
taking other video frames which contain the target retrieval object and have playing time earlier than the video frame as images to be fused associated with the images to be retrieved;
aiming at any image to be fused in the images to be fused, executing the following operations:
inputting any image to be fused into the target classification model to obtain a visual feature and a semantic feature corresponding to the any image to be fused, wherein the semantic feature of the any image to be fused is used for representing a classification result of the any image to be fused;
and obtaining the initial fusion feature corresponding to any image to be fused based on the visual feature and the semantic feature corresponding to any image to be fused.
8. The method according to any one of claims 1 to 7, wherein before determining at least one target image matching the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved, the method further comprises:
determining at least one candidate area contained in each candidate image from each candidate image, wherein each candidate area contains a retrieval object of one retrieval type;
for any one of the determined candidate regions, performing the following operations:
inputting the any candidate region into the target classification model to obtain a visual feature and a semantic feature corresponding to the any candidate region, and obtaining a target fusion feature corresponding to the any candidate region based on the visual feature and the semantic feature corresponding to the any candidate region;
and recording the mapping relation between the target fusion characteristics corresponding to any one candidate region and the corresponding candidate image.
9. The method according to any one of claims 1 to 7, wherein the determining at least one target image matching the image to be retrieved from the candidate images based on the target fusion features corresponding to the candidate images and the image to be retrieved respectively comprises:
calculating the similarity between the target fusion characteristics corresponding to the candidate images and the target fusion characteristics corresponding to the images to be retrieved;
and determining at least one target image matched with the image to be retrieved from candidate images with the similarity greater than a preset threshold value based on the calculated similarity.
10. An object retrieval apparatus, comprising:
the device comprises an acquisition unit, a retrieval unit and a retrieval unit, wherein the acquisition unit is used for acquiring an image to be retrieved containing a target retrieval object;
the output unit is used for inputting the image to be retrieved into a target classification model comprising a feature extraction layer and an output layer, obtaining visual features output by the feature extraction layer and obtaining semantic features output by the output layer, wherein the semantic features are used for representing classification results of the target detection object;
the fusion unit is used for performing feature fusion on the semantic features and the visual features to obtain target fusion features corresponding to the image to be retrieved;
and the matching unit is used for determining at least one target image matched with the image to be retrieved from each candidate image based on the target fusion characteristics corresponding to each candidate image and the image to be retrieved.
11. An electronic device, characterized in that it comprises a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any of claims 1-9.
12. A computer-readable storage medium, characterized in that it comprises a computer program for causing an electronic device to carry out the steps of the method of any one of claims 1-9, when the computer program is run on the electronic device.
CN202210538456.8A 2022-05-17 2022-05-17 Target retrieval method and related device Pending CN114880513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210538456.8A CN114880513A (en) 2022-05-17 2022-05-17 Target retrieval method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210538456.8A CN114880513A (en) 2022-05-17 2022-05-17 Target retrieval method and related device

Publications (1)

Publication Number Publication Date
CN114880513A true CN114880513A (en) 2022-08-09

Family

ID=82675213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210538456.8A Pending CN114880513A (en) 2022-05-17 2022-05-17 Target retrieval method and related device

Country Status (1)

Country Link
CN (1) CN114880513A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937145A (en) * 2022-12-09 2023-04-07 深圳市禾葡兰信息科技有限公司 Skin health visualization method, device and equipment based on big data analysis
CN117132600A (en) * 2023-10-26 2023-11-28 广东岚瑞新材料科技集团有限公司 Injection molding product quality detection system and method based on image

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115937145A (en) * 2022-12-09 2023-04-07 深圳市禾葡兰信息科技有限公司 Skin health visualization method, device and equipment based on big data analysis
CN115937145B (en) * 2022-12-09 2024-03-19 深圳市禾葡兰信息科技有限公司 Skin health visualization method, device and equipment based on big data analysis
CN117132600A (en) * 2023-10-26 2023-11-28 广东岚瑞新材料科技集团有限公司 Injection molding product quality detection system and method based on image
CN117132600B (en) * 2023-10-26 2024-04-16 广东岚瑞新材料科技集团有限公司 Injection molding product quality detection system and method based on image

Similar Documents

Publication Publication Date Title
CN111368893B (en) Image recognition method, device, electronic equipment and storage medium
US20200334830A1 (en) Method, apparatus, and storage medium for processing video image
WO2022105125A1 (en) Image segmentation method and apparatus, computer device, and storage medium
EP3893125A1 (en) Method and apparatus for searching video segment, device, medium and computer program product
CN114880513A (en) Target retrieval method and related device
US9323988B2 (en) Content-adaptive pixel processing systems, methods and apparatus
CN111327945A (en) Method and apparatus for segmenting video
CN112954450B (en) Video processing method and device, electronic equipment and storage medium
CN111666960A (en) Image recognition method and device, electronic equipment and readable storage medium
CN112614110B (en) Method and device for evaluating image quality and terminal equipment
CN112150497B (en) Local activation method and system based on binary neural network
CN113033507B (en) Scene recognition method and device, computer equipment and storage medium
CN113591527A (en) Object track identification method and device, electronic equipment and storage medium
CN110807472B (en) Image recognition method and device, electronic equipment and storage medium
CN114283351A (en) Video scene segmentation method, device, equipment and computer readable storage medium
CN114663871A (en) Image recognition method, training method, device, system and storage medium
CN116486109A (en) Modal self-adaptive descriptive query pedestrian re-identification method and system
CN116524261A (en) Image classification method and product based on multi-mode small sample continuous learning
CN113657249B (en) Training method, prediction method, device, electronic equipment and storage medium
KR101743169B1 (en) System and Method for Searching Missing Family Using Facial Information and Storage Medium of Executing The Program
CN111046232B (en) Video classification method, device and system
CN116261009B (en) Video detection method, device, equipment and medium for intelligently converting video audience
Wang et al. TIENet: task-oriented image enhancement network for degraded object detection
CN116258873A (en) Position information determining method, training method and device of object recognition model
CN113239215B (en) Classification method and device for multimedia resources, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination