WO2024051350A1 - 图像检索方法、装置、电子设备及存储介质 - Google Patents

图像检索方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2024051350A1
WO2024051350A1 PCT/CN2023/107962 CN2023107962W WO2024051350A1 WO 2024051350 A1 WO2024051350 A1 WO 2024051350A1 CN 2023107962 W CN2023107962 W CN 2023107962W WO 2024051350 A1 WO2024051350 A1 WO 2024051350A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
retrieved
data
feature
retrieval
Prior art date
Application number
PCT/CN2023/107962
Other languages
English (en)
French (fr)
Inventor
舒秀军
文伟
谯睿智
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to US18/421,239 priority Critical patent/US20240168992A1/en
Publication of WO2024051350A1 publication Critical patent/WO2024051350A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an image retrieval method, device, electronic equipment and storage medium.
  • image retrieval With the rapid development of Internet technology, image retrieval has been widely used in a variety of scenarios.
  • image retrieval is generally performed based on input data to be retrieved, and the data to be retrieved are generally images. That is, this image retrieval method is actually a picture search. Specifically, images similar to the input retrieval image can be retrieved from an image database. Image.
  • this image retrieval method cannot be generalized to other types of data to be retrieved, and the accuracy of image retrieval needs to be improved.
  • Embodiments of the present application provide an image retrieval method, device, electronic device and storage medium, which can improve the accuracy of image retrieval.
  • an image retrieval method including:
  • the electronic device acquires a candidate image set and data to be retrieved in multiple modalities, wherein the candidate image set includes a plurality of candidate images;
  • the electronic device performs feature extraction on the data to be retrieved based on a target model to obtain a first feature of the data to be retrieved, and performs multiple feature extractions on the candidate image based on the target model to obtain various features of the candidate image.
  • the second feature of the modality after alignment of the data to be retrieved;
  • the electronic device determines a first similarity between the candidate image and the data to be retrieved in various modalities based on the first feature and the second feature, and the electronic device determines a first similarity between the candidate image and the data to be retrieved in various modalities based on the first similarity.
  • the candidate image set determines a result image set corresponding to a plurality of retrieval data combinations, wherein the retrieval data combination includes the data to be retrieved in at least one modality;
  • the electronic device merges multiple result image sets to obtain image retrieval results.
  • embodiments of the present application also provide an image retrieval device, including:
  • a data acquisition module configured to acquire a candidate image set and data to be retrieved in multiple modalities, where the candidate image set includes multiple candidate images;
  • a model processing module configured to perform feature extraction on the data to be retrieved based on a target model to obtain the first feature of the data to be retrieved, and perform feature extraction on the candidate image multiple times based on the target model to obtain the candidate image.
  • the second feature after the image is aligned with the data to be retrieved in various modalities;
  • a retrieval module configured to determine a first degree of similarity between the candidate image and the data to be retrieved in various modalities based on the first feature and the second feature, and based on the first similarity, Determine a result image set corresponding to a plurality of retrieval data combinations from the candidate image set, wherein the retrieval data combination includes the data to be retrieved in at least one modality;
  • a merging module is used to merge multiple result image sets to obtain image retrieval results.
  • embodiments of the present application also provide an electronic device, including a memory and a processor.
  • the memory stores a computer program.
  • the processor executes the computer program, it implements the above image retrieval method.
  • embodiments of the present application also provide a computer-readable storage medium, the storage medium stores a computer program, and the computer program is executed by a processor to implement the above image retrieval method.
  • inventions of the present application also provide a computer program product.
  • the computer program product includes a computer program, and the computer program is stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device executes the above image retrieval method.
  • the electronic device performs feature extraction on the data to be retrieved through a target model to obtain the first feature of the data to be retrieved, and then performs multiple feature extractions on the candidate image through the same target model to obtain the direction of the candidate image.
  • the second feature after aligning the data to be retrieved in various modalities can not only use the data to be retrieved in multiple modalities to improve the accuracy of image retrieval, but also unify the data to be retrieved in multiple modalities and candidate images.
  • the feature framework improves the feature space consistency between the first feature and the second feature; and, the electronic device uses the same target model to determine the first feature and the second feature, which can reduce the number of parameters of the target model and reduce the deployment time of the target model.
  • the electronic device determines candidate images and various modalities to be retrieved based on the first feature and the second feature Based on the first similarity between the data, the electronic device determines the result image set corresponding to the multiple retrieval data combinations from the candidate image set, and merges the multiple result image sets to obtain the image retrieval result.
  • the one-to-one retrieval of retrieval data and candidate images effectively improves the efficiency of image retrieval, and the image retrieval results are obtained based on the result image set corresponding to the combination of multiple retrieval data, which can effectively improve the accuracy of image retrieval.
  • Figure 1 is a schematic diagram of an optional implementation environment provided by the embodiment of the present application.
  • Figure 2 is an optional flow chart of the image retrieval method provided by the embodiment of the present application.
  • Figure 3 is an optional structural schematic diagram of the target model provided by the embodiment of the present application.
  • Figure 4 is an optional flow chart for obtaining image retrieval results based on multiple retrieval data combinations provided by the embodiment of the present application;
  • Figure 5 is another optional structural schematic diagram of the target model provided by the embodiment of the present application.
  • Figure 6 is a schematic diagram of an optional training process of the target model provided by the embodiment of the present application.
  • Figure 7 is an optional flow diagram for expanding training samples provided by the embodiment of the present application.
  • Figure 8 is a schematic diagram of an optional overall architecture of the target model provided by the embodiment of the present application.
  • Figure 9 is a schematic diagram of another optional overall architecture of the target model provided by the embodiment of the present application.
  • Figure 10 is a schematic diagram of another optional overall architecture of the target model provided by the embodiment of the present application.
  • Figure 11 is a schematic diagram of another optional overall architecture of the target model provided by the embodiment of the present application.
  • Figure 12 is a schematic flow chart of image retrieval using a search engine provided by an embodiment of the present application.
  • Figure 13 is a schematic flow chart of image retrieval in a photo application provided by an embodiment of the present application.
  • Figure 14 is an optional structural schematic diagram of the image retrieval device provided by the embodiment of the present application.
  • Figure 15 is a partial structural block diagram of a terminal provided by an embodiment of the present application.
  • Figure 16 is a partial structural block diagram of a server provided by an embodiment of the present application.
  • the permission or consent of the target object when it is necessary to perform relevant processing based on target object attribute information or attribute information collection and other data related to the characteristics of the target object, the permission or consent of the target object will be obtained first. , and the collection, use and processing of these data will comply with the relevant laws, regulations and standards of relevant countries and regions, among which the target objects can be users.
  • the embodiment of this application needs to obtain the attribute information of the target object, the individual permission or independent consent of the target object will be obtained through a pop-up window or a jump to a confirmation page. After clearly obtaining the individual permission or independent consent of the target object, Then obtain the necessary target object related data to enable the embodiment of the present application to operate normally.
  • image retrieval is generally performed based on input data to be retrieved, and the data to be retrieved are generally images. That is, this image retrieval method is actually a picture search. Specifically, images similar to the input retrieval image can be retrieved from an image database. Image. However, this image retrieval method cannot be generalized to other types of data to be retrieved, and the accuracy of image retrieval needs to be improved.
  • embodiments of the present application provide an image retrieval method, device, electronic device and storage medium, which can improve the accuracy of image retrieval.
  • Figure 1 is a schematic diagram of an optional implementation environment provided by an embodiment of the present application.
  • the implementation environment includes a terminal 101 and a server 102, where the terminal 101 and the server 102 are connected through a communication network.
  • the server 102 can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and basic cloud computing services such as big data and artificial intelligence platforms.
  • the server 102 can also be a node server in the blockchain network.
  • the terminal 101 may be a smartphone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, etc., but is not limited thereto.
  • the terminal 101 and the server 102 can be connected directly or indirectly through wired or wireless communication methods, and the embodiment of the present application is not limited here.
  • the terminal 101 can send multiple modalities of data to be retrieved to the server 102; the server 102 receives the data to be retrieved and obtains a pre-stored candidate image set, and performs feature extraction on the data to be retrieved based on the target model to obtain the data to be retrieved.
  • the first feature of the candidate image is extracted multiple times based on the target model, and the second feature of the candidate image that is aligned with the data to be retrieved in various modalities is obtained.
  • the candidate image is determined to be consistent with each
  • the first similarity between the data to be retrieved in each modality according to the first similarity, determine the result image set corresponding to multiple retrieval data combinations from the candidate image set, merge the multiple result image sets, and obtain the image retrieval result , sending the image search results to the terminal 101; the terminal 101 displays the image search results.
  • the server 102 performs feature extraction on the data to be retrieved through the target model to obtain the first feature of the data to be retrieved, and then performs multiple feature extractions on the candidate image through the same target model to obtain the candidate image that is aligned with the data to be retrieved in various modalities.
  • the second feature can not only use the data to be retrieved in multiple modalities to improve the accuracy of image retrieval, but also unify the feature framework of the data to be retrieved in multiple modalities and candidate images to improve the first feature and the second feature.
  • Feature space consistency between features; and using the same target model to determine the first feature and the second feature can reduce the number of parameters of the target model and reduce the memory overhead of target model deployment; in addition, in the training phase, only training is required
  • the same target model improves model training efficiency; on this basis, by determining the first similarity between the candidate image and the data to be retrieved in various modalities based on the first feature and the second feature, based on According to the first similarity, the result image set corresponding to multiple retrieval data combinations is determined from the candidate image set, and the multiple result image sets are merged to obtain the image retrieval result.
  • the methods provided by the embodiments of this application can be applied to various technical fields, including but not limited to cloud technology, artificial intelligence and other technical fields.
  • Figure 2 is an optional flow diagram of an image retrieval method provided by an embodiment of the present application.
  • the image retrieval method can be executed by a server, or can be executed by a terminal, or can also be executed by a server and a terminal in cooperation.
  • the image retrieval method includes but is not limited to the following steps 201 to 204.
  • Step 201 The electronic device obtains a candidate image set and data to be retrieved in multiple modalities.
  • the candidate image set includes multiple candidate images, the candidate images are images in the retrieval database, and the image retrieval results are generated based on the candidate image set.
  • the data to be retrieved is the query data when performing image retrieval.
  • the modality is used to indicate the existence form of the data to be retrieved.
  • the modality can be image modality, text modality, voice modality, etc.
  • the data to be retrieved in image modality is the data to be retrieved.
  • the data to be retrieved in text mode is the text to be retrieved
  • the data to be retrieved in voice mode is the voice to be retrieved.
  • the data to be retrieved in multiple modalities may include images to be retrieved and texts to be retrieved, or the data to be retrieved in multiple modalities may also include images to be retrieved and voices to be retrieved, or,
  • the data to be retrieved in multiple modalities may also include text to be retrieved and voice to be retrieved, or the data to be retrieved in multiple modalities may also include images to be retrieved, text to be retrieved, and voice to be retrieved.
  • the data to be retrieved in multiple modalities are independent of each other, and the data to be retrieved in different modalities may be related or unrelated.
  • the image to be retrieved can be an image including three peonies, and the text to be retrieved can be "three peonies”.
  • the to be retrieved There is a correlation between the image and the text to be retrieved; or, the image to be retrieved can be an image including three peonies, and the text to be retrieved can be "three cars”. At this time, there is no connection between the image to be retrieved and the text to be retrieved.
  • the image to be retrieved can be an image including three peonies
  • the text to be retrieved can be "three cars”.
  • Step 202 The electronic device performs feature extraction on the data to be retrieved based on the target model to obtain the first feature of the data to be retrieved.
  • the electronic device performs multiple feature extractions on the candidate image based on the target model to obtain the data to be retrieved from the candidate image in various modalities. Aligned second feature.
  • feature extraction may refer to mapping the data to be retrieved into a high-dimensional feature space.
  • Feature extraction of data to be retrieved based on the target model is to extract features of data to be retrieved in various modalities based on the target model.
  • the target model can be equipped with different feature extraction units to extract data to be retrieved in various modalities.
  • Perform feature extraction For example, when the data to be retrieved in multiple modalities includes images to be retrieved and text to be retrieved, the target model is equipped with an image feature extraction unit and a text feature extraction unit.
  • the image feature extraction unit is used to extract features of the image to be retrieved, and the text features are The extraction unit is used to extract features of the text to be retrieved; when the data to be retrieved in multiple modalities includes images to be retrieved and voices to be retrieved, the target model is equipped with an image feature extraction unit and a voice feature extraction unit, and the voice feature extraction unit is used to Feature extraction is performed on the speech to be retrieved; when the data to be retrieved in multiple modalities includes text to be retrieved and speech to be retrieved, the target model is equipped with a text feature extraction unit and a speech feature extraction unit; when the data to be retrieved in multiple modalities includes When images, texts, and voices are to be retrieved, the target model is equipped with an image feature extraction unit, a text feature extraction unit, and a voice feature extraction unit.
  • feature extraction is performed on the data to be retrieved based on the target model.
  • the data to be retrieved can be converted into a retrieval embedding vector, and the retrieval embedding vector is input into the target model.
  • the retrieval embedding vector is used to characterize the initial features of the data to be retrieved (features before feature extraction and processing by the target model).
  • the data to be retrieved in different modalities are converted into retrieval embedding vectors with the same vector format, thus making it easy to use the same model in the same model.
  • the retrieval embedding vector may include information embedding vectors and type embedding vectors spliced together.
  • the information embedding vector is used to characterize the information characteristics contained in the data to be retrieved. For example, when the data to be retrieved is an image to be retrieved, the information embedding vector is It is used to represent the image information of the image to be retrieved. When the data to be retrieved is the text to be retrieved, the information embedding vector is used to represent the text information of the image to be retrieved. When the data to be retrieved is the voice to be retrieved, the information embedding vector is used to represent the text to be retrieved.
  • X represents the retrieval embedding vector
  • f inf represents the information embedding vector
  • f typ represents the type embedding vector
  • the image information of the image to be retrieved can be characterized based on the information embedding vector, and the modal type characteristics of the data to be retrieved can be characterized based on the type embedding vector.
  • the treatment is based on the target model.
  • the target model determines the current modality of the data to be retrieved based on the type embedding vector, and then call the corresponding feature extraction unit to extract features of the data to be retrieved, so that the target model can distinguish multiple modalities.
  • the data to be retrieved facilitates the unification of the representation of multiple modalities of data to be retrieved within the same model framework.
  • the candidate image is aligned to the data to be retrieved in various modalities, that is, the candidate image and the data to be retrieved in various modalities are mapped to the same high-dimensional feature space, that is, the first feature and the third feature are mapped to the same high-dimensional feature space.
  • the two features are aligned with each other. For example, if the data to be retrieved in multiple modalities If the data includes an image to be retrieved and a text to be retrieved, then the candidate image is aligned with the image to be retrieved, and the candidate image is aligned with the text to be retrieved. Correspondingly, the number of second features obtained is equal to the number of modalities of the data to be retrieved.
  • the second feature after the candidate image is aligned to the image to be retrieved is obtained, and the second feature after the candidate image is aligned to the text to be retrieved is obtained. It can be understood that if the data to be retrieved in multiple modalities includes an image to be retrieved, a text to be retrieved, and a voice to be retrieved, the candidate image and the voice to be retrieved will also be aligned to obtain the third image after the candidate image is aligned to the voice to be retrieved. Two characteristics.
  • different modal alignment units can be provided in the target model to extract features of the candidate images, so as to align the candidate images to the data to be retrieved in the corresponding modalities.
  • the target model is provided with an image modality alignment unit and a text modality alignment unit, and the image modality alignment unit is used to align the candidate images to the to-be-retrieved Image alignment and text modality alignment units are used to align candidate images to the text to be retrieved;
  • the target model is set with an image modality alignment unit and a voice modality
  • the modality alignment unit and the speech modality alignment unit are used to align the candidate images to the speech to be retrieved; when the data to be retrieved in multiple modalities includes the text to be retrieved and the speech to be retrieved, the target model is set with a text modality alignment unit and a speech modality
  • Modal alignment unit when the data to be retrieved in multiple modalities includes images to be retrieved, texts to be retrieved, and voices to be retrieved, the target model is equipped with an image modality alignment unit, a text modality alignment unit, and a voice modality alignment unit.
  • Figure 3 is an optional structural schematic diagram of a target model provided by an embodiment of the present application.
  • the target model is provided with multiple feature extraction units and multiple modal alignment units.
  • Each feature extraction unit The units are respectively used to extract features from the data to be retrieved in the corresponding modality, and each modal alignment unit is used to extract features from the candidate images, so that the candidate images are aligned to the data to be retrieved in the corresponding modality.
  • the parameters between each feature extraction unit may be different, and the parameters between each modal alignment unit may be different.
  • feature extraction is performed on the data to be retrieved through the target model to obtain the first feature of the data to be retrieved, and then the candidate images are processed multiple times through the same target model.
  • Feature extraction obtains the second features of the candidate image after aligning them with the data to be retrieved in various modalities. It can not only use the data to be retrieved in multiple modalities to improve the accuracy of image retrieval, but also unify the data of multiple modalities.
  • the feature framework of the data to be retrieved and the candidate image improves the feature space consistency between the first feature and the second feature; and using the same target model to determine the first feature and the second feature can reduce the number of parameters of the target model, Reduce the memory overhead of target model deployment; in addition, you only need to train the same target model during the training phase, improving model training efficiency.
  • the image feature extraction unit can be used as the image modality alignment unit, that is, when the data to be retrieved in multiple modalities includes the to-be-retrieved data of the image modality.
  • the image feature extraction unit can be used to obtain the first feature of the image to be retrieved.
  • the image feature extraction unit can be used to obtain the second feature of the candidate image, thereby achieving the reuse effect of the image feature extraction unit and simplifying the target model. structure.
  • an additional image mode alignment unit can also be set up to obtain the second feature of the candidate image, which is not limited in the embodiment of the present application.
  • the data to be retrieved in multiple modalities includes text to be retrieved and images to be retrieved
  • multiple feature extractions are performed on the candidate images based on the target model to obtain the second image after the candidate images are aligned with the data to be retrieved in various modalities.
  • Features can be extracted specifically based on the text modality alignment unit to extract features of the candidate image to obtain the second feature of the candidate image that is aligned with the text to be retrieved; feature extraction is performed on the candidate image based on the image feature extraction unit to obtain the image features of the candidate image.
  • the image features are used as the second features after aligning the candidate image to the image to be retrieved, so as to achieve the reuse effect of the image feature extraction unit and simplify the structure of the target model.
  • the above image feature extraction unit can also be used The reuse method will not be described again here.
  • Step 203 The electronic device determines the first similarity between the candidate image and the data to be retrieved in various modalities based on the first feature and the second feature. The electronic device determines multiple retrievals from the candidate image set based on the first similarity. The resulting image set corresponding to the data combination.
  • a first degree of similarity between the candidate image and the data to be retrieved in various modalities is determined, that is, the number of first similarities is the same as the number of modalities of the data to be retrieved.
  • the data to be retrieved in multiple modalities includes an image to be retrieved and a text to be retrieved
  • the relationship between the image to be retrieved and the candidate image is determined based on the first feature of the image to be retrieved and the second feature of the candidate image being aligned to the image to be retrieved.
  • the first similarity between the text to be retrieved and the second feature of the candidate image aligned to the text to be retrieved is determined; when the multiple modalities are
  • the retrieval data includes the text to be retrieved, the image to be retrieved and the voice to be retrieved, the first similarity between the image to be retrieved and the candidate image is determined based on the first feature of the image to be retrieved and the second feature of the candidate image aligned to the image to be retrieved.
  • the degree of similarity between the text to be retrieved and the candidate image is determined based on the first feature of the text to be retrieved and the second feature of the candidate image aligned to the text to be retrieved, and the first similarity between the text to be retrieved and the candidate image is determined based on the first feature of the speech to be retrieved and the candidate image.
  • the second feature aligned with the voice to be retrieved determines the first similarity between the voice to be retrieved and the candidate image.
  • the retrieval data combination includes data to be retrieved in at least one modality, that is, the retrieval data combination may include data to be retrieved in one modality (i.e., the first data combination), or may include data to be retrieved in multiple modalities (i.e., the first data combination).
  • the second data combination for example, the first data combination may include images to be retrieved, or may include text to be retrieved, or may also include voice to be retrieved; when the second data combination may include images to be retrieved and text to be retrieved, Or it may include an image to be retrieved and a voice to be retrieved, or it may include a text to be retrieved and a voice to be retrieved, or it may also include an image to be retrieved, a text to be retrieved, a voice to be retrieved, and so on.
  • the first similarity can be a distance matrix of Euclidean distance, or a similarity matrix of cosine similarity, Or a distance matrix of Chebyshev distance, etc., which are not limited in the embodiments of this application.
  • the result image set corresponding to the multiple retrieval data combinations can be determined from the candidate image set based on the first similarity.
  • the multiple retrieval data combinations can be the images to be retrieved and the text to be retrieved.
  • the result image set corresponding to the multiple retrieval data combinations is The result image set corresponding to the image to be retrieved, and the result image set corresponding to the text to be retrieved.
  • the retrieval data combination is the first data combination. Therefore, the result image set filtered out by multiple modalities of data to be retrieved can be subsequently combined to obtain image retrieval results, thereby improving the accuracy of image retrieval.
  • the first data combination and the second data combination can also be combined, that is, different retrieval data combinations can be the images to be retrieved.
  • the text to be retrieved, the image to be retrieved combined with the text to be retrieved, correspondingly, the result image set corresponding to different retrieval data combinations is the result image set corresponding to the image to be retrieved, and the result image set corresponding to the text to be retrieved, and the combination of the images to be retrieved.
  • the result image set corresponding to the text to be retrieved so that on the basis of using the retrieved data in various modalities to obtain the image retrieval results, further introducing the combination of multiple modalities to be retrieved data to expand the image retrieval results, thereby further improving the image retrieval results. Retrieval accuracy.
  • the result images corresponding to the multiple retrieval data combinations are determined from the candidate image set.
  • the result image set corresponding to the first data combination can be determined from the candidate image set according to the first similarity corresponding to the data to be retrieved in one modality; the first similarity corresponding to the data to be retrieved in multiple modalities is Degrees are fused to obtain the target similarity, and the result image set corresponding to the second data combination is determined from the candidate image set according to the target similarity.
  • the result image set corresponding to the first data combination is the result image set corresponding to the data to be retrieved in various modalities
  • the result image set corresponding to the second data combination is the result image set corresponding to the combination of the data to be retrieved in various modalities.
  • Result image set for example, when the data to be retrieved in multiple modalities includes an image to be retrieved and a text to be retrieved, the result image set corresponding to the first data combination is the result image set corresponding to the image to be retrieved, and the result image set corresponding to the text to be retrieved.
  • Result image set on this basis, the first similarity corresponding to the image to be retrieved and the first similarity corresponding to the text to be retrieved can be fused to obtain the target similarity, thereby realizing the combination of the image to be retrieved and the text to be retrieved.
  • the fusion method may be weighting or multiplication of multiple similarities.
  • the result image set can directly include target images in the candidate image set that match each retrieval data combination.
  • the number of target images in the result image set can also be preset. When the number of target images When there are more than one, the target images determined from the candidate image set can be further sorted, for example, the target images can be sorted from large to small based on the first similarity, so that the result image set is clearer.
  • Step 204 The electronic device merges multiple result image sets to obtain image retrieval results.
  • result image sets can be merged to obtain the final image retrieval result.
  • the result image set can be deduplicated and then the final image retrieval result can be output.
  • different result image sets can be directly output side by side as the final image retrieval result.
  • Figure 4 is an optional flow chart for obtaining image retrieval results based on a combination of multiple retrieval data provided by an embodiment of the present application.
  • the data to be retrieved in multiple modalities includes images to be retrieved and images to be retrieved. Take text as an example.
  • the image to be retrieved is an image of a girl carrying a bag.
  • the text to be retrieved is "a girl with long hair wearing a black coat, black pants, and carrying a red bag.”
  • the image of a girl carrying a bag is a retrieval data.
  • the combination, "A girl with long hair wearing a black coat, black pants, and carrying a red bag” is a retrieved data combination, and the image of a girl carrying a bag is combined with "A girl with long hair wearing a black coat, black pants, and carrying a red bag”
  • "Package” is a retrieval data combination, and the result image sets corresponding to different retrieval data combinations are combined to obtain the image retrieval results.
  • the image retrieval results can be obtained by conducting a one-to-one search between the data to be retrieved and the candidate images.
  • the one-to-one retrieval means that the data to be retrieved and each candidate image are input into the retrieval model as a data pair.
  • the retrieval model outputs the matching probability between the data to be retrieved and the candidate images. Since there are multiple candidate images, one-to-one retrieval requires pairwise traversal retrieval, which increases the consumption of retrieval resources.
  • the embodiment of the present application determines the first similarity between the candidate image and the data to be retrieved in various modalities based on the first feature and the second feature, and determines multiple retrieval data from the candidate image set based on the first similarity.
  • the data to be retrieved when converting the data to be retrieved into a retrieval embedding vector, can be segmented to obtain multiple retrieval data blocks, and feature mapping is performed on the multiple retrieval data blocks to obtain the third An embedding vector; determine the position information of each retrieval data block in the data to be retrieved, perform feature mapping on the multiple position information, and obtain the second embedding vector; perform feature mapping on the modality corresponding to the data to be retrieved, and obtain the third embedding vector; The first embedding vector, the second embedding vector and the third embedding vector are spliced to obtain a retrieval embedding vector.
  • the first embedding vector and the second embedding vector are equivalent to the aforementioned information embedding vector after splicing.
  • the first embedding vector is obtained by segmenting the data to be retrieved, and based on the position information of each retrieval data block in the data to be retrieved Obtaining the second embedding vector can make the information embedding vector carry more information of the data to be retrieved, thereby improving the accuracy of the information embedding vector;
  • the third embedding vector is equivalent to the aforementioned
  • the type embedding vector is used by the target model to determine the current modality of the data to be retrieved based on the type embedding vector.
  • t represents the result obtained after encoding by the text encoder
  • [cls] represents the start mark
  • [sep] represents the end mark
  • t 1 ,...t M represent each text word respectively
  • M is a positive integer.
  • the pre-trained word embeddings can be used to map the results of the text encoder to symbol embedding vectors , the first embedding vector is obtained; then, the position information of each text word in the text to be retrieved is determined, and the position information of each text word in the text to be retrieved is feature mapped to obtain the second embedding vector; then, the text model is Perform feature mapping in the state to obtain the third embedding vector.
  • the retrieval embedding vector corresponding to the text to be retrieved can be obtained, which can be expressed as :
  • X t represents the retrieval embedding vector corresponding to the text to be retrieved
  • v represents the result obtained after encoding by the image encoder
  • [cls] represents the start flag
  • v 1 ,...v N represent each image block respectively
  • N is a positive integer.
  • a similar method as in the aforementioned text mode can be used to perform feature mapping on the results obtained after encoding by the image encoder to obtain the first embedding vector; then, determine the position information of each image block in the image to be retrieved, and map each image The position information of the block in the image to be retrieved is feature mapped to obtain the second embedding vector; then, the image modality is feature mapped to obtain the third embedding vector, and the first embedding vector and second embedding vector corresponding to the image to be retrieved are By splicing with the third embedding vector, the retrieval embedding vector of the image to be retrieved can be obtained, which can be expressed as:
  • X v represents the retrieval embedding vector corresponding to the image to be retrieved
  • s represents the result obtained after encoding by the speech coder
  • [cls] represents the start mark
  • [sep] represents the end mark
  • s 1 ,...s K represent each speech frame respectively
  • K is a positive integer.
  • a similar method as in the aforementioned text mode can be used to perform feature mapping on the results obtained after encoding by the speech encoder to obtain the first embedding vector; then, determine the position information of each speech frame in the speech to be retrieved, and map each speech The position information of the frame in the speech to be retrieved is feature mapped to obtain the second embedding vector; then, the speech modality is feature mapped to obtain the third embedding vector, and the first embedding vector and the second embedding vector corresponding to the speech to be retrieved are By splicing with the third embedding vector, the speech retrieval embedding vector to be retrieved can be obtained, which can be expressed as:
  • X s represents the retrieval embedding vector corresponding to the speech to be retrieved
  • the retrieval embedding vectors of the above-mentioned data to be retrieved in different modalities have the same vector format, which facilitates the unification of the representation of the data to be retrieved in multiple modalities within the same model framework, so that the target model can process the retrieval data of different modalities.
  • Retrieval data for feature extraction provides a basis for subsequent determination of target images corresponding to different retrieval data combinations from multiple candidate images.
  • the results obtained by different encoders can be Mapping to different high-dimensional feature spaces enables the obtained first embedding vector to better match the feature representation requirements of the corresponding modality, thereby improving the accuracy and rationality of the first embedding vector.
  • the candidate image belongs to image modality data
  • you can perform feature mapping on the candidate image by referring to the aforementioned method of obtaining the retrieval embedding vector corresponding to the image to be retrieved. , get candidate image pairs
  • the corresponding embedding vector is then input to the target model to extract features of the candidate image.
  • Figure 5 is another optional structural diagram of a target model provided by an embodiment of the present application.
  • the target model can be provided with a first normalization layer, an attention layer, a second normalization layer, and multiple The feature extraction unit and multiple modal alignment units, based on the model structure shown in Figure 5, perform feature mapping on the data to be retrieved based on the target model.
  • the retrieval embedding vector can be normalized. Process to obtain the first normalized vector, perform attention feature extraction on the first normalized vector to obtain the attention vector, perform feature mapping on the attention vector based on the target model, and obtain the first feature of the data to be retrieved.
  • the retrieval embedding vector can be normalized (Layer Normalization) through the first normalization layer, so as to achieve the data standardization effect of the retrieval embedding vector and improve the processing efficiency of the target model in retrieval embedding vectors;
  • the attention can be used to The layer performs attention feature extraction on the first normalized vector, thereby extracting important information in the first normalized vector, so that the first feature of the data to be retrieved can be obtained after subsequent feature mapping of the attention vector based on the target model. precise.
  • the attention layer can use a multi-head attention (Multi-head Attention) mechanism to extract attention features from the first normalized vector.
  • the first normalization layer, the attention layer, the second normalization layer, multiple feature extraction units and multiple modal alignment units can form an overall processing module, and multiple above-mentioned processing modules can be stacked in the target model.
  • the output of the previous processing module is used as the output of the next processing module, and the output of the last processing module is the final first feature, thereby improving the accuracy of the first feature.
  • feature mapping is performed on the attention vector based on the target model.
  • the attention vector and the retrieval embedding vector can be spliced. Obtain the splicing vector; normalize the splicing vector to obtain the second normalized vector; perform forward feature mapping on the second normalized vector based on the target model to obtain the mapping vector; splice the mapping vector and the splicing vector, Obtain the first feature of the data to be retrieved.
  • forward feature mapping is performed on the second normalized vector based on the target model to obtain the mapping vector, that is, forward feature mapping is performed on the second normalized vector based on the corresponding feature extraction unit to obtain the mapping vector.
  • feature extraction Units may include a feed forward layer.
  • the splicing vector can be normalized through the second normalization layer, thereby achieving the data standardization effect of the splicing vector and improving the processing efficiency of the target model in retrieving the embedding vector.
  • the mapping vector and the splicing vector By splicing the mapping vector and the splicing vector, the first feature of the data to be retrieved is obtained, so that the first feature carries the original information of the splicing vector, and the accuracy of the first feature is improved.
  • obtaining the second feature of the candidate image based on the target model is similar to obtaining the first feature of the data to be retrieved based on the target model.
  • the candidate image can also be converted into an image embedding vector, and the image embedding vector is the same as the retrieval embedding vector. have the same vector format. Normalize the image embedding vector to obtain the first normalized vector corresponding to the candidate image. Perform attention feature extraction on the first normalized vector corresponding to the candidate image to obtain the attention corresponding to the candidate image.
  • Force vector splice the attention vector corresponding to the candidate image and the image embedding vector to obtain the splicing vector corresponding to the candidate image, normalize the splicing vector corresponding to the candidate image, and obtain the second normalized vector corresponding to the candidate image , perform forward feature mapping on the second normalized vector corresponding to the candidate image based on each modal alignment unit, and obtain the mapping vector corresponding to the candidate image; splice the mapping vector corresponding to the candidate image and the splicing vector corresponding to the candidate image, and obtain The second feature of the candidate image is aligned to the data to be retrieved in various modalities.
  • the same first normalization layer, attention layer and second normalization layer can be shared, and then called Different feature extraction units perform feature extraction or modal alignment units perform feature extraction, thereby simplifying the structure of the target model.
  • the splicing vector can be combined and expressed as:
  • Represents the retrieval embedding vector input to the i-th processing module (the first feature corresponding to the image to be retrieved or the text to be retrieved output by the i-1th processing module), i is a positive integer, when i 1, Represents the initial retrieval embedding vector of the image to be retrieved or the text to be retrieved.
  • first feature or the second feature can be expressed in combination as:
  • the target model before obtaining the candidate image set and data to be retrieved in multiple modalities, the target model can be trained first. Specifically, the sample image and at least one modality other than the image modality can be obtained. Sample retrieval data, obtain the similarity label between the sample image and the sample retrieval data; perform feature extraction on the sample retrieval data based on the target model, obtain the third feature of the sample retrieval data, perform multiple feature extractions on the sample image based on the target model, Obtain the fourth image after aligning the sample image to the sample retrieval data of various modalities. Features; determine the second similarity between the sample image and the data to be retrieved based on the third feature and the fourth feature, determine the first loss value based on the second similarity and the corresponding similarity label; adjust the target based on the first loss value Model parameters.
  • sample retrieval data and sample images are both used to train the target model. Since the sample retrieval data and sample images have different modalities, the sample retrieval data can be sample text, sample speech, etc.
  • the similarity label between the sample image and the sample retrieval data is used to indicate whether the sample image and the sample retrieval data match.
  • the similarity label can be "1" or "0". When the similarity label is "1", that is, the sample The retrieval data matches the corresponding sample image. For example, if the sample retrieval data is sample text and the sample text is "boy carrying a schoolbag", then the sample image is an image of a boy carrying a schoolbag; when the similarity label is "0 ", that is, the sample retrieval data does not match the corresponding sample image. For example, if the sample text is "Boy carrying a schoolbag", then the sample image is an image of a peony flower.
  • the principle of performing feature extraction on the sample retrieval data based on the target model to obtain the third feature of the sample retrieval data is similar to the principle of performing feature extraction on the retrieval data based on the target model to obtain the first feature of the retrieval data, and will not be described again here.
  • multiple feature extractions are performed on the sample image based on the target model to obtain the fourth feature after the sample image is aligned with the sample retrieval data of various modalities, and feature extraction is performed on the candidate image based on the target model to obtain the candidate image to each modality.
  • the principle of the second feature after the data to be retrieved in the two modalities is aligned is similar and will not be described again here.
  • the calculation method of the second similarity is similar to that of the first similarity, which will not be described again here.
  • the first similarity can be determined based on the second similarity and the corresponding similarity label.
  • the loss value can be expressed as:
  • L 1 represents the first loss value
  • B represents the number of sample pairs composed of sample retrieval data and sample images
  • i represents the i-th sample image
  • j represents the j-th sample retrieval data
  • i and j are both positive integers
  • p i,j represents the normalized probability value of the second similarity
  • q i,j represents the normalized probability value of the similarity label
  • represents a small floating point number, which is used for numerical stability (such as Prevent the denominator from being 0).
  • f j represents the third feature of the j-th sample retrieval data
  • f k represents the third feature of the k-th sample retrieval data
  • y i,j represents the i-th
  • y i,k represents the similarity label between the i-th sample image and the k-th sample retrieval data.
  • adjusting the parameters of the target model according to the first loss value may be adjusting the parameters of the modal alignment unit and the corresponding feature extraction unit in the target model, This achieves joint training between the modal alignment unit and the feature extraction unit of the corresponding modality, which can effectively improve the alignment between the features extracted by the modal alignment unit and the feature extraction unit of the corresponding modality, and improve the training of the target model. efficiency.
  • the target model when the target model is equipped with an image feature extraction unit and the image feature extraction unit is reused (that is, the image feature extraction unit is used both for feature extraction of the image to be retrieved and for candidate images feature extraction), correspondingly, when adjusting the parameters of the target model according to the first loss value, the category label of the sample image can be obtained; feature extraction is performed on the sample image based on the target model to obtain the fifth feature of the image modality corresponding to the sample image. ;Classify the sample image according to the fifth feature to obtain the sample category, determine the second loss value according to the sample category and category label; adjust the parameters of the target model according to the first loss value and the second loss value.
  • the category label of the sample image indicates the category of the sample image. For example, if the sample image is an image of a dog, the category label of the sample image can be "animal", or it can also be "dog” and so on.
  • Feature extraction of the sample image based on the target model may be based on the image feature extraction unit, and the fifth feature of the sample image corresponding to the image modality is obtained. After the fifth feature of the sample image is obtained, it can be input to the classifier. Classify the sample image to obtain the sample category, and then determine the second loss value based on the sample category and category label.
  • the second loss value can be specifically expressed as:
  • L 2 represents the second loss value
  • p(x) represents the probability distribution corresponding to the category label
  • q(x) represents the probability distribution corresponding to the sample category
  • x represents the number of the category of the sample image
  • m represents the category of the sample image.
  • the total number, x and m are both positive integers.
  • adjusting the parameters of the target model according to the first loss value and the second loss value may be to adjust the parameters of the target model separately according to the first loss value and the second loss value, or the first loss value may also be used.
  • the loss value and the second loss value are weighted to obtain the total loss value, and the parameters of the target model are adjusted according to the total loss value.
  • image classification can be introduced to adjust the parameters of the image feature extraction unit, so that training methods from other scenarios can be introduced to adjust the image feature extraction unit. parameters to improve the generalization ability of the image feature extraction unit.
  • the base when adjusting the parameters of the target model according to the first loss value, specifically, the first reference image of the same category as the sample image and the second reference image of a different category from the sample image can be obtained; based on the target model, the sample image and the first reference image can be obtained Perform feature extraction with the second reference image to obtain the fifth feature of the corresponding image modality of the sample image, the sixth feature of the first reference image, and the seventh feature of the second reference image; determine the relationship between the fifth feature and the sixth feature.
  • the third degree of similarity, and the fourth degree of similarity between the fifth feature and the seventh feature determine the third loss value based on the third degree of similarity and the fourth degree of similarity; adjust the target based on the first loss value and the third loss value parameters of the model.
  • the number of sample images may be multiple.
  • the first reference image and the second reference image may be images from multiple sample images, or may be images other than multiple sample images.
  • Feature extraction is performed on the sample image, the first reference image and the second reference image based on the target model, that is, feature extraction is performed on the sample image, the first reference image and the second reference image based on the image feature extraction unit. Since the first reference image is of the same category as the sample image, normally the third similarity should be higher. Similarly, since the second reference image is of different categories than the sample image, normally the fourth similarity should be lower.
  • L 3 represents the third loss value
  • d AP represents the third similarity
  • d AN represents the fourth similarity
  • represents the hyperparameter
  • adjusting the parameters of the target model according to the first loss value and the third loss value may be to adjust the parameters of the target model separately according to the first loss value and the third loss value, or the first loss value may also be used.
  • the loss value and the third loss value are weighted to obtain the total loss value, and the parameters of the target model are adjusted according to the total loss value.
  • the distance between images of the same type can be made closer, and the distance between images of different types can be made closer. The distance between them becomes farther, so that the features extracted by the feature extraction unit are more accurate.
  • the first loss value, the second loss value and the third loss value can also be weighted to obtain the total loss value, and the parameters of the target model are adjusted according to the total loss value.
  • L total represents the total loss value
  • the target model is provided with an image feature extraction unit and the image feature extraction unit is reused, by introducing the first loss value, the second loss value and the third loss value at the same time, the image feature extraction unit and the modal alignment unit can be performed Targeted training is conducive to improving training effects.
  • the target model takes the target model to perform image retrieval based on the text to be retrieved and the image to be retrieved as an example to illustrate the training process of the target model.
  • Figure 6 is a schematic diagram of an optional training process of the target model provided by the embodiment of the present application. Specifically, a sample image set and a sample text set can be obtained, and the sample image set and sample text set can be input to the target model.
  • Loss value determine the first reference image and the second reference image of each sample image from the sample image set, and calculate based on the similarity between the sample image and the first reference image and the similarity between the sample image and the second reference image.
  • the third loss value finally, the total loss value is obtained based on the sum of the first loss value, the second loss value and the third loss value, and the parameters of the target model are adjusted according to the total loss value.
  • the training samples of the target model when training the target model, when the sample retrieval data includes sample text, can be expanded to improve the training effect.
  • the training samples of the target model can be expanded to improve the training effect.
  • Figure 7 is an optional flow chart for expanding training samples provided by an embodiment of the present application.
  • the initial image and the initial text may exist in pairs, and the initial image and The number of data pairs composed of the initial text may be multiple, and the data pairs composed of the initial image and the initial text may be labeled with category labels.
  • the enhanced image can be obtained by performing enhancement processing on the initial image.
  • the enhancement processing includes but is not limited to one or a combination of one or more processes such as enlarging, reducing, cropping, flipping, color gamut conversion, and color dithering.
  • text components of any length in the initial text can be deleted to obtain enhanced text.
  • the text components can be words, sentences or paragraphs. For example, if the initial text is "This man is wearing a black and gray down jacket and a pair of light-colored pants, and he has a dark green backpack," the enhanced text can be "This man is wearing a black and gray down jacket, and he has a dark green backpack.” Dark green backpack", or the enhanced text can also be "This man is wearing a black and gray down jacket and a pair of light-colored pants” and so on.
  • the text components in the reference text can also be used to adjust the text components in the initial text to obtain enhanced text, where the reference text is of the same category as the initial text.
  • the category label can be used to determine the reference text of the current initial text from the remaining initial texts in the training data set, and the text components in the reference text can be used to adjust the text components in the initial text. This can be done by using the text components in the reference text to replace the text components in the initial text. text components, or adding text components from the reference text to the text components of the initial text.
  • the initial text is "This man is wearing a black and gray down jacket and a pair of light-colored trousers, and he has a dark green backpack”
  • the reference text is "A man has black hair, he is wearing a gray shirt, gray trousers and Gray canvas shoes, carrying a bag”
  • the enhanced text can be "This person is wearing a black and gray down jacket, gray pants and gray canvas shoes, he has a dark green backpack”, or the enhanced text can also be For "This man is wearing a black and gray down jacket and a pair of light-colored pants, he has black hair, and he has a dark green backpack” and so on.
  • the enhanced image and enhanced text are obtained.
  • the enhanced image and enhanced text can be used to train the target model.
  • the initial image and enhanced text, the enhanced image and the initial text, and the enhanced image and enhanced text can all form a new data pair, so that This makes the training data of the target model more diverse, especially when adjusting the parameters of the modal alignment unit, which can significantly improve the performance of the modal alignment unit.
  • acceleration, deceleration, speech frame replacement, speech frame deletion, noise addition, etc. can also be used to obtain enhanced speech, and the initial speech and enhanced speech can be used to train the target model.
  • the performance of the target model can be further verified when using the target model for image retrieval.
  • the cumulative matching characteristics CMC, Cumulative Matching Characteristic
  • mAP average precision
  • the cumulative matching characteristics and average precision can be calculated based on the target similarity in multiple modalities, and then the performance of the target model can be verified from different dimensions. , when the cumulative matching characteristics and average accuracy do not reach the preset threshold, the parameters of the target model can be adjusted again.
  • the following uses the CUHK-PEDES and RSTP data sets as examples to illustrate the performance of the target model in the image retrieval method provided by the embodiment of the present application.
  • Table 1 is the evaluation effect data of different image retrieval methods on the CUHK-PEDES data set provided by the embodiment of the present application.
  • Table 2 is the evaluation effect data of different image retrieval methods on the RSTP data set provided by the embodiment of the present application.
  • Retrieval method evaluation performance data Among them, Rank-1, Rank-5 and Rank-10 are the evaluation indicators of CMC. It can be seen from Table 1 and Table 2 that the image retrieval method provided by this application has a higher accuracy than other image retrieval methods in related technologies, and In the above data, the image retrieval method provided by this application only uses global features.
  • Table 3 shows the evaluation effect data of different image retrieval methods using text for image retrieval provided by the embodiments of the present application.
  • Table 4 shows the different image retrieval methods provided by the embodiments of the present application for image retrieval using images. evaluation effect data. Among them, R1 is the abbreviation of Rank-1, R5 is the abbreviation of Rank-5, and R10 is the abbreviation of Rank-10. It can be seen from Table 3 and Table 4 that the image retrieval method provided by this application uses text for image retrieval in a separate evaluation. And when using images for image retrieval, the accuracy is also higher than other image retrieval methods in related technologies.
  • Table 5 is the evaluation effect data of image retrieval using text, image retrieval using images, and image retrieval using text combined with images in the image retrieval method provided by the embodiment of the present application.
  • using text combined with images to perform image retrieval has a higher accuracy. Therefore, this application fuses the similarities corresponding to the data to be retrieved in different modalities, thereby achieving image retrieval by combining the data to be retrieved in different modalities. It can significantly improve the accuracy of image retrieval.
  • Figure 8 is a schematic diagram of an optional overall architecture of a target model provided by an embodiment of the present application.
  • the target model is provided with a first normalization layer, an attention layer, a second normalization layer, Image feature extraction unit, text modality alignment unit and text feature extraction unit.
  • the input is multiple data pairs composed of sample text and sample images.
  • the sample text of one of the input data pairs can be "This person is wearing a pair of glasses. He is wearing a black and gray down jacket and a pair of light-colored pants. He has "A pair of light shoes, he has a dark green backpack", the input sample image is a character image; then the text and image amplification processing within the class is performed.
  • random enhancement processing can be performed, that is, from amplification, Randomly select one or more processing methods from reduction, cropping, flipping, color gamut transformation, color dithering, etc. to process the sample image, and adjust the text components of the sample text.
  • the text can form a new data pair, thereby expanding the target model training data; then encode the data pairs to obtain image embedding vectors and text embedding vectors, input the image embedding vectors and text embedding vectors to the target model, and undergo normalization processing in the first normalization layer and attention in the attention layer
  • the force feature extraction process and the normalization process of the second normalization layer are used to obtain the image normalized vector and the text normalized vector, and then according to the corresponding input type, the image normalized vector is preprocessed through the image feature extraction unit forward mapping to obtain the image features of the sample image itself, perform forward mapping on the text normalized vector through the text feature extraction unit, obtain the text features of the sample text, and perform forward mapping on the image normalized vector through the text modal alignment unit , obtain the image features of the sample image after aligning to the sample text; then, calculate the first loss value based on the text features of the sample text and the image features of the sample image after aligning to the sample text, and calculate the second loss value based on the image
  • the input is the data pair ⁇ v q , t q > composed of the image to be retrieved and the text to be retrieved, and the candidate image ⁇ v g > in the candidate image data set.
  • the features of the image to be retrieved v q are extracted through the image feature extraction unit of the target model. And the characteristics of the candidate image v g pass target.
  • the text feature extraction unit of the model extracts the features of the text t q to be retrieved Extract the features of candidate image v g aligned to the text t q to be retrieved through the text modality alignment unit
  • Calculate the Euclidean distance matrix between the image to be retrieved v q and the candidate image v g Determine the result image set corresponding to the image v q to be retrieved from the candidate image data set according to the Euclidean distance matrix D i2i , and calculate the corresponding CMC i2i and mAP i2i according to the Euclidean distance matrix D i2i ;
  • the Euclidean distance matrix D ti2i determines the result image set corresponding to the data pair ⁇ v q , t q > from the candidate image data set, and calculates the corresponding CMC ti2i and mAP ti2i based on the fused Euclidean distance matrix D ti2i .
  • the result image set corresponding to the text to be retrieved t q , the result image set corresponding to the image to be retrieved v q , and the result image set corresponding to the data pair ⁇ v q , t q > are merged to obtain the image retrieval result.
  • Figure 9 is a schematic diagram of another optional overall architecture of the target model provided by the embodiment of the present application, in which the target model is provided with a first normalization layer, an attention layer, a second normalization layer, and a first normalization layer. layer, image feature extraction unit, speech modality alignment unit and speech feature extraction unit.
  • the input is multiple data pairs composed of sample voices and sample images.
  • the input sample images are person images, and the input sample voices are voices describing the characters in the sample images; then the voice and image amplification processing within the class is performed.
  • random enhancement processing can be performed, that is, one or more processing methods are randomly selected from amplification, reduction, cropping, flipping, color gamut conversion, color dithering, etc. to process the sample image.
  • random enhancement processing can also be performed, that is, one or more processing methods are randomly selected from acceleration, deceleration, speech frame replacement, speech frame deletion, noise addition, etc.
  • Images and speech can form new data pairs, thereby expanding the training data of the target model; then the data pairs are encoded to obtain image embedding vectors and speech embedding vectors, and the image embedding vectors and speech embedding vectors are input to the target model.
  • the normalization processing of the first normalization layer, the attention feature extraction processing of the attention layer, and the normalization processing of the second normalization layer are used to obtain the image normalized vector and the speech normalized vector, and then according to the corresponding Input type, the image feature extraction unit performs forward mapping on the image normalized vector to obtain the image features of the sample image itself, and the speech feature extraction unit performs forward mapping on the speech normalized vector to obtain the speech features of the sample speech,
  • the image normalized vector is forward mapped through the speech modality alignment unit to obtain the image features of the sample image after aligning to the sample speech; then, the image features of the sample image after aligning to the sample speech are calculated based on the speech features of the sample speech and the image features of the sample image after aligning to the sample speech.
  • a loss value, calculating the second loss value and the third loss value based on the image characteristics of the sample image itself, and adjusting the parameters of the target model according to the first loss value, the second loss value and the third loss value.
  • the input is the data pair ⁇ v q , s q > composed of the image to be retrieved and the voice to be retrieved, and the candidate image ⁇ v g > in the candidate image data set.
  • the features of the image to be retrieved v q are extracted through the image feature extraction unit of the target model.
  • the characteristics of the candidate image v g Extract the features of the speech s q to be retrieved through the speech feature extraction unit of the target model Extract features of the candidate image v g that are aligned with the to-be-retrieved speech s q through the speech modality alignment unit
  • Calculate the Euclidean distance matrix between the image to be retrieved v q and the candidate image v g Determine the result image set corresponding to the image v q to be retrieved from the candidate image data set according to the Euclidean distance matrix D i2i , and calculate the corresponding CMC i2i and mAP i2i according to the Euclidean distance matrix D i2i ;
  • the Euclidean distance matrix D si2i determines the result image set corresponding to the data pair ⁇ v q , s q > from the candidate image data set, and calculates the corresponding CMC si2i and mAP si2i according to the fused Euclidean distance matrix D si2i .
  • the result image set corresponding to the voice s q to be retrieved, the result image set corresponding to the image v q to be retrieved, and the result image set corresponding to the data pair ⁇ v q , s q > are merged to obtain the image retrieval result.
  • Figure 10 is a schematic diagram of another optional overall architecture of the target model provided by the embodiment of the present application, in which the target model is provided with a first normalization layer, an attention layer, a second normalization layer, and a first normalization layer. layer, text feature extraction unit, text modality alignment unit, speech modality alignment unit and speech feature extraction unit.
  • the input is multiple data pairs composed of sample voices and sample texts, as well as sample images.
  • the input sample text can refer to the description in the example shown in Figure 7, which will not be repeated here.
  • the input sample voice is a pair of characters in the sample text.
  • the voice of the description is then carried out; then the voice, text and image amplification processing within the class is carried out.
  • the data pairs and sample images are encoded to obtain the text embedding vector, voice Embedding vectors and image embedding vectors.
  • Text embedding vectors, speech embedding vectors and image embedding vectors are input to the target model.
  • the normalization process of the normalization layer obtains the text normalized vector, the speech normalized vector and the image normalized vector, and then according to the corresponding input type, the text normalized vector is forward mapped through the text feature extraction unit , obtain the image features of the sample text, perform forward mapping on the speech normalized vector through the speech feature extraction unit, and obtain the speech features of the sample speech, and perform forward mapping on the image normalized vector through the speech modality alignment unit, and obtain
  • the sample speech is aligned
  • the image features of the sample image are forward-mapped to the image normalized vector through the text modality alignment unit to obtain the image features of the sample image after being aligned to the sample text; then, based on the speech features of the sample speech and the sample
  • the image features of the sample image after speech alignment, the text features of the sample text, and the image features of the sample text after alignment to the sample text are used to calculate a first loss value, and the parameters of the target model are adjusted according to
  • the input is the data pair ⁇ t q , s q > composed of the text to be retrieved and the speech to be retrieved, and the candidate image ⁇ v g > in the candidate image data set.
  • the features of the text to be retrieved v q are extracted through the text feature extraction unit of the target model. Extract the features of the speech s q to be retrieved through the speech feature extraction unit of the target model Extract features of the candidate image v g that are aligned with the to-be-retrieved speech s q through the speech modality alignment unit Extract the features of candidate image v g aligned to the text t q to be retrieved through the text modality alignment unit
  • the Euclidean distance matrix D st2i determines the result image set corresponding to the data pair ⁇ t q , s q > from the candidate image data set, and calculates the corresponding CMC st2i and mAP st2i according to the fused Euclidean distance matrix D st2i .
  • Figure 11 is a schematic diagram of another optional overall architecture of the target model provided by the embodiment of the present application, in which the target model is provided with a first normalization layer, an attention layer, a second normalization layer, and a first normalization layer. layer, image feature extraction unit, text feature extraction unit, text modality alignment unit, speech modality alignment unit and speech feature extraction unit.
  • the input is multiple data pairs consisting of sample speech, sample images and sample text.
  • the input sample text can refer to the description in the example shown in Figure 7, which will not be repeated here.
  • the input sample image is a person image, and the input sample text
  • the voice is the voice that describes the characters in the sample text; then the image, voice, text and image amplification processing within the class is performed.
  • the data pair is encoded , obtain the text embedding vector, speech embedding vector and image embedding vector, input the text embedding vector, speech embedding vector and image embedding vector to the target model, after the normalization processing of the first normalization layer and the attention of the attention layer
  • the feature extraction process and the normalization process of the second normalization layer obtain the text normalization vector, the speech normalization vector and the image normalization vector, and then according to the corresponding input type, the image is normalized through the image feature extraction unit
  • the normalized vector is forward mapped to obtain the image features of the sample image itself.
  • the text normalized vector is forward mapped through the text feature extraction unit to obtain the image features of the sample text.
  • the speech normalized vector is obtained through the speech feature extraction unit. Perform forward mapping to obtain the phonetic features of the sample speech, perform forward mapping on the image normalized vector through the speech modality alignment unit, and obtain the image features of the sample image after aligning to the sample speech, and normalize the image through the text modality alignment unit.
  • the normalized vector is forward mapped to obtain the image features of the sample image after alignment to the sample text; then, based on the voice features of the sample voice and the image features of the sample image after alignment to the sample voice, the text features of the sample text are aligned to the sample text Calculate the first loss value based on the image features of the sample text, calculate the second loss value and the third loss value based on the image features of the sample image itself, and adjust the parameters of the target model based on the first loss value, the second loss value, and the third loss value.
  • the input is the data pair ⁇ v q , t q , s q > consisting of the image to be retrieved, the text to be retrieved and the voice to be retrieved, and the candidate image ⁇ v g > in the candidate image data set, which is extracted through the image feature extraction unit of the target model Features of the image v q to be retrieved And the characteristics of the candidate image v g Extract the features of the text v q to be retrieved through the text feature extraction unit of the target model Extract the features of the speech s q to be retrieved through the speech feature extraction unit of the target model Extract features of the candidate image v g that are aligned with the to-be-retrieved speech s q through the speech modality alignment unit Extract the features of candidate image v g aligned to the text t q to be retrieved through the text modality alignment unit
  • Calculate the Euclidean distance matrix between the image to be retrieved v q and the candidate image v g Determine the result image set corresponding to the image v q to be retrieved from the candidate image data set according to the Euclidean distance matrix D i2i , and calculate the corresponding CMC i2i and mAP i2i according to the Euclidean distance matrix D i2i ;
  • the Euclidean distance matrix D ti2i determines the result image set corresponding to the data pair ⁇ v q , t q > from the candidate image data set, and calculates the corresponding CMC ti2i and mAP ti2i based on the fused Euclidean distance matrix D ti2i .
  • the Euclidean distance matrix D si2i determines the result image set corresponding to the data pair ⁇ v q , s q > from the candidate image data set, and calculates the corresponding CMC si2i and mAP si2i according to the fused Euclidean distance matrix D si2i .
  • the Euclidean distance matrix D st2i determines the result image set corresponding to the data pair ⁇ t q , s q > from the candidate image data set, and calculates the corresponding CMC st2i and MAP st2i according to the fused Euclidean distance matrix D st2i .
  • the result image set corresponding to the image v q to be retrieved, the result image set corresponding to the speech s q to be retrieved, the result image set corresponding to the text t q to be retrieved, and the result image set corresponding to the data pair ⁇ v q , t q > , the result image set corresponding to the data pair ⁇ v q ,s q >, the result image set corresponding to the data pair ⁇ t q ,s q > and the result image set corresponding to the data pair ⁇ v q ,t q ,s q > are merged , get the image retrieval results.
  • the aforementioned ⁇ , ⁇ 1 and ⁇ 2 represent weight values.
  • FIG. 12 is a schematic flow chart of using a search engine to perform image retrieval provided by an embodiment of the present application.
  • the terminal displays a search engine search interface 1201.
  • the search engine search interface 1201 displays a first text input box 1202 for inputting text to be retrieved and a first image input control 1203 for inputting an image to be retrieved.
  • the terminal will input the text to be retrieved from the first text input box 1202 and
  • the image to be retrieved input from the first image input control 1203 is sent to the server.
  • the server uses the aforementioned image retrieval method to determine the image retrieval result from the preset image database and sends it to the terminal.
  • the search engine search interface 1201 is displayed.
  • Figure 13 is a schematic flow chart of image retrieval in the photo application provided by the embodiment of the present application.
  • the terminal displays the photo search interface of the photo application. 1301.
  • the photo search interface 1301 displays a second text input box 1302 for inputting text to be retrieved and a second image input control 1303 for inputting an image to be retrieved.
  • the terminal obtains the text to be retrieved input from the second text input box 1302.
  • the image retrieval result is determined from the terminal's own photo database using the aforementioned image retrieval method, and displayed on the photo search interface 1301.
  • Figure 14 is an optional structural schematic diagram of an image retrieval device provided by an embodiment of the present application.
  • the image retrieval device 1400 may be applicable to the aforementioned electronic devices.
  • the image retrieval device 1400 includes:
  • the data acquisition module 1401 is used to acquire a candidate image set and data to be retrieved in multiple modalities, where the candidate image set includes multiple candidates. Select image;
  • the model processing module 1402 is used to perform feature extraction on the data to be retrieved based on the target model to obtain the first feature of the data to be retrieved, and perform multiple feature extractions on the candidate image based on the target model to obtain the data to be retrieved from the candidate image in various modalities. Second feature after alignment;
  • Retrieval module 1403 configured to determine a first degree of similarity between the candidate image and the data to be retrieved in various modalities based on the first feature and the second feature, and determine a plurality of retrieval data from the candidate image set based on the first similarity. Combining corresponding result image sets, wherein the retrieval data combination includes data to be retrieved in at least one modality;
  • the merging module 1404 is used to merge multiple result image sets to obtain image retrieval results. .
  • model processing module 1402 is specifically used to:
  • the retrieval embedding vector is input into the target model, and feature mapping is performed on the data to be retrieved based on the target model to obtain the first feature of the data to be retrieved.
  • model processing module 1402 is specifically used to:
  • the first embedding vector, the second embedding vector and the third embedding vector are spliced to obtain a retrieval embedding vector.
  • model processing module 1402 is specifically used to:
  • Feature mapping is performed on the attention vector based on the target model to obtain the first feature of the data to be retrieved.
  • model processing module 1402 is specifically used to:
  • mapping vector and the splicing vector are spliced to obtain the first feature of the data to be retrieved.
  • the data to be retrieved in multiple modalities includes text to be retrieved and images to be retrieved.
  • the target model includes a text modality alignment unit for aligning candidate images to the text to be retrieved, and an image for feature extraction of the image to be retrieved.
  • Feature extraction unit the above-mentioned model processing module 1402 is specifically used for:
  • Feature extraction is performed on the candidate image based on the text modality alignment unit to obtain the second feature that aligns the candidate image to the text to be retrieved;
  • Features are extracted from the candidate image based on the image feature extraction unit to obtain image features of the candidate image, and the image features are used as second features after the candidate image is aligned with the image to be retrieved.
  • the multiple retrieval data combinations include a first data combination and a second data combination.
  • the first data combination includes data to be retrieved in one modality
  • the second data combination includes data to be retrieved in multiple modalities.
  • the above-mentioned retrieval module 1403 Specifically used for:
  • the first similarities corresponding to the data to be retrieved in multiple modalities are fused to obtain the target similarity, and the result image set corresponding to the second data combination is determined from the candidate image set according to the target similarity.
  • the above-mentioned image retrieval device also includes a training module 1405.
  • the above-mentioned training module 1405 is used for:
  • feature extraction is performed on the sample retrieval data to obtain the third feature of the sample retrieval data.
  • multiple feature extractions are performed on the sample image to obtain the fourth feature after the sample image is aligned with the sample retrieval data of various modalities. ;
  • training module 1405 is specifically used for:
  • training module 1405 is specifically used for:
  • training module 1405 is specifically used for:
  • Delete text components of any length in the initial text to obtain enhanced text or use text components in the reference text to adjust text components in the initial text to obtain enhanced text, where the reference text is of the same category as the initial text;
  • the above-mentioned image retrieval device 1400 is based on the same inventive concept as the aforementioned image retrieval method. Therefore, the above-mentioned image retrieval device 1400 performs feature extraction on the data to be retrieved through the target model to obtain the first features of the data to be retrieved, and then uses the same target model to identify the candidates.
  • the image is subjected to multiple feature extractions to obtain the second feature after the candidate image is aligned with the data to be retrieved in various modalities. It can not only use the data to be retrieved in multiple modalities to improve the accuracy of image retrieval, but also unify multiple modalities.
  • the feature framework of the data to be retrieved and the candidate image of each modality improves the feature space consistency between the first feature and the second feature; and using the same target model to determine the first feature and the second feature can reduce the number of target models
  • the number of parameters reduces the memory overhead of target model deployment; in addition, during the training phase, only the same target model needs to be trained to improve model training efficiency; on this basis, based on the first feature and the second feature, determine the candidate image and Based on the first similarity between the data to be retrieved in various modalities, the result image set corresponding to multiple retrieval data combinations is determined from the candidate image set, and the multiple result image sets are merged to obtain the image retrieval
  • the image retrieval results are obtained based on the result image set corresponding to the combination of multiple retrieval data, which can effectively improve the accuracy of image retrieval. sex.
  • the electronic device for executing the above image retrieval method provided by the embodiment of the present application may be a terminal.
  • Figure 15 is a partial structural block diagram of the terminal provided by the embodiment of the present application.
  • the terminal includes: Radio Frequency (RF) ) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580, and power supply 1590 and other components.
  • RF Radio Frequency
  • memory 1520 includes: Radio Frequency (RF) circuit 1510, memory 1520, input unit 1530, display unit 1540, sensor 1550, audio circuit 1560, wireless fidelity (WiFi) module 1570, processor 1580, and power supply 1590 and other components.
  • WiFi wireless fidelity
  • the RF circuit 1510 can be used to receive and transmit information or signals during a call. In particular, after receiving downlink information from the base station, it is processed by the processor 1580; in addition, the designed uplink data is sent to the base station.
  • the memory 1520 can be used to store software programs and modules.
  • the processor 1580 executes various functional applications and data processing of the terminal by running the software programs and modules stored in the memory 1520 .
  • the input unit 1530 may be used to receive input numeric or character information, and generate key signal input related to settings and function control of the terminal.
  • the input unit 1530 may include a touch panel 1531 and other input devices 1532.
  • the display unit 1540 may be used to display input information or provided information as well as various menus of the terminal.
  • the display unit 1540 may include a display panel 1541.
  • the audio circuit 1560, speaker 1561, and microphone 1562 can provide an audio interface.
  • the processor 1580 included in the terminal can execute the image retrieval method of the previous embodiment.
  • the electronic device used to execute the above image retrieval method provided by the embodiment of the present application may also be a server.
  • FIG 16 is a partial structural block diagram of the server provided by the embodiment of the present application.
  • the server 1600 may be generated due to different configurations or performances. A relatively large difference may include one or more central processing units (CPU) 1622 (for example, one or more processors) and memory 1632, one or more storage applications 1642 or data 1644 Media 1630 (eg, one or more mass storage devices).
  • the memory 1632 and the storage medium 1630 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1630 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server 1600 .
  • the central processor 1622 may be configured to communicate with the storage medium 1630 and execute a series of instruction operations in the storage medium 1630 on the server 1600 .
  • Server 1600 may also include one or more power supplies 1626, one or more wired or wireless network interfaces 1650, one or more input and output interfaces 1658, and/or, one or more operating systems 1641, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • operating systems 1641 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and many more.
  • a processor in server 1600 may be used to perform image retrieval methods.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium is used to store program codes.
  • the program codes are used to execute the image retrieval methods of the foregoing embodiments.
  • An embodiment of the present application also provides a computer program product.
  • the computer program product includes a computer program, and the computer program is stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the program
  • a computer program is provided to cause the computer device to execute the above image retrieval method.
  • At least one (item) refers to one or more, and “plurality” refers to two or more.
  • “And/or” is used to describe the relationship between associated objects, indicating that there can be three relationships. For example, “A and/or B” can mean: only A exists, only B exists, and A and B exist simultaneously. , where A and B can be singular or plural. The character “/” generally indicates that the related objects are in an "or” relationship. “At least one of the following” or similar expressions thereof refers to any combination of these items, including any combination of a single item (items) or a plurality of items (items).
  • At least one of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c” ”, where a, b, c can be single or multiple.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • a unit described as a separate component may or may not be physically separate.
  • a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or it may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods of various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc., which can store program code. medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种图像检索方法、装置、电子设备及存储介质,图像检索方法获取候选图像集以及多种模态的待检索数据,其中,候选图像集包括多个候选图像;基于目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,基于目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征;根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,其中,检索数据组合包括至少一种模态的待检索数据;将多个结果图像集进行合并,得到图像检索结果。

Description

图像检索方法、装置、电子设备及存储介质
本申请基于申请号为:202211089620.8,申请日为2022年09月07日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及人工智能技术领域,特别是涉及一种图像检索方法、装置、电子设备及存储介质。
背景技术
随着互联网技术的快速发展,图像检索在多种场景中得到广泛的应用。相关技术中,一般基于输入的待检索数据进行图像检索,待检索数据一般也为图像,即这种图像检索方式实际上为图搜图,具体可以从图像数据库中检索出与输入的检索图像相似的图像。然而,这种图像检索方式不能泛化其他类型的待检索数据,图像检索的准确性有待提高。
发明内容
以下是对本申请详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。
本申请实施例提供了一种图像检索方法、装置、电子设备及存储介质,能够提升图像检索的准确性。
一方面,本申请实施例提供了一种图像检索方法,包括:
电子设备获取候选图像集以及多种模态的待检索数据,其中,所述候选图像集包括多个候选图像;
电子设备基于目标模型对所述待检索数据进行特征提取,得到所述待检索数据的第一特征,基于所述目标模型对所述候选图像进行多次特征提取,得到所述候选图像向各种模态的所述待检索数据对齐后的第二特征;
电子设备根据所述第一特征和所述第二特征,确定所述候选图像与各种模态的所述待检索数据之间的第一相似度,电子设备根据所述第一相似度,从所述候选图像集中确定多个检索数据组合对应的结果图像集,其中,所述检索数据组合包括至少一种模态的所述待检索数据;
电子设备将多个所述结果图像集进行合并,得到图像检索结果。
另一方面,本申请实施例还提供了一种图像检索装置,包括:
数据获取模块,用于获取候选图像集以及多种模态的待检索数据,其中,所述候选图像集包括多个候选图像;
模型处理模块,用于基于目标模型对所述待检索数据进行特征提取,得到所述待检索数据的第一特征,基于所述目标模型对所述候选图像进行多次特征提取,得到所述候选图像向各种模态的所述待检索数据对齐后的第二特征;
检索模块,用于根据所述第一特征和所述第二特征,确定所述候选图像与各种模态的所述待检索数据之间的第一相似度,根据所述第一相似度,从所述候选图像集中确定多个检索数据组合对应的结果图像集,其中,所述检索数据组合包括至少一种模态的所述待检索数据;
合并模块,用于将多个所述结果图像集进行合并,得到图像检索结果。
另一方面,本申请实施例还提供了一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现上述的图像检索方法。
另一方面,本申请实施例还提供了一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序被处理器执行实现上述的图像检索方法。
另一方面,本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行实现上述的图像检索方法。
本申请实施例至少包括以下有益效果:电子设备通过目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,再通过同一个目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征,既能够利用多种模态的待检索数据来提升图像检索的准确性,也能够统一多种模态的待检索数据与候选图像的特征框架,提升第一特征与第二特征之间的特征空间一致性;并且,电子设备利用同一个目标模型来确定第一特征和第二特征可以减少目标模型的参数量,降低目标模型部署的内存开销;另外,在训练阶段也只需要训练同一个目标模型,提升模型训练效率;在此基础上,电子设备通过根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,电子设备根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,将多个结果图像集进行合并,得到图像检索结果,无须将待检索数据与候选图像进行一对一检索,有效地提升了图像检索的效率,并且图像检索结果基于多个检索数据组合对应的结果图像集得到,能够有效地提升图像检索的准确性。
本申请的其他特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。
附图说明
附图用来提供对本申请技术方案的进一步理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。
图1为本申请实施例提供的一种可选的实施环境的示意图;
图2为本申请实施例提供的图像检索方法的一种可选的流程示意图;
图3为本申请实施例提供的目标模型的一种可选的结构示意图;
图4为本申请实施例提供的基于多个检索数据组合得到图像检索结果的一种可选的流程示意图;
图5为本申请实施例提供的目标模型的另一种可选的结构示意图;
图6为本申请实施例提供的目标模型的一种可选的训练过程示意图;
图7为本申请实施例提供的对训练样本进行扩展的一种可选的流程示意图;
图8为本申请实施例提供的目标模型的一种可选的总体架构示意图;
图9为本申请实施例提供的目标模型的另一种可选的总体架构示意图;
图10为本申请实施例提供的目标模型的另一种可选的总体架构示意图;
图11为本申请实施例提供的目标模型的另一种可选的总体架构示意图;
图12为本申请实施例提供的利用搜索引擎来进行图像检索的流程示意图;
图13为本申请实施例提供的在照片应用中进行图像检索的流程示意图;
图14为本申请实施例提供的图像检索装置的一种可选的结构示意图;
图15为本申请实施例提供的终端的部分结构框图;
图16为本申请实施例提供的服务器的部分结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
需要说明的是,在本申请的各个具体实施方式中,当涉及到需要根据目标对象属性信息或属性信息集合等与目标对象特性相关的数据进行相关处理时,都会先获得目标对象的许可或者同意,而且,对这些数据的收集、使用和处理等,都会遵守相关国家和地区的相关法律法规和标准其中,目标对象可以是用户。此外,当本申请实施例需要获取目标对象属性信息时,会通过弹窗或者跳转到确认页面等方式获得目标对象的单独许可或者单独同意,在明确获得目标对象的单独许可或者单独同意之后,再获取用于使本申请实施例能够正常运行的必要的目标对象相关数据。
相关技术中,一般基于输入的待检索数据进行图像检索,待检索数据一般也为图像,即这种图像检索方式实际上为图搜图,具体可以从图像数据库中检索出与输入的检索图像相似的图像。然而,这种图像检索方式不能泛化其他类型的待检索数据,图像检索的准确性有待提高。
基于此,本申请实施例提供了一种图像检索方法、装置、电子设备及存储介质,能够提升图像检索的准确性。
参照图1,图1为本申请实施例提供的一种可选的实施环境的示意图,该实施环境包括终端101和服务器102,其中,终端101和服务器102之间通过通信网络连接。
服务器102可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。另外,服务器102还可以是区块链网络中的一个节点服务器。
终端101可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端等,但并不局限于此。终端101以及服务器102可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例在此不做限制。
示例性地,终端101可以将多种模态的待检索数据发送至服务器102;服务器102接收待检索数据并获取预先存储的候选图像集,基于目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,基于目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征,根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,将多个结果图像集进行合并,得到图像检索结果,将图像检索结果发送至终端101;终端101对图像检索结果进行显示。服务器102通过目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,再通过同一个目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征,既能够利用多种模态的待检索数据来提升图像检索的准确性,也能够统一多种模态的待检索数据与候选图像的特征框架,提升第一特征与第二特征之间的特征空间一致性;并且,利用同一个目标模型来确定第一特征和第二特征可以减少目标模型的参数量,降低目标模型部署的内存开销;另外,在训练阶段也只需要训练同一个目标模型,提升模型训练效率;在此基础上,通过根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,根 据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,将多个结果图像集进行合并,得到图像检索结果,无须将待检索数据与候选图像进行一对一检索,有效地提升了图像检索的效率,并且图像检索结果基于多个检索数据组合对应的结果图像集得到,能够有效地提升图像检索的准确性。
本申请实施例提供的方法可应用于各种技术领域,包括但不限于云技术、人工智能等技术领域。
参照图2,图2为本申请实施例提供的图像检索方法的一种可选的流程示意图,该图像检索方法可以由服务器执行,或者可以由终端执行,或者也可以由服务器和终端配合执行,该图像检索方法包括但不限于以下步骤201至步骤204。
步骤201:电子设备获取候选图像集以及多种模态的待检索数据。
其中,候选图像集包括多个候选图像,候选图像即检索数据库中的图像,图像检索结果基于候选图像集产生。待检索数据即进行图像检索时的查询数据,模态用于指示待检索数据的存在形式,模态可以是图像模态、文本模态、语音模态等,图像模态的待检索数据即待检索图像,文本模态的待检索数据即待检索文本,语音模态的待检索数据即待检索语音。
在一种可能的实现方式中,多种模态的待检索数据可以包括待检索图像和待检索文本,或者,多种模态的待检索数据也可以包括待检索图像和待检索语音,或者,多种模态的待检索数据也可以包括待检索文本和待检索语音,或者,多种模态的待检索数据也可以包括待检索图像、待检索文本和待检索语音。
其中,多种模态的待检索数据是相互独立的,不同模态的待检索数据之间可以是相关联的,也可以是不相关的。以多种模态的待检索数据包括待检索图像和待检索文本为例,待检索图像可以是包括三朵牡丹花的图像,待检索文本可以是“三朵牡丹花”,此时,待检索图像和待检索文本之间相关联;又或者,待检索图像可以是包括三朵牡丹花的图像,待检索文本可以是“三辆汽车”,此时,待检索图像和待检索文本之间不相关。
步骤202:电子设备基于目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,电子设备基于目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征。
在一种可能的实现方式中,特征提取可以是指将待检索数据映射至高维特征空间。基于目标模型对待检索数据进行特征提取,是基于目标模型对各种模态的待检索数据进行特征提取,相应地,目标模型可以设置有不同的特征提取单元来对各种模态的待检索数据进行特征提取。例如,当多种模态的待检索数据包括待检索图像和待检索文本时,目标模型设置有图像特征提取单元和文本特征提取单元,图像特征提取单元用于对待检索图像进行特征提取,文本特征提取单元用于对待检索文本进行特征提取;当多种模态的待检索数据包括待检索图像和待检索语音时,目标模型设置有图像特征提取单元和语音特征提取单元,语音特征提取单元用于对待检索语音进行特征提取;当多种模态的待检索数据包括待检索文本和待检索语音时,目标模型设置有文本特征提取单元和语音特征提取单元;当多种模态的待检索数据包括待检索图像、待检索文本和待检索语音时,目标模型设置有图像特征提取单元、文本特征提取单元和语音特征提取单元。
在一种可能的实现方式中,基于目标模型对待检索数据进行特征提取,得到待检索数据的第一特征时,具体可以将待检索数据转化为检索嵌入向量,将检索嵌入向量输入至目标模型中,基于目标模型对待检索数据进行特征映射,得到待检索数据的第一特征。其中,检索嵌入向量用于表征待检索数据的初始特征(经目标模型进行特征提取处理前的特征),不同模态的待检索数据转化为向量格式相同的检索嵌入向量,从而便于在同一个模型框架内统一多种模态的待检索数据的表征。
具体地,检索嵌入向量可以包括相互拼接的信息嵌入向量和类型嵌入向量,信息嵌入向量用于表征待检索数据所包含的信息特征,例如,当待检索数据为待检索图像时,信息嵌入向量用于表征待检索图像的图像信息,当待检索数据为待检索文本时,信息嵌入向量用于表征待检索图像的文本信息,当待检索数据为待检索语音时,信息嵌入向量用于表征待检索图像的语音信息;类型嵌入向量用于表征待检索数据的模态类型特征,例如,当待检索数据为待检索图像时,类型嵌入向量用于表征该待检索数据为图像模态,当待检索数据为待检索文本时,类型嵌入向量用于表征该待检索数据为文本模态,当待检索数据为待检索语音时,类型嵌入向量用于表征该待检索数据为语音模态。基于此,检索嵌入向量可以表示为:
X=finf+ftyp
其中,X表示检索嵌入向量,finf表示信息嵌入向量,ftyp表示类型嵌入向量。
由于检索嵌入向量包括相互拼接的信息嵌入向量和类型嵌入向量,可以基于信息嵌入向量来表征待检索图像的图像信息,基于类型嵌入向量来表征待检索数据的模态类型特征,后续基于目标模型对待检索数据进行特征提取时,可以便于目标模型根据类型嵌入向量确定当前的待检索数据的模态,进而调用对应的特征提取单元来对待检索数据进行特征提取,从而使得目标模型可以区分多种模态的待检索数据,便于在同一个模型框架内统一多种模态的待检索数据的表征。
在一种可能的实现方式中,候选图像向各种模态的待检索数据对齐,即候选图像与各种模态的待检索数据映射至相同的高维特征空间中,即第一特征与第二特征是相互对齐的。例如,若多种模态的待检索数 据包括待检索图像和待检索文本,则将候选图像与待检索图像进行对齐,以及将候选图像与待检索文本进行对齐,相应地,得到的第二特征的数量与待检索数据的模态数量相等,即得到候选图像向待检索图像对齐后的第二特征,以及得到候选图像向待检索文本对齐后的第二特征。可以理解的是,若多种模态的待检索数据包括待检索图像、待检索文本和待检索语音,则也将候选图像与待检索语音进行对齐,得到候选图像向待检索语音对齐后的第二特征。
相应地,目标模型中可以设置有不同的模态对齐单元来对候选图像进行特征提取,以将候选图像向对应模态的待检索数据对齐。例如,当多种模态的待检索数据包括待检索图像和待检索文本时,目标模型设置有图像模态对齐单元和文本模态对齐单元,图像模态对齐单元用于将候选图像向待检索图像对齐,文本模态对齐单元用于将候选图像向待检索文本对齐;当多种模态的待检索数据包括待检索图像和待检索语音时,目标模型设置有图像模态对齐单元和语音模态对齐单元,语音模态对齐单元用于将候选图像向待检索语音对齐;当多种模态的待检索数据包括待检索文本和待检索语音时,目标模型设置有文本模态对齐单元和语音模态对齐单元;当多种模态的待检索数据包括待检索图像、待检索文本和待检索语音时,目标模型设置有图像模态对齐单元、文本模态对齐单元和语音模态对齐单元。
具体地,参照图3,图3为本申请实施例提供的目标模型的一种可选的结构示意图,其中,该目标模型设置有多个特征提取单元和多个模态对齐单元,各个特征提取单元分别用于对对应模态的待检索数据进行特征提取,各个模态对齐单元分别用于将候选图像进行特征提取,使得候选图像向对应模态的待检索数据对齐。各个特征提取单元之间的参数可以不相同,各个模态对齐单元之间的参数可以不相同。通过在目标模型中设置多个特征提取单元和多个模态对齐单元,通过目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,再通过同一个目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征,既能够利用多种模态的待检索数据来提升图像检索的准确性,也能够统一多种模态的待检索数据与候选图像的特征框架,提升第一特征与第二特征之间的特征空间一致性;并且,利用同一个目标模型来确定第一特征和第二特征可以减少目标模型的参数量,降低目标模型部署的内存开销;另外,在训练阶段也只需要训练同一个目标模型,提升模型训练效率。
在一种可能的实现方式中,由于候选图像本身属于图像模态的数据,因此可以将图像特征提取单元作为图像模态对齐单元,即当多种模态的待检索数据包括图像模态的待检索数据时,可以利用图像特征提取单元得到待检索图像的第一特征,同时,可以该图像特征提取单元得到候选图像的第二特征,从而达到图像特征提取单元的复用效果,简化目标模型的结构。
可以理解的是,也可以额外设置图像模态对齐单元来得到候选图像的第二特征,本申请实施例不做限定。
因此,当多种模态的待检索数据包括待检索文本和待检索图像,在基于目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征时,具体可以基于文本模态对齐单元对候选图像进行特征提取,得到候选图像向待检索文本对齐的第二特征;基于图像特征提取单元对候选图像进行特征提取,得到候选图像的图像特征,将图像特征作为候选图像向待检索图像对齐后的第二特征,达到图像特征提取单元的复用效果,简化目标模型的结构。
当多种模态的待检索数据包括待检索语音和待检索图像,或者当多种模态的待检索数据包括待检索文本、待检索语音和待检索图像时,同样可以采用上述图像特征提取单元的复用方式,在此不再赘述。
步骤203:电子设备根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,电子设备根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集。
其中,根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,即第一相似度的数量与待检索数据的模态数量相同。例如,当多种模态的待检索数据包括待检索图像和待检索文本时,根据待检索图像的第一特征和候选图像向待检索图像对齐的第二特征,确定待检索图像与候选图像之间的第一相似度,根据待检索文本的第一特征和候选图像向待检索文本对齐的第二特征,确定待检索文本与候选图像之间的第一相似度;当多种模态的待检索数据包括待检索文本、待检索图像和待检索语音时,根据待检索图像的第一特征和候选图像向待检索图像对齐的第二特征,确定待检索图像与候选图像之间的第一相似度,根据待检索文本的第一特征和候选图像向待检索文本对齐的第二特征,确定待检索文本与候选图像之间的第一相似度,根据待检索语音的第一特征和候选图像向待检索语音对齐的第二特征,确定待检索语音与候选图像之间的第一相似度。
其中,检索数据组合包括至少一种模态的待检索数据,即检索数据组合可以包括一种模态的待检索数据(即第一数据组合),也可以包括多种模态的待检索数据(即第二数据组合),例如,第一数据组合可以包括待检索图像,或者也可以包括待检索文本,或者也可以包括待检索语音;当第二数据组合可以包括待检索图像和待检索文本,或者也可以包括待检索图像和待检索语音,或者也可以包括待检索文本和待检索语音,或者也可以包括待检索图像、待检索文本和待检索语音,等等。
在一种可能的实现方式中,第一相似度可以为欧氏距离的距离矩阵,或者余弦相似度的相似度矩阵, 或者切比雪夫距离的距离矩阵等等,本申请实施例不做限定。
其中,由于不同的检索数据组合对应有一个第一相似度或者多个不同的第一相似度,因此,可以根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集。
例如,当多种模态的待检索数据包括待检索图像和待检索文本时,多个检索数据组合可以为待检索图像和待检索文本,相应地,多个检索数据组合对应的结果图像集即待检索图像对应的结果图像集,以及待检索文本对应的结果图像集,这种情况下检索数据组合均为第一数据组合。因此,后续可以结合多种模态的待检索数据筛选出来的结果图像集得到图像检索结果,从而能够提升图像检索的准确性。
除此以外,当多种模态的待检索数据包括待检索图像和待检索文本时,也可以采用第一数据组合和第二数据组合结合的方式,即不同的检索数据组合可以为待检索图像、待检索文本、待检索图像结合待检索文本,相应地,不同检索数据组合对应的结果图像集即待检索图像对应的结果图像集,以及待检索文本对应的结果图像集,以及待检索图像结合待检索文本对应的结果图像集,从而在利用各种模态的待检索数据来得到图像检索结果的基础上,进一步引入多种模态的待检索数据结合来扩充图像检索结果,从而进一步提升图像检索的准确性。
在一种可能的实现方式中,若采用第一数据组合和第二数据组合结合的方式来确定图像检索结果,则根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集时,具体可以根据一种模态的待检索数据对应的第一相似度,从候选图像集中确定第一数据组合对应的结果图像集;将多种模态的待检索数据对应的第一相似度进行融合,得到目标相似度,根据目标相似度从候选图像集中确定第二数据组合对应的结果图像集。
具体地,第一数据组合对应的结果图像集即各种模态的待检索数据各自对应的结果图像集,第二数据组合对应的结果图像集即多种模态的待检索数据结合后对应的结果图像集,例如,当多种模态的待检索数据包括待检索图像和待检索文本时,第一数据组合对应的结果图像集即待检索图像对应的结果图像集,以及待检索文本对应的结果图像集;在此基础上,可以将待检索图像对应的第一相似度与待检索文本对应的第一相似度进行融合,进而得到目标相似度,从而实现待检索图像和待检索文本结合来进行图像检索。其中,融合的方式可以是进行加权处理,或者也可以是多个相似度相乘。
在一种可能的实现方式中,结果图像集可以直接包括候选图像集中与各个检索数据组合匹配的目标图像,另外,还可以对结果图像集中的目标图像的数量进行预设,当目标图像的数量为多个时,还可以进一步对从候选图像集中确定的目标图像进行排序,例如可以基于第一相似度由大到小进行排序,使得结果图像集更加清晰明了。
步骤204:电子设备将多个结果图像集进行合并,得到图像检索结果。
其中,由于不同检索数据组合对应有各自的结果图像集,因此可以将多个结果图像集进行合并,得到最终的图像检索结果,具体可以是对结果图像集进行去重后输出最终的图像检索结果,或者,也可以是将不同的结果图像集直接并列输出为最终的图像检索结果。
例如,参照图4,图4为本申请实施例提供的基于多个检索数据组合得到图像检索结果的一种可选的流程示意图,以多种模态的待检索数据包括待检索图像和待检索文本为例,待检索图像为背着包的女孩的图像,待检索文本为“长头发的女孩穿着黑色外套,黑色裤子,背着红色的包”,背着包的女孩的图像为一个检索数据组合,“长头发的女孩穿着黑色外套,黑色裤子,背着红色的包”为一个检索数据组合,背着包的女孩的图像结合“长头发的女孩穿着黑色外套,黑色裤子,背着红色的包”为一个检索数据组合,不同检索数据组合对应的结果图像集合并后得到图像检索结果。
在一种可能的实现方式中,可以将待检索数据与候选图像进行一对一检索来得到图像检索结果,一对一检索即将待检索数据与各个候选图像作为一个数据对输入至检索模型中,检索模型输出待检索数据与候选图像之间的匹配概率,由于候选图像有多个,因此一对一检索需要成对遍历检索,加大检索资源的消耗。而本申请实施例通过根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,将多个结果图像集进行合并,得到图像检索结果,无须将待检索数据与候选图像进行一对一检索,有效地提升了图像检索的效率,并且图像检索结果基于多个检索数据组合对应的结果图像集得到,能够有效地提升图像检索的准确性。
在一种可能的实现方式中,在将待检索数据转化为检索嵌入向量时,具体可对待检索数据进行切分处理,得到多个检索数据块,对多个检索数据块进行特征映射,得到第一嵌入向量;确定各个检索数据块在待检索数据中的位置信息,对多个位置信息进行特征映射,得到第二嵌入向量;对待检索数据对应的模态进行特征映射,得到第三嵌入向量;将第一嵌入向量、第二嵌入向量和第三嵌入向量进行拼接,得到检索嵌入向量。
其中,第一嵌入向量和第二嵌入向量拼接后相当于前述的信息嵌入向量,通过对待检索数据进行切分处理后得到第一嵌入向量,并根据各个检索数据块在待检索数据中的位置信息得到第二嵌入向量,可以使得信息嵌入向量携带待检索数据更多的信息,从而提升信息嵌入向量的准确性;第三嵌入向量相当于前述 的类型嵌入向量,用于供目标模型根据类型嵌入向量确定当前的待检索数据的模态。
对于待检索文本来说,对待检索数据进行切分处理,得到多个检索数据块,具体可以对待检索文本进行分词处理,得到多个文本词语,同时,添加待检索文本的开始标志和结束标志,然后利用文本编码器进行编码,具体可以表示为:
t={[cls],t1,...tM,[sep]}
其中,t表示经过文本编码器编码后得到的结果,[cls]表示开始标志,[sep]表示结束标志,t1,...tM分别代表各个文本词语,M为正整数。
接着,可以利用预训练的词嵌入将文本编码器编码后得到的结果映射至符号嵌入向量 中,得到第一嵌入向量;然后,确定各个文本词语在待检索文本中的位置信息,对各个文本词语在待检索文本中的位置信息进行特征映射,得到第二嵌入向量;然后,对文本模态进行特征映射,得到第三嵌入向量,将待检索文本对应的第一嵌入向量、第二嵌入向量和第三嵌入向量进行拼接,即可得到待检索文本对应的检索嵌入向量,具体可以表示为:
其中,Xt表示待检索文本对应的检索嵌入向量,表示待检索文本对应的第一嵌入向量,表示待检索文本对应的第二嵌入向量,表示待检索文本对应的第三嵌入向量。
对于待检索图像来说,对待检索数据进行切分处理,得到多个检索数据块,具体可以对待检索图像进行图像分割处理,得到多个图像块,同时,添加待检索图像的开始标志,然后利用图像编码器进行编码,具体可以表示为:
v={[cls],v1,...vN}
其中,v表示经过图像编码器编码后得到的结果,[cls]表示开始标志,v1,...vN分别代表各个图像块,N为正整数。
接着,可以采用与前述文本模态中类似的方式对图像编码器编码后得到的结果进行特征映射,得到第一嵌入向量;然后,确定各个图像块在待检索图像中的位置信息,对各个图像块在待检索图像中的位置信息进行特征映射,得到第二嵌入向量;然后,对图像模态进行特征映射,得到第三嵌入向量,将待检索图像对应的第一嵌入向量、第二嵌入向量和第三嵌入向量进行拼接,即可得到待检索图像检索嵌入向量,具体可以表示为:
其中,Xv表示待检索图像对应的检索嵌入向量,表示待检索图像对应的第一嵌入向量,表示待检索图像对应的第二嵌入向量,表示待检索图像对应的第三嵌入向量。
对于待检索语音来说,对待检索数据进行切分处理,得到多个检索数据块,具体可以对待检索语音进行语音分割处理,得到多个语音帧,同时,添加待检索语音的开始标志和结束标志,然后利用语音编码器进行编码,具体可以表示为:
s={[cls],s1,...sK,[sep]}
其中,s表示经过语音编码器编码后得到的结果,[cls]表示开始标志,[sep]表示结束标志,s1,...sK分别代表各个语音帧,K为正整数。
接着,可以采用与前述文本模态中类似的方式对语音编码器编码后得到的结果进行特征映射,得到第一嵌入向量;然后,确定各个语音帧在待检索语音中的位置信息,对各个语音帧在待检索语音中的位置信息进行特征映射,得到第二嵌入向量;然后,对语音模态进行特征映射,得到第三嵌入向量,将待检索语音对应的第一嵌入向量、第二嵌入向量和第三嵌入向量进行拼接,即可得到待检索语音检索嵌入向量,具体可以表示为:
其中,Xs表示待检索语音对应的检索嵌入向量,表示待检索语音对应的第一嵌入向量,表示待检索语音对应的第二嵌入向量,表示待检索语音对应的第三嵌入向量。
可见,上述不同模态的待检索数据的检索嵌入向量具备相同的向量格式,便于在同一个模型框架内统一多种模态的待检索数据的表征,使得目标模型可以对不同模态的待检索数据进行特征提取,为后续从多个候选图像中确定不同检索数据组合对应的目标图像提供了基础。
在一种可能的实现方式中,在对文本编码器编码后得到的结果、图像编码器编码后得到的结果以及语音编码器编码后得到的结果进行特征映射时,可以将不同编码器得到的结果映射至不同的高维特征空间,使得得到的第一嵌入向量能够更加匹配对应模态的特征表征需求,从而提升第一嵌入向量的准确性与合理性。
可以理解的是,由于候选图像属于图像模态的数据,在基于目标模型对候选图像进行特征提取时,可以参照前述的得到待检索图像对应的检索嵌入向量的方式,来对候选图像进行特征映射,得到候选图像对 应的嵌入向量,再将候选图像对应的嵌入向量输入至目标模型对候选图像进行特征提取。
参照图5,图5为本申请实施例提供的目标模型的另一种可选的结构示意图,目标模型可以设置有第一归一化层、注意力层、第二归一化层、多个特征提取单元和多个模态对齐单元,基于图5所示的模型结构,基于目标模型对待检索数据进行特征映射,得到待检索数据的第一特征时,具体可以对检索嵌入向量进行归一化处理,得到第一归一化向量,对第一归一化向量进行注意力特征提取,得到注意力向量,基于目标模型对注意力向量进行特征映射,得到待检索数据的第一特征。
其中,可以通过第一归一化层对检索嵌入向量进行归一化处理(Layer Normalization),从而达到对检索嵌入向量的数据标准化效果,提升目标模型对检索嵌入向量的处理效率;可以通过注意力层对第一归一化向量进行注意力特征提取,从而提取出第一归一化向量中的重要信息,使得后续基于目标模型对注意力向量进行特征映射后得到待检索数据的第一特征更加准确。
在一种可能的实现方式中,注意力层可以采用多头注意力(Multi-head Attention)机制来对第一归一化向量进行注意力特征提取。第一归一化层、注意力层、第二归一化层、多个特征提取单元和多个模态对齐单元可以构成一个整体的处理模块,目标模型中可以堆叠设置有多个上述处理模块,前一个处理模块的输出作为下一个处理模块的输出,最后一个处理模块输出的为最终的第一特征,从而提升第一特征的准确性。
在一种可能的实现方式中,在得到注意力向量以后,基于目标模型对注意力向量进行特征映射,得到待检索数据的第一特征时,具体可以将注意力向量与检索嵌入向量进行拼接,得到拼接向量;对拼接向量进行归一化处理,得到第二归一化向量;基于目标模型对第二归一化向量进行前向特征映射,得到映射向量;将映射向量与拼接向量进行拼接,得到待检索数据的第一特征。
其中,基于目标模型对第二归一化向量进行前向特征映射,得到映射向量,即基于对应的特征提取单元对第二归一化向量进行前向特征映射,得到映射向量,此时特征提取单元可以包括前向映射层(Feed Forward)。通过将注意力向量与检索嵌入向量进行拼接,得到拼接向量,可以使得拼接向量携带检索嵌入向量的原始信息,提升拼接向量的准确性。
其中,可以通过第二归一化层对拼接向量进行归一化处理,从而达到对拼接向量的数据标准化效果,提升目标模型对检索嵌入向量的处理效率。通过将映射向量与拼接向量进行拼接,得到待检索数据的第一特征,可以使得第一特征携带拼接向量的原始信息,提升第一特征的准确性。
可以理解的是,基于目标模型得到候选图像的第二特征,与基于目标模型得到待检索数据的第一特征相类似,同样可以将候选图像转化为图像嵌入向量,并且图像嵌入向量与检索嵌入向量的向量格式相同,对图像嵌入向量进行归一化处理,得到候选图像对应的第一归一化向量,对候选图像对应的第一归一化向量进行注意力特征提取,得到候选图像对应的注意力向量,将候选图像对应的注意力向量与图像嵌入向量进行拼接,得到候选图像对应的拼接向量,对候选图像对应的拼接向量进行归一化处理,得到候选图像对应的第二归一化向量,基于各个模态对齐单元对候选图像对应的第二归一化向量进行前向特征映射,得到候选图像对应的映射向量;将候选图像对应的映射向量与候选图像对应的拼接向量进行拼接,得到向各种模态的待检索数据对齐后候选图像的第二特征。
因此,基于目标模型得到待检索数据的第一特征以及基于目标模型得到候选图像的第二特征时,可以共用相同的第一归一化层、注意力层以及第二归一化层,再调用不同的特征提取单元进行特征提取或者模态对齐单元进行特征提取,从而可以简化目标模型的结构。
例如,以多种模态的待检索数据包括待检索图像和待检索文本为例,当目标模型中堆叠设置有多个上述处理模块时,拼接向量可以组合表示为:
其中,表示第i个处理模块中生成的待检索图像或者待检索文本或者候选图像对应的拼接向量,i为正整数,MSA表示多头注意力机制,LN表示归一化,表示输入至第i个处理模块的检索嵌入向量(第i-1个处理模块输出的待检索图像或者待检索文本对应的第一特征),i为正整数,当i=1时,表示待检索图像或者待检索文本初始的检索嵌入向量。
相应地,第一特征或者第二特征可以组合表示为:
其中,表示第i个处理模块中生成的待检索图像或者待检索文本的第一特征或者候选图像的第二特征,MLP表示前向映射。
在一种可能的实现方式中,在获取候选图像集以及多种模态的待检索数据之前,可以先对目标模型进行训练,具体可以获取样本图像以及除了图像模态以外至少一种模态的样本检索数据,获取样本图像与样本检索数据之间的相似度标签;基于目标模型对样本检索数据进行特征提取,得到样本检索数据的第三特征,基于目标模型对样本图像进行多次特征提取,得到样本图像向各种模态的样本检索数据对齐后的第四 特征;根据第三特征和第四特征,确定样本图像与待检索数据之间的第二相似度,根据第二相似度和对应的相似度标签确定第一损失值;根据第一损失值调整目标模型的参数。
其中,样本检索数据和样本图像均用于对目标模型进行训练,由于样本检索数据与样本图像的模态不相同,因此样本检索数据可以是样本文本、样本语音等。样本图像与样本检索数据之间的相似度标签用于指示样本图像与样本检索数据之间是否匹配,相似度标签可以为“1”或者“0”,当相似度标签为“1”,即样本检索数据与对应的样本图像相匹配,例如,若样本检索数据为样本文本,样本文本为“背着书包的男孩”,则样本图像为背着书包的男孩的图像;当相似度标签为“0”,即样本检索数据与对应的样本图像不匹配,例如样本文本为“背着书包的男孩”,则样本图像为牡丹花的图像。
基于目标模型对样本检索数据进行特征提取得到样本检索数据的第三特征,与基于目标模型对待检索数据进行特征提取得到待检索数据的第一特征的原理相类似,在此不再赘述。同理,基于目标模型对样本图像进行多次特征提取,得到样本图像向各种模态的样本检索数据对齐后的第四特征,与基于目标模型对候选图像进行特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征的原理相类似,在此不再赘述。同理,第二相似度与第一相似度的计算方式相类似,在此不再赘述。确定样本图像与待检索数据之间的第二相似度以后,由于对应的样本图像与样本检索数据之间的相似度标签已知,因此可以根据第二相似度和对应的相似度标签确定第一损失值,具体可以表示为:
其中,L1表示第一损失值,B表示样本检索数据和样本图像组成的样本对的数量,i表示第i个样本图像,j表示第j个样本检索数据,i、j均为正整数,pi,j表示第二相似度进行归一化后的概率值,qi,j表示相似度标签归一化后的概率值,∈表示一个很小的浮点数,作用是为了数值稳定(比如防止分母为0)。
具体地:
其中,表示第i个样本图像的第四特征的转置,fj表示第j个样本检索数据的第三特征,fk表示第k个样本检索数据的第三特征,yi,j表示第i个样本图像与第j个样本检索数据之间的相似度标签,yi,k表示第i个样本图像与第k个样本检索数据之间的相似度标签。
其中,由于第一损失值是基于第三特征和第四特征确定的,因此根据第一损失值调整目标模型的参数,可以是调整目标模型中模态对齐单元以及对应的特征提取单元的参数,从而达到模态对齐单元以及对应模态的特征提取单元之间的联合训练,能够有效提升模态对齐单元以及对应模态的特征提取单元提取的特征之间的对齐度,并且提升目标模型的训练效率。
在一种可能的实现方式中,当目标模型设置有图像特征提取单元且对图像特征提取单元进行复用时(即图像特征提取单元既用于对待检索图像进行特征提取,也用于对候选图像进行特征提取),相应地,根据第一损失值调整目标模型的参数时,具体可以获取样本图像的类别标签;基于目标模型对样本图像进行特征提取,得到样本图像对应图像模态的第五特征;根据第五特征对样本图像进行分类,得到样本类别,根据样本类别和类别标签确定第二损失值;根据第一损失值和第二损失值调整目标模型的参数。
其中,样本图像的类别标签用指示样本图像的类别,例如样本图像如果为狗的图像,则样本图像的类别标签可以为“动物”,或者也可以为“狗”等等。基于目标模型对样本图像进行特征提取,可以是基于图像特征提取单元对样本图像进行特征提取,得到样本图像对应图像模态的第五特征,得到样本图像的第五特征后,可以输入至分类器对样本图像进行分类,得到样本类别,进而可以根据样本类别和类别标签确定第二损失值,第二损失值具体可以表示为:
其中,L2表示第二损失值,p(x)表示类别标签对应的概率分布,q(x)表示样本类别对应的概率分布,x表示样本图像的类别的编号,m表示样本图像的类别的总数,x、m均为正整数。
在一种可能的实现方式中,根据第一损失值和第二损失值调整目标模型的参数,可以是根据第一损失值和第二损失值单独调整目标模型的参数,或者也可以将第一损失值和第二损失值进行加权得到总损失值,根据总损失值调整目标模型的参数。
通过引入类别标签,并根据第五特征对样本图像进行分类,进而得到第二损失值,可以引入图像分类来调整图像特征提取单元的参数,从而可以引入其他场景的训练方式来调整图像特征提取单元的参数,提升图像特征提取单元的泛化能力。
在一种可能的实现方式中,当目标模型设置有图像特征提取单元且对图像特征提取单元进行复用,根 据第一损失值调整目标模型的参数时,具体也可以获取与样本图像类别相同的第一参考图像,以及与样本图像类别不同的第二参考图像;基于目标模型对样本图像、第一参考图像和第二参考图像进行特征提取,得到样本图像的对应图像模态的第五特征、第一参考图像的第六特征和第二参考图像的第七特征;确定第五特征与第六特征之间的第三相似度,以及第五特征与第七特征之间的第四相似度,根据第三相似度和第四相似度确定第三损失值;根据第一损失值和第三损失值调整目标模型的参数。
其中,样本图像的数量可以为多个,对于其中一个样本图像来说,第一参考图像和第二参考图像可以为多个样本图像中的图像,或者也可以为多个样本图像以外的图像,本申请实施例不做限定。基于目标模型对样本图像、第一参考图像和第二参考图像进行特征提取,即基于图像特征提取单元对样本图像、第一参考图像和第二参考图像进行特征提取。由于第一参考图像与样本图像类别相同,因此正常来说第三相似度应该较高,同理,由于第二参考图像与样本图像类别不相同,因此正常来说第四相似度应该较低,相应地,第三损失值具体可以表示为:
L3=dAP-dAN
其中,L3表示第三损失值,dAP表示第三相似度,dAN表示第四相似度,α表示超参数。
在一种可能的实现方式中,根据第一损失值和第三损失值调整目标模型的参数,可以是根据第一损失值和第三损失值单独调整目标模型的参数,或者也可以将第一损失值和第三损失值进行加权得到总损失值,根据总损失值调整目标模型的参数。
通过引入第一参考图像和第二参考图像,分别确定第三相似度和第四相似度,进而得到第三损失值,可以使得同类的图像之间的距离变得更近,不同类的图像之间的距离变得更远,从而使得特征提取单元提取到的特征更加准确。
在一种可能的实现方式中,也可以将第一损失值、第二损失值和第三损失值进行加权得到总损失值,根据总损失值调整目标模型的参数,例如,当第一损失值、第二损失值和第三损失值的权值均为1时,总损失值具体可以表示为:
Ltotal=L1+L2+L3
其中,Ltotal表示总损失值。
当目标模型设置有图像特征提取单元且对图像特征提取单元进行复用时,通过同时引入第一损失值、第二损失值和第三损失值,可以针对图像特征提取单元和模态对齐单元进行针对性的训练,有利于提高训练效果。
下面以目标模型基于待检索文本和待检索图像进行图像检索为例说明目标模型的训练过程。
参照图6,图6为本申请实施例提供的目标模型的一种可选的训练过程示意图,具体地,可以获取样本图像集和样本文本集,将样本图像集和样本文本集输入至目标模型中,通过目标模型的文本特征提取单元对样本文本集中的样本文本进行特征提取,得到样本文本的第三特征;通过目标模型的文本模态对齐单元对样本图像集中的样本图像进行特征提取,得到向样本文本对齐后样本图像的第四特征;通过目标模型的图像特征提取单元对样本图像集中的样本图像进行特征提取,得到样本图像的第五特征;通过第三特征和第四特征计算第一损失值;对第五特征进行归一化处理,将归一化处理后的第五特征输入至分类器,得到样本图像的图像类别,根据样本图像的图像类别和样本图像的类别标签计算第二损失值;从样本图像集中确定各个样本图像的第一参考图像和第二参考图像,根据样本图像与第一参考图像之间的相似度以及样本图像与第二参考图像之间的相似度,计算第三损失值;最后,根据第一损失值、第二损失值和第三损失值之和得到总损失值,根据总损失值调整目标模型的参数。
在一种可能的实现方式中,在对目标模型进行训练时,在样本检索数据包括样本文本的情况下,可以对目标模型的训练样本进行扩展,以提高训练效果,在获取样本图像以及除了图像模态以外至少一种模态的样本检索数据时,具体可以获取初始图像和初始文本;对初始图像进行增强处理,得到增强图像;删除初始文本中的任意长度的文本成分,得到增强文本,或者利用参考文本中的文本成分调整初始文本中的文本成分,得到增强文本;将初始图像和增强图像作为样本图像,将初始文本和增强文本作为样本文本。
具体地,参照图7,图7为本申请实施例提供的对训练样本进行扩展的一种可选的流程示意图,在训练数据集中,初始图像和初始文本可以是成对存在的,初始图像和初始文本所构成的数据对的数量可以是多个,并且初始图像和初始文本所构成的数据对可以标注有类别标签。
对于初始图像来说,可以通过对初始图像进行增强处理,得到增强图像,增强处理包括但不限于放大、缩小、裁剪、翻转、色域变换、色彩抖动等一种或多种处理的组合。
对于初始文本来说,可以删除初始文本中的任意长度的文本成分,得到增强文本,文本成分可以是词语、句子或者段落。例如,若初始文本为“这个男人穿着一件黑灰色的羽绒服和一条浅色裤子,他有一个深绿色的背包”,则增强文本可以为“这个男人穿着一件黑灰色的羽绒服,他有一个深绿色的背包”,或者,增强文本也可以为“这个男人穿着一件黑灰色的羽绒服和一条浅色裤子”等等。除此以外,也可以利用参考文本中的文本成分调整初始文本中的文本成分,得到增强文本,其中,参考文本与初始文本的类别相同, 可以利用类别标签从训练数据集的其余初始文本中确定当前初始文本的参考文本,利用参考文本中的文本成分调整初始文本中的文本成分,可以是利用参考文本中的文本成分替换初始文本中的文本成分,或者在初始文本的文本成分的基础上添加参考文本中的文本成分。例如,若初始文本为“这个男人穿着一件黑灰色的羽绒服和一条浅色裤子,他有一个深绿色的背包”,参考文本为“一个男人有黑色的头发,他穿着灰色衬衫、灰色裤子和灰色帆布鞋,提着一个包”,则增强文本可以为“这个人穿着一件黑灰色的羽绒服,灰色的裤子和灰色的帆布鞋,他有一个深绿色的背包”,或者,增强文本也可以为“这个男人穿着一件黑灰色的羽绒服和一条浅色裤子,他有黑色的头发,他有一个深绿色的背包”等等。
经过上述处理得到增强图像和增强文本,后续可以利用增强图像和增强文本对目标模型进行训练,初始图像和增强文本、增强图像和初始文本、增强图像和增强文本均可以构成新的数据对,从而使得目标模型的训练数据更加多样化,特别是在调整模态对齐单元的参数时,能够显著提升模态对齐单元的性能。
类似地,对于初始语音来说,同样可以采用加速、减速、语音帧替换、语音帧删除、噪声添加等方式来得到增强语音,利用初始语音和增强语音对目标模型进行训练。
在完成对目标模型的训练后,在利用目标模型进行图像检索时,还可以进一步验证目标模型的性能。具体地,对于包括一种模态的待检索数据的检索数据组合来说,可以根据各种模态下的第一相似度计算累计匹配特性(CMC,Cumulative Matching Characteristic)和平均精度(mAP,mean Average Precision),对于包括多种模态的待检索数据的检索数据组合来说,可以根据多种模态下的目标相似度计算累计匹配特性和平均精度,进而从不同的维度验证目标模型的性能,当累计匹配特性和平均精度未达到预设阈值时,可以再次对目标模型的参数进行调整。
下面以CUHK-PEDES和RSTP数据集为例子说明本申请实施例提供的图像检索方法中目标模型的性能。
参照表1和表2,表1为本申请实施例提供的在CUHK-PEDES数据集上不同的图像检索方法的评价效果数据,表2为本申请实施例提供的在RSTP数据集上不同的图像检索方法的评价效果数据。其中,Rank-1、Rank-5和Rank-10为CMC的评价指标,由表1和表2可见,本申请提供的图像检索方法,准确率比相关技术中的其他图像检索方法更高,并且上述数据中本申请提供的图像检索方法只使用了全局特征。
表1 CUHK-PEDES数据集上不同的图像检索方法的评价效果数据
表2 RSTP数据集上不同的图像检索方法的评价效果数据
参照表3和表4,表3为本申请实施例提供的不同的图像检索方法利用文本进行图像检索的评价效果数据,表4为本申请实施例提供的不同的图像检索方法利用图像进行图像检索的评价效果数据。其中,R1为Rank-1的缩写、R5为Rank-5的缩写、R10为Rank-10的缩写,由表3和表4可见,本申请提供的图像检索方法,在单独评价利用文本进行图像检索以及利用图像进行图像检索时,准确率也比相关技术中的其他图像检索方法更高。
表3不同的图像检索方法利用文本进行图像检索的评价效果数据

表4不同的图像检索方法利用图像进行图像检索的评价效果数据
另外,参照表5,表5为本申请实施例提供的图像检索方法中利用文本进行图像检索、利用图像进行图像检索以及利用文本结合图像进行图像检索的评价效果数据。其中,利用文本结合图像进行图像检索准确率更高,因此,本申请将不同模态的待检索数据对应的相似度进行融合,从而实现利用不同模态的待检索数据的结合来进行图像检索,能够显著提升图像检索的准确性。
表5本申请实施例提供的图像检索方法的评价效果数据
下面以实际例子说明本申请实施例中目标模型的总体架构。
参照图8,图8为本申请实施例提供的目标模型的一种可选的总体架构示意图,其中,该目标模型设置有第一归一化层、注意力层、第二归一化层、图像特征提取单元、文本模态对齐单元和文本特征提取单元。
在目标模型的训练阶段:
输入为样本文本和样本图像组成的多个数据对,输入的其中一个数据对的样本文本可以为“这个人戴着一副眼镜,他穿着一件黑灰色的羽绒服和一条浅色裤子,他有一双轻便的鞋,他有一个深绿色的背包”,输入的样本图像为人物图像;接着进行类内的文本和图像扩增处理,对于样本图像来说,可以进行随机增强处理,即从放大、缩小、裁剪、翻转、色域变换、色彩抖动等处理方式中随机选取一种或者多种处理方式对样本图像进行处理,并对样本文本进行文本成分的调整,例如可以得到“这个男人穿着一件黑灰色的羽绒服和一条浅色裤子,他有一个深绿色的背包”、“一个男人有黑色的头发,他穿着灰色衬衫、灰色裤子和灰色帆布鞋,提着一个包”以及“这个人穿着一件黑灰色的羽绒服,灰色的裤子和灰色的帆布鞋,他有一个深绿色的背包”,然后增强处理得到的图像与调整文本成分后的文本可以组成新的数据对,从而扩充了目标模型的训练数据;接着将数据对进行编码,得到图像嵌入向量和文本嵌入向量,将图像嵌入向量和文本嵌入向量输入至目标模型,经过第一归一化层的归一化处理、注意力层的注意力特征提取处理以及第二归一化层的归一化处理,得到图像归一化向量和文本归一化向量,再根据对应的输入类型,通过图像特征提取单元对图像归一化向量进行前向映射,得到样本图像自身的图像特征,通过文本特征提取单元对文本归一化向量进行前向映射,得到样本文本的文本特征,通过文本模态对齐单元对图像归一化向量进行前向映射,得到向样本文本对齐后样本图像的图像特征;接着,基于样本文本的文本特征和向样本文本对齐后样本图像的图像特征计算第一损失值,基于样本图像自身的图像特征计算第二损失值和第三损失值,根据第一损失值、第二损失值和第三损失值调整目标模型的参数。
在目标模型的推理阶段:
输入为待检索图像和待检索文本组成的数据对<vq,tq>,以及候选图像数据集中的候选图像<vg>,通过目标模型的图像特征提取单元提取待检索图像vq的特征以及候选图像vg的特征通过目标 模型的文本特征提取单元提取待检索文本tq的特征通过文本模态对齐单元提取候选图像vg向待检索文本tq对齐的特征
计算待检索文本tq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Dt2i从候选图像数据集中确定待检索文本tq对应的结果图像集,并根据欧氏距离矩阵Dt2i计算对应的CMCt2i和mAPt2i
计算待检索图像vq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Di2i从候选图像数据集中确定待检索图像vq对应的结果图像集,并根据欧氏距离矩阵Di2i计算对应的CMCi2i和mAPi2i
计算待检索图像和待检索文本组成的数据对<vq,tq>与候选图像vg之间的融合欧氏距离矩阵Dti2i=λ·Di2i+(1-λ)·Dt2i,根据欧氏距离矩阵Dti2i从候选图像数据集中确定数据对<vq,tq>对应的结果图像集,并根据融合欧氏距离矩阵Dti2i计算对应的CMCti2i和mAPti2i
最后,将待检索文本tq对应的结果图像集、待检索图像vq对应的结果图像集以及数据对<vq,tq>对应的结果图像集进行合并,得到图像检索结果。
另外,参照图9,图9为本申请实施例提供的目标模型的另一种可选的总体架构示意图,其中,该目标模型设置有第一归一化层、注意力层、第二归一化层、图像特征提取单元、语音模态对齐单元和语音特征提取单元。
在目标模型的训练阶段:
输入为样本语音和样本图像组成的多个数据对,输入的样本图像为人物图像,输入的样本语音为对样本图像里的人物进行描述的语音;接着进行类内的语音和图像扩增处理,对于样本图像来说,可以进行随机增强处理,即从放大、缩小、裁剪、翻转、色域变换、色彩抖动等处理方式中随机选取一种或者多种处理方式对样本图像进行处理,对于样本语音来说,也可以进行随机增强处理,即从加速、减速、语音帧替换、语音帧删除、噪声添加等处理方式中随机选取一种或者多种处理方式对样本语音进行处理,然后增强处理得到的图像与语音可以组成新的数据对,从而扩充了目标模型的训练数据;接着将数据对进行编码,得到图像嵌入向量和语音嵌入向量,将图像嵌入向量和语音嵌入向量输入至目标模型,经过第一归一化层的归一化处理、注意力层的注意力特征提取处理以及第二归一化层的归一化处理,得到图像归一化向量和语音归一化向量,再根据对应的输入类型,通过图像特征提取单元对图像归一化向量进行前向映射,得到样本图像自身的图像特征,通过语音特征提取单元对语音归一化向量进行前向映射,得到样本语音的语音特征,通过语音模态对齐单元对图像归一化向量进行前向映射,得到向样本语音对齐后样本图像的图像特征;接着,基于样本语音的语音特征和向样本语音对齐后样本图像的图像特征计算第一损失值,基于样本图像自身的图像特征计算第二损失值和第三损失值,根据第一损失值、第二损失值和第三损失值调整目标模型的参数。
在目标模型的推理阶段:
输入为待检索图像和待检索语音组成的数据对<vq,sq>,以及候选图像数据集中的候选图像<vg>,通过目标模型的图像特征提取单元提取待检索图像vq的特征以及候选图像vg的特征通过目标模型的语音特征提取单元提取待检索语音sq的特征通过语音模态对齐单元提取候选图像vg向待检索语音sq对齐的特征
计算待检索语音sq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Ds2i从候选图像数据集中确定待检索语音sq对应的结果图像集,并根据欧氏距离矩阵Ds2i计算对应的CMCs2i和mAPs2i
计算待检索图像vq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Di2i从候选图像数据集中确定待检索图像vq对应的结果图像集,并根据欧氏距离矩阵Di2i计算对应的CMCi2i和mAPi2i
计算待检索图像和待检索语音组成的数据对<vq,sq>与候选图像vg之间的融合欧氏距离矩阵Dsi2i=λ·Di2i+(1-λ)·Ds2i,根据欧氏距离矩阵Dsi2i从候选图像数据集中确定数据对<vq,sq>对应的结果图像集,并根据融合欧氏距离矩阵Dsi2i计算对应的CMCsi2i和mAPsi2i
最后,将待检索语音sq对应的结果图像集、待检索图像vq对应的结果图像集以及数据对<vq,sq>对应的结果图像集进行合并,得到图像检索结果。
另外,参照图10,图10为本申请实施例提供的目标模型的另一种可选的总体架构示意图,其中,该目标模型设置有第一归一化层、注意力层、第二归一化层、文本特征提取单元、文本模态对齐单元、语音模态对齐单元和语音特征提取单元。
在目标模型的训练阶段:
输入为样本语音和样本文本组成的多个数据对以及样本图像,输入的样本文本可以参照图7所示的例子中的描述,在此不再赘述,输入的样本语音为对样本文本里的人物进行描述的语音;接着进行类内的语音、文本和图像扩增处理,具体可以参照前述例子中的描述,在此不再赘述;接着将数据对以及样本图像进行编码,得到文本嵌入向量、语音嵌入向量和图像嵌入向量,将文本嵌入向量、语音嵌入向量和图像嵌入向量输入至目标模型,经过第一归一化层的归一化处理、注意力层的注意力特征提取处理以及第二归一化层的归一化处理,得到文本归一化向量、语音归一化向量和图像归一化向量,再根据对应的输入类型,通过文本特征提取单元对文本归一化向量进行前向映射,得到样本文本的图像特征,通过语音特征提取单元对语音归一化向量进行前向映射,得到样本语音的语音特征,通过语音模态对齐单元对图像归一化向量进行前向映射,得到向样本语音对齐后样本图像的图像特征,通过文本模态对齐单元对图像归一化向量进行前向映射,得到向样本文本对齐后样本图像的图像特征;接着,基于样本语音的语音特征和向样本语音对齐后样本图像的图像特征、样本文本的文本特征和向样本文本对齐后样本文本的图像特征计算第一损失值,根据第一损失值调整目标模型的参数。
在目标模型的推理阶段:
输入为待检索文本和待检索语音组成的数据对<tq,sq>,以及候选图像数据集中的候选图像<vg>,通过目标模型的文本特征提取单元提取待检索文本vq的特征通过目标模型的语音特征提取单元提取待检索语音sq的特征通过语音模态对齐单元提取候选图像vg向待检索语音sq对齐的特征通过文本模态对齐单元提取候选图像vg向待检索文本tq对齐的特征
计算待检索语音sq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Ds2i从候选图像数据集中确定待检索语音sq对应的结果图像集,并根据欧氏距离矩阵Ds2i计算对应的CMCs2i和mAPs2i
计算待检索文本tq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Dt2i从候选图像数据集中确定待检索文本tq对应的结果图像集,并根据欧氏距离矩阵Dt2i计算对应的CMCt2i和mAPt2i
计算待检索文本和待检索语音组成的数据对<tq,sq>与候选图像vg之间的融合欧氏距离矩阵Dst2i=λ·Ds2i+(1-λ)·Dt2i,根据欧氏距离矩阵Dst2i从候选图像数据集中确定数据对<tq,sq>对应的结果图像集,并根据融合欧氏距离矩阵Dst2i计算对应的CMCst2i和mAPst2i
最后,将待检索语音sq对应的结果图像集、待检索文本tq对应的结果图像集以及数据对<tq,sq>对应的结果图像集进行合并,得到图像检索结果。
另外,参照图11,图11为本申请实施例提供的目标模型的另一种可选的总体架构示意图,其中,该目标模型设置有第一归一化层、注意力层、第二归一化层、图像特征提取单元、文本特征提取单元、文本模态对齐单元、语音模态对齐单元和语音特征提取单元。
在目标模型的训练阶段:
输入为样本语音、样本图像和样本文本组成的多个数据对,输入的样本文本可以参照图7所示的例子中的描述,在此不再赘述,输入的样本图像为人物图像,输入的样本语音为对样本文本里的人物进行描述的语音;接着进行类内的图像、语音、文本和图像扩增处理,具体可以参照前述例子中的描述,在此不再赘述;接着将数据对进行编码,得到文本嵌入向量、语音嵌入向量和图像嵌入向量,将文本嵌入向量、语音嵌入向量和图像嵌入向量输入至目标模型,经过第一归一化层的归一化处理、注意力层的注意力特征提取处理以及第二归一化层的归一化处理,得到文本归一化向量、语音归一化向量和图像归一化向量,再根据对应的输入类型,通过图像特征提取单元对图像归一化向量进行前向映射,得到样本图像自身的图像特征,通过文本特征提取单元对文本归一化向量进行前向映射,得到样本文本的图像特征,通过语音特征提取单元对语音归一化向量进行前向映射,得到样本语音的语音特征,通过语音模态对齐单元对图像归一化向量进行前向映射,得到向样本语音对齐后样本图像的图像特征,通过文本模态对齐单元对图像归一化向量进行前向映射,得到向样本文本对齐后样本图像的图像特征;接着,基于样本语音的语音特征和向样本语音对齐后样本图像的图像特征、样本文本的文本特征和向样本文本对齐后样本文本的图像特征计算第一损失值,基于样本图像自身的图像特征计算第二损失值和第三损失值,根据第一损失值、第二损失值和第三损失值调整目标模型的参数。
在目标模型的推理阶段:
输入为待检索图像、待检索文本和待检索语音组成的数据对<vq,tq,sq>,以及候选图像数据集中的候选图像<vg>,通过目标模型的图像特征提取单元提取待检索图像vq的特征以及候选图像vg的特征通过目标模型的文本特征提取单元提取待检索文本vq的特征通过目标模型的语音特征提取单元提取待检索语音sq的特征通过语音模态对齐单元提取候选图像vg向待检索语音sq对齐的特征 通过文本模态对齐单元提取候选图像vg向待检索文本tq对齐的特征
计算待检索图像vq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Di2i从候选图像数据集中确定待检索图像vq对应的结果图像集,并根据欧氏距离矩阵Di2i计算对应的CMCi2i和mAPi2i
计算待检索语音sq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Ds2i从候选图像数据集中确定待检索语音sq对应的结果图像集,并根据欧氏距离矩阵Ds2i计算对应的CMCs2i和mAPs2i
计算待检索文本tq与候选图像vg之间的欧氏距离矩阵根据欧氏距离矩阵Dt2i从候选图像数据集中确定待检索文本tq对应的结果图像集,并根据欧氏距离矩阵Dt2i计算对应的CMCt2i和mAPt2i
计算待检索图像和待检索文本组成的数据对<vq,tq>与候选图像vg之间的融合欧氏距离矩阵Dti2i=λ·Di2i+(1-λ)·Dt2i,根据欧氏距离矩阵Dti2i从候选图像数据集中确定数据对<vq,tq>对应的结果图像集,并根据融合欧氏距离矩阵Dti2i计算对应的CMCti2i和mAPti2i
计算待检索图像和待检索语音组成的数据对<vq,sq>与候选图像vg之间的融合欧氏距离矩阵Dsi2i=λ·Di2i+(1-λ)·Ds2i,根据欧氏距离矩阵Dsi2i从候选图像数据集中确定数据对<vq,sq>对应的结果图像集,并根据融合欧氏距离矩阵Dsi2i计算对应的CMCsi2i和mAPsi2i
计算待检索文本和待检索语音组成的数据对<tq,sq>与候选图像vg之间的融合欧氏距离矩阵Dst2i=λ·Ds2i+(1-λ)·Dt2i,根据欧氏距离矩阵Dst2i从候选图像数据集中确定数据对<tq,sq>对应的结果图像集,并根据融合欧氏距离矩阵Dst2i计算对应的CMCst2i和MAPst2i
计算待检索图像、待检索文本和待检索语音组成的数据对<vq,tq,sq>与候选图像vg之间的融合欧氏距离矩阵Dsti2i=λ1·Di2i2·Dt2i+(1-λ12)·Ds2i,根据欧氏距离矩阵Dsti2i从候选图像数据集中确定数据对<vq,tq,sq>对应的结果图像集,并根据融合欧氏距离矩阵Dsti2i计算对应的CMCsti2i和mAPsti2i
最后,将待检索图像vq对应的结果图像集、待检索语音sq对应的结果图像集、待检索文本tq对应的结果图像集、数据对<vq,tq>对应的结果图像集、数据对<vq,sq>对应的结果图像集、数据对<tq,sq>对应的结果图像集以及数据对<vq,tq,sq>对应的结果图像集进行合并,得到图像检索结果。
其中,前述的λ、λ1和λ2表示权重值。
下面以两个实际例子说明本申请实施例提供的图像检索方法的应用场景。
场景一
本申请实施例提供的图像检索方法可以应用于搜索引擎中,例如,参照图12,图12为本申请实施例提供的利用搜索引擎来进行图像检索的流程示意图,终端显示搜索引擎搜索界面1201,搜索引擎搜索界面1201显示有用于输入待检索文本的第一文本输入框1202,以及用于输入待检索图像的第一图像输入控件1203,终端将从第一文本输入框1202输入的待检索文本和从第一图像输入控件1203输入的待检索图像发送至服务器,服务器基于待检索文本和待检索图像,利用前述的图像检索方法从预设的图像数据库中确定图像检索结果并发送至终端,在终端的搜索引擎搜索界面1201进行显示。
场景二
本申请实施例提供的图像检索方法可以应用于照片应用中,例如,参照图13,图13为本申请实施例提供的在照片应用中进行图像检索的流程示意图,终端显示照片应用的照片搜索界面1301,照片搜索界面1301显示有用于输入待检索文本的第二文本输入框1302,以及用于输入待检索图像的第二图像输入控件1303,终端获取从第二文本输入框1302输入的待检索文本和从第二图像输入控件1303输入的待检索图像,基于待检索文本和待检索图像,利用前述的图像检索方法从终端自身的照片数据库中确定图像检索结果,在照片搜索界面1301进行显示。
可以理解的是,虽然上述各个流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本实施例中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时间执行完成,而是可以在不同的时间执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
参照图14,图14为本申请实施例提供的图像检索装置的一种可选的结构示意图。
在一些实施例中,该图像检索装置1400可适用于前述电子设备。
在一些实施例中,该图像检索装置1400包括:
数据获取模块1401,用于获取候选图像集以及多种模态的待检索数据,其中,候选图像集包括多个候 选图像;
模型处理模块1402,用于基于目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,基于目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征;
检索模块1403,用于根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,其中,检索数据组合包括至少一种模态的待检索数据;
合并模块1404,用于将多个结果图像集进行合并,得到图像检索结果。。
进一步,上述模型处理模块1402具体用于:
将待检索数据转化为检索嵌入向量,其中,不同模态的待检索数据转化为向量格式相同的检索嵌入向量;
将检索嵌入向量输入至目标模型中,基于目标模型对待检索数据进行特征映射,得到待检索数据的第一特征。
进一步,上述模型处理模块1402具体用于:
对待检索数据进行切分处理,得到多个检索数据块,对多个检索数据块进行特征映射,得到第一嵌入向量;
确定各个检索数据块在待检索数据中的位置信息,对多个位置信息进行特征映射,得到第二嵌入向量;
对待检索数据对应的模态进行特征映射,得到第三嵌入向量;
将第一嵌入向量、第二嵌入向量和第三嵌入向量进行拼接,得到检索嵌入向量。
进一步,上述模型处理模块1402具体用于:
对检索嵌入向量进行归一化处理,得到第一归一化向量;
对第一归一化向量进行注意力特征提取,得到注意力向量;
基于目标模型对注意力向量进行特征映射,得到待检索数据的第一特征。
进一步,上述模型处理模块1402具体用于:
将注意力向量与检索嵌入向量进行拼接,得到拼接向量;
对拼接向量进行归一化处理,得到第二归一化向量;
基于目标模型对第二归一化向量进行前向特征映射,得到映射向量;
将映射向量与拼接向量进行拼接,得到待检索数据的第一特征。
进一步,多种模态的待检索数据包括待检索文本和待检索图像,目标模型包括用于将候选图像向待检索文本对齐的文本模态对齐单元,以及用于对待检索图像进行特征提取的图像特征提取单元,上述模型处理模块1402具体用于:
基于文本模态对齐单元对候选图像进行特征提取,得到候选图像向待检索文本对齐的第二特征;
基于图像特征提取单元对候选图像进行特征提取,得到候选图像的图像特征,将图像特征作为候选图像向待检索图像对齐后的第二特征。
进一步,多个检索数据组合包括第一数据组合和第二数据组合,第一数据组合包括一种模态的待检索数据,第二数据组合包括多种模态的待检索数据,上述检索模块1403具体用于:
根据一种模态的待检索数据对应的第一相似度,从候选图像集中确定第一数据组合对应的结果图像集;
将多种模态的待检索数据对应的第一相似度进行融合,得到目标相似度,根据目标相似度从候选图像集中确定第二数据组合对应的结果图像集。
进一步,上述图像检索装置还包括训练模块1405,上述训练模块1405用于:
获取样本图像以及除了图像模态以外至少一种模态的样本检索数据,获取样本图像与样本检索数据之间的相似度标签;
基于目标模型对样本检索数据进行特征提取,得到样本检索数据的第三特征,基于目标模型对样本图像进行多次特征提取,得到样本图像向各种模态的样本检索数据对齐后的第四特征;
根据第三特征和第四特征,确定样本图像与待检索数据之间的第二相似度,根据第二相似度和对应的相似度标签确定第一损失值;
根据第一损失值调整目标模型的参数。
进一步,上述训练模块1405具体用于:
获取样本图像的类别标签;
基于目标模型对样本图像进行特征提取,得到样本图像对应图像模态的第五特征;
根据第五特征对样本图像进行分类,得到样本类别,根据样本类别和类别标签确定第二损失值;
根据第一损失值和第二损失值调整目标模型的参数。
进一步,上述训练模块1405具体用于:
获取与样本图像类别相同的第一参考图像,以及与样本图像类别不同的第二参考图像;
基于目标模型对样本图像、第一参考图像和第二参考图像进行特征提取,得到样本图像对应图像模态的第五特征、第一参考图像的第六特征和第二参考图像的第七特征;
确定第五特征与第六特征之间的第三相似度,以及第五特征与第七特征之间的第四相似度,根据第三相似度和第四相似度确定第三损失值;
根据第一损失值和第三损失值调整目标模型的参数。
进一步,上述训练模块1405具体用于:
获取初始图像和初始文本;
对初始图像进行增强处理,得到增强图像;
删除初始文本中的任意长度的文本成分,得到增强文本,或者利用参考文本中的文本成分调整初始文本中的文本成分,得到增强文本,其中,参考文本与初始文本的类别相同;
将初始图像和增强图像作为样本图像,将初始文本和增强文本作为样本文本。
上述图像检索装置1400与前述的图像检索方法基于相同的发明构思,因此上述图像检索装置1400通过目标模型对待检索数据进行特征提取,得到待检索数据的第一特征,再通过同一个目标模型对候选图像进行多次特征提取,得到候选图像向各种模态的待检索数据对齐后的第二特征,既能够利用多种模态的待检索数据来提升图像检索的准确性,也能够统一多种模态的待检索数据与候选图像的特征框架,提升第一特征与第二特征之间的特征空间一致性;并且,利用同一个目标模型来确定第一特征和第二特征可以减少目标模型的参数量,降低目标模型部署的内存开销;另外,在训练阶段也只需要训练同一个目标模型,提升模型训练效率;在此基础上,通过根据第一特征和第二特征,确定候选图像与各种模态的待检索数据之间的第一相似度,根据第一相似度,从候选图像集中确定多个检索数据组合对应的结果图像集,将多个结果图像集进行合并,得到图像检索结果,无须将待检索数据与候选图像进行一对一检索,有效地提升了图像检索的效率,并且图像检索结果基于多个检索数据组合对应的结果图像集得到,能够有效地提升图像检索的准确性。
本申请实施例提供的用于执行上述图像检索方法的电子设备可以是终端,参照图15,图15为本申请实施例提供的终端的部分结构框图,该终端包括:射频(Radio Frequency,简称RF)电路1510、存储器1520、输入单元1530、显示单元1540、传感器1550、音频电路1560、无线保真(wireless fidelity,简称WiFi)模块1570、处理器1580、以及电源1590等部件。本领域技术人员可以理解,图15中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
RF电路1510可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1580处理;另外,将设计上行的数据发送给基站。
存储器1520可用于存储软件程序以及模块,处理器1580通过运行存储在存储器1520的软件程序以及模块,从而执行终端的各种功能应用以及数据处理。
输入单元1530可用于接收输入的数字或字符信息,以及产生与终端的设置以及功能控制有关的键信号输入。具体地,输入单元1530可包括触摸面板1531以及其他输入装置1532。
显示单元1540可用于显示输入的信息或提供的信息以及终端的各种菜单。显示单元1540可包括显示面板1541。
音频电路1560、扬声器1561,传声器1562可提供音频接口。
在本实施例中,该终端所包括的处理器1580可以执行前面实施例的图像检索方法。
本申请实施例提供的用于执行上述图像检索方法的电子设备也可以是服务器,参照图16,图16为本申请实施例提供的服务器的部分结构框图,服务器1600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(Central Processing Units,简称CPU)1622(例如,一个或一个以上处理器)和存储器1632,一个或一个以上存储应用程序1642或数据1644的存储介质1630(例如一个或一个以上海量存储装置)。其中,存储器1632和存储介质1630可以是短暂存储或持久存储。存储在存储介质1630的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器1600中的一系列指令操作。更进一步地,中央处理器1622可以设置为与存储介质1630通信,在服务器1600上执行存储介质1630中的一系列指令操作。
服务器1600还可以包括一个或一个以上电源1626,一个或一个以上有线或无线网络接口1650,一个或一个以上输入输出接口1658,和/或,一个或一个以上操作系统1641,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
服务器1600中的处理器可以用于执行图像检索方法。
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质用于存储程序代码,程序代码用于执行前述各个实施例的图像检索方法。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在计算机可读存介质中。计算机设备的处理器从计算机可读存储介质读取该计算机程序,处理器执行该计 算机程序,使得该计算机设备执行实现上述的图像检索方法。
本申请的说明书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或装置不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或装置固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c可以是单个,也可以是多个。
应了解,在本申请实施例的描述中,多个(或多项)的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机装置(可以是个人计算机,服务器,或者网络装置等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
还应了解,本申请实施例提供的各种实施方式可以任意进行组合,以实现不同的技术效果。
以上是对本申请的较佳实施进行了具体说明,但本申请并不局限于上述实施方式,熟悉本领域的技术人员在不违背本申请精神的共享条件下还可作出种种等同的变形或替换,这些等同的变形或替换均包括在本申请权利要求所限定的范围内。

Claims (15)

  1. 一种图像检索方法,其特征在于,包括:
    电子设备获取候选图像集以及多种模态的待检索数据,其中,所述候选图像集包括多个候选图像;
    所述电子设备基于目标模型对所述待检索数据进行特征提取,得到所述待检索数据的第一特征,所述电子设备基于所述目标模型对所述候选图像进行多次特征提取,得到所述候选图像向各种模态的所述待检索数据对齐后的第二特征;
    所述电子设备根据所述第一特征和所述第二特征,确定所述候选图像与各种模态的所述待检索数据之间的第一相似度,所述电子设备根据所述第一相似度,从所述候选图像集中确定多个检索数据组合对应的结果图像集,其中,所述检索数据组合包括至少一种模态的所述待检索数据;
    所述电子设备将多个所述结果图像集进行合并,得到图像检索结果。
  2. 根据权利要求1所述的图像检索方法,其特征在于,所述电子设备基于目标模型对所述待检索数据进行特征提取,得到所述待检索数据的第一特征,包括:
    所述电子设备将所述待检索数据转化为检索嵌入向量,其中,不同模态的所述待检索数据转化为向量格式相同的所述检索嵌入向量;
    所述电子设备将所述检索嵌入向量输入至所述目标模型中,基于所述目标模型对所述待检索数据进行特征映射,得到所述待检索数据的第一特征。
  3. 根据权利要求1所述的图像检索方法,其特征在于,所述多种模态的待检索数据包括待检索文本和待检索图像,所述目标模型包括用于将所述候选图像向所述待检索文本对齐的文本模态对齐单元,以及用于对所述待检索图像进行特征提取的图像特征提取单元,所述电子设备基于所述目标模型对所述候选图像进行多次特征提取,得到所述候选图像向各种模态的所述待检索数据对齐后的第二特征,包括:
    所述电子设备基于所述文本模态对齐单元对所述候选图像进行特征提取,得到所述候选图像向所述待检索文本对齐的第二特征;
    所述电子设备基于所述图像特征提取单元对所述候选图像进行特征提取,得到所述候选图像的图像特征,所述电子设备将所述图像特征作为所述候选图像向所述待检索图像对齐后的第二特征。
  4. 根据权利要求1所述的图像检索方法,其特征在于,多个所述检索数据组合包括第一数据组合和第二数据组合,所述第一数据组合包括一种模态的所述待检索数据,所述第二数据组合包括多种模态的所述待检索数据,所述电子设备根据所述第一相似度,从所述候选图像集中确定多个检索数据组合对应的结果图像集,包括:
    所述电子设备根据所述一种模态的所述待检索数据对应的所述第一相似度,从所述候选图像集中确定所述第一数据组合对应的结果图像集;
    所述电子设备将所述多种模态的所述待检索数据对应的所述第一相似度进行融合,得到目标相似度,根据所述目标相似度从所述候选图像集中确定所述第二数据组合对应的结果图像集。
  5. 根据权利要求2所述的图像检索方法,其特征在于,所述电子设备将所述待检索数据转化为检索嵌入向量,包括:
    所述电子设备对所述待检索数据进行切分处理,得到多个检索数据块,所述电子设备对多个所述检索数据块进行特征映射,得到第一嵌入向量;
    所述电子设备确定各个所述检索数据块在所述待检索数据中的位置信息,对多个所述位置信息进行特征映射,得到第二嵌入向量;
    所述电子设备对所述待检索数据对应的模态进行特征映射,得到第三嵌入向量;
    所述电子设备将所述第一嵌入向量、所述第二嵌入向量和所述第三嵌入向量进行拼接,得到所述检索嵌入向量。
  6. 根据权利要求2所述的图像检索方法,其特征在于,所述电子设备基于所述目标模型对所述待检索数据进行特征映射,得到所述待检索数据的第一特征,包括:
    所述电子设备对所述检索嵌入向量进行归一化处理,得到第一归一化向量;
    所述电子设备对所述第一归一化向量进行注意力特征提取,得到注意力向量;
    所述电子设备基于所述目标模型对所述注意力向量进行特征映射,得到所述待检索数据的第一特征。
  7. 根据权利要求6所述的图像检索方法,其特征在于,所述电子设备基于所述目标模型对所述注意力向量进行特征映射,得到所述待检索数据的第一特征,包括:
    所述电子设备将所述注意力向量与所述检索嵌入向量进行拼接,得到拼接向量;
    所述电子设备对所述拼接向量进行归一化处理,得到第二归一化向量;
    所述电子设备基于所述目标模型对所述第二归一化向量进行前向特征映射,得到映射向量;
    所述电子设备将所述映射向量与所述拼接向量进行拼接,得到所述待检索数据的第一特征。
  8. 根据权利要求1至7任意一项所述的图像检索方法,其特征在于,所述电子设备获取候选图像集以及多种模态的待检索数据之前,所述图像检索方法还包括:
    所述电子设备获取样本图像以及除了图像模态以外至少一种模态的样本检索数据,获取所述样本图像与所述样本检索数据之间的相似度标签;
    所述电子设备基于所述目标模型对所述样本检索数据进行特征提取,得到所述样本检索数据的第三特征,基于所述目标模型对所述样本图像进行多次特征提取,得到所述样本图像向各种模态的所述样本检索数据对齐后的第四特征;
    所述电子设备根据所述第三特征和所述第四特征,确定所述样本图像与所述待检索数据之间的第二相似度,根据所述第二相似度和对应的所述相似度标签确定第一损失值;
    所述电子设备根据所述第一损失值调整所述目标模型的参数。
  9. 根据权利要求8所述的图像检索方法,其特征在于,所述电子设备根据所述第一损失值调整所述目标模型的参数,包括:
    所述电子设备获取所述样本图像的类别标签;
    所述电子设备基于所述目标模型对所述样本图像进行特征提取,得到所述样本图像对应图像模态的第五特征;
    所述电子设备根据所述第五特征对所述样本图像进行分类,得到样本类别,所述电子设备根据所述样本类别和所述类别标签确定第二损失值;
    所述电子设备根据所述第一损失值和所述第二损失值调整所述目标模型的参数。
  10. 根据权利要求8所述的图像检索方法,其特征在于,所述电子设备根据所述第一损失值调整所述目标模型的参数,包括:
    所述电子设备获取与所述样本图像类别相同的第一参考图像,以及与所述样本图像类别不同的第二参考图像;
    所述电子设备基于所述目标模型对所述样本图像、所述第一参考图像和所述第二参考图像进行特征提取,得到所述样本图像对应图像模态的第五特征、所述第一参考图像的第六特征和所述第二参考图像的第七特征;
    所述电子设备确定所述第五特征与所述第六特征之间的第三相似度,以及所述第五特征与所述第七特征之间的第四相似度,所述电子设备根据所述第三相似度和所述第四相似度确定第三损失值;
    所述电子设备根据所述第一损失值和所述第三损失值调整所述目标模型的参数。
  11. 根据权利要求8所述的图像检索方法,其特征在于,所述样本检索数据包括样本文本,所述电子设备获取样本图像以及除了图像模态以外至少一种模态的样本检索数据,包括:
    所述电子设备获取初始图像和初始文本;
    所述电子设备对所述初始图像进行增强处理,得到增强图像;
    所述电子设备删除所述初始文本中的任意长度的文本成分,得到增强文本,或者利用参考文本中的文本成分调整所述初始文本中的文本成分,得到增强文本,其中,所述参考文本与所述初始文本的类别相同;
    所述电子设备将所述初始图像和所述增强图像作为样本图像,将所述初始文本和所述增强文本作为样本文本。
  12. 一种图像检索装置,其特征在于,包括:
    数据获取模块,用于获取候选图像集以及多种模态的待检索数据,其中,所述候选图像集包括多个候选图像;
    模型处理模块,用于基于目标模型对所述待检索数据进行特征提取,得到所述待检索数据的第一特征,基于所述目标模型对所述候选图像进行多次特征提取,得到所述候选图像向各种模态的所述待检索数据对齐后的第二特征;
    检索模块,用于根据所述第一特征和所述第二特征,确定所述候选图像与各种模态的所述待检索数据之间的第一相似度,根据所述第一相似度,从所述候选图像集中确定多个检索数据组合对应的结果图像集,其中,所述检索数据组合包括至少一种模态的所述待检索数据;
    合并模块,用于将多个所述结果图像集进行合并,得到图像检索结果。
  13. 一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至11任意一项所述的图像检索方法。
  14. 一种计算机可读存储介质,所述存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至11任意一项所述的图像检索方法。
  15. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至11任意一项所述的图像检索方法。
PCT/CN2023/107962 2022-09-07 2023-07-18 图像检索方法、装置、电子设备及存储介质 WO2024051350A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/421,239 US20240168992A1 (en) 2022-09-07 2024-01-24 Image retrieval method and apparatus, electronic device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211089620.8 2022-09-07
CN202211089620.8A CN116992069A (zh) 2022-09-07 2022-09-07 图像检索方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/421,239 Continuation US20240168992A1 (en) 2022-09-07 2024-01-24 Image retrieval method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2024051350A1 true WO2024051350A1 (zh) 2024-03-14

Family

ID=88520137

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/107962 WO2024051350A1 (zh) 2022-09-07 2023-07-18 图像检索方法、装置、电子设备及存储介质

Country Status (3)

Country Link
US (1) US20240168992A1 (zh)
CN (1) CN116992069A (zh)
WO (1) WO2024051350A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966127A (zh) * 2021-04-07 2021-06-15 北方民族大学 一种基于多层语义对齐的跨模态检索方法
CN113157739A (zh) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 跨模态检索方法、装置、电子设备及存储介质
CN114780777A (zh) * 2022-04-06 2022-07-22 中国科学院上海高等研究院 基于语义增强的跨模态检索方法及装置、存储介质和终端

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966127A (zh) * 2021-04-07 2021-06-15 北方民族大学 一种基于多层语义对齐的跨模态检索方法
CN113157739A (zh) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 跨模态检索方法、装置、电子设备及存储介质
CN114780777A (zh) * 2022-04-06 2022-07-22 中国科学院上海高等研究院 基于语义增强的跨模态检索方法及装置、存储介质和终端

Also Published As

Publication number Publication date
US20240168992A1 (en) 2024-05-23
CN116992069A (zh) 2023-11-03

Similar Documents

Publication Publication Date Title
WO2022155994A1 (zh) 基于注意力的深度跨模态哈希检索方法、装置及相关设备
CN107273458B (zh) 深度模型训练方法及装置、图像检索方法及装置
CN110619051B (zh) 问题语句分类方法、装置、电子设备及存储介质
CN111159485B (zh) 尾实体链接方法、装置、服务器及存储介质
WO2023138188A1 (zh) 特征融合模型训练及样本检索方法、装置和计算机设备
CN111538818B (zh) 数据查询方法、装置、电子设备及存储介质
WO2023179429A1 (zh) 一种视频数据的处理方法、装置、电子设备及存储介质
CN115658955B (zh) 跨媒体检索及模型训练方法、装置、设备、菜谱检索系统
CN112052333A (zh) 文本分类方法及装置、存储介质和电子设备
CN107679070A (zh) 一种智能阅读推荐方法与装置、电子设备
CN113127667A (zh) 图像处理方法及装置、图像分类方法及装置
CN115687664A (zh) 中文图文检索方法及中文图文检索的数据处理方法
CN114863440A (zh) 订单数据处理方法及其装置、设备、介质、产品
WO2024051350A1 (zh) 图像检索方法、装置、电子设备及存储介质
CN115169489B (zh) 数据检索方法、装置、设备以及存储介质
CN110570877A (zh) 手语视频生成方法、电子设备及计算机可读存储介质
CN115422932A (zh) 一种词向量训练方法及装置、电子设备和存储介质
CN113704623B (zh) 一种数据推荐方法、装置、设备及存储介质
CN113239215B (zh) 多媒体资源的分类方法、装置、电子设备及存储介质
CN111222011B (zh) 一种视频向量确定方法和装置
CN111079013A (zh) 一种基于推荐模型的信息推荐方法及装置
KR20210063171A (ko) 이미지 변환 장치 및 이미지 변환 방법
WO2023168997A1 (zh) 一种跨模态搜索方法及相关设备
CN116595978B (zh) 对象类别识别方法、装置、存储介质及计算机设备
CN112632962B (zh) 人机交互系统中实现自然语言理解方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23862051

Country of ref document: EP

Kind code of ref document: A1