WO2024051730A1 - 跨模态检索方法、装置、设备、存储介质及计算机程序 - Google Patents

跨模态检索方法、装置、设备、存储介质及计算机程序 Download PDF

Info

Publication number
WO2024051730A1
WO2024051730A1 PCT/CN2023/117203 CN2023117203W WO2024051730A1 WO 2024051730 A1 WO2024051730 A1 WO 2024051730A1 CN 2023117203 W CN2023117203 W CN 2023117203W WO 2024051730 A1 WO2024051730 A1 WO 2024051730A1
Authority
WO
WIPO (PCT)
Prior art keywords
visual data
visual
text
retrieved
retrieval
Prior art date
Application number
PCT/CN2023/117203
Other languages
English (en)
French (fr)
Inventor
邓旸旸
黄泽毅
徐昀
童川
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024051730A1 publication Critical patent/WO2024051730A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Definitions

  • This application relates to the field of information retrieval, and in particular to a cross-modal retrieval method, device, equipment, storage medium and computer program.
  • Cross-modal retrieval is a form of retrieval that uses data in one modality to retrieve data in another modality. For example, users search for images or videos by entering text.
  • cross-modal retrieval still faces great challenges. How to not limit the content input by the user and feed back the image or video the user wants, thereby realizing cross-modal open content Retrieval and satisfying the actual user experience are currently very important issues. Therefore, a cross-modal retrieval method is urgently needed.
  • This application provides a cross-modal retrieval method, device, equipment, storage medium and computer program, which can realize cross-modal retrieval of open content in multiple scenarios.
  • the technical solutions are as follows:
  • a cross-modal retrieval method includes: extracting text tags and text features of the retrieved text; and determining the retrieved visual data based on the text tags and visual tags of the retrieved visual data. Whether there is at least one first visual data whose visual label matches the text label, the retrieved visual data includes images and/or videos; based on the text features and the visual features of the retrieved visual data, determine the Whether there is at least one second visual data whose visual features match the text features in the retrieved visual data; and determining a retrieval result based on the at least one first visual data and the at least one second visual data.
  • the at least one first visual data and the at least one second visual data exist in the retrieved visual data, based on the at least one first visual data and the at least one second visual data Confirm search results.
  • the first visual data determined by the visual tag of the retrieved visual data and the text tag of the retrieved text is relatively accurate, that is, the retrieval can be accurately controlled through tag matching. scope.
  • the search text is semantically open natural language description information
  • the second visual data determined by the visual features of the retrieved visual data and the text features of the search text has no semantic restrictions, supporting natural semantic retrieval.
  • the search is more flexible, the search scope is wider, and it can identify fine-grained search text such as adjectives. In this way, when the first visual data and the second visual data both exist in the retrieved visual data, fusing the first visual data and the second visual data can simultaneously improve the cross-modal retrieval accuracy and retrieval breadth.
  • the cross-modal retrieval method provided by this application can be applied to network-side scenarios and device-side scenarios.
  • Retrieval text is obtained in different ways depending on the application scenario.
  • the user terminal provides a search page for the user to enter search text in the search box in the search page. Then, the user terminal sends the search text entered in the search box to the server, and the server extracts the search text. Text labels and text features.
  • the user terminal provides a search page for the user to enter search text in the search box in the search page, and then the user terminal directly extracts the text tags and text features of the search text entered in the search box.
  • the text tag of the retrieved text is matched with the visual tag of the retrieved visual data to determine whether there is visual data in the retrieved visual data that has the same visual tag as the text tag or is a synonym. If there is visual data in the retrieved visual data that has the same visual label as the text label of the retrieved text or is a synonym, it is determined that there is at least one first visual data in the retrieved visual data, and the at least one first visual data is the retrieved visual data.
  • Visual data whose visual tag is the same as the text tag of the retrieved text or is a synonym; if there is no visual data in the retrieved visual data that has a visual tag that is the same as the text tag of the retrieved text or is a synonym, it is determined that it does not exist in the retrieved visual data At least one first visual data.
  • the retrieved visual data may contain at least one first visual data and at least one second visual data at the same time, or there may be only at least one first visual data or only at least one second visual data. If there is at least one first visual data and at least one second visual data in the retrieved visual data, the at least one first visual data and the at least one second visual data can be fused according to the fusion strategy to obtain the retrieval result. If at least one first visual data exists but at least one second visual data does not exist in the retrieved visual data, then at least one first visual data is used as the retrieval result. If at least one second visual data exists but at least one first visual data does not exist in the retrieved visual data, then at least one second visual data is used as the retrieval result.
  • the fusion strategy is preset. According to the emphasis of the application scenario on the quantity and accuracy of the retrieval results, the union or intersection of at least one first visual data and at least one second visual data can be selected as the retrieval result. That is, when the application scenario focuses more on the number of retrieval results, the union of at least one first visual data and at least one second visual data is used as the retrieval result; when the application scenario focuses more on the accuracy of the retrieval results, The intersection of at least one first visual data and at least one second visual data is taken as the retrieval result.
  • the first type of labels refers to labels that are uncertain when characterizing visual data.
  • the first type of label refers to a label that has uncertainty when characterizing visual data
  • the visual label of at least one first visual data belongs to the first type of label
  • the content of the corresponding visual data may not necessarily be accurately expressed.
  • the intersection of at least one first visual data and at least one second visual data is used as the retrieval result.
  • the second type of labels refers to labels that are deterministic when characterizing visual data.
  • the second type of label refers to a label that is deterministic when representing visual data
  • the second type of label indicates that the at least one visual label of the first visual data can Accurately express the content of the corresponding visual data.
  • the union of at least one first visual data and at least one second visual data is used as the retrieval result.
  • the retrieval result can be determined directly according to the above method.
  • the retrieval result can also be determined according to the above method after processing the at least one second visual data more accurately.
  • the model inference results include similarity results and/or pairwise judgment results.
  • the similarity results indicate the similarity between the at least one second visual data and the retrieved text respectively.
  • the pairwise judgment results indicate that the Whether at least one second visual data can be paired with the retrieval text respectively; the at least one second visual data is processed based on the model inference result.
  • the model inference results can include only similarity results, only paired judgment results, or both similarity results and paired judgment results.
  • the at least one second visual data is processed in different ways based on the model inference results, which will be introduced separately below.
  • the model inference results include similarity results; at this time, based on the similarity results, filter out from the at least one second visual data those whose similarity to the retrieved text is greater than the first similarity Degree threshold for second visual data.
  • the similarity result is obtained by fine-grained analysis by the neural network model after combining the visual features of the second visual data and the text features of the retrieved text, the similarity result can more accurately characterize the second visual data and the retrieved text. Similarity, through which the similarity As a result, filtering at least one second visual data can filter out the visual data that is not really similar to the retrieved text, and retain the visual data that is really similar to the retrieved text, thereby improving the accuracy of the final retrieval results.
  • the model inference results include paired judgment results; at this time, based on the paired judgment results, second visual images that can be paired with the retrieved text are screened out from the at least one second visual data. data.
  • the pairwise judgment result can more accurately characterize the relationship between the second visual data and the textual features of the retrieved text.
  • the retrieval text can be paired, based on the pairing judgment result, at least one second visual data can be filtered out, visual data that is not paired with the retrieval text can be filtered out, and visual data that is paired with the retrieval text can be retained, thereby filtering out the visual data that is not paired with the retrieval text.
  • Reasonable visual data improves the accuracy of finalized search results.
  • the model inference results include similarity results and pairwise judgment results; at this time, based on the pairwise judgment results, the at least one second visual data is filtered out to be paired with the retrieved text.
  • second visual data based on the similarity result, sort the filtered second visual data in descending order of similarity between the filtered second visual data and the retrieved text .
  • model inference results include both similarity results and pairwise judgment results
  • the at least one second visual data The data is filtered to retain the second visual data that can be paired with the retrieved text, and the second visual data that cannot be paired with the retrieved text is deleted; then, according to the similarity between the filtered second visual data and the retrieved text, from Sort the filtered second visual data in descending order, thereby improving the rationality of the sorting of the retrieval results ultimately fed back to the user.
  • a cross-modal retrieval method which method includes:
  • the tag matching result includes at least one first visual data
  • the feature matching result includes at least one second visual data
  • obtaining the retrieval result based on the tag matching result and the feature matching result includes: :
  • the union or intersection of the at least one first visual data included in the tag matching result and the at least one second visual data included in the feature matching result is used as the retrieval result.
  • the tag matching result includes at least one first visual data
  • the feature matching result indicates that there is no matching data
  • obtaining the retrieval result based on the tag matching result and the feature matching result includes:
  • the at least one first visual data included in the tag matching result is used as the retrieval result.
  • the tag matching result indicates that there is no matching data
  • the feature matching result includes at least one second visual data
  • obtaining the retrieval result based on the tag matching result and the feature matching result includes:
  • the at least one second visual data included in the feature matching result is used as the retrieval result.
  • the tag matching result includes at least one first visual data
  • the visual tags of part or all of the visual data in the at least one first visual data include the text tag
  • the at least one first visual data is The visual tag in the visual data includes the visual data of the text tag as the retrieval result.
  • the retrieval result indicates that there is no matching data.
  • using the union or intersection of the at least one first visual data included in the tag matching result and the at least one second visual data included in the feature matching result as the retrieval result includes:
  • the intersection of the at least one first visual data and the at least one second visual data is used as the retrieval result
  • the visual tag of the at least one first visual data belongs to the preset second type of tag, then the union of the at least one first visual data and the at least one second visual data is used as the retrieval result.
  • obtaining a feature matching result based on the text feature and the visual feature of the retrieved visual data includes:
  • Feature matching is performed on the text features and the visual features of the retrieved visual data to obtain a first feature matching result, and the third A feature matching result includes at least one third visual data;
  • the feature matching result includes part or all of the third visual data in the first feature matching result, and
  • the third visual data included in the feature matching result is sorted by similarity to the text feature.
  • the method also includes:
  • the retrieval results include at least one image and/or video, and the method further includes:
  • a cross-modal retrieval device in a third aspect, has the function of realizing the behavior of the cross-modal retrieval method in the first aspect.
  • the cross-modal retrieval device includes at least one module, and the at least one module is used to implement the cross-modal retrieval method provided in the first aspect or the second aspect.
  • an electronic device in a fourth aspect, includes a processor and a memory, and the memory is used to store a computer program that executes the cross-modal retrieval method provided in the first aspect or the second aspect.
  • the processor is configured to execute a computer program stored in the memory to implement the cross-modal retrieval method described in the first aspect or the second aspect.
  • the electronic device may further include a communication bus used to establish a connection between the processor and the memory.
  • a computer-readable storage medium is provided. Instructions are stored in the storage medium. When the instructions are run on a computer, they cause the computer to execute the cross-modal method described in the first aspect or the second aspect. Retrieval method steps.
  • a sixth aspect provides a computer program product containing instructions that, when run on a computer, cause the computer to perform the steps of the cross-modal retrieval method described in the first aspect or the second aspect.
  • a computer program is provided, which when the computer program is run on a computer, causes the computer to perform the steps of the cross-modal retrieval method described in the first aspect or the second aspect.
  • a chip in a seventh aspect, includes a processor and an interface circuit.
  • the interface circuit is used to receive instructions and transmit them to the processor.
  • the processor is used to execute the first aspect or the second aspect. The steps of the cross-modal retrieval method.
  • An eighth aspect provides a retrieval system, which includes the cross-modal retrieval device and model training device described in the third aspect.
  • Figure 1 is a flow chart of a cross-modal retrieval method provided by an embodiment of the present application
  • Figure 2 is a flow chart of another cross-modal retrieval method provided by an embodiment of the present application.
  • Figure 3 is a flow chart of a visual data fusion method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of a cross-modal retrieval provided by an embodiment of the present application.
  • Figure 5 is a flow chart of a method for processing second visual data provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of processing second visual data provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of a user interface for cross-modal retrieval provided by an embodiment of the present application.
  • Figure 8 is a flow chart of another cross-modal retrieval method provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a cross-modal retrieval device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a user terminal provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of another user terminal provided by an embodiment of the present application.
  • Cross-modal retrieval usually refers to using data of one modality to retrieve data of another modality from multi-modal data, for example, retrieving images or videos with text, retrieving text or videos with images, etc.
  • Cross-modal retrieval is a bridge for interaction between data in different modalities. Its focus is on automatically understanding and correlating key elements between data in different modalities, and achieving relatively accurate cross-matching.
  • NLP Natural language processing
  • CV Computer Vision, computer vision
  • the cross-modal retrieval method provided by the embodiments of this application can be applied to network-side scenarios such as search engines, and can also be applied to end-side scenarios, such as retrieving images or videos in mobile phone albums on the mobile phone. Of course, it is not limited to mobile phone photo albums, and can also be used in other scenarios. The same applies to similar scenarios, such as entering text in the history of chat software to retrieve images or videos.
  • the cross-modal retrieval method provided by the embodiments of this application can provide users with more open and accurate retrieval results to meet the actual user experience.
  • this method can be applied not only in network-side retrieval scenarios such as search engines, but also in network-side scenarios such as content recommendation, such as news information recommendations, product purchase recommendations, etc., by retrieving the history of news information or products for users. Statistics are recorded to determine the retrieved text and then recommend similar content.
  • the execution subject of the embodiments of the present application can be a server or a user terminal.
  • the execution subjects of the embodiments of the present application are collectively referred to as electronic devices.
  • the electronic device can be an independent server, a server cluster or a distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud Functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, or A cloud computing service center.
  • cloud databases cloud computing, cloud Functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, or A cloud computing service center.
  • cloud servers for basic cloud computing services such as big data and artificial intelligence platforms, or A cloud computing service center.
  • the electronic device can be any electronic product that can conduct human-computer interaction with the user through one or more methods such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting device.
  • the electronic device can be any electronic product that can conduct human-computer interaction with the user through one or more methods such as keyboard, touch pad, touch screen, remote control, voice interaction or handwriting device.
  • PC personal computers
  • PDA personal digital assistants
  • PPC pocket pc
  • tablet computers smart cars, etc.
  • Figure 1 is a flow chart of a cross-modal retrieval method provided by an embodiment of the present application. The method is applied to electronic devices. The method includes the following steps.
  • Step 101 Extract text tags and text features of the retrieved text.
  • the retrieved text is input into the first text model to obtain text labels and text features of the retrieved text.
  • the search text can also be input into the second text model to obtain the text features of the search text, and the search text can be input into the third text model to obtain the text label of the search text, or the search text can be extracted through a text label extraction algorithm.
  • text label That is, when the first text model is used to extract text tags and text features, the input of the first text model is the retrieval text, and the output of the first text model is the text tag and text features of the retrieval text.
  • the second text model is used to extract text features, the input of the second text model is the retrieval text, and the output of the second text model is the text feature of the retrieval text.
  • the third text model is used to extract text tags, the input of the third text model is the retrieval text, and the output of the third text model is the text tag of the retrieval text.
  • the text label indicates the classification result of the object retrieved through the retrieval text. For example, if the retrieval text is "giant pandas eating bamboo", then it is determined that giant pandas need to be retrieved through this retrieval text. In this way, "giant panda" can be determined as the retrieval text text label. Text features indicate characteristics of the retrieved text.
  • the cross-modal retrieval method provided by the embodiments of this application can be applied to network-side scenarios and can also be applied to device-side scenarios.
  • Retrieval text is obtained in different ways depending on the application scenario.
  • the user terminal provides a search page for the user to enter search text in the search box in the search page. Then, the user terminal sends the search text entered in the search box to the server, and the server extracts the search text. Text labels and text features.
  • the user terminal provides a search page for the user to enter search text in the search box in the search page, and then the user terminal directly extracts the text tags and text features of the search text entered in the search box.
  • Step 102 Based on the text tag of the retrieved text and the visual tag of the retrieved visual data, determine whether there is at least one first visual data whose visual tag matches the text tag in the retrieved visual data.
  • the retrieved visual data includes images and/or videos. .
  • the text tag of the retrieved text is matched with the visual tag of the retrieved visual data to determine whether there is visual data in the retrieved visual data that has the same visual tag as the text tag or is a synonym. If there is visual data in the retrieved visual data that has the same visual label as the text label of the retrieved text or is a synonym, it is determined that there is at least one first visual data in the retrieved visual data, and the at least one first visual data is the retrieved visual data.
  • Visual data whose visual tag is the same as the text tag of the retrieved text or is a synonym; if there is no visual data in the retrieved visual data that has a visual tag that is the same as the text tag of the retrieved text or is a synonym, it is determined that it does not exist in the retrieved visual data At least one first visual data.
  • tags belonging to synonyms are marked in advance.
  • the tags "flower” and “flower” are both synonyms of the tag "flower”.
  • the embodiment of the present application does not limit the marking method.
  • the visual label indicates the classification result of the visual data. For example, if the visual data is a picture of flowers, then the visual label of the visual data can be "flowers”.
  • the visual tags and visual features of the retrieved visual data can be extracted before step 102 and step 103, and can also be extracted before step 101. Of course, they can also be extracted when the electronic device is idle.
  • the embodiment of the present application can extract the visual tags and visual features of the retrieved visual data.
  • the timing of label and visual feature extraction is not limited.
  • the retrieved visual data can be input into the first visual model to obtain the visual tags and visual features of the retrieved visual data.
  • the retrieved visual data can also be input into the second visual model to obtain the visual features of the retrieved visual data, and the retrieved visual data can be input into the third visual model to obtain the visual label of the retrieved visual data. That is, when the first visual model is used to extract the visual tags and visual features of the retrieved visual data, the input of the first visual model is the retrieved visual data, and the output of the first visual model is the visual tags and visual features of the retrieved visual data.
  • the input of the second visual model is the retrieved visual data
  • the output of the second visual model is the visual features of the retrieved visual data.
  • the input of the third visual model is the retrieved visual data
  • the output of the third visual model is the visual tag of the retrieved visual data.
  • the structures of the first visual model, the second visual model and the third visual model can be different.
  • the third visual model can be an OCR (Optical Character Recognition, optical character recognition) network model, which analyzes the retrieved visual data. Recognition processing to obtain text information, and then extract the label of the text information as the visual label of the retrieved visual data.
  • OCR Optical Character Recognition, optical character recognition
  • the cross-modal retrieval method provided by the embodiments of this application can be applied to network-side scenarios and can also be applied to device-side scenarios.
  • the retrieved visual data is stored in different locations depending on the application scenario.
  • the retrieved visual data is stored in the server; in the client-side scenario, the retrieved visual data is stored in the user terminal.
  • the retrieved visual data can include only images, only videos, or both.
  • Step 103 Based on the text features of the retrieved text and the visual features of the retrieved visual data, determine whether there is at least one second visual data whose visual features match the text features in the retrieved visual data.
  • the similarity between text features and visual features can be obtained by calculating the cosine distance between text features and visual features, It can also be calculated in other ways, and the embodiments of this application do not limit this.
  • the second similarity threshold is preset, such as 0.8, 0.85, etc. In actual applications, different values may be taken according to different requirements, and the embodiments of the present application do not limit this.
  • Visual features indicate characteristics of visual data.
  • the visual data is an image
  • the visual features are the characteristics of the image.
  • the same visual model can be used to extract the visual features and visual labels of the retrieved visual data, or two different visual models can be used to extract the visual features and visual labels of the retrieved visual data respectively.
  • This application The embodiment does not limit this.
  • Step 104 If there is at least one first visual data and at least one second visual data in the retrieved visual data, determine the retrieval result based on the at least one first visual data and at least one second visual data.
  • the retrieved visual data may contain at least one first visual data and at least one second visual data at the same time, or there may be only at least one first visual data or only at least one second visual data. If there is at least one first visual data and at least one second visual data in the retrieved visual data, the at least one first visual data and the at least one second visual data can be fused according to the fusion strategy to obtain the retrieval result. If at least one first visual data exists but at least one second visual data does not exist in the retrieved visual data, then at least one first visual data is used as the retrieval result. If at least one second visual data exists but at least one first visual data does not exist in the retrieved visual data, then at least one second visual data is used as the retrieval result.
  • Figure 2 is a flow chart of another cross-modal retrieval method provided by an embodiment of the present application.
  • the text tags and text features of the retrieval text input by the user are extracted, and the visual tags and visual features of the retrieved visual data are determined.
  • Match the text tags of the retrieved text with the visual tags of the retrieved visual data to determine whether there is first visual data in the retrieved visual data whose visual tags match the text tags of the retrieved text; match the text features of the retrieved text with the retrieved visual tags
  • the visual features of the visual data are matched to determine whether there is second visual data in the retrieved visual data whose visual features match the textual features of the retrieved text.
  • first visual data and the second visual data exist at the same time, a preconfigured fusion scheme is used to fuse the first visual data and the second visual data, and the fused visual data is used as the retrieval result. If the first visual data and the second visual data do not exist at the same time, determine whether the first visual data exists. If the first visual data exists, use the first visual data as the retrieval result; if the first visual data does not exist, use Second visual data as retrieval results.
  • the fusion strategy is preset. According to the emphasis of the application scenario on the quantity and accuracy of the retrieval results, the union or intersection of at least one first visual data and at least one second visual data can be selected as the retrieval result. That is, when the application scenario focuses more on the number of retrieval results, the union of at least one first visual data and at least one second visual data is used as the retrieval result; when the application scenario focuses more on the accuracy of the retrieval results, The intersection of at least one first visual data and at least one second visual data is taken as the retrieval result.
  • Figure 3 is a flow chart of a visual data fusion method provided by an embodiment of the present application. If there is at least one first visual data and at least one second visual data in the retrieved visual data, it is determined whether the visual tag of the at least one first visual data belongs to the first type of tag or the second type of tag. If at least one of the first visual data If the visual label of the data belongs to the first category of labels, the intersection of at least one first visual data and at least one second visual data will be used as the retrieval result.
  • the first category of labels refers to labels with uncertainty when characterizing the visual data; if at least If the visual label of a first visual data belongs to the second type of label, the union of at least one first visual data and at least one second visual data will be used as the retrieval result.
  • the second type of label refers to the deterministic label when characterizing the visual data. Label.
  • the first type of label refers to a label that has uncertainty when characterizing visual data
  • the visual label of at least one first visual data belongs to the first type of label
  • the content of the corresponding visual data may not necessarily be accurately expressed.
  • the intersection of at least one first visual data and at least one second visual data is used as the retrieval result.
  • the second type of label refers to a label that is deterministic when representing visual data
  • the second type of label indicates that the at least one visual label of the first visual data can Accurately express the content of the corresponding visual data.
  • the union of at least one first visual data and at least one second visual data is used as the retrieval result.
  • the first type of tags and the second type of tags are set in advance, and different first type of tags and second type of tags can be set according to different product requirements and application scenario requirements.
  • the embodiment of the present application provides the first There are no restrictions on the setting methods of class labels and second class labels.
  • the process of matching the textual tags of the retrieved text with the visual tags of the retrieved visual data is called tag recall, and the process of matching the textual features of the retrieved text with the visual features of the retrieved visual data is called Open semantic recall can also be called vector recall.
  • label recall and open semantic recall There may be three situations after label recall and open semantic recall, that is, at least one first visual data is obtained through label recall but no result is obtained through open semantic recall, or at least one second visual data is obtained through open semantic recall but after Label No results are obtained from the recall, or at least one first visual data is obtained through label recall and at least one second visual data is obtained through open semantic recall.
  • the retrieval text is "Giant Panda Eating bamboo”
  • the text label of the retrieval text is "Giant Panda”. If it is determined through label recall that "Giant Panda” hits the visual label of the retrieved visual data, but it is not obtained through open semantic recall. As a result, the visual data with the visual tag "giant panda” in the retrieved visual data will be used as the retrieval result.
  • the search text is "Giant Panda Eating bamboo”
  • the text label of the search text is "Giant Panda”. If it is determined through label recall that "Giant Panda” hits the visual label of the retrieved visual data, the visual data will be retrieved.
  • the visual data with the medium visual tag "Giant Panda” is determined to be at least one first visual data.
  • at least one second visual data is obtained through open semantic recall. At this time, the at least one first visual data and the at least one second visual data can be intersected or combined to obtain the retrieval result.
  • the retrieval text is "black lithography machine” and the text label of the retrieval text is "lithography machine”. If it is determined through label recall that "lithography machine” misses the visual label of the retrieved visual data, however, after Open semantic recall obtains at least one second visual data, and at this time, the at least one second visual data is used as the retrieval result.
  • the retrieved visual data may also include neither at least one first visual data nor at least one second visual data. In this case, it is determined that the retrieval result is empty.
  • the retrieval result can be determined directly according to the above method.
  • the retrieval result can also be determined according to the above method after processing the at least one second visual data more accurately.
  • the visual features of the at least one second visual data and the text features of the retrieved text are input into the neural network model to obtain model inference results, and the model inference results include similarity results.
  • the similarity result indicates the similarity between the at least one second visual data and the retrieved text respectively
  • the pairwise judgment result indicates whether the at least one second visual data respectively and the retrieved text can be paired; based on the model
  • the inference result is processed on at least one second visual data.
  • FIG. 5 is a flow chart of a method for processing second visual data provided by an embodiment of the present application.
  • the process of determining at least one second visual data for the first time will be introduced here. That is, the retrieved visual data is visually analyzed through the visual model in the offline state to obtain the visual features of the retrieved visual data; based on the retrieval text input by the user online, the retrieval text is text analyzed through the text model to obtain the retrieval text. text features of the retrieved text and visual features of the retrieved visual data to perform feature retrieval to determine at least one second visual data. Then, the text features of the retrieved text and the visual features of the at least one second visual data are simultaneously input into the neural network model, and the at least one second visual data is further processed through the analysis of the neural network model to obtain the final second visual data. .
  • the model inference results can include only similarity results, only paired judgment results, or both similarity results and paired judgment results.
  • the at least one second visual data is processed in different ways based on the model inference results, which will be introduced separately below.
  • Case 1 If the model inference result includes a similarity result but does not include a pairwise judgment result, then based on the similarity result, filter out from at least one second visual data the similarity with the retrieved text that is greater than the first similarity threshold Second visual data.
  • the similarity result can more accurately characterize the second visual data and the retrieved text.
  • the similarity, through the similarity result at least one second visual data can be filtered out, and the visual data that is really not similar to the retrieved text can be filtered out, and the visual data that is really similar to the retrieved text can be retained, thereby improving the final retrieval accuracy of results.
  • the first similarity threshold is set in advance, such as 0.85, 0.9, etc.
  • the first similarity threshold can take different values according to different requirements.
  • the values of the first similarity threshold and the second similarity threshold may be the same or different, and this is not limited in the embodiment of the present application.
  • the filtered out data can also be filtered out in order from the largest to the smallest similarity.
  • the second visual data is sorted, thereby improving the rationality of the sorting of the retrieval results finally fed back to the user.
  • Case 2 If the model inference result includes a pairwise judgment result but does not include a similarity result, then based on the pairwise judgment result, second visual data that can be paired with the retrieved text is screened out from at least one second visual data.
  • the pairwise judgment result is obtained by fine-grained analysis by the neural network model after combining the visual features of the second visual data and the textual features of the retrieved text, the pairwise judgment result can more accurately characterize the relationship between the second visual data and the textual features of the retrieved text.
  • retrieve whether the text can be paired by Based on the paired judgment results, at least one second visual data is filtered to filter out the visual data that is not paired with the retrieved text, and retain the visual data that is paired with the retrieved text, thereby filtering out unreasonable visual data and improving the final result. Determine the accuracy of search results.
  • Case 3 If the model inference result includes a similarity result and a pairwise judgment result, then based on the pairwise judgment result, second visual data that can be paired with the retrieved text is screened out from at least one second visual data; based on the similarity result , sort the filtered second visual data in descending order of similarity between the filtered second visual data and the retrieved text.
  • model inference results include both similarity results and pairwise judgment results
  • the at least one second visual data The data is filtered to retain the second visual data that can be paired with the retrieved text, and the second visual data that cannot be paired with the retrieved text is deleted; then, according to the similarity between the filtered second visual data and the retrieved text, from Sort the filtered second visual data in descending order, thereby improving the rationality of the sorting of the retrieval results ultimately fed back to the user.
  • the retrieved text is "sparrow", and at least one second visual data includes four images, including three images of "sparrow” and one image of "parrot” (the image ranked 2), Since “parrot” does not match the meaning of "sparrow”, but since both "sparrow” and “parrot” are small birds, it is difficult to distinguish them. It is easy to determine the search results only by at least one second visual data obtained for the first time. Something went wrong. However, after screening based on the pairwise judgment results, the image of "parrot" was deleted from at least one second visual data obtained for the first time, which is beneficial to obtaining more accurate results.
  • the retrieved visual data with insufficient visual features are ranked first, while the retrieved visual data with clearer visual features are ranked later, such as the first "Sparrow" "The visual characteristics of the image of "Sparrow” are not good enough and are ranked in the front. The visual characteristics of the third image of "Sparrow” are better and they are ranked in the back. Therefore, according to the order of similarity between the filtered second visual data and the retrieved text from large to small, the retrieved data with clearer visual features are adjusted to the front, and the retrieved visual data with insufficient visual features are adjusted to the front. later to obtain more reasonable sorting results.
  • the search results can be fed back to the user.
  • the cross-modal retrieval method provided by the embodiments of this application can be applied to network-side scenarios and can also be applied to device-side scenarios.
  • the search results can be sent to the user terminal, and the user terminal displays the search results.
  • the search result can be displayed.
  • the first visual data determined by the visual tag of the retrieved visual data and the text tag of the retrieved text is relatively accurate, that is, the retrieval can be accurately controlled through tag matching. scope.
  • the search text is semantically open natural language description information
  • the second visual data determined by the visual features of the retrieved visual data and the text features of the search text has no semantic restrictions, supporting natural semantic retrieval.
  • the search is more flexible, the search scope is wider, and it can identify fine-grained search text such as adjectives. In this way, when the first visual data and the second visual data both exist in the retrieved visual data, fusing the first visual data and the second visual data can simultaneously improve the cross-modal retrieval accuracy and retrieval breadth.
  • embodiments of the present application can also input the visual features of the second visual data and the text features of the retrieved text into the neural network model at the same time.
  • the neural network model realizes fine-grained analysis by combining the visual features and text features, thereby analyzing the third visual data. Second, visual data is filtered to improve the accuracy and rationality of the retrieval results.
  • Figure 8 is a flow chart of another cross-modal retrieval method provided by an embodiment of the present application. The method includes the following steps.
  • Step 801 Extract text tags and text features of the retrieved text.
  • the method further includes: receiving retrieval text input by the user.
  • Step 802 Obtain a tag matching result based on the text tag of the retrieved text and the visual tag of the retrieved visual data, where the retrieved visual data includes images and/or videos.
  • Step 803 Obtain feature matching results based on the text features of the retrieved text and the visual features of the retrieved visual data.
  • obtaining the feature matching result based on the text feature and the visual feature of the retrieved visual data includes: feature matching the text feature and the visual feature of the retrieved visual data to obtain a first feature matching result, the first feature matching result including at least one third visual data; input the text feature and the first feature matching result into the preset model to obtain the feature matching result, the feature matching result includes the first feature Some or all of the third visual data in the matching results, and the third visual data included in the feature matching results are sorted according to similarity with text features.
  • the preset model may be a neural network model.
  • Step 804 Obtain the search results based on the tag matching results and the feature matching results.
  • the label matching result may include at least one first visual data, or may not include it, and the feature matching result may include at least one second visual data, or it may be different.
  • the methods of determining the search results are different, which will be introduced separately next.
  • the tag matching result includes at least one first visual data
  • the feature matching result includes at least one second visual data.
  • the implementation process of obtaining the retrieval result based on the tag matching result and the feature matching result includes: matching the tags The union or intersection of at least one first visual data included in the result and at least one second visual data included in the feature matching result is used as the retrieval result.
  • using the union or intersection of at least one first visual data included in the tag matching result and at least one second visual data included in the feature matching result as the retrieval result includes: if the visual tag of the at least one first visual data belongs to the preset Assuming the first type of label, the intersection of at least one first visual data and at least one second visual data will be used as the retrieval result; if the visual label of at least one first visual data belongs to the preset second type of label, then at least one of the first visual data will be The union of one first visual data and at least one second visual data serves as the retrieval result.
  • the tag matching result includes at least one first visual data
  • the feature matching result indicates that there is no matching data.
  • the implementation process of obtaining the retrieval result based on the tag matching result and the feature matching result includes: including the tag matching result. At least one first visual data is used as the retrieval result.
  • the tag matching result indicates that there is no matching data
  • the feature matching result includes at least one second visual data.
  • the implementation process of obtaining the retrieval result based on the tag matching result and the feature matching result includes: including the feature matching result At least one second visual data is used as the retrieval result.
  • the tag matching result includes at least one first visual data
  • the visual tags of part or all of the at least one first visual data include text tags
  • the visual tags in the at least one first visual data include Text-labeled visual data as retrieval results.
  • the retrieval result indicates that there is no matching data.
  • the retrieval result determined through the above steps includes at least one image and/or video. At this time, the retrieval result can also be displayed.
  • the first visual data determined by the visual tag of the retrieved visual data and the text tag of the retrieved text is relatively accurate, that is, the retrieval can be accurately controlled through tag matching. scope.
  • the search text is semantically open natural language description information
  • the second visual data determined by the visual features of the retrieved visual data and the text features of the search text has no semantic restrictions, supporting natural semantic retrieval.
  • the search is more flexible, the search scope is wider, and it can identify fine-grained search text such as adjectives. In this way, when the first visual data and the second visual data both exist in the retrieved visual data, fusing the first visual data and the second visual data can simultaneously improve the cross-modal retrieval accuracy and retrieval breadth.
  • embodiments of the present application can also input the visual features of the second visual data and the text features of the retrieved text into the neural network model at the same time.
  • the neural network model realizes fine-grained analysis by combining the visual features and text features, thereby analyzing the third visual data. Second, visual data is filtered to improve the accuracy and rationality of the retrieval results.
  • Figure 9 is a schematic structural diagram of a cross-modal retrieval device provided by an embodiment of the present application.
  • the device can be implemented as part or all of an electronic device by software, hardware, or a combination of both.
  • the device includes: an extraction module 901, a first determination module 902, a second determination module 903 and a third determination module 904.
  • Extraction module 901 is used to extract text tags and text features of the retrieved text.
  • Extraction module 901 is used to extract text tags and text features of the retrieved text.
  • the first determination module 902 is configured to determine whether there is at least one first visual data whose visual label matches the text label in the retrieved visual data based on the text label and the visual label of the retrieved visual data.
  • the retrieved visual data includes images and/or or video.
  • the second determination module 903 is configured to determine, based on the text features and the visual features of the retrieved visual data, whether there is at least one second visual data whose visual features match the text features in the retrieved visual data.
  • the third determination module 904 is configured to determine the retrieval result based on at least one first visual data and at least one second visual data if there is at least one first visual data and at least one second visual data in the retrieved visual data.
  • the third determination module 904 is specifically used to:
  • the intersection of at least one first visual data and at least one second visual data is used as the retrieval result.
  • the first category of tags means that there is uncertainty when characterizing the visual data. sexual labels;
  • the visual label of at least one first visual data belongs to the second type of label, then the union of the at least one first visual data and the at least one second visual data is used as the retrieval result.
  • the second type of label refers to a label that has certain characteristics when characterizing the visual data. Sexual labels.
  • the device also includes:
  • the fourth determination module is configured to use at least one first visual data as the retrieval result if at least one first visual data exists but at least one second visual data does not exist in the retrieved visual data.
  • the device also includes:
  • the fifth determination module is configured to use at least one second visual data as the retrieval result if at least one second visual data exists but at least one first visual data does not exist in the retrieved visual data.
  • the device further includes:
  • a model inference module configured to input visual features and text features of at least one second visual data into the neural network model to obtain model inference results, where the model inference results include similarity results and/or pairwise judgment results, and similarity results. Indicates the similarity between the at least one second visual data and the retrieval text, and the pair judgment result indicates whether the at least one second visual data and the retrieval text can be paired;
  • a processing module configured to process at least one second visual data based on the model inference results.
  • model inference results include similarity results
  • processing module is specifically used to:
  • second visual data whose similarity to the retrieved text is greater than the first similarity threshold is filtered out from at least one second visual data.
  • the model inference results include paired judgment results; the processing module is specifically used to:
  • second visual data that can be paired with the retrieved text is selected from at least one second visual data.
  • model inference results include similarity results and paired judgment results;
  • processing module is specifically used to:
  • the filtered second visual data is sorted in descending order of similarity between the filtered second visual data and the retrieved text.
  • the first visual data determined by the visual tag of the retrieved visual data and the text tag of the retrieved text is relatively accurate, that is, the retrieval can be accurately controlled through tag matching. scope.
  • the retrieval text is semantically open natural language description information, there are no semantic restrictions on the second visual data determined by the visual features of the retrieved visual data and the text features of the retrieval text, making the retrieval more flexible.
  • the search scope is also relatively wide and can identify fine-grained search text such as adjectives. In this way, when the first visual data and the second visual data both exist in the retrieved visual data, fusing the first visual data and the second visual data can simultaneously improve the cross-modal retrieval accuracy and retrieval breadth.
  • embodiments of the present application can also input the visual features of the second visual data and the text features of the retrieved text into the neural network model at the same time.
  • the neural network model realizes fine-grained analysis by combining the visual features and text features, thereby analyzing the third visual data. Second, visual data is filtered to improve the accuracy and rationality of the retrieval results.
  • cross-modal retrieval device provided in the above embodiment performs cross-modal retrieval
  • the division of each functional module mentioned above is only used as an example.
  • the above functions can be allocated to different modules as needed.
  • the functional module is completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the cross-modal retrieval device provided by the above embodiments and the cross-modal retrieval method embodiments belong to the same concept. Please refer to the method embodiments for the specific implementation process, which will not be described again here.
  • FIG. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device includes at least one processor 1001, a communication bus 1002, a memory 1003, and at least one communication interface 1004.
  • the processor 1001 may be a general central processing unit (CPU), a network processor (NP), a microprocessor, or one or more integrated circuits used to implement the solution of the present application, such as , application-specific integrated circuit (ASIC), programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
  • Communication bus 1002 is used to transfer information between the above-mentioned components.
  • the communication bus 1002 can be divided into an address bus, a data bus, a control bus, etc. For ease of presentation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the memory 1003 can be a read-only memory (ROM), a random access memory (RAM), or an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory). , EEPROM), optical disc (including compact disc read-only memory (CD-ROM), compressed disc, laser disc, digital versatile disc, Blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or can Any other medium used to carry or store desired program code in the form of instructions or data structures and capable of being accessed by a computer, without limitation.
  • the memory 1003 may exist independently and be connected to the processor 1001 through the communication bus 1002.
  • the memory 1003 may also be integrated with the processor 1001.
  • the Communication interface 1004 uses any transceiver-like device for communicating with other devices or communication networks.
  • the communication interface 1004 includes a wired communication interface and may also include a wireless communication interface.
  • the wired communication interface may be an Ethernet interface, for example.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface may be a wireless local area networks (WLAN) interface, a cellular network communication interface, or a combination thereof.
  • WLAN wireless local area networks
  • the processor 1001 may include one or more CPUs, such as CPU0 and CPU1 as shown in FIG. 10 .
  • the electronic device may include multiple processors, such as the processor 1001 and the processor 1005 shown in FIG. 10 .
  • processors can be a single-core processor or a multi-core processor.
  • a processor here may refer to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the electronic device may also include an output device 1006 and an input device 1009.
  • Output device 1006 communicates with processor 1001 and can display information in a variety of ways.
  • the output device 1006 can be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (cathode ray tube, CRT) display device or a projector (projector), etc.
  • Input device 1009 communicates with processor 1001 and can receive user input in a variety of ways.
  • the input device 1009 may be a mouse, a keyboard, a touch screen device, a sensing device, or the like.
  • the memory 1003 is used to store the program code 1010 for executing the solution of the present application, and the processor 1001 can execute the program code 1010 stored in the memory 1003.
  • the program code 1010 may include one or more software modules, and the electronic device may implement the cross-modal retrieval method provided in the above embodiments through the processor 1001 and the program code 1010 in the memory 1003.
  • FIG 11 is a schematic structural diagram of a user terminal provided by an embodiment of the present application.
  • the user terminal includes a sensor unit 1110, a computing unit 1120, a storage unit 1140 and an interaction unit 1130.
  • the sensor unit 1110 usually includes a visual sensor (such as a camera), a depth sensor, an IMU, a laser sensor, etc.;
  • Computing unit 1120 usually including CPU, GPU, cache, register, etc., is mainly used to run the operating system
  • the storage unit 1140 mainly includes memory and external storage, and is mainly used for reading and writing user local and temporary data;
  • the interaction unit 1130 mainly includes a display screen, a touch panel, a speaker, a microphone, etc., and is mainly used to interact with the user, obtain input, and implement presentation algorithm effects, etc.
  • Figure 12 is a schematic structural diagram of a user terminal provided by an embodiment of the present application.
  • the user terminal 100 may include a processor 110 , an external memory interface 120 , an internal memory 121 , a universal serial bus (USB) interface 130 , a charging management module 140 , a power management module 141 , and a battery 142 , Antenna 1, Antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193 , display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195, etc.
  • a processor 110 the user terminal 100 may include a processor 110 , an external memory interface 120 , an internal memory 121 , a universal serial bus (USB) interface 130 , a charging management module 140 , a power management module 141 , and a battery 142 , Antenna 1, Antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170
  • the sensor module 180 may include pressure sensing sensor 180A, gyro sensor 180B, air pressure sensor 180C, magnetic sensor 180D, acceleration sensor 180E, distance sensor 180F, proximity light sensor 180G, fingerprint sensor 180H, temperature sensor 180J, touch sensor 180K, ambient light sensor 180L, bone conduction sensor 180M wait.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the user terminal 100.
  • the user terminal 100 may include more or less components than shown in the figures, or combine some components, or split some components, or arrange different components.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) wait.
  • application processor application processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the processor 110 can execute a computer program to implement any method in the embodiments of the present application.
  • the controller may be the nerve center and command center of the user terminal 100 .
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instruction or data again, it can be directly called from the memory, which avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • Interfaces may include integrated circuit (inter-integrated circuit, I1C) interface, integrated circuit built-in audio (inter-integrated circuit sound, I1S) interface, pulse code modulation (pulse code modulation, PCM) interface, universal asynchronous receiver and transmitter (universal asynchronous receiver/transmitter (UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and /or universal serial bus (USB) interface, etc.
  • I1C integrated circuit
  • I1S integrated circuit built-in audio
  • PCM pulse code modulation
  • UART universal asynchronous receiver and transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB universal serial bus
  • the interface connection relationships between the modules illustrated in the embodiments of the present application are only schematic illustrations and do not constitute a structural limitation on the user terminal 100 .
  • the user terminal 100 may also adopt different interface connection methods in the above embodiments, or a combination of multiple interface connection methods.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive charging input from the wired charger through the USB interface 130 .
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc.
  • the wireless communication function of the user terminal 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • the user terminal 100 can communicate with other devices using wireless communication functions.
  • the user terminal 100 can communicate with a second electronic device, the user terminal 100 establishes a screen projection connection with the second electronic device, and the user terminal 100 outputs screen projection data to the second electronic device, etc.
  • the screen projection data output by the user terminal 100 may be audio and video data.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in user terminal 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the mobile communication module 150 can provide wireless communication solutions including 1G/3G/4G/5G applied to the user terminal 100.
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 2 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency Signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then transmits the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194.
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the user terminal 100 including wireless local area networks (WLAN), such as wireless fidelity (Wi-Fi) networks, Bluetooth (bluetooth, BT), and global navigation satellite systems. (global navigation satellite system, GNSS), frequency modulation (FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 1 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the user terminal 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the user terminal 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include global positioning system (GPS), global navigation satellite system (GLONASS), Beidou navigation satellite system (BDS), quasi-zenith satellite system (quasi) -zenith satellite system (QZSS) and/or satellite based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation systems
  • the user terminal 100 implements display functions through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • AMOLED organic light-emitting diode
  • FLED flexible light-emitting diode
  • Miniled MicroLed, Micro-oLed, quantum dot light emitting diode (QLED), etc.
  • the user terminal 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the display screen 194 may be used to display various interfaces output by the system of the user terminal 100 .
  • the user terminal 100 can implement the shooting function through the ISP, camera 193, video codec, GPU, display screen 194, application processor, etc.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the lens, the optical signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other format image signals.
  • the user terminal 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals.
  • Video codecs are used to compress or decompress digital video.
  • User terminal 100 may support one or more video codecs.
  • the user terminal 100 can play or record videos in multiple encoding formats, such as moving picture experts group (moving picture experts group, MPEG) 1, MPEG1, MPEG3, MPEG4, etc.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • the NPU can realize intelligent cognitive applications of the user terminal 100, such as image recognition, face recognition, speech recognition, text understanding, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the user terminal 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the user terminal 100 .
  • the internal memory 121 may include a program storage area and a data storage area.
  • the stored program area may store an operating system, at least one application program required for a function (such as the method in the embodiment of the present application, etc.).
  • the storage data area may store data created during use of the user terminal 100 (such as audio data, phone book, etc.).
  • the internal memory 121 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, universal flash storage (UFS), etc.
  • the user terminal 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 can be used to play sounds corresponding to the video. For example, when the display screen 194 displays a video playback screen, the audio module 170 outputs the sound of the video playback.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals.
  • Speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • Microphone 170C also called “microphone” or “microphone”, is used to convert sound signals into electrical signals.
  • the headphone interface 170D is used to connect wired headphones.
  • the headphone interface 170D may be a USB interface 130, or may be a 3.5mm open mobile terminal platform (OMTP) standard interface, or a Cellular Telecommunications Industry Association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA Cellular Telecommunications Industry Association of the USA
  • the pressure sensor 180A is used to sense pressure signals and can convert the pressure signals into electrical signals.
  • pressure sensor 180A may be disposed on display screen 194 .
  • the gyro sensor 180B may be used to determine the motion posture of the user terminal 100 .
  • Air pressure sensor 180C is used to measure air pressure.
  • the acceleration sensor 180E can detect the acceleration of the user terminal 100 in various directions (including three axes or six axes). When the user terminal 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify user terminal gestures and be used in horizontal and vertical screen switching, pedometer and other applications.
  • Distance sensor 180F for measuring distance.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • Fingerprint sensor 180H is used to collect fingerprints.
  • Temperature sensor 180J is used to detect temperature.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K can be disposed on the display screen 194.
  • the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near the touch sensor 180K.
  • the touch sensor can pass the detected touch operation to the application processor to determine the touch event type.
  • Visual output related to the touch operation may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the user terminal 100 in a position different from that of the display screen 194 .
  • the buttons 190 include a power button, a volume button, etc.
  • Key 190 may be a mechanical key. It can also be a touch button.
  • the user terminal 100 may receive key input and generate key signal input related to user settings and function control of the user terminal 100 .
  • the motor 191 can generate vibration prompts.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • Embodiments of the present application also provide a computer-readable storage medium. Instructions are stored in the storage medium. When the instructions are run on a computer, they cause the computer to perform the steps of the cross-modal retrieval method described in the above embodiments. .
  • Embodiments of the present application also provide a computer program product containing instructions. When the instructions are run on a computer, they cause the computer to perform the steps of the cross-modal retrieval method described in the above embodiments.
  • a computer program is provided. When the computer program is run on a computer, it causes the computer to perform the steps of the cross-modal retrieval method described in the above embodiment.
  • Embodiments of the present application also provide a chip.
  • the chip includes a processor and an interface circuit.
  • the interface circuit is used to receive instructions and transmit them to the processor.
  • the processor is used to execute the cross-connection method described in the above embodiments. Steps of the modal retrieval method.
  • Embodiments of the present application also provide a retrieval system, which includes the cross-modal retrieval device and model training device described in the above embodiments.
  • the model training device is used to train the model involved in the above embodiment.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer, or a data storage device such as a server or data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, tapes), optical media (such as digital versatile discs (DVD)) or semiconductor media (such as solid state disks (SSD)) wait.
  • the computer-readable storage media mentioned in the embodiments of this application may be non-volatile storage media, in other words, may be non-transitory storage media.
  • a multi-feature fine ranking and filtering module is proposed: open semantic recall is divided into two steps, first recall and fine ranking and filtering. Based on the feature retrieval results obtained from the first recall, visual features and text features are simultaneously sent to the fine ranking and filtering module.
  • the fine ranking and filtering module performs fine-grained analysis based on multi-feature input, thereby modifying the sorting of the first recall and performing further filtering, thereby improving Recall effect.
  • the embodiments of this application are applied in cross-modal search scenarios, such as: users input text and search for images and videos. Cooperate with the deep learning network to complete classification marking and cross-modal feature extraction to complete cross-modal retrieval; including computer vision, natural language processing and other fields.
  • the system architecture of the embodiment of the present application is shown in Figure 2, which mainly includes: 1. Dual-path recall component (i.e., matching of text labels and visual labels, and matching of text features and visual features); 2. Dual-path fusion component (i.e., at least The retrieval result is determined by whether a first visual data and at least one second visual data exist at the same time); 3. A fine sorting and filtering component based on 1 (refining processing of at least one second visual data).
  • the classification scheme has extremely high accuracy, which is conducive to ensuring the recall effect of key scenes;
  • the cross-modal scheme has extremely broad recognition capabilities, and can identify adjectives, combination words and other detailed description capabilities; the combination of the two can learn from each other's strengths and weaknesses to achieve accuracy and Comprehensive improvement of breadth retrieval effect;
  • the refined filtering module uses a variety of cross-modal features as input to further improve the results.
  • the core implementation device of the embodiment of the present application is computer code, which is implemented as shown in Figure 2 above. It is divided into a dual-path recall component, a dual-path fusion component, and a fine row filtering component in the dual-path recall component.
  • the dual-path recall component and the fine row filter component It needs to be implemented with the help of deep learning network.
  • Step 1 The user stores the pictures and videos that need to be retrieved in advance for analysis by the visual model.
  • Step 2 Developers configure the "fusion strategy" in the dual-channel fusion component in advance.
  • Step 3 The visual model analyzes the user data one by one to obtain a visual feature library and a visual label library.
  • the visual feature library is used in the open semantic recall in subsequent steps; the visual label library is used in the label recall in subsequent steps.
  • Step 4 The user enters a text description (query) to trigger cross-modal retrieval.
  • Step 5 Based on the query in step 4, the retrieval system extracts the tag content and uses the tags to search the visual tag library generated in step 3; obtain tag recall.
  • Step 6 Based on the query in Step 4, the retrieval system sends the text model to generate text features; it uses the text features to retrieve the visual feature library generated in Step 3 to obtain the first recall.
  • Step 7 Send the visual features and text features recalled for the first time in Step 6 to the fine sorting and filtering module at the same time, and sort and filter the first recall in Step 6 in the most precise manner to obtain open semantic recall.
  • Step 8 Send the label recall in step 5 and the open semantic recall in step 7 to the dual-channel fusion component at the same time; depending on whether the dual-channel recall exists at the same time, execution is divided into three situations:
  • Dual-path recall exists at the same time: the results of dual-path recall are fused, and the fusion result is returned through a configurable fusion strategy;
  • Step 1 The user enters the search content (query).
  • Step 2 The system obtains label recall and open semantic recall based on user input.
  • Step 3 Based on the recall situation in step 2, the system selects the following three fusion situations:
  • Step 3.1 (directly hitting the tag):
  • the result of tag recall is directly returned.
  • the tag recall directly returns the search results based on the tag entered by the user, such as the returned pictures and videos.
  • the preset label contains "Panda” and the user input is "Panda”.
  • the user input directly hits the tag, and the tag (giant panda) recall result will be returned instead of the semantic recall result.
  • the setting method of the preset tag is not limited here. For example, it can be artificially set based on product requirements and application scenario requirements. Certainly.
  • Step 3.2 When the user inputs a search sentence that does not contain preset tags, open semantic recall results are directly returned. Open semantic recall refers to returning search results that match the user input, such as returned pictures and videos. Assume that the user enters "lithography machine” and "lithography machine” is not in the preset label. In this case, if the user input does not hit the label, the open semantic recall result will be returned, and the unlabeled result will be returned.
  • Step 3.3 User input contains tags:
  • the preset tags for example: synonyms of the preset tags
  • the results of tag recall and open semantic recall need to be returned.
  • the preset label contains "giant panda” and the user input is "giant panda eating bamboo”.
  • the "giant panda" contained in the user's input happens to hit the preset label, so it is necessary to return the label recall and
  • the results of open semantic recall since there are two-way results of label recall and open semantic recall, fusion is required.
  • the fusion plan can be preset for the system according to the sensitivity of the tag.
  • the fusion scheme can choose union or intersection based on the scene's emphasis on recall quantity and accuracy. For example: if a non-sensitive tag (such as giant panda) is hit, it is recommended to take the union of the tag hit result and the semantic hit result; if a sensitive tag is hit, such as offensive content, or content that is easy to express an edge, it is recommended to take the union of the tag hit result and the semantic hit result. Union, thus ensuring maximum error-free recall.
  • a non-sensitive tag such as giant panda
  • the two technical solutions of classification labeling and cross-modal comparative learning can learn from each other's strengths and weaknesses.
  • the classification scheme has extremely high accuracy, which is conducive to ensuring the recall effect of key scenes;
  • the cross-modal scheme has extremely broad recognition capabilities, and can identify adjectives, combination words and other detailed description capabilities; the combination of the two can learn from each other's strengths and weaknesses to achieve accuracy and Comprehensive improvement of breadth search effect.
  • the embodiment of this application designs a system solution that integrates classification tags and cross-modal comparative learning technology to achieve high-precision open content search without context. It is suitable for The scope is wider and can be extended to end-side device usage scenarios (such as mobile phone photo albums).
  • the embodiments of this application combine the two and learn from each other's strengths to achieve a comprehensive improvement in accuracy and breadth of retrieval effects.
  • the refined filtering technology solution shown in Figure 5 can be used in the post-processing process of open semantic recall, or can be combined with the above two-way fusion method. It can be realized by different methods and is not limited here. If the refined filtering module is used in the open semantic recall process, it can improve the quality of open semantic recall, optimize sorting, and delete difficult and wrong cases.
  • the fine sorting and filtering module in the embodiment of this application needs to use the visual features recalled for the first time and the textual features of this retrieval to simultaneously send the visual and textual features into the fine sorting and filtering module to achieve an improved recall effect. See Figure 5 for details of the technical solution. The steps are as follows:
  • Step 1 Send the image or video into the visual model to obtain the visual feature library.
  • Step 2 Feed user input into the text model to obtain text features.
  • Step 3 Use text features to retrieve visual features, take the results with similarity greater than the threshold as the first recalled pictures or videos, and obtain the first recalled visual features. Each feature corresponds to a picture or video.
  • Step 4 Input the first recalled visual features and text features into the refined filtering module at the same time.
  • the refined filtering module includes a neural network model and may also include a data processing module; the neural network model outputs the similarity of each "visual-text" pair The result or the judgment result of whether it is paired.
  • Step 5 Obtain the result after fine sorting and filtering. See Figure 6 for the effect display.
  • FIG. 6 shows the effect after fine filtering.
  • Fine filtering has two main functions:
  • Function 2 Adjust the first recall results to improve the recall experience. As shown in Figure 6, the appearance characteristics of the sparrow ranked 1 in the first recall are not as good as those of 3 and 4. In the fine sorting and filtering stage, since visual and textual features are simultaneously fed into the model for fine-grained analysis, better discrimination can be made, which is beneficial to obtaining more reasonable sorting results.
  • this technical solution first performs the first recall by decoupling the visual model and the text model. While ensuring the efficiency of reasoning, it achieves a wide range of recognition capabilities and can identify adjectives and combinations. words and other details to describe the effect of the ability, and narrow the range of refined filtering to a controllable range; further, a refined filtering module is added to eliminate difficult and wrong examples, adjust the recall order, and improve recall quality.
  • An open semantics and label dual-path recall technology Based on the classification algorithm and the cross-modal feature matching algorithm, the recall results of labels and open semantics are recalled simultaneously as the first recall.
  • the implementation of the fine sorting and filtering module is based on the first recall of technical point 1. Based on the first recall results, text and visual features are simultaneously input into the fine sorting and filtering module for refined sorting and filter.
  • Dual-channel recall Based on a configurable fusion strategy, the results after dual-channel fusion are obtained.
  • An open semantic and label dual-channel recall technology obtain open semantic and label recall results at the same time to ensure the integrity of the recall and provide a high-quality starting point for the next step of fine-tuning and filtering.
  • a multi-feature fine sorting and filtering module at the same time, visual and textual information are fed into the model for fine-grained analysis to obtain better ranking results and delete difficult and wrong cases.
  • a two-way fusion technology of open semantics and tags Configurable fusion of open semantics and tag recall results. By adjusting the fusion method, sensitive scenes will not trigger public opinion risks, and non-sensitive scenes will improve user recall satisfaction.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • Signals are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with the relevant laws, regulations and standards of relevant countries and regions.

Abstract

本申请公开了一种跨模态检索方法、装置、设备、存储介质及计算机程序,属于信息检索领域。所述方法包括:提取检索文本的文本标签和文本特征;基于文本标签和被检索视觉数据的视觉标签,确定被检索视觉数据中是否存在视觉标签与文本标签匹配的至少一个第一视觉数据,被检索视觉数据包括图像和/或视频;基于文本特征和被检索视觉数据的视觉特征,确定被检索视觉数据中是否存在视觉特征与文本特征匹配的至少一个第二视觉数据;如果被检索视觉数据中存在至少一个第一视觉数据和至少一个第二视觉数据,则基于至少一个第一视觉数据和至少一个第二视觉数据确定检索结果。本申请能够同时提升跨模态的检索精度和检索广度。

Description

跨模态检索方法、装置、设备、存储介质及计算机程序
本申请要求于2022年09月07日提交的申请号为202211091658.9、发明名称为“跨模态检索方法、装置、设备、存储介质及计算机程序”的中国专利申请的优先权,以及要求于2023年08月31日提交的申请号为202311130428.3、发明名称为“跨模态检索方法、装置、设备、存储介质及计算机程序”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及信息检索领域,特别涉及一种跨模态检索方法、装置、设备、存储介质及计算机程序。
背景技术
随着科学技术的发展,图像、文本、视频等多模态数据爆炸式增长,而且用户对于检索的需求不再停留在以文本检索文本的形式,所以跨模态检索随之产生。跨模态检索是以某一种模态的数据去检索另一种模态的数据的检索形式,比如,用户通过输入文本来检索图像或者视频。然而,由于不同模态的数据差异性较大,跨模态检索仍然面临很大挑战,如何不限定用户输入的内容,并反馈给用户想要的图像或者视频,从而实现开放内容的跨模态检索,满足用户的实际体验是目前非常重要的问题。因此,亟需一种跨模态的检索方法。
发明内容
本申请提供了一种跨模态检索方法、装置、设备、存储介质及计算机程序,能够实现多场景下的开放内容跨模态的检索。所述技术方案如下:
第一方面,提供了一种跨模态检索方法,所述方法包括:提取检索文本的文本标签和文本特征;基于所述文本标签和被检索视觉数据的视觉标签,确定所述被检索视觉数据中是否存在视觉标签与所述文本标签匹配的至少一个第一视觉数据,所述被检索视觉数据包括图像和/或视频;基于所述文本特征和所述被检索视觉数据的视觉特征,确定所述被检索视觉数据中是否存在视觉特征与所述文本特征匹配的至少一个第二视觉数据;基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果。
可选地,如果所述被检索视觉数据中存在所述至少一个第一视觉数据和所述至少一个第二视觉数据,则基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果。
由于被检索视觉数据的视觉标签的范围固定,所以,通过被检索视觉数据的视觉标签和检索文本的文本标签确定出的第一视觉数据比较精确,也即是,通过标签匹配能够精确地控制检索范围。并且,由于检索文本是具有语义开放性的自然语言的描述信息,所以,通过被检索视觉数据的视觉特征和检索文本的文本特征确定出的第二视觉数据没有语义上的限制,支持自然语义检索,检索比较灵活,检索范围也比较广,能够识别形容词等细粒度的检索文本。这样,在被检索视觉数据中同时存在第一视觉数据和第二视觉数据的情况下,将第一视觉数据和第二视觉数据进行融合,能够同时提升跨模态的检索精度和检索广度。
本申请提供的跨模态检索方法可以应用于网络侧场景,也可应用于端侧场景。检索文本根据应用场景的不同,获取的方式也不同。例如,在网络侧场景中,用户终端提供检索页面以供用户在检索页面内的检索框中输入检索文本,然后,用户终端将检索框中输入的检索文本发送给服务器,由服务器提取该检索文本的文本标签和文本特征。在端侧场景中,用户终端提供检索页面以供用户在检索页面内的检索框中输入检索文本,然后,用户终端直接提取检索框中输入的检索文本的文本标签和文本特征。
将检索文本的文本标签与被检索视觉数据的视觉标签进行匹配,以确定被检索视觉数据中是否存在视觉标签与文本标签相同或者属于同义词的视觉数据。如果被检索视觉数据中存在视觉标签与检索文本的文本标签相同或者属于同义词的视觉数据,则确定被检索视觉数据中存在至少一个第一视觉数据,该至少一个第一视觉数据为被检索视觉数据中视觉标签与检索文本的文本标签相同或者属于同义词的视觉数据;如果被检索视觉数据中不存在视觉标签与检索文本的文本标签相同或者属于同义词的视觉数据,则确定被检索视觉数据中不存在至少一个第一视觉数据。
确定检索文本的文本特征与被检索视觉数据的视觉特征之间的相似度,如果被检索视觉数据中存在视觉特征与检索文本的文本特征之间的相似度大于第二相似度阈值的视觉数据,则确定被检索视觉数据中存在至少一个第二视觉数据,该至少一个第二视觉数据为被检索视觉数据中视觉特征与检索文本的文本特征之间的相似度大于第二相似度阈值的视觉数据;如果被检索视觉数据中不存在视觉特征与检索文本的文本特征之间的相似度大于第二相似度阈值的视觉数据,则确定被检索视觉数据中不存在至少一个第二视觉数据。
经上述步骤判断可得,被检索视觉数据中可能同时存在至少一个第一视觉数据和至少一个第二视觉数据,也可能只存在至少一个第一视觉数据或者只存在至少一个第二视觉数据。如果被检索视觉数据中同时存在至少一个第一视觉数据和至少一个第二视觉数据,可以按照融合策略对至少一个第一视觉数据和至少一个第二视觉数据进行融合,以得到检索结果。如果被检索视觉数据中存在至少一个第一视觉数据但不存在至少一个第二视觉数据,则将至少一个第一视觉数据作为检索结果。如果被检索视觉数据中存在至少一个第二视觉数据但不存在至少一个第一视觉数据,则将至少一个第二视觉数据作为检索结果。
其中,融合策略是预先设定好的,可以根据应用场景对于检索结果的数量和准确性的侧重程度,选择将至少一个第一视觉数据和至少一个第二视觉数据取并集或交集来作为检索结果。也即是,当应用场景更侧重于检索结果的数量时,将至少一个第一视觉数据和至少一个第二视觉数据取并集作为检索结果;当应用场景更侧重于检索结果的准确性时,将至少一个第一视觉数据和至少一个第二视觉数据取交集作为检索结果。
作为一种示例,如果所述至少一个第一视觉数据的视觉标签属于第一类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的交集作为所述检索结果,所述第一类标签是指表征视觉数据时具有不确定性的标签。
由于第一类标签是指表征视觉数据时具有不确定性的标签,所以,在至少一个第一视觉数据的视觉标签属于第一类标签的情况下,表明该至少一个第一视觉数据的视觉标签可能不一定能够准确地表达相应的视觉数据的内容,此时,为了保证检索结果的准确性,将至少一个第一视觉数据和至少一个第二视觉数据的交集作为检索结果。
作为一种示例,如果所述至少一个第一视觉数据的视觉标签属于第二类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的并集作为所述检索结果,所述第二类标签是指表征视觉数据时具有确定性的标签。
由于第二类标签是指表征视觉数据时具有确定性的标签,所以,在至少一个第一视觉数据的视觉标签属于第二类标签的情况下,表明该至少一个第一视觉数据的视觉标签能够准确地表达相应的视觉数据的内容,此时,为了保证检索结果的数量,将至少一个第一视觉数据和至少一个第二视觉数据的并集作为检索结果。
在被检索视觉数据中存在至少一个第二视觉数据的情况下,可以直接按照上述方法来确定检索结果。当然,还可以对至少一个第二视觉数据进行更精确地处理之后,再按照上述方法来确定检索结果。其中,对至少一个第二视觉数据进行更精确处理的方法包括多种,接下来对其中的一种方法进行介绍。
如果所述被检索视觉数据中存在所述至少一个第二视觉数据,将所述至少一个第二视觉数据的视觉特征和所述文本特征输入至神经网络模型中,以得到模型推理结果,所述模型推理结果包括相似性结果和/或成对判断结果,所述相似性结果指示所述至少一个第二视觉数据分别与所述检索文本之间的相似度,所述成对判断结果指示所述至少一个第二视觉数据分别与所述检索文本是否能够成对;基于所述模型推理结果对所述至少一个第二视觉数据进行处理。
模型推理结果可以只包括相似性结果,也可以只包括成对判断结果,还可以包括相似性结果和成对判断结果。在不同的情况下,基于模型推理结果对至少一个第二视觉数据进行处理的方式不同,接下来将分别进行介绍。
可选地,所述模型推理结果包括相似性结果;此时,基于所述相似性结果,从所述至少一个第二视觉数据中筛选出与所述检索文本之间的相似度大于第一相似度阈值的第二视觉数据。
由于相似性结果是神经网络模型将第二视觉数据的视觉特征和检索文本的文本特征结合后进行细粒度地分析得到的,所以,该相似性结果能够更精确地表征第二视觉数据与检索文本的相似度,通过该相似 性结果,对至少一个第二视觉数据进行筛选,能够筛选掉与检索文本真正不太相似的视觉数据,保留与检索文本真正相似的视觉数据,从而提升最终确定的检索结果的准确性。
可选地,所述模型推理结果包括成对判断结果;此时,基于所述成对判断结果,从所述至少一个第二视觉数据中筛选出与所述检索文本能够成对的第二视觉数据。
由于成对判断结果是神经网络模型将第二视觉数据的视觉特征和检索文本的文本特征结合后进行细粒度地分析得到的,所以,该成对判断结果能够更精确地表征第二视觉数据与检索文本是否能够成对,通过该成对判断结果,对至少一个第二视觉数据进行筛选,能够筛选掉与检索文本不成对的视觉数据,保留与检索文本成对的视觉数据,从而过滤掉不合理的视觉数据,提升最终确定的检索结果的准确性。
可选地,所述模型推理结果包括相似性结果和成对判断结果;此时,基于所述成对判断结果,从所述至少一个第二视觉数据中筛选出与所述检索文本能够成对的第二视觉数据;基于所述相似性结果,按照筛选出的第二视觉数据与所述检索文本之间的相似度从大到小的顺序,对所述筛选出的第二视觉数据进行排序。
在模型推理结果既包括相似性结果,也包括成对判断结果的情况下,首先根据至少一个第二视觉数据中的每个第二视觉数据与检索文本是否能够成对,对至少一个第二视觉数据进行筛选,以保留与检索文本能够成对的第二视觉数据,删除与检索文本不能成对的第二视觉数据;然后,按照筛选后的第二视觉数据与检索文本之间的相似度从大到小的顺序,对筛选后的第二视觉数据进行排序,从而提升最终反馈给用户的检索结果的排序合理性。
第二方面,提供了一种跨模态检索方法,所述方法包括:
提取检索文本的文本标签和文本特征;
基于所述文本标签和被检索视觉数据的视觉标签,得到标签匹配结果,所述被检索视觉数据包括图像和/或视频;
基于所述文本特征和所述被检索视觉数据的视觉特征,得到特征匹配结果;
基于所述标签匹配结果和所述特征匹配结果,得到检索结果。
可选地,所述标签匹配结果包括至少一个第一视觉数据,且所述特征匹配结果包括至少一个第二视觉数据,所述基于所述标签匹配结果和所述特征匹配结果,得到检索结果包括:
将所述标签匹配结果包括的所述至少一个第一视觉数据和所述特征匹配结果包括的所述至少一个第二视觉数据的并集或交集作为所述检索结果。
可选地,所述标签匹配结果包括至少一个第一视觉数据,所述特征匹配结果指示不存在匹配数据,所述基于所述标签匹配结果和所述特征匹配结果,得到检索结果包括:
将所述标签匹配结果包括的所述至少一个第一视觉数据作为所述检索结果。
可选地,所述标签匹配结果指示不存在匹配数据,所述特征匹配结果包括至少一个第二视觉数据,所述基于所述标签匹配结果和所述特征匹配结果,得到检索结果包括:
将所述特征匹配结果包括的所述至少一个第二视觉数据作为所述检索结果。
可选地,所述标签匹配结果包括至少一个第一视觉数据,且所述至少一个第一视觉数据中的部分或全部视觉数据的视觉标签包括所述文本标签时,将所述至少一个第一视觉数据中的视觉标签包括所述文本标签的视觉数据作为所述检索结果。
可选地,当所述标签匹配结果指示不存在匹配数据,且所述特征匹配结果指示不存在匹配数据时,所述检索结果指示不存在匹配数据。
可选地,所述将所述标签匹配结果包括的所述至少一个第一视觉数据和所述特征匹配结果包括的所述至少一个第二视觉数据的并集或交集作为所述检索结果包括:
如果所述至少一个第一视觉数据的视觉标签属于预设的第一类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的交集作为所述检索结果;
如果所述至少一个第一视觉数据的视觉标签属于预设的第二类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的并集作为所述检索结果。
可选地,所述基于所述文本特征和所述被检索视觉数据的视觉特征,得到特征匹配结果包括:
将所述文本特征和所述被检索视觉数据的视觉特征进行特征匹配,以得到第一特征匹配结果,所述第 一特征匹配结果包括至少一个第三视觉数据;
将所述文本特征和所述第一特征匹配结果输入预设模型,以得到所述特征匹配结果,所述特征匹配结果包括所述第一特征匹配结果中的部分或全部第三视觉数据,且所述特征匹配结果中包括的第三视觉数据按照与所述文本特征的相似度排序。
可选地,所述方法还包括:
接收用户输入的所述检索文本。
可选地,所述检索结果包括至少一个图像和/或视频,所述方法还包括:
显示所述检索结果。
第三方面,提供了一种跨模态检索装置,所述跨模态检索装置具有实现上述第一方面中跨模态检索方法行为的功能。所述跨模态检索装置包括至少一个模块,该至少一个模块用于实现上述第一方面或第二方面所提供的跨模态检索方法。
第四方面,提供了一种电子设备,所述电子设备包括处理器和存储器,所述存储器用于存储执行上述第一方面或第二方面所提供的跨模态检索方法的计算机程序。所述处理器被配置为用于执行所述存储器中存储的计算机程序,以实现上述第一方面或第二方面所述的跨模态检索方法。
可选地,所述电子设备还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第五方面,提供了一种计算机可读存储介质,所述存储介质内存储有指令,当所述指令在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的跨模态检索方法的步骤。
第六方面,提供了一种包含指令的计算机程序产品,当所述指令在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的跨模态检索方法的步骤。或者说,提供了一种计算机程序,当所述计算机程序在计算机上运行时,使得计算机执行上述第一方面或第二方面所述的跨模态检索方法的步骤。
第七方面,提供了一种芯片,所述芯片包括处理器和接口电路,所述接口电路用于接收指令并传输至所述处理器,所述处理器用于执行上述第一方面或第二方面所述的跨模态检索方法的步骤。
第八方面,提供了一种检索系统,所述检索系统包括上述第三方面所述的跨模态检索装置以及模型训练装置。
上述第二方面至第八方面所获得的技术效果与第一方面中对应的技术手段获得的技术效果近似,在这里不再赘述。
附图说明
图1是本申请实施例提供的一种跨模态检索方法的流程图;
图2是本申请实施例提供的另一种跨模态检索方法的流程图;
图3是本申请实施例提供的一种视觉数据的融合方法的流程图;
图4是本申请实施例提供的一种跨模态检索的示意图;
图5是本申请实施例提供的一种对第二视觉数据进行处理的方法的流程图;
图6是本申请实施例提供的一种对第二视觉数据进行处理的示意图;
图7是本申请实施例提供的一种跨模态检索的用户界面示意图;
图8是本申请实施例提供的另一种跨模态检索方法的流程图;
图9是本申请实施例提供的一种跨模态检索装置的结构示意图;
图10是本申请实施例提供的一种电子设备的结构示意图;
图11是本申请实施例提供的一种用户终端的结构示意图;
图12是本申请实施例提供的另一种用户终端的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
在对本申请实施例提供的跨模态检索方法进行详细地解释说明之前,先对本申请实施例涉及的应用场景进行介绍。
来自信息领域的数据形式多种多样,其每一种形式都可以看作是一种模态,例如文本、视频、图像以及语音等。跨模态检索通常指从多模态的数据中,以某一种模态的数据去检索另一种模态的数据,例如,以文本检索图像或者视频,以图像检索文本或者视频等。跨模态检索是不同模态的数据之间交互的桥梁,其重点在于自动理解、关联不同模态的数据之间的关键要素,并实现相对准确的交叉匹配。随着NLP(Natural language processing,自然语言处理)技术和CV(Computer Vision,计算机视觉)技术的发展壮大,网络和手机等设备上存储的图像和视频越来越多,用户的检索已不单单局限于检索文本,用户进行跨模态检索的需求与日俱增。
本申请实施例提供的跨模态检索方法可以应用于搜索引擎等网络侧场景,也可应用于端侧场景,例如手机端检索手机相册中的图像或者视频,当然并不仅限于手机相册,在其他类似的场景下也同样适用,例如在聊天软件的历史记录中输入文本来检索图像或者视频等。通过本申请实施例提供的跨模态检索方法,能够为用户提供更加开放和精准的检索结果,以满足用户的实际体验。此外,该方法不仅可以应用在搜索引擎等网络侧的检索场景中,还可以应用在内容推荐等网络侧场景中,例如新闻资讯推荐、商品购买推荐等,通过对用户检索新闻资讯或商品的历史记录进行统计来确定检索文本,进而推荐类似的内容。
由于本申请实施例提供的跨模态检索方法可以应用于网络侧场景,也可应用于端侧场景,所以本申请实施例的执行主体可以为服务器,也可以为用户终端。为了便于描述,将本申请实施例的执行主体统称为电子设备。
当该电子设备为服务器时,该电子设备可以是一台独立的服务器,也可以是由多台物理服务器组成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器,或者是一个云计算服务中心。
当该电子设备为用户终端时,该电子设备可以是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如个人计算机(personal computer,PC)、手机、智能手机、个人数字助手(personal digital assistant,PDA)、可穿戴设备、掌上电脑(pocket pc,PPC)、平板电脑、智能车机等。
本领域技术人员应能理解上述应用场景和电子设备仅为举例,其他现有的或今后可能出现的应用场景和电子设备如可适用于本申请实施例,也应包含在本申请实施例保护范围以内,并在此以引用方式包含于此。
接下来对本申请实施例提供的跨模态检索方法进行解释说明。
请参考图1,图1是本申请实施例提供的一种跨模态检索方法的流程图,该方法应用于电子设备中,该方法包括如下步骤。
步骤101:提取检索文本的文本标签和文本特征。
将检索文本输入第一文本模型,以得到检索文本的文本标签和文本特征。可选地,也可以将检索文本输入第二文本模型,以得到检索文本的文本特征,将检索文本输入第三文本模型,以得到检索文本的文本标签,或者,通过文本标签提取算法提取检索文本的文本标签。也即是,当采用第一文本模型提取文本标签和文本特征时,第一文本模型的输入为检索文本,第一文本模型的输出为检索文本的文本标签和文本特征。当采用第二文本模型提取文本特征时,第二文本模型的输入为检索文本,第二文本模型的输出为检索文本的文本特征。当采用第三文本模型提取文本标签时,第三文本模型的输入为检索文本,第三文本模型的输出为检索文本的文本标签。
其中,第一文本模型、第二文本模型和第三文本模型的结构可以不同,文本标签提取算法的实现过程 根据不同的需求也可能不同,本申请实施例对此不做限定。文本标签指示通过检索文本进行检索的对象的分类结果,比如,检索文本为“吃竹子的大熊猫”,那么确定通过该检索文本需要检索大熊猫,这样,可以将“大熊猫”确定为检索文本的文本标签。文本特征指示检索文本的特征。
基于上文所述,本申请实施例提供的跨模态检索方法可以应用于网络侧场景,也可应用于端侧场景。检索文本根据应用场景的不同,获取的方式也不同。例如,在网络侧场景中,用户终端提供检索页面以供用户在检索页面内的检索框中输入检索文本,然后,用户终端将检索框中输入的检索文本发送给服务器,由服务器提取该检索文本的文本标签和文本特征。在端侧场景中,用户终端提供检索页面以供用户在检索页面内的检索框中输入检索文本,然后,用户终端直接提取检索框中输入的检索文本的文本标签和文本特征。
步骤102:基于检索文本的文本标签和被检索视觉数据的视觉标签,确定被检索视觉数据中是否存在视觉标签与文本标签匹配的至少一个第一视觉数据,被检索视觉数据包括图像和/或视频。
将检索文本的文本标签与被检索视觉数据的视觉标签进行匹配,以确定被检索视觉数据中是否存在视觉标签与文本标签相同或者属于同义词的视觉数据。如果被检索视觉数据中存在视觉标签与检索文本的文本标签相同或者属于同义词的视觉数据,则确定被检索视觉数据中存在至少一个第一视觉数据,该至少一个第一视觉数据为被检索视觉数据中视觉标签与检索文本的文本标签相同或者属于同义词的视觉数据;如果被检索视觉数据中不存在视觉标签与检索文本的文本标签相同或者属于同义词的视觉数据,则确定被检索视觉数据中不存在至少一个第一视觉数据。
其中,属于同义词的标签是提前标注好的,比如标签“花朵”和“鲜花”均为标签“花”的同义词,本申请实施例对标注的方法不做限定。视觉标签指示视觉数据的分类结果,比如,视觉数据为一张鲜花的图片,那么该视觉数据的视觉标签可以为“鲜花”。
在执行步骤102和步骤103之前,还需要提取被检索视觉数据的视觉标签和视觉特征。而且被检索视觉数据的视觉标签和视觉特征可以在步骤102和步骤103之前提取,还可以在步骤101之前提取,当然还可以在电子设备空闲时提取,本申请实施例对被检索视觉数据的视觉标签和视觉特征的提取时机不做限定。
其中,可以将被检索视觉数据输入第一视觉模型,以得到被检索视觉数据的视觉标签和视觉特征。可选地,也可以将被检索视觉数据输入第二视觉模型,以得到被检索视觉数据的视觉特征,将被检索视觉数据输入第三视觉模型,以得到被检索视觉数据的视觉标签。也即是,当采用第一视觉模型提取被检索视觉数据的视觉标签和视觉特征时,第一视觉模型的输入为被检索视觉数据,第一视觉模型的输出为被检索视觉数据的视觉标签和视觉特征;当采用第二视觉模型提取被检索视觉数据的视觉特征时,第二视觉模型的输入为被检索视觉数据,第二视觉模型的输出为被检索视觉数据的视觉特征。当采用第三视觉模型提取被检索视觉数据的视觉标签时,第三视觉模型的输入为被检索视觉数据,第三视觉模型的输出为被检索视觉数据的视觉标签。
其中,第一视觉模型、第二视觉模型和第三视觉模型的结构可以不同,例如,第三视觉模型可以为OCR(Optical Character Recognition,光学字符识别)网络模型,通过对被检索视觉数据进行分析识别处理,以获取文本信息,进而提取文本信息的标签作为被检索视觉数据的视觉标签。
基于上文所述,本申请实施例提供的跨模态检索方法可以应用于网络侧场景,也可应用于端侧场景。被检索视觉数据根据应用场景的不同,存储的位置也不同。在网络侧场景中,被检索视觉数据存储在服务器;在端侧场景中,被检索视觉数据存储在用户终端。其中,被检索视觉数据可以只包括图像,也可以只包括视频,当然也可以两者都有。
步骤103:基于检索文本的文本特征和被检索视觉数据的视觉特征,确定被检索视觉数据中是否存在视觉特征与文本特征匹配的至少一个第二视觉数据。
确定检索文本的文本特征与被检索视觉数据的视觉特征之间的相似度,如果被检索视觉数据中存在视觉特征与检索文本的文本特征之间的相似度大于第二相似度阈值的视觉数据,则确定被检索视觉数据中存在至少一个第二视觉数据,该至少一个第二视觉数据为被检索视觉数据中视觉特征与检索文本的文本特征之间的相似度大于第二相似度阈值的视觉数据;如果被检索视觉数据中不存在视觉特征与检索文本的文本特征之间的相似度大于第二相似度阈值的视觉数据,则确定被检索视觉数据中不存在至少一个第二视觉数据。
其中,文本特征与视觉特征之间的相似度可以通过计算文本特征与视觉特征之间的余弦距离来得到, 也可以通过其他方式计算得到,本申请实施例对此不做限定。并且,第二相似度阈值是预先设定的,比如0.8、0.85等。实际应用中,根据不同的需求可以取不同的值,本申请实施例对此也不做限定。视觉特征指示视觉数据的特征。比如,视觉数据为一张图像,该视觉特征为该图像的特征。
基于上文所述,可以采用同一个视觉模型来提取被检索视觉数据的视觉特征和视觉标签,也可以采用两个不同的视觉模型来分别提取被检索视觉数据的视觉特征和视觉标签,本申请实施例对此不做限定。
步骤104:如果被检索视觉数据中存在至少一个第一视觉数据和至少一个第二视觉数据,则基于至少一个第一视觉数据和至少一个第二视觉数据确定检索结果。
经上述步骤判断可得,被检索视觉数据中可能同时存在至少一个第一视觉数据和至少一个第二视觉数据,也可能只存在至少一个第一视觉数据或者只存在至少一个第二视觉数据。如果被检索视觉数据中同时存在至少一个第一视觉数据和至少一个第二视觉数据,可以按照融合策略对至少一个第一视觉数据和至少一个第二视觉数据进行融合,以得到检索结果。如果被检索视觉数据中存在至少一个第一视觉数据但不存在至少一个第二视觉数据,则将至少一个第一视觉数据作为检索结果。如果被检索视觉数据中存在至少一个第二视觉数据但不存在至少一个第一视觉数据,则将至少一个第二视觉数据作为检索结果。
请参考图2,图2是本申请实施例提供的另一种跨模态检索方法的流程图。首先,提取用户输入的检索文本的文本标签和文本特征,确定被检索视觉数据的视觉标签和视觉特征。将检索文本的文本标签与被检索视觉数据的视觉标签进行匹配,以确定被检索视觉数据中是否存在视觉标签与检索文本的文本标签匹配的第一视觉数据;将检索文本的文本特征与被检索视觉数据的视觉特征进行匹配,以确定被检索视觉数据中是否存在视觉特征与检索文本的文本特征匹配的第二视觉数据。如果第一视觉数据和第二视觉数据同时存在,则采用预先配置好的的融合方案,对第一视觉数据和第二视觉数据进行融合,将融合后的视觉数据作为检索结果。如果第一视觉数据和第二视觉数据不同时存在,则判断第一视觉数据是否存在,如果第一视觉数据存在,则以第一视觉数据作为检索结果;如果第一视觉数据不存在,则以第二视觉数据作为检索结果。
其中,融合策略是预先设定好的,可以根据应用场景对于检索结果的数量和准确性的侧重程度,选择将至少一个第一视觉数据和至少一个第二视觉数据取并集或交集来作为检索结果。也即是,当应用场景更侧重于检索结果的数量时,将至少一个第一视觉数据和至少一个第二视觉数据取并集作为检索结果;当应用场景更侧重于检索结果的准确性时,将至少一个第一视觉数据和至少一个第二视觉数据取交集作为检索结果。
请参考图3,图3是本申请实施例提供的一种视觉数据的融合方法的流程图。如果被检索视觉数据中同时存在至少一个第一视觉数据和至少一个第二视觉数据,则判断至少一个第一视觉数据的视觉标签属于第一类标签还是第二类标签,如果至少一个第一视觉数据的视觉标签属于第一类标签,则将至少一个第一视觉数据和至少一个第二视觉数据的交集作为检索结果,第一类标签是指表征视觉数据时具有不确定性的标签;如果至少一个第一视觉数据的视觉标签属于第二类标签,则将至少一个第一视觉数据和至少一个第二视觉数据的并集作为检索结果,第二类标签是指表征视觉数据时具有确定性的标签。
由于第一类标签是指表征视觉数据时具有不确定性的标签,所以,在至少一个第一视觉数据的视觉标签属于第一类标签的情况下,表明该至少一个第一视觉数据的视觉标签可能不一定能够准确地表达相应的视觉数据的内容,此时,为了保证检索结果的准确性,将至少一个第一视觉数据和至少一个第二视觉数据的交集作为检索结果。
由于第二类标签是指表征视觉数据时具有确定性的标签,所以,在至少一个第一视觉数据的视觉标签属于第二类标签的情况下,表明该至少一个第一视觉数据的视觉标签能够准确地表达相应的视觉数据的内容,此时,为了保证检索结果的数量,将至少一个第一视觉数据和至少一个第二视觉数据的并集作为检索结果。
其中,第一类标签和第二类标签是事先进行设置的,并且,可以根据产品需求和应用场景需求的不同来设置不同的第一类标签和第二类标签,本申请实施例对第一类标签和第二类标签的设置方法不做限定。
比如,请参考图4,将检索文本的文本标签与被检索视觉数据的视觉标签进行匹配的过程称为标签召回,将检索文本的文本特征与被检索视觉数据的视觉特征进行匹配的过程称为开放语义召回,也可以称为向量召回。经过标签召回和开放语义召回之后可能存在三种情况,即,经过标签召回得到至少一个第一视觉数据但经过开放语义召回未得到结果,或者,经过开放语义召回得到至少一个第二视觉数据但经过标签 召回未得到结果,或者,经过标签召回得到至少一个第一视觉数据且经过开放语义召回得到至少一个第二视觉数据。
假设,检索文本为“吃竹子的大熊猫”,该检索文本的文本标签为“大熊猫”,如果经过标签召回确定“大熊猫”命中被检索视觉数据的视觉标签,但是经过开放语义召回未得到结果,则将被检索视觉数据中视觉标签为“大熊猫”的视觉数据作为检索结果。
又假设,检索文本为“吃竹子的大熊猫”,该检索文本的文本标签为“大熊猫”,如果经过标签召回确定“大熊猫”命中被检索视觉数据的视觉标签,则将被检索视觉数据中视觉标签为“大熊猫”的视觉数据确定为至少一个第一视觉数据。而且,经过开放语义召回得到至少一个第二视觉数据,此时,可以将该至少一个第一视觉数据和该至少一个第二视觉数据取交集或者并集来得到检索结果。
再假设,检索文本为“黑色的光刻机”,该检索文本的文本标签为“光刻机”,如果经过标签召回确定“光刻机”未命中被检索视觉数据的视觉标签,但是,经过开放语义召回得到至少一个第二视觉数据,此时,将该至少一个第二视觉数据作为检索结果。
可选地,被检索视觉数据中还可能既不包括至少一个第一视觉数据,也不包括至少一个第二视觉数据,此时,确定检索结果为空。
在被检索视觉数据中存在至少一个第二视觉数据的情况下,可以直接按照上述方法来确定检索结果。当然,还可以对至少一个第二视觉数据进行更精确地处理之后,再按照上述方法来确定检索结果。其中,对至少一个第二视觉数据进行更精确处理的方法包括多种,接下来对其中的一种方法进行介绍。
如果被检索视觉数据中存在至少一个第二视觉数据,将至少一个第二视觉数据的视觉特征和检索文本的文本特征输入至神经网络模型中,以得到模型推理结果,模型推理结果包括相似性结果和/或成对判断结果,相似性结果指示至少一个第二视觉数据分别与检索文本之间的相似度,成对判断结果指示至少一个第二视觉数据分别与检索文本是否能够成对;基于模型推理结果对至少一个第二视觉数据进行处理。
请参考图5,图5是本申请实施例提供的一种对第二视觉数据进行处理的方法的流程图。为了便于理解,此处将结合首次确定至少一个第二视觉数据的过程进行介绍。即,在离线状态下通过视觉模型对被检索视觉数据进行视觉解析,以得到被检索视觉数据的视觉特征;基于用户在线输入的检索文本,通过文本模型对检索文本进行文本解析,以得到检索文本的文本特征,通过检索文本的文本特征与被检索视觉数据的视觉特征进行特征检索,以确定出至少一个第二视觉数据。然后,将检索文本的文本特征和该至少一个第二视觉数据的视觉特征同时输入神经网络模型,通过神经网络模型的分析对该至少一个第二视觉数据进一步处理,以得到最终的第二视觉数据。
模型推理结果可以只包括相似性结果,也可以只包括成对判断结果,还可以包括相似性结果和成对判断结果。在不同的情况下,基于模型推理结果对至少一个第二视觉数据进行处理的方式不同,接下来将分别进行介绍。
情况1、如果模型推理结果包括相似性结果但不包括成对判断结果,则基于相似性结果,从至少一个第二视觉数据中筛选出与检索文本之间的相似度大于第一相似度阈值的第二视觉数据。
由于相似性结果是神经网络模型将第二视觉数据的视觉特征和检索文本的文本特征结合后进行细粒度地分析得到的,所以,该相似性结果能够更精确地表征第二视觉数据与检索文本的相似度,通过该相似性结果,对至少一个第二视觉数据进行筛选,能够筛选掉与检索文本真正不太相似的视觉数据,保留与检索文本真正相似的视觉数据,从而提升最终确定的检索结果的准确性。
其中,第一相似度阈值是事先设置的,比如,0.85、0.9等。实际应用中,第一相似度阈值根据不同的需求可以取不同的值。而且,第一相似度阈值与第二相似度阈值的取值可以相同,也可以不同,本申请实施例对此不做限定。
可选地,从至少一个第二视觉数据中筛选出与检索文本之间的相似度大于第一相似度阈值的第二视觉数据之后,还可以按照相似度从大到小的顺序,对筛选出的第二视觉数据进行排序,从而提升最终反馈给用户的检索结果的排序合理性。
情况2、如果模型推理结果包括成对判断结果但不包括相似性结果,则基于成对判断结果,从至少一个第二视觉数据中筛选出与检索文本能够成对的第二视觉数据。
由于成对判断结果是神经网络模型将第二视觉数据的视觉特征和检索文本的文本特征结合后进行细粒度地分析得到的,所以,该成对判断结果能够更精确地表征第二视觉数据与检索文本是否能够成对,通 过该成对判断结果,对至少一个第二视觉数据进行筛选,能够筛选掉与检索文本不成对的视觉数据,保留与检索文本成对的视觉数据,从而过滤掉不合理的视觉数据,提升最终确定的检索结果的准确性。
情况3、如果模型推理结果包括相似性结果和成对判断结果,则基于成对判断结果,从至少一个第二视觉数据中筛选出与检索文本能够成对的第二视觉数据;基于相似性结果,按照筛选出的第二视觉数据与检索文本之间的相似度从大到小的顺序,对筛选出的第二视觉数据进行排序。
在模型推理结果既包括相似性结果,也包括成对判断结果的情况下,首先根据至少一个第二视觉数据中的每个第二视觉数据与检索文本是否能够成对,对至少一个第二视觉数据进行筛选,以保留与检索文本能够成对的第二视觉数据,删除与检索文本不能成对的第二视觉数据;然后,按照筛选后的第二视觉数据与检索文本之间的相似度从大到小的顺序,对筛选后的第二视觉数据进行排序,从而提升最终反馈给用户的检索结果的排序合理性。
例如,请参考图6,检索文本为“麻雀”,至少一个第二视觉数据包括四张图像,其中包括三张“麻雀”的图像和一张“鹦鹉”的图像(排序为2的图像),由于“鹦鹉”并不符合“麻雀”的意思,但是由于“麻雀”和“鹦鹉”都属于小型鸟类,因此判别难度较高,仅以首次得到的至少一个第二视觉数据确定检索结果,容易出错。但是,基于成对判断结果进行筛选后,“鹦鹉”的图像从首次得到的至少一个第二视觉数据中被删除,有利于获得更准确的结果。此外,在筛选后的第二视觉数据中,可能存在视觉特征不够好的被检索视觉数据排在前面,而视觉特征更清楚的被检索视觉数据却排在后面的情况,比如第一张“麻雀”的图像的视觉特征不够好而被排在前面,第三张“麻雀”的图像的视觉特征比较好而被排在后面。所以,按照筛选后的第二视觉数据与检索文本之间的相似度从大到小的顺序,将视觉特征更清楚的被检索数据调整到前面,将视觉特征不够好的被检索视觉数据调整到后面,以获得更合理的排序结果。
经过上述步骤得到检索结果之后,可以将该检索结果反馈给用户。基于上文所述,本申请实施例提供的跨模态检索方法可以应用于网络侧场景,也可应用于端侧场景。对于网络侧场景来说,在服务器确定出检索结果之后,可以将该检索结果发送给用户终端,由用户终端来显示该检索结果。对于端侧场景来说,在用户终端确定出该检索结果之后,可以显示该检索结果。
比如,请参考图7,当用户需要搜索手机相册中的图像或者视频时,用户可以在搜索框中输入“麻雀”,此时,手机通过本申请实施例提供的方法对手机相册进行检索后,得到三张“麻雀”的图像,而且这三张图像中视觉特征更清楚的图像排到前面,视觉特征不够好的图像排到后面。
由于被检索视觉数据的视觉标签的范围固定,所以,通过被检索视觉数据的视觉标签和检索文本的文本标签确定出的第一视觉数据比较精确,也即是,通过标签匹配能够精确地控制检索范围。并且,由于检索文本是具有语义开放性的自然语言的描述信息,所以,通过被检索视觉数据的视觉特征和检索文本的文本特征确定出的第二视觉数据没有语义上的限制,支持自然语义检索,检索比较灵活,检索范围也比较广,能够识别形容词等细粒度的检索文本。这样,在被检索视觉数据中同时存在第一视觉数据和第二视觉数据的情况下,将第一视觉数据和第二视觉数据进行融合,能够同时提升跨模态的检索精度和检索广度。
另外,将第一视觉数据和第二视觉数据进行融合时,通过确定第一视觉数据的视觉标签属于第一类标签还是第二类标签来采取不同的融合方案,实现了应用场景对于检索结果的数量和准确性的不同侧重。此外,本申请实施例还可以将第二视觉数据的视觉特征和检索文本的文本特征同时输入神经网络模型,神经网络模型通过将视觉特征和文本特征进行结合来实现细粒度地分析,从而对第二视觉数据进行筛选,以提高检索结果的准确性和合理性。
请参考图8,图8是本申请实施例提供的另一种跨模态检索方法的流程图。该方法包括如下步骤。
步骤801:提取检索文本的文本标签和文本特征。
可选地,该方法还包括:接收用户输入的检索文本。
步骤802:基于检索文本的文本标签和被检索视觉数据的视觉标签,得到标签匹配结果,被检索视觉数据包括图像和/或视频。
步骤803:基于检索文本的文本特征和被检索视觉数据的视觉特征,得到特征匹配结果。
可选地,基于文本特征和被检索视觉数据的视觉特征,得到特征匹配结果包括:将文本特征和被检索视觉数据的视觉特征进行特征匹配,以得到第一特征匹配结果,第一特征匹配结果包括至少一个第三视觉数据;将文本特征和第一特征匹配结果输入预设模型,以得到特征匹配结果,特征匹配结果包括第一特征 匹配结果中的部分或全部第三视觉数据,且特征匹配结果中包括的第三视觉数据按照与文本特征的相似度排序。
其中,该预设模型可以为神经网络模型。
步骤804:基于标签匹配结果和特征匹配结果,得到检索结果。
其中,标签匹配结果中可能包括至少一个第一视觉数据,也可能不包括,特征匹配结果中可能包括至少一个第二视觉数据,也可能不同。对于不同的情况,确定检索结果的方式不同,接下来将分别进行介绍。
第一种情况,标签匹配结果包括至少一个第一视觉数据,且特征匹配结果包括至少一个第二视觉数据,此时,基于标签匹配结果和特征匹配结果得到检索结果的实现过程包括:将标签匹配结果包括的至少一个第一视觉数据和特征匹配结果包括的至少一个第二视觉数据的并集或交集作为检索结果。
可选地,将标签匹配结果包括的至少一个第一视觉数据和特征匹配结果包括的至少一个第二视觉数据的并集或交集作为检索结果包括:如果至少一个第一视觉数据的视觉标签属于预设的第一类标签,则将至少一个第一视觉数据和至少一个第二视觉数据的交集作为检索结果;如果至少一个第一视觉数据的视觉标签属于预设的第二类标签,则将至少一个第一视觉数据和至少一个第二视觉数据的并集作为检索结果。
第二种情况,标签匹配结果包括至少一个第一视觉数据,特征匹配结果指示不存在匹配数据,此时,基于标签匹配结果和特征匹配结果得到检索结果的实现过程包括:将标签匹配结果包括的至少一个第一视觉数据作为检索结果。
第三种情况,标签匹配结果指示不存在匹配数据,特征匹配结果包括至少一个第二视觉数据,此时,基于标签匹配结果和特征匹配结果得到检索结果的实现过程包括:将特征匹配结果包括的至少一个第二视觉数据作为检索结果。
第四种情况,标签匹配结果包括至少一个第一视觉数据,且至少一个第一视觉数据中的部分或全部视觉数据的视觉标签包括文本标签时,将至少一个第一视觉数据中的视觉标签包括文本标签的视觉数据作为检索结果。
第五种情况,当标签匹配结果指示不存在匹配数据,且特征匹配结果指示不存在匹配数据时,检索结果指示不存在匹配数据。
可选地,经过上述步骤确定的检索结果包括至少一个图像和/或视频,此时,还可以显示该检索结果。
由于被检索视觉数据的视觉标签的范围固定,所以,通过被检索视觉数据的视觉标签和检索文本的文本标签确定出的第一视觉数据比较精确,也即是,通过标签匹配能够精确地控制检索范围。并且,由于检索文本是具有语义开放性的自然语言的描述信息,所以,通过被检索视觉数据的视觉特征和检索文本的文本特征确定出的第二视觉数据没有语义上的限制,支持自然语义检索,检索比较灵活,检索范围也比较广,能够识别形容词等细粒度的检索文本。这样,在被检索视觉数据中同时存在第一视觉数据和第二视觉数据的情况下,将第一视觉数据和第二视觉数据进行融合,能够同时提升跨模态的检索精度和检索广度。
另外,将第一视觉数据和第二视觉数据进行融合时,通过确定第一视觉数据的视觉标签属于第一类标签还是第二类标签来采取不同的融合方案,实现了应用场景对于检索结果的数量和准确性的不同侧重。此外,本申请实施例还可以将第二视觉数据的视觉特征和检索文本的文本特征同时输入神经网络模型,神经网络模型通过将视觉特征和文本特征进行结合来实现细粒度地分析,从而对第二视觉数据进行筛选,以提高检索结果的准确性和合理性。
需要说明的是:图8所示实施例的实现细节与上述图1所示实施例的实现细节类似,具体内容请参考上述图1所示实施例中的相关描述,这里不再赘述。
图9是本申请实施例提供的一种跨模态检索装置的结构示意图,该装置可以由软件、硬件或者两者的结合实现成为电子设备的部分或者全部。参见图9,该装置包括:提取模块901、第一确定模块902、第二确定模块903和第三确定模块904。
提取模块901,用于提取检索文本的文本标签和文本特征。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。
第一确定模块902,用于基于文本标签和被检索视觉数据的视觉标签,确定被检索视觉数据中是否存在视觉标签与文本标签匹配的至少一个第一视觉数据,被检索视觉数据包括图像和/或视频。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。
第二确定模块903,用于基于文本特征和被检索视觉数据的视觉特征,确定被检索视觉数据中是否存在视觉特征与文本特征匹配的至少一个第二视觉数据。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。
第三确定模块904,用于如果被检索视觉数据中存在至少一个第一视觉数据和至少一个第二视觉数据,则基于至少一个第一视觉数据和至少一个第二视觉数据确定检索结果。详细实现过程参考上述各个实施例中对应的内容,此处不再赘述。
可选地,第三确定模块904具体用于:
如果至少一个第一视觉数据的视觉标签属于第一类标签,则将至少一个第一视觉数据和至少一个第二视觉数据的交集作为检索结果,第一类标签是指表征视觉数据时具有不确定性的标签;
如果至少一个第一视觉数据的视觉标签属于第二类标签,则将至少一个第一视觉数据和至少一个第二视觉数据的并集作为检索结果,第二类标签是指表征视觉数据时具有确定性的标签。
可选地,该装置还包括:
第四确定模块,用于如果被检索视觉数据中存在至少一个第一视觉数据但不存在至少一个第二视觉数据,则将至少一个第一视觉数据作为检索结果。
可选地,该装置还包括:
第五确定模块,用于如果被检索视觉数据中存在至少一个第二视觉数据但不存在至少一个第一视觉数据,则将至少一个第二视觉数据作为检索结果。
可选地,被检索视觉数据中存在至少一个第二视觉数据;装置还包括:
模型推理模块,用于将至少一个第二视觉数据的视觉特征和文本特征输入至神经网络模型中,以得到模型推理结果,模型推理结果包括相似性结果和/或成对判断结果,相似性结果指示至少一个第二视觉数据分别与检索文本之间的相似度,成对判断结果指示至少一个第二视觉数据分别与检索文本是否能够成对;
处理模块,用于基于模型推理结果对至少一个第二视觉数据进行处理。
可选地,模型推理结果包括相似性结果;处理模块具体用于:
基于相似性结果,从至少一个第二视觉数据中筛选出与检索文本之间的相似度大于第一相似度阈值的第二视觉数据。
可选地,模型推理结果包括成对判断结果;处理模块具体用于:
基于成对判断结果,从至少一个第二视觉数据中筛选出与检索文本能够成对的第二视觉数据。
可选地,模型推理结果包括相似性结果和成对判断结果;处理模块具体用于:
基于成对判断结果,从至少一个第二视觉数据中筛选出与检索文本能够成对的第二视觉数据;
基于相似性结果,按照筛选出的第二视觉数据与检索文本之间的相似度从大到小的顺序,对筛选出的第二视觉数据进行排序。
由于被检索视觉数据的视觉标签的范围固定,所以,通过被检索视觉数据的视觉标签和检索文本的文本标签确定出的第一视觉数据比较精确,也即是,通过标签匹配能够精确地控制检索范围。并且,由于检索文本是具有语义开放性的自然语言的描述信息,所以,通过被检索视觉数据的视觉特征和检索文本的文本特征确定出的第二视觉数据没有语义上的限制,检索比较灵活,检索范围也比较广,能够识别形容词等细粒度的检索文本。这样,在被检索视觉数据中同时存在第一视觉数据和第二视觉数据的情况下,将第一视觉数据和第二视觉数据进行融合,能够同时提升跨模态的检索精度和检索广度。
另外,将第一视觉数据和第二视觉数据进行融合时,通过确定第一视觉数据的视觉标签属于第一类标签还是第二类标签来采取不同的融合方案,实现了应用场景对于检索结果的数量和准确性的不同侧重。此外,本申请实施例还可以将第二视觉数据的视觉特征和检索文本的文本特征同时输入神经网络模型,神经网络模型通过将视觉特征和文本特征进行结合来实现细粒度地分析,从而对第二视觉数据进行筛选,以提高检索结果的准确性和合理性。
需要说明的是:上述实施例提供的跨模态检索装置在进行跨模态检索时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的跨模态检索装置与跨模态检索方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图10,图10是根据本申请实施例示出的一种电子设备的结构示意图。该电子设备包括至少一个处理器1001、通信总线1002、存储器1003以及至少一个通信接口1004。
处理器1001可以是一个通用中央处理器(central processing unit,CPU)、网络处理器(network processor,NP)、微处理器、或者可以是一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线1002用于在上述组件之间传送信息。通信总线1002可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器1003可以是只读存储器(read-only memory,ROM),也可以是随机存取存储器(random access memory,RAM),也可以是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、光盘(包括只读光盘(compact disc read-only memory,CD-ROM)、压缩光盘、激光盘、数字通用光盘、蓝光光盘等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器1003可以是独立存在,并通过通信总线1002与处理器1001相连接。存储器1003也可以和处理器1001集成在一起。
通信接口1004使用任何收发器一类的装置,用于与其它设备或通信网络通信。通信接口1004包括有线通信接口,还可以包括无线通信接口。其中,有线通信接口例如可以为以太网接口。以太网接口可以是光接口、电接口或其组合。无线通信接口可以为无线局域网(wireless local area networks,WLAN)接口、蜂窝网络通信接口或其组合等。
在具体实现中,作为一种实施例,处理器1001可以包括一个或多个CPU,如图10中所示的CPU0和CPU1。
在具体实现中,作为一种实施例,电子设备可以包括多个处理器,如图10中所示的处理器1001和处理器1005。这些处理器中的每一个可以是一个单核处理器,也可以是一个多核处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,电子设备还可以包括输出设备1006和输入设备1009。输出设备1006和处理器1001通信,可以以多种方式来显示信息。例如,输出设备1006可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备1009和处理器1001通信,可以以多种方式接收用户的输入。例如,输入设备1009可以是鼠标、键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器1003用于存储执行本申请方案的程序代码1010,处理器1001可以执行存储器1003中存储的程序代码1010。该程序代码1010中可以包括一个或多个软件模块,该电子设备可以通过处理器1001以及存储器1003中的程序代码1010,来实现上文实施例提供的跨模态检索方法。
请参考图11,图11是本申请实施例提供的一种用户终端的结构示意图。该用户终端包括传感器单元1110、计算单元1120、存储单元1140和交互单元1130。
传感器单元1110,通常包括视觉传感器(如相机)、深度传感器、IMU、激光传感器等;
计算单元1120,通常包括CPU、GPU、缓存、寄存器等,主要用于运行操作系统;
存储单元1140,主要包括内存和外部存储,主要用于用户本地和临时数据的读写等;
交互单元1130,主要包括显示屏、触摸板、扬声器、麦克风等,主要用于和用户进行交互,获取用于输入,并实施呈现算法效果等。
为便于理解,下面将对本申请实施例提供的一种用户终端100的结构进行示例说明。参见图12,图12是本申请实施例提供的一种用户终端的结构示意图。
如图12所示,用户终端100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感 器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对用户终端100的具体限定。在本申请另一些实施例中,用户终端100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器110可以执行计算机程序,以实现本申请实施例中任一种方法。
其中,控制器可以是用户终端100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用,避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I1C)接口,集成电路内置音频(inter-integrated circuit sound,I1S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对用户终端100的结构限定。在本申请另一些实施例中,用户终端100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。
用户终端100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
在一些可行的实施方式中,用户终端100可以使用无线通信功能和其他设备通信。例如,用户终端100可以和第二电子设备通信,用户终端100与第二电子设备建立投屏连接,用户终端100输出投屏数据至第二电子设备等。其中,用户终端100输出的投屏数据可以为音视频数据。
天线1和天线2用于发射和接收电磁波信号。用户终端100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在用户终端100上的包括1G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线2转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频 信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在用户终端100上的包括无线局域网(wireless local area networks,WLAN),如无线保真(wireless fidelity,Wi-Fi)网络,蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线1接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,用户终端100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得用户终端100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
用户终端100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,用户终端100可以包括1个或N个显示屏194,N为大于1的正整数。
在一些可行的实施方式中,显示屏194可用于显示用户终端100的系统输出的各个界面。
用户终端100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,用户终端100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。
视频编解码器用于对数字视频压缩或解压缩。用户终端100可以支持一种或多种视频编解码器。这样,用户终端100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG1,MPEG3,MPEG4等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现用户终端100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展用户终端100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行用户终端100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如本申请实施例中的方法等)等。存储数据区可存储用户终端100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
用户终端100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。在一些可行的实施方式中,音频模块170可用于播放视频对应的声音。例如,显示屏194显示视频播放画面时,音频模块170输出视频播放的声音。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。陀螺仪传感器180B可以用于确定用户终端100的运动姿态。气压传感器180C用于测量气压。
加速度传感器180E可检测用户终端100在各个方向上(包括三轴或六轴)加速度的大小。当用户终端100静止时可检测出重力的大小及方向。还可以用于识别用户终端姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。
环境光传感器180L用于感知环境光亮度。
指纹传感器180H用于采集指纹。
温度传感器180J用于检测温度。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于用户终端100的表面,与显示屏194所处的位置不同。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。用户终端100可以接收按键输入,产生与用户终端100的用户设置以及功能控制有关的键信号输入。
马达191可以产生振动提示。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。
本申请实施例还提供了一种计算机可读存储介质,所述存储介质内存储有指令,当所述指令在计算机上运行时,使得计算机执行上述实施例所述的跨模态检索方法的步骤。
本申请实施例还提供了一种包含指令的计算机程序产品,当所述指令在计算机上运行时,使得计算机执行上述实施例所述的跨模态检索方法的步骤。或者说,提供了一种计算机程序,当所述计算机程序在计算机上运行时,使得计算机执行上述实施例所述的跨模态检索方法的步骤。
本申请实施例还提供了一种芯片,所述芯片包括处理器和接口电路,所述接口电路用于接收指令并传输至所述处理器,所述处理器用于执行上述实施例所述的跨模态检索方法的步骤。
本申请实施例还提供了一种检索系统,所述检索系统包括上述实施例所述的跨模态检索装置以及模型训练装置。该模型训练装置用于对上述实施例中涉及的模型进行训练。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))或半导体介质(例如:固态硬盘(solid state disk,SSD))等。值得注意的是,本申请实施例提到的计算机可读存储介质可以为非易失性存储介质,换句话说,可以是非瞬时性存储介质。
本申请实施例主要有如下三个关键点:
1、提出一种开放语义与标签双路召回技术:在做图像和视频检索中,使用“分类打标技术”和“跨模态对比学习技术”分别获得标签召回和开放语义的双路召回。
2、基于关键点1的开放语义召回,提出一种多特征精排过滤模块:开放语义召回分为两个步骤,首次召回和精排过滤。基于首次召回获得的特征检索结果,将视觉特征和文本特征同时送入精排过滤模块,精排过滤模块基于多特征输入,做细粒度分析,从而修改首次召回的排序并做进一步过滤,从而提升召回效果。
3、基于关键点1的获得的标签召回和开放语义召回,提出一种开放语义与标签的双路融合技术:基于标签召回和开放语义召回双路召回结果,分为双路召回同时存在、仅存在标签召回、仅存在开放语义召回三种情况处理,在双路召回同时存在场景中,给予系统可配置的融合方案,从而适应不同准确率和召回率要求的场景。
本申请实施例应用在跨模态搜索场景,如:用户输入文字,搜索图像和视频。配合深度学习网络完成分类打标和跨模态特征提取,从而完成跨模态检索;包括计算机视觉、自然语言处理等领域。本申请实施例的系统架构如图2所示,主要包括:1、双路召回部件(即文本标签与视觉标签的匹配、文本特征与视觉特征的匹配);2、双路融合部件(即至少一个第一视觉数据和至少一个第二视觉数据是否同时存在而确定检索结果);3、基于1的精排过滤部件(对至少一个第二视觉数据进行精细化处理)。
通过本申请实施例提出的方法,能够为用户提供更加开放和精准的跨模态内容搜索。相比于已有方案,在端侧场景(如:手机相册)能够提供更加开放的搜索能力;在云侧场景(如:搜索引擎)提供更加精准的搜索效果。本申请实施例通过设计系统,融合了两项技术:分类打标和跨模态对比学习。分类方案具有极高的精度,有利于保障关键场景的召回效果;跨模态方案具有极广泛的识别能力,且能够识别形容词、组合词等细节描述能力;两者结合,相互取长补短,达到精度和广度检索效果的综合提升;精排过滤模块以多种跨模态特征作为输入,进一步提升结果。
本申请实施例的核心实现装置为计算机代码,如上图2实现,分为双路召回部件、双路融合部件以及双路召回部件中的精排过滤部件,其中双路召回部件以及精排过滤部件需要借助深度学习网络实现。
参考图2和图5,本申请实施例核心方法流程如下:
步骤一:用户提前存放好需要被检索的图片、视频,以供视觉模型作分析。
步骤二:开发人员提前配置好双路融合组件中的“融合策略”。
步骤三:视觉模型对用户数据做一一分析,获得视觉特征库和视觉标签库,其中视觉特征库用在后续步骤的开放语义召回中;视觉标签库用在后续步骤的标签召回中。
步骤四:用户输入文字描述(query),触发跨模态检索。
步骤五:检索系统基于步骤四的query,提取其中的标签内容,使用标签检索步骤三生成的视觉标签库;获得标签召回。
步骤六:检索系统基于步骤四的query,送入文本模型,生成文本特征;使用文本特征检索步骤三生成的视觉特征库,获得首次召回。
步骤七:将步骤六首次召回的视觉特征和文本特征,同时送入精排过滤模块,对步骤六的首次召回最更加精细的排序和过滤,获得开放语义召回。
步骤八:将步骤五的标签召回和步骤七的开放语义召回同时送入双路融合组件中;根据双路召回是否同时存在,分为三种情况执行:
(A)双路召回同时存在:融合双路召回的结果,通过可配置的融合策略,返回融合结果;
(B)仅存在标签召回:返回标签召回结果;
(C)仅存在语义召回:返回语义召回结果。
参考图2和图4,对跨模态检索,用户不同输入的场景,会触发不同的融合效果,具体流程如下:
步骤1:用户输入检索内容(query)。
步骤2:系统根据用户输入,获得标签召回和开放语义召回。
步骤3:系统根据步骤2的召回情况,从如下三种融合情况:
步骤3.1(直接命中标签):当用户输入和预设标签相同时,则直接返标签召回的结果,其中标签召回根据用户输入的标签直接返回搜索结果,例如返回的图片和视频。假设预置标签中包含“大熊猫”,且用户输入为“大熊猫”。该情况下用户输入直接命中标签,则返回标签(大熊猫)召回结果,不返回语义召回结果;其中预置标签的设置方法在此不做限定,例如可以通过产品需求和应用场景需求,人为设定。
另一种可实现方案中,当用户输入和预设标签相同时,则直接返标签召回的结果,不返回语义召回结果。
步骤3.2(用户输入不包含标签):当用户输入检索语句中不包含预设标签,则直接返回开放语义召回结果,其中开放语义召回指返回符合用户输入的搜索结果,如返回的图片和视频。假设用户输入“光刻机”,且“光刻机”不在预置标签中。该情况下用户输入未命中标签,则返回开放语义召回结果,无标签结果返回。
另一种可实现方案中,当用户输入检索语句中不包含预设标签,则直接返回开放语义召回结果,且无标签结果返回。
步骤3.3(用户输入包含标签):当用户输入检索语句中含有预设标签或者预设标签指代的语义(例如:预设标签的同义词),需要返回标签召回和开放语义召回的结果。假设预置标签中包含“大熊猫”,且用户输入为“吃竹子的大熊猫”,经过系统识别发现用户的输入中包含的“大熊猫”正好命中了预设标签,因此需要返回标签召回和开放语义召回的结果;由于存在标签召回和开放语义召回的两路结果,因此需要做融合。根据标签可根据标签的敏感程度为系统预设融合方案。融合方案可以根据场景对于召回数量和准确性的侧重程度,选择并集或交集。例如:若命中非敏感标签(如:大熊猫),则建议将标签命中的结果和语义命中结果取并集;若命中敏感标签,例如涉及攻击性,或容易在表达上擦边球的内容,建议取并集,从而最大限度保障无误召回发生。
采用本申请实施例的提出方案,能够对分类标签和跨模态对比学习两个技术方案取长补短。分类方案具有极高的精度,有利于保障关键场景的召回效果;跨模态方案具有极广泛的识别能力,且能够识别形容词、组合词等细节描述能力;两者结合,相互取长补短,达到精度和广度检索效果的综合提升。
区别于需借助上下文描述的方案,本申请实施例通过设计一种分类标签和跨模态对比学习技术融合的系统方案,在无需上下文的情况下,即可做到高精度的开放内容搜索,适用范围更广,可推广到端侧设备使用场景(如:手机相册)。
区别于分类标签技术和跨模态对比学习技术,本申请实施例通过两者结合,相互取长补短,达到精度和广度检索效果的综合提升。
图5所示的精排过滤技术方案可以用在开放语义召回的后处理过程中,也可以结合上述双路融合的方 法来实现,在此不做限定。如果精排过滤模块用在开放语义召回流程中,可以提升开放语义召回质量,优化排序,删除难例错例。
本申请实施例的精排过滤模块需要借助首次召回的视觉特征,以及本次检索的文本特征,将视觉和文本特征同时送入精排过滤模块中,以达到召回效果提升。技术方案详见图5,步骤如下:
步骤1:将图像或视频送入视觉模型,获得视觉特征库。
步骤2:将用户输入送入文本模型,获得文本特征。
步骤3:使用文本特征检索视觉特征,取相似度大于阈值的结果作为首次召回的图片或视频,获得首次召回的视觉特征,每个特征对应一张图片或视频。
步骤4:将首次召回的视觉特征和文本特征同时输入精排过滤模块,精排过滤模块包括神经网络模型,还可能包括数据处理模块;神经网络模型输出每一对“视觉-文本”的相似性结果或是否成对的判断结果。
步骤5:获得精排过滤后的结果,效果展示详见图6。
如图6为精排过滤后的效果,精排过滤主要有两个功能:
功能一:过滤不合理结果,提升召回准确性。如图6所示,首次召回中的样例2(鹦鹉)不符合搜索词“麻雀”的意思,但由于“麻雀”和“鹦鹉”都属于小型鸟类,因此判别难度高,因此首次召回前的图片和文本分别使用“视觉模型”和“文本模型”进行分析,难以做到细粒度判别,故容易出错(详见图5首次召回)。在精排过滤阶段,由于将视觉和文本特征同时送入模型进行细粒度分析,因此能够更好地做判别,有利于获得更准确的结果(详见图5精排过滤)。因此图6中,鹦鹉在精排过滤后得以删除。
功能二:调整首次召回结果,提升召回体验。如图6所示,首次召回中的排序1的麻雀,外形特征不如3和4。在精排过滤阶段,由于将视觉和文本特征同时送入模型进行细粒度分析,因此能够更好地做判别,有利于获得更合理的排序结果。
区别于跨模态对比学习技术仅做首次召回的方案,本技术方案先进行视觉模型和文本模型解耦的首次召回,在保障推理效率的同时,达到广泛的识别能力,且能够识别形容词、组合词等细节描述能力的效果,并将精排过滤范围缩小到可控范围;进一步地,加入精排过滤模块,排除难例错例,调整召回顺序,提升召回质量。
本申请实施例的关键技术点概括:
1、一种开放语义与标签双路召回技术:基于分类算法和跨模态特征匹配算法,同时召回标签和开放语义的召回结果,作为首次召回。
2、保护具有精排过滤模块的检索系统,其中精排过滤模块的实现方式,基于技术点1的首次召回,基于首次召回结果,将文本和视觉特征同时输入精排过滤模块做精细化排序和过滤。
3、一种开放语义与标签双路融合技术:基于标签和语义的召回,分为三种情况:
双路召回:基于可配置的融合策略,获得双路融合后的结果。
仅标签召回:直接返回标签召回结果。
仅语义召回:直接返回语义召回结果。
本申请实施例的关键技术点对应的有益效果:
一种开放语义与标签双路召回技术:同时获得开放语义和标签召回结果,保障召回完整性,为下一步精排过滤提供优质起点。
一种多特征精排过滤模块:同时视觉和文本信息送入模型中做细粒度分析,获得更优质的排序结果,删除难例错例。
一种开放语义与标签双路融合技术:将开放语义和标签召回结果,进行可配置的融合,通过调整融合方式,做到敏感场景不触发舆论风险,非敏感场景提升用户召回满足度。
应当理解的是,本文提及的“多个”是指两个或两个以上。在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
需要说明的是,本申请实施例所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (25)

  1. 一种跨模态检索方法,其特征在于,所述方法包括:
    提取检索文本的文本标签和文本特征;基于所述文本标签和被检索视觉数据的视觉标签,确定所述被检索视觉数据中是否存在视觉标签与所述文本标签匹配的至少一个第一视觉数据,所述被检索视觉数据包括图像和/或视频;
    基于所述文本特征和所述被检索视觉数据的视觉特征,确定所述被检索视觉数据中是否存在视觉特征与所述文本特征匹配的至少一个第二视觉数据;
    基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果。
  2. 如权利要求1所述的方法,其特征在于,所述基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果,包括:
    如果所述被检索视觉数据中存在所述至少一个第一视觉数据和所述至少一个第二视觉数据,则基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果。
  3. 如权利要求1或2所述的方法,其特征在于,所述基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果,包括:
    如果所述至少一个第一视觉数据的视觉标签属于第一类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的交集作为所述检索结果,所述第一类标签是指表征视觉数据时具有不确定性的标签。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果,包括:
    如果所述至少一个第一视觉数据的视觉标签属于第二类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的并集作为所述检索结果,所述第二类标签是指表征视觉数据时具有确定性的标签。
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述方法还包括:
    如果所述被检索视觉数据中存在所述至少一个第一视觉数据但不存在所述至少一个第二视觉数据,则将所述至少一个第一视觉数据作为所述检索结果。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述方法还包括:
    如果所述被检索视觉数据中存在所述至少一个第二视觉数据但不存在所述至少一个第一视觉数据,则将所述至少一个第二视觉数据作为所述检索结果。
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述被检索视觉数据中存在所述至少一个第二视觉数据;所述方法还包括:
    将所述至少一个第二视觉数据的视觉特征和所述文本特征输入至神经网络模型中,以得到模型推理结果,所述模型推理结果包括相似性结果和/或成对判断结果,所述相似性结果指示所述至少一个第二视觉数据分别与所述检索文本之间的相似度,所述成对判断结果指示所述至少一个第二视觉数据分别与所述检索文本是否能够成对;
    基于所述模型推理结果对所述至少一个第二视觉数据进行处理。
  8. 如权利要求7所述的方法,其特征在于,所述模型推理结果包括相似性结果;
    所述基于所述模型推理结果对所述至少一个第二视觉数据进行处理,包括:
    基于所述相似性结果,从所述至少一个第二视觉数据中筛选出与所述检索文本之间的相似度大于第一相似度阈值的第二视觉数据。
  9. 如权利要求7所述的方法,其特征在于,所述模型推理结果包括成对判断结果;
    所述基于所述模型推理结果对所述至少一个第二视觉数据进行处理,包括:
    基于所述成对判断结果,从所述至少一个第二视觉数据中筛选出与所述检索文本能够成对的第二视觉数据。
  10. 如权利要求7所述的方法,其特征在于,所述模型推理结果包括相似性结果和成对判断结果;
    所述基于所述模型推理结果对所述至少一个第二视觉数据进行处理,包括:
    基于所述成对判断结果,从所述至少一个第二视觉数据中筛选出与所述检索文本能够成对的第二视觉数据;
    基于所述相似性结果,按照筛选出的第二视觉数据与所述检索文本之间的相似度从大到小的顺序,对所述筛选出的第二视觉数据进行排序。
  11. 一种跨模态检索装置,其特征在于,所述装置包括:
    提取模块,用于提取检索文本的文本标签和文本特征;
    第一确定模块,用于基于所述文本标签和被检索视觉数据的视觉标签,确定所述被检索视觉数据中是否存在视觉标签与所述文本标签匹配的至少一个第一视觉数据,所述被检索视觉数据包括图像和/或视频;
    第二确定模块,用于基于所述文本特征和所述被检索视觉数据的视觉特征,确定所述被检索视觉数据中是否存在视觉特征与所述文本特征匹配的至少一个第二视觉数据;
    第三确定模块,用于基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果。
  12. 如权利要求11所述的装置,其特征在于,所述第三确定模块具体用于:
    如果所述被检索视觉数据中存在所述至少一个第一视觉数据和所述至少一个第二视觉数据,则基于所述至少一个第一视觉数据和所述至少一个第二视觉数据确定检索结果。
  13. 如权利要求11或12所述的装置,其特征在于,所述第三确定模块具体用于:
    如果所述至少一个第一视觉数据的视觉标签属于第一类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的交集作为所述检索结果,所述第一类标签是指表征视觉数据时具有不确定性的标签。
  14. 如权利要求11-13任一项所述的装置,其特征在于,所述第三确定模块具体用于:
    如果所述至少一个第一视觉数据的视觉标签属于第二类标签,则将所述至少一个第一视觉数据和所述至少一个第二视觉数据的并集作为所述检索结果,所述第二类标签是指表征视觉数据时具有确定性的标签。
  15. 如权利要求11-14任一项所述的装置,其特征在于,所述装置还包括:
    第四确定模块,用于如果所述被检索视觉数据中存在所述至少一个第一视觉数据但不存在所述至少一个第二视觉数据,则将所述至少一个第一视觉数据作为所述检索结果。
  16. 如权利要求11-15任一项所述的装置,其特征在于,所述装置还包括:
    第五确定模块,用于如果所述被检索视觉数据中存在所述至少一个第二视觉数据但不存在所述至少一个第一视觉数据,则将所述至少一个第二视觉数据作为所述检索结果。
  17. 如权利要求11-16任一项所述的装置,其特征在于,所述被检索视觉数据中存在所述至少一个第二视觉数据;所述装置还包括:
    模型推理模块,用于将所述至少一个第二视觉数据的视觉特征和所述文本特征输入至神经网络模型中,以得到模型推理结果,所述模型推理结果包括相似性结果和/或成对判断结果,所述相似性结果指示所述至少一个第二视觉数据分别与所述检索文本之间的相似度,所述成对判断结果指示所述至少一个第二视觉数 据分别与所述检索文本是否能够成对;
    处理模块,用于基于所述模型推理结果对所述至少一个第二视觉数据进行处理。
  18. 如权利要求17所述的装置,其特征在于,所述模型推理结果包括相似性结果;所述处理模块具体用于:
    基于所述相似性结果,从所述至少一个第二视觉数据中筛选出与所述检索文本之间的相似度大于第一相似度阈值的第二视觉数据。
  19. 如权利要求17所述的装置,其特征在于,所述模型推理结果包括成对判断结果;所述处理模块具体用于:
    基于所述成对判断结果,从所述至少一个第二视觉数据中筛选出与所述检索文本能够成对的第二视觉数据。
  20. 如权利要求17所述的装置,其特征在于,所述模型推理结果包括相似性结果和成对判断结果;所述处理模块具体用于:
    基于所述成对判断结果,从所述至少一个第二视觉数据中筛选出与所述检索文本能够成对的第二视觉数据;
    基于所述相似性结果,按照筛选出的第二视觉数据与所述检索文本之间的相似度从大到小的顺序,对所述筛选出的第二视觉数据进行排序。
  21. 一种电子设备,其特征在于,所述电子设备包括存储器和处理器,所述存储器用于存储计算机程序,所述处理器被配置为执行所述计算机程序,以实现权利要求1-10任一项所述的跨模态检索方法的步骤。
  22. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有指令,当所述指令在所述计算机上运行时,使得所述计算机执行权利要求1-10任一项所述的方法的步骤。
  23. 一种计算机程序,其特征在于,所述计算机程序包含指令,当所述指令在计算机上运行时,使得所述计算机执行权利要求1-10任一项所述的跨模态检索方法的步骤。
  24. 一种芯片,其特征在于,所述芯片包括处理器和接口电路,所述接口电路用于接收指令并传输至所述处理器,所述处理器用于执行权利要求1-10任一项所述的跨模态检索方法的步骤。
  25. 一种检索系统,其特征在于,所述检索系统包括权利要求11-20任一项所述的跨模态检索装置以及模型训练装置。
PCT/CN2023/117203 2022-09-07 2023-09-06 跨模态检索方法、装置、设备、存储介质及计算机程序 WO2024051730A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202211091658 2022-09-07
CN202211091658.9 2022-09-07
CN202311130428.3A CN117668290A (zh) 2022-09-07 2023-08-31 跨模态检索方法、装置、设备、存储介质及计算机程序
CN202311130428.3 2023-08-31

Publications (1)

Publication Number Publication Date
WO2024051730A1 true WO2024051730A1 (zh) 2024-03-14

Family

ID=90068881

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/117203 WO2024051730A1 (zh) 2022-09-07 2023-09-06 跨模态检索方法、装置、设备、存储介质及计算机程序

Country Status (2)

Country Link
CN (1) CN117668290A (zh)
WO (1) WO2024051730A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170004383A1 (en) * 2015-06-30 2017-01-05 Adobe Systems Incorporated Searching untagged images with text-based queries
CN111353076A (zh) * 2020-02-21 2020-06-30 华为技术有限公司 训练跨模态检索模型的方法、跨模态检索的方法和相关装置
WO2021028656A1 (en) * 2019-08-15 2021-02-18 Vision Semantics Limited Text based image search
CN114090823A (zh) * 2021-09-09 2022-02-25 秒针信息技术有限公司 视频检索方法、装置、电子设备及计算机可读存储介质
CN114996511A (zh) * 2022-04-22 2022-09-02 北京爱奇艺科技有限公司 一种针对跨模态视频检索模型的训练方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170004383A1 (en) * 2015-06-30 2017-01-05 Adobe Systems Incorporated Searching untagged images with text-based queries
WO2021028656A1 (en) * 2019-08-15 2021-02-18 Vision Semantics Limited Text based image search
CN111353076A (zh) * 2020-02-21 2020-06-30 华为技术有限公司 训练跨模态检索模型的方法、跨模态检索的方法和相关装置
CN114090823A (zh) * 2021-09-09 2022-02-25 秒针信息技术有限公司 视频检索方法、装置、电子设备及计算机可读存储介质
CN114996511A (zh) * 2022-04-22 2022-09-02 北京爱奇艺科技有限公司 一种针对跨模态视频检索模型的训练方法及装置

Also Published As

Publication number Publication date
CN117668290A (zh) 2024-03-08

Similar Documents

Publication Publication Date Title
US11900924B2 (en) Semantic parsing method and server
WO2021018154A1 (zh) 信息表示方法及装置
CN111465918B (zh) 在预览界面中显示业务信息的方法及电子设备
WO2021244457A1 (zh) 一种视频生成方法及相关装置
WO2021258797A1 (zh) 图像信息输入方法、电子设备及计算机可读存储介质
US20220214894A1 (en) Command execution method, apparatus, and device
CN112269853B (zh) 检索处理方法、装置及存储介质
US20220343648A1 (en) Image selection method and electronic device
CN111930964B (zh) 内容处理方法、装置、设备及存储介质
US20230195801A1 (en) Word completion method and apparatus
CN114816610B (zh) 一种页面分类方法、页面分类装置和终端设备
CN113806473A (zh) 意图识别方法和电子设备
CN114547428A (zh) 推荐模型处理方法、装置、电子设备及存储介质
CN112256868A (zh) 零指代消解方法、训练零指代消解模型的方法及电子设备
US20150293943A1 (en) Method for sorting media content and electronic device implementing same
CN114281936A (zh) 分类方法、装置、计算机设备及存储介质
CN112287070A (zh) 词语的上下位关系确定方法、装置、计算机设备及介质
WO2023040603A1 (zh) 一种搜索方法、终端、服务器及系统
WO2024051730A1 (zh) 跨模态检索方法、装置、设备、存储介质及计算机程序
WO2022033432A1 (zh) 内容推荐方法、电子设备和服务器
CN115437601A (zh) 图像排序方法、电子设备、程序产品及介质
CN112232890B (zh) 数据处理方法、装置、设备及存储介质
CN113742460B (zh) 生成虚拟角色的方法及装置
CN116861066A (zh) 应用推荐方法和电子设备
CN111597823A (zh) 中心词提取方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23862424

Country of ref document: EP

Kind code of ref document: A1