WO2023168997A1 - 一种跨模态搜索方法及相关设备 - Google Patents

一种跨模态搜索方法及相关设备 Download PDF

Info

Publication number
WO2023168997A1
WO2023168997A1 PCT/CN2022/134918 CN2022134918W WO2023168997A1 WO 2023168997 A1 WO2023168997 A1 WO 2023168997A1 CN 2022134918 W CN2022134918 W CN 2022134918W WO 2023168997 A1 WO2023168997 A1 WO 2023168997A1
Authority
WO
WIPO (PCT)
Prior art keywords
modal
data
modality
cross
search
Prior art date
Application number
PCT/CN2022/134918
Other languages
English (en)
French (fr)
Other versions
WO2023168997A9 (zh
Inventor
梅柯
郑还
李明
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to KR1020247011097A priority Critical patent/KR20240052055A/ko
Priority to US18/353,882 priority patent/US20230359651A1/en
Publication of WO2023168997A1 publication Critical patent/WO2023168997A1/zh
Publication of WO2023168997A9 publication Critical patent/WO2023168997A9/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the field of computer technology, and in particular to cross-modal search technology.
  • Embodiments of the present application provide a cross-modal search method and related equipment, which can improve the efficiency of cross-modal search and the diversity and comprehensiveness of cross-modal search results.
  • embodiments of the present application provide a cross-modal search method, which is executed by a computer device, including:
  • the first set and the second set are merged to obtain cross-modal search results corresponding to the first modal data.
  • embodiments of the present application provide another cross-modal search method, which is executed by a computer device, including:
  • a conversational interface that displays social conversations
  • a session record details interface In response to viewing the historical session record of the social session, display a session record details interface, where the session record details interface includes the second modal data in the historical session record of the social session;
  • the cross-modal search results corresponding to the first modal data are output; the cross-modal search results are obtained by using the cross-modal search results of the embodiments of the present application. Obtained by state search method.
  • embodiments of the present application provide a cross-modal search device, including:
  • Acquisition module used to obtain the first modal data
  • a search module configured to search in the second modality database based on the content information of the first modality data to obtain a first set.
  • the first set includes at least one second modality database that matches the content information of the first modality data.
  • the search module is also configured to search in the second modality database based on the semantic information of the first modality data to obtain a second set.
  • the second set includes at least one second set that matches the semantic information of the first modality data. bimodal data;
  • the merging module is used to merge the first set and the second set to obtain cross-modal search results corresponding to the first modal data.
  • embodiments of the present application provide another cross-modal search device, including:
  • Display module used to display the conversation interface of social conversations
  • the display module is also configured to display a session record details interface in response to viewing the historical session records of the social session, where the session record details interface includes the second modal data in the historical session records of the social session;
  • the output module is configured to respond to the first modal data input in the session record details interface and output the cross-modal search results corresponding to the first modal data; the cross-modal search results are the cross-modal search results using the embodiments of the present application. obtained by search method.
  • embodiments of the present application provide a computer device, including: a processor, a memory, and a network interface; the processor is connected to the memory and the network interface, where the network interface is used to provide network communication functions, and the memory is used to store program codes.
  • the processor is used to call program code to execute the cross-modal search method in the embodiment of the present application.
  • inventions of the present application provide a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program includes program instructions. When executed by a processor, the program instructions execute the cross-module steps in the embodiments of the present application. state search method.
  • embodiments of the present application provide a computer program product.
  • the computer program product includes a computer program or computer instructions.
  • the cross-modality provided in one aspect of the embodiments of the present application is implemented. Search method.
  • the embodiments of the present application not only support cross-modal search, but also support comprehensive search from the two dimensions of content and semantics respectively.
  • This The dimension covered by the search is no longer single; in addition, the second modal data searched in the two dimensions are merged as cross-modal search results. Search results in multiple dimensions can be obtained through one search process, which improves cross-modal search. search efficiency; in addition, since the cross-modal search results are obtained by merging the search results in two dimensions, this makes the cross-modal search results more diverse and comprehensive.
  • Figure 1 is an architectural diagram of a cross-modal search system provided by an embodiment of the present application
  • Figure 2 is a schematic flow chart 1 of a cross-modal search method provided by an embodiment of the present application
  • Figure 3 is a schematic flow chart 2 of a cross-modal search method provided by an embodiment of the present application.
  • Figure 4a is a schematic structural diagram of a first modal processing network in a cross-modal search model provided by an embodiment of the present application
  • Figure 4b is a schematic structural diagram of a second modal processing network in a cross-modal search model provided by an embodiment of the present application;
  • Figure 5 is a schematic diagram of training of a cross-modal search model provided by an embodiment of the present application.
  • Figure 6 is a schematic flow chart of a cross-modal search algorithm provided by an embodiment of the present application.
  • Figure 7 is a schematic flow chart 3 of a cross-modal search method provided by an embodiment of the present application.
  • Figure 8a is a schematic diagram of an operation for viewing historical session records provided by an embodiment of the present application.
  • Figure 8b is a schematic diagram of a cross-modal search operation provided by an embodiment of the present application.
  • Figure 8c is a schematic diagram of the effect of outputting cross-modal search results provided by the embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a cross-modal search device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of another cross-modal search device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Chat photo wall A full display page of pictures sent and received in each chat in the application (Application, APP).
  • Multimodal learning refers to mapping data of two different modalities to the same feature space (such as semantic space), so that the data of two different modalities can be related based on semantics, and modal data with similar semantics are in the same feature space (such as semantic space).
  • the feature space has similar features, and the data in the above two different modalities can be, for example, images and text.
  • Figure 1 is a schematic architectural diagram of a cross-modal search system provided by an embodiment of the present application.
  • the architecture diagram may include a database 101 and a cross-modal search device 102.
  • the cross-modal search device 102 can establish a communication connection with the database 101 in a wired or wireless manner.
  • the database 101 can be a local database of the cross-modal search device 102 or a cloud database that the cross-modal search device 102 can access.
  • the cross-modal search device 102 may specifically be a computer device such as a server or a terminal.
  • the server can be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it can provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, and networks.
  • Cloud servers for basic cloud computing services such as services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms are not limited here.
  • Terminals can be smartphones, tablets, smart wearable devices, smart voice interaction devices, smart home appliances, personal computers, vehicle-mounted terminals and other devices, and are not limited here.
  • the database 101 may include a second modality database and a second modality feature database associated with the second modality database.
  • the second modality database is used to store second modality data and attribute information of the second modality data.
  • the attribute information of the second modal data may be information contained in the second modal data itself.
  • the second modal data is an image
  • the attribute information may be text in the image.
  • the attribute information of the second modal data may also be information associated with the second modal data.
  • the second modal data is an image
  • the attribute information may be a category label annotated for the image.
  • the second modal feature database is used to store the semantic features of the second modal data, and each semantic feature of the second modal data is set with a feature index, which can assist in quickly searching from the second modal database. to the second modal data.
  • the cross-modal search device 102 is used to search second modal data according to the first modal data, and then generate cross-modal search results.
  • the specific process is as follows: 1 Obtain the first modal data.
  • the first modal data may be any one of text, voice, image, etc. 2
  • the second modal data matching the content information and the second modal matching the semantic information are searched from the database 101 (specifically, the second modal database). status data.
  • the content information here refers to the content contained in the first modal data itself, and the semantic information refers to the abstract meaning expressed by the first modal data.
  • the first modal data is text
  • the content information is the characters in the text
  • the semantic information is the meaning expressed by the text.
  • the content information may be the content contained in the image, such as text; the semantic information may be semantic features extracted from the image.
  • the second modal data matching the content information can be directly searched from the second modal database; and based on the semantic information of the first modal data, Then you need to use the second modal feature library to find the second modal feature that matches the semantic information of the first modal data in the second modal feature library, and then search the second modal feature in the second modal database based on the second modal feature.
  • the corresponding second modal data is determined to be the second modal data matching the semantic information. 3 Combine these second modal data as cross-modal search results that match the first modal data.
  • the cross-modal search device 102 may also be configured to output cross-modal search results corresponding to the first modal data according to the input first modal data.
  • the specific process includes: 1 Displaying the session interface of the social session; 2 In response to the viewing operation of the historical session record of the social session, displaying the session record details interface; The second modal data is displayed in the session record details interface, and the second modality The data belongs to the historical session records of the social session; 3In response to the first modal data input in the session record details interface, output the cross-modal search results corresponding to the first modal data.
  • a search box can be provided on the session record details interface for the conversation object to manually input the first modal data, or the first modal data can be recommended for the conversation object to select, thereby quickly triggering the search function for cross-modal search.
  • cross-modal search can be performed according to specified search rules.
  • the input text can be searched according to the image description or the text in the image.
  • the displayed cross-modal search results can be related to the search dimension.
  • the second modal data in the cross-modal search results that matches the content information of the first modal data can be output, or the cross-modal search can be output.
  • the second modality data in the result matches the semantic information of the first modality data.
  • the cross-modal search system supports the following two cross-modal search solutions: one is a general cross-modal search at the technical level, and the other is a cross-modal search in historical conversation records at the product level.
  • the latter Cross-modal search results will be output, and the cross-modal search results are obtained by implementing a technical-level cross-modal search solution.
  • the cross-modal search equipment carrying these two schemes can be the same computer equipment, or it can be a different computer equipment; when the cross-modal search equipment carrying these two schemes is different computer equipment, it is assumed that it is a computer equipment A and computer device B.
  • Computer device B receives the input first modal data and sends the first modal data to computer device A.
  • Computer device A searches from the database based on the acquired first modal data.
  • the computer device A can automatically identify the input first modal data, search for matching second modal data from the database based on the first modal data, and obtain the cross-modal data. The results are retrieved and output in computer device A.
  • the cross-modal search system can support searching in the second modal database based on the content information of the first modal data and the semantic information of the first modal data.
  • Matching second modal data is a cross-modal search method, and comprehensive search is performed from the two dimensions of content and semantics, so that the dimension covered by the search is no longer single, and can cover the search and first modal data All related second modal data can be obtained faster and more accurately.
  • the second modal data searched in two dimensions are merged as cross-modal search results, so that multiple search results can be obtained in one search process. Dimensional search results, the efficiency of cross-modal search has been significantly improved, and very rich and comprehensive search results can be obtained.
  • the cross-modal search system can also provide a search function based on historical session records of social sessions.
  • This search function searches for the second modal data in historical session records and can display all cross-modal search results. Or you can search according to the specified dimensions to display the cross-modal search results of the specified dimensions. Due to the technical support of the above-mentioned cross-modal search solution, when the search function is enabled to search for the second modal data in the historical session records, the first modal The degree of freedom and complexity of data input have been effectively improved.
  • the scenarios in which the cross-modal search method can be applied are described below.
  • the cross-modal search method in the embodiment of the present application can be applied in Scenario 1 and Scenario 2 as shown below, but is not limited to these application scenarios. Scenario 1 and scenario 2 are introduced below respectively.
  • the first modal data is text data
  • the second modal data is image data. Search and match the image data and text data.
  • the historical conversation records of social conversations there are many forms of conversation messages, such as pictures, videos, files, links, music, etc. Searching historical conversation records is a faster way to reach the historical conversation messages contained in historical conversation records. The way.
  • search historical session records in the form of pictures or videos you can use manually entered text or selected recommended text as search text, and then output matching image data, including pictures or videos.
  • the cross-modal search method in the embodiment of the present application can also be used.
  • text can be input as a query, and the image features and images in the album can be searched.
  • the text information contained in the image or the associated text description information is matched to output the corresponding image data.
  • the first modal data is audio data
  • the second modal data is image data.
  • Search and match image data and voice data Take smartphones as an example.
  • smartphones are equipped with intelligent voice functions. Through intelligent voice, terminal devices can be controlled to automatically perform corresponding operations.
  • voice query involves cross-modal search issues, that is, recognizing and understanding the voice, mapping the voice and image into the same feature comparison space, and matching them to the corresponding ones.
  • you can also convert speech into text and match the corresponding picture or video by comparing the text with the image's category label, text description information, etc.
  • voice can be input as a query, and images matching the voice can be automatically output by matching the image content in the mobile phone album.
  • FIG 2 is a schematic flowchart 1 of a cross-modal search method provided by an embodiment of the present application.
  • the cross-modal search method can be executed by a computer device (such as the cross-modal search device 102 shown in Figure 1).
  • the cross-modal search method includes but is not limited to the following steps.
  • Modality can refer to a source or form of information.
  • people have hearing, vision, smell, and touch, and the media of information include voice, video, text, pictures, etc.
  • the media of information include voice, video, text, pictures, etc.
  • modal data can be data in different forms such as images, videos, and audios.
  • the obtained first modal data may be modal data input by the user through a computer device.
  • the first modal data may be text data or images input through auxiliary means such as a physical keyboard, virtual keyboard, cursor selection, etc.
  • the data may be audio data recognized by the smart voice device, or may be selected from recommended first modal data (eg, recommended text).
  • S202 Search the second modality database based on the content information of the first modality data to obtain the first set.
  • the first set includes at least one second modality data that matches the content information of the first modality data.
  • the content information of the first modal data is data information used to describe the essential content contained in the first modal data.
  • the corresponding content information may be the text characters themselves, or keywords extracted based on the text; for another example, if the first modal data is an image, the corresponding content information may be the content contained in the image.
  • Other modal information or basic features such as any one or more of the geometric shapes, textures, colors, object category labels, text description information contained in the image, etc.
  • Based on the dimension of content information of the first modal data all second modal data that match the content information of the first modal data can be searched in the second modal database, and the matching second modal data can be searched Data is added to the first collection.
  • the second modality database stores N pieces of second modality data and respective attribute information of the N pieces of second modality data, where N is a positive integer.
  • the second modal data and the first modal data are two different modal data.
  • the second modal data can be any one of the modal data such as text, image, audio, video, etc.
  • the second modal database The stored second modal data will be different in different business scenarios. For example, in a historical session record search of a social session, the second modal data may be images sent or received in the session.
  • the attribute information of the second modal data is information describing the attributes of the second modal data. It can be identified from the second modal data or associated information generated from other data. The attribute information is different from the first modal data.
  • the content information can be data in the same record form, for example, both are text description information.
  • the content information of the first modal data can be matched with the attribute information of the second modal data, so that matching second modal data is searched for in the second modal database and the first set is obtained.
  • step S202 includes the following steps S2021 and S2022: S2021.
  • For each second modal data in the N second modal data determine the content information of the first modal data and the second modal data.
  • the matching degree between the attribute information of the modal data is used as the matching degree corresponding to the second modal data; S2022.
  • the content information of the first modal data can be matched with the attribute information of each second modal data in the N second modal data in the second modal database to obtain the corresponding matching degree.
  • the matching degree here may indicate whether the content information of the first modal data and the attribute information of the second modal data are similar or consistent.
  • the matching degree between the content information of the first modal data and the attribute information of the second modal data can be measured by the similarity of the modal data (such as text similarity), or abstract semantic similarity, or it can Other methods are not limited here.
  • the matching condition here can be set so that the matching degree is greater than or equal to the matching degree threshold, or it can be set so that the matching degree is ranked in the top y positions, and y is a positive integer. There are no restrictions on the specific setting content of the matching conditions.
  • the attribute information includes one or both of first modal description information and category labels.
  • the first modal description information refers to description information recorded in the form of the first modal, for example, the first modal If the data is text, then the first modal description information is text description information. For another example, if the first modal data is an image, then the first modal description information is image description information.
  • the first modal description information as the attribute information of the second modal data can be matched with the content information of the first modal data.
  • the content information of the first modal data and the attribute information of the second modal are both in the same modality. When recorded in the form, it is a match of the same modal information.
  • the content information of the modal data matches the second modal data.
  • Category labels are information labeled for classifying the second modal data. They can be manually labeled for the second modal data, or they can be obtained by inputting the second modal data into a classification model for multi-label classification. The category label of the second modal data and the content information of the first modal data can also be matched to search for the second modal data that meets the matching conditions.
  • any one of N second modal data is represented as the i-th second modal data, i is a positive integer, and i is less than or equal to N.
  • the attribute information includes the first modal description information.
  • the corresponding implementation manner of steps S2021 and S2022 may be: determining the content information of the first modal data and the first of the i-th second modal data.
  • the semantic similarity between modal description information is used as the matching degree corresponding to the i-th second modal data; if the content information of the first modal data and the first modal description of the i-th second modal data If the semantic similarity between the information is greater than the first similarity threshold, the i-th second modal data is added to the first set.
  • the matching degree between the content information of the first modal data and the attribute information of the second modal data can adopt the semantic similarity mentioned above.
  • the method of obtaining the semantic similarity can be: extracting the first mode The semantic features corresponding to the content information of the modal data, and the semantic features corresponding to the first modal description information of the i-th second modal data, and then determine the content information of the first modal data and the i-th second modal data.
  • the first modality describes the similarity between the corresponding semantic features of the information, and takes it as the semantic similarity.
  • the i-th second modality data meets the matching condition by judging whether the semantic similarity is greater than the first similarity threshold: if the semantic similarity is greater than the first similarity threshold, it means that the i-th second modality
  • the matching degree between the attribute information of the data and the content information of the first modal data satisfies the matching condition, which further indicates that the attribute information of the i-th second modal data matches the content information of the first modal data, then it can be The i-th second modal data is added to the first set. Otherwise, the i-th second modal data will not be added to the first set.
  • the content information of the first modal data and the first modality of the second modal data can be obtained.
  • the consistency of the semantics expressed by the modal description information is determined to determine whether the second modal data matches the first modal data.
  • the first modal data is text data
  • the second modal data is image data
  • the specific content of the first modal data is "blue sky and white clouds”
  • the content information is also the text content
  • the second modal data is
  • the first modal description information is text description information for the image content.
  • the text description information is associated with the image, and may be text information contained in the image, or text description information associated with the image.
  • the text description associated with the image reads "The sky looks beautiful today.” Then you can use the keyword "sky” as the first modal description information, and then determine the semantic similarity of the two texts "sky” and “blue sky and white clouds” to determine whether the two match, thereby determining whether the corresponding image to match the image to the text.
  • the attribute information includes category tags
  • the implementation manner corresponding to steps S2021 and S2022 may be: determining the relationship between the content information of the first modal data and the category tag of the i-th second modal data. Similarity, as the matching degree corresponding to the i-th second modal data; if the similarity between the content information of the first modal data and the category label of the i-th second modal data is greater than the second similarity threshold, Then add the i-th second modal data to the first set.
  • the above-mentioned matching degree specifically refers to the similarity between the content information of the first modal data and the category label of the i-th second modal data.
  • it can be text similarity
  • the similarity can be Represents the degree of consistency between the category label of the second modal data and the content information of the first modal data.
  • Whether the i-th second modal data meets the matching condition can be specifically determined by whether the similarity between the content information of the first modal data and the category label of the i-th second modal data is greater than the second similarity threshold. : If the similarity is greater than the second similarity threshold, it indicates that the matching degree between the content information of the first modal data and the category label of the i-th second modal data satisfies the matching condition, further indicating that the i-th second modal data If the category information of the modal data matches the content information of the first modal data, then the i-th second modal data can be added to the first set, otherwise, the i-th second modal data will not be added to the first set. A collection.
  • the first modal data is the search text
  • the second modal data is the image
  • the i-th second modal data is the target image
  • the above two implementation methods are applicable to any second modal data among the N second modal data, so that the N second modal data stored in the second modal database are all the same as the first modal data.
  • the finally obtained first set can be used as part of the following cross-modal search results.
  • S203 Search the second modality database based on the semantic information of the first modality data to obtain the second set.
  • the second set includes at least one second modality data that matches the semantic information of the first modality data.
  • the semantic information of the first modal data may specifically refer to the meanings represented by things in the real world corresponding to the first modal data. Semantic information can be used to represent shallow or deep semantic understanding of the first modal data. The semantic information can be very rich. For example, when the first modal data is text, the same semantics can have many different text expressions. Very flexible.
  • the semantic information of the second modal data can be matched with the semantic information of the first modal data, and then, from the second modal data, the semantic information of the second modal data can be matched.
  • the semantic information can be represented by semantic features, specifically, it can be a semantic feature vector.
  • the semantic features of the two different modal data can be mapped to the same semantic feature space for similarity analysis by separately extracting the semantic features of the first modal data and the second modal data. Comparison, and then, search for second modal data with similar semantics based on similar semantic features.
  • For the specific implementation method of this step please refer to the introduction of the corresponding embodiment in Figure 3 below, and will not be described in detail here.
  • this step is a text-based image search method based on cross-modal features, that is, by separately extracting the text feature vector of the search term and the image feature of the picture.
  • Vector compare the similarity of the feature vectors of two different modalities in the same semantic feature space, so as to directly retrieve images with similar semantics across modalities through text descriptions, which can support more and more complex text descriptions. Achieve inputting free and diverse text describing images to search for target images.
  • S204 Merge the first set and the second set to obtain cross-modal search results corresponding to the first modal data.
  • the N second modal data stored in the second modal database according to the above steps to obtain a first set that matches the content of the first modal data and that matches the semantics of the first modal data. of the second collection.
  • all second modal data matching the first modal data can be obtained, including second modal data matching the content information of the first modal data, and second modal data matching the content information of the first modal data.
  • the second modal data that matches the semantic information of one modal data is the cross-modal search result corresponding to the first modal data.
  • the resulting cross-modal search results include search results in multiple dimensions, which is diversified. and comprehensive search results.
  • the cross-modal search solution provided by the embodiment of the present application can search the second modal data that matches the content information of the first modal data in the second modal database based on the content information of the first modal data. Based on the semantic information of the first modal data, the second modal data that matches the semantic information of the first modal data can be searched in the second modal database.
  • This search method is not limited to a certain dimension, but It is a comprehensive search from multiple dimensions, which makes the dimension covered by the search no longer single, and search results in multiple dimensions can be obtained through one search, which improves the efficiency of cross-modal search; in addition, by combining it with the first mode
  • the second modal data that matches the two dimensions of the modal data are combined as cross-modal search results, which can obtain richer and more diverse cross-modal search results.
  • the search is based on the content information of the first modal data, specifically: Based on the matching degree between the content information of the first modal data and the attribute information of the second modal data (which can be the first modal description information or category label), since the attribute information is more relevant to the second modal
  • the content contained in the modal data is described.
  • the first modal data may not be limited to fixed expressions, but may support more diverse and complex expressions.
  • Figure 3 is a schematic flowchart 2 of a cross-modal search method provided by an embodiment of the present application.
  • This method can be executed by a computer device (such as the cross-modal search device 102 shown in Figure 1).
  • the cross-modal search method in this embodiment corresponds to step S203 in Figure 2: searching in the second modal database based on the semantic information of the first modal data to obtain the second set, which corresponds to a detailed introduction to the implementation method.
  • N pieces of second modal data are stored in the second modal database.
  • the second modal data is associated with a second modal feature library, and the second modal feature library stores semantic features of each of the N second modal data. Search the second modality database based on the semantic information of the first modality data to obtain the specific implementation of the second set, including the following steps S301 to S304.
  • the semantic features of the first modal data can be obtained through cross-modal search model processing.
  • the cross-modal search model includes the first modal processing network.
  • the specific implementation of this step can be:
  • the first modal processing network in the cross-modal search model performs feature extraction processing on the first modal data to obtain the semantic features of the first modal data.
  • the first modality processing network is a processing network for the first modality data.
  • the first modality processing network may be a text processing network
  • the text processing network may be BERT. (Bidirectional Encoder Representation from Transformers, a pre-trained language representation model) model or various variant models related to BERT, or other natural language processing (Natural Language Processing, NLP) models.
  • Figure 4a it is a schematic diagram of text encoder processing. Taking text as input, the text encoder (Text encoder) can output text feature vectors.
  • search the second modal feature library Based on the semantic features of the first modal data, search the second modal feature library for target semantic features that match the semantic features of the first modal data.
  • Whether the semantic features of the first modal data match the semantic features of the second modal data can be determined by determining whether the similarity between the semantic features of the two modal data is greater than a similarity threshold. Specifically, the feature similarity between the semantic features of the N second modality data stored in the second modality feature library and the semantic features of the first modality data can be calculated separately, and the second feature whose feature similarity is greater than the similarity threshold can be calculated separately.
  • the semantic features of the modal data are determined as the semantic features of the second modal data that match the semantic features of the first modal data, that is, the target semantic features. In the above manner, one or more target semantic features can be found from the second modality feature library.
  • the first modal data is text
  • the second modal data is image
  • the semantic features corresponding to the first modal data are text feature vectors
  • the semantic features corresponding to the second modal data are image feature vectors.
  • Use text Feature vectors retrieve similar image feature vectors from the image feature database.
  • the specific retrieval method can be to use text feature vectors and image feature vectors to calculate feature similarity, and use image feature vectors with feature similarity higher than the threshold as text features.
  • Target image feature vector for vector matching can be to use text feature vectors and image feature vectors to calculate feature similarity, and use image feature vectors with feature similarity higher than the threshold as text features.
  • S303 Determine second modality data matching the semantic information of the first modality data in the second modality database according to the target semantic features.
  • the second modal feature library is associated with the second modal database, using the target semantic feature found in the second modal feature database, the third feature corresponding to the target semantic feature can be determined from the second modal database.
  • the second modal data is then used as the second modal data that matches the semantic information of the first modal data.
  • the second modality feature database and the second modality database are related through feature indexes.
  • the implementation of step S303 may specifically include the following steps: (1) Determine the feature index corresponding to the target semantic feature; (2) Based on The feature index corresponding to the target semantic feature is determined in the second modality database and the second modality data corresponding to the feature index corresponding to the target semantic feature is determined.
  • each second modality data in the second modality feature database are associated with feature indexes, and each feature index is unique. There is also an association between the feature index and the second modality data in the second modality database. , in this way, the second modal data in the second modal database and the semantic features of the second modal data in the second modal feature database can be associated one by one through the feature index, so that based on the found target semantic features corresponding to Feature index: select the second modality data corresponding to the feature index from the second modality database to obtain the second modality data that matches the semantic information of the first modality data.
  • S304 Add second modality data that matches the semantic information of the first modality data to the second set.
  • the second modality data determined from the second modality database that matches the semantic information of the first modality data can be added to the second set.
  • All can be processed according to the above steps, and then all the second modal data that match the semantic information of the first modal data can be determined, and added to the second set one by one, and then the final second set can be As part of cross-modal search results.
  • the cross-modal search method searches from the dimension of semantic information of the first modal data, by extracting the corresponding semantics of the first modal data and the second modal data.
  • Features perform feature comparison processing on the semantic features of the first modal data and the semantic features of the second modal data in the same semantic space, and search for matching semantic features of the first modal data from the second modal feature library target semantic features, and then determine the second modal data matching the semantic information of the first modal data from the second modal database based on the found target semantic features, and obtain the cross-modal search results.
  • This method is essentially a search method based on cross-modal features. Through cross-modal features at the semantic level, search results matching the first modal data can be searched more quickly and accurately, and to a certain extent, it can also increase the number of searches. Diversity in search results across modalities.
  • the cross-modal search model includes a second modal processing network, and the semantic features of the N second modal data stored in the second modal database are processed by the second modality in the cross-modal search model.
  • the processing network performs feature extraction on N second modal data respectively.
  • the second modality processing network is a processing network for second modality data, and may include a variety of networks with different functions. Taking the second modality data as an image as an example, the second modality processing network may specifically be an image processing network.
  • the second modal processing network includes a feature extraction network, a pooling processing network and a feature integration network; for ease of description, any one of the N second modal data is represented as the i-th second modal data, i is a positive integer, and i is less than or equal to N, that is, all N second modal data are processed according to the following steps to obtain the corresponding semantic features. Based on this, the second modal processing network in the cross-modal search model performs feature extraction processing on the N second modal data respectively.
  • the steps of obtaining the semantic features of the N second modal data may specifically include:
  • the feature extraction network in the second modal processing network extracts the initial features of the i-th second modal data; through the pooling processing network in the second modal processing network, the initial features are pooled to obtain the i-th The pooled features of the second modal data; through the feature integration network, the pooled features are integrated to obtain the semantic features of the i-th second modal data.
  • the feature extraction network can be a deep model used for image processing, such as a conventional convolutional neural network (CNN) model or a VIT (Vision Transformer) model used for feature extraction.
  • the feature extraction network is the second
  • the backbone network (Backbone) in the modal processing network is mainly used to extract the initial features of the second modal data for use by subsequent networks.
  • the pooling network can be used to pool the initial features output by the feature extraction network. Specifically, it can be Global Average Pooling (GAP). At this time, the pooling network can also be called global average pooling. layer, global average pooling can not only reduce the number of parameters and prevent overfitting, but also integrate global spatial information to make the characteristics of the second modal data more robust.
  • the feature integration network can be called to integrate the pooled features output by the pooling processing network to obtain the semantic features of the i-th second modal data.
  • the feature integration network can specifically be a feature fully connected layer. Since the fully connected layer requires the input object to be one-dimensional, the pooled feature input needs to be flattened into one-dimensional features before processing by the feature integration network, and then integrated by the features. The network processes the one-dimensional features and then obtains the semantic features of the second-modality data.
  • the above method is the principle of processing any second modal data among the N second modal data through the cross-modal search model. That is to say, for any second modal data among the N second modal data, the same processing steps can be used to extract the semantic features of the second modal data, and then store them in the second modal feature library.
  • the second modality processing network also includes a classification network, and may also: perform classification prediction processing based on pooled features through the classification network to obtain the category label of the i-th second modality data; and, Add the category label of the i-th second modal data to the second modal database.
  • the classification network can be a classification fully connected layer, which is similar to the feature fully connected layer.
  • the pooled features processed by the classification fully connected layer are also flattened one-dimensional features.
  • the output of the classification fully connected layer passes through the activation function (such as the Sigmoid function). Obtain the score that the i-th second modal data belongs to each category, thereby obtaining the corresponding category label.
  • the category labels of the N second modality data in the second modality database can be obtained using the above classification network for multi-classification processing, and the category labels of each second modality data can be added to the second modality database. , so that when processing the first modal data, the second modality matching the first modal data is searched based on the similarity between the category labels of each second modal data and the content information of the first modal data. data.
  • the specific structure of the second modality processing network can be known.
  • the second modality processing network is specifically an image encoder, including a feature extraction network, a pooling network, a feature integration network, and a classification network, which are respectively the backbone network, the global average pooling layer, and the feature Fully connected layer and classification fully connected layer
  • the second modal database is specifically an image database
  • the second modal feature database is specifically an image feature vector retrieval set.
  • the image encoder in the cross-modal search model shown in Figure 4b specifically includes the backbone network Backbone, the global average pooling layer, the classification fully connected layer and the feature fully connected layer.
  • the specific processing process is as follows: first, the image is used as the input of the image encoder, and the image features are obtained through the backbone network of the image encoder (such as CNN or VIT) (i.e.
  • the one-dimensional vector will also be input into the feature fully connected layer (Feature FC) outputs a vector of length d (assumed to be 512), which is then used as the feature vector f I of the image after L2 normalization.
  • the feature vector f I of the image will be stored in the image feature vector retrieval set (corresponding to the second modality feature library), and the category labels of the multi-label classification corresponding to the image can be stored in the second modality database, and finally based on the image
  • the feature vector f I can add a corresponding image feature vector index and add it to the image feature vector retrieval set G I to assist in quickly searching for the target image from the image database.
  • the cross-modal search model is used to search, and the The specific processing process of the first modal data is as follows.
  • the first modal data is text and the first modal processing network corresponds to a text encoder.
  • the output text feature vector and image feature vector are features mapped to the same semantic feature space and have the same dimensions.
  • Vector, text processing specifically includes: first, search the image library from the content information of the text.
  • the text is input to the text encoder, which outputs a vector of length d and undergoes L2 normalization to obtain the text feature vector f T ; then, the text feature vector f T is used to retrieve the set G I ( The included image feature vectors are obtained by processing the image with the image encoder as shown in Figure 4b) to retrieve similar image feature vectors and recall the corresponding image set B.
  • the cross-modal search model includes a first modal processing network and a second modal processing network.
  • the specific training process can be as follows: 1) Obtain a cross-modal training data set.
  • the cross-modal training data set includes multiple sets of cross-modal sample data.
  • Each set of cross-modal sample data includes second modal sample data, first modal sample data, and the second modal sample data.
  • the matching result between the modal sample data and the first modal sample data 2) Characterize the first modal sample data in the cross-modal sample data through the first modal processing network in the cross-modal search model extraction processing to obtain the semantic features of the first modal sample data; and, through the second modal processing network in the cross-modal search model, perform feature extraction processing on the second modal sample data in the cross-modal sample data, Obtain the semantic features of the second modal sample data; 3) Iterate the cross-modal search model based on the cross-modal contrast loss between the semantic features of the first modal sample data and the semantic features of the second modal sample data. Train to obtain the trained cross-modal search model.
  • the cross-modal training data set can be obtained from the business data generated in the corresponding scenario.
  • the cross-modal training data set is a collection of sample data from two different modalities.
  • you can Each set of cross-modal sample data is input into the cross-modal search model for processing.
  • each set of cross-modal sample data can be an image-text pair, that is, the image and the text description corresponding to the image can constitute an image- Text pairs, massive image-text pairs can form a cross-modal training data set.
  • the first modal processing network and the second modal processing network are mixed and trained.
  • K sets of cross-modal sample data can be input at the same time, and then the first modal sample data in the i-th set of cross-modal sample data is processed through the first modal processing network to obtain the semantic features of the first modal sample data, and , process the second modal sample data in the i-th group of cross-modal sample data through the second modal processing network to obtain the semantic features of the second modal sample data, and then, according to the semantics of the two different modal sample data Features are used to calculate the cross-modal contrast loss. Based on the cross-modal contrast loss, the cross-modal search model is iteratively trained, and the model parameters are continuously updated until convergence, and the trained model can be obtained.
  • the cross-modal training data set may also include category labels corresponding to the second modal sample data.
  • the training process may also include the following: Searching the model through cross-modal
  • the second modal processing network in the cross-modal sample data performs classification and prediction processing on the second modal sample data to obtain the category prediction information of the second modal sample data; the second mode is determined based on the category prediction information and category labels.
  • Classification loss of modal sample data iteratively train the cross-modal search model based on the classification loss and cross-modal contrast loss to obtain the trained cross-modal search model.
  • the category prediction information can include the predicted probability that the second modal sample data belongs to each category.
  • the classification loss can use cross-entropy loss.
  • the classification loss and the cross-modal contrast loss can be combined as the total loss.
  • the classification loss and the cross-modal contrast loss can be combined.
  • the total loss is obtained by performing a weighted sum of the state comparison losses, and then using an optimizer (such as a stochastic gradient descent (SGD) optimizer) to update the model parameters of the cross-modal search model, and repeat the above training process until the model The parameters converge and the trained cross-modal search model is obtained.
  • the cross-modal search model can not only be applied to the semantic feature extraction processing of the first modal data and the second modal data, but also detect the matching between the first modal data and the second modal data based on the cross-modal features. degree; the cross-modal search model also has a multi-label classification function, which generates category labels for the second modal data and stores them in the second modal database.
  • the following example takes the first modal processing network as the text encoder and the second modal processing network as the image encoder to illustrate the training process of the cross-modal search model.
  • Figure 5 is a schematic diagram of cross-modal search model training provided by an embodiment of the present application.
  • the cross-modal training data set includes K groups of image-text pairs (or image-text pairs for short).
  • K groups of image-text pairs are input at the same time, and the image feature vectors are obtained through the image encoder and text encoder respectively.
  • the image encoder also outputs the category prediction probability PI corresponding to the prediction probability of the C categories of the image.
  • InfoNCE loss can then be used to calculate the cross-modal contrast loss between image-text pairs.
  • the specific expression is as follows:
  • cross-modal contrast loss in, represents the i-th image feature vector, Represents the i-th text feature vector.
  • image-text pairs can be divided into positive sample pairs and negative sample pairs.
  • Positive sample pairs refer to image and text descriptions.
  • Matching image-text pairs, negative sample pairs refer to image-text pairs where the image and text descriptions do not match.
  • cross-modal contrastive loss by Represents the similarity between pairs of positive samples, Represents the similarity between negative sample pairs, so that the smaller the cross-modal contrast loss, the closer the first modal sample data and the second modal sample data match.
  • the classification loss L cls of the image can be calculated using Cross Entropy Loss (CEL), combining L cls and L infoNCE as the total loss, and using the SGD optimizer to update the model parameters until convergence.
  • CEL Cross Entropy Loss
  • the algorithm flow used in the cross-modal search solution is explained. For details, see the algorithm shown in Figure 6 flow chart.
  • the image is input into the image encoder for multi-label classification to obtain the category label.
  • the search text marked as query
  • the relevant image can be matched.
  • the image encoder can also output image feature vectors, and add the image feature vectors and new indexes to the image feature vector retrieval set.
  • search text query it can be input into the text encoder, output the text feature vector, and then retrieve similar image feature vectors from the image feature vector retrieval set based on the text feature vector, and recall the corresponding image set based on the similar image feature vector .
  • This solution based on cross-modal feature search can support more diverse and complex text descriptions by matching the features of different modal data without relying on the fixed category label system of the classification model, thereby improving the The freedom of search term input to find target images faster, more accurately and more comprehensively.
  • FIG 7 is a flow diagram 3 of a cross-modal search method provided by an embodiment of the present application.
  • the cross-modal search method can be performed by a computer device (such as the cross-modal search device 102 shown in Figure 1).
  • the modal search device 102 may specifically be a terminal).
  • the cross-modal search method includes but is not limited to the following steps.
  • S701 display the conversation interface of the social conversation.
  • a social conversation here can be an individual-to-individual conversation or a group conversation.
  • conversation objects can send or receive conversation messages, such as images, texts, voices, etc.
  • the conversation message received in the conversation interface includes second modal data
  • the second modal processing network in the cross-modal search model can be called to process the second modal data, and output the category label and the third modal data of multi-label classification.
  • the semantic features of the second modality data are stored in the second modality database, and the category labels are stored in the second modality database, and the semantic features (such as image feature vectors) of the second modality data are stored in the second modality feature database.
  • the conversation interface of the social conversation can provide the viewing function of historical conversation records. Specifically, you can enter the session details interface from the session interface.
  • the session details interface includes a viewing entrance for historical session records.
  • the session object can initiate a viewing operation through this viewing entrance to view and search specific historical session records. For details, see the following Describe steps S702 ⁇ S703.
  • the session record details interface includes the second modal data in the historical session records of the social session.
  • Historical conversation records of social conversations can include data in different modalities, such as images, videos, texts, audios, etc.
  • the conversation object can select data in different modalities for viewing.
  • the viewing of historical conversation records here mainly focuses on the first Viewing of second-modal data, therefore, what is displayed in the session record details interface is the second-modal data generated in historical session records.
  • the second modal data can be displayed in full in the session record details interface. If the amount of second modal data is large, then the second modal data can be displayed in the current session record details interface. Shown in is part of the second modal data.
  • the second modal data is an image
  • the session record details interface is specifically a chat photo wall, in which 12 images can be displayed according to the same size. If there are 10 images in all the historical session records, then the session record can be displayed in the chat photo wall.
  • the record details interface displays all images. If there are more than 12 images, a maximum of 12 images will be displayed. When viewing other images, you need to perform an operation such as sliding down to display them.
  • the session record details interface can support searching for the second mode with the first mode data. modal data, and output second modal data that matches the first modal data, that is, cross-modal search results.
  • Figure 8a is a schematic diagram of an operation of viewing historical session records provided by an embodiment of the present application.
  • the conversation interface 810 provides an entrance to search historical conversation records, that is, "Search Chat Content".
  • search interface 811 In this historical conversation record search interface, you can select the corresponding search type, and display all the historical conversation records of the search type.
  • the chat will be displayed in the conversation record details interface 812.
  • the photo wall, and the chat photo wall is all pictures and videos displayed according to date, specifically (3) in Figure 8a, and the session record details interface 812 provides a search box 8120 to facilitate searching for pictures or videos.
  • the second modality data in the historical session records of the social session is stored in the second modality database, and the second modality database stores attribute information of the second modality data.
  • the search can be directly performed from the second modal database instead of from the global database. Searching in historical session records is helpful to improve the efficiency of searching for second modal data.
  • the second modal database stores attribute information of the second modal data. Different attribute information can further expand the search dimension.
  • the attribute information includes at least one of the following: category labels, first modality description information associated with the second modality data, and first modality description information identified from the second modality data.
  • the category label can be annotation information generated by humans or machines (such as classification models) to classify the second modal data.
  • the first modal description information is the description information about the second modal data. Specifically, it can be from the second modal data. What is identified in the modal data can also be associated with it generated in the historical conversation records. For example, the second modal data is an image.
  • the text in the image can be obtained by identifying the image and used as the first modal description information; If the conversation object in the social session sends an image followed by text description information for the image, for example: Look at the great changes in Park A, then you can generate a description of the image based on the text description information, for example, extract keywords "A Park" is used as the first modal description information of the image.
  • the cross-modal search results are obtained using the cross-modal search method introduced in the previous embodiment.
  • the output cross-modal search results include all second modalities that match the first modal data input in the session record details interface. status data.
  • the first modal data is text
  • the second modal data is image
  • the session record details interface includes a search box
  • the first modal data is obtained by inputting in the search box; or
  • the session record The details interface also includes at least one recommended text
  • the first modal data is obtained by selecting at least one recommended text. That is to say, the first modal data input in the session record details interface can be manually input into the search box through an input device (such as a physical/virtual keyboard, an intelligent voice device), etc., or it can be entered from the session record details interface. selected from the recommended text provided in . Optionally, the selected recommended text can be automatically populated into the search box and the search function automatically launched.
  • the recommended text in the session record details interface may be randomly generated, or may be generated based on the attribute information of the second modal data or the semantic features of the second modal data.
  • the text entered in the search box can be an image description that conforms to intuitive expressions.
  • the conversation object searches in the search box
  • the image of the category label that matches the text can be queried and recalled in the second modal database, and at the same time, it can be searched through cross-modal search
  • the text encoder in the model processes the search text, outputs the corresponding text feature vector, retrieves similar image feature vectors from the image feature vector retrieval set, and recalls the corresponding image set, and finally merges all recalled images and displays them to the conversation object.
  • Figure 8b is a schematic diagram of the operation of cross-modal search provided by the embodiment of the present application.
  • the session record details interface provides a search box 8220, and the search box 8220
  • the prompt search supports inputting image description or text in the picture.
  • Image description is a semantic explanation of the content contained in the image
  • text in the picture belongs to the content information of the image.
  • automatically generated recommended text is also displayed on the session record details interface, such as "ticket", "screenshot”, etc. in Figure 8b.
  • the recommended text can provide more references and convenient operations.
  • search results interface can be output, and pictures matching the query text can be displayed on the search results interface, as shown in (2) in Figure 8b.
  • search results interface Shown in 823 are 3 pictures that match the input query text "food", which are cross-modal search results.
  • the first search rule and the second search rule are rules for searching from different dimensions. They can be searched according to different dimensions, and all cross-modal search results can be divided and displayed according to different search dimensions. Searching according to the first search rule can obtain the second modal data that matches the content information of the first modal data and output it. Searching according to the second search rule can obtain the second modal data that matches the semantic information of the first modal data. Second modal data and output. That is to say, a single search dimension can be specified. For example, when the first modal data is text and the second modal data is image, you can search by image and search by text. Search by image specifically refers to searching by image description. , that is, searching through the dimension of matching the semantic information of the image. Searching by text specifically refers to searching according to the text in the image, that is, searching through the dimension of matching the content information of the image.
  • Figure 8c is a schematic diagram of the effect of outputting cross-modal search results according to an embodiment of the present application.
  • all images matching the query text displayed based on the cross-modal search results provided in (2) in Figure 8b are obtained after searching according to different search dimensions.
  • the search results interface can display the text
  • the semantic information matches the semantic information of the image, or the content information of the text matches the attribute information of the image (such as the category label of the image).
  • This solution can be applied to a variety of scenarios.
  • the cross-modal search based on historical conversation records of social conversations introduced in this embodiment, it can also be applied to other multimedia data search scenarios, such as short video search scenarios. There is no limit to this. .
  • the cross-modal search solution provided by the embodiments of the present application can support cross-modal search scenarios in historical conversation records of social conversations. Specifically, it can be applied to cross-modal search scenarios of images and texts, that is, by entering a search in the search box. Words are used to search for target pictures. Since cross-modal search searches from multiple dimensions of the search terms, the search terms do not have to completely match the category labels of the pictures to find the corresponding pictures. Therefore, inputting is more in line with intuitive expression and more accurate. Diverse and more complex image descriptions to find target pictures can not only increase the freedom of input, but also greatly increase the probability of searching for target pictures and improve the diversity of cross-modal search results; in addition, by providing recommended text (such as Recommended search terms) can also improve search efficiency to a certain extent.
  • FIG. 9 is a schematic structural diagram of a cross-modal search device provided by an embodiment of the present application.
  • the cross-modal search device may be a computer program (including program code) running in a computer device.
  • the cross-modal search device may be an application software; the cross-modal search device may be used to execute the steps provided by the embodiments of the present application. corresponding steps in the method.
  • the cross-modal search device 900 may include: an acquisition module 901, a search module 902, and a merging module 903.
  • Acquisition module 901 used to acquire first modal data
  • the search module 902 is configured to search in the second modality database based on the content information of the first modality data to obtain a first set.
  • the first set includes at least one first set that matches the content information of the first modality data. bimodal data;
  • the search module 902 is also configured to search in the second modality database based on the semantic information of the first modality data to obtain a second set.
  • the second set includes at least one that matches the semantic information of the first modality data.
  • the merging module 903 is used to merge the first set and the second set to obtain cross-modal search results corresponding to the first modal data.
  • the second modality database stores N second modality data and respective attribute information of the N second modality data, where N is a positive integer; the search module 902 is specifically used to: for the For each second modal data in the N second modal data, determine the matching degree between the content information of the first modal data and the attribute information of the second modal data, as the second modal data Corresponding matching degree; add second modal data whose corresponding matching degree satisfies the matching condition to the first set.
  • the attribute information includes first modal description information; any one of N second modal data is represented as the i-th second modal data, i is a positive integer, and i is less than or equal to N;
  • the search module 902 is specifically configured to: determine the semantic similarity between the content information of the first modal data and the first modal description information of the i-th second modal data, as the i-th second modality. Matching degree corresponding to the data; if the semantic similarity between the content information of the first modal data and the first modal description information of the i-th second modal data is greater than the first similarity threshold, then the i-th second modal data will be Modal data is added to the first collection.
  • the attribute information includes category labels; any one of the N second modal data is represented as the i-th second modal data, i is a positive integer, and i is less than or equal to N; search module 902, Specifically used for: determining the similarity between the content information of the first modal data and the category label of the i-th second modal data, as the matching degree corresponding to the i-th second modal data; if the first If the similarity between the content information of the modal data and the category label of the i-th second modal data is greater than the second similarity threshold, then the i-th second modal data is added to the first set.
  • the second modality database stores N second modality data; the second modality database is associated with a second modality feature library, and the second modality feature library stores N second modalities.
  • the respective semantic features of the data; the search module 902 is also specifically used to: obtain the semantic features of the first modal data; based on the semantic features of the first modal data, search for the first modal data in the second modal feature database target semantic features that match the semantic features of Second modal data with matching information is added to the second set.
  • the second modality feature database and the second modality database are related through a feature index; the search module 902 is specifically used to: determine the feature index of the target semantic feature; based on the feature index of the target semantic feature, in the second Second modal data corresponding to the feature index of the target semantic feature is determined in the modal database.
  • the semantic features of the N second modal data stored in the second modal feature library are processed by the second modal processing network in the cross-modal search model. obtained by performing feature extraction processing respectively;
  • the cross-modal search model also includes a first modal processing network;
  • the search module 902 is specifically used to: through the first modal processing network in the cross-modal search model, the first modal
  • the data is subjected to feature extraction processing to obtain the semantic features of the first modal data.
  • the second modal processing network includes a feature extraction network, a pooling processing network and a feature integration network; any one of the N second modal data is represented as the i-th second modal data, i is A positive integer, and i is less than or equal to N; the search module 902 is specifically used to: extract the initial features of the i-th second modal data through the feature extraction network in the second modal processing network; through the second modal processing The pooling processing network in the network performs pooling processing on the initial features to obtain the pooling features of the i-th second modal data; through the feature integration network, the pooling features are integrated to obtain the i-th second modality data. Semantic characteristics of static data.
  • the second modality processing network also includes a classification network; the search module 902 is also specifically configured to perform classification prediction processing based on pooled features through the classification network to obtain the category of the i-th second modality data. label; and, add the category label of the i-th second modal data to the second modal database.
  • the cross-modal search device also includes a training module 904, configured to: obtain a cross-modal training data set.
  • the cross-modal training data set includes multiple sets of cross-modal sample data, each set of cross-modal sample data.
  • the data includes second modal sample data, first modal sample data, and matching results between the second modal sample data and the first modal sample data; through the first modal processing network in the cross-modal search model , perform feature extraction processing on the first modal sample data in the cross-modal sample data, and obtain the semantic features of the first modal sample data; and, through the second modal processing network in the cross-modal search model, perform feature extraction on the cross-modal sample data.
  • the second modal sample data in the modal sample data is subjected to feature extraction processing to obtain the semantic features of the second modal sample data; according to the difference between the semantic features of the first modal sample data and the semantic features of the second modal sample data
  • the cross-modal comparison loss is used to iteratively train the cross-modal search model to obtain the trained cross-modal search model.
  • FIG 10 is a schematic structural diagram of another cross-modal search device provided by an embodiment of the present application.
  • the cross-modal search device may be a computer program (including program code) running in a computer device.
  • the cross-modal search device may be an application software; the cross-modal search device may be used to execute the steps provided by the embodiments of the present application. corresponding steps in the method.
  • the cross-modal search device 1000 may include: a display module 1001 and an output module 1002.
  • Display module 1001 used to display the conversation interface of the social conversation
  • the display module 1001 is also configured to display a session record details interface in response to viewing the historical session records of the social session, where the session record details interface includes the second modal data in the historical session records of the social session;
  • the output module 1002 is configured to respond to the first modal data input in the session record details interface and output the cross-modal search results corresponding to the first modal data; the cross-modal search results are obtained by using the cross-modal search results described in the embodiments of this application. Obtained by modal search method.
  • the second modality data in the historical session records of the social session is stored in the second modality database, and the second modality database stores attribute information of the second modality data, and the attribute information includes at least the following: One: category label, first modality description information associated with the second modality data, and first modality description information identified from the second modality data.
  • the first modal data is text and the second modal data is image;
  • the session record details interface includes a search box, and the first modal data is obtained by inputting in the search box; or, the session record The details interface also includes at least one recommended text, and the first modal data is obtained by selecting at least one recommended text.
  • the output module 1002 is specifically configured to: in response to the selection of the first search rule, output the second modal data in the cross-modal search results that matches the content information of the first modal data; or , in response to the selection of the second search rule, output the second modal data in the cross-modal search results that matches the semantic information of the first modal data.
  • the cross-modal search device in Figure 9 and the cross-modal search device in Figure 10 can be deployed in the same computer device, or they can be deployed in different computer devices.
  • the computer device can automatically search the database for second modal data that matches the first modal data based on the input first modal data, and obtain cross-modal search results, and then in The cross-modal search results are output in the computer device; when deployed in different computer devices, assuming that the cross-modal search device of Figure 9 is deployed in computer device A, and the cross-modal search device of Figure 10 is deployed in computer device B, Computer device B is responsible for receiving the input first modal data and sending the first modal data to computer device A, and then computer device A searches the second modal database based on the first modal data for the first modal data.
  • the cross-modal search results are obtained by matching the second modal data with the modal data, and the cross-modal search results are sent to the computer device B, and the computer device B displays the cross-modal search results.
  • the computer device 1100 may include an independent device (such as one or more servers, nodes, terminals, etc.), or may include components within the independent device (such as a chip, software module, or hardware module, etc.).
  • the computer device 1100 may include at least one processor 1101 and a communication interface 1102. Further optionally, the computer device 1100 may also include at least one memory 1103 and a bus 1104. Among them, the processor 1101, the communication interface 1102 and the memory 1103 are connected through the bus 1104.
  • the processor 1101 is a module that performs arithmetic operations and/or logical operations. Specifically, it can be a central processing unit (CPU), a graphics processor (GPU), a microprocessor unit (MPU). ), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), Complex programmable logic device (CPLD), co-processor (assist in central processing) (Complete corresponding processing and applications), microcontroller unit (Microcontroller Unit, MCU) and other processing modules, or a combination thereof.
  • CPU central processing unit
  • GPU graphics processor
  • MPU microprocessor unit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • CPLD Complex programmable logic device
  • co-processor assistst in central processing
  • MCU microcontroller Unit
  • other processing modules or a combination thereof.
  • Communication interface 1102 may be used to provide information input or output to at least one processor. And/or, the communication interface 1102 can be used to receive data sent from the outside and/or send data to the outside. It can be a wired link interface such as an Ethernet cable, or a wireless link (Wi-Fi, Bluetooth, General wireless transmission, vehicle short-range communication technology and other short-range wireless communication technology, etc.) interface.
  • Wi-Fi Wireless Fidelity
  • Bluetooth Wireless Fidelity
  • the memory 1103 is used to provide storage space, and data such as operating systems and computer programs can be stored in the storage space.
  • the memory 1103 may be random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM), or portable read-only memory.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read only memory
  • portable read-only memory One or more combinations of memory (compact disc read-only memory, CD-ROM), etc.
  • At least one processor 1101 in the computer device 1100 is used to call a computer program stored in at least one memory 1103 to execute the aforementioned cross-modal search method, such as the embodiments shown in Figures 2, 3 and 7. Described cross-modal search method.
  • the computer device 1100 described in the embodiment of the present application can execute the cross-modal search method described in the corresponding embodiment above, and can also execute the cross-modal search device in the embodiment corresponding to FIG. 9 900 or the description of the cross-modal search device 1000 in the embodiment corresponding to FIG. 10 will not be described again here. In addition, the description of the beneficial effects of using the same method will not be described again.
  • an exemplary embodiment of the present application also provides a storage medium in which a computer program for the aforementioned cross-modal search method is stored.
  • the computer program includes program instructions.
  • the program instructions are loaded and executed by the programmer, and the description of the cross-modal search method in the embodiment can be implemented, which will not be described again here.
  • the description of the beneficial effects of using the same method will not be described again here.
  • the program instructions may be deployed and executed on one or multiple computer devices capable of communicating with each other.
  • the above-mentioned computer-readable storage medium may be the cross-modal search device provided in any of the foregoing embodiments or an internal storage unit of the above-mentioned computer device, such as a hard disk or memory of the computer device.
  • the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card equipped on the computer device, Flash card, etc.
  • the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or is to be output.
  • a computer program product or computer program is provided, which computer program product or computer program includes computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided in one aspect of the embodiment of the present application.
  • the computer program product includes a computer program or computer instructions.
  • the cross-modal search method provided by the embodiment of the present application is implemented. step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例公开了一种跨模态搜索方法及相关设备,该跨模态搜索方法包括:获取第一模态数据;基于第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,该第一集合中包括与第一模态数据的内容信息相匹配的至少一个第二模态数据;基于第一模态数据的语义信息在第二模态数据库中进行搜索,得到第二集合,该第二集合中包括与第一模态数据的语义信息相匹配的至少一个第二模态数据;对第一集合和第二集合进行合并,得到第一模态数据对应的跨模态搜索结果。通过本申请实施例,可以提升跨模态搜索的效率,以及跨模态搜索结果的多样性和全面性。

Description

一种跨模态搜索方法及相关设备
本申请要求于2022年03月07日提交中国专利局、申请号为2022102220890、申请名称为“一种跨模态搜索方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及跨模态搜索技术。
背景技术
随着互联网技术的高速发展,借助计算机设备对数据进行搜索,是人们在生产生活以及工作学习中不可或缺的功能。实践发现,目前的搜索普遍存在不支持跨模态搜索、搜索维度单一、搜索效率低、搜索结果不全面等问题。
发明内容
本申请实施例提供了一种跨模态搜索方法及相关设备,可以提升跨模态搜索的效率、以及跨模态搜索结果的多样性和全面性。
本申请实施例一方面提供了一种跨模态搜索方法,由计算机设备执行,包括:
获取第一模态数据;
基于第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,第一集合中包括与第一模态数据的内容信息相匹配的至少一个第二模态数据;
基于第一模态数据的语义信息在第二模态数据库中进行搜索,得到第二集合,第二集合中包括与第一模态数据的语义信息相匹配的至少一个第二模态数据;
对第一集合和第二集合进行合并,得到第一模态数据对应的跨模态搜索结果。
本申请实施例一方面提供了另一种跨模态搜索方法,由计算机设备执行,包括:
显示社交会话的会话界面;
响应于对所述社交会话的历史会话记录的查看,显示会话记录详情界面,所述会话记录详情界面中包括所述社交会话的历史会话记录中的第二模态数据;
响应于在所述会话记录详情界面中输入的第一模态数据,输出所述第一模态数据对应的跨模态搜索结果;所述跨模态搜索结果是采用本申请实施例的跨模态搜索方法得到的。
本申请实施例一方面提供了一种跨模态搜索装置,包括:
获取模块,用于获取第一模态数据;
搜索模块,用于基于第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,第一集合中包括与第一模态数据的内容信息相匹配的至少一个第二模态数据;
搜索模块,还用于基于第一模态数据的语义信息在第二模态数据库中进行搜索,得到第二集合,第二集合中包括与第一模态数据的语义信息相匹配的至少一个第二模态数据;
合并模块,用于对第一集合和第二集合进行合并,得到第一模态数据对应的跨模态搜索结果。
本申请实施例一方面提供了另一种跨模态搜索装置,包括:
显示模块,用于显示社交会话的会话界面;
显示模块,还用于响应于对社交会话的历史会话记录的查看,显示会话记录详情界面,会话记录详情界面中包括社交会话的历史会话记录中的第二模态数据;
输出模块,用于响应于在会话记录详情界面中输入的第一模态数据,输出第一模态数据对应的跨模态搜索结果;跨模态搜索结果是采用本申请实施例的跨模态搜索方法得到的。
本申请实施例一方面提供了一种计算机设备,包括:处理器、存储器以及网络接口;处理器与存储器、网络接口相连,其中,网络接口用于提供网络通信功能,存储器用于存储程序代码,处理器用于调用程序代码,以执行本申请实施例中的跨模态搜索方法。
本申请实施例一方面提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序包括程序指令,程序指令当被处理器执行时,执行本申请实施例中的跨模态搜索方法。
本申请实施例一方面提供了一种计算机程序产品,该计算机程序产品包括计算机程序或计算机指令,所述计算机程序或计算机指令被处理器执行时实现本申请实施例中一方面提供的跨模态搜索方法。
在本申请实施例中,基于第一模态数据的内容信息,可搜索到与该第一模态的内容信息相匹配的第二模态数据;基于第一模态数据的语义信息,可搜索到与该第一模态数据的语义信息相匹配的第二模态数据;可见,本申请实施例不仅支持跨模态搜索,而且还支持分别从内容和语义这两个维度进行综合搜索,这使得搜索覆盖的维度不再单一;另外,将两个维度搜索到的第二模态数据合并作为跨模态搜索结果,通过一次搜索过程可以获得多个维度的搜索结果,提升了跨模态搜索的搜索效率;另外,由于跨模态搜索结果由两个维度的搜索结果合并得到,这使得跨模态搜索结果更加多样化,更加全面。
附图说明
图1是本申请实施例提供的一种跨模态搜索系统的架构图;
图2是本申请实施例提供的一种跨模态搜索方法的流程示意图一;
图3是本申请实施例提供的一种跨模态搜索方法的流程示意图二;
图4a是本申请实施例提供的一种跨模态搜索模型中的第一模态处理网络的结构示意图;
图4b是本申请实施例提供的一种跨模态搜索模型中的第二模态处理网络的结构示意图;
图5是本申请实施例提供的一种跨模态搜索模型的训练示意图;
图6是本申请实施例提供的一种跨模态搜索的算法流程示意图;
图7是本申请实施例提供的一种跨模态搜索方法的流程示意图三;
图8a是本申请实施例提供的一种对历史会话记录的查看的操作示意图;
图8b是本申请实施例提供的一种跨模态搜索的操作示意图;
图8c是本申请实施例提供的一种输出跨模态搜索结果的效果示意图;
图9是本申请实施例提供的一种跨模态搜索装置的结构示意图;
图10是本申请实施例提供的另一种跨模态搜索装置的结构示意图;
图11是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的相关术语和概念进行介绍。
聊天照片墙:应用程序(Application,APP)中每个聊天内收发图片的全量展示页。
多模态学习:是指将两种不同模态的数据映射到同一个特征空间(例如语义空间),使得两种不同模态的数据可以根据语义产生关联,具有相似语义的模态数据在该特征空间中具备相似的特征,上述两种不同模态的数据例如可以是图像和文本。
基于上述术语及概念,下面将结合附图,对本申请实施例提供的跨模态搜索系统的架构进行介绍。
请参见图1,图1是本申请实施例提供的一种跨模态搜索系统的架构示意图。如图1所示,该架构图可以包括数据库101以及跨模态搜索设备102。跨模态搜索设备102可以和数据库101通过有线或无线的方式建立通信连接,数据库101可以是跨模态搜索设备102的本地数据库,也可以是跨模态搜索设备102可以访问的云端数据库。跨模态搜索设备102具体可以是服务器或者终端等计算机设备。
本申请实施例中,服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器,在此不做限制。终端可以是智能手机、平板电脑、智能可穿戴设备、智能语音交互设备、智能家电、个人电脑、车载终端等设备,在此不做限制。
数据库101可以包括第二模态数据库、以及与该第二模态数据库关联的第二模态特征库,第二模态数据库用于存储第二模态数据以及第二模态数据的属性信息。此处,在一种实施方式中,第二模态数据的属性信息可以是第二模态数据本身所包含的信息,例如,第二模态数据为图像,属性信息可以是图像中的文字。在另一种实施方式中,第二模态数据的属性信息还可以是第二模态数据关联的信息,例如,第二模态数据为图像,属性信息可以是为图像标注的类别标签。第二模态特征库用于存储第二模态数据的语义特征,并且每个第二模态数据的语义特征均设置有特征索引,该特征索引可以辅助快速地从第二模态数据库中搜索到第二模态数据。
跨模态搜索设备102用于根据第一模态数据搜索第二模态数据,进而,生成跨模态搜索结果,具体过程如下:①获取第一模态数据。该第一模态数据可以是文本、语音、图像等中的任一种。②基于第一模态数据的内容信息和语义信息,分别从数据库101(具体是第二模态数据库)中搜索出与内容信息匹配的第二模态数据、以及与语义信息匹配的第二模态数据。这里的内容信息是指第一模态数据的本身包含的内容,语义信息是指第一模态数据所表达的抽象含义。举例来说,第一模态数据为文本,内容信息即该文本中的字符,语义信息则是该文本表达的含义。若第一模态数据为图像,则内容信息可以是图像中包含的内容,例如文字;语义信息则可以是从该图像中提取出来的语义特征。在一个实施例中,基于第一模态数据的内容信息,可以直接从第二模态数据库中查找到与该内容信息匹配的第二模态数据;而基于第一模态数据的语义信息,则需要借助第二模态特征库,在第二模态特征库中查找与第一模态数据的语义信息匹配的第二模态特征,进而根据第二模态特征在第二模态数据库中确定对应的第二模态数据,为与该语义信息匹配的第二模态数据。③将这些第二模态数据合并作为与第一模态数据相匹配的跨模态搜索结果。
跨模态搜索设备102还可以用于根据输入的第一模态数据,输出该第一模态数据对应的跨模态搜索结果。具体过程包括:①显示社交会话的会话界面;②响应于对社交会话的历史会话记录的查看操作,显示会话记录详情界面;会话记录详情界面中显示有第二模态数据,并且第二模态数据属于社交会话的历史会话记录;③响应于会话记录详情界面中输入的第一模态数据,输出第一模态数据对应的跨模态搜索结果。可选地,在会话记录详情界面可 以提供搜索框,供会话对象手动输入第一模态数据,也可以推荐第一模态数据供会话对象选择,进而快捷触发搜索功能进行跨模态搜索。在一个实施例中,可以按照指定的搜索规则进行跨模态搜索,例如,输入的文本可以按照图像描述搜索,也可以按照图像中的文字搜索。这样,所展示的跨模态搜索结果可以与搜索维度相关,例如,可以输出跨模态搜索结果中与第一模态数据的内容信息匹配的第二模态数据,或者是输出跨模态搜索结果中与第一模态数据的语义信息匹配的第二模态数据。
由上述可知,跨模态搜索系统支持如下两种跨模态搜索方案:一种是技术层面的通用跨模态搜索,另一种是产品层面在历史会话记录中的跨模态搜索,后者将输出跨模态搜索结果,且跨模态搜索结果是通过实施技术层面的跨模态搜索方案得到的。承载这两种方案的跨模态搜索设备具体可以为同一个计算机设备,也可以为不同的计算机设备;当承载这两种方案的跨模态搜索设备为不同的计算机设备时,假设为计算机设备A和计算机设备B,由计算机设备B接收输入的第一模态数据,并将该第一模态数据发送给计算机设备A,计算机设备A基于获取到的第一模态数据从数据库中进行搜索,得到跨模态搜索结果,再将该跨模态搜索结果发送给计算机设备B,并在计算机设备B中输出该跨模态搜索结果;当承载这两种方案的跨模态搜索设备为同一个计算机设备时,假设为计算机设备A,可以由计算机设备A自动识别输入的第一模态数据,基于该第一模态数据从数据库中搜索相匹配的第二模态数据,得到跨模态检索结果并在计算机设备A中输出。
可见,本申请实施例提供的跨模态搜索系统,可以支持分别基于第一模态数据的内容信息和第一模态数据的语义信息,在第二模态数据库中搜索与第一模态数据匹配的第二模态数据;这是一种跨模态搜索方式,并且分别从内容和语义这两个维度进行综合搜索,使得搜索覆盖的维度不再单一,能够覆盖搜索与第一模态数据有关联的所有第二模态数据,进而可以更快更准确地得到搜索结果;此外,将两个维度搜索到的第二模态数据合并作为跨模态搜索结果,使得一次搜索过程获得多个维度的搜索结果,跨模态搜索的效率得到显著提升,并且能够得到十分丰富且全面的搜索结果。此外,跨模态搜索系统还可以提供基于社交会话的历史会话记录的搜索功能,该搜索功能是对历史会话记录中的第二模态数据进行搜索,并且可以展示全部的跨模态搜索结果,或者可以按照指定维度进行搜索以展示指定维度的跨模态搜索结果,由于上述跨模态搜索方案的技术支持,在启用搜索功能搜索历史会话记录中的第二模态数据时,第一模态数据输入的自由度、复杂度得到了有效提升。
为了更好地理解本申请实施例提供的跨模态搜索方法,下面对该跨模态搜索方法可以应用的场景进行说明。具体而言,本申请实施例的跨模态搜索方法可以应用在如下所示的场景1、场景2中,但并不限于这几种应用场景。下面分别对场景1、场景2进行介绍。
场景1:第一模态数据为文本数据,第二模态数据为图像数据,针对图像数据、文本数据进行搜索匹配。在社交会话的历史会话记录中,存在诸多形式的会话消息,例如图片、视频、文件、链接、音乐等等,搜索历史会话记录是一种更快捷地触达历史会话记录中包含的历史会话消息的方式。针对图片或者视频形式的历史会话记录进行搜索,可以将手动输入的文本或者被选中的推荐文本作为搜索文本,然后输出与之匹配的图像数据,包括图片或者视频。此外,针对智能终端的系统相册中图片或者视频进行搜索,也可以采用本申请实施例的跨模态搜索方法,也是同理,可以将文本作为查询输入,通过对相册中的图像特征、以及图像中包含的文本信息或者关联的文本描述信息进行匹配,从而输出对应的图像数据。
场景2:第一模态数据为音频数据,第二模态数据为图像数据,针对图像数据、语音数据进行搜索匹配。以智能手机为例,当前许多智能手机均配备有智能语音的功能,通过智能语音能够控制终端设备自动执行相应操作。面对智能手机中海量的图片或者视频,通过语音查询涉及跨模态搜索的问题,即对语音进行识别和理解,将语音和图像映射到相同的特征比对空间中,从而匹配到与之对应的图片,此外,还可以将语音转换为文本,通过将文本与图像的类别标签、文本描述信息等进行比对,从而匹配到对应的图片或视频。通过本申请实施例的跨模态搜索方法,可以将语音作为查询输入,通过对手机相册中的图像内容进行匹配,从而自动输出与语音匹配的图像。
下面结合附图,对本申请实施例提出的跨模态搜索方法的具体实现方式进行详细阐述。
请参见图2,是本申请实施例提供的一种跨模态搜索方法的流程示意图一,该跨模态搜索方法可以由计算机设备(例如图1所示的跨模态搜索设备102)执行。其中,该跨模态搜索方法包括但不限于以下步骤。
S201,获取第一模态数据。
模态可以是指一种信息的来源或者形式。举例来说,人有听觉、视觉、嗅觉、触觉,信息的媒介有语音、视频、文字、图片等,以上每一种都可以视为一种模态。在本申请实施例中,跨模态搜索主要涉及对信息媒介的处理,模态数据具体可以是图像、视频、音频等不同形式的数据。获取到的第一模态数据可以是用户通过计算机设备输入的模态数据,可选地,第一模态数据可以是通过例如物理键盘、虚拟键盘、光标选择等辅助方式输入的文本数据或者图像数据,或者可以是通过智能语音设备识别的音频数据,或者可以是从推荐第一模态数据(例如推荐文本)中选择的。
S202,基于第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合。
第一集合中包括与第一模态数据的内容信息相匹配的至少一个第二模态 数据。第一模态数据的内容信息是用于描述第一模态数据所包含的本质内容的数据信息。例如,第一模态数据为文本,对应的内容信息可以为文本字符本身,或者是基于文本抽取出来的关键词;又例如,第一模态数据为图像,对应的内容信息可以是图像所包含的其他模态信息或者是基本特征,譬如图像包含的几何形状、纹理、颜色、对象类别标签、文本描述信息中的任一种或多种等等。基于第一模态数据的内容信息这一维度,可以在第二模态数据库中搜索所有与第一模态数据的内容信息相匹配的第二模态数据,并将相匹配的第二模态数据添加到第一集合中。
在一个实施例中,第二模态数据库中存储有N个第二模态数据、以及N个第二模态数据各自的属性信息,N为正整数。第二模态数据和第一模态数据是两种不同模态的数据,第二模态数据可以是文本、图像、音频、视频等模态数据中的任一种,第二模态数据库中存储的第二模态数据在不同的业务场景中会有所不同。例如,在社交会话的历史会话记录搜索中,第二模态数据可以是会话中发送或接收到的图像。第二模态数据的属性信息是描述第二模态数据属性的信息,可以是从第二模态数据中识别到的、或者是其他数据产生的关联信息,该属性信息和第一模态数据的内容信息可以是相同记录形式的数据,例如均为文本描述信息。第一模态数据的内容信息可以和第二模态数据的属性信息进行匹配,从而在第二模态数据库中搜索到匹配的第二模态数据,得到第一集合。
可选地,步骤S202的具体实现方式包括以下步骤S2021和S2022:S2021、针对N个第二模态数据中的每个第二模态数据,确定第一模态数据的内容信息与该第二模态数据的属性信息之间的匹配度,作为该第二模态数据对应的匹配度;S2022、将所对应的匹配度满足匹配条件的第二模态数据添加至第一集合中。
可以将第一模态数据的内容信息分别和第二模态数据库中N个第二模态数据中的每个第二模态数据的属性信息进行匹配,得到对应的匹配度。此处的匹配度可以表示第一模态数据的内容信息和第二模态数据的属性信息之间是否相似或者一致。第一模态数据的内容信息和第二模态数据的属性信息之间的匹配度,可以通过模态数据的相似度(例如文本相似度),或者是抽象的语义相似度来衡量,也可以采用其他方式,在此不做限制。通过判断匹配度是否满足匹配条件,可以从第二模态数据库中搜索出与第一模态数据的内容信息相匹配的第二模态数据。此处的匹配条件可设定为匹配度大于或等于匹配度阈值,也可设定为匹配度排列在前y位,y为正整数。对匹配条件的具体设定内容不做限制。
可选地,属性信息包括第一模态描述信息和类别标签中的一种或两种,第一模态描述信息是指以第一模态的形式记录的描述信息,例如,第一模态数据为文本,那么第一模态描述信息则是文本描述信息,再例如,第一模态 数据为图像,那么第一模态描述信息则是图像描述信息。第一模态描述信息作为第二模态数据的属性信息可以和第一模态数据的内容信息进行匹配,在第一模态数据的内容信息和第二模态的属性信息均以相同模态的形式记录时,是相同模态信息的匹配,这样可以通过第一模态数据的内容信息和第二模态数据的第一模态描述信息的比对,更加方便地筛选出与第一模态数据的内容信息相匹配的第二模态数据。类别标签是为第二模态数据划分类别所标注的信息,可以是人工为第二模态数据标注的,也可以是将第二模态数据输入分类模型中进行多标签分类得到的。第二模态数据的类别标签和第一模态数据的内容信息也可以进行匹配,来搜索到满足匹配条件的第二模态数据。
对于属性信息的不同,步骤S2021和S2022的详细实现步骤也有所不同。具体可以参见下述两种实施方式。为便于描述,将N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N。
在一种实施方式中,属性信息包括第一模态描述信息,步骤S2021和S2022分别对应的实现方式可以是:确定第一模态数据的内容信息与第i个第二模态数据的第一模态描述信息之间的语义相似度,作为该第i个第二模态数据对应的匹配度;若第一模态数据的内容信息与第i个第二模态数据的第一模态描述信息之间的语义相似度大于第一相似阈值,则将第i个第二模态数据添加至第一集合。
具体地,第一模态数据的内容信息与第二模态数据的属性信息之间的匹配度可以采用上述提及的语义相似度,针对语义相似度的获取方式,可以是:提取第一模态数据的内容信息对应的语义特征,以及第i个第二模态数据的第一模态描述信息对应的语义特征,然后确定第一模态数据的内容信息和第i个第二模态数据的第一模态描述信息各自对应的语义特征之间的相似度,并将其作为语义相似度。之后,可以通过判断语义相似度是否大于第一相似阈值,来确定第i个第二模态数据是否满足匹配条件:如果该语义相似度大于第一相似阈值,则表示第i个第二模态数据的属性信息与第一模态数据的内容信息之间的匹配度满足匹配条件,进一步表明第i个第二模态数据的属性信息与第一模态数据的内容信息相匹配,则可以将第i个第二模态数据添加至第一集合中,反之,第i个第二模态数据将不被添加至第一集合中。
通过计算第一模态数据的内容信息和第二模态数据的第一模态描述信息之间的语义相似度,可以获知第一模态数据的内容信息和第二模态数据的第一模态描述信息表达的语义的一致性,进而确定第二模态数据和第一模态数据是否匹配。
示例性地,第一模态数据为文本数据,第二模态数据为图像数据,第一模态数据的具体内容为“蓝天白云”,内容信息也即该文本内容,第二模态数据的第一模态描述信息是对于图像内容的文本描述信息,该文本描述信息与图像是关联的,可以是图像中包含的文字信息,也可以是与图像关联的文本 描述信息。当与图像关联的文本描述内容为“今天的天空真好看”。那么可以将其中的关键词“天空”作为第一模态描述信息,然后确定“天空”与“蓝天白云”两个文本各自对应的语义相似度,确定两者是否匹配,从而确定对应的图像是否为与文本相匹配的图像。
在另一种实施方式中,属性信息包括类别标签,步骤S2021和S2022分别对应的实现方式可以是:确定第一模态数据的内容信息与第i个第二模态数据的类别标签之间的相似度,作为该第i个第二模态数据对应的匹配度;若第一模态数据的内容信息与第i个第二模态数据的类别标签之间的相似度大于第二相似阈值,则将第i个第二模态数据添加至第一集合。
在属性信息包括类别标签时,上述匹配度具体是指第一模态数据的内容信息和第i个第二模态数据的类别标签之间的相似度,例如可以是文本相似度,相似度可以代表第二模态数据的类别标签和第一模态数据的内容信息的一致程度,当第一模态数据的内容信息完全等同于第i个第二模态数据的类别标签时,第i个第二模态数据即为满足匹配条件的第二模态数据,或者,第一模态数据的内容信息与类别标签十分相似,同理也可以将第i个第二模态数据确定为满足匹配条件的第二模态数据。对于第i个第二模态数据是否满足匹配条件,具体可以通过第一模态数据的内容信息和第i个第二模态数据的类别标签之间的相似度是否大于第二相似度阈值确定:若相似度大于第二相似度阈值,则表明第一模态数据的内容信息和第i个第二模态数据的类别标签之间的匹配度满足匹配条件,进一步表明第i个第二模态数据的类别信息与第一模态数据的内容信息相匹配,则第i个第二模态数据可以被添加至第一集合,反之,第i个第二模态数据将不被添加至第一集合。
示例性地,第一模态数据为搜索文本,第二模态数据为图像,第i个第二模态数据为目标图像,且被分类模型划分为“人物”、“风景”两个类别标签,那么当搜索文本输入为“人物”或者是“风景”时,由于类别标签和搜索文本是完全等同的,因此可以匹配到该图片,这里所使用的相似度具体可以是文本相似度。
需要注意的是,如果仅使用类别标签来实现以文搜图,且搜索词需完全等同于类别标签才能匹配到相关图片,那么由于对搜索词的要求较高、支持的搜索词有限且维度单一,可能很容易出现搜索不到结果的情况。而综合其他维度的信息进行搜索,具体从语义和内容两个维度进行搜索,同时降低匹配条件,例如包含该类别标签也可以视为匹配,不仅可以提升搜索效率,可以降低搜索结果为空的概率。
需要说明的是,上述两种实施方式针对N个第二模态数据中的任一个第二模态数据均适用,这样在第二模态数据库存储的N个第二模态数据都与第一模态数据的内容信息按照上述方式匹配之后,最终得到的第一集合可以作为下述跨模态搜索结果的一部分。
S203,基于第一模态数据的语义信息在第二模态数据库中进行搜索,得到第二集合。
第二集合中包括与第一模态数据的语义信息相匹配的至少一个第二模态数据。第一模态数据的语义信息作为另一种信息表现形式,具体可以是指第一模态数据所对应的现实世界中事物所代表的含义。语义信息可以用于表征对第一模态数据的浅层或深层的语义理解,语义信息可以是非常丰富的,例如第一模态数据为文本时,相同的语义可以有很多不同的文本表达,非常灵活。
通过第一模态数据的语义信息这一维度在第二模态数据库中进行搜索,具体可以将第二模态数据的语义信息和第一模态数据的语义信息进行匹配,进而,从第二模态数据库中搜索出所有与第一模态数据的语义信息相匹配的第二模态数据,得到第二集合。其中,语义信息可以通过语义特征来表示,具体可以是语义特征向量。以多模态学习为基础,可以通过分别提取第一模态数据的语义特征和第二模态数据的语义特征,将两个不同模态数据的语义特征映射到同一个语义特征空间进行相似度比对,进而,基于相似语义特征将具有相似语义的第二模态数据搜索到。此步骤的具体实现方式可以参见下述图3对应实施例的介绍,在此先不做详述。
在第一模态数据为文本,第二模态数据为图像的条件下,此步骤是基于跨模态特征的以文搜图方式,即通过分别提取搜索词的文本特征向量和图片的图像特征向量,将两种不同模态的特征向量在同一个语义特征空间比对相似度,从而通过文本描述直接跨模态检索到具有相似语义的图像,这样可以支持更多、更复杂的文本描述,实现输入自由多样的描述图像的文本来搜索目标图片。
S204,对第一集合和第二集合进行合并,得到第一模态数据对应的跨模态搜索结果。
按照上述步骤对第二模态数据库中存储的N个第二模态数据进行搜索,可以得到与第一模态数据的内容相匹配的第一集合、以及与第一模态数据的语义相匹配的第二集合。将第一集合和第二集合进行合并,可以得到所有与第一模态数据相匹配的第二模态数据,包括与第一模态数据的内容信息匹配的第二模态数据、以及与第一模态数据的语义信息匹配的第二模态数据,即是第一模态数据对应的跨模态搜索结果,由此得到的跨模态搜索结果包括多个维度的搜索结果,是多样化且全面的搜索结果。
本申请实施例提供的跨模态搜索方案,基于第一模态数据的内容信息,可在第二模态数据库中搜索到与第一模态数据的内容信息相匹配的第二模态数据,基于第一模态数据的语义信息,可在第二模态数据库中搜索到第一模态数据的语义信息相匹配的第二模态数据,这样的搜索方式并不局限于某个维度,而是从多个维度进行综合搜索,这使得搜索覆盖的维度不再单一,并 且通过一次搜索可以获取多个维度的搜索结果,这样提高了跨模态搜索的效率;此外,通过将与第一模态数据的两个维度分别匹配的第二模态数据合并作为跨模态搜索结果,能够得到更加丰富多样的跨模态搜索结果,其中,基于第一模态数据的内容信息进行搜索,具体是以第一模态数据的内容信息与第二模态数据的属性信息(可以为第一模态描述信息或者类别标签)之间的匹配度作为依据,由于属性信息更多的是对第二模态数据所包含的内容描述,相应地,第一模态数据也可以不局限于固定的表达,而是支持更加多样和复杂的表达。
请参见图3,是本申请实施例提供的一种跨模态搜索方法的流程示意图二,该方法可以由计算机设备(例如图1所示的跨模态搜索设备102)执行。其中,本实施例的跨模态搜索方法是对图2对应的步骤S203:基于第一模态数据的语义信息在第二模态数据库中搜索,得到第二集合,对应实现方式的详细介绍。
第二模态数据库中存储有N个第二模态数据。第二模态数据关联有第二模态特征库,该第二模态特征库中存储有N个第二模态数据各自的语义特征。基于第一模态数据的语义信息在第二模态数据库中进行搜索,得到第二集合的具体实现方式,包括以下步骤S301~S304。
S301,获取第一模态数据的语义特征。
在一个实施例中,第一模态数据的语义特征可以通过跨模态搜索模型处理得到,具体地,跨模态搜索模型包括第一模态处理网络,此步骤的具体实现方式可以为:通过跨模态搜索模型中的第一模态处理网络,对第一模态数据进行特征提取处理,得到第一模态数据的语义特征。第一模态处理网络是针对第一模态数据的处理网络,示例性地,当第一模态数据为文本时,第一模态处理网络可以是文本处理网络,该文本处理网络可以是BERT(Bidirectional Encoder Representation from Transformers,一种预训练的语言表征模型)模型或者是各类有关BERT的变种模型,也可以是其他自然语言处理(Natural Language Processing,NLP)模型。如图4a所示,为文本编码器处理的示意图,将文本作为输入,文本编码器(Text encoder)可以输出文本特征向量。
302,基于第一模态数据的语义特征,在第二模态特征库中查找与第一模态数据的语义特征相匹配的目标语义特征。
第一模态数据的语义特征和第二模态数据的语义特征是否匹配,可以通过判断两种模态数据的语义特征之间的相似度是否大于相似度阈值来确定。具体可以分别计算第二模态特征库中存储的N个第二模态数据的语义特征与第一模态数据的语义特征之间的特征相似度,将特征相似度大于相似度阈值的第二模态数据的语义特征,确定为与第一模态数据的语义特征相匹配的第 二模态数据的语义特征,即目标语义特征。按照上述方式可以从第二模态特征库中查找到一个或多个目标语义特征。
示例性地,第一模态数据为文本,第二模态数据为图像,第一模态数据对应的语义特征为文本特征向量,第二模态数据对应的语义特征为图像特征向量,用文本特征向量从图像特征库中检索出相似的图像特征向量,具体的检索方式可以是使用文本特征向量和图像特征向量计算特征相似度,并将特征相似度高于阈值的图像特征向量作为与文本特征向量匹配的目标图像特征向量。
S303,根据目标语义特征,在第二模态数据库中确定与第一模态数据的语义信息相匹配的第二模态数据。
由于第二模态特征库和第二模态数据库相关联,利用在第二模态特征库中查找到的目标语义特征,可以从第二模态数据库中确定出该目标语义特征所对应的第二模态数据,进而将其作为与第一模态数据的语义信息相匹配的第二模态数据。
在一个实施例中,第二模态特征库和第二模态数据库通过特征索引关联,步骤S303的实现方式具体可以包括以下步骤:(1)确定目标语义特征对应的特征索引;(2)基于目标语义特征对应的特征索引,在第二模态数据库中确定与该目标语义特征对应的特征索引对应的第二模态数据。
第二模态特征库中每个第二模态数据的语义特征关联有特征索引,且每个特征索引具备唯一性,特征索引和第二模态数据库中的第二模态数据也存在关联关系,这样第二模态数据库中的第二模态数据与第二模态特征库中第二模态数据的语义特征可以通过特征索引一一关联,以此能够基于查找到的目标语义特征对应的特征索引,从第二模态数据库中选取出与该特征索引对应的第二模态数据,得到与第一模态数据的语义信息相匹配的第二模态数据。
S304,将与第一模态数据的语义信息相匹配的第二模态数据添加至第二集合。
从第二模态数据库中确定的与第一模态数据的语义信息相匹配的第二模态数据可以添加至第二集合中,对于第二模态数据库中存储的所有第二模态数据,均可以按照上述步骤处理,进而可以确定出所有与第一模态数据的语义信息相匹配的第二模态数据,并且将其一一添加至第二集合中,再将最终得到的第二集合作为跨模态搜索结果中的一部分。
本申请实施例提供的跨模态搜索方法,从第一模态数据的语义信息这一维度进行搜索,通过提取第一模态数据和第二模态数据这两种模态数据各自对应的语义特征,在相同语义空间中将第一模态数据的语义特征和第二模态数据的语义特征进行特征比对处理,从第二模态特征库中查找与第一模态数据的语义特征匹配的目标语义特征,进而基于查找到的目标语义特征从第二模态数据库中确定与第一模态数据的语义信息相匹配的第二模态数据,得到 跨模态搜索结果。这种方式本质上是一种基于跨模态特征进行搜索的方式,通过语义层面的跨模态特征,可以更加快速准确地搜索到与第一模态数据匹配搜索结果,一定程度上也可以增加跨模态搜索结果的多样性。
由上可知,基于第一模态数据的语义信息在第二模态数据中进行搜索需要借助第二模态特征库,接下来对第二模态特征库中存储的第二模态数据的语义特征的获取方式进行详细介绍。
在一个实施例中,跨模态搜索模型包括第二模态处理网络,第二模态数据库中存储的N个第二模态数据的语义特征是通过跨模态搜索模型中的第二模态处理网络对N个第二模态数据分别进行特征提取得到的。第二模态处理网络是针对第二模态数据的处理网络,可以包括多种功能不同的网络。以第二模态数据为图像为例,第二模态处理网络具体可以是图像处理网络。
可选地,第二模态处理网络包括特征提取网络、池化处理网络和特征整合网络;为便于描述,N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N,即所有N个第二模态数据均按照下述步骤处理,得到对应的语义特征。基于此,通过跨模态搜索模型中的第二模态处理网络分别对N个第二模态数据进行特征提取处理,得到N个第二模态数据的语义特征的步骤具体可以包括:通过第二模态处理网络中的特征提取网络,提取第i个第二模态数据的初始特征;通过第二模态处理网络中的池化处理网络,对初始特征进行池化处理,得到第i个第二模态数据的池化特征;通过特征整合网络,对池化特征进行整合处理,得到第i个第二模态数据的语义特征。
其中,特征提取网络可以是用于图像处理的深度模型,例如常规的卷积神经网络(Convolutional Neural Network,CNN)模型或者是用于特征提取的VIT(Vision Transformer)模型,特征提取网络是第二模态处理网络中的主干网络(Backbone),主要用于提取第二模态数据的初始特征,以供后面的网络使用。池化处理网络可以用于对特征提取网络输出的初始特征进行池化处理,具体可以是全局平均池化处理(Global Average Pooling,GAP),此时池化处理网络也可以称为全局平均池化层,通过全局平均池化不仅可以降低参数量,防止过拟合,还可以整合全局空间信息,使得第二模态数据的特征更加鲁棒。之后,可以调用特征整合网络对池化处理网络输出的池化特征进行整合处理,得到第i个第二模态数据的语义特征。该特征整合网络具体可以是特征全连接层,由于全连接层要求输入的对象是一维的,因此池化特征输入特征整合网络处理之前,需要展平为一维的特征,然后再由特征整合网络处理该一维的特征,继而得到第二模态数据的语义特征。
需要说明的是,上述方式是针对N个第二模态数据中的任一个第二模态数据通过跨模态搜索模型进行处理的原理,也就是说,针对N个第二模态数据中的每一个第二模态数据,均可以采用相同的处理步骤来提取第二模态数 据的语义特征,进而存储到第二模态特征库中。
在一个可能的实施例中,第二模态处理网络还包括分类网络,还可以:通过分类网络,基于池化特征进行分类预测处理,得到第i个第二模态数据的类别标签;以及,将第i个第二模态数据的类别标签添加至第二模态数据库中。该分类网络可以是分类全连接层,同特征全连接层类似,分类全连接层处理的池化特征也是展平之后的一维特征,分类全连接层的输出通过激活函数(例如Sigmoid函数)之后得到第i个第二模态数据属于各个类别的分数,从而得到对应的类别标签。对于第二模态数据库中的N个第二模态数据的类别标签均可以采用上述分类网络进行多分类处理来获取,且各个第二模态数据的类别标签均可以添加至第二模态数据库中,以便于处理第一模态数据时,根据各个第二模态数据的类别标签和第一模态数据的内容信息之间的相似度,搜索与第一模态数据匹配的第二模态数据。
基于上述对第二模态处理网络的描述可以得知第二模态处理网络的具体结构。假设第二模态数据为图像,第二模态处理网络具体为图像编码器,包括特征提取网络、池化网络、特征整合网络以及分类网络,具体分别为主干网络、全局平均池化层、特征全连接层以及分类全连接层,第二模态数据库具体为图像库,第二模态特征库具体为图像特征向量检索集,对于第二模态处理网络处理第二模态数据的处理流程,结合如图4b所示的跨模态搜索模型中图像编码器的结构进行如下示例性说明。
如图4b所示的跨模态搜索模型中图像编码器具体包括主干网络Backbone、全局平均池化层、分类全连接层以及特征全连接层。假设会话对象在会话中发送或接收到的图像为XI,图像XI被输入到图像编码器中,图像编码器可以输出图像的多标签分类结果C I={c 1,c 2,…,c n}和图像的特征向量f I(或称之为图像特征向量),具体处理过程如下:首先图像作为图像编码器的输入,通过图像编码器的主干网络(例如CNN或者VIT)来得到图像的特征图(即初始特征),然后图像的特征图经过全局平均池化处理并展平为一维向量,之后将展平的一维向量输入到分类全连接层(Cls FC)输出长度为C的一维向量,并通过Sigmoid函数得到各类别的分数,从而得到对应的类别标签C I={c 1,c 2,…,c n},同时,该一维向量还将输入特征全连接层(Feature FC)输出长度为d(假设为512)的向量,再通过L2归一化(L2 Normalization)后作为图像的特征向量f I。图像的特征向量f I将存储至图像特征向量检索集(对应为第二模态特征库)中,而图像对应的多标签分类的类别标签可以存储到第二模态数据库中,最后再根据图像特征向量f I,可以新增对应的图像特征向量索引并加入到图像特征向量检索集G I中,以便于辅助从图像库中快速搜索到目标图像。
结合上述图4a示出的跨模态搜索模型中的第一模态处理网络以及图4b示出的跨模态搜索模型中的第二模态处理网络,利用跨模态搜索模型进行搜 索,对第一模态数据的具体处理过程如下。为便于描述,以第一模态数据为文本,第一模态处理网络对应为文本编码器为例进行说明,输出的文本特征向量和图像特征向量是映射到相同语义特征空间且维度相同的特征向量,对文本的处理具体包括:首先,从文本的内容信息在图像库中搜索,具体可以从图像库中查询含有和文本query完全匹配的标签的图像,召回匹配到的图像作为图像集A;同时,文本被输入到文本编码器,文本编码器输出长度为d的向量并经过L2归一化后得到文本特征向量f T;然后,用文本特征向量f T从图像特征向量检索集G I(其中包括的图像特征向量由如图4b所示的图像编码器对图像进行处理得到)中检索出相似的图像特征向量,并召回对应的图像集B,具体的检索方式是使用文本特征向量f T和检索集中的图像特征向量f I计算特征相似度,即S=f T·f I,检索到特征相似度S高于阈值θ的图像作为图像集B,最终合并图像集A和图像集B,得到跨模态搜索结果。
在一个实施例中,结合上述对跨模态搜索模型的结构以及功能的描述,跨模态搜索模型包括第一模态处理网络和第二模态处理网络,具体的训练过程可以如下:1)获取跨模态训练数据集,跨模态训练数据集包括多组跨模态样本数据,每组跨模态样本数据包括第二模态样本数据、第一模态样本数据、以及该第二模态样本数据与该第一模态样本数据之间的匹配结果;2)通过跨模态搜索模型中的第一模态处理网络,对跨模态样本数据中的第一模态样本数据进行特征提取处理,得到第一模态样本数据的语义特征;以及,通过跨模态搜索模型中的第二模态处理网络,对跨模态样本数据中的第二模态样本数据进行特征提取处理,得到第二模态样本数据的语义特征;3)根据第一模态样本数据的语义特征与第二模态样本数据的语义特征之间的跨模态对比损失,对跨模态搜索模型进行迭代训练,得到训练好的跨模态搜索模型。
在训练数据准备阶段,可以从相应场景产生的业务数据中获取跨模态训练数据集,跨模态训练数据集是两种不同模态样本数据的集合,对于跨模态搜索模型的训练,可以是以每组跨模态样本数据为单位输入跨模态搜索模型中进行处理的。举例来说,第一模态样本数据和第二模态样本数据分别是文本和图像,那么每组跨模态样本数据可以是图像-文本对,即图像和图像对应的文本描述可以构成图像-文本对,海量的图像-文本对可以组成跨模态训练数据集。
对于跨模态搜索模型的训练过程,具体是对第一模态处理网络和第二模态处理网络混合训练。可以同时输入K组跨模态样本数据,然后通过第一模态处理网络处理第i组跨模态样本数据中的第一模态样本数据,得到该第一模态样本数据的语义特征,以及,通过第二模态处理网络处理第i组跨模态样本数据中的第二模态样本数据,得到该第二模态样本数据的语义特征,进而,根据两种不同模态样本数据的语义特征来计算跨模态对比损失,基于该跨模态对比损失对跨模态搜索模型进行迭代训练,不断更新模型参数直至收 敛,就可以得到训练好的模型。
在第二模态处理网络包括分类处理网络时,跨模态训练数据集中还可以包括第二模态样本数据对应的类别标签,具体地,训练过程还可以包括以下内容:通过跨模态搜索模型中的第二模态处理网络对跨模态样本数据中的第二模态样本数据进行分类预测处理,得到第二模态样本数据的类别预测信息;根据类别预测信息和类别标签确定第二模态样本数据的分类损失;根据分类损失和跨模态对比损失对跨模态搜索模型进行迭代训练,得到训练好的跨模态搜索模型。其中,类别预测信息可以包括第二模态样本数据属于各个类别的预测概率,分类损失可以使用交叉熵损失,后续可以联合分类损失和跨模态对比损失作为总损失,例如将分类损失和跨模态对比损失进行加权求和得到总损失,再使用优化器(例如随机梯度下降(Stochastic Gradient Descent,SGD)优化器)对跨模态搜索模型的模型参数进行更新,不断重复上述训练过程,直至模型参数收敛,得到训练好的跨模态搜索模型。如此,跨模态搜索模型不仅可以应用于第一模态数据和第二模态数据的语义特征提取处理,基于跨模态特征,检测第一模态数据与第二模态数据之间的匹配度;跨模态搜索模型还具备多标签分类功能,为第二模态数据生成类别标签并存储到第二模态数据库中。
为了更好地理解训练阶段的原理,下面以第一模态处理网络为文本编码器,第二模态处理网络为图像编码器为例,对跨模态搜索模型的训练过程进行如下举例说明。请参见图5,是本申请实施例提供的一种跨模态搜索模型训练示意图。假设跨模态训练数据集包括K组图像-文本对(或简称图文对),在训练时,同时输入K组图像-文本对,分别通过图像编码器和文本编码器得到图像特征向量
Figure PCTCN2022134918-appb-000001
和文本特征向量
Figure PCTCN2022134918-appb-000002
并且图像编码器还输出类别预测概率PI对应图像C个类别的预测概率。之后可以使用InfoNCE loss来计算图像-文本对之间的跨模态对比损失,具体表达式如下:
Figure PCTCN2022134918-appb-000003
其中,
Figure PCTCN2022134918-appb-000004
表示第i个图像特征向量,
Figure PCTCN2022134918-appb-000005
表示第i个文本特征向量。跨模态对比损失的主要思想是最大化相似性和最小化差异性的损失,具体来说,可以将图像-文本对划分为正样本对和负样本对,正样本对是指图像和文本描述匹配的图像-文本对,负样本对是指图像和文本描述不匹配的图像-文本对。在跨模态对比损失中,通过
Figure PCTCN2022134918-appb-000006
表示正样本对之间的相似度,
Figure PCTCN2022134918-appb-000007
Figure PCTCN2022134918-appb-000008
表示负样本对之间的相似度,这样跨模态对比损失越小,第一模态样本数据和第二模态样本数据就越匹配。
对于图像的分类损失L cls可以使用交叉熵损失(Cross Entropy Loss,CEL)来计算,联合L cls和L infoNCE作为总损失,使用SGD优化器来对模型参数进行更新,直至收敛。
基于上述实施例的介绍,以第一模态数据为文本,第二模态数据为图像为例,对跨模态搜索方案中采用的算法流程进行说明,具体可以参见如图6所示的算法流程图。图像被输入到图像编码器中进行多标签分类可以得到类别标签,当搜索文本(记为query)完全等同于类别标签时,可以匹配到相关图像。此外,图像编码器还可以输出图像特征向量,并且将图像特征向量和新增索引添加至图像特征向量检索集。对于搜索文本query,可以输入到文本编码器中,输出文本特征向量,然后基于文本特征向量从图像特征向量检索集中检索出相似的图像特征向量,并基于该相似的图像特征向量召回对应的图像集。这种基于跨模态特征搜索的方案,由于可以不依赖分类模型的固定类目标签体系,而通过不同模态数据的特征进行匹配,从而能够支持更加多样、更复杂的文本描述,从而能够提升搜索词输入的自由度,更快更准确且更全面地找到目标图片。
请参见图7,是本申请实施例提供的一种跨模态搜索方法的流程示意图三,该跨模态搜索方法可以由计算机设备(例如图1所示的跨模态搜索设备102,该跨模态搜索设备102具体可以是终端)执行。其中,该跨模态搜索方法包括但不限于以下步骤。
S701,显示社交会话的会话界面。
此处的社交会话可以是个人与个人之间的会话,或者是群组会话。在社交会话的会话界面中,会话对象可以发送或者接收会话消息,例如图像、文本、语音等等。当在会话界面中接收到的会话消息包括第二模态数据时,可以调用跨模态搜索模型中的第二模态处理网络来处理第二模态数据,输出多标签分类的类别标签和第二模态数据的语义特征,进而,将类别标签存入第二模态数据库中,将第二模态数据的语义特征(例如图像特征向量)存入第二模态特征库中。
社交会话的会话界面可以提供历史会话记录的查看功能。具体可以是从会话界面进入会话详情界面,在该会话详情界面中包括历史会话记录的查看入口,会话对象可以通过该查看入口发起查看操作,查看具体的历史会话记录并进行搜索,具体可以参见下述步骤S702~S703。
S702,响应于对社交会话的历史会话记录的查看,显示会话记录详情界面。
该会话记录详情界面中包括社交会话的历史会话记录中的第二模态数据。社交会话的历史会话记录中可以包括不同模态的数据,例如图像、视频、文本、音频等等,会话对象可以选择不同模态的数据进行查看,此处对历史会话记录的查看主要是对第二模态数据的查看,因此,会话记录详情界面中展示的是历史会话记录中产生的第二模态数据。
需要说明的是,若第二模态数据的数量较少,则第二模态数据可以在会 话记录详情界面中全量显示,若第二模态数据的数量较多,则在当前会话记录详情界面中显示的是部分第二模态数据。示例性地,第二模态数据为图像,会话记录详情界面具体为聊天照片墙,其中显示的图像按照等同大小能够显示12张,若历史会话记录中的所有图像有10张,那么可以在会话记录详情界面全量展示,如果超过12张,则最多显示12张,查看其他图像时需要执行例如向下滑动的操作来展示,后续在会话记录详情界面可以支持以第一模态数据搜索第二模态数据,并输出与第一模态数据相匹配的第二模态数据,即跨模态搜索结果。
请参见图8a,是本申请实施例提供的一种对历史会话记录的查看的操作示意图。如图8a中的(1)的会话界面810中提供了查找历史会话记录的入口,即“查找聊天内容”,当触发该入口,可以进入如图8a中的(2)所示的历史会话记录搜索界面811,在此历史会话记录搜索界面中可以选择相应的搜索类型,并对该搜索类型的历史会话记录进行全量展示,例如当选择了图片及视频,则在会话记录详情界面812中展示聊天照片墙,且聊天照片墙是按照日期进行展示的所有图片以及视频,具体如图8a中的(3),并且该会话记录详情界面812提供有搜索框8120以便于搜索图片或视频。
在一个实施例中,社交会话的历史会话记录中的第二模态数据存储在第二模态数据库中,且第二模态数据库中存储有第二模态数据的属性信息。通过将历史会话记录中的第二模态数据划分至第二模态数据库中存储,当发起对第二模态数据的搜索时,可以直接从该第二模态数据库中查找,而不是从全局的历史会话记录中查找,有利于提升搜索第二模态数据的效率,同时第二模态数据库中存储有第二模态数据的属性信息,属性信息的不同,可以进一步扩展搜索维度。
属性信息包括以下至少一种:类别标签、与第二模态数据关联的第一模态描述信息、从第二模态数据中识别到的第一模态描述信息。其中,类别标签可以是人工或者机器(例如分类模型)为第二模态数据进行分类产生的标注信息,第一模态描述信息是有关第二模态数据的描述信息,具体可以是从第二模态数据中识别到的,也可以是历史会话记录中产生的与之相关联的。示例性地,第二模态数据为图像,当历史会话记录中的图像是包含文本的图像时,可以通过对图像进行识别得到该图像中的文本,并将其作为第一模态描述信息;若社交会话中会话对象发送图像之后紧接着发送对该图像的文本描述信息,例如:你看A公园的变化真大,那么可以根据该文本描述信息生成有关该图像的描述,例如提取出关键词“A公园”作为图像的第一模态描述信息。
S703,响应于在会话记录详情界面中输入的第一模态数据,输出第一模态数据对应的跨模态搜索结果。
其中,跨模态搜索结果是采用前述实施例介绍的跨模态搜索方法得到的, 输出的跨模态搜索结果包括与会话记录详情界面中输入的第一模态数据相匹配的所有第二模态数据。
在一个实施例中,第一模态数据为文本,第二模态数据为图像,会话记录详情界面中包括搜索框,第一模态数据是在搜索框中进行输入得到的;或者,会话记录详情界面中还包括至少一个推荐文本,第一模态数据是通过在至少一个推荐文本中选择得到的。也就是说,在会话记录详情界面中输入的第一模态数据可以是通过输入设备(例如物理/虚拟键盘、智能语音设备)等手动输入至搜索框中的,也可以是从会话记录详情界面中提供的推荐文本中选择的。可选地,被选择的推荐文本可以自动填充至搜索框中并且自动启动搜索功能。其中,会话记录详情界面中的推荐文本可以是随机生成的,也可以是根据第二模态数据的属性信息或者是第二模态数据的语义特征生成的。由上述跨模态搜索方法的技术支持,搜索框中输入的文本可以是符合直觉表达的图像描述。简单来说,以以文搜图为例,当会话对象在搜索框中搜索时,可以在第二模态数据库中查询和搜索文本匹配的类别标签的图像并召回,同时可以通过跨模态搜索模型中的文本编码器处理搜索文本,输出对应的文本特征向量,从图像特征向量检索集中检索出相似的图像特征向量,并召回对应的图像集合,最终合并所有召回的图像并展示给会话对象。
示例性地,请参见图8b,是本申请实施例提供的跨模态搜索的操作示意图,如图8b中的(1)所示,会话记录详情界面提供有搜索框8220,并且该搜索框8220中提示搜索支持输入图像描述或者图中文字,图像描述是对图像所包含内容的语义解释,图中文字属于图像的内容信息。此外,自动生成的推荐文本也展示在该会话记录详情界面,如图8b中的“票”、“截图”等,通过推荐文本可以提供更多的参考以及便捷的操作。当在搜索框8220中输入查询文本并触发搜索功能时,可以输出搜索结果界面,并在搜索结果界面展示与该查询文本匹配的图片,如图8b中的(2)所示,在搜索结果界面823中展示的是与输入的查询文本“食物”相匹配的3张图片,属于跨模态搜索结果。
在一个实施例中,还可以:响应于对第一搜索规则的选择,输出跨模态搜索结果中与第一模态数据的内容信息相匹配的第二模态数据;或者,响应于对第二搜索规则的选择,输出跨模态搜索结果中与第一模态数据的语义信息相匹配的第二模态数据。
第一搜索规则和第二搜索规则是从不同维度进行搜索的规则,可以按照不同维度搜索,并且可以将全部跨模态搜索结果按照不同的搜索维度划分显示。按照第一搜索规则进行搜索可以得到与第一模态数据的内容信息相匹配的第二模态数据并输出,按照第二搜索规则进行搜索可以得到与第一模态数据的语义信息相匹配的第二模态数据并输出。也就是说,可以指定单一的搜索维度,例如,在第一模态数据为文本,第二模态数据为图像时,可以按图像搜索和按文字搜索,按图像搜索具体是指按照图像描述搜索,也即通过匹配 图像的语义信息这一维度进行搜索,按文字搜索具体是指按照图中文字搜索,也即通过匹配图像的内容信息这一维度进行搜索。
示例性地,请参见图8c,是本申请实施例提供的一种输出跨模态搜索结果的效果示意图。如图8c所示,是基于图8b中(2)提供的跨模态搜索结果展示的与查询文本匹配的全部图片,按照不同搜索维度搜索后得到的。当选择按图像搜索或者是按文字搜索,会呈现不同的跨模态搜索结果,分别如图8c中的(1)和图8c中的(2)所示,在该搜索结果界面可以展示文本的语义信息与图像的语义信息匹配、或者是文本的内容信息与图像的属性信息(例如图像的类别标签)匹配的图片。本方案可以应用于多种场景,除了本实施例介绍的基于社交会话的历史会话记录的跨模态搜索,也可以应用于其他支持多媒体数据搜索场景,例如短视频搜索场景,对此不做限制。
本申请实施例提供的跨模态搜索方案,可以支持社交会话的历史会话记录中的跨模态搜索场景,具体可以应用于图文跨模态搜索的场景中,即通过在搜索框中输入搜索词来搜索目标图片,由于跨模态搜索是从搜索词的多个维度来搜索的,搜索词不必与图片的类别标签完全匹配就能够找到相应的图片,因此,通过输入更加符合直觉表达、更多样以及更加复杂的图像描述来查找目标图片,不仅可以提升输入的自由度,还可以大大提升搜索到目标图片的概率,提高跨模态搜索结果的多样性;此外,通过提供推荐文本(例如推荐的搜索词)也能够在一定程度上提升搜索效率。
请参见图9,图9是本申请实施例提供的一种跨模态搜索装置的结构示意图。上述跨模态搜索装置可以是运行于计算机设备中的一个计算机程序(包括程序代码),例如该跨模态搜索装置为一个应用软件;该跨模态搜索装置可以用于执行本申请实施例提供的方法中的相应步骤。如图9所示,该跨模态搜索装置900可以包括:获取模块901、搜索模块902、合并模块903。
获取模块901,用于获取第一模态数据;
搜索模块902,用于基于第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,第一集合中包括与第一模态数据的内容信息相匹配的至少一个第二模态数据;
搜索模块902,还用于基于第一模态数据的语义信息在第二模态数据库中进行搜索,得到第二集合,第二集合中包括与第一模态数据的语义信息相匹配的至少一个第二模态数据;
合并模块903,用于对第一集合和第二集合进行合并,得到第一模态数据对应的跨模态搜索结果。
在一个实施例中,第二模态数据库存储有N个第二模态数据、以及N个第二模态数据各自的属性信息,N为正整数;搜索模块902,具体用于:针对所述N个第二模态数据中的每个第二模态数据,确定第一模态数据的内容信 息与该第二模态数据的属性信息之间的匹配度,作为所述第二模态数据对应的匹配度;将所对应的匹配度满足匹配条件的第二模态数据添加至第一集合中。
在一个实施例中,属性信息包括第一模态描述信息;N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N;搜索模块902,具体用于:确定第一模态数据的内容信息与第i个第二模态数据的第一模态描述信息之间的语义相似度,作为所述第i个第二模态数据对应的匹配度;若第一模态数据的内容信息与第i个第二模态数据的第一模态描述信息之间的语义相似度大于第一相似阈值,则将第i个第二模态数据添加至第一集合。
在一个实施例中,属性信息包括类别标签;N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N;搜索模块902,具体用于:确定第一模态数据的内容信息与第i个第二模态数据的类别标签之间的相似度,作为所述第i个第二模态数据对应的匹配度;若第一模态数据的内容信息与第i个第二模态数据的类别标签之间的相似度大于第二相似阈值,则将第i个第二模态数据添加至第一集合。
在一个实施例中,第二模态数据库存储有N个第二模态数据;第二模态数据库关联有第二模态特征库,第二模态特征库中存储有N个第二模态数据各自的语义特征;搜索模块902,具体还用于:获取第一模态数据的语义特征;基于第一模态数据的语义特征,在第二模态特征库中查找与第一模态数据的语义特征相匹配的目标语义特征;根据目标语义特征,在第二模态数据库中确定与第一模态数据的语义信息相匹配的第二模态数据;将与第一模态数据的语义信息相匹配的第二模态数据添加至第二集合。
在一个实施例中,第二模态特征库和第二模态数据库通过特征索引关联;搜索模块902,具体用于:确定目标语义特征的特征索引;基于目标语义特征的特征索引,在第二模态数据库中确定与目标语义特征的特征索引对应的第二模态数据。
在一个实施例中,第二模态特征库中存储的N个第二模态数据各自的语义特征,是通过跨模态搜索模型中的第二模态处理网络对N个第二模态数据分别进行特征提取处理得到的;跨模态搜索模型还包括第一模态处理网络;搜索模块902,具体用于:通过跨模态搜索模型中的第一模态处理网络,对第一模态数据进行特征提取处理,得到第一模态数据的语义特征。
在一个实施例中,第二模态处理网络包括特征提取网络、池化处理网络和特征整合网络;N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N;搜索模块902,具体用于:通过第二模态处理网络中的特征提取网络,提取第i个第二模态数据的初始特征;通过第二模态处理网络中的池化处理网络,对初始特征进行池化处理,得到第i个第 二模态数据的池化特征;通过特征整合网络,对池化特征进行整合处理,得到第i个第二模态数据的语义特征。
在一个实施例中,第二模态处理网络还包括分类网络;搜索模块902,具体还用于:通过分类网络,基于池化特征进行分类预测处理,得到第i个第二模态数据的类别标签;以及,将第i个第二模态数据的类别标签添加至第二模态数据库中。
在一个实施例中,该跨模态搜索装置还包括训练模块904,用于:获取跨模态训练数据集,跨模态训练数据集包括多组跨模态样本数据,每组跨模态样本数据包括第二模态样本数据、第一模态样本数据、以及第二模态样本数据与第一模态样本数据之间的匹配结果;通过跨模态搜索模型中的第一模态处理网络,对跨模态样本数据中的第一模态样本数据进行特征提取处理,得到第一模态样本数据的语义特征;以及,通过跨模态搜索模型中的第二模态处理网络,对跨模态样本数据中的第二模态样本数据进行特征提取处理,得到第二模态样本数据的语义特征;根据第一模态样本数据的语义特征与第二模态样本数据的语义特征之间的跨模态对比损失,对跨模态搜索模型进行迭代训练,得到训练好的跨模态搜索模型。
可以理解的是,本申请实施例所描述的跨模态搜索装置的各功能模块的功能可根据上述方法实施例中的方法具体实现,其具体实现过程可以参照上述方法实施例的相关描述,此处不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图10,图10是本申请实施例提供的另一种跨模态搜索装置的结构示意图。上述跨模态搜索装置可以是运行于计算机设备中的一个计算机程序(包括程序代码),例如该跨模态搜索装置为一个应用软件;该跨模态搜索装置可以用于执行本申请实施例提供的方法中的相应步骤。如图10所示,该跨模态搜索装置1000可以包括:显示模块1001和输出模块1002。
显示模块1001,用于显示社交会话的会话界面;
显示模块1001,还用于响应于对社交会话的历史会话记录的查看,显示会话记录详情界面,会话记录详情界面中包括社交会话的历史会话记录中的第二模态数据;
输出模块1002,用于响应于在会话记录详情界面中输入的第一模态数据,输出第一模态数据对应的跨模态搜索结果;跨模态搜索结果是采用本申请实施例描述的跨模态搜索方法得到的。
在一个实施例中,社交会话的历史会话记录中的第二模态数据存储在第二模态数据库中,且第二模态数据库存储有第二模态数据的属性信息,属性信息包括以下至少一种:类别标签、与第二模态数据关联的第一模态描述信息、从第二模态数据中识别到的第一模态描述信息。
在一个实施例中,第一模态数据为文本,第二模态数据为图像;会话记 录详情界面中包括搜索框,第一模态数据是在搜索框中进行输入得到的;或者,会话记录详情界面中还包括至少一个推荐文本,第一模态数据是通过在至少一个推荐文本中选择得到的。
在一个实施例中,输出模块1002,具体用于:响应于对第一搜索规则的选择,输出跨模态搜索结果中与第一模态数据的内容信息相匹配的第二模态数据;或者,响应于对第二搜索规则的选择,输出跨模态搜索结果中与第一模态数据的语义信息相匹配的第二模态数据。
可以理解的是,本申请实施例所描述的跨模态搜索装置的各功能模块的功能可根据上述方法实施例中的方法具体实现,其具体实现过程可以参照上述方法实施例的相关描述,此处不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
需要说明的是,图9的跨模态搜索装置和图10的跨模态搜索装置可以部署在相同计算机设备中的,也可以部署在不同计算机设备中的。当部署在相同计算机设备中时,计算机设备可以根据输入的第一模态数据自动从数据库中搜索到与第一模态数据相匹配的第二模态数据,得到跨模态搜索结果,进而在计算机设备中输出跨模态搜索结果;当部署在不同计算机设备中时,假设图9的跨模态搜索装置部署在计算机设备A中,图10的跨模态搜索装置部署在计算机设备B中,计算机设备B则负责接收输入的第一模态数据并将该第一模态数据发送给计算机设备A,再由计算机设备A基于第一模态数据在第二模态数据库中搜索与第一模态数据相匹配的第二模态数据,得到跨模态搜索结果,并将该跨模态搜索结果发送给计算机设备B,由计算机设备B对跨模态搜索结果进行展示。
请参见图11,图11是本申请实施例提供的一种计算机设备的结构示意图。该计算机设备1100可以包含独立设备(例如服务器、节点、终端等等中的一个或者多个),也可以包含独立设备内部的部件(例如芯片、软件模块或者硬件模块等)。该计算机设备1100可以包括至少一个处理器1101和通信接口1102,进一步可选地,计算机设备1100还可以包括至少一个存储器1103和总线1104。其中,处理器1101、通信接口1102和存储器1103通过总线1104相连。
其中,处理器1101是进行算术运算和/或逻辑运算的模块,具体可以是中央处理器(central processing unit,CPU)、图片处理器(graphics processing unit,GPU)、微处理器(microprocessor unit,MPU)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)、复杂可编程逻辑器件(Complex programmable logic device,CPLD)、协处理器(协助中央处理器完成相应处理和应用)、微控制单元(Microcontroller Unit,MCU)等处理模块中的一种 或者多种的组合。
通信接口1102可以用于为至少一个处理器提供信息输入或者输出。和/或,通信接口1102可以用于接收外部发送的数据和/或向外部发送数据,可以为包括诸如以太网电缆等的有线链路接口,也可以是无线链路(Wi-Fi、蓝牙、通用无线传输、车载短距通信技术以及其他短距无线通信技术等)接口。
存储器1103用于提供存储空间,存储空间中可以存储操作系统和计算机程序等数据。存储器1103可以是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM)等等中的一种或者多种的组合。
该计算机设备1100中的至少一个处理器1101用于调用至少一个存储器1103中存储的计算机程序,用于执行前述的跨模态搜索方法,例如前述图2、图3以及图7所示实施例所描述的跨模态搜索方法。
应当理解,本申请实施例中所描述的计算机设备1100可执行前文所对应实施例中对该跨模态搜索方法的描述,也可执行前文图9所对应实施例中对该跨模态搜索装置900或者图10所对应实施例中对该跨模态搜索装置1000的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,还应指出,本申请一个示例性实施例还提供了一种存储介质,该存储介质中存储了前述跨模态搜索方法的计算机程序,该计算机程序包括程序指令,当一个或多个处理器加载并执行该程序指令,可以实现实施例中对跨模态搜索方法的描述,这里不再赘述,对采用相同方法的有益效果描述,也在此不再赘述。可以理解的是,程序指令可以被部署在一个或能够互相通信的多个计算机设备上执行。
上述计算机可读存储介质可以是前述任一实施例提供的跨模态搜索装置或者上述计算机设备的内部存储单元,例如计算机设备的硬盘或内存。该计算机可读存储介质也可以是该计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该计算机设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该计算机设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
本申请的一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例中一方面提 供的方法。
本申请的一个方面,提供了另一种计算机程序产品,该计算机程序产品包括计算机程序或计算机指令,该计算机程序或计算机指令被处理器执行时实现本申请实施例提供的跨模态搜索方法的步骤。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (19)

  1. 一种跨模态搜索方法,由计算机设备执行,包括:
    获取第一模态数据;
    基于所述第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,所述第一集合中包括与所述第一模态数据的内容信息相匹配的至少一个第二模态数据;
    基于所述第一模态数据的语义信息在所述第二模态数据库中进行搜索,得到第二集合,所述第二集合中包括与所述第一模态数据的语义信息相匹配的至少一个第二模态数据;
    对所述第一集合和所述第二集合进行合并,得到所述第一模态数据对应的跨模态搜索结果。
  2. 如权利要求1所述的方法,所述第二模态数据库存储有N个第二模态数据、以及所述N个第二模态数据各自的属性信息,N为正整数;所述基于所述第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,包括:
    针对所述N个第二模态数据中的每个第二模态数据,确定所述第一模态数据的内容信息与所述第二模态数据的属性信息之间的匹配度,作为所述第二模态数据对应的匹配度;
    将所对应的匹配度满足匹配条件的第二模态数据添加至所述第一集合中。
  3. 如权利要求2所述的方法,所述属性信息包括第一模态描述信息;所述N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N;
    所述针对所述N个第二模态数据中的每个第二模态数据,确定所述第一模态数据的内容信息与所述第二模态数据的属性信息之间的匹配度,作为所述第二模态数据对应的匹配度,包括:
    确定所述第一模态数据的内容信息与所述第i个第二模态数据的第一模态描述信息之间的语义相似度,作为所述第i个第二模态数据对应的匹配度;
    所述将所对应的匹配度满足匹配条件的第二模态数据添加至所述第一集合中,包括:
    若所述第一模态数据的内容信息与所述第i个第二模态数据的第一模态描述信息之间的语义相似度大于第一相似阈值,则将所述第i个第二模态数据添加至所述第一集合。
  4. 如权利要求2所述的方法,所述属性信息包括类别标签;所述N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N;
    所述针对所述N个第二模态数据中的每个第二模态数据,确定所述第一模态数据的内容信息与所述第二模态数据的属性信息之间的匹配度,作为所 述第二模态数据对应的匹配度,包括:
    确定所述第一模态数据的内容信息与所述第i个第二模态数据的类别标签之间的相似度,作为所述第i个第二模态数据对应的匹配度;
    所述将所对应的匹配度满足匹配条件的第二模态数据添加至所述第一集合中,包括:
    若所述第一模态数据的内容信息与所述第i个第二模态数据的类别标签之间的相似度大于第二相似阈值,则将所述第i个第二模态数据添加至所述第一集合。
  5. 如权利要求1所述的方法,所述第二模态数据库存储有N个第二模态数据;所述第二模态数据库关联有第二模态特征库,所述第二模态特征库中存储有所述N个第二模态数据各自的语义特征;所述基于所述第一模态数据的语义信息在所述第二模态数据库中进行搜索,得到第二集合,包括:
    获取所述第一模态数据的语义特征;
    基于所述第一模态数据的语义特征,在所述第二模态特征库中查找与所述第一模态数据的语义特征相匹配的目标语义特征;
    根据所述目标语义特征,在所述第二模态数据库中确定与所述第一模态数据的语义信息相匹配的第二模态数据;
    将所述与所述第一模态数据的语义信息相匹配的第二模态数据添加至所述第二集合。
  6. 如权利要求5所述的方法,所述第二模态特征库和所述第二模态数据库通过特征索引关联;所述根据所述目标语义特征,在所述第二模态数据库中确定与所述第一模态数据的语义信息相匹配的第二模态数据,包括:
    确定所述目标语义特征的特征索引;
    基于所述目标语义特征的特征索引,在所述第二模态数据库中确定与所述目标语义特征的特征索引对应的第二模态数据。
  7. 如权利要求5所述的方法,所述第二模态特征库中存储的所述N个第二模态数据各自的语义特征,是通过跨模态搜索模型中的第二模态处理网络对所述N个第二模态数据分别进行特征提取处理得到的;所述跨模态搜索模型还包括第一模态处理网络;所述获取所述第一模态数据的语义特征,包括:
    通过所述跨模态搜索模型中的所述第一模态处理网络,对所述第一模态数据进行特征提取处理,得到所述第一模态数据的语义特征。
  8. 如权利要求7所述的方法,所述第二模态处理网络包括特征提取网络、池化处理网络和特征整合网络;所述N个第二模态数据中的任一个表示为第i个第二模态数据,i为正整数,且i小于或等于N;
    通过所述跨模态搜索模型中的第二模态处理网络对所述N个第二模态数据分别进行特征提取处理,得到所述N个第二模态数据各自的语义特征,包括:
    通过所述第二模态处理网络中的所述特征提取网络,提取所述第i个第二模态数据的初始特征;
    通过所述第二模态处理网络中的所述池化处理网络,对所述初始特征进行池化处理,得到所述第i个第二模态数据的池化特征;
    通过所述特征整合网络,对所述池化特征进行整合处理,得到所述第i个第二模态数据的语义特征。
  9. 如权利要求8所述的方法,所述第二模态处理网络还包括分类网络;所述方法还包括:
    通过所述分类网络,基于所述池化特征进行分类预测处理,得到所述第i个第二模态数据的类别标签;以及,
    将所述第i个第二模态数据的类别标签添加至所述第二模态数据库中。
  10. 如权利要求7所述的方法,所述方法还包括:
    获取跨模态训练数据集,所述跨模态训练数据集包括多组跨模态样本数据,每组所述跨模态样本数据包括第二模态样本数据、第一模态样本数据、以及所述第二模态样本数据与所述第一模态样本数据之间的匹配结果;
    通过所述跨模态搜索模型中的第一模态处理网络,对所述跨模态样本数据中的第一模态样本数据进行特征提取处理,得到所述第一模态样本数据的语义特征;以及,通过所述跨模态搜索模型中的第二模态处理网络,对所述跨模态样本数据中的第二模态样本数据进行特征提取处理,得到所述第二模态样本数据的语义特征;
    根据所述第一模态样本数据的语义特征与所述第二模态样本数据的语义特征之间的跨模态对比损失,对所述跨模态搜索模型进行迭代训练,得到训练好的跨模态搜索模型。
  11. 一种跨模态搜索方法,由计算机设备执行,包括:
    显示社交会话的会话界面;
    响应于对所述社交会话的历史会话记录的查看,显示会话记录详情界面,所述会话记录详情界面中包括所述社交会话的历史会话记录中的第二模态数据;
    响应于在所述会话记录详情界面中输入的第一模态数据,输出所述第一模态数据对应的跨模态搜索结果;所述跨模态搜索结果是采用权利要求1-10任一项所述的跨模态搜索方法得到的。
  12. 如权利要求11所述的方法,所述社交会话的历史会话记录中的第二模态数据存储在第二模态数据库中,且所述第二模态数据库存储有所述第二模态数据的属性信息,所述属性信息包括以下至少一种:类别标签、与所述第二模态数据关联的第一模态描述信息、从所述第二模态数据中识别到的第一模态描述信息。
  13. 如权利要求11所述的方法,所述第一模态数据为文本,所述第二模 态数据为图像;所述会话记录详情界面中包括搜索框,所述第一模态数据是在所述搜索框中进行输入得到的;或者,
    所述会话记录详情界面中还包括至少一个推荐文本,所述第一模态数据是通过在所述至少一个推荐文本中选择得到的。
  14. 如权利要求11所述的方法,所述方法还包括:
    响应于对第一搜索规则的选择,输出所述跨模态搜索结果中与所述第一模态数据的内容信息相匹配的第二模态数据;或者,
    响应于对第二搜索规则的选择,输出所述跨模态搜索结果中与所述第一模态数据的语义信息相匹配的第二模态数据。
  15. 一种跨模态搜索装置,包括:
    获取模块,用于获取第一模态数据;
    搜索模块,用于基于所述第一模态数据的内容信息在第二模态数据库中进行搜索,得到第一集合,所述第一集合中包括与所述第一模态数据的内容信息相匹配的至少一个第二模态数据;
    所述搜索模块,还用于基于所述第一模态数据的语义信息在所述第二模态数据库中进行搜索,得到第二集合,所述第二集合中包括与所述第一模态数据的语义信息相匹配的至少一个第二模态数据;
    合并模块,用于对所述第一集合和所述第二集合进行合并,得到所述第一模态数据对应的跨模态搜索结果。
  16. 一种跨模态搜索装置,包括:
    显示模块,用于显示社交会话的会话界面;
    所述显示模块,还用于响应于对所述社交会话的历史会话记录的查看,显示会话记录详情界面,所述会话记录详情界面中包括所述社交会话的历史会话记录中的第二模态数据;
    输出模块,用于响应于在所述会话记录详情界面中输入的第一模态数据,输出所述第一模态数据对应的跨模态搜索结果;所述跨模态搜索结果是采用权利要求1-10任一项所述的跨模态搜索方法得到的。
  17. 一种计算机设备,包括:处理器、存储器以及网络接口;
    所述处理器与所述存储器、所述网络接口相连,其中,所述网络接口用于提供网络通信功能,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行权利要求1至14任一项所述的跨模态搜索方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,执行权利要求1至14任一项所述的跨模态搜索方法。
  19. 一种计算机程序产品,所述计算机程序产品包括计算机程序或计算机指令,所述计算机程序或计算机指令被处理器执行时实现如权利要求1至14中任一项所述的跨模态搜索方法的步骤。
PCT/CN2022/134918 2022-03-07 2022-11-29 一种跨模态搜索方法及相关设备 WO2023168997A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
KR1020247011097A KR20240052055A (ko) 2022-03-07 2022-11-29 교차-모달 검색 방법 및 관련 디바이스
US18/353,882 US20230359651A1 (en) 2022-03-07 2023-07-18 Cross-modal search method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210222089.0A CN116775980B (zh) 2022-03-07 2022-03-07 一种跨模态搜索方法及相关设备
CN202210222089.0 2022-03-07

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/353,882 Continuation US20230359651A1 (en) 2022-03-07 2023-07-18 Cross-modal search method and related device

Publications (2)

Publication Number Publication Date
WO2023168997A1 true WO2023168997A1 (zh) 2023-09-14
WO2023168997A9 WO2023168997A9 (zh) 2024-09-12

Family

ID=87937126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/134918 WO2023168997A1 (zh) 2022-03-07 2022-11-29 一种跨模态搜索方法及相关设备

Country Status (4)

Country Link
US (1) US20230359651A1 (zh)
KR (1) KR20240052055A (zh)
CN (1) CN116775980B (zh)
WO (1) WO2023168997A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093936B (zh) * 2024-04-26 2024-07-16 腾讯科技(深圳)有限公司 视频标签处理方法、装置、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190205393A1 (en) * 2016-07-11 2019-07-04 Peking University Shenzhen Graduate School A cross-media search method
CN112015923A (zh) * 2020-09-04 2020-12-01 平安科技(深圳)有限公司 一种多模态数据检索方法、系统、终端及存储介质
CN112199375A (zh) * 2020-09-30 2021-01-08 三维通信股份有限公司 跨模态的数据处理方法、装置、存储介质以及电子装置
CN113590850A (zh) * 2021-01-29 2021-11-02 腾讯科技(深圳)有限公司 多媒体数据的搜索方法、装置、设备及存储介质
CN113987119A (zh) * 2021-09-30 2022-01-28 阿里巴巴(中国)有限公司 一种数据检索方法、跨模态数据匹配模型处理方法和装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120117051A1 (en) * 2010-11-05 2012-05-10 Microsoft Corporation Multi-modal approach to search query input
CN102262670A (zh) * 2011-07-29 2011-11-30 中山大学 一种基于移动可视设备的跨媒体信息检索系统及方法
US9728008B2 (en) * 2012-12-10 2017-08-08 Nant Holdings Ip, Llc Interaction analysis systems and methods
US10248875B2 (en) * 2013-06-14 2019-04-02 Aware Inc. Method for automatically detecting and repairing biometric crosslinks
US10346464B2 (en) * 2016-09-27 2019-07-09 Canon Kabushiki Kaisha Cross-modiality image matching method
CN113297452A (zh) * 2020-05-26 2021-08-24 阿里巴巴集团控股有限公司 多级检索方法、多级检索装置及电子设备
CN111680173B (zh) * 2020-05-31 2024-02-23 西南电子技术研究所(中国电子科技集团公司第十研究所) 统一检索跨媒体信息的cmr模型
CN112100426B (zh) * 2020-09-22 2024-05-24 哈尔滨工业大学(深圳) 基于视觉和文本特征的通用表格信息检索的方法与系统
CN112148831B (zh) * 2020-11-26 2021-03-19 广州华多网络科技有限公司 图文混合检索方法、装置、存储介质、计算机设备
CN113392265A (zh) * 2021-02-05 2021-09-14 腾讯科技(深圳)有限公司 多媒体处理方法、装置及设备
CN113627151B (zh) * 2021-10-14 2022-02-22 北京中科闻歌科技股份有限公司 跨模态数据的匹配方法、装置、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190205393A1 (en) * 2016-07-11 2019-07-04 Peking University Shenzhen Graduate School A cross-media search method
CN112015923A (zh) * 2020-09-04 2020-12-01 平安科技(深圳)有限公司 一种多模态数据检索方法、系统、终端及存储介质
CN112199375A (zh) * 2020-09-30 2021-01-08 三维通信股份有限公司 跨模态的数据处理方法、装置、存储介质以及电子装置
CN113590850A (zh) * 2021-01-29 2021-11-02 腾讯科技(深圳)有限公司 多媒体数据的搜索方法、装置、设备及存储介质
CN113987119A (zh) * 2021-09-30 2022-01-28 阿里巴巴(中国)有限公司 一种数据检索方法、跨模态数据匹配模型处理方法和装置

Also Published As

Publication number Publication date
CN116775980A (zh) 2023-09-19
WO2023168997A9 (zh) 2024-09-12
US20230359651A1 (en) 2023-11-09
KR20240052055A (ko) 2024-04-22
CN116775980B (zh) 2024-06-07

Similar Documents

Publication Publication Date Title
WO2022142014A1 (zh) 基于多模态信息融合的文本分类方法、及其相关设备
CN110162593B (zh) 一种搜索结果处理、相似度模型训练方法及装置
CN112164391B (zh) 语句处理方法、装置、电子设备及存储介质
WO2023065211A1 (zh) 一种信息获取方法以及装置
WO2021169347A1 (zh) 提取文本关键字的方法及装置
WO2020177673A1 (zh) 一种视频序列选择的方法、计算机设备及存储介质
CN111753060A (zh) 信息检索方法、装置、设备及计算机可读存储介质
CN112733042B (zh) 推荐信息的生成方法、相关装置及计算机程序产品
WO2022134701A1 (zh) 视频处理方法及装置
EP4310695A1 (en) Data processing method and apparatus, computer device, and storage medium
KR20210091076A (ko) 비디오를 처리하기 위한 방법, 장치, 전자기기, 매체 및 컴퓨터 프로그램
CN110619051A (zh) 问题语句分类方法、装置、电子设备及存储介质
CN113806588B (zh) 搜索视频的方法和装置
WO2021226840A1 (zh) 热点新闻意图识别方法、装置、设备及可读存储介质
CN112052333A (zh) 文本分类方法及装置、存储介质和电子设备
CN116821307B (zh) 内容交互方法、装置、电子设备和存储介质
CN113204691B (zh) 一种信息展示方法、装置、设备及介质
CN112926308B (zh) 匹配正文的方法、装置、设备、存储介质以及程序产品
KR20150041908A (ko) 정답 유형 자동 분류 방법 및 장치, 이를 이용한 질의 응답 시스템
WO2023168997A1 (zh) 一种跨模态搜索方法及相关设备
WO2023134085A1 (zh) 问题答案的预测方法、预测装置、电子设备、存储介质
CN111988668B (zh) 一种视频推荐方法、装置、计算机设备及存储介质
CN116578729B (zh) 内容搜索方法、装置、电子设备、存储介质和程序产品
CN116030375A (zh) 视频特征提取、模型训练方法、装置、设备及存储介质
CN114443864A (zh) 跨模态数据的匹配方法、装置及计算机程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22930620

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20247011097

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2024532539

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE