WO2023225919A1 - 一种视觉搜索方法及装置 - Google Patents

一种视觉搜索方法及装置 Download PDF

Info

Publication number
WO2023225919A1
WO2023225919A1 PCT/CN2022/095061 CN2022095061W WO2023225919A1 WO 2023225919 A1 WO2023225919 A1 WO 2023225919A1 CN 2022095061 W CN2022095061 W CN 2022095061W WO 2023225919 A1 WO2023225919 A1 WO 2023225919A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
level
objects
image
search results
Prior art date
Application number
PCT/CN2022/095061
Other languages
English (en)
French (fr)
Inventor
蒋昊
蒋杰
杨光
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/095061 priority Critical patent/WO2023225919A1/zh
Publication of WO2023225919A1 publication Critical patent/WO2023225919A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present application relates to the field of search technology, and in particular to a visual search method and device.
  • Visual search is one of the key technologies in the Internet field. Typical applications include "searching for pictures by pictures” and “searching for text by pictures”. Through visual search, it is a professional search engine system that provides users with retrieval services for searching related graphics and image materials on the Internet. It is a subdivision of search engines. For example, Microsoft's "Bing" search engine helps users more conveniently search through pictures. Complete specific search tasks. In the current digital age, where consumers' attention span and time have been sharply reduced, effectively capturing users' actual needs through visual search and improving user consumption experience have increasingly become the development consensus of major e-commerce platforms. On the other hand, a Data Bridge survey shows that the market valuation of visual search will grow from US$6 billion to US$30 billion. The rapidly growing market continues to promote the iterative development of related visual search technologies.
  • Embodiments of the present application provide a visual search method and device that, through multiple rounds of interactions, help users describe their search intentions efficiently, clearly, and completely, guide and improve their search intentions, proactively explore users' potential points of interest, and improve the effectiveness of their searches. sex and flexibility.
  • this application provides a visual search method, which includes: obtaining an image to be searched; and obtaining the first round of search results based on the characteristics of the image to be searched and the characteristics of the first-level objects in the query recommendation database.
  • the search results of each round include multiple first-level objects that meet the standard; among them, the query recommendation library includes N-level objects, and each N-1th-level object corresponds to multiple N-th-level objects, and N is greater than An integer of 1, the object includes text content and/or image content and/or video content and/or audio content; perform late interactive fusion of the features of the image to be searched and the features of the first target object to obtain the first cumulative search intent feature,
  • the first target object is an object selected by the user from multiple first-level objects that meet the standard; based on the first accumulated search intent characteristics, the second round of search results is obtained, and the second round of search results include the first target object corresponding to Multiple second-level objects that meet the criteria.
  • a first-level object whose similarity to the image to be searched is greater than a preset threshold is determined to be a first-level object that meets the criteria.
  • multiple first-level objects that meet the criteria in the first round of search results are sorted from high to low according to their similarity to the image to be searched.
  • a second-level object whose similarity to the first accumulated search intent feature is greater than a preset threshold is determined to be a second-level object that meets the criteria.
  • multiple qualified second-level objects in the second round of search results are sorted from high to low according to their similarity to the first accumulated search intent feature.
  • the Mth cumulative intention feature and the Lth target object feature are late-interactively fused to obtain the final search intent, where the Lth target object is the user's selected Lth level object from multiple qualified Lth level objects.
  • M is a positive integer greater than or equal to 1
  • L is a positive integer greater than M; based on the final search intention, the final search results are obtained, and the final search results include the L+1th level corresponding to the Lth target object. object.
  • the final search intent is also related to a first text feature, which is a feature of the query text entered by the user.
  • the final search results include card search results and/or expanded search results.
  • the query recommendation database includes information of multiple modalities.
  • the information of multiple modalities is a tree structure.
  • the nodes of the tree structure represent the object, and the nodes of different levels of the tree structure represent Objects at different levels.
  • this application provides a visual search device, including:
  • Acquisition module used to obtain images to be searched
  • a cumulative search intent determination module configured to obtain the first round of search results based on the characteristics of the image to be searched and the characteristics of the first-level objects in the query recommendation library, where the first round of search results include multiple criteria.
  • the first level object
  • the query recommendation library includes objects of N levels, each of the N-1th level objects corresponds to multiple Nth level objects, the N is an integer greater than 1, and the objects include text content. and/or image content and/or video content and/or audio content;
  • the characteristics of the image to be searched and the characteristics of the first target object are later interactively fused to obtain the first cumulative search intention characteristics.
  • the first target object is selected by the user from the plurality of first-level objects that meet the criteria. Object;
  • a search result determination module is configured to obtain a second round of search results based on the first accumulated search intention characteristics, where the second round of search results include a plurality of qualified second-level objects corresponding to the first target object.
  • a first-level object whose similarity to the image to be searched is greater than a preset threshold is determined to be a first-level object that meets the criteria.
  • a plurality of first-level objects that meet the criteria in the first round of search results are sorted from high to low in order of similarity to the image to be searched.
  • a second-level object whose similarity to the first accumulated search intent feature is greater than a preset threshold is determined as a second-level object that meets the criteria.
  • a plurality of qualified second-level objects in the second round of search results are sorted from high to low according to their similarity to the first cumulative search intent feature.
  • the search result determination module is also used to perform late interactive fusion of the Mth cumulative intention feature and the Lth target object feature to obtain the final search intention, wherein the Lth target The object is an object selected by the user from the plurality of L-th level objects that meet the standard, the M is a positive integer greater than or equal to 1, and the L is a positive integer greater than M;
  • a final search result is obtained, and the final search result includes an L+1-th level object corresponding to the L-th target object.
  • the final search intent is also related to a first text feature
  • the first text feature is a feature of the query text input by the user.
  • the final search results include card search results and/or extended search results.
  • the query recommendation database includes information in multiple modalities, and the information in multiple modalities is a tree structure.
  • the nodes of the tree structure represent the object, and the tree Nodes at different levels of the structure represent objects at different levels.
  • this application provides a computing device, including a memory and a processor.
  • the memory stores executable code
  • the processor executes the executable code to implement the method described in the first aspect.
  • the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed in a computer, the computer is caused to execute the method described in the first aspect of the present application.
  • the present application provides a computer program or computer program product.
  • the computer program or computer program product includes instructions. When the instructions are executed, the method described in the first aspect of the application is implemented.
  • Figure 1 shows a schematic flow chart of visual search
  • Figure 2 is an architecture diagram of a visual search system provided by an embodiment of the present application.
  • Figure 3 is a schematic diagram of the construction process of the query recommendation library
  • Figure 4 is a schematic diagram of query recommendation of the query recommendation database during the search process
  • Figure 5a is a schematic diagram of the card search results
  • Figure 5b is a schematic diagram of the expanded search results
  • Figure 6 is a flow chart of a visual search method provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a visual search device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Semantic space that is, the world of language meaning. Each symbol system is a language that conveys meaning in a broad sense, and the meaning they express constitutes a specific semantic space.
  • Semantic features The basic concepts and meanings of the content are represented by feature numerical vectors.
  • Modality Every source or form of information can be called a modality.
  • Cross-modal retrieval The demand for information retrieval is often not just data from a single modality of the same event, but may also require data from other modalities to enrich our understanding of the same thing or event. In this case, cross-modal retrieval is needed to achieve Retrieval between different modal data.
  • Multi-source fusion integrate various data information, absorb the characteristics of different data sources, and then extract unified information that is better and richer than single data.
  • Vector retrieval In a given vector data set, K vectors that are similar to the query vector are retrieved according to a certain metric.
  • Graph A structure that represents the interconnections between some things, objects, and entities and other things, objects, and entities.
  • Figure 1 shows a flow diagram of a visual search. As shown in Figure 1, in order to complete the technical implementation of visual search, visual search includes the following steps:
  • BERT is used to calculate semantic features for text
  • SWIN TRANSFORMER is used for images to calculate semantic features, and the resulting features are projected into the same semantic space as the offline corpus
  • embodiments of the present application propose several visual search solutions.
  • the first visual search solution includes: content base database construction: clean product images, web page images, etc. to build an image base database, and use data mining technology to extract structured (such as product databases) or unstructured (such as web pages) data. Extract tags from the source, remove duplicates and clean them to build a text tag base library;
  • Content semantic feature calculation Through the text tower and picture tower of the multi-modal semantic model, the content feature vectors of each tag and each picture in the base database are calculated respectively;
  • Query semantic feature calculation Calculate the semantic feature vector of the user's Query image through the picture tower of the multi-modal semantic model;
  • Content search search for images by image, search for text by image: Calculate the similarity between query features and text and image features in the content database, and return search content with higher similarity to the user, ultimately achieving the effect of visual search.
  • the first visual search solution is based on multi-modal semantic matching. By calculating image and text modal features, it can accurately determine the general correlation between the query image and the base content to a certain extent, thereby achieving the goal of visual search.
  • there are still shortcomings such as poor search flexibility and passive response.
  • Passive response It can only passively respond to user queries, and cannot help users identify and improve unclear search intentions. It also cannot actively stimulate users' search interests and restrict search traffic and duration.
  • the second visual search solution is to fill and expand based on the user's search content, and provide things that the search engine thinks are related to the topic, thereby helping users obtain recommended information faster and better. Used to handle complex multi-mode searches. Among them, the experience innovation of visual search is further explained. From a technical perspective, on the basis of basic visual search, it allows users to input additional text Query and initiate image-text fusion search, which significantly improves the flexibility of search and can support Some search intents that traditional search technologies cannot accomplish.
  • the second visual search solution has the following problems:
  • Passive response It can only passively respond to user queries and cannot help users identify unclear search intentions. Similarly, this solution cannot actively stimulate users' search interests and restricts search traffic and duration.
  • a third visual search option includes:
  • This technical solution uses the user's continuous input of search content (text, pictures), uses the input content as a vector space representation, and continuously refines the search scope based on the fusion of images and texts, in order to achieve interactive complex queries.
  • This solution improves the ability to handle complex queries in a conversational and progressive manner to a certain extent.
  • the third visual search method based on multi-modal conversational search, adds multi-round interaction capabilities on the basis of image and text fusion to improve the efficiency of complex queries, but it still has the following shortcomings:
  • a visual search method which is based on a multi-level query recommendation database, fully mines structured and unstructured multi-source data, and automatically builds a multi-level tree.
  • the state-of-the-art query recommendation base uses a late-interactive multi-source content fusion strategy and combines user behavior (image input, clicks, etc.) to continuously update query recommendations from two dimensions: breadth and depth, helping users describe their search intentions efficiently, clearly, and completely. , guide and improve users' search intentions, actively explore users' potential points of interest, and improve the effectiveness and flexibility of search.
  • Figure 2 is an architecture diagram of a visual search system provided by an embodiment of the present application.
  • the system mainly includes two module components: offline and online.
  • Offline module components include: query recommendation library construction module, multi-modal content library, multi-level query recommendation library, and content base library construction module.
  • Online module components include: human-computer interaction module, multi-modal information understanding module, multi-information fusion module, and semantic vector retrieval module.
  • the query recommendation library construction module in the offline module component is used to construct tree stumps using structured data such as graphs and multi-level tags; it mines high-frequency words such as web pages/logs to expand the number of root nodes. Expand relationship nodes from depth and breadth, deduplicate nodes based on synonym dictionaries, language models and other tools, and merge various subtrees or leaf nodes mounted under duplicate nodes;
  • the multi-modal information understanding module in the online module component used to make recommendations based on Query content characteristics and query recommendation calculation similarity, combined with user online behavior accumulation to improve user query intentions, and used to adjust recommendations for user intentions;
  • the multi-information fusion module in the online module performs late interactive fusion modeling based on cumulative intent features and next-level node query recommendation text features, retrieves each modal information in the content base, and returns Top-1 as the query recommendation details of the node; Further expand the search, conduct expanded search based on accumulated intent features (and further integrate additional user input text features), and return more content information.
  • This visual search system is a multi-level conversational query recommendation visual search system.
  • the main function of the system is to interact with the user's behavior accumulation based on the Query image input by the user and the interactive click recommendation results to guide and improve the user's search. Intent, actively explore users' potential points of interest, improve the effectiveness and flexibility of search, and return effective search results to users.
  • the main task of the offline module component is to extract structural information such as upper and lower bits of query recommendations from multi-source data, and build a multi-level (tree-like) query recommendation library.
  • the main task of the online module component is to guide and improve user search intentions through conversational cross-modal query recommendations and diverse fusion content, and continue to interact until the search is completed.
  • the structural information such as upper and lower digits of query recommendations is extracted from multi-source data, a multi-level (tree-like) query recommendation base library is constructed, and the offline construction of the multi-level query recommendation library is completed.
  • Logical calculation piles are calculated through structured calculations such as multi-level class target labels of the data.
  • the semantic similarity is carried out in depth (expanding child nodes based on inclusion relationships with synonymous judgments) and breadth (based on cross-relationships with synonymous judgments).
  • the co-occurrence relationship in the same picture is expanded to sibling nodes, and added after verifying the relationship with the parent node).
  • nodes are deduplicated based on tools such as thesaurus and language model, and each subtree or leaf node mounted under the duplicate node is merged.
  • the specific implementation is shown in Figure 3.
  • first-level recommendations are selected and sorted from all nodes; for subsequent levels, the child nodes of the user-selected node are returned As a candidate for the next layer; then perform late interactive fusion modeling based on Query image features and user-selected query recommendation text features to improve the cumulative intent features.
  • next-level nodes are pruned and re-ordered online (see Figure 4).
  • Figure 6 is a flow chart of a visual search method provided by an embodiment of the present application. This visual search method can be implemented through the visual search system shown in Figure 2. As shown in Figure 6, a visual search method provided by an embodiment of the present application includes steps S601 to S604.
  • step S601 the image to be searched is obtained.
  • the image to be searched can receive user input from the search terminal and upload it to the server.
  • the search terminal (such as a smartphone) can capture the image to be searched through a camera (such as a mobile phone camera), or it can directly call the image from local storage as the image to be searched. This application does not limit the specific method of obtaining the image to be searched. .
  • a first round of search results is obtained based on the characteristics of the image to be searched and the characteristics of the first-level objects in the query recommendation database.
  • the first-round search results include a plurality of first-level objects that meet the criteria.
  • Extract the semantic features of the image to be searched For example, extract the semantic features of the image to be searched through the SWIN TRANSFORMER model, map the semantic features of the image to be searched into the same semantic space as the query recommendation library, and obtain the semantic feature vector of the image to be searched. ; Then calculate the similarity between the semantic feature vector of the image to be searched and the features of the first-level objects in the query recommendation database, and determine the first-level objects whose similarity is greater than the preset threshold (for example, 0.8) as qualified objects. , taking multiple first-level objects that meet the criteria as the search results of the first round.
  • the preset threshold for example, 0.8
  • the query recommendation library includes N-level objects.
  • Each N-1-th level object corresponds to multiple N-th-level objects.
  • N is an integer greater than 1.
  • the objects include text content and/or image content and/or Video content and/or audio content; that is to say, the information in the query recommendation database is multi-modal information, including text content information, image content information, audio content information, etc.
  • the information of multiple modalities in the query recommendation database is in a tree structure, the nodes of the tree structure represent objects, and the nodes at different levels of the tree structure represent objects of different levels.
  • the construction of the query recommendation library and the specific structure of the tree stump structure can be found in the description of the query recommendation library above. For the sake of brevity, they will not be repeated here.
  • the implementation of the search method provides effective, low-cost, and scalable data support.
  • multiple first-level objects that meet the criteria in the first round of search results are sorted from high to low according to their similarity to the image to be searched. For example, as shown in Figure 4, after calculating the similarity between the image to be searched and each first-level object, the first-level objects that meet the similarity are sorted from high to low according to the similarity as bicycle accessories, screws...steel wires. That is to say, the first round of search results will be sorted according to the similarity with the image to be searched, and the search results will be displayed to the user. The higher the similarity, the closer it is to the user's initial search intention. The closer it is, the more likely it is the content the user wants to search for, and the higher the ranking is, making it easier for the user to find the content they want to search faster.
  • step S603 the characteristics of the image to be searched and the characteristics of the first target object are late interactively fused to obtain the first accumulated search intention characteristics.
  • the first target object is selected by the user from multiple first-level objects that meet the criteria. object.
  • the feature vector of the image to be searched and the feature vector of the first target object are weighted and fused to obtain the first cumulative search intention.
  • the weight of the feature vector of the image to be searched and the weight of the feature vector of the first target object can be determined in a variety of ways, for example, by system default or user settings.
  • the first target object is a first-level object selected by the user. For example, the user clicks on a first-level object in the first round of search results on the screen as the first target object, such as the bicycle accessories in Figure 4.
  • the user's initial search intention which may be incomplete, can be obtained from the characteristics of the image to be searched.
  • the characteristics of the first target object selected by the user (which can be text characteristics) reflect the user's further search intention.
  • step S604 a second round of search results is obtained based on the first accumulated search intention characteristics, and the second round of search results include a plurality of qualified second-level objects corresponding to the first target object.
  • the similarity greater than a preset threshold (for example, 0.8) is determined to be a second-level object that meets the standard.
  • the second-level object that meets the standard is Level objects are used as search results in the second round.
  • multiple second-level objects that meet the criteria in the second round of search results are sorted from high to low according to their similarity to the first cumulative intention feature.
  • the first-level objects that meet the similarity criteria are sorted from high to low according to the similarity as transmission products, transmission installation tutorials, transmission repair tutorials, that is to say, the second round of search results will be sorted according to the similarity with the first cumulative search intention, and the search results will be displayed to the user.
  • the higher the similarity the higher the similarity. The closer it is to the user's first cumulative search intention, the more likely it is the content the user wants to search, and the higher the ranking, making it easier for the user to find the content they want to search faster.
  • the user can double-click to open the content and successfully obtain the content that the user wants to search for.
  • the search ends. If the user is still not satisfied with the second round of search results (does not exist) Content that matches the user's search intention) will continue to interact, conduct the next round of search, and continue to refine the user's search intention until content that matches the user's search intention is found.
  • the Mth cumulative intention feature and the Nth target object are late interactively fused to obtain the final search intent, where the Nth target object is the object selected by the user from multiple Nth level objects that meet the criteria, and M is greater than or A positive integer equal to 1, and N is a positive integer greater than M; based on the final search intention, the final search result is obtained, and the final search result includes the N+1-th level object corresponding to the N-th target object.
  • rollback can also be supported in each search round. For example, after receiving the user's rollback command, the search page of the previous round is returned to allow the user to re-select the target object and re-express his or her search intention.
  • the final search results include card search results (as shown in Figure 5a) and/or extended search results (as shown in Figure 5b).
  • the visual search method provided by the embodiments of this application uses "click" continuous interaction of visual information to replace the original user text description. On the one hand, it reduces the complexity of user operations, and on the other hand, it uses visual information that contains more "information”. Information guides users to refine their search intentions. At the same time, in the process of multi-modal information interaction and accumulation, the retrieval recommendation results are continuously adjusted and optimized to improve the effectiveness of retrieval.
  • the final search intent is also related to the first text feature, which is a feature of the query text input by the user.
  • the first text feature which is a feature of the query text input by the user.
  • users are also supported to further express their search intentions by inputting query text, so as to shorten the search rounds and search for the corresponding search content faster or more accurately.
  • the search results of the Nth search round if the user enters a query text, the features of the query text are extracted, and the features of the cumulative search intent, the features of the target object, and the features of the query text are interacted with each other. Fusion, and then search in the content base database or query recommendation database based on the fused features (calculate the similarity with each object), and finally get the final search results, and the final search results recommend the object with the highest similarity (top1).
  • the visual search method guides user interaction based on Query images and multi-level query recommendation libraries, and continues to improve the cumulative intent features until the search is completed (returning is supported during the process).
  • Based on Query image features and user-selected query recommendation text features late interactive fusion modeling is performed to improve the cumulative intent features.
  • the next-level nodes are pruned and reordered online.
  • next-level query recommendation features Based on the cumulative intent features, next-level query recommendation features, and user additional input text features, multiple modalities are fused to retrieve various modal information in the content base. Based on the cumulative intent features and the next-level node query recommendation text features, late interaction fusion modeling is performed to retrieve each modal information of the content base and return Top-1 as the query recommendation details of the node. Expanded search is performed based on accumulated intent features (and can be further integrated with additional user input text features) to return more content information.
  • the visual search method provided by the embodiments of this application can also be used in end-side services, such as mobile phone album and gallery search recommendations. It can further connect end-side multi-modal information such as videos, pictures, and text messages to achieve Joint interactive retrieval recommendation for each modal information on the terminal side.
  • the embodiment of the present application also provides a visual search device 700.
  • the visual search device 700 includes each of the visual search methods shown in Figures 1-6. Unit or module of steps.
  • Figure 7 is a schematic structural diagram of a visual search device provided by an embodiment of the present application. The device is applied to computing equipment. As shown in Figure 7, the visual search device 700 at least includes:
  • Acquisition module 701 used to acquire images to be searched
  • the cumulative search intention determination module 702 is used to obtain the first round of search results based on the characteristics of the image to be searched and the characteristics of the first level objects in the query recommendation database.
  • the first round of search results include multiple First-level objects that meet the standards;
  • the query recommendation library includes objects of N levels, each of the N-1th level objects corresponds to multiple Nth level objects, the N is an integer greater than 1, and the objects include text content. and/or image content and/or video content and/or audio content;
  • the characteristics of the image to be searched and the characteristics of the first target object are later interactively fused to obtain the first cumulative search intention characteristics.
  • the first target object is selected by the user from the plurality of first-level objects that meet the criteria. Object;
  • the search result determination module 703 is configured to obtain a second round of search results based on the first cumulative search intent feature, where the second round of search results include a plurality of second-level objects that meet the standard corresponding to the first target object. .
  • a first-level object whose similarity to the image to be searched is greater than a preset threshold is determined to be a first-level object that meets the criteria.
  • a plurality of first-level objects that meet the criteria in the first round of search results are sorted from high to low in order of similarity to the image to be searched.
  • a second-level object whose similarity to the first accumulated search intent feature is greater than a preset threshold is determined as a second-level object that meets the criteria.
  • a plurality of qualified second-level objects in the second round of search results are sorted from high to low according to their similarity to the first cumulative search intent feature.
  • the search result determination module 704 is also configured to perform late interactive fusion of the Mth accumulated intention feature and the Lth target object feature to obtain the final search intention, wherein the Lth The target object is an object selected by the user from the plurality of L-th level objects that meet the target, the M is a positive integer greater than or equal to 1, and the L is a positive integer greater than M;
  • a final search result is obtained, and the final search result includes an L+1-th level object corresponding to the L-th target object.
  • the final search intent is also related to a first text feature
  • the first text feature is a feature of the query text input by the user.
  • the final search results include card search results and/or extended search results.
  • the query recommendation database includes information in multiple modalities, and the information in multiple modalities is a tree structure.
  • the nodes of the tree structure represent the object, and the tree Nodes at different levels of the structure represent objects at different levels.
  • the visual search device 700 may correspond to performing the method described in the embodiment of the present application, and the above and other operations and/or functions of each module in the visual search device 700 are respectively intended to implement Figures 1-6
  • the corresponding processes of each method in for the sake of brevity, will not be repeated here.
  • An embodiment of the present application also provides a computing device, including at least one processor, a memory, and a communication interface.
  • the processor is configured to execute the method described in Figures 1-6.
  • Figure 8 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device 800 includes at least one processor 801 , a memory 802 and a communication interface 803 .
  • the processor 801, the memory 802 and the communication interface 803 are communicatively connected, and the communication connection can be realized in a wired manner (for example, a bus) or in a wireless manner.
  • the communication interface 803 is used to receive data sent by other devices; the memory 802 stores computer instructions, and the processor 801 executes the computer instructions to perform the visual search method in the foregoing method embodiment.
  • the processor 801 can be a central processing unit CPU, and the processor 801 can also be other general-purpose processors, digital signal processors (digital signal processors, DSPs), special-purpose processors, etc.
  • Integrated circuit application specific integrated circuit, ASIC
  • field programmable gate array field programmable gate array, FPGA
  • a general-purpose processor can be a microprocessor or any conventional processor, etc.
  • the memory 802 may include read-only memory and random access memory and provides instructions and data to the processor 801 .
  • Memory 802 may also include non-volatile random access memory.
  • the memory 802 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • computing device 800 can perform the method shown in Figures 1-6 in the embodiment of the present application.
  • the computing device 800 can perform the method shown in Figures 1-6 in the embodiment of the present application.
  • the details will not be described again.
  • Embodiments of the present application provide a computer-readable storage medium on which a computer program is stored.
  • the computer instructions are executed by a processor, the above-mentioned visual search method is implemented.
  • Embodiments of the present application provide a chip, which includes at least one processor and an interface.
  • the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to Implement the visual search method mentioned above.
  • Embodiments of the present application provide a computer program or computer program product, which includes instructions that, when executed, cause the computer to perform the above-mentioned visual search method.
  • RAM random access memory
  • ROM read-only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or anywhere in the field of technology. any other known form of storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供一种视觉搜索,包括获取待搜索图像(601);基于待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,第一轮次搜索结果包括多个达标的第一级别的对象;其中,查询推荐库中包括N个级别的对象,每个第N-1级别的对象对应多个第N级别的对象,N为大于1的整数,对象包括文本内容和/或图像内容和/或视频内容和/或音频内容(602);将待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,第一目标对象为用户从所述多个达标的第一级别的对象中选中的对象(603);基于第一累积搜索意图特征,得到第二轮次搜索结果,第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象(604)。

Description

一种视觉搜索方法及装置 技术领域
本申请涉及搜索技术领域,尤其涉及一种视觉搜索方法及装置。
背景技术
视觉搜索是互联网领域的关键技术之一,典型应用如“以图搜图”、“以图搜文”。通过视觉搜索,为用户提供互联网上搜索相关图形图像资料检索服务的专业搜索的引擎系统,是搜索引擎的一种细分,如微软公司的“必应”搜索引擎,通过图片帮助用户更方便地完成特定搜索任务。在当前消费者注意范围、时间锐减的数字时代,有效的通过视觉搜索来捕捉用户切实需求,提升用户消费体验越来越成为各大电商平台的发展共识。另一方面,Data Bridge调查显示,视觉搜索的市场估值将从60亿美元增长至300亿美元,快速增长的市场不断推动相关视觉搜索技术的迭代发展。
但是现有技术中的视觉搜索存在着灵活性差,只能被动响应用户Querry,无法帮助用户识别或完善尚不明确的搜索意图,进而导致搜索结果准确度不高,用户体验差的问题。
发明内容
本申请的实施例提供一种视觉搜索方法及装置,通过多轮交互,帮助用户高效、清晰、完整的描述搜索意图,引导和完善用户的搜索意图,主动挖掘用户潜在兴趣点,提高搜索的有效性和灵活性。
第一方面,本申请提供了一种视觉搜索方法,包括,获取待搜索图像;基于待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,第一轮次搜索结果包括多个达标的第一级别的对象;其中,查询推荐库中包括N个级别的对象,每个第N-1级别的对象对应多个第N级别的对象,N为大于1的整数,对象包括文本内容和/或图像内容和/或视频内容和/或音频内容;将待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,第一目标对象为用户从多个达标的第一级别的对象中选中的对象;基于第一累积搜索意图特征,得到第二轮次搜索结果,第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象。
在该可能的实现中,通过多轮交互,帮助用户高效、清晰、完整的描述搜索意图,引导和完善用户的搜索意图,主动挖掘用户潜在兴趣点,提高搜索的有效性和灵活性。
在一个可能的实现中,将与待搜索图像的相似度大于预设阈值的第一级别的对象确定为达标的第一级别的对象。
在另一个可能的实现中,第一轮次搜索结果中多个达标的第一级别的对象按照与待搜索图像的相似度由高到低进行排序。
在另一个可能的实现中,将与第一累积搜索意图特征的相似度大于预设阈值的第二级别的对象确定为达标的第二级别的对象。
在另一个可能的实现中,第二轮次搜索结果中多个达标的第二级别的对象按照与第一 累积搜索意图特征的相似度由高到低进行排序。
在另一个可能的实现中,将第M累积意图特征与第L目标对象的特征进行晚交互融合,得到最终搜索意图,其中,第L目标对象为用户从多个达标的第L级别的对象中选中的对象,M为大于或等于1的正整数,L为大于M的正整数;基于最终搜索意图,得到最终搜索结果,最终搜索结果包括第L目标对象对应的达标的第L+1级别的对象。
在一个示例中,最终搜索意图还与第一文本特征相关,第一文本特征为用户输入的查询文本的特征。
在另一个可能的实现中,最终搜索结果包括卡片搜索结果和/或扩展搜索结果。
在另一个可能的实现中,查询推荐库中包括多种模态的信息,多种模态的信息为树状结构,树状结构的节点表征所述对象,树状结构的不同层级的节点表征不同级别的对象。
第二方面,本申请提供了一种视觉搜索装置,包括:
获取模块,用于获取待搜索图像;
累计搜索意图确定模块,用于基于所述待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,所述第一轮次搜索结果包括多个达标的第一级别的对象;
其中,所述查询推荐库中包括N个级别的对象,所述每个第N-1级别的对象对应多个第N级别的对象,所述N为大于1的整数,所述对象包括文本内容和/或图像内容和/或视频内容和/或音频内容;
将所述待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,所述第一目标对象为用户从所述多个达标的第一级别的对象中选中的对象;
搜索结果确定模块,用于基于所述第一累积搜索意图特征,得到第二轮次搜索结果,所述第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象。
在一个可能的实现中,将与所述待搜索图像的相似度大于预设阈值的第一级别的对象确定为达标的第一级别的对象。
在另一个可能的实现中,所述第一轮次搜索结果中多个达标的第一级别的对象按照与所述待搜索图像的相似度由高到低进行排序。
在另一个可能的实现中,将与所述第一累积搜索意图特征的相似度大于预设阈值的第二级别的对象确定为达标的第二级别的对象。
在另一个可能的实现中,所述第二轮次搜索结果中多个达标的第二级别的对象按照与所述第一累积搜索意图特征的相似度由高到低进行排序。
在另一个可能的实现中,搜索结果确定模块,还用于将所述第M累积意图特征与所述第L目标对象的特征进行晚交互融合,得到最终搜索意图,其中,所述第L目标对象为用户从所述多个达标的第L级别的对象中选中的对象,所述M为大于或等于1的正整数,所述L为大于M的正整数;
基于最终搜索意图,得到最终搜索结果,所述最终搜索结果包括第L目标对象对应的达标的第L+1级别的对象。
在另一个可能的实现中,所述最终搜索意图还与第一文本特征相关,所述第一文本特征为所述用户输入的查询文本的特征。
在另一个可能的实现中,所述最终搜索结果包括卡片搜索结果和/或扩展搜索结果。
在另一个可能的实现中,所述查询推荐库中包括多种模态的信息,所述多种模态的信息为树状结构,所述树状结构的节点表征所述对象,所述树状结构的不同层级的节点表征不同级别的对象。
第三方面,本申请提供了一种计算设备,包括存储器和处理器,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现第一方面所述的方法。
第四方面,本申请提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机程序在计算机中执行时,令计算机执行本申请第一方面所述的方法。
第五方面,本申请提供了一种计算机程序或计算机程序产品,所述计算机程序或计算机程序产品包括指令,当所述指令被执行时,实现本申请第一方面所述的方法。
附图说明
图1示出了一种视觉搜索的流程示意图;
图2为本申请实施例提供的一种视觉搜索系统的架构图;
图3为查询推荐库的构建过程示意图;
图4为搜索过程中查询推荐库的查询推荐示意图;
图5a为卡片搜索结果示意图;
图5b为扩展搜索结果示意图;
图6为本申请实施例提供的一种视觉搜索方法的流程图;
图7为本申请实施例提供的一种视觉搜索装置的结构示意图;
图8为本申请实施例提供的一种计算设备的结构示意图。
具体实施方式
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
为了更好的理解本申请实施例提供的技术方案,下面对其中的一些术语进行简单介绍。
query:用户在搜索框输入的内容。
语义空间:即语言意义的世界,每一种符号体系在广义上都是传达意义的语言,它们所表达的意义构成了特定的语义空间。
语义特征:将内容含义的基本概念、含义用特征数值向量进行表示。
模态:每一种信息的来源或者形式,都可以称为一种模态。
跨模态检索:信息检索的需求往往不只是同一事件单一模态的数据,也可能需要其他模态的数据来丰富我们对同一事物或事件的认知,此时就需要跨模态检索来实现不同模态数据之间的检索。
多源融合:将各种不同的数据信息进行综合,吸取不同数据源的特点,然后从中提取出统一的,比单一数据更好、更丰富的信息。
向量检索:在一个给定向量数据集中,按照某种度量方式,检索出与查询向量相近的K个向量。
图谱:表示一些事物、对象、实体与另一些事物、对象、实体之间相互连接的结构。
图1示出了一种视觉搜索的流程示意图。如图1所示,为了完成视觉搜索的技术实现,视觉搜索包括如下几个步骤:
1)离线构建数据底库,通过数据挖掘技术,从各结构化(如商品库)或非结构化(如网页)数据源获取并过滤关键信息,以此实现离线建库操作;
2)在线对用户实际query内容,文本使用BERT进行语义特征计算,图片使用SWIN TRANSFORMER等进行语义特征计算,并将结果特征投影到和离线建库语料相同语义空间内;
3)以实际query内容特征为依据,从离线底库中进行基于特征相似度的匹配检索和粗召回;
4)进一步的结合粗召回的推荐结果进行精排,并返回最终推荐结果。
然而,面对实际场景中多模信息及推荐的有效性,该视觉搜索方法的推荐越发捉襟见肘。
为了解决上述问题,本申请实施例提出几种视觉搜索方案。
第一方案视觉搜索方案包括:内容底库构造:对商品图、网页图等清洗后构建图片底库,通过数据挖掘技术,从各结构化(如商品库)或非结构化(如网页)数据源中提取标签,去重、清洗后构建文本标签底库;
内容语义特征计算:通过多模态语义模型的文本塔、图片塔,分别计算底库中各个标签、各图片的内容特征向量;
查询语义特征计算:通过多模态语义模型的图片塔,计算用户Query图片的语义特征向量;
内容搜索(以图搜图、以图搜文):对查询特征和内容底库中文本、图片的特征进行相似度计算,返回相似度较高的搜索内容给用户,最终实现视觉搜索的效果。
第一种视觉搜索方案,基于多模语义匹配,通过计算图文模态特征,在一定程度上能够准确的判断query图片和底库内容间的通用相关性,进而实现视觉搜索的目标。然而,仍然存在搜索的灵活性差,被动响应等缺点。
搜索灵活性差:仅支持单轮图片查询,以致用户无法完整描述复杂搜索意图,例如“(图片中)这种情况如何修理”。输入意图信息有限,系统只能通过计算通用语义相关性,完成找相似、识物等简单搜索请求。
被动响应:只能被动响应用户Query,无法帮助用户识别、完善尚不明确的搜索意图,同样无法主动激发用户搜索兴趣,制约搜索流量、时长。
第二种视觉搜索方案为:根据用户的搜索内容进行填充和拓展,并提供搜索引擎认为与主题相关的事情,进而帮助用户更快、更好的获取推荐信息。用于处理复杂的多模搜索。其中对视觉搜索的体验创新做出来进一步的特性阐释,从技术角度分析,其在基础视觉搜索的基础上允许用户额外输入文本Query,发起图文融合搜索,显著提升了搜索的灵活性,能够支持部分传统搜索技术无法完成的搜索意图。
第二种视觉搜索方案,存在着如下问题:
被动响应:只能被动响应用户Query,无法帮助用户识别尚不明确的搜索意图,同样地,该方案也无法主动激发用户搜索兴趣,且制约搜索流量、时长。
操作复杂:一方面,用户需要进行两次查询输入(上传图片+输入文本)才能完成复杂意图搜索,违背视觉搜索“一拍即得”的用户心智;另一方面,用户对搜索结果不满意时,每次都需要重新输入查询。
第三种视觉搜索方案包括:
(1)利用表示模型计算用户输入内容的特征向量表示;
(2)根据输入内容的特征向量在离线构建底库中做相似度检索,返回排序后的结果;
(3)结合用户输入内容及返回结果,给予用户进一步细化的搜索推荐项;
(4)在用户根据细化后的搜索推荐项进行交互后,再次计算内容的特征向量,并以此为锚点,在上轮返回推荐结果中进行二次检索;
(5)重复上述对话交互流程,直至引导用户完成搜索意图的有效推荐。
该技术方案通过用户不断的输入检索内容(文本,图片),以输入内容做向量空间的表示,在图文融合的基础上不断的细化检索范围,以期实现交互式的复杂查询。该方案在一定程度上以对话递进式的提升处理复杂查询的能力。
第三种视觉搜索方法,基于多模态对话式的搜索在图文融合的基础上,加入多轮交互能力,提升复杂查询的效率,但仍存在以下缺点:
(1)被动响应:只能被动响应用户Query,无法帮助用户识别尚不明确的搜索意图;
(2)操作复杂:需要用户不断的细化查询需求,严重影响用户输入内容,降低用户体验;
(3)技术难度大:相关配套技术主要存在学术圈,距工业落地有很大距离,难以商用。
为了解决上述方案及现有技术存在的问题,本申请实施例提出一种视觉搜索方法,以多层次查询推荐库为基础,充分挖掘结构化、非结构化的多源数据,自动构建多层次树状查询推荐底库,以晚交互的多源内容融合策略,结合用户行为(输入图片,点击等)从广度和深度两个维度进行持续更新查询推荐,帮助用户高效、清晰、完整的描述搜索意图,引导和完善用户的搜索意图,主动挖掘用户潜在兴趣点,提高搜索的有效性和灵活性。
图2为本申请实施例提供的一种视觉搜索系统的架构图。该系统主要包括离线和在线两个模块组件。
离线模块组件包括:查询推荐库构造模块、多模态内容库、多层次查询推荐库、内容底库构造模块。
在线模块组件包括:人机交互模块、多模态信息理解模块、多元信息融合模块、语义向量检索模块。
其中,离线模块组件中的查询推荐库构造模块:用于利用图谱、多级标签等结构化数据构造树桩;挖掘网页/日志等高频词扩充根部节点数量。从深度和广度进行关系节点扩充,并且基于同义词典、语言模型等工具对节点去重,合并重复节点下挂载的各个子树或叶节点;
在线模块组件中的多模态信息理解模块:用于基于Query内容特征和查询推荐计算相似度进行推荐,结合用户在线行为累积完善用户查询意图,用于用户意图的调整推荐;
在线模块中多元信息融合模块:基于累积意图特征、下一层节点查询推荐文本特征进行晚交互融合建模,检索内容底库的各模态信息,返回Top-1作为节点的查询推荐详情内容;进一步拓展搜索,基于累积意图特征(并可进一步融合用户额外输入文本特征)进行扩展搜索,返回更多内容资讯。
该视觉搜索系统为多层次对话式查询推荐的视觉搜索系统,该系统的主要功能是基于用户输入的Query图片和交互式的点滑推荐结果,进行用户行为累积的交互,引导和完善用户的搜索意图,主动挖掘用户潜在兴趣点,提高搜索的有效性和灵活性,返回用户有效检索结果。离线模块组件的主要任务是从多源数据中提取查询推荐的上下位等结构信息,构建多层次(树状)的查询推荐库。在线模块组件的主要任务是通过会话式的跨模态查询推荐和多元融合内容进行用户搜索意图的引导和完善,持续交互直至搜索完成。
实现步骤如下:
S1、在特定硬件服务器上部署该系统,离线模块组件和在线模块组件可以部署在同一个硬件服务器上,也可以分别部署。
S2、离线阶段从多源数据中提取查询推荐的上下位等结构信息,构建多层次(树状)的查询推荐底库,完成多层次查询推荐库的离线构造。分别通过数据的多级类目标签等结构化计算逻辑数桩,在语义上的相似度在进行深度(基于带同义判断的包含关系扩充子节点)和广度(基于带同义判断的交叉关系、相同图片中的共现关系扩充兄弟节点,校验与父节点关系后加入)上扩充。同时,基于同义词典、语言模型等工具对节点去重,合并重复节点下挂载的各个子树或叶节点,具体实施如图3所示。
S3、在线推荐阶段,基于Query图片和多层次查询推荐库引导用户交互,持续完善累积意图特征,直至搜索完成(过程中支持回退)。首先,针对首层节点(即最靠近根节点的分支节点),基于Query图片特征和查询推荐特征相似度,从全部节点中选择首层推荐并排序;针对后续层,返回用户选择节点的子节点作为下一层候选;接着基于Query图片特征、用户选中的查询推荐文本特征进行晚交互融合建模,完善累积意图特征。最后,基于当前累积意图特征和子节点查询推荐文本特征的相似度,对下一层节点进行在线剪枝和重排序(参加图4)。
S4、基于累积意图特征、下一层查询推荐特征、用户额外输入文本特征进行多元融合,以检索内容底库中的各模态资讯,最终提供卡片搜索(参加图5a)和/或扩展搜索结果(参加图5b)。
图6为本申请实施例提供的一种视觉搜索方法的流程图。该视觉搜索方法可以通过图2所示的视觉搜索系统实现,如图6所示,本申请实施例提供的一种视觉搜索方法包括步骤S601至步骤S604。
在步骤S601中,获取待搜索图像。
待搜索图像可以由搜索终端接收用户输入,并上传至服务器。搜索终端(例如智能手机)可以通过摄像装置(例如手机摄像头)进行拍摄得到待搜索图像,也可以从本地存储中直接调用图像作为待搜索图像,本申请对获取待搜索图像的具体方式不做限定。
在步骤S602中,基于待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,第一轮次搜索结果包括多个达标的第一级别的对象。
提取待搜索图像的语义特征,例如,通过SWIN TRANSFORMER模型提取得到待搜索图像的语义特征,将待搜索图像的语义特征映射至与查询推荐库相同的语义空间内,得到待搜索图像的语义特征向量;然后再将待搜索图像的语义特征向量与查询推荐库中的第一级别的对象的特征计算相似度,将相似度大于预设阈值(例如0.8)的第一级别的对象确定为达标的对象,将多个达标的第一级别的对象作为第一轮次的搜索结果。
其中,查询推荐库中包括N个级别的对象,每个第N-1级别的对象对应多个第N级别的对象,N为大于1的整数,对象包括文本内容和/或图像内容和/或视频内容和/或音频内容;也就是说,查询推荐库中的信息为多模态信息,包括文本内容信息、图像内容信息和音频内容信息等。
可选的,查询推荐库中的多种模态的信息为树状结构,树状结构的节点表征对象,树状结构的不同层级的节点表征不同级别的对象。查询推荐库的构建及树桩结构的具体结构可参加上文中对查询推荐库的描述,为了简洁,这里不再赘述。
通过查询推荐库的设置充分挖掘结构化、非结构化的多源数据,自动构造多层次树状查询推荐库,结构化梳理用户信息检索、信息探索的潜在路径,为本申请实施例提供的视觉搜索方法的实现提供有效、低成本、可扩展的数据支持。
在一个示例中,第一轮次搜索结果中多个达标的第一级别的对象按照与待搜索图像的相似度由高到低进行排序。例如,如图4所示,通过计算待搜索图像和各个第一级别的对象的相似度后,相似度达标的第一级别的对象依照相似度由高到低排序为自行车配件、螺钉…钢丝,也就是说,第一轮次的搜索结果,会按照与待搜索图片的相似度的高低进行排序,向用户进行展示搜索结果,相似度越高的,也就意味着与用户的初始搜索意图越接近,越可能是用户想要搜索的内容,排序越靠前,方便用户更快得找到自己想要搜索的内容。
在步骤S603中,将待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,第一目标对象为用户从多个达标的第一级别的对象中选中的对象。
得到待搜索图像的特征向量和第一目标对象的特征向量后,将待搜索图像的特征向量和第一目标对象的特征向量进行加权融合,得到第一累积搜索意图。待搜索图像的特征向量的权重和第一目标对象的特征向量的加权权重可以是多种方式确定,例如,系统默认或用户设置等方式。
第一目标对象为用户选中的第一级别的对象,例如,用户通过点击屏幕中第一轮次搜索结果中的某个第一级别的对象作为第一目标对象,例如图4中的自行车配件。
可以从待搜索图像的特征中得到用户可能并不完善的初始搜索意图,用户选中的第一目标对象的特征(可以是文本特征)体现出用户进一步的搜索意图,通过将待搜索图像的特征(及初始搜索意图)和第一目标对象的特征(及进一步的搜索意图),得到更加完善的累计搜索意图。
在步骤S604中,基于第一累积搜索意图特征,得到第二轮次搜索结果,第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象。
计算第一累计意图特征和第一目标对象对应的各个第二级别的对象特征的相似度,相似度大于预设阈值(例如0.8)的确定为达标的第二级别的对象,将达标的第二级别的对象作为第二轮次的搜索结果。
可选的,第二轮次搜索结果中多个达标的第二级别的对象按照与第一累计意图特征的相似度由高到低进行排序。例如,如图4所示,通过计算第一累计意图特征和各个第二级别的对象的相似度后,相似度达标的第一级别的对象依照相似度由高到低排序为变速器商品、变速器安装教程、变速器修理教程,也就是说,第二轮次的搜索结果,会按照与第一累计搜索意图的相似度的高低进行排序,向用户进行展示搜索结果,相似度越高的,也就意味着与用户的第一累计搜索意图越接近,越可能是用户想要搜索的内容,排序越靠前, 方便用户更快得找到自己想要搜索的内容。
若第二轮搜索结果中存在用户想要搜索的内容,则用户可双击打开该内容,成功获取到用户想要搜索的内容,搜索结束,若用户对第二轮搜索结果仍不满意(不存在符合用户搜索意图的内容)则继续交互,进行下一轮次的搜索,继续完善用户的搜索意图,直至找到符合用户搜索意图的内容为止。
即将第M累积意图特征与第N目标对象的特征进行晚交互融合,得到最终搜索意图,其中,第N目标对象为用户从多个达标的第N级别的对象中选中的对象,M为大于或等于1的正整数,所述N为大于M的正整数;基于最终搜索意图,得到最终搜索结果,最终搜索结果包括第N目标对象对应的达标的第N+1级别的对象。
可选的,在各个搜索轮次中还可以支持回退,例如接收到用户的回退命令后,退回上一轮次搜索页面,以使用户重新选中目标对象,重新表达自己的搜索意图。
可选的,最终搜索结果包括卡片搜索结果(如图5a所示)和/或扩展搜索结果(如图5b所示)。
本申请实施例提供的视觉搜索方法,使用“点击”式的视觉信息的不断交互来替代原始的用户文本描述,一方面减少用户操作的复杂性,另一方面使用包含更多“信息”的视觉信息引导用户完善检索意图。同时,在多模态信息交互和积累过程中,不断调整、优化检索推荐结果,提高检索的有效性。
在另一个实现方式中,最终搜索意图还与第一文本特征相关,第一文本特征为用户输入的查询文本的特征。也就是说,在各个搜索轮次中,还支持用户输入查询文本的方式进一步表达自己的搜索意图,以缩短搜索轮次,更快或更准确的搜索到相应搜索内容。
例如,在第N搜索轮次的搜索结果中,用户输入查询文本,则提取该查询文本的特征,将该搜索轮次的累计搜索意图的特征、目标对象的特征和查询文本的特征进行晚交互融合,再根据融合后的特征在内容底库或查询推荐库中进行检索(计算与各个对象的相似度),最终得到最终搜索结果,最终搜索结果推荐相似度(top1)最高的对象。
本申请实施例提供的视觉搜索方法,基于Query图片和多层次查询推荐库引导用户交互,持续完善累积意图特征,直至搜索完成(过程中支持回退)。首先基于Query图片特征和查询推荐特征相似度,从全部节点中选择首层推荐并排序;然后在后续层中,返回用户选择节点的子节点作为下一层候选。基于Query图片特征、用户选中的查询推荐文本特征进行晚交互融合建模,完善累积意图特征。基于当前累积意图特征和子节点查询推荐文本特征的相似度,对下一层节点进行在线剪枝和重排序。基于累积意图特征、下一层查询推荐特征、用户额外输入文本特征进行多元融合,以检索内容底库中的各模态资讯。基于累积意图特征、下一层节点查询推荐文本特征进行晚交互融合建模,检索内容底库的各模态信息,返回Top-1作为节点的查询推荐详情内容。基于累积意图特征(并可进一步融合用户额外输入文本特征)进行扩展搜索,返回更多内容资讯。
本申请实施例提供的视觉搜索方法,除应用于云端应用外,同样也可用应用于端侧服务,如手机相册图库搜索推荐,进一步可以连通端侧视频、图片、短信等多模态信息,实现端侧各模态信息的联合交互检索推荐。
与前述视觉搜索方法的实施例基于相同的构思,本申请实施例中还提供了一种视觉搜 索装置700,该视觉搜索装置700包括用以实现图1-6所示的视觉搜索方法中的各个步骤的单元或模块。
图7为本申请实施例提供的一种视觉搜索装置的结构示意图。该装置应用于计算设备,如图7所示,该一种视觉搜索装置700至少包括:
获取模块701,用于获取待搜索图像;
累计搜索意图确定模块702,用于基于所述待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,所述第一轮次搜索结果包括多个达标的第一级别的对象;
其中,所述查询推荐库中包括N个级别的对象,所述每个第N-1级别的对象对应多个第N级别的对象,所述N为大于1的整数,所述对象包括文本内容和/或图像内容和/或视频内容和/或音频内容;
将所述待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,所述第一目标对象为用户从所述多个达标的第一级别的对象中选中的对象;
搜索结果确定模块703,用于基于所述第一累积搜索意图特征,得到第二轮次搜索结果,所述第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象。
在一个可能的实现中,将与所述待搜索图像的相似度大于预设阈值的第一级别的对象确定为达标的第一级别的对象。
在另一个可能的实现中,所述第一轮次搜索结果中多个达标的第一级别的对象按照与所述待搜索图像的相似度由高到低进行排序。
在另一个可能的实现中,将与所述第一累积搜索意图特征的相似度大于预设阈值的第二级别的对象确定为达标的第二级别的对象。
在另一个可能的实现中,所述第二轮次搜索结果中多个达标的第二级别的对象按照与所述第一累积搜索意图特征的相似度由高到低进行排序。
在另一个可能的实现中,搜索结果确定模块704,还用于将所述第M累积意图特征与所述第L目标对象的特征进行晚交互融合,得到最终搜索意图,其中,所述第L目标对象为用户从所述多个达标的第L级别的对象中选中的对象,所述M为大于或等于1的正整数,所述L为大于M的正整数;
基于最终搜索意图,得到最终搜索结果,所述最终搜索结果包括第L目标对象对应的达标的第L+1级别的对象。
在另一个可能的实现中,所述最终搜索意图还与第一文本特征相关,所述第一文本特征为所述用户输入的查询文本的特征。
在另一个可能的实现中,所述最终搜索结果包括卡片搜索结果和/或扩展搜索结果。
在另一个可能的实现中,所述查询推荐库中包括多种模态的信息,所述多种模态的信息为树状结构,所述树状结构的节点表征所述对象,所述树状结构的不同层级的节点表征不同级别的对象。
根据本申请实施例的视觉搜索装置700可对应于执行本申请实施例中描述的方法,并且一种视觉搜索装置700中的各个模块的上述和其它操作和/或功能分别为了实现图1-6中的各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供一种计算设备,包括至少一个处理器、存储器和通信接口,所述处理器用于执行图1-6所述的方法。
图8为本申请实施例提供的计算设备的结构示意图。
如图8所示,所述计算设备800包括至少一个处理器801、存储器802和通信接口803。其中,处理器801、存储器802和通信接口803通信连接,可以通过有线(例如总线)的方式实现通信连接,也可以通过无线的方式实现通信连接。该通信接口803用于接收其他设备发送的数据;存储器802存储有计算机指令,处理器801执行该计算机指令,执行前述方法实施例中的视觉搜索方法。
应理解,在本申请实施例中,该处理器801可以是中央处理单元CPU,该处理器801还可以是其他通用处理器、数字信号处理器(d igita l s igna l processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。
该存储器802可以包括只读存储器和随机存取存储器,并向处理器801提供指令和数据。存储器802还可以包括非易失性随机存取存储器。
该存储器802可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
应理解,根据本申请实施例的计算设备800可以执行实现本申请实施例中图1-6所示方法,该方法实现的详细描述参见上文,为了简洁,在此不再赘述。
本申请的实施例提供了一种计算机可读存储介质,其上存储有计算机程序,当所述计算机指令在被处理器执行时,使得上文提及的视觉搜索方法被实现。
本申请的实施例提供了一种芯片,该芯片包括至少一个处理器和接口,所述至少一个处理器通过所述接口确定程序指令或者数据;该至少一个处理器用于执行所述程序指令,以实现上文提及的视觉搜索方法。
本申请的实施例提供了一种计算机程序或计算机程序产品,该计算机程序或计算机程序产品包括指令,当该指令执行时,令计算机执行上文提及的视觉搜索方法。
本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成 及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
以上所述的具体实施方式,对本申请的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请的具体实施方式而已,并不用于限定本申请的保护范围,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种视觉搜索方法,其特征在于,
    获取待搜索图像;
    基于所述待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,所述第一轮次搜索结果包括多个达标的第一级别的对象;
    其中,所述查询推荐库中包括N个级别的对象,所述每个第N-1级别的对象对应多个第N级别的对象,所述N为大于1的整数,所述对象包括文本内容和/或图像内容和/或视频内容和/或音频内容;
    将所述待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,所述第一目标对象为用户从所述多个达标的第一级别的对象中选中的对象;
    基于所述第一累积搜索意图特征,得到第二轮次搜索结果,所述第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象。
  2. 根据权利要求1所述的方法,其特征在于,将与所述待搜索图像的相似度大于预设阈值的第一级别的对象确定为达标的第一级别的对象。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一轮次搜索结果中多个达标的第一级别的对象按照与所述待搜索图像的相似度由高到低进行排序。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,将与所述第一累积搜索意图特征的相似度大于预设阈值的第二级别的对象确定为达标的第二级别的对象。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述第二轮次搜索结果中多个达标的第二级别的对象按照与所述第一累积搜索意图特征的相似度由高到低进行排序。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,还包括,
    将所述第M累积意图特征与所述第L目标对象的特征进行晚交互融合,得到最终搜索意图,其中,所述第L目标对象为用户从所述多个达标的第L级别的对象中选中的对象,所述M为大于或等于1的正整数,所述L为大于M的正整数;
    基于最终搜索意图,得到最终搜索结果,所述最终搜索结果包括第L目标对象对应的达标的第L+1级别的对象。
  7. 根据权利要求6所述的方法,其特征在于,所述最终搜索意图还与第一文本特征相关,所述第一文本特征为所述用户输入的查询文本的特征。
  8. 根据权利要求6或7所述的方法,其特征在于,所述最终搜索结果包括卡片搜索结果和/或扩展搜索结果。
  9. 根据权利要求1-7任一项所述的方法,其特征在于,所述查询推荐库中包括多种模态的信息,所述多种模态的信息为树状结构,所述树状结构的节点表征所述对象,所述树状结构的不同层级的节点表征不同级别的对象。
  10. 一种视觉搜索装置,其特征在于,包括:
    获取模块,用于获取待搜索图像;
    累计搜索意图确定模块,用于基于所述待搜索图像的特征和查询推荐库中的第一级别的对象的特征,得到第一轮次搜索结果,所述第一轮次搜索结果包括多个达标的第一级别的对象;
    其中,所述查询推荐库中包括N个级别的对象,所述每个第N-1级别的对象对应多个第N级别的对象,所述N为大于1的整数,所述对象包括文本内容和/或图像内容和/或视频内容和/或音频内容;
    将所述待搜索图像的特征和第一目标对象的特征进行晚交互融合,得到第一累积搜索意图特征,所述第一目标对象为用户从所述多个达标的第一级别的对象中选中的对象;
    搜索结果确定模块,用于基于所述第一累积搜索意图特征,得到第二轮次搜索结果,所述第二轮次搜索结果包括第一目标对象对应的多个达标的第二级别的对象。
  11. 根据权利要求10所述的装置,其特征在于,将与所述待搜索图像的相似度大于预设阈值的第一级别的对象确定为达标的第一级别的对象。
  12. 根据权利要求10或11所述的装置,其特征在于,所述第一轮次搜索结果中多个达标的第一级别的对象按照与所述待搜索图像的相似度由高到低进行排序。
  13. 根据权利要求10-12任一项所述的装置,其特征在于,将与所述第一累积搜索意图特征的相似度大于预设阈值的第二级别的对象确定为达标的第二级别的对象。
  14. 根据权利要求10-13任一项所述的装置,其特征在于,所述第二轮次搜索结果中多个达标的第二级别的对象按照与所述第一累积搜索意图特征的相似度由高到低进行排序。
  15. 根据权利要求10-14任一项所述的装置,其特征在于,搜索结果确定模块,还用于将所述第M累积意图特征与所述第L目标对象的特征进行晚交互融合,得到最终搜索意图,其中,所述第L目标对象为用户从所述多个达标的第L级别的对象中选中的对象,所述M为大于或等于1的正整数,所述L为大于M的正整数;
    基于最终搜索意图,得到最终搜索结果,所述最终搜索结果包括第L目标对象对应的达标的第L+1级别的对象。
  16. 根据权利要求15所述的装置,其特征在于,所述最终搜索意图还与第一文本特征相关,所述第一文本特征为所述用户输入的查询文本的特征。
  17. 根据权利要求15或16所述的装置,其特征在于,所述最终搜索结果包括卡片搜索结果和/或扩展搜索结果。
  18. 根据权利要求10-17任一项所述的装置,其特征在于,所述查询推荐库中包括多种模态的信息,所述多种模态的信息为树状结构,所述树状结构的节点表征所述对象,所述树状结构的不同层级的节点表征不同级别的对象。
  19. 一种计算设备,包括存储器和处理器,其特征在于,所述存储器中存储有可执行代码,所述处理器执行所述可执行代码,实现权利要求1-9任一项所述的方法。
  20. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,当所述计算机程序在计算机中执行时,令计算机执行权利要求1-9任一项所述的方法。
PCT/CN2022/095061 2022-05-25 2022-05-25 一种视觉搜索方法及装置 WO2023225919A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/095061 WO2023225919A1 (zh) 2022-05-25 2022-05-25 一种视觉搜索方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/095061 WO2023225919A1 (zh) 2022-05-25 2022-05-25 一种视觉搜索方法及装置

Publications (1)

Publication Number Publication Date
WO2023225919A1 true WO2023225919A1 (zh) 2023-11-30

Family

ID=88918025

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095061 WO2023225919A1 (zh) 2022-05-25 2022-05-25 一种视觉搜索方法及装置

Country Status (1)

Country Link
WO (1) WO2023225919A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254539A1 (en) * 2008-04-03 2009-10-08 Microsoft Corporation User Intention Modeling For Interactive Image Retrieval
JP2010218479A (ja) * 2009-03-19 2010-09-30 Yahoo Japan Corp 画像検索装置
CN102508909A (zh) * 2011-11-11 2012-06-20 苏州大学 一种基于多智能算法及图像融合技术的图像检索方法
CN106547744A (zh) * 2015-09-16 2017-03-29 杭州海康威视数字技术股份有限公司 一种图像检索方法及系统
CN108563792A (zh) * 2018-05-02 2018-09-21 百度在线网络技术(北京)有限公司 图像检索处理方法、服务器、客户端及存储介质
CN110020185A (zh) * 2017-12-29 2019-07-16 国民技术股份有限公司 智能搜索方法、终端及服务器
CN112364199A (zh) * 2021-01-13 2021-02-12 太极计算机股份有限公司 一种图片搜索系统
US11176189B1 (en) * 2016-12-29 2021-11-16 Shutterstock, Inc. Relevance feedback with faceted search interface

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090254539A1 (en) * 2008-04-03 2009-10-08 Microsoft Corporation User Intention Modeling For Interactive Image Retrieval
JP2010218479A (ja) * 2009-03-19 2010-09-30 Yahoo Japan Corp 画像検索装置
CN102508909A (zh) * 2011-11-11 2012-06-20 苏州大学 一种基于多智能算法及图像融合技术的图像检索方法
CN106547744A (zh) * 2015-09-16 2017-03-29 杭州海康威视数字技术股份有限公司 一种图像检索方法及系统
US11176189B1 (en) * 2016-12-29 2021-11-16 Shutterstock, Inc. Relevance feedback with faceted search interface
CN110020185A (zh) * 2017-12-29 2019-07-16 国民技术股份有限公司 智能搜索方法、终端及服务器
CN108563792A (zh) * 2018-05-02 2018-09-21 百度在线网络技术(北京)有限公司 图像检索处理方法、服务器、客户端及存储介质
CN112364199A (zh) * 2021-01-13 2021-02-12 太极计算机股份有限公司 一种图片搜索系统

Similar Documents

Publication Publication Date Title
Qi et al. Finding all you need: web APIs recommendation in web of things through keywords search
CN111611361B (zh) 抽取式机器智能阅读理解问答系统
KR102354716B1 (ko) 딥 러닝 모델을 이용한 상황 의존 검색 기법
CN109829104B (zh) 基于语义相似度的伪相关反馈模型信息检索方法及系统
CN107480158B (zh) 基于相似性得分评估内容项目与图像的匹配的方法和系统
US10025819B2 (en) Generating a query statement based on unstructured input
CN110442777B (zh) 基于bert的伪相关反馈模型信息检索方法及系统
CN111190997B (zh) 一种使用神经网络和机器学习排序算法的问答系统实现方法
US7895195B2 (en) Method and apparatus for constructing a link structure between documents
CN111753198A (zh) 信息推荐方法和装置、以及电子设备和可读存储介质
US20180081880A1 (en) Method And Apparatus For Ranking Electronic Information By Similarity Association
US20120246135A1 (en) Image search engine augmenting search text based upon category selection
US20220114361A1 (en) Multi-word concept tagging for images using short text decoder
JP2022050379A (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
JP7150090B2 (ja) ショッピング検索のための商品属性抽出方法
CN110147494B (zh) 信息搜索方法、装置,存储介质及电子设备
US11475290B2 (en) Structured machine learning for improved whole-structure relevance of informational displays
US10198497B2 (en) Search term clustering
CN112115232A (zh) 一种数据纠错方法、装置及服务器
JP7483320B2 (ja) 自動検索辞書およびユーザインターフェイス
CN113039539A (zh) 使用ai模型推荐来扩展搜索引擎能力
US20230117568A1 (en) Knowledge attributes and passage information based interactive next query recommendation
CN117435685A (zh) 文档检索方法、装置、计算机设备、存储介质和产品
WO2023225919A1 (zh) 一种视觉搜索方法及装置
US9195940B2 (en) Jabba-type override for correcting or improving output of a model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943124

Country of ref document: EP

Kind code of ref document: A1