WO2019118253A1

WO2019118253A1 - Document recall based on vector nearest neighbor search

Info

Publication number: WO2019118253A1
Application number: PCT/US2018/064146
Authority: WO
Inventors: Dianfei Han; Jiefeng HUA; Dongqing Zhang; Suyan ZHU; Shi ZHANG; Gang Ren; Feng Tan; Jingdong Wang; Hui Shen; Wei Luo; Zengzhong Li; Lintao Zhang; Qi Chen; Mingqin LI
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2017-12-14
Filing date: 2018-12-06
Publication date: 2019-06-20
Also published as: CN109948044A

Abstract

The present disclosure provides technical solutions related to a document recalling based the vector nearest neighbor search. The technique of vector approximate matching is applied to the searching engine. The content for searching and webpage documents may be turned into semantic vectors, respectively, and the webpage documents related to the content for searching may be obtained in a way of searching by vector approximate matching, so that a searching service, which could understand the user's intention better, may be provided without being limited by the searching method of symbol matching.

Description

DOCUMENT RECALL BASED ON VECTOR NEAREST NEIGHBOR

SEARCH

BACKGROUND

[0001] As the development of the internet technology, the functionality of the searching engine is becoming more and more powerful, and the targets of the searches are more and more diversified. A searching engine may offer information for many applications and thus act as a service necessary for many applications. In such period in which information are developing in high speed, there are a large amount of webpage documents and the amount of the webpage documents are growing fast. Meanwhile, the user’s needs on information are growing. There is now a big challenge in the current searching engine technique that how to understand the intention of the search being conducted a user in a quicker, more efficient, and more precise way.

BRIEF SUMMARY

[0002] The embodiments of the present disclosure is provided to give a brief introduction to some concepts, which would be further explained in the following description. This Summary is not intended to identify essential technical features or important features of the subject as claimed nor to limit the scope of the subject as claimed.

[0003] A technical solution related to document recalling based vector nearest neighbor search is disclosed. More particularly, the technique of vector approximate matching search is applied in the searching engine. The query content and webpage documents may be converted into semantic vectors, respectively, and the webpage documents similar with the query content may be obtained in a way of vector approximate matching search, so that a searching service, which could understand the user’s intention better, may be provided without being limited by the searching method of symbol matching.

[0004] The above description is merely a brief introduction of the technical solutions of the present disclosure, so that the technical means of the present disclosure may be clearly understood, and implemented according to the description of the specification, and the above and other technical objects, features and advantages of the present disclosure may be more obvious based on the embodiments of the present disclosure as follows.

BRIEF DESCRIPTION OF THE DRAWINGS

[0005] Fig. 1 is an exemplary block diagram of a searching engine system of embodiments of the present disclosure;

[0006] Fig. 2 is a schematic diagram showing query processing procedure on webpage documents of embodiments of the present disclosure;

[0007] Fig. 3 is a block diagram of a webpage document query device of embodiments of the present disclosure;

[0008] Fig. 4 is a schematic diagram showing another query processing procedure on webpage documents of embodiments of the present disclosure;

[0009] Fig. 5 is a block diagram of another webpage document query device of embodiments of the present disclosure;

[00010] Fig. 6 is another schematic diagram showing still another query processing procedure on webpage documents of embodiments of the present disclosure;

[00011] Fig. 7 is a block diagram showing systematic framework of query processing on webpage documents of embodiments of the present disclosure;

[00012] Fig. 8 is a schematic diagram showing another query processing procedure on webpage documents of embodiments of the present disclosure;

[00013] Fig. 9 is a schematic diagram showing another query processing procedure of webpage documents of embodiments of the present disclosure;

[00014] Fig. 10 is a block diagram showing another systematic framework of query processing on webpage documents of embodiments of the present disclosure;

[00015] Fig. 11 is a block diagram showing still another systematic framework of query processing on webpage documents of embodiments of the present disclosure;

[00016] Fig. 12 is a schematic diagram of vector nearest neighbor searching based on model of CDSSM of embodiments of the present disclosure; and

[00017] Fig. 13 is a block diagram of electronic device of embodiments of the present disclosure.

DETAILED DESCRIPTION

[00018] In the following, description will be given in detail on the exemplary embodiments of the present disclosure, in connection with the accompanying drawing. Although drawings show the exemplary embodiments of the present disclosure, it should be appreciated that the present disclosure may be implemented in various ways without being limited by the embodiments set forth herein. On the contrary, these embodiments are provided for thorough understanding of the present disclosure, and completely conveying the scope of the present disclosure to the skills in the art.

[00019] The term "technique", as cited herein, for instance, may refer to system(s), method(s), computer-readable instructions, module(s), algorithms, hardware logic (e.g., Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs)), and/or other technique(s) as permitted by the context above and throughout the document.

[00020] Searching engine technology has been widely used in various fields. The searching engine is now associated into various APPs (applications) to provide various information searching services to users, in addition to general way of using searching engine by webpage accessing.

[00021] A user may send out a query request to a searching engine. The searching engine may conduct search throughout the stored webpage documents according to the query content contained in the query request, acquire webpage documents matched with the query content of the user and returned the acquired webpage documents to the user. The searching engine may conduct search not only on the webpage documents, but also on other types of documents (such as message documents, data documents). In the present disclosure, the description may be mainly made with webpage documents as examples.

[00022] Currently, the searching on documents may be mainly conduct in a way based on symbols matching. In web searching engines, documents related to certain query content may be acquired in a symbol matching way based on inverted index of keywords as common modes. However, the current searching method based on symbol matching cannot understand the intention of a user very well. Although in some searching engines, the input original query may be adjusted and then used in the searching so as to improve the recalling ratio, such adjustment may be very limited, and cannot maintain the required recalling ratio, especially in the case of the query content is related to some new concepts.

[00023] In the document recalling technique based on vector nearest neighbor searching as proposed in the present disclosure, documents in the searching engine may be converted into document vectors in a form of semantic vectors in advance, and the query content of a user may be also converted into query vectors in the form of semantic vectors. Then, searching may be conduct in a document vectors base (which is consisted of document vectors which are obtained by converting a plurality of documents) for document vectors approximate to the query vectors. At last, corresponding documents may be obtained according to the found document vectors and returned to the user as a query result.

[00024] More particularly, a technique of ANN (approximate nearest neighbor) search may be adopted in the vector approximate matching search as described above. The documents and query content are both converted into a form of semantic vectors and thus documents to be recalled may be determined according to the similarity between the query vector and document vector. Such method can eliminate the limitation of the searching method based on symbol matching and thus may understand the user’s intention much better.

[00025] As shown in Fig. 1, which is an exemplary block diagram 100 of searching engine system of embodiments of the present disclosure, the block diagram 100 may include: a user 101, a server 103 including a searching engine 102, one or more databases 104 for storing webpage documents. The user 101 and the server 103 may be connected via internet 105. In the present disclosure, the user 101 may refer to a person, a client in a form of software, a client in a form of hardware (e.g., desktop computer, laptop computer, mobile phone, tablet, and other similar smart terminals), APP, or other application server.

[00026] On one hand, the searching engine 102 may perform identifying with respect to a huge amount of data all the time every day and obtain content, so as to generate webpage documents and store them in the database 104. More particularly, the content of the webpage documents may include: title, linkage, anchor, data of clicking, and so on. On the other hand, the searching engine 102 may receive an query request from a user 101, perform search in the database 104 according to the query content contained in the query request, obtain webpage documents matched with the query content and then return them to the user 101. The query request may be generated based on the text content input by a user in the searching box of a webpage, or from APP, which may generate an query request based on an query need of a user. When a user inputs a query content, the query content may be input in a form a text, or obtained by receiving an input in a form of voice and then converting it into a form of text. In the present disclosure, no matter what kind of form an query content is input in, the query content may be finally be the query content in a form of natural language so as to be further processed by using the technique as described in the present disclosure.

[00027] Embodiments of the present disclosure may make improvements with respect to the searching part for webpage documents in searching engine. The improvements may mainly involve the following aspects.

1) webpage documents query processing based on vector approximate matching search The embodiments of the present disclosure introduce a technique of vector approximate matching search, which may perform a matching search after converting both the query content and the query document into the form of semantic vectors, so as to eliminate the limitation of symbol matching search and understand the intention of the user much better.

2) partitioning processing on webpage document data

Due to the large amount of webpage documents, the embodiments of the present disclosure may perform partitioning processing on the webpage document data consisted of a plurality of webpage documents, and then perform the processing of converting to sematic vectors and establishing of indexes. During the searching processing, searching processing is also performed on each webpage document data in parallel, and then the webpage documents may be combined so as to generate output results.

3) Establishing of vector indexes and applying in query processing

In order to further improve the efficiency of vector approximate matching search, vector indexes may be established. The vector indexes may be mainly used for quickly positioning the region where matched webpage documents are probably in.

4) combined using of vector approximate matching search and inverted index search

In order to further optimize the searching results, the embodiments of the present disclosure may adopt both the method of vector approximate matching search and the method of inverted index search, and fully utilize the webpage documents obtained by both searching method to generate the final searching results.

[00028] Detailed description would be made on these aspects of improvements in the following.

webpage documents query processing based on vector approximate matching search

[00029] In the embodiments of the present disclosure, the searching engine 102 may convert webpage documents obtained into webpage document vectors in advance and store them in the database 104 as data basis for future document searching. Therefore, when the searching engine 102 receives a query request from a user, the query content may be extracted and then, the query procedure of webpage documents of embodiments of the present disclosure as shown in the schematic diagram 200 in Fig. 2. The query processing may include:

[00030] S201, generating query vectors according the query content. The query content may be in a form of natural language. The query vectors may be generated by performing feature extracting on semanteme of the query content. During the generation of query vectors, semantic features may be extracted based on context of the query content, so that the user’s intention may be better understood. Furthermore, the query vectors and webpage document vectors are semantic vectors generated based on the same semantic space and thus it may facilitate the future vector approximate matching process between the query vectors and the webpage document vectors.

[00031] S202, performing vector approximate matching search, and obtaining webpage document vectors matched with the query vectors. Because both the query content and the webpage documents are in a form of semantic vectors, the webpage document vectors most similar with the query vector may be found as the query results by calculating the similarity between the webpage document vectors and the query vectors. The algorithm of the vector approximate matching search as cited herein is based on the similarity between semantic vectors. In the same semantic space, the distance between any two semantic vectors may show the similarity between two semantic vectors. There are many algorithms for calculating the distance between semantic vectors, e.g., the algorithm of cosine similarity, which may calculate the cosine value of the angle between two semantic vectors, and the smaller the cosine value is, the higher the similarity between two semantic vectors is. More particularly, the vector approximate matching search may be implemented by using ANN (approximate nearest neighbor) search. The typical examples of ANN algorithm may include: KD-tree algorithm, KNN (K-nearest neighbor) graph algorithm, LSH (Locality Sensitive Hash) algorithm, or the like.

[00032] S203, obtaining corresponding webpage documents according to webpage document vectors. The mapping relationship between webpage document vector and webpage document may be stored in a database, and the webpage document corresponding to the webpage document may be found according to the mapping relationship.

[00033] The above steps S201 to S203 describe the processing procedure of searching for matched webpage documents based on query content. As described above, before these steps, it is necessary to convert the webpage documents into webpage document vectors. Therefore, in the above processing procedure, before the step of S202, the method may further include:

[00034] S204, generating one or more webpage document vectors according to content of the documents. More particularly, the content of the document may include: title, linkage, anchor, clicking data. Webpage document vectors may be generated based on any of the above contents of document or the combination thereof. One document may be corresponding to a plurality of generated webpage document vectors. When there are a plurality of webpage document vectors corresponding to one document, in the above step of S102, when the query vector is matched with any webpage document vector, the document corresponding to this matched webpage document may be regarded as being matched with the query content and may be returned as query result. Furthermore, because the searching engine 102 may obtain the content on the internet all day long and generate webpage documents, the procedure of converting the webpage document into vector documents are constantly performed. When there is a new webpage document, the searching engine 102 may convert the new webpage document into webpage document vectors and add them to the database.

[00035] As shown in Fig. 3, which is a block diagram 300 of a webpage document query device of embodiments of the present disclosure, the above query processing on webpage documents may be implemented with the webpage document query device as shown in Fig. 3. The webpage document query device may be provided in the above searching engine 102, and include: a query vector generating module 301 configured to generate query vectors according to query content; a webpage document vector acquiring module 302 configured to conduct vector approximate matching search to acquire webpage document vectors matched with the query vector; and a document acquiring module 303 configured to acquire corresponding documents according to the webpage document vectors.

[00036] Furthermore, the webpage document query device may further include a webpage document vector generating module 304 configured to generate one or more webpage document vectors according to the document content of documents.

[00037] In the above webpage document query processing, the webpage documents and query content may be converted into a form of semantic vectors and the vector approximate matching search may be conducted by using sematic vectors, so that the search may be performed based on similarity of semantic vectors and similar webpage document vectors in the vector space may be obtained without limitation on the search by symbol matching. Furthermore, the feature elements contained in the search based on semantic vectors are not only the query words (single word or words in a sentence) itself, but also more various feature elements, so as to better understand the query intention of the user and improve the recalling ratio.

Partitioning processing on webpage document data

[00038] In the above, description has been made on query processing based on vector approximate matching search, which is a basic processing, of embodiments of the present disclosure. In embodiments, the searching engine 102 may have to deal with a large amount of webpage documents. In the present disclosure, the group of a plurality of webpage documents may be referred as webpage document data. The webpage document data consisted of a large amount of webpage documents has extremely huge amount of data. Therefore, both the storing and the establishing of indexes need a lot of efforts. It is a time- consuming work to conduct matching search based on query content with respect to the webpage document data which is already of such huge amount and continuously growing in amount. Therefore, embodiments of the present disclosure propose a systematic framework for performing partitioning processing and individual-index-establishing processes on the webpage document data. Based on such system framework, query processing may be made on the same query content in each webpage document data block in parallel, and the webpage documents obtained in each webpage document data block may be combined so as to generate a final query result.

[00039] As shown in Fig. 4, which is a schematic diagram 400 showing another query processing procedure of webpage documents of embodiments of the present disclosure, the query processing procedure based on the above system framework for partitioning the webpage document data may include the following steps.

[00040] S401, generating query vectors according to query content.

[00041] S402, performing vector approximate matching search in a plurality of webpage document vectors bases according to the query vectors to obtain the webpage document vectors matched with the query vectors, and obtain the webpage documents corresponding to the webpage document vectors in the webpage document data block corresponding to the webpage document vectors base according to the webpage document vectors. The vector approximate matching search may particularly adopt the approximate nearest neighbor (ANN) search described above.

[00042] S403, combing the webpage documents respectively obtained from each webpage document data block to generate the final query result. The webpage document data blocks may be independent from each other, and thus the webpage documents found in each webpage document data block may be not repetitive with respect to each other, and the query results in some webpage document data blocks may be null. The webpage documents found as intermediate query results in each webpage document data block may be combined directly and output as a final query result. More preferably, screening or mixed ranking may be performed on the webpage documents obtained in each webpage document data block during the process of combining, and one or more webpage documents most similar with the query content may be selected as the final query result.

[00043] As described above, as preparing work for the query processing, it is necessary to perform partitioning on the huge amount of webpage document data and converting each webpage document into webpage document vectors. Therefore, in the above processing, before the step of S401, the method may further include:

[00044] S404, performing partitioning on the webpage document data to generate a plurality of webpage document data blocks. In practice, since the searching engine 102 may continuously obtain webpage information and form webpage documents, the webpage document data may be accumulated to be of a certain size and subjected to the partitioning processing.

[00045] S405, processing a plurality of documents in each webpage document data block to generate a plurality of webpage document vectors bases corresponding to each webpage document data block, and each webpage document vector base may include a plurality of webpage document vectors respectively corresponding to a plurality of documents in the webpage document data block.

[00046] As shown in Fig. 5, which is a block diagram 500 of another webpage document query processing device of embodiments of the present disclosure, the above query processing on webpage documents may be implemented by the webpage document query processing device as shown in Fig. 5, and the webpage document query processing device may be provided in the above searching engine 102 and may include a query vector generating module 501, a vector approximate matching search module 502, and a query result generating module 503.

[00047] The query vector generating module 501 may be configured to generate query vectors according to the query content.

[00048] The vector approximate matching search module 502 may be configured to perform vector approximate matching search in a plurality of webpage document vectors bases according to the query vectors, to obtain the webpage document vectors matched with the query vector, and obtain the webpage documents corresponding to the webpage documents vectors in the webpage document data blocks corresponding to the webpage document vectors bases according to the document vectors.

[00049] The query result generating module 503 may be configured to combine the webpage documents obtained in each webpage document data block respectively and generate a final query result.

[00050] Furthermore, the webpage document query processing device may further include a partitioning module 504 and a document vectors base generating module 505.

[00051] The partitioning module 504 may perform partitioning on the webpage document data to generate a plurality of webpage document data blocks.

[00052] The document vectors base generating module 505 may be configured to perform processing on the plurality of webpage documents in each webpage document data block to generate a plurality of webpage document vectors bases corresponding to each webpage document data block. Each webpage document vectors base may include a plurality of webpage document vectors corresponding to a plurality of webpage documents respectively in the webpage document data blocks, and each document may be corresponding to one or more document vectors.

[00053] The embodiments of the present disclosure may narrow the range of vector approximate matching search to a reasonable range by performing partitioning on webpage document data, so that the vector approximate matching search may be conducted more quickly.

Establishing of vector indexes and applying in query processing

[00054] In order to perform the processing of vector approximate matching search more quickly, the embodiments of the present disclosure may further establish vector indexes for the webpage document vectors base consisted of each webpage document data block in addition to partitioning on the webpage document data. The vector indexes may be mainly used to perform partitioning on each webpage document vector in the webpage document vectors base, so that an query vector may be quickly positioned to a region where there may be possibly webpage document vectors matched therewith. In the embodiments of the present disclosure, vector indexes are established only after the webpage document data blocks have been subjected to partitioning processing. Therefore, the amount of the vector indexes may be relatively small so that the speed for the vector matching search may be further improved.

[00055] In addition to establishing vector indexes, as shown in Fig. 6, which is another schematic diagram 600 showing still another query processing procedure of webpage documents of embodiments of the present disclosure, in the above step of S402, the processing of performing vector approximate matching search in a plurality of webpage document vectors bases according to the query vectors to obtain the webpage document vectors matched with the query vectors may further include:

[00056] S601, determining a region in each document vectors base where the vector approximate matching search is going to be performed, according to the query vectors and the vector indexes corresponding to each document vectors base.

[00057] S602, performing vector approximate matching search in the determined region according to the query vectors, to obtain document vectors matched with the query vectors.

[00058] As shown in Fig. 7, which is a block diagram 700 showing systematic framework of query processing on webpage documents of embodiments of the present disclosure, the block diagram 700 may include a query worker 701, a plurality of search workers 702, an aggregator 703, and databases 704 corresponding to each search worker 702 respectively.

[00059] After the webpage document data is subjected to the partitioning processing and the vector indexes are established with respect to the webpage document vectors bases, the query worker 701 may convert the query content into the semantic vectors and then send copies of the semantic vectors to each search worker 702. Each search worker 702 may perform search on webpage document vectors with respect to each webpage document data block in parallel, and then output the found webpage documents to the aggregator 703. The aggregator 703 may perform ranking on the webpage documents provided by each search worker 702, and select one or more webpage documents most approximate to the query content and provide these webpage documents to the user as a final query result.

[00060] Each search worker 702 may be corresponding to one database 702, which is configured to store the webpage document data blocks and webpage document vectors base corresponding to the search worker 702. Vector indexes of webpage document vectors base may be recorded in the search worker 702.

[00061] The embodiments of the present disclosure may quickly narrow the range of vector approximate matching search to a specific region of webpage document vectors base by establishing vector indexes, so as to decrease the work for calculating similarity between vectors and improve the efficiency of vector approximate matching search.

Combined using of vector approximate matching search and inverted index search

[00062] In order to better optimize the query result, the embodiments of the present disclosure may combine the inverted index search and the vector approximate matching search so as to fully utilize the advantages of these two kinds of searching methods to further improve the accuracy of the query result.

[00063] In the embodiments of the present disclosure, the inverted indexes are established after the webpage document data is subjected to the partitioning processing, similarly with the case of vector indexes. The inverted indexes are indexes established with respect to the webpage documents in each webpage document database, while the vector indexes are indexes established with respect to each webpage document vector in each webpage document vectors base.

[00064] As shown in Fig. 8, which is a schematic diagram 800 showing another query processing procedure of webpage documents of embodiments of the present disclosure, and as shown in Fig. 9, which is a schematic diagram 900 showing another query processing procedure of webpage documents of embodiments of the present disclosure, in the embodiments of the present disclosure, the inverted index search is performed in parallel with the vector approximate matching search. After the webpage document data is subjected to the partitioning processing, analyzing processing may be performed on the query content (801). With respect to each webpage document data block, the inverted index search and the vector approximate matching search may be performed in parallel so that the webpage documents found through inverted index search and the webpage documents found through vector approximate matching search may be obtained respectively. More particularly, as shown in Fig. 8 and Fig. 9, with respect to the query content, keywords may be extracted (802) and query vectors may be generated (803). And then distributed inverted index search (804) and distributed ANN vector search (805) may be performed respectively. At last, the webpage documents obtained from each webpage document data block may be combined and ranking processing may be performed on the obtained webpage documents during the combining process so as to determine the search result, which could be output to the user finally. The two following ways may be adopted for the ranking processing.

[00065] The first way may be as shown in Fig. 8. The webpage documents found through the inverted index search may be subjected to a ranking processing 806 and the webpage documents found through the vector approximate matching search may be subjected to a ranking processing 807. The webpage documents output through the ranking processing 806 and the ranking processing 807 may be further subjected to a ranking processing 808. The webpage documents output through the ranking processing 808 may be subjected to a combining processing 809 so as to generate a final query result. Then the query result may be output (810).

[00066] The second way may be as shown in Fig. 9. The webpage documents found through the inverted index search and the webpage documents found through the vector approximate matching search may be subjected to a mixed ranking 901. Then, the webpage documents output through the mixed ranking 901 may be subjected to a combining processing 902 so as to generate a final query result. Then the query result may be output (810).

[00067] As shown in Fig. 10, which is a block diagram 1000 showing another systematic framework of query processing on webpage documents of embodiments of the present disclosure, the above query processing on the webpage documents may be implemented by the processing framework as shown in Fig. 10. In the block diagram 1000, the query worker 1001 may perform a processing of converting a query content into semantic vectors and extracting keywords based on the query content. Then the query worker 1001 may make copies of keywords extracted from the query content and the query vectors obtained through converting. Then the query worker 1001 may distribute the copies to each search worker. More particularly, the search worker may be classified into two kinds. An example of one kind is a search worker 1002 configured to perform the vector approximate matching search, and an example of the other kind is a search worker 1003 configured to perform the inverted index search. A ranking worker 1004 may be configured to perform ranking on the webpage documents obtained through the vector approximate matching search. A ranking worker 1005 may be configured to perform ranking on the webpage documents obtained through the inverted index search. A ranking worker 1006 may be configured to perform reranking on the webpage documents output by the ranking worker 1004 and the ranking worker 1005. At last, an aggregator 1007 may combine the webpage documents output by the ranking worker 1006 so as to generate a query result, which could be finally provided to the user.

[00068] As shown in Fig. 11, which is a block diagram 1100 showing still another systematic framework of query processing on webpage documents of embodiments of the present disclosure, the above query processing on the webpage documents may be implemented by the processing framework as shown in Fig. 11. In the block diagram 1100, the query worker 1101 may perform a processing of converting a query content into semantic vectors and extracting keywords based on the query content, and then the query worker 1101 may make copies on keywords extracted from the query content and the query vectors obtained through converting. Then the query worker 1101 may distribute the copies to each search worker 1102. Each search worker 1102 may perform inverted index search in addition to the vector approximate matching search. Each search worker 1102 may output the webpage documents obtained through the vector approximate matching search and the webpage documents obtained through the inverted index search to a mixed ranking worker 1103. The mixed ranking worker 1103 may perform mixed ranking on the webpage documents through the vector approximate matching search and the webpage documents through the inverted index search. Then, an aggregator 1104 may combine the webpage documents output by the mixed ranking worker 1103 so as to generate a query result, which could be finally provided to the user.

[00069] The ranking processing on the webpage documents may adopt models such as a module of LambdaRank, which is a kind of learning ranking, or a module of LambdaMart, which is a kind of learning ranking, to improve the processing.

[00070] The advantages of the inverted index search and the vector approximate matching search may be fully utilized by combining these two kinds of search method, so that the query result, which is more accurate and may understand the user’s intention much better, may be obtained.

Embodiments of the applying scenarios

[00071] Detailed description has been made on the processing procedure and overview framework of the document query technique based the vector nearest neighbor search of embodiments of the present disclosure in the above. A specific exemplary example would be provided for explaining the technical solution of embodiments of the present disclosure.

[00072] As shown in Fig. 12, which is a schematic diagram 1200 of vector nearest neighbor search based on model of CDSSM (Convolutional Deep Structured Semantic Model) of embodiments of the present disclosure, in the embodiments of the present disclosure, original query content of“coffee and teasouth melbourne” 1201 may be used as an example, and it is supposed that there are currently three webpage documents. More particularly, the URL (Uniform Resource Locator) of a webpage document 1202 is www truelocal com au find coffee vie melbourne city south melbourne. The title of the second webpage document 1203 is“coffee tea suppliers in south Melbourne Melbourne city vie”. The click log of the webpage document 1204 may be“coffee beans supplier south melbourne”. The click log as cited herein refers to a query content of the webpage linkage corresponding to this webpage document. That is to say, when a user inputs some query content, the searching engine sends back some webpage document, and the user clicks the webpage linkage of this webpage document to access corresponding webpage, and thus the searching engine may record this query content as a click log of this webpage document.

[00073] In the figures, a model of CDSSM is used for vectorizing query content and webpage document and similarity matching. As shown in the figures, the conversions to semantic vectors on the original query content and webpage documents are both made through word embedding and Deep Neural Network. In the models as shown in the figures, the word embedding may be performed based on a tri-letter mode (1208), and then semantic vectors with dimension of 100 may be generated by using a Convolutional Deep Structured Semantic Model, where the label of“d” refers to a dimension of the generated vector.

[00074] As shown in the figures, after the query vector 1205 and the webpage document vectors 1206 are generated, the webpage documents with highest similarity may be selected as query result by performing cosine similarity calculation 1207 between the query vectors and each webpage document vectors.

Embodiments

[00075] In some examples, one or more components or modules and one or more steps as shown in Fig. 1 to Fig. 12 may be implemented by software, hardware, or in combination of software and hardware. For example, the above component or module and one or more steps may be implemented in system on chip (SoC). Soc may include: integrated circuit chip, including one or more of processing unit (such as center processing unit (CPU), micro controller, micro processing unit, digital signal processing unit (DSP) or the like), memory, one or more communication interface, and/or other circuit for performing its function and alternative embedded firmware.

[00076] Fig. 13 is a block diagram of electronic device 1300 of embodiments of the present disclosure. The electronic device 1300 may include: a memory 1301 and a processor 1302.

[00077] The memory 1301 may be configured to store programs. In addition to the above programs, the memory 1301 may be configured to store other data to support operations on the electronic device 1300. The examples of these data may include instructions of any applications or methods operated on the electronic device 1300, contact data, phone book data, messages, pictures, videos, and the like.

[00078] The memory 1301 may be implemented by any kind of volatile or nonvolatile storage device or their combinations, such as static random access memory (SRAM), electronically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, disk memory, or optical disk.

[00079] In some examples, the memory 1301 may be coupled to the processor 1302 and contain instructions stored thereon. The instructions may cause the electric apparatus to perform operations upon being executed by the processor 1302, the operations may include:

[00080] generating query vectors according the query content;

[00081] performing vector approximate matching search to obtain document vectors matched with the query vectors;

[00082] obtaining corresponding documents according to document vectors.

[00083] More particularly, the step of performing vector approximate matching search to obtain document vectors matched with the query vectors may further include: obtaining document vectors matched with the query vectors based on the approximate nearest neighbor search.

[00084] As another example of another electronic device, the above operations may include:

[00085] generating query vectors based on query content;

[00086] performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain the document vectors matched with the query vectors, and obtaining the documents corresponding to the document vectors in the document data block corresponding to the webpage document vectors base according to the webpage document vectors;

[00087] combining the documents obtained from each document data block respectively to generate the final query result.

[00088] More particularly, the step of performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain the document vectors matched with the query vector may further include:

[00089] determining a region in each document vectors base where the vector approximate matching search is going to be performed, according to the query vectors and the vector indexes corresponding to each document vectors base;

[00090] performing vector approximate matching search in the determined region according to the query vectors, to obtain document vectors matched with the query vectors.

[00091] Furthermore, before the combining of documents obtained from each document data block, the operations may further include: performing inverted index search to obtain documents corresponding to the query content.

[00092] Correspondingly, the step of combining the documents obtained from each document data block respectively to generate the final query result may further include: performing mixed ranking on the documents obtained through the vector approximate matching search and the documents obtained through the inverted index search, and combining the documents according to the result of the mixed ranking to generate a final query result.

[00093] Detailed description has been made on the above operations in the above embodiments of method and device. The description on the above operations may be applied to electronic device 1300. That is to say, the specific operations mentioned in the above embodiments may be recorded in memory 1301 in program and be performed by processor 1302.

[00094] Furthermore, as shown in Fig. 13, the electronic device 1300 may further include: a communication unit 1303, a power supply unit 1304, an audio unit 1305, a display unit 1306, chipset 1307, and other units. Only part of units are exemplarily shown in Fig. 13 and it is obvious to one skilled in the art that the electronic device 1300 only includes the units shown in Fig. 13.

[00095] The communication unit 1303 may be configured to facilitate wireless or wired communication between the c electronic device 1300 and other apparatuses. The electronic device may be connected to wireless network based on communication standard, such as WiFi, 2G, 3G, or their combination. In an exemplary example, the communication unit 1303 may receive radio signal or radio related information from external radio management system via radio channel. In an exemplary example, the communication unit 1303 may further include near field communication (NFC) module for facilitating short-range communication. For example, the NFC module may be implemented with radio frequency identification (RFID) technology, Infrared data association (IrDA) technology, ultra wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

[00096] The power supply unit 1304 may be configured to supply power to various units of the electronic device. The power supply unit 1304 may include a power supply management system, one or more power supplies, and other units related to power generation, management, and allocation.

[00097] The audio unit 1305 may be configured to output and/or input audio signals. For example, the audio unit 1305 may include a microphone (MIC). When the electronic device in an operation mode, such as calling mode, recording mode, and voice recognition mode, the MIC may be configured to receive external audio signals. The received audio signals may be further stored in the memory 1301 or sent via the communication unit 1303. In some examples, the audio unit 1305 may further include a speaker configured to output audio signals.

[00098] The display unit 1306 may include a screen, which may include liquid crystal display (LCD) and touch panel (TP). If the screen includes a touch panel, the screen may be implemented as touch screen so as to receive input signal from users. The touch panel may include a plurality of touch sensors to sense touching, sliding, and gestures on the touch panel. The touch sensor may not only sense edges of touching or sliding actions, but also sense period and pressure related to the touching or sliding operations.

[00099] The above memory 1301, processor 1302, communication unit 1303, power supply unit 1304, audio unit 1305 and display unit 1306 may be connected with the chipset 1307. The chipset 1307 may provide interface between the processor 1302 and other units of the electronic device 1300. Furthermore, the chipset 1307 may provide interface for each unit of the electronic device 1300 to access the memory 1301 and communication interface for accessing among units.

[000100] Example Clauses

[000101] A. A method, including:

[000102] generating query vectors according to a query content;

[000103] performing vector approximate matching search to obtain document vectors matched with the query vectors; and

[000104] obtaining corresponding documents according to the document vectors.

[000105] B. The method according to paragraph A, wherein the performing vector approximate matching search to obtain document vectors matched with the query vectors further includes: obtaining document vectors matched with the query vectors through approximate nearest neighbor search.

[000106] C. The method according to paragraph A, wherein the query vector and the document vector are semantic vectors generated based on the same semantic space.

[000107] D. The method according to paragraph A, wherein the generating query vectors according to a query content further includes: generating the query vectors according to context of the query content.

[000108] E. The method according to paragraph A, wherein the method further includes:

[000109] generating one or more document vectors according to a document content of a document,

[000110] wherein the document content includes: any one of title, linkage, anchor, click data, and the combination thereof.

[000111] E A method, including:

[000112] generating query vectors according to a query content;

[000113] performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors, and obtaining documents corresponding to the document vectors in document data blocks corresponding to the document vectors bases according to the document vectors; and

[000114] combining the documents obtained from each document data block to generate a final query result.

[000115] G. The method according to paragraph F, wherein the performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors further includes:

[000116] determining a region in each document vectors base where the vector approximate matching search is going to be performed, according to the query vectors and vector indexes corresponding to each document vectors base;

[000117] performing the vector approximate matching search according to the query vectors in the determined region, to obtain document vectors matched with the query vectors.

[000118] H. The method according to paragraph G, wherein before combining the documents obtained from each document data block, the method further includes:

[000119] performing inverted index search in a plurality of document data blocks according to the query content to obtain documents corresponding to the query content; [000120] wherein the combining the documents obtained from each document data block to generate a final query result further includes :

[000121] performing mixed ranking on the documents obtained through the vector approximate matching search and the documents obtained through inverted index search with respect to each document data block, and combining the documents according to a ranking result to generate the final query result.

[000122] I. The method according to paragraph F, wherein the method further includes:

[000123] performing partitioning processing on document data to generate a plurality of document data blocks;

[000124] performing processing on a plurality of documents in each the document data block to generate a plurality of document vectors bases corresponding to each document data block,

[000125] wherein each document vectors base includes a plurality of document vectors corresponding to the plurality of documents in the document data blocks respectively, and each of the documents corresponds to one or more document vectors.

[000126] J. The method according to paragraph I, wherein the method further includes:

[000127] establishing the vector indexes for the partitioning on each document vector in each document vectors base with respect to each document vectors base.

[000128] K. An electronic device, including:

[000129] a processing unit; and

[000130] a memory, coupled to the processing unit and containing instructions stored thereon, the instructions cause the electronic device to perform operations upon being executed by the processing unit, the operations may include:

[000131] generating query vectors according to a query content;

[000132] performing vector approximate matching search to obtain document vectors matched with the query vectors; and

[000133] obtaining corresponding documents according to the document vectors.

[000134] L. The electronic device according to paragraph K, wherein the performing vector approximate matching search to obtain document vectors matched with the query vectors further includes: obtaining document vectors matched with the query vectors through approximate nearest neighbor search.

[000135] M. An electronic device, including:

[000136] a processing unit; and

[000137] a memory, coupled to the processing unit and containing instructions stored thereon, the instructions cause the electronic device to perform operations upon being executed by the processing unit, the operations may include:

[000138] generating query vectors according to a query content;

[000139] performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors, and obtaining documents corresponding to the document vectors in document data blocks corresponding to the document vectors bases according to the document vectors;

[000140] combining the documents obtained from each document data block to generate a final query result.

[000141] N. The electronic device according to paragraph M, wherein the performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors further includes:

[000142] determining a region in each document vectors base where the vector approximate matching search is going to be performed, according to the query vectors and vector indexes corresponding to each document vectors base; and

[000143] performing the vector approximate matching search according to the query vectors in the determined region, to obtain document vectors matched with the query vectors.

[000144] O. The electronic device according to paragraph N, wherein before combining the documents obtained from each document data block, the operations further include:

[000145] performing inverted index search according to the query content in a plurality of document data blocks to obtain documents corresponding to the query content;

[000146] wherein the combining the documents obtained from each document data block to generate a final query result further includes :

[000147] performing mixed ranking on the documents obtained through the vector approximate matching search and the documents obtained through the inverted index search with respect to each document data block, and combining the documents according to a ranking result to generate the final query result.

[000148] P. An apparatus, including:

[000149] a query vector generating module, configured to generate query vectors according to a query content;

[000150] document vector acquiring module, configured to perform vector approximate matching search to obtain document vectors matched with the query vectors; and

[000151] document acquiring module, configured to obtain corresponding documents according to the document vectors.

[000152] Q. The apparatus according to paragraph P, wherein the performing vector approximate matching search to obtain document vectors matched with the query vectors further includes: obtaining document vectors matched with the query vectors through approximate nearest neighbor search.

[000153] R. An apparatus, including:

[000154] a query vector generating module, configured to generate query vectors according to a query content;

[000155] a vector approximate matching search module, configured to perform vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors, and obtain documents corresponding to the document vectors in document data blocks corresponding to the document vectors bases according to the document vectors; and

[000156] a query result generating module, configured to combine the documents obtained from each document data block to generate a final query result.

[000157] S. The apparatus according to paragraph R, wherein the performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors further includes:

[000158] determining a region in each document vectors base where the vector approximate matching search is going to be performed, according to the query vectors and vector indexes corresponding to each document vectors base;

[000159] performing the vector approximate matching search according to the query vectors in the determined region, to obtain document vectors matched with the query vectors.

[000160] T. The apparatus according to paragraph S, wherein the apparatus further includes a plurality of inverted index search modules, configured to perform inverted index search in a plurality of document data blocks according to the query content to obtain documents corresponding to the query content;

[000161] wherein in the query result generating module, the combining the documents obtained from each document data block to generate a final query result further includes :

[000162] performing mixed ranking on the documents obtained through the vector approximate matching search and the documents obtained through inverted index search with respect to each document data block, and combining the documents according to a ranking result to generate the final query result.

[000163] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the claims.

[000164] Conditional language such as, among others, "can," "could," "might" or "may," unless specifically stated otherwise, are otherwise understood within the context as used in general to present that certain examples include, while other examples do not include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular example.

[000165] Conjunctive language such as the phrase "at least one of X, Y or Z," unless specifically stated otherwise, is to be understood to present that an item, term, etc. can be either X, Y, or Z, or a combination thereof.

[000166] Any routine descriptions, elements or blocks in the flow diagrams described herein and/or depicted in the attached figures should be understood as potentially representing modules, segments, or portions of code that include one or more executable instructions for implementing specific logical functions or elements in the routine. Alternate examples are included within the scope of the examples described herein in which elements or functions can be deleted, or executed out of order from that shown or discussed, including substantially synchronously or in reverse order, depending on the functionality involved as would be understood by those skilled in the art.

[000167] It should be emphasized that many variations and modifications can be made to the above-described examples, the elements of which are to be understood as being among other acceptable examples. All such modifications and variations are intended to be included herein within the scope of this disclosure and protected by the following claims [000168] It would be obvious to one skilled in the art that, all or part of steps for implementing the above embodiments may be accomplished by hardware related to programs or instructions. The above program may be stored in a computer readable storing medium. Such program may perform the steps of the above embodiments upon being executed. The above storing medium may include: ROM, RAM, magnetic disk, or optic disk or other medium capable of storing program codes.

[000169] It should be noted that the foregoing embodiments are merely used to illustrate the technical solution of the present disclosure, and not to limit the present disclosure. Although the present disclosure has been described in detail with reference to the foregoing embodiments, one skilled in the art would understand that the technical solutions recited in the foregoing embodiments may be modified or all or a part of the technical features may be replaced equally. These modifications and replacements are not intended to make corresponding technical solution depart from the scope of the technical solution of embodiments of the present disclosure.

Claims

1. An electronic device, comprising:

a processing unit; and

a memory, coupled to the processing unit and containing instructions stored thereon, the instructions cause the electronic device to perform operations upon being executed by the processing unit, the operations comprise:

generating query vectors according to a query content;

performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors, and obtaining documents corresponding to the document vectors in document data blocks corresponding to the document vectors bases, according to the document vectors; and

combining the documents obtained from each document data block to generate a final query result.

2. The electronic device according to claim 1, wherein

the performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors further comprises:

determining a region in each document vectors base where the vector approximate matching search is going to be performed, according to the query vectors and vector indexes corresponding to each document vectors base; and

performing the vector approximate matching search according to the query vectors in the determined region, to obtain document vectors matched with the query vectors.

3. The electronic device according to claim 2, wherein before combining the documents obtained from each document data block, the operations further comprise:

performing inverted index search according to the query content in a plurality of document data blocks to obtain documents corresponding to the query content,

wherein the combining the documents obtained from each document data block to generate a final query result further comprises :

performing mixed ranking on the documents obtained through the vector approximate matching search and the documents obtained through the inverted index search with respect to each document data block, and combining the documents according to a ranking result to generate the final query result.

4. A method, comprising: generating query vectors according to a query content;

performing vector approximate matching search to obtain document vectors matched with the query vectors; and

obtaining corresponding documents according to the document vectors.

5. The method according to claim 4, wherein the performing vector approximate matching search to obtain document vectors matched with the query vectors further comprises:

obtaining document vectors matched with the query vectors through approximate nearest neighbor search.

6. The method according to claim 4, wherein the query vectors and the document vectors are semantic vectors generated based on the same semantic space.

7. The method according to claim 4, wherein the generating query vectors according to a query content further comprises:

generating the query vectors according to context of the query content.

8. The method according to claim 4, wherein the method further comprises:

generating one or more document vectors according to a document content of a document,

wherein the document content comprises: any of title, linkage, anchor, click data, and the combination thereof.

9. A method, comprising:

generating query vectors according to a query content;

performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors, and obtaining documents corresponding to the document vectors in document data blocks corresponding to the document vectors bases according to the document vectors; and combining the documents obtained from each document data block to generate a final query result.

10. The method according to claim 9, wherein the performing vector approximate matching search in a plurality of document vectors bases according to the query vectors to obtain document vectors matched with the query vectors further comprises:

11. The method according to claim 10, wherein before combining the documents obtained from each document data block, the method further comprises:

performing inverted index search in a plurality of document data blocks according to the query content to obtain documents corresponding to the query content;

performing mixed ranking on the documents obtained through the vector approximate matching search and the documents obtained through the inverted index search, and combining the documents according to a ranking result to generate the final query result.

12. The method according to claim 9, wherein the method further comprises:

performing partitioning processing on document data to generate a plurality of document data blocks;

performing processing on a plurality of documents in each of the document data block to generate a plurality of document vectors bases corresponding to each document data block,

wherein each document vectors base comprises a plurality of document vectors respectively corresponding to the plurality of documents in the document data blocks, and each of the documents corresponds to one or more document vectors.

13. The method according to claim 12, wherein the method further comprises:

establishing the vector indexes for the partitioning on each document vector in the document vectors bases with respect to each document vectors base.