WO2019047437A1 - 多文档交集的获取方法及文档服务器 - Google Patents

多文档交集的获取方法及文档服务器 Download PDF

Info

Publication number
WO2019047437A1
WO2019047437A1 PCT/CN2017/120062 CN2017120062W WO2019047437A1 WO 2019047437 A1 WO2019047437 A1 WO 2019047437A1 CN 2017120062 W CN2017120062 W CN 2017120062W WO 2019047437 A1 WO2019047437 A1 WO 2019047437A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
intersection
document set
sets
query element
Prior art date
Application number
PCT/CN2017/120062
Other languages
English (en)
French (fr)
Inventor
毕成龙
潘文彬
Original Assignee
北京三快在线科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京三快在线科技有限公司 filed Critical 北京三快在线科技有限公司
Priority to US16/622,293 priority Critical patent/US11288329B2/en
Priority to CA3069382A priority patent/CA3069382C/en
Priority to JP2019568694A priority patent/JP6986577B2/ja
Publication of WO2019047437A1 publication Critical patent/WO2019047437A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Definitions

  • Embodiments of the present invention relate to the field of search engine technologies, and in particular, to obtaining multiple document intersections.
  • Search engines may collect tens of millions to billions of web pages on the World Wide Web and index each word in the web page to build an index database. When a user finds a certain keyword, all pages containing the keyword in the page content will be searched out as search results.
  • Widely used open source search engines such as Lucene, employ a linear approach to getting the intersection of multiple document sets, that is, after sorting each document set, traversing other document sets from the first element of the first document set. .
  • the element may be found in the current document set and then continue to traverse the next document set. If not found, the other document set is re-traversed with the next element in the current document set as the query element. So go back and forth until you find an element, the document, that exists in all document sets at the same time. In this way, it is repeated until the end of one of the document set traversals, and the intersection process of the multiple document sets ends.
  • the present invention provides a multi-document intersection acquisition method, apparatus, and readable storage medium to obtain document intersections with higher efficiency even when the length difference of different document sets is greater than a certain threshold.
  • a multi-document intersection acquisition method comprising: acquiring, for at least two document sets requiring intersection in a search, a document set length of each of the document sets; The difference in length of the document set determines an intersection algorithm that obtains the intersection of the documents; and the document intersection of the at least two document sets is obtained using the determined intersection algorithm.
  • a document server comprising: a processor; and a non-transitory computer readable storage medium having stored thereon the processor capable of being Executable machine executable instructions.
  • the processor is caused by the machine executable instructions to: obtain a document set length of each of the document sets for at least two document sets that require intersection in the search; and according to length differences of the at least two document sets Determining an intersection algorithm for obtaining an intersection of documents; and, using the determined intersection algorithm, obtaining an intersection of documents of the at least two document sets.
  • a non-transitory machine readable storage medium having stored thereon machine executable instructions executable by a processor.
  • the document server is capable of executing the multi-document intersection acquisition method described above when the machine-executable instructions in the non-transitory machine-readable storage medium are executed by a processor in a document server.
  • An embodiment of the present invention provides a method, an apparatus, and a readable storage medium for acquiring multiple document intersections, and at least two document sets that require intersection in a search process, by using a length of a document set in the at least two document sets.
  • the elements in the shortest document set are used as the query elements to traverse the remaining document sets in turn, which can effectively improve the efficiency of the document intersection and speed up the response time of the search engine for the user.
  • FIG. 1 is a schematic structural diagram of a search engine according to an embodiment of the present invention.
  • FIG. 2 is a flowchart of a method for acquiring multiple document intersections according to an embodiment of the present invention.
  • FIG. 3 is a flowchart of a method for acquiring multiple document intersections according to another embodiment of the present invention.
  • FIG. 3A is a structural diagram of a primary syntax tree according to an embodiment of the present invention.
  • FIG. 3B is a structural diagram of a final level syntax tree according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a multi-document intersection obtaining apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a multi-document intersection obtaining apparatus according to another embodiment of the present invention.
  • a multi-document intersection acquisition method provided by an embodiment of the present invention can be applied to a search engine technology, and the purpose thereof is to segment a query content according to a query content input by a user on a search engine interface, and perform matching of corresponding documents by using each word segmentation. Generate a document set corresponding to each word segment, obtain the intersection of the documents by intersecting all the document sets, and return the document intersection to the user.
  • Search engine technology is an Internet communication technology.
  • the server side provides content and builds an index for that content.
  • the server may search for the content according to the keyword in the search request, and then return the found content to the client for display.
  • a search engine may generally include a WEB (World Wide Web) server 110, an index server 120, and a document server 130.
  • the document server 130 can store information of the document.
  • the WEB server 110 receives the search term and sends the search term to the index server 120.
  • the index server 120 performs grammar word segmentation processing, matches the corresponding document in the index database for each word segment, and transmits the matching result to the document server 130.
  • the document server 130 may establish a document set corresponding to each word segment according to the matching result, obtain an intersection of the documents by intersection of the document sets corresponding to all the word segments, and return the document intersection to the search engine browser 140 through the WEB server 110. .
  • the search engine browser 140 can present the documents in the document intersection to the user.
  • Document The general search engine deals with Internet pages, and the concept of documents is more general, representing storage objects that exist in text form. Compared with web pages, documents cover more forms. Words such as Word, PDF, html, and XML can be called documents. For example, an email, a text message, or a microblog can also be called. Documentation. In the present invention, each document is correspondingly provided with a document identifier for identifying each document.
  • Document Collection A collection of several documents is called a document set. For example, a large number of Internet pages or a large number of emails are specific examples of document collections.
  • Parse Tree A syntax tree is a graphical representation of a statement structure that represents the derivation of a statement and helps to understand the level of the grammatical structure of the statement. Simply put, the grammar tree is the tree formed when deriving according to a certain rule.
  • Leaf node The leaf node is the lowest node of the syntax tree, and the leaf node no longer includes lower-level nodes. In the present invention, the leaf node is a participle of the search term.
  • FIG. 2 there is shown a flow chart of the steps of a multi-document intersection acquisition method.
  • Step 210 Acquire a document set length of each document set for at least two document sets that require intersection.
  • the number of documents matching different word segments of the search term may also be different, and the difference between the lengths of the at least two document sets that need to be intersected in the search process may be different due to the difference in usage heat between different word segments. It is highly probable that the value will exceed the length threshold. It should be noted that the length of the document set refers to the number of document elements included in the document set.
  • the length of each document set is obtained for the document set that is required to be required for each word segmentation, and the number of documents in each document set can be obtained.
  • the word segmentation "Hai Dian” and the participle 2 "hot pot” can be obtained, and the documents obtained by querying the word segment 1 and the word segment 2 are as follows: Show:
  • the documents corresponding to the respective word segments obtained by the query are arranged according to the size of the document identifier, and Table 1 can be obtained.
  • the number of document elements in the first document set corresponding to the word segment 1 "Hai Dian” is 4, and the number of document elements in the second document set corresponding to the word segment 2 "hot pot” is 40 (between document 6 and document 20 and the document) 20 to omission between documents 80).
  • the length of the first document set corresponding to the segmentation word "Hai Dian" is 4
  • the length of the second document set corresponding to the word segment 2 "hot pot” is 40.
  • Step 220 Compare lengths of the at least two document sets to determine an intersection algorithm for obtaining document intersections according to length differences of the at least two document sets.
  • Step 230 Obtain an intersection of documents of the at least two document sets by using the determined intersection algorithm.
  • the query element in the minimum document set may be used as a traversal starting point to find whether the query element exists in the remaining document set.
  • the minimum document set is a document set having a minimum document set length in the at least two document sets.
  • the preset condition may be that the difference between the lengths of the longest document set and the shortest document set in the plurality of document sets is greater than the first preset threshold.
  • the first preset threshold may be set according to actual conditions, for example, may be a preferred value obtained by performing a daily search test on the search engine. It should be noted that the preset condition may also be a plurality of document sets, and the ratio of the maximum document set length to the minimum document set length exceeds a second preset threshold.
  • the preset condition is that the difference between the length of the longest document set and the shortest document set in multiple document sets is greater than 10
  • the difference between the lengths of the first document set and the second document set is 30, which is in accordance with the preset. condition.
  • the traversal of the second document set is performed by using the document 1 in the first document set as a query element. It is found that document 1 exists in the second document set, and document 1 is inserted into the document intersection. Thereafter, the traversal of the second document set is performed with the document 2 in the first document set as the query element, and it is found that the document 2 does not exist in the second document set.
  • the document 20 in the first document set is continuously selected as the query element, and the document 20 is found in the second document set. After the document 20 is inserted into the document intersection, the document 85 in the first document set is used as the query element to perform the traversal of the second document set. Finally, it is found that the document 85 does not exist in the second document set, and the first document set is ended by the traversal. In this way, the current intersection process can be terminated and the final document intersection [document 1, document 20] can be returned to the user.
  • the approximate intersection process is as follows: traversing the second document set by using document 1 in the first document set as a query element, and finding that document 1 exists in the second document set, Document 1 inserts the document intersection. Afterwards, the traversal of the second document set is performed by using the document 2 in the first document set as a query element, and it is found that the document 2 does not exist in the second document set, so that the first document set is performed by using the document 3 in the second document set as a query element. Traversing. Since the document 3 does not exist in the first document set, the document 20 in the first document set is a query element.
  • the webpage corresponding to the document 1 and the document 20 is the result obtained by the user in this query, and can be presented to the user document 1 and the document 20 through the browser interface.
  • the application page corresponding to the document 1 and the document 20 is the result obtained by the user in this query, and the document 1 and the user can be presented to the user through the mobile phone interface.
  • Document 20 corresponds to a link to the application page.
  • the embodiment of the present invention provides a method for acquiring multiple document intersections, which is directed to at least two document sets that require intersection in the search process, by using the document set lengths of the two document sets to meet the preset conditions.
  • the elements in the shortest document set as the query element traverse the remaining document sets in turn, which can effectively improve the efficiency of the document intersection and speed up the response time of the search engine for the user.
  • FIG. 3 there is shown a flow chart of specific steps of a multi-document intersection acquisition method.
  • Step 310 receiving a search term.
  • the search engine may receive a search term input by the user and perform a syntax tree construction on the search term.
  • Step 320 constructing a search syntax tree according to the received search words.
  • the leaf node of the syntax tree is a word segmentation of the search term.
  • the search term input by the user can be parsed by syntax and a syntax tree is constructed. For example, if a rule is set: when a space is found by syntax analysis, it is determined that the two words before and after the space are "and". Suppose the user enters the "Beijing Full-time Convenience Store", and the two words "Beijing" and "Full-time Convenience Store” are saved, and the analysis result is constructed into the primary syntax tree as shown in Figure 3A. .
  • the synchronization structure is constructed according to the structure of the primary syntax tree shown in FIG. 3A.
  • the system will determine whether the node of the primary syntax tree is text. If it is text, the system will classify it again. For example, “Beijing” will be divided into “Beijing” and “City”, and “Full-time Convenience Store” will be divided into “Full Time” and “Convenience” and “Store”.
  • the primary grammar tree can be reconstructed according to the result of the word segmentation, and the AND node after the word segmentation is added to the primary grammar tree to form a final grammar tree as shown in FIG. 3B, so that the search engine can make each of the ultimate grammar trees.
  • the intersection algorithm of word segmentation is used to the intersection algorithm of word segmentation.
  • Step 330 Acquire a document set length of each document set for at least two document sets that require intersection.
  • step 210 For the basic implementation of the step, reference may be made to step 210 above, and details are not described herein again.
  • step 320 starting from the lowest level of the intersection node in the search syntax tree that has not been subjected to the intersection calculation, determining the required intersection according to each child node of the intersection node The at least two sets of documents and determining a respective document set length of the at least two sets of documents.
  • the intersection calculation can be started from the lowest-level intersection node in the syntax tree constructed according to the search term. After the lowest-level intersection node completes the intersection calculation, according to the obtained lower-level document intersection, the intersection of the second-level intersection node is calculated until the top-level document intersection is obtained, and the user is returned to the user.
  • the intersection nodes 10, 20 are the underlying nodes, and the intersection node 30 is the top node.
  • the document set that the top node 30 needs to intersect needs to correspond to the "Beijing" and "City” corresponding document sets corresponding to "Full Time", "Convenience” and "Store” Document set.
  • Step 340 comparing lengths of the at least two document sets to determine an intersection algorithm for obtaining document intersections according to length differences of the at least two document sets.
  • Step 350 If the length difference of the at least two document sets meets a preset condition, use a query element in the minimum document set as a traversal starting point to find whether the query element exists in the remaining document set. If an element matching the query element of the current sorted sequence number is found in each of the remaining document sets, the query element is taken as an element of the document intersection.
  • the minimum document set is a document set having a minimum document set length in the at least two document sets.
  • the preset condition may include: among the document set lengths of the at least two document sets, a difference between a maximum document set length and a minimum document set length exceeds a first preset threshold; or, the at least two In the document set, the ratio of the maximum document set length to the minimum document set length exceeds a second predetermined threshold.
  • the first preset threshold and the second preset threshold may be set according to actual conditions. For example, a preferred value can be obtained by conducting a daily search test on a search engine.
  • a query element of the current sorted sequence number in the minimum document set may be matched to an element in each of the remaining document sets. If an element of the query element matching the current sorted sequence number is not found in at least one of the remaining document sets, the query element of the next sorted sequence number in the minimum document set is matched with the elements of each of the remaining document sets. If an element matching the query element of the current sort sequence number is found in all of the remaining document sets, the query element is used as an element of the document intersection and the next sort number of the minimum document set The query element is matched to the elements in the remaining document sets. This loops until all elements in the minimum document set are traversed.
  • the traversal of the second document set is performed at this time with the document 1 in the first document set as a query element.
  • the elements in the document set are aligned or inverted by the size of the logo.
  • the identifier of the element in the document set is the number corresponding to the document in Table 2, and the identifier is used to identify the document element, and the table 2 is obtained by arranging the elements in the document set by the size of the identifier. If the elements in the document set are inverted by the size of the identifier, the table 3 can be obtained.
  • the first element document 85 in the first document set can be traversed as a query element, which is not limited in the embodiment of the present invention.
  • the traversal of the second document set is performed with the document 1 in the first document set as the query element, and the document 1 is found to exist in the second document set, the document 1 is inserted into the document intersection.
  • the traversal of the second document set is performed with the document 2 in the first document set as the query element, and it is found that the document 2 does not exist in the second document set.
  • the document 20 in the first document set will be selected as the query element, and the document 20 will be found in the second document set, and the document 20 will be inserted into the document intersection.
  • the traversal of the second document set is performed with the last document 85 in the first document set as the query element, and it is found that the document 85 does not exist in the second document set, and the traversal is stopped.
  • the document intersection [document 1, document 20] obtained by the intersection node 10 can be returned to the top-level intersection node 30 shown in FIG. 3B.
  • Step 360 If the length difference of the at least two document sets does not meet the preset condition, the common intersection algorithm is used to obtain the document intersection.
  • the query element in the focused document set in the at least two document sets may be a traversal starting point, and whether the query element is found in each remaining document set. When the query element is present in each of the remaining document sets, the query element is taken as an element of the document intersection.
  • the previous query element is from the focused document set and the previous query element is determined as an element of the document intersection, or the focused document set is determined by the first one to have no the previous query element;
  • the sorting number of the query element is the next sorting number of the previous query element.
  • the query element of the smallest sorted sequence number in the first document set in the at least two document sets may be used as an initial traversal starting point to find whether the query element exists in each remaining document set. If an element matching the query element of the current sort sequence number is not found in the remaining document set currently being searched, the element of the next sort sequence number in the remaining document set is used as the new query element. If an element matching the query element of the current sorted sequence is found in all remaining document sets, the query element is taken as an element of the document intersection, and the next sort in the document set of the query element is selected The element of the sequence number is used as the new query element. This loops until all elements of one document set in the at least two document sets are traversed.
  • the number of document elements in the third document set corresponding to the participle "full time” is assumed. 2
  • the number of document elements in the fourth document set corresponding to the word segmentation "convenience” is 3
  • the number of document elements in the fifth document set corresponding to the segmentation "shop” is 4
  • the first preset threshold is 10
  • the third document set and the fourth document may be performed by the universal intersection algorithm.
  • the calculation of the intersection between the set and the fifth document set is as follows: traversing the fourth document set with the document 1 in the third document set as the query element, and finding that the document 1 does not exist in the fourth document set, so the query element is replaced by Document 2 in the fourth document set. It is found that the document 2 does not exist in the fifth document set, and then the query element is replaced with the document 20 in the fifth document set, and the document 20 is found in both the third document set and the fourth document set, and the document 20 is inserted into the intersection node 20 The intersection of documents.
  • the document 40 in the fifth document set is used as the query element, and it is found that the document 40 does not exist in the third document set and the third document set is traversed.
  • the query is stopped, and the document intersection [document 20] corresponding to the intersection node 20 is uploaded to the upper intersection node 30.
  • the general intersection algorithm can have faster query efficiency.
  • the document set corresponding to the intersection node 30 is the document intersection corresponding to the intersection node 10 [Document 1, Document 20] And the intersection of documents corresponding to the intersection node 20 [document 20]. At this time, the intersection node 30 can be intersected to obtain the final document intersection [document 20] and returned to the user.
  • the test system test has improved the TP90, TP99, and TP999 indicators regarding the response delay, and the improvement range is above 10%.
  • TP90 time is the minimum time to meet 90% of the request
  • TP99 time is the minimum time to meet 99% of the request
  • TP99 time is the minimum time to meet 99.9% of the request.
  • the method for acquiring multiple document intersections is to use the element with the shortest length in the document set when the length of the document set of at least two document sets that need to be intersected in the search process meets the preset condition.
  • traversing the remaining document sets in turn can effectively improve the efficiency of the document intersection and speed up the response time of the search engine for the user.
  • FIG. 4 there is shown a block diagram of a multi-document intersection acquisition apparatus.
  • the multi-document intersection obtaining apparatus may include: an obtaining module 401, configured to acquire a document set length of each document set for at least two document sets that require intersection in a search process; and a length comparison module 402, Comparing the lengths of the at least two document sets to determine an intersection algorithm for obtaining an intersection of documents; and an intersection module 403 for obtaining an intersection of the at least two documents according to the determined intersection algorithm.
  • the embodiment of the present invention provides a multi-document intersection acquiring apparatus, which uses an element of a shortest document set as a query element when a document set length of at least two document sets that require intersection is in accordance with a preset condition. By traversing the remaining document sets in turn, the efficiency of the intersection of documents can be effectively improved, and the response time of the search engine to the user is accelerated.
  • FIG. 5 a specific structural diagram of a multi-document intersection acquisition apparatus is shown.
  • the multi-document intersection obtaining device may include:
  • the receiving module 501 is configured to receive a search term.
  • the syntax tree construction module 502 is configured to construct a search syntax tree according to the search term, where a leaf node of the syntax tree is a word segmentation of the search term.
  • the obtaining module 503 is configured to obtain a document set length of each document set for at least two document sets that require intersection in the search process;
  • the length comparison module 504 is configured to compare lengths of the at least two document sets to determine an intersection algorithm for obtaining an intersection of documents.
  • the first intersection module 505 is configured to: if the length difference of the at least two document sets meets a preset condition, use a query element in the minimum document set as a traversal starting point, and find whether the query is in each remaining document set. An element; and, when the query element is present in each of the remaining document sets, the query element is used as an element of the document intersection.
  • the minimum document set is a document set having a minimum document set length in the at least two document sets.
  • the second intersection module 506 is configured to obtain the intersection of the documents by using a universal intersection algorithm if the length difference of the at least two document sets does not meet the preset condition.
  • the query element in the focused document set in the at least two document sets may be a traversal starting point to find whether the query element exists in each remaining document set; and, when each of the remaining document sets has the query element And the query element is used as an element of the intersection of the documents.
  • the previous query element is from the focused document set and the previous query element is determined as an element of the document intersection, or the focused document set is determined by the first one to have no the previous query element;
  • the sorting number of the query element is the next sorting number of the previous query element.
  • the obtaining module 503 may be specifically configured to: start from a first intersecting node of a lowest level in the interworking node in which the search grammar has not been calculated, and determine a requirement according to the child node of the first intersecting node. Finding the at least two sets of documents of the intersection; and obtaining the document set length of each document set.
  • the first intersection module 505 is specifically configured to: perform matching on the query elements of the current sort sequence number in the minimum document set with elements in the remaining document set; if the current sort is not found in at least one of the remaining document sets And the element of the query element of the sequence number is matched with the element of the remaining sorted set in the query element of the next sorted sequence in the minimum document set; if the current sorted sequence number is found in all the remaining document sets An element of the query element, the query element is used as an element of the intersection of the documents, and a query element of the next sorted sequence number in the minimum document set is matched with an element of the remaining document set. This loops until all elements in the minimum document set are traversed.
  • the second intersection module 506 is specifically configured to: use a query element of a minimum sorted sequence number in the first document set of the at least two document sets as an initial traversal starting point, and find whether the query element exists in each remaining document set. . If no element of the query element matching the current sort sequence number is found in the remaining document set currently being searched, the element of the next sorted sequence number in the remaining document set is used as the new query element. If an element matching the query element of the current sorted sequence is found in all remaining document sets, the query element is taken as an element of the document intersection, and the next sort in the document set of the query element is selected The element of the sequence number is used as the new query element. This loops until all elements of one document set in the at least two document sets are traversed.
  • the embodiment of the present invention provides a multi-document intersection acquiring apparatus, which uses an element of a shortest document set as a query element when a document set length of at least two document sets that require intersection is in accordance with a preset condition. By traversing the remaining document sets in turn, the efficiency of the intersection of documents in the case of large difference in the length of the document set can be effectively improved, and the response time of the search engine to the user is accelerated.
  • An embodiment of the present invention further provides a document server, including: a processor; and a non-transitory computer readable storage medium on which a machine executable by the processor is stored Executable instructions.
  • the processor is caused by the machine executable instructions to perform the steps in the multi-document intersection acquisition method of the foregoing embodiment, for example, acquiring at least two document sets for which an intersection is required in the search, and acquiring each of the document sets a document set length; determining an intersection algorithm for obtaining an intersection of the documents according to a difference in length of the at least two document sets; and, using the determined intersection algorithm, obtaining an intersection of documents of the at least two document sets.
  • the embodiment of the present invention further provides a non-transitory machine readable storage medium, which enables a document server to execute the multi-document intersection acquisition method of the foregoing embodiment when the instructions in the storage medium are executed by a processor of the document server.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • modules in the devices of the embodiments can be adaptively changed and placed in one or more devices different from the embodiment.
  • the modules or units or components of the embodiments may be combined into one module or unit or component, and further they may be divided into a plurality of sub-modules or sub-units or sub-components.
  • any combination of the features disclosed in the specification, including the accompanying claims, the abstract and the drawings, and any methods so disclosed, or All processes or units of the device are combined.
  • Each feature disclosed in this specification (including the accompanying claims, the abstract and the drawings) may be replaced by alternative features that provide the same, equivalent or similar purpose.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • Those skilled in the art will appreciate that some or all of the functionality of some or all of the components of the payment information processing device in accordance with embodiments of the present invention may be implemented in practice using a microprocessor or digital signal processor (DSP).
  • DSP digital signal processor
  • the invention may also be implemented as a device or device program for performing some or all of the methods described herein.
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

Abstract

一种多文档交集获取方法及、装置及可读存储介质。针对搜索过程中需要求交集的至少两个文档集,获取各文档集的文档集长度(210),并对所述至少两个文档集的长度进行比较。根据所述至少两个文档集的长度差异确定获得文档交集的求交算法(220)。在所述至少两个文档集的文档集长度符合预设条件时,可以以长度最短文档集中的元素作为查询元素依次遍历剩余的文档集。

Description

多文档交集的获取方法及文档服务器
相关申请的交叉引用
本专利申请要求于2017年9月6日提交的、申请号为201710797899.8、发明名称为“多文档交集获取方法、装置、设备及可读存储介质”的中国专利申请的优先权,该申请的全文以引用的方式并入本文中。
技术领域
本发明实施例涉及搜索引擎技术领域,尤其涉及多文档交集的获取。
背景技术
搜索引擎可能收集了万维网上几千万到几十亿个网页,并对网页中的每一个词进行索引以建立索引数据库。当用户查找某个关键词的时候,页面内容中包含了该关键词的所有网页都将作为搜索结果被搜出来。
广泛应用的开源搜索引擎、例如Lucene采用了一种线性的获取多文档集的交集的方法,即:将每个文档集排序后,从第一个文档集的第一个元素开始遍历其他文档集。这时可能在当前文档集中找到该元素,则继续遍历下一个文档集。如果没找到,那么以当前文档集中的下一个元素作为查询元素重新遍历其他文档集。如此往复,直到找到一个同时在所有文档集中存在的元素、即文档为止。以这种方式,不断重复,直到其中一个文档集遍历结束,则多文档集的求交过程结束。
其中,在遍历每个文档集时,如果当前比较的元素不符合要求、即不是查询元素,则需要继续比较本文档集的下一个元素。但在不同文档集的长度差距大于某一阈值时,这种查找往往是没有必要的,因为对于多文档集求交而言,只要一个文档中不存在某元素,则该元素就一定不会是交集内的元素。因此,上述文档交集的获取效率可能较低下。
发明内容
本发明提供一种多文档交集获取方法、装置及可读存储介质,以即使在不同文档集的长度差距大于某一阈值时仍能够以较高效率获取文档交集。
根据本发明的第一方面,提供了一种多文档交集获取方法,包括:针对搜索中需要求 交集的至少两个文档集,获取各所述文档集的文档集长度;根据所述至少两个文档集的长度差异确定获得文档交集的求交算法;以及利用所确定的所述求交算法,获得所述至少两个文档集的文档交集。
根据本发明的第二方面,提供了一种文档服务器,包括:处理器;以及非临时性计算机可读存储介质,在所述非临时性计算机可读存储介质上存储有能够被所述处理器执行的机器可执行指令。其中,所述处理器被所述机器可执行指令促使:针对搜索中需要求交集的至少两个文档集,获取各所述文档集的文档集长度;根据所述至少两个文档集的长度差异确定获得文档交集的求交算法;以及,利用所确定的所述求交算法,获得所述至少两个文档集的文档交集。
根据本发明的第三方面,提供了一种非临时性机器可读存储介质,其上存储有能够被处理器执行的机器可执行指令。当所述非临时性机器可读存储介质中的所述机器可执行指令由文档服务器中的处理器执行时,所述文档服务器能够执行上述多文档交集获取方法。
本发明实施例提供了一种多文档交集获取方法、装置及可读存储介质,针对搜索过程中需要求交集的至少两个文档集,通过在所述至少两个文档集的文档集长度符合预设条件时,以长度最短文档集中的元素作为查询元素依次遍历剩余的文档集,可以有效提高文档求交集效率,加快了搜索引擎针对用户的响应时间。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一实施例提供的一种搜索引擎的结构示意图。
图2是本发明一实施例提供的一种多文档交集获取方法的流程图。
图3是本发明另一实施例提供的一种多文档交集获取方法的流程图。
图3A是本发明一实施例提供的一种初级语法树的结构图。
图3B是本发明一实施例提供的一种终级语法树的结构图。
图4是本发明一实施例提供的一种多文档交集获取装置的结构示意图。
图5是本发明另一实施例提供的一种多文档交集获取装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供的一种多文档交集获取方法可应用于搜索引擎技术,其目的是根据用户在搜索引擎界面输入的查询内容,对查询内容进行分词,通过对各个分词进行对应文档的匹配来生成各个分词对应的文档集,通过对所有文档集进行求交集来获得文档交集,并将文档交集返回给用户。
搜索引擎技术是一种互联网通信技术。在互联网中,服务器侧提供内容,并针对该内容构建索引。当用户使用客户端发送搜索请求至服务器时,服务器可根据搜索请求中的关键词去索引中查找内容,然后将查找到的内容返回给该客户端进行展示。
参照图1,搜索引擎通常可包括WEB(World Wide Web,万维网)服务器110、索引服务器120和文档服务器130。其中,文档服务器130可储存有文档的信息。
当用户通过搜索引擎浏览器140输入搜索词后,WEB服务器110会接收到该搜索词并将该搜索词发送至索引服务器120。接着,索引服务器120会将搜索词进行语法分词处理,针对每个分词在索引数据库匹配对应的文档,并将匹配结果发送至文档服务器130。然后,文档服务器130可根据匹配结果建立每个分词对应的文档集,通过对所有分词各自对应的文档集求交集后得到文档交集,并通过WEB服务器110将该文档交集返回给搜索引擎浏览器140。这样,搜索引擎浏览器140可将文档交集中的文档展示给用户。
本发明提供的一种多文档交集获取方法中常见的术语有:
文档(Document):一般搜索引擎的处理对象是互联网网页,而文档这个概念要更宽泛些,代表以文本形式存在的存储对象。相比网页来说,文档涵盖更多种形式,比如Word、PDF、html、XML等不同格式的文件都可以称之为文档,再比如一封邮件、一条短信、一条微博也可以称之为文档。在本发明中,每个文档对应设置有文档标识,用于识别各个文档。
文档集(Document Collection):由若干文档构成的集合称之为文档集。比如海量的互联网网页或者大量的电子邮件都是文档集的具体例子。
语法树(Parse Tree):语法树是语句结构的图形表示,它代表了语句的推导结果,可有利于理解语句的语法结构的层次。简单说,语法树就是按照某一规则进行推导时所形成的树。
叶子节点(leaf node):叶子节点为语法树最底层的节点,叶子节点不再包括更底层的节点。在本发明中,叶子节点为搜索词的分词。
参照图2,其示出了一种多文档交集获取方法的步骤流程图。
步骤210,针对需要求交集的至少两个文档集,获取各文档集的文档集长度。
在索引数据库中,搜索词的不同分词匹配的文档数量也可能不相同,更由于不同分词之间存在使用热度的差异,针对搜索过程中需要求交集的至少两个文档集的长度之间的差值极有可能超过长度阈值。需要说明的是,文档集的长度是指文档集中包括的文档元素的数量。
在本发明实施例中,针对各个分词对应生成的需要求交集的文档集,获取每个文档集的长度,具体可获取每个文档集中文档的数量。
例如,若用户在搜索引擎浏览器140输入搜索词“海底捞火锅”,则可得到分词1“海底捞”和分词2“火锅”,针对分词1和分词2对应查询得到的文档如下表1所示:
海底捞 火锅
文档1 文档1
文档2 文档3
文档20 文档5
文档85 文档6
 
  文档20
 
  文档80
表1
因此,将查询得到的各个分词对应的文档按照文档标识的大小顺排,可得到表1。其中,分词1“海底捞”对应的第一文档集中文档元素的个数为4,分词2“火锅”对应的第二文档集中文档元素的个数为40(文档6至文档20之间和文档20至文档80之间有省略)。 这样,可获得分词1“海底捞”对应的第一文档集长度为4,分词2“火锅”对应的第二文档集的长度为40。
步骤220,对所述至少两个文档集的长度进行比较,以根据所述至少两个文档集的长度差异确定获得文档交集的求交算法。
步骤230,利用所确定的求交算法,获得所述至少两个文档集的文档交集。
在本发明实施例中,若至少两个文档集的文档集长度符合预设条件,则可以以最小文档集中的查询元素为遍历起点,查找剩余的文档集中是否有该查询元素。其中,所述最小文档集为所述至少两个文档集中具有最小文档集长度的文档集。预设条件可以为,多个文档集中最长文档集与最短文档集的长度之差大于第一预设阈值。该第一预设阈值可以根据实际情况进行设定,例如可以是通过对搜索引擎进行日常搜索试验得到的一个优选值。需要说明的是,预设条件也可以为多个文档集中,最大文档集长度相对于最小文档集长度的比值超过第二预设阈值。
针对表1的数据,若预设条件为多个文档集中最长文档集与最短文档集的长度之差大于10,则第一文档集和第二文档集的长度之差为30,符合预设条件。此时,以第一文档集中的文档1作为查询元素,进行第二文档集的遍历。发现第二文档集中存在文档1,并将文档1插入文档交集。之后,以第一文档集中的文档2作为查询元素进行第二文档集的遍历,并发现第二文档集中不存在文档2。继续选取第一文档集中的文档20作为查询元素,并发现第二文档集中存在文档20,将文档20插入文档交集后,以第一文档集中的文档85作为查询元素进行第二文档集的遍历。最后,发现第二文档集中不存在文档85,并且第一文档集被遍历结束。这样,可终止本次求交过程,并将最终的文档交集[文档1,文档20]返回给用户。
若按照通用的求交算法,则针对表1,大致的求交过程如下:以第一文档集中的文档1作为查询元素进行第二文档集的遍历,并发现第二文档集中存在文档1,将文档1插入文档交集。之后,以第一文档集中的文档2作为查询元素进行第二文档集的遍历,并发现第二文档集中不存在文档2,从而以第二文档集中的文档3为查询元素进行第一文档集的遍历。由于第一文档集不存在文档3,则以第一文档集中的文档20为查询元素。发现第二文档集中存在文档20,则将文档20插入文档交集,并将查询元素更换为第一文档集中的文档85。最后,发现第二文档集中不存在文档85,并且第一文档集被遍历结束。这样,可终止本次求交过程,并将最终的文档交集[文档1,文档20]返回给用户。明显可见,该通用的求交算法造成了以文档3进行遍历的浪费。实际上,文档3没有在第一文档集中出现,也就是说该文档根本不会被召回,因此,在至少两个文档集的长度之间的差值超过长度阈值的情况下,采用通用求 交算法进行求交的效率较低。
例如,若用户通过电脑浏览器的搜索引擎查询“海底捞火锅”,则文档1和文档20对应的网页为用户此次查询得到的结果,并可通过浏览器界面展现给用户文档1和文档20对应网页的链接。
另外,若用户通过手机应用中的搜索引擎查询“海底捞火锅”,则文档1和文档20对应的应用页面为用户此次查询得到的结果,并可通过手机界面展现给用户的是文档1和文档20对应应用页面的链接。
综上所述,本发明实施例提供了一种多文档交集获取方法,针对搜索过程中需要求交集的至少两个文档集,通过在两个文档集的文档集长度符合预设条件时,以长度最短文档集中的元素作为查询元素依次遍历剩余的文档集,可以有效提高文档求交集效率,加快了搜索引擎针对用户的响应时间。
参照图3,其示出了一种多文档交集获取方法的具体步骤流程图。
步骤310,接收搜索词。
在本发明实施例中,搜索引擎可以接收用户输入的搜索词,并对搜索词进行语法树构建。
步骤320,根据接收到的搜索词构建搜索语法树。其中,所述语法树的叶子节点为搜索词的分词。
在本发明实施例中,可通过语法解析用户输入的搜索词并构建语法树。例如,若设定规则:在通过语法解析发现有空格时,判定空格的前后两个词为“与(and)”的关系。假设用户输入了“北京市全时便利店”,则会将空格的前后两个词“北京市”和“全时便利店”保存,并将解析结果构造成如图3A所示的初级语法树。
之后,在构造最终语法树节点时,会根据图3A所示的初级语法树的结构进行同步构造。这个过程中,系统会判别初级语法树的节点是否为文本。如果是文本,系统会对其进行再次分词,例如“北京市”会分为“北京”和“市”,“全时便利店”会分为“全时”和“便利”和“店”。可根据分词后的结果对初级语法树重建,并将分词后的“与”节点添加到初级语法树中,形成如图3B所示的终级语法树,以供搜索引擎对终极语法树做各个分词的求交算法。
步骤330,针对需要求交集的至少两个文档集,获取各文档集的文档集长度。
该步骤的基本实现可以参照上述步骤210,此处不再赘述。
此外,针对步骤320中构建的搜索语法树,可从所述搜索语法树中还未进行求交计算的最低层级的求交节点开始,根据所述求交节点的各子节点确定需要求交集的所述至少两个文档集,并确定所述至少两个文档集各自的文档集长度。
在本发明实施例中,可从根据搜索词构建的语法树中的最底层求交节点开始进行求交计算。当最底层的求交节点完成求交计算后,根据得到的底层的文档交集,再对次底层的求交节点进行求交计算,直到得到最顶层的文档交集,并将其返回给用户。
例如,图3B示出的针对“北京市全时便利店”的终级语法树中,求交节点10、20为底层节点,求交节点30为顶层节点。此时,根据底层节点10和底层节点20,可确定顶层节点30需要求交集的文档集为“北京”和“市”对应的文档集与“全时”、“便利”和“店”对应的文档集。
接着,可分别得到“北京”和“市”各自对应的文档集中文档元素的数量,以及“全时”、“便利”和“店”各自对应的文档集中文档元素的数量。
步骤340,对所述至少两个文档集的长度进行比较,以根据所述至少两个文档集的长度差异确定获得文档交集的求交算法。
步骤350,在所述至少两个文档集的长度差异符合预设条件的情况下,以最小文档集中的查询元素为遍历起点,查找剩余的文档集中是否有所述查询元素。如果在每个所述剩余的文档集中均找到了匹配所述当前排序序号的查询元素的元素,则将所述查询元素作为所述文档交集的元素。其中,所述最小文档集为所述至少两个文档集中具有最小文档集长度的文档集。
其中,所述预设条件可包括:所述至少两个文档集的文档集长度中,最大文档集长度与最小文档集长度之间的差值超过第一预设阈值;或者,所述至少两个文档集中,最大文档集长度相对于最小文档集长度的比值超过第二预设阈值。该第一预设阈值与第二预设阈值均可以根据实际情况进行设定。例如,可通过对搜索引擎进行日常搜索试验得到的一个优选值。
可对所述最小文档集中的当前排序序号的查询元素进行与每个剩余文档集中的元素的匹配。如果在至少一个所述剩余文档集中未找到匹配所述当前排序序号的查询元素的元素,则对所述最小文档集中的下一排序序号的查询元素进行与每个剩余文档集中的元素的匹配。如果在所有所述剩余文档集中均找到了匹配所述当前排序序号的查询元素的元素,则将所述查询元素作为所述文档交集的元素,并对所述最小文档集中的下一排序序号的查询元素进行 与剩余的文档集中的元素的匹配。以此循环,直至所述最小文档集中的所有元素被遍历。
例如,针对图3B示出的“北京市全时便利店”的终级语法树中的求交节点10,假设分词“北京”的第一文档集中文档元素的数量为4,分词“市”对应的第二文档集中文档元素的数量为40,第一预设阈值为10,并且分词“北京”和“市”分别对应的文档元素如下表2所示。
由于第一文档集和第二文档集的长度之差30大于第一预设阈值10,则此时以第一文档集中的文档1作为查询元素,进行第二文档集的遍历。
北京
文档1 文档1
文档2 文档3
文档20 文档5
文档85 文档6
 
  文档20
 
  文档80
表2
可选的,文档集中的元素已按标识大小顺排或者倒排。
北京
文档85 文档80
文档20
文档2 文档20
文档1
  文档6
  文档5
  文档3
  文档1
表3
在本发明实施例中,文档集中元素的标识即为表2中文档后面对应的数字,该标识用于识别文档元素,表2是由文档集中的元素以按标识大小顺排得到的。若将文档集中的元素以按标识大小倒排,则可得到表3,此时可以将第一文档集中的第一个元素文档85作为查询元素进行遍历,本发明实施例对此不作限定。
当以第一文档集中的文档1作为查询元素进行第二文档集的遍历,并发现第二文档集中存在文档1,则将文档1插入文档交集。之后,以第一文档集中的文档2作为查询元素进行第二文档集的遍历,并发现第二文档集中不存在文档2。这时,将选取第一文档集中的文档20作为查询元素,并发现第二文档集中存在文档20,则将文档20插入文档交集。最后,再以第一文档集中的最后一个文档85作为查询元素进行第二文档集的遍历,并发现第二文档集中不存在文档85,则停止遍历。接着,可将求交节点10得到的文档交集[文档1,文档20]返回给图3B示出的顶层求交节点30。
步骤360,在所述至少两个文档集的长度差异不符合预设条件的情况下,则利用通用的求交算法来获取所述文档交集。例如,可以以所述至少两个文档集中的关注文档集内的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素。当每个所述剩余文档集中均有所述查询元素,则将所述查询元素作为所述文档交集的元素。其中,前一查询元素出自所述关注文档集且所述前一查询元素被确定为所述文档交集的元素,或者,所述关注文档集被第一个确定不具有所述前一查询元素;所述查询元素的排序序号为所述前一查询元素的下一排序序号。
在本发明实施例中,可以以所述至少两个文档集中的第一个文档集中的最小排序序号的查询元素作为最初的遍历起点,查找每个剩余文档集中是否有所述查询元素。如果在当前被查找的剩余文档集中没有找到匹配所述当前排序序号的查询元素的元素,则以该剩余文档集中的下一排序序号的元素作为新的查询元素。如果在所有剩余文档集中均找到了匹配所述当前排序序号的查询元素的元素,则将所述查询元素作为所述文档交集的元素,并以选出所述查询元素的文档集中的下一排序序号的元素作为新的查询元素。以此循环,直至所述至少两个文档集中有一个文档集的所有元素被遍历。
在本发明实施例中,针对图3B示出的针对“北京市全时便利店”的终级语法树中 的求交节点20,假设分词“全时”对应的第三文档集中文档元素的数量为2,分词“便利”对应的第四文档集中文档元素的数量为3,分词“店”对应的第五文档集中文档元素的数量为4,第一预设阈值为10,分词“全时”、“便利”、“店”各自对应的文档元素可如下表4所示。
全时 便利
文档1 文档2 文档20
文档20 文档20 文档40
  文档21 文档50
    文档60
表4
此时,由于最长的第五文档集与最短的第三文档集之间的文档集长度差值2小于第一预设阈值10,可通过通用求交算法进行第三文档集、第四文档集和第五文档集之间的求交计算,具体如下:以第三文档集中的文档1作为查询元素遍历第四文档集,并发现第四文档集中不存在文档1,于是将查询元素换为第四文档集中的文档2。发现第五文档集中不存在文档2,于是将查询元素换成第五文档集中的文档20,并发现第三文档集和第四文档集中均存在文档20,则将文档20插入求交节点20对应的文档交集。当发现第三文档集和第四文档集中存在文档20之后,再以第五文档集中的文档40作为查询元素,并发现在第三文档集不存在文档40且第三文档集被遍历结束。此时,查询停止,并将求交节点20对应的文档交集[文档20]上传至上层的求交节点30。在各个分词对应的文档集之间的长度差值较小的情况下,采用通用求交算法会可具有更快的查询效率。
基于以上处理,在图3B示出的针对“北京市全时便利店”的终级语法树中,求交节点30对应的文档集分别为求交节点10对应的文档交集[文档1,文档20]和求交节点20对应的文档交集[文档20]。此时,可对求交节点30进行求交,得到最终的文档交集[文档20]并返回给用户。
针对本发明提供的上述多文档交集获取方法,通过测试系统的测试,在关于响应延时的TP90,TP99,TP999指标方面均有提升,并且提升幅度在10%以上。其中,TP90时间为满足90%请求的最低耗时;TP99时间为满足99%请求的最低耗时;TP999时间为满足99.9%请求的最低耗时。
综上所述,本发明实施例提供了的上述多文档交集获取方法,通过在搜索过程中需要求交集的至少两个文档集的文档集长度符合预设条件时,以长度最短文档集中的元素作为查询元素依次遍历剩余的文档集,可以有效提高文档求交集效率,加快了搜索引擎针对用户的响应时间。
参照图4,其示出了一种多文档交集获取装置的结构图。
如图4所示,该多文档交集获取装置可包括:获取模块401,用于针对搜索过程中需要求交集的至少两个文档集,获取各文档集的文档集长度;长度比较模块402,用于对所述至少两个文档集的长度进行比较,以确定获得文档交集的求交算法;以及求交模块403,用于根据所确定的求交算法获得所述至少两个文档的交集。
综上所述,本发明实施例提供了一种多文档交集获取装置,通过在需要求交集的至少两个文档集的文档集长度符合预设条件时,以长度最短文档集中的元素作为查询元素依次遍历剩余的文档集,可以有效提高文档求交集效率,加快了搜索引擎针对用户的响应时间。
参照图5,其示出了一种多文档交集获取装置的具体结构图。
如图5所示,该多文档交集获取装置可包括:
接收模块501,用于接收搜索词。
语法树构建模块502,用于根据所述搜索词构建搜索语法树,所述语法树的叶子节点为所述搜索词的分词。
获取模块503,用于针对搜索过程中需要求交集的至少两个文档集,获取各文档集的文档集长度;
长度比较模块504,用于对所述至少两个文档集的长度进行比较,以确定获得文档交集的求交算法。
第一求交模块505,用于在所述至少两个文档集的长度差异符合预设条件的情况下,以最小文档集中的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素;以及,当每个剩余文档集中均有所述查询元素,则将所述查询元素作为所述文档交集的元素。其中,所述最小文档集为所述至少两个文档集中具有最小文档集长度的文档集。
第二求交模块506,用于在所述至少两个文档集的长度差异不符合预设条件的情况下,则利用通用的求交算法来获得所述文档交集。例如,可以以所述至少两个文档集中的关注文档集内的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素;以及,当每个 剩余文档集中均有所述查询元素,则将所述查询元素作为所述文档交集的元素。其中,前一查询元素出自所述关注文档集且所述前一查询元素被确定为所述文档交集的元素,或者,所述关注文档集被第一个确定不具有所述前一查询元素;所述查询元素的排序序号为所述前一查询元素的下一排序序号。
其中,获取模块503可具体用于:从所述搜索语法还未进行求交计算的求交节点中的最低层级的第一求交节点开始,根据所述第一求交节点的子节点确定需要求交集的所述至少两个文档集;以及,获取每个文档集的文档集长度。
第一求交模块505可具体用于:对所述最小文档集中的当前排序序号的查询元素进行与剩余文档集中的元素的匹配;如果在至少一个所述剩余文档集中未找到匹配所述当前排序序号的查询元素的元素,则对所述最小文档集中的下一排序序号的查询元素进行与剩余的文档集中的元素的匹配;如果在所有所述剩余文档集中均找到了匹配所述当前排序序号的查询元素的元素,则将所述查询元素作为所述文档交集的元素,并对所述最小文档集中的下一排序序号的查询元素进行与剩余文档集中的元素的匹配。以此循环,直至所述最小文档集中的所有元素被遍历。
第二求交模块506可具体用于:以所述至少两个文档集中的第一个文档集中的最小排序序号的查询元素作为最初的遍历起点,查找每个剩余文档集中是否有所述查询元素。如果在当前被查找的剩余文档集中没有找到匹配当前排序序号的查询元素的元素,则以该剩余文档集中的下一排序序号的元素作为新的查询元素。如果在所有剩余文档集中均找到了匹配所述当前排序序号的查询元素的元素,则将所述查询元素作为所述文档交集的元素,并以选出所述查询元素的文档集中的下一排序序号的元素作为新的查询元素。以此循环,直至所述至少两个文档集中有一个文档集的所有元素被遍历。
综上所述,本发明实施例提供了一种多文档交集获取装置,通过在需要求交集的至少两个文档集的文档集长度符合预设条件时,以长度最短文档集中的元素作为查询元素依次遍历剩余的文档集,可以有效提高了文档集长度差值较大情况下的文档求交集效率,加快了搜索引擎针对用户的响应时间。
本发明实施例还提供了一种文档服务器,包括:处理器;以及非临时性计算机可读存储介质,在所述非临时性计算机可读存储介质上存储有能够被所述处理器执行的机器可执行指令。其中,所述处理器被所述机器可执行指令促使执行前述实施例的多文档交集获取方法中的步骤,例如:针对搜索中需要求交集的至少两个文档集,获取各所述文档集的文档集长度;根据所述至少两个文档集的长度差异确定获得文档交集的求交算法;以及,利用所确定 的所述求交算法,获得所述至少两个文档集的文档交集。
本发明实施例还提供了一种非临时性机器可读存储介质,当所述存储介质中的指令由文档服务器的处理器执行时,使得文档服务器能够执行前述实施例的多文档交集获取方法。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处 理器或者数字信号处理器(DSP)来实现根据本发明实施例的支付信息处理设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。

Claims (17)

  1. 一种多文档交集获取方法,包括:
    针对搜索中需要求交集的至少两个文档集,获取各所述文档集的文档集长度;
    根据所述至少两个文档集的长度差异确定获得文档交集的求交算法;以及
    利用所确定的所述求交算法,获得所述至少两个文档集的文档交集。
  2. 如权利要求1所述的方法,其特征在于,利用所确定的所述求交算法,获得所述至少两个文档集的文档交集,包括:
    在所述至少两个文档集的长度差异符合预设条件的情况下,以最小文档集中的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素,其中,所述最小文档集为所述至少两个文档集中具有最小文档集长度的文档集;
    当每个所述剩余文档集中均有所述查询元素,将所述查询元素作为所述文档交集的元素。
  3. 根据权利要求2所述的方法,其特征在于,所述预设条件包括以下任意一个或多个:
    所述至少两个文档集中,最大文档集长度与最小文档集长度之间的差值超过第一预设阈值;以及
    所述至少两个文档集中,最大文档集长度相对于最小文档集长度的比值超过第二预设阈值。
  4. 根据权利要求2所述的方法,其特征在于,以所述最小文档集中的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素,包括:
    对所述最小文档集中的当前排序序号的查询元素进行与剩余文档集中的元素的匹配;
    如果在至少一个剩余文档集中未找到匹配所述查询元素的元素,则对所述最小文档集中的下一排序序号的查询元素进行与剩余文档集中的元素的匹配;
    如果在所有剩余文档集中均找到了匹配所述查询元素的元素,则将所述查询元素作为所述文档交集的元素,并对所述最小文档集中的下一排序序号的查询元素进行与剩余文档集中的元素的匹配。
  5. 根据权利要1所述的方法,其特征在于,所述文档集中的元素按标识大小顺排或者倒排而具有所述排序序号。
  6. 根据权利要求1所述的方法,其特征在于,还包括:
    接收用于触发所述搜索的搜索词;
    根据所述搜索词构建搜索语法树,其中,所述语法树的叶子节点为所述搜索词的分词。
  7. 根据权利要求6所述的方法,其特征在于,针对所述搜索中需要求交集的所述至少两个文档集,获取各所述文档集的文档集长度,包括:
    从所述搜索语法树中还未进行求交计算的最低层级的求交节点开始,根据所述求交节点的各子节点确定需要求交集的所述至少两个文档集;
    获取每个所述文档集的文档集长度。
  8. 根据权利要求1-7中任一所述的方法,其特征在于,利用所确定的所述求交算法,获得所述至少两个文档集的文档交集,包括:
    在所述至少两个文档集的长度差异不符合预设条件的情况下,则以所述至少两个文档集中的关注文档集内的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素;
    当每个所述剩余文档集中均有所述查询元素,则将所述查询元素作为所述文档交集的元素,
    其中,前一查询元素出自所述关注文档集且所述前一查询元素被确定为所述文档交集的元素,或者,所述关注文档集被第一个确定不具有所述前一查询元素;所述查询元素的排序序号为所述前一查询元素的下一排序序号。
  9. 一种文档服务器,包括:
    处理器;以及
    非临时性计算机可读存储介质,在所述非临时性计算机可读存储介质上存储有能够被所述处理器执行的机器可执行指令,
    其中,所述处理器被所述机器可执行指令促使:
    针对搜索中需要求交集的至少两个文档集,获取各所述文档集的文档集长度;
    根据所述至少两个文档集的长度差异确定获得文档交集的求交算法;以及
    利用所确定的所述求交算法,获得所述至少两个文档集的文档交集。
  10. 如权利要求9所述的装置,其特征在于,在利用所确定的所述求交算法,获得所述至少两个文档集的文档交集时,所述处理器被所述机器可执行指令促使:
    在所述至少两个文档集的长度差异符合预设条件的情况下,以最小文档集中的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素,其中,所述最小文档集为所述至少两个文档集中具有最小文档集长度的文档集;
    当每个所述剩余文档集中均有所述查询元素,将所述查询元素作为所述文档交集的元素。
  11. 根据权利要求10所述的装置,其特征在于,所述预设条件包括以下任意一个或多个:
    所述至少两个文档集中,最大文档集长度与最小文档集长度之间的差值超过第一预设阈值;以及
    所述至少两个文档集中,最大文档集长度相对于最小文档集长度的比值超过第二预设阈值。
  12. 根据权利要求10所述的装置,其特征在于,在以所述最小文档集中的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素时,所述处理器被所述机器可执行指令促使:
    对所述最小文档集中的当前排序序号的查询元素进行与剩余文档集中的元素的匹配;
    如果在至少一个剩余文档集中未找到匹配所述查询元素的元素,则对所述最小文档集中的下一排序序号的查询元素进行与剩余文档集中的元素的匹配;
    如果在所有剩余文档集中均找到了匹配所述查询元素的元素,则将所述查询元素作为所述文档交集的元素,并对所述最小文档集中的下一排序序号的查询元素进行与剩余文档集中的元素的匹配。
  13. 根据权利要求9所述的装置,其特征在于,所述文档集中的元素按标识大小顺排或者倒排而具有所述排序序号。
  14. 根据权利要求9所述的装置,其特征在于,所述处理器还被所述机器可执行指令促使:
    接收用于触发所述搜索的搜索词;
    根据所述搜索词构建搜索语法树,其中,所述语法树的叶子节点为所述搜索词的分词。
  15. 根据权利要求14所述的装置,其特征在于,在针对所述搜索中需要求交集的所述至少两个文档集,获取各所述文档集的文档集长度时,所述处理器被所述机器可执行指令促使:
    从所述搜索语法树中还未进行求交计算的最低层级的求交节点开始,根据所述求交节点的各子节点确定需要求交集的所述至少两个文档集;
    获取每个所述文档集的文档集长度。
  16. 根据权利要求9-15中任一所述的装置,其特征在于,在利用所确定的所述求交算法,获得所述至少两个文档集的文档交集时,所述处理器被所述机器可执行指令促使:
    在所述至少两个文档集的长度差异不符合预设条件的情况下,则以所述至少两个文档集中的关注文档集中的查询元素为遍历起点,查找每个剩余文档集中是否有所述查询元素;
    当每个所述剩余文档集中均有所述查询元素,将所述查询元素作为所述文档交集的元素,
    其中,前一查询元素出自所述关注文档集且所述前一查询元素被确定为所述文档交集的元素,或者,所述关注文档集被第一个确定不具有所述前一查询元素;所述查询元素的排序序号为所述前一查询元素的下一排序序号。
  17. 一种非临时性机器可读存储介质,其上存储有能够被处理器执行的机器可执行指令,当所述非临时性机器可读存储介质中的所述机器可执行指令由文档服务器中的处理器执行时,所述文档服务器能够执行如权利要求1-8任一所述的多文档交集获取方法。
PCT/CN2017/120062 2017-09-06 2017-12-29 多文档交集的获取方法及文档服务器 WO2019047437A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US16/622,293 US11288329B2 (en) 2017-09-06 2017-12-29 Method for obtaining intersection of plurality of documents and document server
CA3069382A CA3069382C (en) 2017-09-06 2017-12-29 Multi-document intersection acquisition method and document server
JP2019568694A JP6986577B2 (ja) 2017-09-06 2017-12-29 複数のドキュメント交差を取得するための方法及びドキュメントサーバー

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710797899.8A CN107766414B (zh) 2017-09-06 2017-09-06 多文档交集获取方法、装置、设备及可读存储介质
CN201710797899.8 2017-09-06

Publications (1)

Publication Number Publication Date
WO2019047437A1 true WO2019047437A1 (zh) 2019-03-14

Family

ID=61265299

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/120062 WO2019047437A1 (zh) 2017-09-06 2017-12-29 多文档交集的获取方法及文档服务器

Country Status (6)

Country Link
US (1) US11288329B2 (zh)
JP (1) JP6986577B2 (zh)
CN (1) CN107766414B (zh)
CA (1) CA3069382C (zh)
TW (1) TW201913414A (zh)
WO (1) WO2019047437A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766414B (zh) * 2017-09-06 2020-06-12 北京三快在线科技有限公司 多文档交集获取方法、装置、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858737A (zh) * 2006-01-25 2006-11-08 华为技术有限公司 一种数据搜索的方法和系统
CN102201007A (zh) * 2011-06-14 2011-09-28 悠易互通(北京)广告有限公司 一种大规模数据搜索系统
CN102750393A (zh) * 2012-07-13 2012-10-24 携程计算机技术(上海)有限公司 复合索引结构以及基于该复合索引结构的搜索方法
CN102810096A (zh) * 2011-06-02 2012-12-05 阿里巴巴集团控股有限公司 一种基于单字索引系统的检索方法和装置
WO2016173366A1 (zh) * 2015-04-28 2016-11-03 腾讯科技(深圳)有限公司 基于求交算法的搜索方法、搜索系统及存储介质

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3514874B2 (ja) 1995-06-06 2004-03-31 富士通株式会社 フリーテキスト検索システム
US6658626B1 (en) * 1998-07-31 2003-12-02 The Regents Of The University Of California User interface for displaying document comparison information
US20030172048A1 (en) * 2002-03-06 2003-09-11 Business Machines Corporation Text search system for complex queries
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
US6915300B1 (en) * 2003-12-19 2005-07-05 Xerox Corporation Method and system for searching indexed string containing a search string
EP1776629A4 (en) * 2004-07-21 2011-05-04 Equivio Ltd METHOD FOR DETERMINING QUASI DUPLICATE OF OBJECTS
EP1903457B1 (en) * 2006-09-19 2012-05-30 Exalead Computer-implemented method, computer program product and system for creating an index of a subset of data
US20080195597A1 (en) * 2007-02-08 2008-08-14 Samsung Electronics Co., Ltd. Searching in peer-to-peer networks
US8832140B2 (en) * 2007-06-26 2014-09-09 Oracle Otc Subsidiary Llc System and method for measuring the quality of document sets
US7925604B2 (en) * 2007-10-25 2011-04-12 International Business Machines Corporation Adaptive greedy method for ordering intersecting of a group of lists into a left-deep AND-tree
US8166041B2 (en) * 2008-06-13 2012-04-24 Microsoft Corporation Search index format optimizations
US20100125614A1 (en) * 2008-11-14 2010-05-20 D Urso Christopher Andrew Systems and processes for functionally interpolated increasing sequence encoding
US8150831B2 (en) * 2009-04-15 2012-04-03 Lexisnexis System and method for ranking search results within citation intensive document collections
US8874663B2 (en) * 2009-08-28 2014-10-28 Facebook, Inc. Comparing similarity between documents for filtering unwanted documents
US20110314045A1 (en) * 2010-06-21 2011-12-22 Microsoft Corporation Fast set intersection
US9501506B1 (en) * 2013-03-15 2016-11-22 Google Inc. Indexing system
US20150331908A1 (en) * 2014-05-15 2015-11-19 Genetic Finance (Barbados) Limited Visual interactive search
US9984110B2 (en) * 2014-08-21 2018-05-29 Dropbox, Inc. Multi-user search system with methodology for personalized search query autocomplete
US10467215B2 (en) * 2015-06-23 2019-11-05 Microsoft Technology Licensing, Llc Matching documents using a bit vector search index
CN106933824B (zh) * 2015-12-29 2021-01-01 伊姆西Ip控股有限责任公司 在多个文档中确定与目标文档相似的文档集合的方法和装置
CN107203567A (zh) * 2016-03-18 2017-09-26 伊姆西公司 用于搜索字串的方法和设备
US10599726B2 (en) * 2017-04-19 2020-03-24 A9.Com, Inc. Methods and systems for real-time updating of encoded search indexes
CN107766414B (zh) * 2017-09-06 2020-06-12 北京三快在线科技有限公司 多文档交集获取方法、装置、设备及可读存储介质
CN111373392B (zh) * 2017-11-22 2021-05-07 花王株式会社 文献分类装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1858737A (zh) * 2006-01-25 2006-11-08 华为技术有限公司 一种数据搜索的方法和系统
CN102810096A (zh) * 2011-06-02 2012-12-05 阿里巴巴集团控股有限公司 一种基于单字索引系统的检索方法和装置
CN102201007A (zh) * 2011-06-14 2011-09-28 悠易互通(北京)广告有限公司 一种大规模数据搜索系统
CN102750393A (zh) * 2012-07-13 2012-10-24 携程计算机技术(上海)有限公司 复合索引结构以及基于该复合索引结构的搜索方法
WO2016173366A1 (zh) * 2015-04-28 2016-11-03 腾讯科技(深圳)有限公司 基于求交算法的搜索方法、搜索系统及存储介质

Also Published As

Publication number Publication date
CN107766414A (zh) 2018-03-06
JP2020523697A (ja) 2020-08-06
CN107766414B (zh) 2020-06-12
US11288329B2 (en) 2022-03-29
US20200210493A1 (en) 2020-07-02
CA3069382A1 (en) 2019-03-14
CA3069382C (en) 2024-02-06
TW201913414A (zh) 2019-04-01
JP6986577B2 (ja) 2021-12-22

Similar Documents

Publication Publication Date Title
US20180032930A1 (en) System and method to Generate Queries for a Business Database
US9576054B2 (en) Search method, apparatus and system based on rewritten search term
CN104462084B (zh) 基于多个查询提供搜索细化建议
Cao et al. Retrieving regions of interest for user exploration
US20160357851A1 (en) Natural Language Search With Semantic Mapping And Classification
CN104199833B (zh) 一种网络搜索词的聚类方法和聚类装置
CN106557558B (zh) 一种数据分析方法及装置
JP5875711B2 (ja) ウェブページ検索の方法及び装置
US20170235726A1 (en) Information identification and extraction
CN103778251B (zh) 面向大规模rdf图数据的sparql并行查询方法
US9262555B2 (en) Machine for recognizing or generating Jabba-type sequences
JP2013531289A (ja) 検索におけるモデル情報群の使用
US20170235836A1 (en) Information identification and extraction
CN104537341A (zh) 人脸图片信息获取方法和装置
CN107229714B (zh) 一种基于分布式数据库的全文搜索引擎
US20170235835A1 (en) Information identification and extraction
CN113722600B (zh) 应用于大数据的数据查询方法、装置、设备及产品
WO2019047437A1 (zh) 多文档交集的获取方法及文档服务器
CN105488165B (zh) 基于索引库的数据检索方法及系统
CN108733848B (zh) 一种搜索知识的方法及系统
JP6399514B2 (ja) ブロック・レベル索引を使用し及び構築して検索クエリを実行するための方法及びシステム
CN110895582A (zh) 一种数据处理的方法和设备
CN105159899B (zh) 一种搜索的方法和装置
CN111639099A (zh) 全文索引方法及系统
CN107463570B (zh) 一种文献检索/分析方法和装置

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019568694

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 3069382

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17924067

Country of ref document: EP

Kind code of ref document: A1