WO2009000173A1 - Procédé, système et serveur de recherche - Google Patents

Procédé, système et serveur de recherche Download PDF

Info

Publication number
WO2009000173A1
WO2009000173A1 PCT/CN2008/070598 CN2008070598W WO2009000173A1 WO 2009000173 A1 WO2009000173 A1 WO 2009000173A1 CN 2008070598 W CN2008070598 W CN 2008070598W WO 2009000173 A1 WO2009000173 A1 WO 2009000173A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
read
retrieval
index
server
Prior art date
Application number
PCT/CN2008/070598
Other languages
English (en)
Chinese (zh)
Inventor
Liang Sun
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2009000173A1 publication Critical patent/WO2009000173A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Definitions

  • the present invention relates to the field of communications technologies, and in particular, to a retrieval method, a retrieval system, and a retrieval server.
  • a search string contains one or more keywords.
  • each keyword is separated by a space, a space between the keywords indicates between the keywords.
  • Each keyword can consist of one or more morphemes.
  • a morpheme is the smallest language unit that can express independent semantics, usually a Chinese word that is segmented by the word segmentation system. Key words can be divided into morphemes of different numbers by word segmentation system.
  • the keyword is a binary compound morpheme. If it is divided into three morphemes, the keyword is ternary compound. Morpheme.
  • searching the input search string needs to find a collection of all the documents containing the search string in a short time, and display the document collection through the document identification list.
  • background retrieval cluster technology is one of the most core technologies. This technology is directly related to the collaboration between multiple search servers to provide retrieval services for a larger set of data. Since the number of document collections managed by a single retrieval server is limited, if the number of documents saved is too large, it will be difficult for the system to return the desired results within a time acceptable to the user during normal retrieval operations. Usually the user can accept no more than 1 second, so a search cluster consisting of multiple search servers is needed to support search services within a larger data set.
  • the inverted index is a data structure used to speed up the retrieval of the search string. It can exist in the form of a disk file or it can be loaded into the memory. At least consists of a dictionary file and an inverted table file. A plurality of inverted entries are saved in the inverted table file, and each inverted entry is used to save the correspondence between each keyword and the document in the search string. Therefore, effectively improving the reading speed of the inverted items can improve the retrieval efficiency.
  • the time to read the inverted entry of the inverted table file includes the time of each disk address and the time required to read the data.
  • the reading time of the inverted row item mainly depends on the addressing time of the disk. In the case that the amount of data read is relatively large, the reading time of the inverted row item mainly depends on the read data. time.
  • the system includes a retrieval proxy server and a plurality of parallel retrieval servers managed by the retrieval proxy server.
  • Each retrieval server allocates one-ninth of the documents in the full set of documents, where N is the total number of retrieval servers.
  • N is the total number of retrieval servers.
  • the retrieval proxy server sends the read requests to each retrieval server at the same time. After the retrieval server completes the local retrieval, the retrieval results will be retrieved. Returned to the search proxy server, and finally the search proxy server aggregates the search results of each search server according to a specific weight sorting manner.
  • the document partition-based retrieval system has an independent structural design, and the degree of coupling between the retrieval servers is small, and each retrieval server is equivalent to a retrieval subsystem that can be independently loaded.
  • most of the search strings are composed of two or more keywords.
  • the search server needs to perform the position offset matching in the document after matching the document identifiers for each keyword. This will result in multiple I/O access to the document disk.
  • the high frequency morpheme is included in the search string, the number of document identification lists and position offset lists that need to be read is large, for example, the inverted list of high frequency morphemes such as "China", “Net”, “We”, etc.
  • the amount of item data usually accounts for a large proportion of the entire inverted index data. It is impossible to read the index data in a short time, so most of the retrieval time will be consumed in the reading operation of the file input and output. As a result, the overall concurrency of the retrieval system is degraded, resulting in slower retrieval speed and response speed of the retrieval string.
  • FIG. 2 An existing distributed index file retrieval model based on index entry partitioning is shown in FIG. 2.
  • the system includes a retrieval proxy server and N sets of parallel retrieval servers managed by the retrieval agent, where N is an integer greater than 1, each group Retrieve the server to allocate one-ninth of the documents in the full set of documents.
  • each group of search servers contains three search servers.
  • All indexed keyword inverted items are evenly distributed in 3
  • the server retrieves the server, thereby speeding up access to the inverted entries.
  • a single retrieval server in each group of retrieval servers cannot perform the retrieval independently, and must be the same as other retrieval servers in the group.
  • Collaboration can complete the retrieval, thus increasing the degree of data coupling between the retrieval servers, resulting in more complicated data backup and lower retrieval speed.
  • the operation is performed every time the check is performed, thereby increasing the amount of communication between the search servers.
  • a retrieval method including:
  • the n search servers When the keyword is a high frequency keyword, the n search servers respectively read a part of the index entries of the high frequency keyword stored by themselves, and n is an integer greater than 1.
  • one of the n search servers reads all index entries of the low frequency keyword stored by the search server;
  • a retrieval system comprising:
  • a cluster proxy server configured to determine a type of a keyword to be retrieved; when the keyword is a high frequency keyword, send, to each of the n search servers, a part of an index entry of the high frequency keyword stored by the user a command for transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by the user, where n is greater than 1 An integer; determining, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved;
  • the search server is configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read itself When the command of all the index entries of the low frequency keyword is stored, all the I I entries of the low frequency keyword are read.
  • a retrieval server comprising: a read management module, configured to receive at least one of a command to read a part of an index entry of a high frequency keyword stored by itself and a command to read all index entries of a low frequency keyword stored by itself;
  • a keyword reading module configured to: when receiving a command to read a part of an index entry of the high frequency keyword stored by itself, read a part of an index entry of the high frequency keyword; when receiving the read When the command of all the index entries of the low frequency keyword stored by itself is commanded, all the I I entries of the low frequency keyword are read.
  • a cluster proxy server including:
  • a first module configured to determine a type of a keyword to be retrieved
  • a second module configured to: when the keyword is a high frequency keyword, send, to each of the n search servers, a command to read a part of an index entry of the high frequency keyword stored by the user; a low frequency keyword, transmitting, to a retrieval server of the n retrieval servers, a command to read all index entries of the low frequency keyword stored by itself, where n is an integer greater than 1;
  • a third module configured to determine, according to the index table entry read by the retrieval server, a retrieval result of the keyword to be retrieved.
  • an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when searching, the high frequency keyword is used by multiple servers.
  • the inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the subsequent processing time does not delay the time overhead of a single logical operation, thereby improving the retrieval speed.
  • all the inverted items of a low frequency keyword are stored by a retrieval server, and only the inverted list of the low frequency keyword is read by the server when the retrieval is performed. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
  • FIG. 1 is a schematic diagram of a distributed index file retrieval model based on document partitioning
  • FIG. 2 is a schematic diagram of a distributed index file retrieval model based on index entry partitioning
  • FIG. 3 is a schematic diagram of a retrieval method according to an embodiment of the present invention. flow chart
  • FIG. 4 is a flowchart of a retrieval method according to another embodiment of the present invention.
  • FIG. 5 is a schematic diagram of searching a specific search string by applying the method of the present invention.
  • FIG. 6 is a structural diagram of a retrieval system in an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a retrieval model of a retrieval system in an embodiment of the present invention.
  • Figure 8 is a flow chart for applying the retrieval model in Figure 7 for retrieval;
  • FIG. 9 is a structural diagram of a retrieval server in an embodiment of the present invention. The present invention will be further described in detail with reference to the drawings and embodiments. instruction of.
  • the flow of the retrieval method in the embodiment of the present invention is as shown in FIG. 3.
  • Step 301 Parsing the retrieved search string to generate a search expression consisting of keywords.
  • Step 302 Send a read request for the keyword to each search server in the cluster.
  • the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
  • the inverted list item of the keyword is an array in which the identifiers of all the documents including the keyword are recorded, and the identifier of the document including the keyword, the weight of the keyword in the document, and The positional offset of the keyword in the document, the basic structure is as follows: t ⁇ di,w d i,t,loci,loc 2 , ...loc fd i, t > ⁇ d 2 ....> ... ⁇ d ft ...>
  • Step 303 The retrieval server in the cluster reads the inverted item of the keyword according to the frequency of the keyword hitting the document.
  • the keywords in the search expression can be classified into high frequency keywords and low frequency keywords composed of super high frequency keywords and medium and high frequency keywords according to the frequency of hitting the documents.
  • the index data may be counted before the retrieval is performed, the number of documents hit by each keyword is determined, and the type of the keyword to be retrieved is determined according to the frequency threshold of the preset document.
  • the keyword is a UHF keyword and/or a medium-high frequency keyword
  • the inverted item of the keyword is segmented and stored by the retrieval server in the cluster, and each retrieval server stores a part of the keyword.
  • Schedule item For example, when the cluster includes n retrieval servers, all index entries of the high frequency keyword are divided into n parts, and the mth retrieval server stores the mth partial index entries of the keyword, where n is greater than 1 An integer, m is an integer greater than 1 and less than or equal to n.
  • the keyword is a low frequency keyword
  • all the inverted entries of the keyword are stored by a retrieval server in the cluster.
  • the entire low frequency keyword is divided into n parts, and the mth retrieval server stores all index entries of the mth part of the low frequency keyword.
  • each retrieval server in the cluster reads the inverted item of the high-frequency keyword stored by itself; In the case of a low frequency keyword, all of the inverted entries of the low frequency keyword are read by a retrieval server storing the inverted list of the low frequency keywords.
  • n is an integer greater than 1.
  • the segmentation of the inverted items of the keyword includes: modulating the document identifier in the inverted entry of the high frequency keyword, and taking the modulo
  • the parameter is n, and the inverted table items having the same modulus value are stored as a group in the retrieval server corresponding to the modulus value, and in the retrieval phase, the retrieval server corresponding to the modulus value reads the inverted table having the same modulus value. item.
  • the word identifier (word ID) corresponding to the low frequency keyword is modulo
  • the modulo parameter is n
  • the same low frequency keyword of the modulo value is grouped and stored by a retrieval server.
  • the retrieval server compresses the eight-byte document identifier in the keyword inverted list entry into a four-byte document article number.
  • Step 304 The retrieval server in the cluster performs logical operation on the inverted item of the keyword After the search results are output.
  • the search server storing the low-frequency keyword inverted list item modulates the document identifier corresponding to the inverted entry of the low-frequency keyword, and the modulo parameter is n, and the inverted item corresponding to each modulus value is Send to the retrieval server corresponding to the modulus.
  • Each search server in the cluster performs logical operations on the inverted items of the high frequency keyword and the low frequency keyword; and the search results of the search string are obtained by summarizing the logical operation results of each search server.
  • the logical operation may be one of an operation, or an operation, a non-operation, or any combination.
  • Each cluster shown in this embodiment includes n retrieval servers, where n is an integer greater than one.
  • Step 401 Parsing the retrieved search string to generate a search expression consisting of keywords.
  • the search string input by the user that needs to be searched may be a short sentence or include a plurality of keywords.
  • These search strings are original strings that are not formatted by the computer, and the search string is parsed to generate a computer-recognizable search. expression.
  • the search expression may contain one or more keywords. If the user input keyword does not include a separator, after the parsing process, there is a logical relationship between the keywords. If the user input keyword includes a separator, for example, when the keywords are separated by a space, the preceding and following keywords are subjected to the "and" retrieval operation, and the keywords are separated by T, indicating the before and after keywords. To perform an "OR” operation, use "! before the keyword to indicate a "non” operation on the keyword.
  • Step 402 Send a read request for the keyword to each search server in the cluster, wherein the read request for the keyword includes a read-ahead request for the inverted entry of the keyword.
  • Step 403 Determine that the keyword is a high frequency keyword or a low frequency keyword, and if it is a high frequency keyword, perform step 404; if it is a low frequency keyword, perform step 405.
  • the keywords in the search expression are divided into high frequency keywords and low frequency keywords.
  • the high frequency keywords can be further divided into medium and high frequency keywords and ultra high frequency keywords.
  • the index data may be counted before the search is performed, and the number of documents hit by each keyword, that is, the number of inverted entries corresponding to the keyword, may be determined according to a preset frequency threshold of the document. Determine the type of keyword to be retrieved.
  • Step 404 The n retrieval servers in the cluster respectively read a part of the inverted entry of the high frequency keyword, and then perform step 407.
  • n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords.
  • the entry, in the retrieval is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed.
  • Single logical operation time overhead For the reading of the inverted entries of the high-frequency keywords, a technology similar to the disk RAID (Redundant Independent Disk Array) system can be used, so that the n retrieval servers in the cluster respectively store the inverted rows of the ultra-large-scale high-frequency keywords.
  • the entry, in the retrieval is read by the n retrieval servers in parallel, so that the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed.
  • Single logical operation time overhead For the reading of the oversized inverted items can be completed within the system design time, and the subsequent logical operations are not delayed.
  • Step 405 The retrieval server storing the low frequency keyword to be retrieved in the cluster reads all the inverted entries of the low frequency keyword.
  • the inverted entry of the low frequency keyword is read by one of the search servers in the cluster, avoiding the existing situation of a small number of inverted entries read on multiple search servers.
  • the data block of the inverted list item of the low frequency keyword is smaller than the minimum read data block of the disk, for example, 64K, and for the data block smaller than 64K, the time taken by the disk to read is the same.
  • the inverted list of low frequency keywords is divided into n blocks, and then read by n servers, so that not only does not increase the speed of reading, but also wastes multiple searches in the cluster. Server resources.
  • the retrieval server further compresses the eight-byte document identifier in the keyword inverted list item into a four-byte document part number when the index is established.
  • the document identifier in the inverted list item is used to locate the document.
  • each web page has a unique URL (Uniform Resource Locator), which we can use according to the URL string of the web page.
  • URL Uniform Resource Locator
  • After processing the signature algorithm a 64-bit (8-byte) globally unique integer corresponding to the URL string is obtained, thereby obtaining a document identifier corresponding to the document.
  • the inverted items of the keyword are respectively stored in n units.
  • each retrieval server gets a certain number of documents, assuming that the number is N, where N is an integer greater than 0, then
  • each search server further numbers the documents assigned to the machine, and converts the document identifiers into an integer from 0 - N-1 as the document number of the document. In this way, for the same document, the length of the document number is much smaller than the length of the original document identifier, thereby saving storage space and improving the reading speed.
  • Step 406 The document part number of the inverted list item of the low frequency keyword is modulo and sent to the corresponding modulus search server.
  • Step 407 The n retrieval servers in the cluster perform logical operations on the inverted entries that have been read.
  • the logical operations in this step are performed according to the logical relationship between the keywords in the search string to be retrieved, wherein the logical operations include one or any combination of operations, operations, and operations.
  • Step 408 The result of the search operation of the search string is obtained by summarizing the logical operation results of the n search servers.
  • the search string includes both a high frequency keyword and a low frequency keyword as an example.
  • steps 405-406 need not be performed, and when the search string includes only low frequency keywords, step 404 need not be performed.
  • the cluster includes three search servers, which are search server 0, search server 1, and search server 2.
  • search string "China Xu Jianjun” is analyzed to generate a search expression consisting of the keywords "China” and "Xu Jianjun".
  • the three search servers in the cluster determine the type of the keyword to be retrieved according to the number of hits of the keyword to be retrieved, and read the inverted entry of the keyword according to the type of the keyword.
  • the three search servers in the cluster respectively store a part of the inverted list items of the high-frequency keyword "China"
  • the three search servers in the cluster respectively read a part of the inverted items of the high-frequency keyword "China”.
  • Each document number corresponding to the high frequency keyword "China” is modulo 3
  • the retrieval server corresponding to each modulus value reads the inverted entry corresponding to the modulus value. For example, if the value of the document number 16 to 3 is modulo 1, the retrieval server 1 in the cluster reads the inverted entry of the document number 16.
  • the search server 0 in the cluster reads the inverted entry of the document number ⁇ 207, 903, 2331 ⁇
  • the search server 1 in the cluster reads the document number as ⁇ 16, 100, 319, 1081.
  • the reverse row entry the search server 2 in the cluster reads the inverted entry of the document number ⁇ 38, 872, 5618 ⁇ .
  • the three search servers in the cluster save all the inverted items of different low frequency keywords. Assume that all the inverted items of the low-frequency keyword "Xu Jianjun" are stored in the search server 2 in the cluster.
  • the search server 2 saves and reads all the inverted items including the low-frequency keyword "Xu Jianjun", that is, the inverted items of the document number ⁇ 38, 295, 307, 971, 2331 ⁇ .
  • the inverted list of the low-frequency keyword "Xu Jianjun" is distributed to the three search servers in the cluster.
  • the inverted item corresponding to each modulus value is sent to the retrieval server corresponding to the modulus value.
  • the inverted entry of document document number ⁇ 2331 ⁇ is sent to search server 0
  • the inverted entry of document document number ⁇ 295, 307 ⁇ is sent to The search server 1
  • the inverted entry of the document number ⁇ 38, 971 ⁇ is sent to the search server 2, and the intermediate result of the search is obtained.
  • the three servers in the cluster operate and invert the high-frequency keyword "China” and the low-frequency key word “Xu Jianjun” and obtain the search results.
  • the search result of the search server 0 is the document with the document number 2331
  • the search result of the search server 1 is empty
  • the search result of the search server 2 is the document with the document number 38
  • the search of the three search servers is performed.
  • Fig. 6 shows a retrieval system in an embodiment of the present invention.
  • the retrieval system in this embodiment includes: a caching proxy server 610, a cluster proxy server 620, and a retrieval server 630.
  • the cache proxy server 610 parses the search string to be retrieved to generate a search expression consisting of key words; receives the search result from the cluster proxy server 620, and outputs the search result as needed.
  • the cluster proxy server 620 is configured to receive a retrieval expression from the caching proxy server 610, determine the type of the keyword in the retrieval expression, and send a reading command to the retrieval server 630 according to the type of the keyword; receive the retrieval result from the retrieval server 630. And sending the search result to the caching proxy server 610.
  • the retrieval server 630 is configured to read the inverted item of the keyword according to the read command from the cluster proxy server 620, determine the retrieval result of the keyword to be retrieved, and return the retrieval result to the cluster proxy server 620;
  • the search server 630 is further configured to perform logical operations on the inverted items of the at least two keywords after obtaining the inverted items of each keyword, and determine the corresponding at least two keywords. Search Results.
  • FIG. 7 A schematic diagram of a retrieval model using the system of the present invention is shown in FIG. 7.
  • the cache proxy server, the cluster proxy server, and the retrieval server in the schematic diagram are distributed in a "tree" manner, and the system includes a cache proxy server, and the cache proxy server is connected.
  • n cluster proxy servers, each cluster proxy server is connected to n retrieval servers, and each set of n retrieval servers constitutes a cluster retrieval subsystem.
  • the cache proxy server is a separate process and can reside on a hardware server.
  • the caching proxy server caches the query request of the externally input search string, and parses the search string to be retrieved to generate a search expression consisting of keywords.
  • the caching proxy server can invoke a retrieval interpreter in the retrieval server to parse the externally entered retrieval string into a retrieval expression that the machine can understand.
  • the cache proxy server summarizes the results of all cluster proxy servers and returns them to the external user.
  • a clustered proxy server is a separate process that can reside on a single hardware server.
  • the cluster proxy server determines the type of the keyword in the retrieval expression, and according to the The type of the keyword, sends a read command to the search server in the cluster subsystem, and when the keyword is a high frequency keyword, sends a command to the search server to read a part of the index entry of the high frequency keyword stored by itself, A command to read all index entries of the low frequency keywords stored by itself is sent to a retrieval server in the retrieval server.
  • the returned search results are summarized to determine the search result of the keyword to be searched; and the summarized search result is returned to the upper cache proxy server.
  • Each retrieval server is a separate process that can reside on a hardware server. It is a basic retrieval unit. Under the scheduling of the upper cluster proxy server, basic underlying retrieval operations, including cluster agents.
  • the server's read command reads the inverted list of keywords and returns it to the cluster proxy server.
  • all index entries of the low frequency keyword are read.
  • the search server further performs logical operations such as "and" or "not” on the inverted items of the at least two keywords to determine the at least two keywords. Corresponding index table entry.
  • the retrieval speed can be remarkably improved.
  • the total number of unary, binary, and ternary morphemes that hit more than 1,000 documents does not exceed 500,000. Then it can be inferred that in 100 million documents, the number of morphemes hitting 6000-10000 pieces will not exceed 500,000.
  • 8-byte storage document identification is used, and 3-byte storage is used.
  • the storage space of the inverted item of the keyword is 64k, when the keyword hits 10000 documents.
  • the inverted row of the keyword has a storage space of 128k and a read time of 8 milliseconds.
  • the inverted row entry is performed according to the inverted row entry, including the storage space of the document identifier, the weight, and the position offset. Separate. For a morpheme with a storage space of 64k or more, an inverted list item of the morpheme is stored by a plurality of search servers, and for a morpheme whose storage space is 64k or less, all the inverted items of the morpheme are stored by one search server.
  • FIG 8 The flowchart for applying the search model in Figure 7 is shown in Figure 8.
  • at least two keywords are included in the search string.
  • Step 801 The cache proxy server parses the search string to be retrieved to generate a search expression consisting of the key words.
  • Step 802 The cluster proxy server determines the type of each keyword in the retrieval expression, and sends a command to read the inverted row item to the retrieval server according to the type of each keyword.
  • Step 803 After receiving the read request, the search server reads the keyword inverted list item.
  • Step 804 The retrieval server performs logical operations on the inverted items of the at least two keywords. For example, logical operations are performed on the document number in the inverted list item to obtain a logical operation on the keyword.
  • Step 805 Each retrieval server sends the result of the logical operation to the upper cluster server for aggregation to obtain an intermediate result.
  • Step 806 Each cluster server sends the intermediate result to the upper cache proxy server to summarize and output the final result.
  • Fig. 9 shows the structure of a retrieval server in the embodiment of the present invention.
  • the search string to be retrieved includes at least two keywords.
  • the retrieval server includes: a retrieval interpretation module 910, a read management module 920, a keyword reading module 930, a logical operation module 940, and an identification conversion module 950.
  • the search and interpretation module 910 is configured to parse the search string to be retrieved to generate a search expression composed of keywords for the upper layer server to call.
  • the read management module 920 is used to connect At least one of a command to read a part of an index entry of a high frequency keyword stored therein and a command to read all index entries of a low frequency keyword stored by itself.
  • the keyword reading module 930 is configured to: when receiving a command to read a part of the index entry of the high frequency keyword stored by itself, read a part of the index entry of the high frequency keyword; when receiving the read self storage When all the indexes of the low frequency keyword are indexed, all index entries of the low frequency keyword are read.
  • the inverted item includes the document number generated after the document identifier is compressed.
  • the logic operation module 940 is configured to perform logical operations on the index entries corresponding to the at least two keywords to be retrieved according to the logical relationship when there are at least two keywords having a logical relationship to be retrieved, and determine at least two The index table entry corresponding to the keyword.
  • the identifier conversion module 950 is configured to compress the eight-byte document identifier in the keyword inverted list item into a four-byte document article number.
  • the indexing method for the document is an inverted index
  • the corresponding index entry is an inverted list item, which is only an example of the present invention and is not intended to limit the present invention.
  • other index methods may be used to read the index table corresponding to the index method.
  • an inverted entry of a high frequency keyword is stored by multiple servers in the cluster, and when the search is performed, the high frequency keyword is used by multiple servers.
  • the inverted entries are read in parallel, so that a large number of inverted entries can be read within the system design time, and the time overhead of a single logical operation is not delayed, and the retrieval speed is improved.
  • all inverted entries of a low frequency keyword are stored by a retrieval server. When the search is performed, only the inverted list item of the low frequency keyword is read by the server. Therefore, it is not necessary to separately read a small number of inverted entries on multiple search servers, which saves the storage resources of multiple search servers in the cluster and improves the retrieval speed.
  • the embodiment of the present invention can effectively improve the coupling degree between the search servers in the search cluster, and increase the resource dynamic allocation capability between the servers, and uniformly plan the resources of multiple search servers in the cluster to maximize the maximum The overall concurrency capability of the cluster is guaranteed, which further improves the retrieval speed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

L'invention concerne un procédé de recherche, qui comprend les étapes suivantes : - détermination du type du mot-clé à rechercher, - lorsque le mot-clé est un mot-clé haute fréquence, N serveurs de recherche accèdent à une partie des tables d'index du mot-clé haute fréquence qui y sont stockées respectivement, où N représente un entier supérieur à 1, - lorsque le mot-clé est un mot-clé basse fréquence, l'un des N serveurs de recherche accède aux tables d'index totales du mot-clé basse fréquence stockées à l'intérieur, - détermination du texte qui implique le mot-clé à rechercher en fonction de la table d'index à laquelle on accède. L'invention concerne également un système et un serveur de recherche. Avec la solution, elle améliore efficacement la vitesse de recherche.
PCT/CN2008/070598 2007-06-26 2008-03-27 Procédé, système et serveur de recherche WO2009000173A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710112451.4 2007-06-26
CNB2007101124514A CN100462979C (zh) 2007-06-26 2007-06-26 分布式索引文件的检索方法、检索系统及检索服务器

Publications (1)

Publication Number Publication Date
WO2009000173A1 true WO2009000173A1 (fr) 2008-12-31

Family

ID=38898665

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070598 WO2009000173A1 (fr) 2007-06-26 2008-03-27 Procédé, système et serveur de recherche

Country Status (2)

Country Link
CN (1) CN100462979C (fr)
WO (1) WO2009000173A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095209A (zh) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 文档聚类方法及装置、网络设备

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100462979C (zh) * 2007-06-26 2009-02-18 腾讯科技(深圳)有限公司 分布式索引文件的检索方法、检索系统及检索服务器
US8386929B2 (en) * 2010-06-22 2013-02-26 Microsoft Corporation Personal assistant for task utilization
US9229946B2 (en) 2010-08-23 2016-01-05 Nokia Technologies Oy Method and apparatus for processing search request for a partitioned index
CN102479207B (zh) * 2010-11-29 2013-07-03 阿里巴巴集团控股有限公司 一种信息搜索的方法、系统及信息搜索设备
US10192176B2 (en) 2011-10-11 2019-01-29 Microsoft Technology Licensing, Llc Motivation of task completion and personalization of tasks and lists
CN103064841A (zh) * 2011-10-20 2013-04-24 北京中搜网络技术股份有限公司 检索装置和检索方法
CN103810220B (zh) * 2012-11-15 2018-02-27 腾讯科技(深圳)有限公司 一种微博搜索方法及装置
CN103455619B (zh) * 2013-09-12 2016-09-07 焦点科技股份有限公司 一种基于Lucene分片结构的打分处理方法及系统
CN104679778B (zh) * 2013-11-29 2019-03-26 腾讯科技(深圳)有限公司 一种搜索结果的生成方法及装置
CN103678697A (zh) * 2013-12-26 2014-03-26 乐视网信息技术(北京)股份有限公司 倒排索引存储方法及其系统
CN105335373A (zh) * 2014-06-17 2016-02-17 阿里巴巴集团控股有限公司 信息搜索方法及装置
CN105608022B (zh) * 2014-11-25 2017-08-01 南方电网科学研究院有限责任公司 一种基于倒排技术的智能安全芯片的指令分发方法和系统
CN104778200A (zh) * 2015-01-13 2015-07-15 东莞中山大学研究院 一种结合历史数据的异构处理大数据检索的方法
CN106156166B (zh) * 2015-04-16 2020-11-10 深圳市腾讯计算机系统有限公司 关系链查询系统、文档检索方法、索引建立方法及装置
CN106156000B (zh) * 2015-04-28 2020-03-17 腾讯科技(深圳)有限公司 基于求交算法的搜索方法及搜索系统
CN105447162B (zh) * 2015-12-01 2021-06-25 腾讯科技(深圳)有限公司 群组文件搜索方法和装置
CN105653646B (zh) * 2015-12-28 2019-06-04 北京中电普华信息技术有限公司 一种并发查询条件下的动态查询系统及方法
CN106055622A (zh) * 2016-05-26 2016-10-26 浪潮软件集团有限公司 一种数据搜索方法及系统
CN107436911A (zh) * 2017-05-24 2017-12-05 阿里巴巴集团控股有限公司 模糊查询方法、装置及查询系统
CN107145603A (zh) * 2017-06-08 2017-09-08 上海德衡数据科技有限公司 一种针对关键词的网络文档搜索引擎框架
CN108520051A (zh) * 2018-04-04 2018-09-11 湖南蚁坊软件股份有限公司 一种提升Apache Lucene修饰符搜索性能的方法
CN110532347B (zh) * 2019-09-02 2023-12-22 北京博睿宏远数据科技股份有限公司 一种日志数据处理方法、装置、设备和存储介质
CN112836008B (zh) * 2021-02-07 2023-03-21 中国科学院新疆理化技术研究所 基于去中心化存储数据的索引建立方法
CN113923209B (zh) * 2021-09-29 2023-07-14 北京轻舟智航科技有限公司 一种基于LevelDB进行批量数据下载的处理方法
CN113824804A (zh) * 2021-11-24 2021-12-21 飞狐信息技术(天津)有限公司 一种关键词检测的方法及相关装置
CN117851538A (zh) * 2024-03-07 2024-04-09 济南浪潮数据技术有限公司 一种分布式检索方法、系统、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198068A1 (en) * 2004-03-04 2005-09-08 Shouvick Mukherjee Keyword recommendation for internet search engines
CN1975729A (zh) * 2005-12-02 2007-06-06 国际商业机器公司 搜索文本中关键词的系统及其方法
CN101071442A (zh) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 分布式索引文件的检索方法、检索系统及检索服务器

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080195595A1 (en) * 2004-11-05 2008-08-14 Intellectual Property Bank Corp. Keyword Extracting Device
CN1936887A (zh) * 2005-09-22 2007-03-28 国家计算机网络与信息安全管理中心 基于类别概念空间的自动文本分类方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050198068A1 (en) * 2004-03-04 2005-09-08 Shouvick Mukherjee Keyword recommendation for internet search engines
CN1975729A (zh) * 2005-12-02 2007-06-06 国际商业机器公司 搜索文本中关键词的系统及其方法
CN101071442A (zh) * 2007-06-26 2007-11-14 腾讯科技(深圳)有限公司 分布式索引文件的检索方法、检索系统及检索服务器

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095209A (zh) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 文档聚类方法及装置、网络设备

Also Published As

Publication number Publication date
CN101071442A (zh) 2007-11-14
CN100462979C (zh) 2009-02-18

Similar Documents

Publication Publication Date Title
WO2009000173A1 (fr) Procédé, système et serveur de recherche
US11249971B2 (en) Segmenting machine data using token-based signatures
US7007015B1 (en) Prioritized merging for full-text index on relational store
US6754799B2 (en) System and method for indexing and retrieving cached objects
EP2973018B1 (fr) Procédé pour accélérer des interrogations à l'aide de formats de données de remplacement générés dynamiquement dans une mémoire cache flash
Tomasic et al. Performance of inverted indices in shared-nothing distributed text document information retrieval systems
US7058783B2 (en) Method and mechanism for on-line data compression and in-place updates
US7185019B2 (en) Performant and scalable merge strategy for text indexing
US6209003B1 (en) Garbage collection in an object cache
US9959347B2 (en) Multi-layer search-engine index
US20120327956A1 (en) Flow compression across multiple packet flows
Cambazoglu et al. Scalability challenges in web search engines
US20080082554A1 (en) Systems and methods for providing a dynamic document index
CN104679898A (zh) 一种大数据访问方法
WO2008154823A1 (fr) Procédé, système et dispositif de recherche
CN104778270A (zh) 一种用于多文件的存储方法
Williams et al. What's Next? Index Structures for Efficient Phrase Querying.
US9262511B2 (en) System and method for indexing streams containing unstructured text data
JP3499105B2 (ja) 情報検索方法および情報検索装置
US20070220064A1 (en) Fault tolerance scheme for distributed hyperlink database
CN102201007A (zh) 一种大规模数据搜索系统
Zhang et al. Efficient search in large textual collections with redundancy
Jonassen et al. A combined semi-pipelined query processing architecture for distributed full-text retrieval
Henrique et al. A new approach for verifying url uniqueness in web crawlers
CN117539915B (zh) 一种数据处理方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08715334

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 7315/CHENP/2009

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10/06/2010)

122 Ep: pct application non-entry in european phase

Ref document number: 08715334

Country of ref document: EP

Kind code of ref document: A1