CN114661666B - Data searching method, device, equipment and storage medium - Google Patents
Data searching method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN114661666B CN114661666B CN202210204466.8A CN202210204466A CN114661666B CN 114661666 B CN114661666 B CN 114661666B CN 202210204466 A CN202210204466 A CN 202210204466A CN 114661666 B CN114661666 B CN 114661666B
- Authority
- CN
- China
- Prior art keywords
- identification value
- file identification
- partition
- target
- inverted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 128
- 238000003860 storage Methods 0.000 title claims abstract description 19
- 238000005192 partition Methods 0.000 claims abstract description 130
- 238000004590 computer program Methods 0.000 claims description 21
- 238000009826 distribution Methods 0.000 abstract description 6
- 230000000875 corresponding effect Effects 0.000 description 45
- 238000004891 communication Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000005236 sound signal Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the application provides a data search method, a data search device, data search equipment and a storage medium. In the embodiment of the application, the multiple reverse arrangement tables are divided into a first reverse arrangement table and a second reverse arrangement table, the second reverse arrangement table is divided into a first partition and a second partition, and the target file identification values belonging to the first reverse arrangement table are searched in the first partition and the second partition respectively by a sequential lookup method and a skip list lookup method; when the target file identification value is searched in the second partition, the target area where the target file identification value is located can be determined according to the currently searched file identification value, and the target file identification value is continuously searched in the target area by adopting any searching method. By the method, the target file identification value can be searched in different areas by adopting a proper searching method according to the distribution condition of the file identification value in the inverted list, so that the target file identification value can be found quickly, and the searching efficiency is improved.
Description
Technical Field
The present application relates to the field of search engine technologies, and in particular, to a data search method, apparatus, device, and storage medium.
Background
In search engines, the most widely used data structure is the Inverted Index (Inverted Index). The inverted index is composed of two parts, namely an index word (Term) and an inverted List (nesting List), the index word and the inverted List are in one-to-one correspondence, and document identification (DocID) containing the corresponding index word and information such as the position and frequency of the index word appearing in a document are stored in the inverted List.
The posting list is typically an ordered sequence, for example, ordered according to document identity. Under the condition that the search engine determines the target index words, the search engine generally searches for target documents simultaneously containing all the target index words by adopting a mode of solving intersection of inverted lists corresponding to the target index words. However, due to different ways of establishing inverted indexes, the rules of data distribution in inverted lists are different, and when intersection is obtained for each inverted list, different methods for obtaining intersection from inverted lists are not necessarily all applicable, thereby easily affecting the efficiency and accuracy of data search.
Disclosure of Invention
The application provides a data search method, a data search device, data search equipment and a data search storage medium from multiple aspects, and the data search method, the data search device and the data search equipment are used for performing intersection operation on all inverted lists by using a proper search method to quickly find out a target document identifier.
The embodiment of the application provides a data searching method, which comprises the following steps: according to a search request, acquiring a first inverted arrangement table and a second inverted arrangement table, wherein the first inverted arrangement table stores file objects corresponding to a first index word, the second inverted arrangement table stores file objects corresponding to a second index word, and the file objects comprise file identification values; acquiring a file identification value to be queried at this time from the first inverted list as a current target file identification value; searching the target file identification value in the first partition of the second inverted list by adopting a first searching method; if the target file identification value is not found in the first partition, performing at least one skip list search in a second partition of the second inverted list to continuously find the target file identification value; if a second file identification value which is larger than the target file identification value is found in the second partition, determining a target area, and searching the target file identification value in the target area by adopting a second searching method; the file identification value corresponding to the file object stored in the target area is between the first file identification value and the second file identification value; the first file identification value refers to a file identification value smaller than the target file identification value found by the last skip list, the first search method is a sequential search method, and the second search method is the same as or different from the first search method.
In an optional embodiment, the second search method is a binary search method, and searching for the target file identifier value in the target area by using the second search method includes: determining a first sub-partition and a second sub-partition searched in a binary manner according to the first file identification value and the second file identification value; and continuously searching the target file identification value in the first sub-partition and the second sub-partition by adopting a binary search method.
In an optional embodiment, before searching for the target file identification value in the first partition of the second inverted list by using the first search method, the method further includes: searching an allowable maximum length M according to a set sequence, and dividing an area, which is located behind the current position of the search pointer and stores the file object, in the second inverted arrangement table into a first partition and a second partition by combining the current position of the search pointer, wherein M is a positive integer greater than 1; or, according to the size of the current target file identification value, in combination with the current position of the search pointer, dividing the region of the second inverted arrangement table, which is located behind the current position of the search pointer and stores the file object, into a first partition and a second partition; or, according to the difference value between the current target file identification value and the previous target file identification value, dynamically adjusting the sequence to search the allowable maximum length M; and dividing the area of the second inverted list, which is used for storing the file object and is positioned behind the current position of the searching pointer, into a first partition and a second partition according to the adjusted maximum length M and by combining the current position of the searching pointer.
In an optional embodiment, performing at least one skip list lookup in the second partition of the second inverted list to continue looking up the target file identification value includes: selecting a target skip list searching mode from a plurality of skip list searching modes, wherein skip list intervals corresponding to different skip list searching modes are different; and starting to search the skip list from the current position of the search pointer according to the skip list interval corresponding to the target skip list search mode.
In an optional embodiment, selecting a target skip list lookup manner from multiple skip list lookup manners includes: and selecting a target skip list searching mode from multiple skip list searching modes according to the length of the second inverted list or the second partition.
In an optional embodiment, selecting a target skip list lookup manner from multiple skip list lookup manners according to the length of the second inverted list or the second partition includes: if the length of the second inverted arrangement table or the second subarea is smaller than a preset length threshold value, selecting a jump table lookup mode corresponding to the N-th power with a jump table interval of 2 as a target jump table lookup mode; if the length of the second inverted arrangement table or the second partition is larger than a preset length threshold, selecting a skip table lookup mode corresponding to a power of 2N with a skip table interval of 2 as a target skip table lookup mode; wherein N is a positive integer greater than 0.
In an optional embodiment, obtaining the first inverted list and the second inverted list according to the search request includes: determining a plurality of index words to be searched according to the search request, wherein the plurality of index words correspond to a plurality of inverted lists; selecting a reverse arrangement table with the shortest length from the plurality of reverse arrangement tables as a first reverse arrangement table, and taking an index word corresponding to the first reverse arrangement table as a first index word; and respectively taking the rest inverted lists and the corresponding index words in the inverted lists as second inverted lists and second index words.
In an optional embodiment, the method further comprises: and in the plurality of second inverted lists, sequentially searching the target file identification value in each second inverted list according to the sequence of the lengths of the inverted lists from small to large.
In an optional embodiment, if the target file identifier value is found in the first partition, or the target file identifier value is found in the second area, or the target file identifier value is found in the target area, the target file identifier value is output and the current position of the search pointer is recorded, so that the next target file identifier value is continuously searched from the current position of the search pointer.
An embodiment of the present application further provides a data search apparatus, including: the system comprises an acquisition module, a first search module, a second search module and a third search module; the acquisition module is configured to receive a search request, and acquire a first inverted list and a second inverted list according to the search request, where the first inverted list stores a file object corresponding to a first index word, the second inverted list stores a file object corresponding to a second index word, and the file object includes a file identification value; the first query module is used for acquiring the file identification value to be queried from the first inverted list as the current target file identification value; searching the target file identification value in the first partition of the second inverted arrangement table by adopting a first searching method; the second query module is configured to perform at least one skip list lookup in a second partition of the second inverted list to continue to lookup the target file identifier value when the target file identifier value is not found in the first partition; the third query module is configured to determine a target area when a second file identification value larger than the target file identification value is found in the second partition, and search the target file identification value in the target area by using a second search method; the file identification value corresponding to the file object stored in the target area is between the first file identification value and the second file identification value; the first file identification value refers to a file identification value smaller than the target file identification value found in the last skip list, the first search method is a sequential search method, and the second search method is the same as or different from the first search method.
An embodiment of the present application further provides a computer device, including: a processor and a memory, the memory storing a computer program for implementing a search engine for implementing any of the steps of the method when the computer program is executed by the processor
Embodiments of the present application also provide a computer-readable storage medium, wherein when the instructions in the computer-readable storage medium are executed by a processor of a computer device, the computer device is enabled to execute any one of the steps of the method.
In the embodiment of the application, the multiple reverse arrangement tables are divided into a first reverse arrangement table and a second reverse arrangement table, the second reverse arrangement table is divided into a first partition and a second partition, and the target file identification values belonging to the first reverse arrangement table are searched in the first partition and the second partition respectively by a sequential lookup method and a skip list lookup method; when the target file identification value is searched in the second partition, the target area where the target file identification value is located can be determined according to the currently searched file identification value, and any searching method is adopted to continue searching the target file identification value in the target area. By the method, the target file identification value can be searched in different areas by adopting a proper searching method according to the distribution condition of the file identification values in the inverted list, so that the target file identification value can be found quickly, and the searching efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1a is a flowchart of a data searching method according to an embodiment of the present application;
fig. 1b is a schematic diagram of an implementation process of a data search method according to an embodiment of the present application;
FIG. 1c is a flowchart of another data searching method provided in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a data search apparatus according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
In order to improve the efficiency of finding intersections of inverted lists, an embodiment of the present application provides a data search method, where fig. 1a is a flowchart of the data search method, and as shown in fig. 1a, the method includes:
s1, acquiring a first inverted arrangement table and a second inverted arrangement table according to a search request, wherein the first inverted arrangement table stores a file object corresponding to a first index word, the second inverted arrangement table stores a file object corresponding to a second index word, and the file object comprises a file identification value;
s2, acquiring a file identification value to be queried at this time from the first inverted arrangement table as a current target file identification value, and searching the target file identification value in a first partition of a second inverted arrangement table by adopting a first searching method;
s3, if the target file identification value is not found in the first partition, performing at least one skip list search in a second partition of the second inverted list to continuously find the target file identification value;
s4, if a second file identification value larger than the target file identification value is found in the second partition, determining a target area, and searching the target file identification value in the target area by adopting a second searching method; the file identification value corresponding to the file object stored in the target area is between the first file identification value and the second file identification value;
the first file identification value refers to a file identification value smaller than a target file identification value found by the last skip list, the first search method is a sequential search method, and the second search method may be the same as or different from the first search method.
As described above, different ways of establishing inverted indexes have different rules of data distribution in inverted lists, and when intersection is obtained for each inverted list, the intersection obtaining methods for different inverted lists are not necessarily all applicable. And based on the analysis of various search methods, the binary search method is more suitable for the condition that the target data are uniformly distributed in the inverted arrangement table, the skip table search method is more suitable for the condition that the target number is distributed at two ends of the inverted arrangement table, and the exponential search method is more suitable for the condition that the target data are distributed at the front end of the inverted arrangement table.
Therefore, in order to improve the search efficiency, it is necessary to perform intersection operation on each inverted list by using an appropriate search method according to the characteristics of the data distribution in the inverted list.
In the embodiment of the application, according to a search request, a plurality of index words contained in the search request can be determined, and further, according to the plurality of index words, a posting list corresponding to each index word can be determined; the inverted list corresponding to each index word comprises a plurality of file objects, and each file object is a file object comprising the index word. Alternatively, before intersecting the inverted lists, a reference inverted list may be determined from the inverted lists corresponding to the index words, so as to perform intersection operation with other inverted lists in the inverted lists according to the reference inverted list.
The file object may specifically include any data form that can be searched by the scheme in the embodiment of the present application, such as a document (document), audio, video, picture, program, and the like. Taking a document as an example, the document specifically refers to a data medium and data recorded thereon. Of course, in the embodiment of the present application, for file objects in other forms, the file objects may also be converted/converted into a document form, and then data search is performed based on the manner in the embodiment of the present application, which is not limited in this embodiment of the present application.
In the embodiment of the present application, the reference posting list is referred to as a first posting list, and the other posting lists in the posting lists are referred to as second posting lists; furthermore, the embodiment of the present application is not limited to a specific manner of determining the first posting list, and alternatively, the posting list with the shortest length may be selected from the posting lists as the first posting list, the index word corresponding to the first posting list may be used as the first index word, and the remaining posting lists and their corresponding index words in the posting lists may be used as the second posting list and the second index word, respectively; further optionally, before determining the first posting list, the posting lists with the smallest length may be selected as the first posting list by sorting the posting lists from small to large according to the list lengths; of course, the specific implementation is not limited thereto, and can be flexibly selected according to actual requirements.
In this embodiment of the present application, since each inverted list includes a plurality of file objects, when intersection is determined for a plurality of inverted lists, target file objects including the same index word may be determined according to the identification values of the file objects in each inverted list. Based on this, when the intersection is obtained for the multiple inverted lists with the first inverted list as a reference, the file identification value corresponding to each file object may be sequentially obtained from the first inverted list as the target file identification value, and in the multiple second inverted lists, the target file identification value is sequentially searched for in each second inverted list according to the sequence from small to large of the length of the inverted list; if the current target file identification value is found in each second inverted arrangement table, it indicates that the target file object corresponding to the current target file identification value simultaneously contains a plurality of index words.
In this embodiment of the present application, before searching for the target document identifier value in each second reverse arrangement table, each second reverse arrangement table may be partitioned into partitions, and the target document identifiers may be sequentially searched in different partitions. For example, each second inverted list may be sequentially divided into a first partition, a second partition, and a target region; the first partition is used for carrying out preliminary search on the target file identification value, and the second partition is used for determining a target area where the target file identification value is located.
When the target file identification value is searched in each partition, if the target file identification value is not searched in the first partition, the target file identification value is continuously searched in the second partition; if the target area where the target file identification value is located cannot be determined in the second partition, it is indicated that no file object containing the first index word exists in all file objects in the current second inverted list; if the target area where the target file identification value is located is determined in the second partition, continuing to search the target file identification value in the target area; if the target file identification value is not found in the target area, it is indicated that no file object containing the first index word exists in all the file objects in the current second posting list.
In the embodiment of the application, the file identification values in each inverted list are arranged in the order from small to large, so that the target file identification values sequentially taken out from the first inverted list are sequentially increased in sequence; when the current target file identification value is found in the current second inverted arrangement table, the file identification value before the current position of the searching pointer does not need to be found when the next target file identification value is found, and the next target identification value can be found from the current position of the searching pointer. Based on this, if the target file identification value is found in the first partition, or the target file identification value is found in the second partition, or the target file identification value is found in the target partition, the target file identification value is output and the current position of the search pointer is recorded, so that the next target file identification value can be continuously searched from the current position of the search pointer.
It should be noted that, in the embodiment of the present application, the partition of the second posting list may not be partitioned once, and based on the foregoing embodiment, when the next target file identification value is searched in the second posting list, the search is continued from the current position of the search pointer, so that the partition of the second posting list is also partitioned based on the current position of the search pointer; before searching the next target file identification value, the current second inverted list is divided into partitions according to the current position of the searching pointer. In the embodiment of the present application, a specific manner of partitioning the second inverted list is not limited, and the partition may be flexibly set according to actual requirements.
In an optional embodiment, the allowable maximum length M may be searched according to a set sequence, and in combination with the current position of the search pointer, the area, in the second inverted list, where the file object is stored after the current position of the search pointer is divided into a first partition and a second partition; wherein M is a positive integer greater than 1. In this embodiment, the maximum length M allowed for sequential search refers to the maximum number of file identification values allowed to be searched in the first partition. In this embodiment, the specific value of M is not limited, and may be flexibly set according to the characteristics of the inverted list; optionally, if the difference between the file identifier values of the adjacent file objects is small, or the file identifier value searched each time is always near the current position of the search pointer, M may take a small value, for example, 2, 4, or 6; if the difference between the file identification values of the adjacent file objects is large, or the file identification value searched each time is always far away from the current position of the search pointer, M may take a large value, for example, 10, 15, 20; and so on.
In another optional embodiment, according to the size of the current target file identification value, in combination with the current location of the search pointer, the area in the second inverted list, which is located after the current location of the search pointer and stores the file object, may be divided into the first partition and the second partition. In this embodiment, before each partition of the second inverted list, a specific value of M may be determined according to a size of the current target file identifier value. Optionally, if the current target file identification value is smaller, M may take a smaller value; if the current target file identification value is larger, M can take a larger value; for example, if the current target file identification value is less than 10, M may take a positive integer within 10; if the current target file identification value is between 20 and 50, M may take a positive integer value within 30, and so on.
In another optional embodiment, the allowable maximum length M may be dynamically adjusted and sequentially searched according to a difference between the current target file identification value and the previous target file identification value; optionally, if the difference between the current target file identification value and the previous target file identification value is small, M may take a small value; if the difference value between the current target file identification value and the last target file identification value is larger, M can take a larger value; for example, if the difference between the two is less than 10, M may take a positive integer within 10; if the difference is between 20 and 50, M may be a positive integer within 30, and so on.
It should be noted that the value of M in the foregoing embodiment is merely an exemplary illustration, and is not limited thereto; in addition, a specific implementation form is not limited in an actual scheme, and any two or three ways described above may be combined to determine the value of M. Further, under the condition of determining the maximum length M of the sequential search, the current position of the search pointer may be combined, and the area, located after the current position of the search pointer, of the file object stored in the second inverted list may be divided into a first partition and a second partition; the first partition is an area for searching M storage file objects behind the current position of the pointer, and the second partition is an area behind the first partition.
Based on the above, after the partition of the second inverted list is divided, the target file identification value may be searched in the first partition by using a sequential lookup method, and if the target file identification value is not found in the first partition, at least one skip list lookup is performed in the second partition to continue to search for the target file identification value. Optionally, before performing the skip list lookup in the second partition, a target skip list lookup manner may be selected from the multiple skip list lookup manners; and the skip list intervals corresponding to different skip list searching modes are different. Further, under the condition of determining the target skip list searching mode, skip list searching can be carried out from the current position of the searching pointer according to the skip list interval corresponding to the target skip list searching mode.
In the alternative embodiment of the present application, the form of determining the skip list lookup manner is not limited. Optionally, in order to improve the lookup efficiency, a target skip list lookup manner may be selected from multiple skip list lookup manners according to the length of the second inverted list or the second partition; and the jump table intervals corresponding to different jump table lookup modes are different. For example, if the length of the second inverted list or the second partition is smaller than a preset length threshold, and the current second inverted list or the second partition is determined to be a short list, selecting a skip list lookup manner corresponding to the nth power with a skip list interval of 2 as a target skip list lookup manner; for another example, if the length of the second inverted arrangement table or the second partition is greater than the preset length threshold, and the current second inverted arrangement table or the second partition is determined to be a long list table, a skip table lookup manner corresponding to the power of 2N with a skip table interval of 2 is selected as a target skip table lookup manner; wherein N is a positive integer greater than 0. In this embodiment, a specific value of the preset length threshold may be flexibly set according to a requirement, and optionally, may be set to 100, 200, or 300, and the like, which is not limited herein.
Taking the target document identification value obtained from the first inverted list as 95 as an example, the process of searching the target document identification value in the second inverted list in the embodiment of the present application is exemplarily described below. Fig. 1b is a schematic diagram of a search process, and as shown in fig. 1b, assuming that the maximum search length corresponding to the first partition of the second inverted list is 2, first, a target file identifier value 95 is searched in a region of the second inverted list storing the first two file objects, and the target file identifier value is not found; further, starting with the current position of the lookup pointer, continuing to lookup the target file identification value 95 in the second partition in a skip list lookup manner corresponding to the nth power with skip list intervals of 2 until the file identification value found in the skip list for 3 times is 105, judging that the currently found file identification value is larger than the target file identification value 95, stopping continuing skip list lookup, and determining that the region between the last skip list position of the lookup pointer and the current position is a target region, namely the region corresponding to the file identification value 50 and the file identification value 105 is the target region; further, continuously searching for the target file identification value 95 in the target area by a binary search method, as shown in fig. 1b, searching for 2 times in the target area by the binary search method, and finally finding out the target file identification value 95; further, the current search position of the search pointer may be recorded, so as to search for the next target file identification value according to the position, and continue to search for the target file identification value 95 in the next second inverted list, so as to obtain the intersection of all inverted lists.
It should be noted that, in the above embodiment, the target file identification value is searched in each partition, and finally the target file identification value is found in the target partition as an example, and as for other cases, reference may be made to the overall flowchart shown in fig. 1 c.
As shown in fig. 1c, in the case of receiving a search request, a plurality of index words may be obtained from the search request and a plurality of inverted lists corresponding to the plurality of index words, respectively, may be determined; further, the plurality of posting lists may be sorted from small to large according to the list length, with the shortest posting list being the first posting list and the other posting lists being the second posting list, and the target document identification value may be selected sequentially from the first posting list.
Further, before searching the target file identification value in the second inverted list, the second inverted list may be divided into a first partition and a second partition, and the target file identification value may be searched in the first partition by a sequential search method; if the target file identification value is found in the first partition, recording the current position of the search pointer, and acquiring the next target file identification value to continue searching; if the target file identification value is not found in the first partition, selecting a corresponding skip list finding mode according to the length of the second inverted list or the second partition, and continuously finding the target file identification value in the second partition; if the target file identification value is found in the second partition, recording the current position of the search pointer, and acquiring the next target file identification value to continue searching; if the file identification value larger than the target file identification value is found in the second partition, stopping continuously skipping the table and determining a target area, and continuously searching the target file identification value in the target area by a binary search method; if the file identification value larger than the target file identification value is not found in the second partition, indicating that the finding is failed, and ending the finding; if the target file identification value is found in the target area, recording the current position of the search pointer, and acquiring the next target file identification value to continue searching; if the file identification value larger than the target file identification value is not found in the target area, indicating that the finding is failed, and ending the finding; if the target document identification is found in all the second inverted list, the finding is successful, and the finding is finished.
It should be noted that, in the embodiment of the present application, a manner of selecting a next second posting list is defined when a current target file identification value is looked up in one second posting list and the next second posting list is selected to continue looking up the current target file identification value. Optionally, the second inverted list tables may be sequentially selected according to the lengths of all the second inverted list tables from short to long to search for the target file identifier value, so as to improve the search efficiency. Under the condition that the current target file identification value is found in all the second inverted list tables, determining that the target file object corresponding to the current target file identification value is the intersection of all the inverted list tables aiming at the current index word; if the intersection of all the inverted lists for each index word is the target file object, it may be determined that the target file object is the file object including all the index words at the same time.
In this embodiment, the multiple reverse arrangement lists corresponding to the multiple index words may be divided into a first reverse arrangement list and multiple second reverse arrangement lists, and the target file identification values are sequentially extracted from the first reverse arrangement list and respectively searched in the multiple second reverse arrangement lists, so as to search for the intersection of the multiple reverse arrangement lists. Aiming at each second inverted arrangement table, dividing the second inverted arrangement table into a first partition and a second partition, and searching the identification value of the target file in the first partition and the second partition respectively by a sequential searching method and a skip table searching method so as to improve the searching efficiency; and when the target file identification value is searched in the second partition, the target area where the target file identification value is located can be determined according to the currently searched file identification value so as to further reduce the searching range of the target file identification value, and the target file identification value is searched in the target area by a binary searching method so as to quickly search the target file identification value.
By the method, the target file identification value can be searched in different areas by adopting a proper searching method according to the distribution condition of the file identification values in the inverted list, so that the target file identification value can be found quickly, and the searching efficiency is improved.
It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps S1 to S4 may be device a; for another example, the execution subject of step S1 may be device a, and the execution subjects of steps S2 to S4 may be device B; and so on.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations occurring in a specific order are included, but it should be clearly understood that these operations may be executed out of the order they appear herein or in parallel, and the order of the operations, such as S1, S2, etc., is merely used to distinguish various operations, and the order itself does not represent any order of execution. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
The embodiment of the application also provides a data searching device. For example, the data search device may be implemented as a virtual device, such as an application, in a Communication Control Unit (CCU). As shown in fig. 3, the data search apparatus includes an obtaining module 201, a first searching module 202, a second searching module 203 and a third searching module 204; wherein:
an obtaining module 201, configured to receive a search request and obtain a first inverted list and a second inverted list according to the search request, where the first inverted list stores a file object corresponding to a first index word, and the second inverted list stores a file object corresponding to a second index word, where the file object includes a file identification value;
a first query module 202, configured to obtain a file identifier value to be queried this time from the first inverted list as a current target file identifier value; searching a target file identification value in a first partition of a second inverted arrangement table by adopting a first searching method;
the second query module 203 is configured to perform at least one skip list lookup in a second partition of the second inverted list to continue to lookup the target file identification value when the target file identification value is not found in the first partition;
a third query module 204, configured to determine a target area when a second file identifier value that is greater than the target file identifier value is found in the second partition, and search the target file identifier value in the target area by using a second search method; the first file identification value, which is between the first file identification value and the second file identification value, of the file identification value corresponding to the file object stored in the target area refers to the file identification value smaller than the target file identification value found in the previous skip list, the first search method is a sequential search method, and the second search method may be the same as or different from the first search method.
It should be noted that, for specific functions and implementation processes of each module in the apparatus, reference may be made to the method embodiment described above, and details are not described herein again.
An embodiment of the present application further provides a computer device, fig. 3 is a schematic structural diagram of the computer device, and as shown in fig. 3, the computer device includes: a processor 31 and a memory 32 in which a computer program is stored; the processor 31 and the memory 32 may be one or more.
The memory 32 is mainly used for storing computer programs, and these computer programs can be executed by the processor 31, so that the processor 31 controls the computer device to implement corresponding functions, and complete corresponding actions or tasks. In addition to storing computer programs, the memory 32 may also be configured to store other various data to support operations on the computer device. Examples of such data include instructions for any application or method operating on a computer device.
The memory 32, which may be implemented by any type or combination of volatile and non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
In the embodiment of the present application, the implementation form of the processor 31 is not limited, and may be, for example, but not limited to, a CPU, a GPU, an MCU, or the like. The processor 31 may be regarded as a control system of the computer device and may be configured to execute a computer program stored in the memory 32 to control the computer device to implement the corresponding functions and to complete the corresponding actions or tasks. It should be noted that, according to the implementation form and the scene of the computer device, the functions, actions or tasks to be implemented may be different; accordingly, the computer programs stored in the memory 32 may be different, and the execution of different computer programs by the processor 31 may control the computer device to perform different functions, perform different actions or tasks.
In some alternative embodiments, as shown in fig. 3, the computer device may further include: display 33, power supply components 34, and communication components 35. Only some of the components are schematically shown in fig. 3, which does not mean that the computer device only includes the components shown in fig. 3, but the computer device may also include other components for different application requirements, for example, in the case where there is a need for voice interaction, as shown in fig. 3, the computer device may also include an audio component 36. The components that may be included in the computer device may depend on the product form of the computer device, and are not limited herein.
In an embodiment of the application, the processor, when executing the computer program in the memory, is configured to: according to the search request, a first inverted arrangement table and a second inverted arrangement table are obtained, the first inverted arrangement table stores file objects corresponding to the first index words, the second inverted arrangement table stores file objects corresponding to the second index words, and the file objects comprise file identification values; acquiring a file identification value to be queried from the first inverted arrangement table as a current target file identification value; searching a target file identification value in a first partition of a second inverted arrangement table by adopting a first searching method; if the target file identification value is not found in the first partition, performing at least one skip list search in a second partition of the second inverted list to continue to find the target file identification value; if a second file identification value larger than the target file identification value is found in the second partition, taking an area, between the first file identification value and the second file identification value, of the second partition for storing the file object as a target area, and searching the target file identification value in the target area by adopting a second searching method; the first file identification value refers to a file identification value smaller than the target file identification value found by the previous skip list, the first search method is a sequential search method, and the second search method may be the same as or different from the first search method.
It should be noted that, for specific functions of the processor in the electronic device, reference may be made to the method embodiments described above, and details are not described herein again.
Accordingly, the present application further provides a computer-readable storage medium storing a computer program, where the computer program can implement the steps that can be executed by a computer device in the foregoing method embodiments when executed.
The communication component in the above embodiments is configured to facilitate communication between the device in which the communication component is located and other devices in a wired or wireless manner. The device where the communication component is located can access a wireless network based on a communication standard, such as a WiFi, a 2G, 3G, 4G/LTE, 5G and other mobile communication networks, or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
The display in the above embodiments includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The power supply assembly of the above embodiments provides power to various components of the device in which the power supply assembly is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.
The audio component in the above embodiments may be configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (12)
1. A method of searching data, comprising:
according to a search request, acquiring a first inverted arrangement table and a second inverted arrangement table, wherein the first inverted arrangement table stores file objects corresponding to a first index word, the second inverted arrangement table stores file objects corresponding to a second index word, and the file objects comprise file identification values;
acquiring a file identification value to be queried at this time from the first inverted arrangement table as a current target file identification value; searching the target file identification value in the first partition of the second inverted arrangement table by adopting a first searching method;
if the target file identification value is not found in the first partition, performing at least one skip list search in a second partition of the second inverted list to continuously find the target file identification value;
if a second file identification value which is larger than the target file identification value is found in the second partition, determining a target area, and searching the target file identification value in the target area by adopting a second searching method; the file identification value corresponding to the file object stored in the target area is between the first file identification value and the second file identification value;
the first file identification value refers to a file identification value smaller than the target file identification value found by the last skip list, the first search method is a sequential search method, and the second search method is the same as or different from the first search method.
2. The method of claim 1, wherein the second lookup method is a binary lookup method, and wherein searching the target file identification value in the target region using the second lookup method comprises:
determining a first sub-partition and a second sub-partition searched in a binary manner according to the first file identification value and the second file identification value;
and continuously searching the target file identification value in the first sub-partition and the second sub-partition by adopting a binary search method.
3. The method of claim 1, further comprising, prior to using the first lookup method to lookup the target file identification value in the first partition of the second posting list:
searching an allowable maximum length M according to a set sequence, and dividing an area, which is located behind the current position of the search pointer and stores the file object, in the second inverted arrangement table into a first partition and a second partition by combining the current position of the search pointer, wherein M is a positive integer greater than 1;
or,
according to the size of the current target file identification value and the current position of a searching pointer, dividing an area, located behind the current position of the searching pointer, of the second inverted arrangement table, for storing a file object into a first partition and a second partition;
or,
dynamically adjusting the sequence to search the allowable maximum length M according to the difference value between the current target file identification value and the previous target file identification value; and dividing the area of the second inverted list, which is used for storing the file object and is positioned behind the current position of the searching pointer, into a first partition and a second partition according to the adjusted maximum length M and by combining the current position of the searching pointer.
4. The method of claim 1, wherein performing at least one skip list lookup in the second partition of the second posting list to continue looking up the target file identification value comprises:
selecting a target skip list searching mode from multiple skip list searching modes, wherein the skip list intervals corresponding to different skip list searching modes are different;
and starting to search the skip list from the current position of the search pointer according to the skip list interval corresponding to the target skip list search mode.
5. The method of claim 4, wherein selecting a target hop table lookup from a plurality of hop table lookups comprises:
and selecting a target skip list lookup mode from multiple skip list lookup modes according to the length of the second inverted list or the second partition.
6. The method of claim 5, wherein selecting a target skip list lookup from a plurality of skip list lookups based on the length of the second inverted list or the second partition comprises:
if the length of the second inverted arrangement table or the second subarea is smaller than a preset length threshold, selecting a jump table lookup mode corresponding to the N power with the jump table interval of 2 as a target jump table lookup mode;
if the length of the second inverted arrangement table or the second partition is larger than a preset length threshold, selecting a skip table lookup mode corresponding to a power of 2N with a skip table interval of 2 as a target skip table lookup mode;
wherein N is a positive integer greater than 0.
7. The method of claim 1, wherein obtaining the first and second inverted lists according to the search request comprises:
determining a plurality of index words to be searched according to the search request, wherein the plurality of index words correspond to a plurality of inverted arrangement lists;
selecting a reverse arrangement table with the shortest length from the plurality of reverse arrangement tables as a first reverse arrangement table, and taking an index word corresponding to the first reverse arrangement table as a first index word; and
and respectively taking the rest inverted lists and the corresponding index words in the inverted lists as second inverted lists and second index words.
8. The method of claim 7, further comprising:
and in the plurality of second inverted lists, sequentially searching the target file identification value in each second inverted list according to the sequence of the lengths of the inverted lists from small to large.
9. The method according to any one of claims 1 to 8, wherein if the target file identification value is found in the first partition, or the target file identification value is found in the second partition, or the target file identification value is found in the target area, the target file identification value is output and the current position of the search pointer is recorded, so as to continue searching for the next target file identification value from the current position of the search pointer.
10. A data search apparatus, comprising: the system comprises an acquisition module, a first query module, a second query module and a third query module;
the acquisition module is configured to receive a search request, and acquire a first inverted list and a second inverted list according to the search request, where the first inverted list stores a file object corresponding to a first index word, the second inverted list stores a file object corresponding to a second index word, and the file object includes a file identification value;
the first query module is used for acquiring the file identification value to be queried from the first inverted list as the current target file identification value; searching the target file identification value in the first partition of the second inverted arrangement table by adopting a first searching method;
the second query module is configured to perform at least one skip list lookup in a second partition of the second inverted list to continue to lookup the target file identifier value when the target file identifier value is not found in the first partition;
the third query module is configured to determine a target area when a second file identification value larger than the target file identification value is found in the second partition, and search the target file identification value in the target area by using a second search method;
the file identification value corresponding to the file object stored in the target area is between the first file identification value and the second file identification value; the first file identification value refers to a file identification value which is found by the last skip list and is smaller than the target file identification value, the first search method is a sequential search method, and the second search method is the same as or different from the first search method.
11. A computer device, comprising: a processor and a memory, the memory storing a computer program for implementing a search engine for implementing the method according to any one of claims 1-9 when the computer program is executed by the processor.
12. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of a computer device, enable the computer device to perform the method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210204466.8A CN114661666B (en) | 2022-03-03 | 2022-03-03 | Data searching method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210204466.8A CN114661666B (en) | 2022-03-03 | 2022-03-03 | Data searching method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114661666A CN114661666A (en) | 2022-06-24 |
CN114661666B true CN114661666B (en) | 2023-01-24 |
Family
ID=82027068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210204466.8A Active CN114661666B (en) | 2022-03-03 | 2022-03-03 | Data searching method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114661666B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116541427B (en) * | 2023-06-30 | 2023-11-14 | 腾讯科技(深圳)有限公司 | Data query method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103793439A (en) * | 2012-11-05 | 2014-05-14 | 腾讯科技(深圳)有限公司 | Real-time retrieval information acquisition method, real-time retrieval device, and real-time retrieval server |
CN110222074A (en) * | 2019-06-14 | 2019-09-10 | 北京金山云网络技术有限公司 | It indexes lookup method, search device, electronic equipment and storage medium |
CN113495945A (en) * | 2020-04-03 | 2021-10-12 | 腾讯科技(深圳)有限公司 | Text search method, text search device and storage medium |
CN113535642A (en) * | 2021-08-05 | 2021-10-22 | 统信软件技术有限公司 | File searching method and computing device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9152697B2 (en) * | 2011-07-13 | 2015-10-06 | International Business Machines Corporation | Real-time search of vertically partitioned, inverted indexes |
CN108255958B (en) * | 2017-12-21 | 2022-05-03 | 百度在线网络技术(北京)有限公司 | Data query method, device and storage medium |
-
2022
- 2022-03-03 CN CN202210204466.8A patent/CN114661666B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103793439A (en) * | 2012-11-05 | 2014-05-14 | 腾讯科技(深圳)有限公司 | Real-time retrieval information acquisition method, real-time retrieval device, and real-time retrieval server |
CN110222074A (en) * | 2019-06-14 | 2019-09-10 | 北京金山云网络技术有限公司 | It indexes lookup method, search device, electronic equipment and storage medium |
CN113495945A (en) * | 2020-04-03 | 2021-10-12 | 腾讯科技(深圳)有限公司 | Text search method, text search device and storage medium |
CN113535642A (en) * | 2021-08-05 | 2021-10-22 | 统信软件技术有限公司 | File searching method and computing device |
Also Published As
Publication number | Publication date |
---|---|
CN114661666A (en) | 2022-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200142860A1 (en) | Caseless file lookup in a distributed file system | |
CN107015985B (en) | Data storage and acquisition method and device | |
CN108572789B (en) | Disk storage method and device, message pushing method and device and electronic equipment | |
CN111835985B (en) | Video editing method, device, apparatus and storage medium | |
CN111221840A (en) | Data processing method and device, data caching method, storage medium and system | |
CN106339260B (en) | Task allocation method and device based on Jenkins platform | |
CN114661666B (en) | Data searching method, device, equipment and storage medium | |
CN111968640A (en) | Voice control method and device, electronic equipment and storage medium | |
US20210022069A1 (en) | Method and apparatus for indicating position of cell-defining synchronization signal block and searching for the same, and base station | |
CN111190710A (en) | Task allocation method and device | |
CN110874358A (en) | Multi-attribute column storage and retrieval method and device and electronic equipment | |
CN110555075B (en) | Data processing method, device, electronic equipment and computer readable storage medium | |
CN107203418B (en) | Method and device for selecting resources according to system configuration | |
CN114385623A (en) | Data table acquisition method, device, apparatus, storage medium, and program product | |
CN111435351B (en) | Database query optimization method, equipment and storage medium | |
CN116467027A (en) | Information display method, information display device and storage medium | |
CN109063645B (en) | Filtering method, filtering device and storage medium | |
CN113297317A (en) | Data table synchronization method and device, electronic equipment and storage medium | |
CN110019544B (en) | Data query method and system | |
CN112100247B (en) | Method and system for querying data by using ElasticSearch | |
CN112486979B (en) | Data processing method, device and system, electronic equipment and computer readable storage medium | |
CN112486980A (en) | Data storage method, data retrieval method, data storage device, data retrieval device, electronic equipment and computer-readable storage medium | |
CN110019296B (en) | Database query script generation method and device, storage medium and processor | |
CN111143711A (en) | Object searching method and system | |
CN113190610B (en) | Map color matching method, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |