CN112417081A - Method and device for realizing storage of incremental inverted index data - Google Patents

Method and device for realizing storage of incremental inverted index data Download PDF

Info

Publication number
CN112417081A
CN112417081A CN201910774369.0A CN201910774369A CN112417081A CN 112417081 A CN112417081 A CN 112417081A CN 201910774369 A CN201910774369 A CN 201910774369A CN 112417081 A CN112417081 A CN 112417081A
Authority
CN
China
Prior art keywords
index data
memory pool
increment
compressed
inverted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910774369.0A
Other languages
Chinese (zh)
Inventor
窦文轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201910774369.0A priority Critical patent/CN112417081A/en
Publication of CN112417081A publication Critical patent/CN112417081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for realizing storage of increment reverse index data, and relates to the technical field of computers. One embodiment of the method comprises: determining a required memory space for storing the incremental reverse index data; judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not; if so, applying for a new memory pool, storing the incremental reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed; otherwise, storing the incremental reverse index data to be stored into the pre-allocated memory pool; and compressing the inverted index data in the memory pool to be compressed. The method greatly improves the storage performance of the increment inverted index data under the condition that the increment data is uncertain. And, do benefit to and carry out data compression effectively, improved space utilization.

Description

Method and device for realizing storage of incremental inverted index data
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for realizing storage of increment reverse index data.
Background
Inverted index (Inverted index), also commonly referred to as Inverted index, posting profile, or Inverted profile, is an indexing method used to store a mapping (word-to-document mapping) of a word to a storage location in a document or a group of documents under a full-text search. Which is the most common data structure in document retrieval systems. By inverted indexing, a list of documents containing a word can be quickly retrieved from that word. According to the real-time performance and flexibility of data, the reverse indexes are divided into full reverse indexes and incremental reverse indexes. The full-scale inverted index is generally constructed offline, the structure is not variable, and the loading and updating are slow, so that the full-scale inverted index cannot be constructed frequently. In the commodity searching system, merchants or commodities need to be changed frequently sometimes, so that an increment inverted index appears, and the increment inverted index has the characteristics of online real-time construction and high timeliness.
The basic structure of the increment inverted index is as follows: and the corresponding relation between the inverted term and the document identification docid set can know which documents a certain term appears in. Where the document identification may be the internal id of a certain item in the search engine, the collection of docids is called docalist. In the prior art, whenever the incremental inverted index data needs to be stored, the required storage space is applied in real time to complete storage. However, the performance of the system is reduced because the incremental inverted index data has uncertainty and may frequently apply for the memory space. And the stored data are all data fragments, and for the compression of the ordered integers, the effect is only effective when the quantity reaches a certain magnitude, so the prior art has the problem that the data compression is not beneficial to be effectively carried out, and a large amount of memory space can be wasted.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for implementing storage of incremental inverted index data, which can greatly improve storage performance of the incremental inverted index data under the condition that the incremental data is uncertain. And, do benefit to and carry out data compression effectively, improved space utilization.
To achieve the above object, according to an aspect of the embodiments of the present invention, a method for implementing incremental inverted index data storage is provided.
The method for realizing the storage of the increment reverse index data comprises the following steps: determining a required memory space for storing the incremental reverse index data; judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not; if so, applying for a new memory pool, storing the incremental reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed; otherwise, storing the incremental reverse index data to be stored into the pre-allocated memory pool; and compressing the inverted index data in the memory pool to be compressed.
Optionally, the step of storing the increment reverse index data to be stored to the pre-allocated memory pool and the new memory pool includes: according to the residual memory space, the increment reverse index data to be stored is divided into first increment reverse index data and second increment reverse index data; wherein the required memory space of the first incremental inverted index data is equal to the remaining memory space; storing the first increment reverse index data into the pre-allocation memory pool; and storing the second increment reverse index data into the new memory pool.
Optionally, the step of compressing the inverted index data in the memory pool to be compressed includes: for each inverted word in the inverted index data in the memory pool to be compressed, summarizing a document identification list of the inverted word to form a data block of the inverted word; and compressing the data block by a PFORDelta algorithm.
Optionally, after marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed, before compressing the inverted index data in the memory pool to be compressed, the method further includes: sending the information of the memory pool to be compressed to a reconstruction queue; then
The step of compressing the inverted index data in the memory pool to be compressed comprises the following steps: reconstructing the memory pool to be compressed based on the consumed message in the reconstruction queue; and in the reconstruction process, compressing the inverted index data in the memory pool to be compressed.
To achieve the above object, according to another aspect of the embodiments of the present invention, an apparatus for implementing incremental inverted index data storage is provided.
The device for realizing the storage of the increment reverse index data comprises the following steps:
the device comprises a required memory space determining module, a required memory space determining module and a storage module, wherein the required memory space determining module is used for determining the required memory space of the increment reverse index data to be stored;
the judging module is used for judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not;
the storage module is used for applying for a new memory pool, storing the increment reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed; storing the incremental reverse index data to be stored to the pre-allocated memory pool;
and the compression module is used for compressing the inverted index data in the memory pool to be compressed.
Optionally, the storage module is further configured to divide the to-be-stored incremental inverted index data into first incremental inverted index data and second incremental inverted index data according to the remaining memory space; wherein the required memory space of the first incremental inverted index data is equal to the remaining memory space; storing the first increment reverse index data into the pre-allocation memory pool; and storing the second increment reverse index data into the new memory pool.
Optionally, the compression module is further configured to, for each inverted word in the inverted index data in the memory pool to be compressed, summarize a document identifier list of the inverted word, so as to form a data block of the inverted word; and compressing the data block by a PFORDelta algorithm.
Optionally, the apparatus for implementing storage of incremental inverted index data according to the embodiment of the present invention further includes a sending module, configured to send information of the memory pool to be compressed to a reconstruction queue; then
The compression module reconstructs the memory pool to be compressed based on the consumed message in the reconstruction queue; and in the reconstruction process, compressing the inverted index data in the memory pool to be compressed.
To achieve the above object, according to still another aspect of an embodiment of the present invention, there is provided an electronic apparatus.
The electronic device of the embodiment of the invention comprises: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement any of the above-described methods for implementing incremental inverted index data storage.
To achieve the above object, according to a further aspect of the embodiments of the present invention, there is provided a computer readable medium having a computer program stored thereon, wherein the computer program is configured to implement any one of the above methods for implementing incremental inverted index data storage when executed by a processor.
One embodiment of the above invention has the following advantages or benefits: because the online increment inverted index data is stored in an irregular time, a larger space (a memory pool) can be pre-allocated, the performance loss caused by frequently applying a small memory space is reduced, the generation of memory fragments is reduced, and the inverted index data can be effectively compressed. The problems of frequent memory allocation and low space utilization rate in the prior art when a large amount of increment inverted index data exists are solved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a method for implementing incremental inverted index data storage according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a document identification list summarizing inverted words, according to an embodiment of the invention;
FIG. 3 is a schematic diagram of a method of implementing incremental inverted index data storage according to an embodiment of the invention;
FIG. 4 is a schematic diagram of compressing a data block by a PFORDelta algorithm according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an inverted index data query according to an embodiment of the invention;
FIG. 6 is a schematic diagram of the main modules of an apparatus implementing incremental inverted index data storage according to an embodiment of the present invention;
FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
FIG. 1 is a schematic diagram of a main flow of a method for implementing incremental inverted index data storage according to an embodiment of the present invention; FIG. 2 is a schematic diagram of a document identification list summarizing inverted words according to an embodiment of the invention.
As shown in fig. 1, the method for implementing storage of increment inverted index data according to the embodiment of the present invention mainly includes:
step S101: and determining the required memory space for storing the increment reverse index data. The required memory space is the memory space required when the incremental reverse index data to be stored is stored.
Step S102: and judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not. If yes, go to step S103; otherwise, step S105 is performed. The pre-allocated memory pool is a memory pool in which a certain memory space is pre-allocated. In the embodiment of the invention, the incremental reverse index data is preferentially stored in the pre-allocated memory pool every time when constructed on line in real time, and a new memory pool is applied if the pre-allocated memory pool is full or the residual memory space is insufficient.
Step S103: applying for a new memory pool, storing the increment reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed.
Optionally, the increment reverse index data to be stored is divided into a first increment reverse index data and a second increment reverse index data according to the remaining memory space. And the required memory space of the first increment inverted index data is equal to the residual memory space. Storing the first increment reverse index data into a pre-allocation memory pool; and storing the second increment reverse index data into a new memory pool. Through the steps, the currently available pre-allocated memory pool can be fully stored, and then a new memory pool is applied. And after the new memory pool is stored, a new memory pool is applied again, so that the space utilization rate is maximized as much as possible.
Step S104: and compressing the inverted index data in the memory pool to be compressed.
Optionally, for each inverted word in the inverted index data in the memory pool to be compressed, the document identifier list of the inverted word is summarized, so that a data block of the inverted word is formed. Specifically, the hash function is used for converting the term to be inverted into a term (id) of a 64-bit integer, and the id of the doc where each term is located is inserted into the corresponding docalist. As shown in fig. 2, for the inverted words "mobile phone", "sneaker" and "eye", wherein the inverted words may also be the brand of goods, etc., after passing through the hash function, their termids are determined to be respectively represented as 1, 3, 2. And, the daclizt corresponding to termid1, 3, 2 is {12, 13, 30, 101}, {86, 93}, and {25, 36, 89 }. In the embodiment of the present invention, the document identifier docid may be identification information generated by the system according to the timestamp information, and is generally continuous. Thus, for each term's doclist, the docids it includes are ordered (in order from large to small or small to large).
And compressing the data block by a PForDelta algorithm. The pforldelta algorithm (p4delta algorithm) was first proposed by Heman in 2005 (Heman et al ICDE 2006), which allows the compression of the entire chunk data at the same time. The basic idea is to consider that for a chunk sequence, the majority of the x% data (e.g., 90%) occupies less space, while the remaining minority 1-x% (e.g., 10%) is outliers that result in excessive digital storage space. Therefore, less b bits are uniformly used for storing x% of small data, and the rest 1-x% of data are stored separately.
After the pre-allocated memory pool after the storage operation is marked as a memory pool to be compressed, before compressing inverted index data in the memory pool to be compressed, the method further comprises the following steps: and sending the information of the memory pool to be compressed to a reconstruction queue. And in the process of compressing the inverted index data in the memory pool to be compressed, reconstructing the memory pool to be compressed based on the message in the consumed reconstruction queue. And in the process of rebuilding, compressing the inverted index data in the memory pool to be compressed.
Step S105: and storing the incremental reverse index data to be stored into a pre-allocated memory pool.
For the embodiment of the invention, because the online increment inverted index data is stored in an irregular time, a larger space (a memory pool) can be pre-allocated, the performance loss caused by frequently applying for a small memory space is reduced, the generation of memory fragments is reduced, and the inverted index data can be effectively compressed. The problems of frequent memory allocation and low space utilization rate in the prior art when a large amount of increment inverted index data exists are solved.
FIG. 3 is a schematic diagram of a method of implementing incremental inverted index data storage according to an embodiment of the invention; FIG. 4 is a schematic diagram of compressing a data block by a PFORDelta algorithm according to an embodiment of the present invention; FIG. 5 is a diagram of an inverted index data query, according to an embodiment of the invention.
In some application scenarios, for example, determining a list of goods according to the goods query terms, the incremental inverted index data has uncertainty due to uncertainty of goods on-shelf by the merchant and goods on-shelf and off-shelf at any time. In the prior art, only part of incremental inverted index data can be accumulated, and only document ids in the batch of incremental data can be inverted when an inverted chain is constructed for term in the batch of incremental data. Because the memory is continuous, when the next batch of incremental data arrives, a new space is applied to construct a reverse row for term, and if the new term appears in the previous reverse chain, the same term corresponding to a plurality of reverse chains is inevitable and discontinuous. For the compression of ordered integers docalist, the effect is only obtained when the number reaches a certain magnitude, and the short-chain compression effect is poor. Since the data in each inverted chain is limited, it is not favorable for efficient data compression.
In view of the above problems, the embodiment of the present invention pre-allocates a larger space (memory pool), so as to reduce performance loss caused by frequently applying for a small memory space, reduce memory fragmentation, and facilitate effective compression of inverted index data. As shown in fig. 3, a method for implementing storage of incremental inverted index data according to an embodiment of the present invention includes:
step S301: and constructing the incremental reverse index data to be stored.
Step S302: and judging whether the incremental reverse index data to be stored can be stored in the current memory pool Incsegment. If yes, go to step S303; otherwise, step S306 is executed.
Step S303: and inserting the incremental reverse index data to be stored into the current IncSegment.
Step S304: in the case that the IncSegment store is full, it is placed in a rebuild queue, waiting for a rebuild operation to be performed. In order to realize memory compression, a rebuild operation needs to be performed on data in the full IncSegment periodically. Moreover, for the full Incsegment, the embedded operation is set as read-only for facilitating the subsequent rebuild operation, and the embedded operation is not modified. In order to distinguish unrebuilt data from rebuild data, a header may be added to the data block, which is an AliveBlockHeader and a SolidBlockHeader, respectively.
Step S305: and consuming the messages in the rebuild queue, performing the rebuild operation and obtaining the compressed memory pool. And during rebuild, accurately distributing the memory according to IncSegment to be rebuild, so that memory gaps are avoided, and compact storage is realized. There is a hash _ fact (dictionary) in each IncSegment, and the offset address of the inverted term and the Block header in the IncSegment are recorded. And the offset of the next blockHeader of the term is recorded in the blockHeader, and the like, and the blockHeader is in a linked list form. Thus, several short docalists corresponding to the same term in the IncSegment can be combined into one docalist, and p4delta compression is performed on the docalist. All the compressed data lengths are accumulated and the compressed data is recorded into a temporary space. The total compressed data length can be directly allocated and then copied from the temporary space in sequence.
The above operation, as shown in fig. 4, specifically includes:
1. and traversing all term in the full IncSegment, and summarizing docalists corresponding to the same term. And (3) converting the term to be built into a term (reverse term identifier) of a 64-bit integer by using a hash function, and inserting the id of the doc where each term is located into the corresponding docalist.
2. Applying for a Block of memory again, for each term's doclist, replacing an AliveBlockHeader with a SolidBlockHeader, compressing the data Block using a p4delta algorithm, for example, compressing the difference value of 128 docids using the p4delta algorithm, and writing the maximum value (or the minimum value) of the 128 docids into the corresponding BlockHeader. In the implementation process, instead of compressing 128 docids, the difference values are sequentially calculated from back to front to obtain 128 differences for compression. (the difference value on the rightmost side is 0, and the differences between every two docids are sequentially from left to right), and the index can restore all the docids in the Block according to the largest docid in the Block header.
3. The above operation is repeated until all term is completed by rebuild.
4. Replacing the original IncSegment pointer.
In Block, docids are all in ascending order. Because the compressed data can not be directly read, in order to accurately find the Block where the specified docid is located and decompress the Block, the largest docid in the current Block is recorded into the Block header. The purpose of this operation is to find the first docid that is not smaller than the query docid. The jump method is adopted for searching, and the step length is 2n(n>1), where the input is given as docid, curr _ header refers to the current header position, end _ header refers to the position after the last header, step refers to the step size, and n refers to the power of the step size. The input docid is the value to be queried currently, and curr _ header is the currently traversed BlockHeader. And comparing the input docid with curr _ header max _ docid to determine the subsequent jump direction and step length. As shown in fig. 5, the search is as follows:
1. initial n is 1 and step is 0.
2. Step 3 is carried out as long as the curr _ header does not reach the end _ header and the maximum doc (curr _ header. max _ doc) of the curr _ header is smaller than the input doc; otherwise, ending.
3.step=2n(ii) a n is increased by 1; the current curr _ header goes backward step. If the curr _ header does not reach the end _ header or the maximum doc of curr _ header is greater than doc, then step 4 is performed.
Step-1 step forward by curr _ header; n is set to 1.
5. And returning to the step 2.
The code is implemented as follows:
Figure BDA0002174572540000101
Figure BDA0002174572540000111
step S306: and applying for a new IncSegment, and storing the incremental reverse index data to be stored to the currently existing IncSegment and the new IncSegment.
Fig. 6 is a schematic diagram of main modules of an apparatus for implementing storage of increment inverted index data according to an embodiment of the present invention, and as shown in fig. 6, the apparatus 600 for implementing storage of increment inverted index data according to an embodiment of the present invention includes a required memory space determining module 601, a determining module 602, a storing module 603, and a compressing module 604.
The required memory space determining module 601 is configured to determine a required memory space in which the incremental reverse index data is to be stored.
The determining module 602 is configured to determine whether a required memory space of the incremental reverse index data to be stored is larger than a remaining memory space of the pre-allocated memory pool.
The storage module 603 is configured to apply for a new memory pool after the determining module 602 determines that the required memory space for storing the incremental reverse index data is larger than the remaining memory space of the pre-allocated memory pool. And storing the incremental reverse index data to be stored to a pre-allocated memory pool and a new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed. After the determining module 602 determines that the required memory space of the to-be-stored increment inverted index data is not greater than the remaining memory space of the pre-allocated memory pool, the storing module 603 stores the to-be-stored increment inverted index data to the pre-allocated memory pool.
The storage module is further used for dividing the increment reverse index data to be stored into first increment reverse index data and second increment reverse index data according to the residual memory space. And the required memory space of the first increment inverted index data is equal to the residual memory space. The storage module is further used for storing the first increment reverse index data into a pre-allocated memory pool and storing the second increment reverse index data into a new memory pool.
The compressing module 604 is configured to compress the inverted index data in the memory pool to be compressed. The compression module is further configured to, for each inverted word in the inverted index data in the memory pool to be compressed, summarize a document identification list of the inverted word, so as to form a data block of the inverted word. And compressing the data block by a PForDelta algorithm.
The device for realizing the storage of the increment inverted index data further comprises a sending module, wherein the sending module is used for sending the information of the memory pool to be compressed to the reconstruction queue. The compression module reconstructs the memory pool to be compressed based on the message in the reconstruction queue for consumption; and in the reconstruction process, compressing the inverted index data in the memory pool to be compressed.
For the embodiment of the invention, because the online increment inverted index data is stored in an irregular time, a larger space (a memory pool) can be pre-allocated, the performance loss caused by frequently applying for a small memory space is reduced, the generation of memory fragments is reduced, and the inverted index data can be effectively compressed. The problems of frequent memory allocation and low space utilization rate in the prior art when a large amount of increment inverted index data exists are solved.
Fig. 7 illustrates an exemplary system architecture 700 of a method of implementing an incremental inverted index data store or an apparatus implementing an incremental inverted index data store to which embodiments of the invention may be applied.
As shown in fig. 7, the system architecture 700 may include terminal devices 701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the terminal devices 701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use the terminal devices 701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The terminal devices 701, 702, 703 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).
The terminal devices 701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 705 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 701, 702, 703. The background management server can analyze and process the received data such as the product information inquiry request and feed back the processing result to the terminal equipment.
It should be noted that the method for implementing storage of the incremental inverted index data provided in the embodiment of the present invention is generally executed by the server 705, and accordingly, the apparatus for implementing storage of the incremental inverted index data is generally disposed in the server 705.
It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.
The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 801.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a required memory space determining module, a judging module, a storage module and a compression module. The names of these modules do not form a limitation on the module itself under certain circumstances, for example, the required memory space determination module may also be described as a "module that determines the required memory space for storing the increment inverted index data".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: determining a required memory space for storing the incremental reverse index data; judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not; if so, applying for a new memory pool, storing the incremental reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed; otherwise, storing the incremental reverse index data to be stored into a pre-allocated memory pool; and compressing the inverted index data in the memory pool to be compressed.
For the embodiment of the invention, because the online increment inverted index data is stored in an irregular time, a larger space (a memory pool) can be pre-allocated, the performance loss caused by frequently applying for a small memory space is reduced, the generation of memory fragments is reduced, and the inverted index data can be effectively compressed. The problems of frequent memory allocation and low space utilization rate in the prior art when a large amount of increment inverted index data exists are solved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for realizing storage of increment inverted index data is characterized by comprising the following steps:
determining a required memory space for storing the incremental reverse index data;
judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not;
if so, applying for a new memory pool, storing the incremental reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed; otherwise, storing the incremental reverse index data to be stored into the pre-allocated memory pool;
and compressing the inverted index data in the memory pool to be compressed.
2. The method of claim 1, wherein the step of storing the to-be-stored delta inverted index data to the pre-allocated memory pool and the new memory pool comprises:
according to the residual memory space, the increment reverse index data to be stored is divided into first increment reverse index data and second increment reverse index data; wherein the required memory space of the first incremental inverted index data is equal to the remaining memory space;
storing the first increment reverse index data into the pre-allocation memory pool;
and storing the second increment reverse index data into the new memory pool.
3. The method according to claim 1, wherein the step of compressing the inverted index data in the memory pool to be compressed comprises:
for each inverted word in the inverted index data in the memory pool to be compressed, summarizing a document identification list of the inverted word to form a data block of the inverted word;
and compressing the data block by a PFORDelta algorithm.
4. The method of claim 1, wherein after marking the pre-allocated memory pool after the storing operation as the memory pool to be compressed, before compressing the inverted index data in the memory pool to be compressed, further comprising: sending the information of the memory pool to be compressed to a reconstruction queue; then
The step of compressing the inverted index data in the memory pool to be compressed comprises the following steps: reconstructing the memory pool to be compressed based on the consumed message in the reconstruction queue; and in the reconstruction process, compressing the inverted index data in the memory pool to be compressed.
5. An apparatus for implementing incremental inverted index data storage, comprising:
the device comprises a required memory space determining module, a required memory space determining module and a storage module, wherein the required memory space determining module is used for determining the required memory space of the increment reverse index data to be stored;
the judging module is used for judging whether the required memory space of the increment reverse index data to be stored is larger than the residual memory space of the pre-allocated memory pool or not;
the storage module is used for applying for a new memory pool, storing the increment reverse index data to be stored to the pre-allocated memory pool and the new memory pool, and marking the pre-allocated memory pool after the storage operation as a memory pool to be compressed; storing the incremental reverse index data to be stored to the pre-allocated memory pool;
and the compression module is used for compressing the inverted index data in the memory pool to be compressed.
6. The apparatus according to claim 5, wherein the storage module is further configured to segment the to-be-stored reverse-ordered increment index data into a first reverse-ordered increment index data and a second reverse-ordered increment index data according to the remaining memory space; wherein the required memory space of the first incremental inverted index data is equal to the remaining memory space; storing the first increment reverse index data into the pre-allocation memory pool; and storing the second increment reverse index data into the new memory pool.
7. The apparatus according to claim 5, wherein the compression module is further configured to, for each inverted word in the inverted index data in the memory pool to be compressed, summarize a document identification list of the inverted word, so that a data block of the inverted word is formed; and compressing the data block by a PFORDelta algorithm.
8. The apparatus according to claim 7, further comprising a sending module, configured to send information of the memory pool to be compressed to a rebuild queue; then
The compression module reconstructs the memory pool to be compressed based on the consumed message in the reconstruction queue; and in the reconstruction process, compressing the inverted index data in the memory pool to be compressed.
9. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.
CN201910774369.0A 2019-08-21 2019-08-21 Method and device for realizing storage of incremental inverted index data Pending CN112417081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910774369.0A CN112417081A (en) 2019-08-21 2019-08-21 Method and device for realizing storage of incremental inverted index data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910774369.0A CN112417081A (en) 2019-08-21 2019-08-21 Method and device for realizing storage of incremental inverted index data

Publications (1)

Publication Number Publication Date
CN112417081A true CN112417081A (en) 2021-02-26

Family

ID=74779722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910774369.0A Pending CN112417081A (en) 2019-08-21 2019-08-21 Method and device for realizing storage of incremental inverted index data

Country Status (1)

Country Link
CN (1) CN112417081A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579596A (en) * 2022-05-06 2022-06-03 达而观数据(成都)有限公司 Method and system for updating index data of search engine in real time

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579596A (en) * 2022-05-06 2022-06-03 达而观数据(成都)有限公司 Method and system for updating index data of search engine in real time

Similar Documents

Publication Publication Date Title
KR102240557B1 (en) Method, device and system for storing data
CN108628898B (en) Method, device and equipment for data storage
CN113010542B (en) Service data processing method, device, computer equipment and storage medium
CN112613271A (en) Data paging method and device, computer equipment and storage medium
CN109697019B (en) Data writing method and system based on FAT file system
CN114817651B (en) Data storage method, data query method, device and equipment
CN112748866A (en) Method and device for processing incremental index data
CN111723089B (en) Method and device for processing data based on column type storage format
CN112417081A (en) Method and device for realizing storage of incremental inverted index data
CN112395337B (en) Data export method and device
CN111259013A (en) Method and device for storing data
CN113760861B (en) Data migration method and device
CN111177109A (en) Method and device for deleting overdue key
CN112784139B (en) Query method, device, electronic equipment and computer readable medium
CN114116675A (en) Data archiving method and device
CN113760600A (en) Database backup method, database restoration method and related device
CN113778318B (en) Data storage method and device
CN113127416A (en) Data query method and device
CN115982206B (en) Method and device for processing data
CN110019336B (en) Method and device for querying data
CN114328558B (en) List updating method, apparatus, device and storage medium
CN113449155B (en) Method, apparatus, device and medium for feature representation processing
CN113760900B (en) Method and device for real-time summarizing of data and interval summarizing
CN113138988B (en) Product code generation method and device
CN111291038B (en) Data query method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination