CN117807174A - Index processing method, apparatus, computer device, medium, and program product - Google Patents

Index processing method, apparatus, computer device, medium, and program product Download PDF

Info

Publication number
CN117807174A
CN117807174A CN202211172148.4A CN202211172148A CN117807174A CN 117807174 A CN117807174 A CN 117807174A CN 202211172148 A CN202211172148 A CN 202211172148A CN 117807174 A CN117807174 A CN 117807174A
Authority
CN
China
Prior art keywords
index
segment
target
copy
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211172148.4A
Other languages
Chinese (zh)
Inventor
毕杰山
刘蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211172148.4A priority Critical patent/CN117807174A/en
Publication of CN117807174A publication Critical patent/CN117807174A/en
Pending legal-status Critical Current

Links

Abstract

The application relates to an index processing method, an index processing device, a computer device, a storage medium and a computer program product. The method can be applied to a document type database, and mainly relates to synchronizing index segments for inquiring documents between a main fragment and a auxiliary fragment, and comprises the following steps: and processing the document to obtain retrieval data associated with the document every time the node where the main segment of the target index segment is located receives the document to be added with the index, writing the retrieval data into a segment cache of the main segment of the target index segment, reading the retrieval data in the segment cache to generate an index segment when the segment cache meets the index segment generation condition, adding the index segment into the main segment of the target index segment, and copying the index segment to the copy segment of the target index segment. The method can improve the reliability of the distributed cluster of the whole database and the efficiency of data retrieval, save the CPU resources of the whole distributed cluster and improve the overall processing performance of the distributed cluster.

Description

Index processing method, apparatus, computer device, medium, and program product
Technical Field
The present application relates to the field of database technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for creating an index.
Background
With the development of computer technology, in various scenarios, the generated data needs to be stored for query, such as client data, product data, order data, etc., and as the data volume is larger, the time consumption of querying the mass data is also increasing.
At present, a document type database stores massive data by using a data structure named as an index, and the design of the structure can allow full text query to be very fast, so that the problem of low query efficiency on the massive data can be solved. The document type database regards each piece of data as a document, and indexes are the collection of massive similar documents, such as indexes of massive order data, indexes of massive product data and the like, and the data query efficiency can be improved through the indexes. In the big data age, the data contained in one index may far exceed the resource capacity of a single machine, for this purpose, the data of one index is generally split into a plurality of index fragments and distributed to a plurality of nodes of the distributed cluster, and in order to improve the reliability of the whole distributed cluster and the efficiency of retrieving data, a main fragment and a auxiliary fragment may be set for each index fragment.
However, in the current data synchronization manner between the main shard and the auxiliary shard, after a document is processed on a node where the main shard is located, relevant data such as a keyword of the document is obtained, the document is sent to a node where a corresponding copy shard is located, the node where the copy shard is located needs to perform similar processing on the document again, and processing on the same document on two sides of the main shard and the auxiliary shard is independent. The processing of the document usually involves a large amount of CPU resources, and the processing of the same document is performed on the main slicing side and each copy slicing side, so that the redundant processing causes the CPU resource waste of the whole distributed cluster and affects the overall processing performance of the distributed cluster.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an index processing method, apparatus, computer device, computer readable storage medium, and computer program product, which can reduce redundant processing of documents to save CPU resources of an entire distributed cluster and improve processing performance of the distributed cluster as a whole.
The application provides an index processing method, wherein the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is a minimum retrievable data unit of the index; the method comprises the following steps:
Processing a document to be added into an index when a node where a main partition of a target index partition is located receives the document to be added into the index, obtaining retrieval data associated with the document, and writing the retrieval data into a segment cache of the main partition of the target index partition;
when the segment cache meets the index segment generation condition, reading out the retrieval data in the segment cache to generate an index segment, and adding the index segment into the main segment of the target index segment;
and copying the index segment to a copy slice of the target index slice.
The application also provides an index processing device, the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is the minimum retrievable data unit of the index; the device comprises:
the caching module is used for processing the document to be added into the index whenever the node where the main segment of the target index segment is located receives the document to be added into the index, obtaining retrieval data associated with the document, and writing the retrieval data into the segment cache of the main segment of the target index segment;
The index segment generation module is used for reading the search data in the segment cache to generate an index segment when the segment cache meets the index segment generation condition, and adding the index segment into the main segment of the target index segment;
and the copying module is used for copying the index segment to the copy fragment of the target index fragment.
In one embodiment, the apparatus further comprises:
the log recording module is used for writing the document into the log file of the main partition of the target index partition;
and the document transmission module is used for transmitting the document to the node where the copy shard of the target index shard is located, so that the node where the copy shard of the target index shard is located writes the document into the log file of the copy shard of the target index shard.
In one embodiment, the replication module is further configured to send the index segment to a node where the copy of the target index shard is located, so that the node where the copy of the target index shard is located adds the index segment to the copy of the target index shard.
In one embodiment, the replication module is further configured to determine a manner of replicating the index segment to the copy shard of the target index shard according to a file size of the index segment, and distribute the index segment to a node where the copy shard of the target index shard is located according to the determined manner.
In one embodiment, the replication module is further configured to send the index segment to a node where a copy slice of the target index slice is located when a file size of the index segment is less than or equal to a preset threshold; when the file size of the index segment is larger than the preset threshold, splitting the index segment into a plurality of data blocks, and sequentially sending the plurality of data blocks obtained by splitting to nodes where the copy slices of the target index slices are located, wherein the size of each data block is smaller than the preset threshold.
In one embodiment, the copying module is further configured to generate a data block according to a plurality of small files in the index segment, where a total file size of the plurality of small files is less than or equal to the preset threshold; splitting a large file in the index section into a plurality of data blocks, wherein the file size of the large file is larger than the preset threshold value; and sequentially sending each data block to the node where the copy slice of the target index slice is located.
In one embodiment, the index segment generation module is further configured to read, from the segment cache, retrieval data associated with each of the plurality of documents; generating an index segment according to the retrieval data, wherein the index segment comprises a group of files, and the group of files are used for inquiring each document in the plurality of documents; refreshing the index segment to a file system cache of a main partition of the target index partition; when a persistence instruction is triggered, at least one index segment in the file system cache is persisted into a master partition of the target index partition.
In one embodiment, the apparatus further comprises:
an index view file update module for updating an index view file for a main tile of the target index tile, the updated index view file being used to indicate all index segments persisted into the main tile of the target index tile;
and the index view file copying module is used for sending the updated index view file to a node where the copy of the target index shard is located, so that when the updated index view file contains the segment identifier of the copied index segment, adding the copied index segment into the copy of the target index shard, and taking the received updated index view file as the index view file of the copy of the target index shard.
In one embodiment, the apparatus further comprises:
the merging module is used for adding the merging index segments into the main segment of the target index segment when merging the plurality of index segments which are persistent into the main segment of the target index segment to obtain the merging index segments; deleting the plurality of index segments from the master shard of the target index shard;
And the merging index segment copying module is used for copying the merging index segment to the copy partition of the target index partition.
In one embodiment, the apparatus further comprises:
the index view file updating module is used for updating the index view file of the main partition of the target index partition according to the updated index segment in the main partition of the target index partition; the updated index view file is used for indicating valid index segments in the main partition which are persisted to the target index partition;
and the index view file copying module is used for sending the updated index view file to a node where the copy fragment of the target index fragment is located, so that the node where the copy fragment of the target index fragment is located, comparing the index view file before updating with the index view file after updating, removing the index segments from the copy fragment of the target index fragment, and adding the received combined index segment into the copy fragment of the target index fragment.
The application provides an index processing method, wherein the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is a minimum retrievable data unit of the index; the method comprises the following steps:
Receiving an index segment copied by a node where a main segment of a target index segment is located, wherein the index segment is persisted to the main segment of the target index segment;
receiving an index view file copied by a node where the main partition is located, wherein the index view file is used for indicating all index segments in the main partition which are persistent to the target index partition;
and adding the copied index segment into the copy partition of the target index partition when the received index view file contains the segment identification of the copied index segment.
The application also provides an index processing device, the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is the minimum retrievable data unit of the index; the device comprises:
the first receiving module is used for receiving an index segment copied by a node where a main segment of the target index segment is located, wherein the index segment is persisted to the main segment of the target index segment;
the second receiving module is used for receiving an index view file copied by the node where the main partition is located, wherein the index view file is used for indicating all index segments which are durable to the main partition of the target index partition;
And the persistence module is used for adding the copied index segment into the copy partition of the target index partition when the received index view file contains the segment identification of the copied index segment.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the index processing method described above when executing the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the index processing method described above.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the index processing method described above.
According to the index processing method, the device, the computer equipment, the storage medium and the computer program product, for the document stored through the data structure, each time the node of the main partition of the target index partition receives the document to be added with the index, the node of the main partition processes the document, after obtaining the retrieval data related to the document, the retrieval data are written into the segment cache of the main partition of the target index partition, when the segment cache meets the index segment generation condition, an index segment is generated according to all the retrieval data cached in the cache, the index segment is added into the main partition of the target index partition, and then the index segment is directly copied to the auxiliary partition of the target index partition. Therefore, for each newly added document, the document processing is not needed to be carried out on the main slicing side and the copy slicing side respectively, so that the reliability of the whole distributed cluster and the efficiency of data retrieval are improved, the CPU resources of the whole distributed cluster are saved, and the overall processing performance of the distributed cluster is improved.
Drawings
FIG. 1a is a schematic diagram of an index segment in one embodiment;
FIG. 1b is a detailed view of an index segment in one embodiment;
FIG. 2 is a schematic diagram of an index shard in one embodiment;
FIG. 3 is a schematic diagram of index segment merging in one embodiment;
FIG. 4 is an application environment diagram of an index processing method in one embodiment;
FIG. 5 is a schematic diagram of a related art backup on a copy partition corresponding to a main partition;
FIG. 6 is a flow diagram of an index processing method in one embodiment;
FIG. 7 is a diagram of synchronizing data between a master shard and a replica shard in one embodiment;
FIG. 8 is a diagram of splitting an index segment into multiple data blocks in one embodiment;
FIG. 9 is a diagram of replicating updated index view files, in one embodiment;
FIG. 10 is a diagram of merging index segments and copying to duplicate slices, in one embodiment;
FIG. 11 is a diagram of updating an index view file and copying to a replica shard in one embodiment;
FIG. 12 is a flow chart of an indexing method according to another embodiment;
FIG. 13 is a flow chart of an indexing method in one embodiment;
FIG. 14 is a block diagram of an index processing device in one embodiment;
Fig. 15 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.
Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside.
The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.
There are two types of data in life.
Structured data: also referred to as row data, is data logically expressed and implemented by a two-dimensional table structure, strictly following data format and length specifications, and is stored and managed mainly by a relational database. Refers to data having a fixed format or a limited length, such as databases, metadata, etc.
Unstructured data: the data can be called full text data, is not fixed in length or fixed in format, is not suitable for being represented by a two-dimensional table of a database, and comprises office documents, XML, HTML, word documents, mails, various reports, pictures, frequency information, video information and the like in all formats.
Searches (or queries) are also correspondingly categorized into two categories based on the two categories of data: structured data searches and unstructured data searches. For structured data, because they have a specific structure, we can generally store and search by way of a two-dimensional table (table) of a relational database (mysql, oracle, etc.). For unstructured data, i.e. searching for full-text data, there are mainly two methods: and (5) sequentially scanning and searching the full text. Sequential scanning, i.e. querying specific keywords in a manner of sequentially scanning data, is time-consuming and inefficient. The full text retrieval refers to extracting a part of information in unstructured data, reorganizing the information to have a certain structure, and searching the data with a certain structure, so that the purpose of relatively quick searching is achieved. This information extracted from unstructured data and reorganized is called an index (index).
Index (index): is a container of documents (documents), which is a collection of documents of a type. Document-oriented databases (e.g., elastic search) store documents and document-related search data in one or more indexes. The index, like a database, may write documents to or read documents from the index. Creating an index for a document, i.e., saving the document to the index.
Document (document): is a data record recorded in the index, for example, a piece of user access log information.
Document processing (process document): refers to a document to be indexed, from which information useful for searching is extracted through a series of data processes, collectively referred to as search data. For example, word segmentation processing is performed on the document, keywords or keywords of the query are used for subsequently establishing an inverted index of the document, and an index segment is formed.
Index segment (segment): a set of files generated for a plurality of documents generated for a certain continuous time window, the set of files being used to index the plurality of documents. The set of files including the plurality of documents and index data for querying the plurality of documents, as shown in FIG. 1a, is a schematic diagram of an index segment in one embodiment, and referring to FIG. 1a, the set of files includes at most one line memory file, at most one column memory file, index-related files, metadata files, etc. for the plurality of documents, the index-related files such as a dictionary, an inverted list, etc. Each index segment is a set of independently indexable data units (including the plurality of documents and index data for querying the plurality of documents). In more detail, as shown in fig. 1b, which is a detailed view of an index segment in one embodiment, referring to fig. 1b, a line file (with a suffix of fdt) is read line by line when the read data is one document at a time; the column storage file (suffix is dvd) divides a document (namely a row of records) into single columns for storage, so that the reading redundancy in the data reading process can be reduced; the index related file comprises a dictionary and an inverted list, and the inverted list provides the function that keywords or document identifications of all documents comprising the keywords are quickly found from the inverted index, so that corresponding documents can be queried according to the document identifications, and the content to be queried can be quickly found.
Index shard (card): the data in the index is dispersed in index fragments, each index fragment can be regarded as an independent index, one index comprises one or more index fragments, and in order to improve the reliability of the whole distributed cluster and the efficiency of searching the data, the same index fragment is provided with at least one primary fragment (primary) and at least one duplicate fragment (duplicate), and the primary fragment and the duplicate fragment of the same index fragment are distributed on different nodes in the distributed cluster. Each index fragment consists of 1 or several index segments (segments), each of which is an independently queriable data unit. Referring to fig. 2, an index slice is shown in an exemplary embodiment, and referring to fig. 2, an index includes 4 index slices, where each index slice is provided with 1 main slice and 1 copy slice. For index shard 1, the backup shard 1 is a copy of the data of the primary shard 1, and each of the primary shard 1 and the backup shard 1 contains an independent index data directory under which a plurality of index segments are typically contained.
In the related art, the copy shard is a data copy of the main shard, but does not represent that the copy shard is identical to the data of the main shard. The generation of the secondary-patch-side index segment is completely independent of the generation of the primary-patch-side index segment. The copy shard is a data copy of the main shard, and represents that the document set that can be indexed by the plurality of index segments contained in the copy shard is the same as the document set that can be indexed by the plurality of index segments contained in the main shard. For example, the multiple index segments contained in the copy shard and the multiple index segments contained in the main shard are used to index documents 1 through 2000, but the main shard contains 3 index segments: index segment 1, index segment 2, and index segment 3, wherein index segment 1 is generated from the retrieved data of document 1, document 2, documents 3, … …, document 500, index segment 2 is generated from the retrieved data of document 501, document 502, documents 503, … …, document 1000, index segment 2 is generated from the retrieved data of document 1001, document 1002, documents 1003, … …, document 2000, and copy sharding comprises 2 index segments: index segment 4, index segment 5, wherein index segment 4 is generated from the retrieved data of document 1, document 2, documents 3, … …, document 1000, and index segment 5 is generated from the retrieved data of document 1001, document 1002, documents 1003, … …, document 2000.
Indexing view files: all currently valid index segments in an index shard (primary shard or secondary shard) are indicated, as well as representing the current scope of search for a query request. Each of the index segments that are persisted to the index shard needs to be accessed by the query request by being added to the index view file.
Persistence (flush operation): each time of the persistence operation, the newly generated index segment is persisted to the disk, that is, added to a certain index fragment (a main fragment or a copy fragment), and the index view file is updated.
Segment merge (segment merge): the operation can not only reduce the total file handle occupation, but also improve the reading performance by merging a plurality of smaller index segments belonging to the same index segment into a larger new index segment (also called merging index segment). Referring to fig. 3, a large index segment 301 may be obtained by merging for a plurality of index segments (left) on the same index segment, a large index segment 302 may be obtained by merging for a plurality of index segments (middle) on the same index segment, a large index segment 303 may be obtained by merging for a plurality of index segments (right) on the same index segment, and a larger index segment 304 may be obtained by merging for the newly generated index segments 301, 302, 303.
The scheme provided by the embodiment of the application relates to database technology, and also relates to distributed clustering, and the specific process is illustrated by the following embodiment.
The index processing method provided by the embodiment of the application can be applied to an application environment shown in fig. 4. The application environment comprises a terminal 402 and a distributed cluster 404, wherein the terminal 402 can trigger the requests of adding, inquiring, modifying and deleting documents to indexes; the distributed cluster 404 may be used to process at least one index, such as an index of order data, an index of customer data, an index of product data, etc., with some or all of the nodes in the distributed cluster 404 being used to store all of the data forming a certain index in a distributed manner. The distributed cluster 404 includes a plurality of nodes including a coordinating node 4041, the coordinating node 4041 for routing a document to be written with an index to a certain index shard, and the plurality of nodes further include a node 4042, a node 4043, and a node 4044 … … for storing the index shard. At least one index shard (a primary shard or a secondary shard) may be stored on each of the nodes 4042, 4043, and 4044, and the primary shard and the secondary shard of the same index shard may be stored on different nodes. For example, node 4042 stores the master tile of index tile 1, the replica tile of index tile 2, and the master tile of index tile 3, node 4043 stores the master tile of index tile 2, and the replica tile of index tile 3, and node 4044 stores the replica tile of index tile 1.
In one embodiment, whenever a node where a main partition of a target index partition is located receives a document to be indexed, processing the document to obtain retrieval data associated with the document, and writing the retrieval data into a segment cache of the main partition of the target index partition; when the segment buffer meets the index segment generation condition, reading out the search data in the segment buffer to generate an index segment, and adding the index segment into the main segment of the target index segment; the index segment is copied to a copy tile of the target index tile.
The nodes in distributed cluster 404 may communicate with each other to facilitate data interaction (e.g., copying index segments, copying documents, etc.) between the node where the master shard of each index shard is located and the node where the replica shard is located. Each node in the distributed cluster 404 may be an independent server, may be a cluster server or a distributed system, or may be a cloud server that provides cloud computing services. The terminal 402 may be, but is not limited to, a smart phone, tablet, notebook, desktop, smart box, smart watch, vehicle terminal, smart television, etc. The terminals 402 and the servers in the distributed clusters 404 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
The flow of indexing documents in the related art will be described below in conjunction with the above explanation.
Referring to fig. 5, a schematic diagram of backup on a copy partition corresponding to a main partition in the related art is shown. Referring to fig. 5, the backup method mainly includes 2 steps:
1. when the node of the main partition of a certain index segment receives a request for writing an index for a new document, the node of the main partition processes the document, converts the document to generate corresponding retrieval data, and writes the document into a log file, so that the document is prevented from being lost before the index partition is added.
2. After the processing of the node where the main partition is located is completed, the document is distributed to the nodes where the auxiliary partitions are located. The node where the copy shards are located also needs to process the document, convert it to generate corresponding search data, and write the document into the log file.
The document processing generally involves a large amount of CPU computation, but for the same document, the document is processed at the node where the main partition is located, and is reprocessed at the node where each copy partition is located, so that the computation redundancy exists, which results in the CPU resource waste of the whole distributed cluster and influences the overall processing performance of the distributed cluster. Moreover, the processing of the document by the copy-and-slice side and the processing of the same document by the main-and-slice side are performed independently, so that the copy-and-slice side and the main-and-slice side correspond to the same document set and can be used for indexing the same document set, but the included index segments may be different.
In the embodiment of the application, when the node where the main segment of the target index segment is located receives the document to be added with the index, the node where the main segment is located processes the document, after obtaining the retrieval data associated with the document, the retrieval data is written into the segment cache of the main segment of the target index segment, when the segment cache meets the index segment generation condition, an index segment is generated according to all the retrieval data cached in the cache, after the index segment is added into the main segment of the target index segment, the index segment is directly copied to the copy segment of the target index segment, and the node where the copy segment is located does not need to carry out the same processing on the document again to create another index segment, so that backup is realized, and redundant document processing is reduced. Therefore, each newly added document does not need to be processed on the main slicing side and the copy slicing side, so that the reliability of the whole distributed cluster and the efficiency of data retrieval are improved, CPU resources of the whole distributed cluster are saved, and the overall processing performance of the distributed cluster is improved.
In one embodiment, as shown in fig. 6, an index processing method is provided, which is illustrated by using the method applied to the node where the main slice in the distributed cluster 404 in fig. 4 is located as an example, and includes the following steps:
Step 602, each time a node where a main partition of the target index partition is located receives a document to be indexed, the document is processed to obtain retrieval data associated with the document, and the retrieval data is written into a segment cache of the main partition of the target index partition.
The index (index) includes at least one index fragment (partition), each index fragment includes a primary fragment (primary partition) and a duplicate fragment (duplicate partition), the primary fragment and the duplicate fragment corresponding to the same index fragment are distributed on different nodes, each index fragment includes at least one index segment (segment), and each index segment is a minimum retrievable data unit of the index. An index is a collection of similar documents, such as an index of order data, an index of customer data, an index of product information, and so forth.
It can be understood that each node in the distributed cluster can be used for simultaneously storing different index slices, and the stored index slices can be main slices or copy slices of the index slices, so that an execution subject of the method can be each node in the distributed cluster.
The document to be added to the index is a document related to the index to be added, for example, for the index of order data, the document to be added to the index may be any piece of order data, for the index of customer data, the document to be added to the index may be any piece of customer data, for the index of product information, the document to be added to the index may be any piece of product information, for the index of video information, the document to be added to the index may be any piece of video information, and so on.
A document will only be added to one index shard. In one embodiment, a terminal may initiate a request to index a document to a distributed cluster, where the request may carry an index identifier, and any node in the distributed cluster may receive the request, such as a coordinating node or a node where an index shard is located. Taking the example that the coordination node receives the request as an example, after the coordination node receives the request, determining the index indicated by the index identifier, determining a target index fragment to be added to the document from a plurality of index fragments contained in the index, determining a node where a main fragment of the target index fragment is located by the coordination node, forwarding the document to the node where the main fragment of the target index fragment is located by the coordination node, so that the node where the main fragment of the target index fragment is located can receive the document to be added to the index.
Optionally, when determining the target index shard, the coordination node may acquire the document identifier (doc_id) of the document, map the document identifier to a Hash value (Hash), and perform modulo calculation on the Hash value to obtain a value, where the value may be mapped to a unique index shard identifier, and use the index shard corresponding to the mapped index shard identifier as the target index shard. By the aid of the processing, documents can be uniformly distributed to each index fragment, load balancing of each node is achieved, and response performance to batch data query is improved.
When the node of the main partition of the target index partition receives the document to be added with the index, document processing can be carried out on the document to obtain retrieval data associated with the document. Document processing is a process that consumes CPU resources, and particularly in the case where the number of documents is increased more, CPU resources consumption for each node is enormous. The node where the main segment is located performs document processing on the received document, for example, word frequency of each word can be counted after the document is subjected to word segmentation processing, so that keywords or keywords of the document are extracted, and for example, semantic information extraction can be performed on the document, so that the subject word of the document is obtained. The information obtained by a series of processes is effective information for searching for a document, and may be collectively referred to as search data of the document. Of course, the retrieved data associated with each document may also include the document itself.
The segment buffer (segment buffer) of the main segment of the target index segment is the search data in the memory for buffering a plurality of documents accumulated in a certain time window. And each time the node of the main partition of the target index partition receives the document to be added with the index, after the document is processed to obtain the retrieval data associated with the document, the retrieval data is written into the segment cache of the main partition of the target index partition. If the master slices of the target index slices are stored in the nodes where the master slices of the target index slices are located, it can be understood that each master slice should correspond to a segment cache.
Optionally, in one embodiment, after the search data is written into the segment cache of the main partition, the node where the main partition of the target index partition is located may also write the document into the log file of the main partition of the target index partition, and in addition, the document may be sent to the node where the copy partition of the target index partition is located, so that the node where the copy partition of the target index partition is located, and the document is written into the log file of the copy partition of the target index partition.
In this embodiment, the log file may be used to record deletion, update, query, and new request for the index, so as to ensure that the data is not lost, and if the data needs to be recovered, the data may be read from the log file. Each index shard corresponds to a log file. Writing the document into the log file of the main partition of the target index partition by the node of the main partition of the target index partition, and recovering the data from the log file if the node of the main partition of the target index partition is down under the condition that the document is not indexed, namely the target index partition is not added yet. The node where the main partition of the target index partition is located sends the document to the node where the auxiliary partition of the target index partition is located, after the node where the auxiliary partition is located receives the document, the document is only written into the log file of the auxiliary partition of the target index partition, the document is not processed, the consumption of CPU resources of the node is reduced, and when the node where the auxiliary partition is located suddenly goes down, the data can be recovered from the log file.
Step 604, when the segment buffer meets the index segment generation condition, the search data in the segment buffer is read out to generate an index segment, and the index segment is added into the main segment of the target index segment.
Specifically, a node where a main partition of the target index partition is located determines whether a segment cache of the main partition of the current target index partition meets an index segment generation condition. If so, all the retrieval data (i.e., the retrieval data of each of the plurality of documents) are read from the segment cache, and an index segment is generated based on the retrieval data, the index segment being usable to index any one of the plurality of documents. If not, the node of the main partition of the target index partition continues to receive the new document, the document processing is carried out on the new document, and the retrieval data obtained by processing is written into the section of cache.
The segment cache satisfies the index segment generation condition, and satisfies the preset condition. For example, the segment cache has cached search data corresponding to N documents, and when N is greater than a set threshold, it is determined that the segment cache satisfies the index segment generation condition. For another example, the size of the memory currently occupied by the segment cache is greater than a set threshold, and it is determined that the segment cache satisfies the index segment generation condition. For another example, when the time elapsed since the current distance generation of the last index segment is greater than a set threshold, it is determined that the segment cache satisfies the index segment generation condition. Of course, a combination of a plurality of these determination methods may be used to construct the index segment generation condition, which is not limited in the embodiment of the present application.
When the segment buffer meets the index segment generation condition, the node where the main segment of the target index segment is located reads out the search data in the segment buffer to generate an index segment, and the main segment of the target index segment is added, that is, the index segment is stored in the disk of the node where the main segment is located, that is, the documents are added into an index, and the documents enter the search range of any query request.
The node where the main segment of the target index segment is located reads out the search data in the segment cache to generate an index segment, and adds the main segment of the target index segment, and an optional implementation manner may be: the node reads all the retrieved data from the segment cache, generates an index segment based on the retrieved data, and persists the index segment to the master partition of the target index partition when a persistence (flush) instruction is triggered.
It will be appreciated that, in general, the index segment is not a collection of search data associated with each of the plurality of documents, but rather a set of files that may promote query efficiency is obtained by further processing the search data associated with each of the plurality of documents after all of the search data is read from the segment cache. For example, the node where the main segment of the target index segment is located may perform redundancy elimination, reordering, and the like on all the search data to generate an index segment that may be used to search the plurality of documents.
In one embodiment, reading retrieved data in a segment cache to generate an index segment, adding the index segment to a master tile of a target index tile, comprising: reading the retrieval data respectively associated with a plurality of documents from the segment cache; generating an index segment according to the search data, wherein the index segment comprises a group of files, and the group of files is used for inquiring each document in a plurality of documents; refreshing the index segment to a file system cache of a main partition of the target index partition; when the persistence instruction is triggered, at least one index segment in the file system cache is persisted into the master partition of the target index partition.
In this embodiment, each time an index segment is generated, the node where the main partition of the target index partition is located may be refreshed into the file system cache of the main partition of the target index partition, and when the persistence instruction is triggered, the node where the main partition of the target index partition is located may read all the index segments from the file system cache and persistence is performed to the main partition of the target index partition at one time.
In one embodiment, the node where the main partition of the target index partition is located may further timely merge several index segments in the file system cache to obtain a merged index segment, and meanwhile, the several index segments are removed from the file system cache, when the persistence instruction is triggered, the node where the main partition of the target index partition is located may read all the index segments (including the merged index segment) from the file system cache, and persistence is performed on the main partition of the target index partition at one time.
Step 606 copies the index segment to a copy tile of the target index tile.
In order to improve the reliability of the whole distributed cluster and the efficiency of data retrieval, after adding the newly generated index segment into the main segment of the target index segment at the node where the main segment of the target index segment is located, for each index segment added into the main segment of the target index segment, a copy request for the index segment is triggered, and the index segment is copied to the copy segment of the target index segment. When there are a plurality of the copy fragments of the target index fragment, the node of the main fragment of the target index fragment copies the index fragment to each copy fragment. Therefore, for each newly added document, the document processing is not needed to be carried out on the main slicing side and the copy slicing side respectively, so that the reliability of the whole distributed cluster and the efficiency of data retrieval are improved, the CPU resources of the whole distributed cluster are saved, and the overall processing performance of the distributed cluster is improved.
FIG. 7 is a diagram illustrating synchronization of data between a master tile and a replica tile in one embodiment. Referring to fig. 7, the method includes the steps of:
when receiving a request for adding a document to an index, a node where a main partition of a target index partition is located first executes step (1), step (2) and step (3), wherein: the method comprises the following steps of (1) conducting document processing on a document, and writing retrieval data associated with the document obtained through processing into a segment cache of a memory; writing the document into a log file of a main partition of the target index partition; and (3) distributing the document to the node where the copy shard of each target index shard is located, wherein the node where the copy shard is located only writes the document into the log file of the copy shard, and does not process the document.
Next, the node where the main slice of the target index slice is located performs step (4) and step (5). Step (4), when certain retrieval data is accumulated in the segment cache, or a fixed time window has elapsed since the last index segment is generated, triggering a node where a main segment of the target index segment is located to generate a complete index segment according to all the retrieval data in the segment cache; and (5) triggering a copying request according to the complete index segment to copy the index segment from the main segment of the target index segment to the copy segment.
According to the index processing method, each time the node where the main segment of the target index segment is located receives the document to be added with the index, the node where the main segment is located processes the document, after search data related to the document is obtained, the search data are written into the segment cache of the main segment of the target index segment, when the segment cache meets the index segment generation condition, an index segment is generated according to all the search data cached in the cache, the index segment is added into the main segment of the target index segment, and then the index segment is directly copied to the auxiliary segment of the target index segment. Therefore, for each newly added document, the document processing is not needed to be carried out on the main slicing side and the copy slicing side respectively, so that the reliability of the whole distributed cluster and the efficiency of data retrieval are improved, the CPU resources of the whole distributed cluster are saved, and the overall processing performance of the distributed cluster is improved.
In one embodiment, copying an index segment to a copy of a target index segment includes: and sending the index segment to a node where the copy of the target index fragment is located, so that the node where the copy of the target index fragment is located, and adding the index segment into the copy of the target index fragment.
Optionally, after receiving the index segment, the node where the copy segment of the target index segment is located may be temporarily stored in the system file cache, when the updated index view file sent by the node where the main segment of the target index segment is located is to be received, the received index segment is added to the copy segment of the target index segment according to the updated index view file, and the index segment is stored in a disk of the node where the copy segment is located and can be searched.
Optionally, the node where the main segment of the target index segment is located may generate a replication request according to the index segment, the segment identifier of the index segment, and the segment identifier of the target index segment, and send the replication request to the node where the copy segment of the target index segment is located. And after the node where the copy fragment of the target index fragment is located receives the copy request, acquiring an index segment, a segment identifier and a fragment identifier, determining the copy fragment of the target index fragment corresponding to the fragment identifier when the updated index view file is received and comprises the segment identifier, and adding the index segment into the copy fragment of the target index fragment.
In this embodiment, by distributing the index segment to the copy segment of the target index segment, even if the copy segment side does not perform document processing by itself and does not generate a corresponding index segment by itself, the index segment included in the main segment of the target index segment is identical to the index segment included in the copy segment, so that the reliability of the entire distributed cluster and the efficiency of retrieving data are improved, and the CPU resources of the entire distributed cluster can be saved.
In one embodiment, copying an index segment to a copy of a target index segment includes: and determining a copy sharding mode of copying the index segment to the target index shard according to the file size of the index segment, and distributing the index segment to a node where the copy shard of the target index shard is located according to the determined mode.
The node where the main partition of the target index partition is located copies the index segment to the copy partition of the target index partition, a copy request needs to be generated first, the index segment obviously needs to be read into the memory of the node where the main partition is located, and then the node where the copy partition of the target index partition is located is sent to the node where the copy partition of the target index partition is located through a network. In order to reduce the memory of the node where the excessive main slices are located at one time, and also to reduce the memory consumption of the node where the copy slices of the target index slices are located, the manner of copying the index segments to the copy slices of the target index slices can be determined according to the file size of the index segments: the index segment is directly copied to the copy fragment of the target index fragment or is copied to the copy fragment of the target index fragment after being split.
In one embodiment, determining a manner of copying the index segment to the copy shard of the target index shard according to a file size of the index segment, and distributing the index segment to a node where the copy shard of the target index shard is located according to the determined manner, includes: when the file size of the index segment is smaller than or equal to a preset threshold value, the index segment is sent to a node where a copy of the target index segment is located; when the file size of the index segment is larger than a preset threshold value, splitting the index segment into a plurality of data blocks, and sequentially sending the plurality of data blocks obtained by splitting to a node where the copy of the target index segment is located, wherein the size of each data block is smaller than the preset threshold value.
Specifically, if the file size of the index segment is smaller than or equal to a preset threshold value X (for example, 512 KB), the node where the main partition is located compresses and packages the index segment into a data packet, and generates a corresponding replication request and sends the replication request to the node where the copy partition of the target index partition is located. If the file size of the index segment is larger than a preset threshold value X, splitting the index segment into a plurality of data blocks (Chunk) according to the size of the preset threshold value X by a node where the main partition is located, generating a data block copy request, and sequentially sending the data block copy request to the node where the auxiliary partition of the target index partition is located.
In one embodiment, splitting the index segment into a plurality of data blocks, and sequentially sending the plurality of data blocks obtained by splitting to a node where a copy of the target index segment is located, including: generating a data block according to a plurality of small files in the index section, wherein the total file size of the small files is smaller than or equal to a preset threshold value; splitting a large file in the index section into a plurality of data blocks, wherein the file size of the large file is larger than a preset threshold value; and sequentially sending each data block to the node where the copy slice of the target index slice is located.
Optionally, the node where the main partition is located may sort the files in the index segment (a group of files) according to the file size from small to large, and the 1 st data block split from the file may contain as many small files in the index segment as possible, but the file size of the 1 st database should not exceed the preset threshold X, that is, the 1 st data block is associated with many different small files. And (3) remaining the unsent files, respectively splitting the unsent files into 1 or more data blocks, wherein the file size of each data block does not exceed a preset threshold X.
FIG. 8 is a schematic diagram of splitting an index segment into multiple data blocks in one embodiment. Referring to fig. 8, the index segment includes a set of files beginning with "_c", which is a segment identifier of the index segment, and belongs to the index segment, wherein "_c.fdt" is a line memory data file, "_c_luce80_0.dvd" is a column memory file, and "_c_luce84_0.tim" is a dictionary data file formed by keywords. The order of the files from top to bottom in fig. 8 is in order of the file sizes from small to large. Referring to fig. 8, 11 files from "_c.fdx" to "_c.kdd" are packed and stored in the 1 st data block. Each of the latter files is split separately into 1 or more data blocks. For example, "_c_Luce80_0.dvd" will be split into 5 data blocks.
In one embodiment, after adding an index segment to the master tile of the target index tile, the method further comprises: updating an index view file of the main shard with respect to the target index shard, the updated index view file being used to indicate all index segments in the main shard that are persisted to the target index shard; and sending the updated index view file to a node where the copy shard of the target index shard is located, so that when the node where the copy shard of the target index shard is located contains the segment identification of the copied index segment in the updated index view file, adding the copied index segment into the copy shard of the target index shard, and taking the received updated index view file as the index view file of the copy shard of the target index shard.
The index view file of the main partition of the target index partition indicates all index segments persisted to the main partition of the target index partition, and only the index segments added to the index view file can be accessed by the query request. The index view file may include segment identifiers for all index segments in the main shard that are persisted to the target index shard.
Specifically, at the node where the main shard of the target index shard is located, each time one or more index segments are added to the main shard of the target index shard (flush operation), an index view file of the main shard related to the target index shard is triggered to be updated, and the updated index view file is also copied to the node where the auxiliary shard of the target index shard is located. Thus, when the node of the target index fragment is located in the copy fragment, the updated index view file is received, the local index view file can be compared with the updated index view file, whether the segment identifier of the index segment copied by the node of the main fragment of the target index fragment is newly added in the updated index view file is determined, if yes, the index segment is added into the copy fragment of the target index fragment, and the received updated index view file is used as the index view file of the copy fragment of the target index fragment. Before copying the index view file to the node where the copy of the target index shard is located, it is necessary to ensure that all the index segments associated with the index view file have been copied to the node where the copy of the target index shard is located.
FIG. 9 is a schematic diagram of copying updated index view files in one embodiment. Referring to fig. 9, segments_1 file is an index view file, in which 4 index segments are associated: { segment1, segment2, segment3, segment4}, represents the index segments in all the main tiles currently persisted to the target index tile. And after the 4 index segments are copied to the copy fragments of the target index fragments at the nodes where the main fragments of the target index fragments are located, triggering the copy of the index view file to the nodes where the copy fragments of the target index fragments are located.
In the related art, the index segment on the side of the copy segment of the target index segment is self-generated and may be different from the index segment on the side of the main segment of the target index segment, so that the index view file on the side of the copy segment of the target index segment also needs to be self-generated according to the index segment contained in the copy segment, and is different from the index view file generated on the side of the main segment.
In this embodiment, the index segments are copied to the copy segments of the target index segment, the updated index view file is copied to the node where the copy segments of the target index segment are located, and the index segments and the index view files on both sides are identical, that is, even if the copy segment side does not perform document processing by itself and does not generate the corresponding index segments and the index view files by itself, the copy segments of the target index segment can be accessed by the query request of the user, so that the reliability and the data retrieval efficiency of the whole distributed cluster are improved, and the CPU resources of the whole distributed cluster can be saved.
In one embodiment, the method further comprises: each time a plurality of index segments in the main segment which is lasting to the target index segment are combined to obtain a combined index segment, adding the combined index segment into the main segment of the target index segment; deleting a plurality of index segments from a main partition of the target index partition; copy the merge index segment to the copy shard of the target index shard.
Specifically, when merging multiple index segments in the main segment persisted to the target index segment, a new merged index segment is obtained, deleting multiple index segments from the main segment of the target index segment, adding the new merged index segment to the main segment of the target index segment, and at the same time, copying the new merged index segment to the copy segment of the target index segment.
Wherein, the time (or condition) for triggering merging the plurality of index segments in the main segment persisted to the target index segment may be: the plurality of index segments in the master tile reach a set threshold. In merging, multiple small index segments of similar file sizes in the main shard may be merged into one large index segment.
It can be understood that, instead of simply aggregating a group of files included in each of the small index segments together, the small index segments are read from the disk and then further processed to obtain a group of files capable of improving the query efficiency. For example, the multiple index segments may be de-redundant, reordered, updated, etc., to generate a new index segment.
The specific manner in which the new merge index segment is copied to the copy partition of the target index partition may also be the aforementioned manner of splitting the data block, and the description thereof will not be repeated here.
FIG. 10 is a schematic diagram of merging index segments and copying to duplicate slices, in one embodiment. Referring to fig. 10, a node where a main segment of a target index segment is located merges some or all of the index segments in the main segment, for example { segment1, segment2, segment3, segment4} is merged into segment5, at this time, the merged { segment1, segment2, segment3, segment4} is deleted from the main segment, and the merged segment5 is copied to a sub-segment of the target index segment.
In one embodiment, the index processing method further includes: updating the index view file of the main partition of the target index partition according to the index segment in the main partition of the updated target index partition; the updated index view file is used for indicating valid index segments in the main shard that are persisted to the target index shard; and sending the updated index view file to a node where the copy shard of the target index shard is located, so that the node where the copy shard of the target index shard is located, comparing the index view file before updating with the index view file after updating, removing a plurality of index segments from the copy shard of the target index shard, and adding the received combined index segments to the copy shard of the target index shard.
As mentioned above, each time an index segment is added to the main tile of the target index tile (i.e., flush operation), an update of the index view file is triggered. Adding the new index segment (i.e., merging the index segments) to the master tile of the target index tile, and removing the plurality of index segments from the master tile, will also trigger updating the index view file. The updated index view file will be copied to the node where the copy shard of the target index shard is located.
FIG. 11 is a diagram illustrating updating an index view file and copying to a copy shard in one embodiment. Referring to fig. 11 in conjunction with fig. 10, after merging is completed, an update operation is triggered, the current index view file segments_1 is updated to generate a new index view file segment_2, in the updated index view file segment_2, the new index segment5 generated by merging is added, and the merged index segment { segment1, segment2, segment3, segment4} is removed, that is, the updated index view file segment_2 is used to indicate the index segment that is currently valid in the main segment and can be queried, that is, segment5. After the combined index segment5 is successfully copied to the copy segment of the target index segment, the updated index view file segment_2 is also copied to the copy segment, after the node where the copy segment is located receives the new index view file segment_2, by comparing with the last index view file segment_1, it can be found that { segment1, segment2, segment3, segment4} has been deleted, so that the several index Segments are also deleted from the local copy segment, and the last updated index view file segment_1 is deleted, and the updated index view file segment_2 is used as the index view file of the copy segment.
In the related art, the index segments on the side of the copy segment of the target index segment are self-generated and may be different from the index segments on the side of the main segment of the target index segment, so that the merging of the side of the copy segment of the target index segment and the merging of the side of the main segment are performed separately, the merged index segments have differences, and the respective index view files after merging also need to be updated according to the contained index segments.
In this embodiment, the new combined index segment is copied to the copy segment of the target index segment, the updated index view file is copied to the node where the copy segment of the target index segment is located, and the index segments and the index view files on both sides are identical, that is, even if the copy segment side does not perform document processing by itself, does not generate a corresponding index segment by itself, and does not perform updating of the index segment and the index view file by itself, the copy segment of the target index segment is accessed by the query request of the user, thereby improving the reliability of the entire distributed cluster and the efficiency of retrieving data, and saving the CPU resources of the entire distributed cluster.
As shown in fig. 12, an index processing method is provided, which is applied to the node where the copy slices in the distributed cluster 104 in fig. 1 are located, and includes the following steps:
Step 1202, receiving an index segment copied by a node where a main segment of the target index segment is located, the index segment having been persisted to the main segment of the target index segment;
step 1204, receiving an index view file copied by a node where the main partition is located, where the index view file is used to indicate all index segments in the main partition that are persisted to the target index partition;
in step 1206, when the copied index view file contains the segment identification of the copied index segment, the copied index segment is added to the copy shard of the target index shard.
In one embodiment, step 1202 includes: and receiving a merging index segment copied by a node where the main partition of the target index partition is located, wherein the merging index segment is obtained by merging a plurality of index segments which are persisted to the main partition of the target index partition by the node where the main partition of the target index partition is located.
In one embodiment, step 1204 includes: receiving an updated index view file copied by a node where the main partition is located; the updated index view file is updated after the node where the main segment is located adds the combined index segment into the main segment of the target index segment and deletes a plurality of index segments from the main segment of the target index segment; step 1206, comprising: comparing the index view file before updating with the index view file after updating; when the updated index view file does not contain the segment identifiers of the plurality of index segments and contains the segment identifiers of the combined index segments, the plurality of index segments are removed from the copy slices of the target index slices, the received combined index segments are added to the copy slices of the target index slices, and the received updated index view file is used as the index view file of the copy slices of the target index slices.
In one embodiment, the method further comprises: receiving a document which is repeatedly sent by a node where a main partition of the target index partition is located; writing the document into the log file of the copy shard of the target index shard.
For an embodiment of the individual steps in the method, reference is made to the relevant description hereinbefore.
In one embodiment, as shown in fig. 13, a flowchart of an index processing method in a specific embodiment is shown. Referring to fig. 13, a description is given of a node where a master slice of a slave target index is located and a node where a replica slice is located, including the following steps:
step 1302, receiving a document to be added with an index by a node where a main partition is located, processing the document to obtain retrieval data associated with the document, and writing the retrieval data into a segment cache of the main partition of a target index partition; writing the document into the log file of the main partition of the target index partition, and sending the document to the node where the copy partition is located, so that the node where the copy partition is located, and writing the document into the log file of the copy partition of the target index partition.
In step 1304, the node where the main slice is located determines whether the segment cache meets the index segment generation condition, if yes, step 1306 is executed, otherwise, step 1302 is executed in a return mode.
In step 1306, the node where the master segment is located reads the search data associated with each of the plurality of documents from the segment cache, generates an index segment according to the search data, and when a persistence instruction is triggered, persists the index segment into the master segment of the target index segment.
In step 1308, the node where the primary shard is located copies the index segment to the secondary shard of the target index shard.
Step 1308, specifically, comprises: step 1308-2, when the file size of the index segment is smaller than or equal to a preset threshold value, sending the index segment to a node where the copy segment of the target index segment is located; step 1308-4, when the file size of the index segment is greater than a preset threshold, generating a data block according to a plurality of small files in the index segment, wherein the total file size of the small files is smaller than or equal to the preset threshold; splitting a large file in the index section into a plurality of data blocks, wherein the file size of the large file is larger than a preset threshold value; and sequentially sending each data block to a node where the copy partition of the target index partition is located, wherein the size of each data block is smaller than a preset threshold value.
In step 1310, the node where the master shard is located updates the index view file of the master shard with respect to the target index shard.
In step 1312, the node where the primary shard is located sends the updated index view file to the node where the secondary shard of the target index shard is located.
In step 1314, when the node where the copy shard is located includes the segment identifier of the copied index segment in the received updated index view file, adding the copied index segment to the copy shard of the target index shard, and taking the received updated index view file as the index view file of the copy shard of the target index shard.
Step 1316, merging the multiple index segments in the main segment which is persistent to the target index segment by the node where the main segment is located, to obtain a merged index segment, and adding the merged index segment into the main segment of the target index segment; deleting a plurality of index segments from a main partition of the target index partition; copy the merge index segment to the copy shard of the target index shard.
In step 1318, the node where the main shard is located updates the index view file of the main shard with respect to the target index shard according to the index segment in the updated main shard of the target index shard.
In step 1320, after receiving the updated index view file, the node where the copy shard is located compares the pre-updated index view file with the updated index view file to remove the plurality of index segments from the copy shard of the target index shard and adds the received merged index segment to the copy shard of the target index shard.
In this embodiment, each time a node where a main partition of a target index partition is located receives a document to be added to the index, the node where the main partition is located processes the document, after obtaining search data associated with the document, writes the search data into a segment cache of the main partition of the target index partition, when the segment cache meets an index segment generation condition, generates an index segment according to all the search data cached in the cache, adds the index segment to the main partition of the target index partition, and directly copies the index segment to a secondary partition of the target index partition. Therefore, for each newly added document, the document processing is not needed to be carried out on the main slicing side and the copy slicing side respectively, so that the reliability of the whole distributed cluster and the efficiency of data retrieval are improved, the CPU resources of the whole distributed cluster are saved, and the overall processing performance of the distributed cluster is improved.
It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an index processing device for realizing the index processing method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the index processing device provided below may refer to the limitation of the index processing method hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 14, an index processing device 1400 is provided, where an index includes at least one index shard, each index shard includes a primary shard and a secondary shard, the primary shard and the secondary shard corresponding to the same index shard are distributed on different nodes, each index shard includes at least one index segment, and each index segment is a minimum retrievable data unit of the index; the device comprises: a caching module 1402, an index segment generation module 1404, a replication module 1406; wherein:
the caching module 1402 processes the document to obtain retrieval data associated with the document whenever the node where the main segment of the target index segment is located receives the document to be indexed, and writes the retrieval data into the segment cache of the main segment of the target index segment;
An index segment generation module 1404, configured to, when the segment buffer meets the index segment generation condition, read the search data in the segment buffer to generate an index segment, and add the index segment into the main partition of the target index partition;
a copy module 1406 copies the index segments to the copy slices of the target index slices.
In one embodiment, the index processing device 1400 further comprises:
the log recording module is used for writing the document into the log file of the main partition of the target index partition;
and the document transmission module is used for transmitting the document to the node where the copy shard of the target index shard is located, so that the node where the copy shard of the target index shard is located, and writing the document into the log file of the copy shard of the target index shard.
In one embodiment, the replication module 1406 is further configured to send the index segment to a node where the copy of the target index shard is located, such that the node where the copy of the target index shard is located adds the index segment to the copy of the target index shard.
In one embodiment, the replication module 1406 is further configured to determine a manner of replicating the index segment to the copy shard of the target index shard according to the file size of the index segment, and distribute the index segment to a node where the copy shard of the target index shard is located according to the determined manner.
In one embodiment, the replication module 1406 is further configured to send the index segment to a node where the copy shard of the target index shard is located when the file size of the index segment is less than or equal to a preset threshold; when the file size of the index segment is larger than a preset threshold value, splitting the index segment into a plurality of data blocks, and sequentially sending the plurality of data blocks obtained by splitting to a node where the copy of the target index segment is located, wherein the size of each data block is smaller than the preset threshold value.
In one embodiment, the replication module 1406 is further configured to generate a data block according to the plurality of small files in the index section, where a total file size of the plurality of small files is less than or equal to a preset threshold; splitting a large file in the index section into a plurality of data blocks, wherein the file size of the large file is larger than a preset threshold value; and sequentially sending each data block to the node where the copy slice of the target index slice is located.
In one embodiment, index segment generation module 1404 is further configured to read, from the segment cache, the retrieved data associated with each of the plurality of documents; generating an index segment according to the search data, wherein the index segment comprises a group of files, and the group of files is used for inquiring each document in a plurality of documents; refreshing the index segment to a file system cache of a main partition of the target index partition; when the persistence instruction is triggered, at least one index segment in the file system cache is persisted into the master partition of the target index partition.
In one embodiment, the index processing device 1400 further comprises:
an index view file updating module for updating an index view file of a main tile with respect to the target index tile, the updated index view file being for indicating all index segments persisted into the main tile of the target index tile;
the index view file copying module is used for sending the updated index view file to the node where the copy shard of the target index shard is located, so that when the updated index view file contains the segment identifier of the copied index segment, the copied index segment is added into the copy shard of the target index shard, and the received updated index view file is used as the index view file of the copy shard of the target index shard.
In one embodiment, the index processing device 1400 further comprises:
the merging module is used for adding the merging index segments into the main segments of the target index segments when merging the plurality of index segments in the main segments which are lasting to the target index segments to obtain merging index segments; deleting a plurality of index segments from a main partition of the target index partition;
and the merging index segment copying module is used for copying the merging index segment to the copy partition of the target index partition.
In one embodiment, the index processing device 1400 further comprises:
the index view file updating module is used for updating the index view file of the main partition of the target index partition according to the index segment in the main partition of the updated target index partition; the updated index view file is used for indicating valid index segments in the main shard that are persisted to the target index shard;
the index view file copying module is used for sending the updated index view file to the node where the copy slice of the target index slice is located, so that the node where the copy slice of the target index slice is located, comparing the index view file before updating with the index view file after updating, removing a plurality of index segments from the copy slice of the target index slice, and adding the received combined index segments into the copy slice of the target index slice.
The various modules in the index processing device 1400 described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, an index processing device is provided, an index includes at least one index fragment, each index fragment includes a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment includes at least one index segment, and each index segment is a minimum retrievable data unit of the index; the device comprises:
the first receiving module is used for receiving the index segment copied by the node where the main segment of the target index segment is located, and the index segment is persisted to the main segment of the target index segment;
the second receiving module is used for receiving an index view file copied by the node where the main partition is located, and the index view file is used for indicating all index segments in the main partition which are persistent to the target index partition;
and the persistence module is used for adding the copied index segment into the copy partition of the target index partition when the received index view file contains the segment identification of the copied index segment.
In one embodiment, the first receiving module is further configured to receive a merging index segment copied by a node where the main partition of the target index partition is located, where the merging index segment is obtained by merging, by the node where the main partition of the target index partition is located, a plurality of index segments that are persisted to the main partition of the target index partition.
In one embodiment, the second receiving module is configured to receive an updated index view file copied by a node where the main shard is located; the updated index view file is updated after the node where the main segment is located adds the combined index segment into the main segment of the target index segment and deletes a plurality of index segments from the main segment of the target index segment; the persistence module is used for comparing the index view file before updating with the updated index view file; when the updated index view file does not contain the segment identifiers of the plurality of index segments and contains the segment identifiers of the combined index segments, the plurality of index segments are removed from the copy slices of the target index slices, the received combined index segments are added to the copy slices of the target index slices, and the received updated index view file is used as the index view file of the copy slices of the target index slices.
In one embodiment, the device further comprises a log recording module, which is used for receiving the document which is repeatedly sent by the node where the main segment of the target index segment is located; writing the document into the log file of the copy shard of the target index shard.
With regard to the embodiments of the individual modules in the arrangement, reference may be made to the relevant description of the corresponding method hereinbefore.
According to the index processing device, each time the node where the main segment of the target index segment is located receives the document to be added with the index, the node where the main segment is located processes the document, after obtaining the retrieval data related to the document, the retrieval data is written into the segment cache of the main segment of the target index segment, when the segment cache meets the index segment generation condition, an index segment is generated according to all the retrieval data cached in the cache, the index segment is added into the main segment of the target index segment, and then the index segment is directly copied to the auxiliary segment of the target index segment. Therefore, for each newly added document, the document processing is not needed to be carried out on the main slicing side and the copy slicing side respectively, so that the reliability of the whole distributed cluster and the efficiency of data retrieval are improved, the CPU resources of the whole distributed cluster are saved, and the overall processing performance of the distributed cluster is improved.
In one embodiment, a computer device is provided, which may be a node in a distributed cluster as shown in fig. 4, and an internal structure diagram of which may be shown in fig. 15. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing index data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an index processing method.
It will be appreciated by those skilled in the art that the structure shown in fig. 15 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application is applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of the index processing method of any one or more of the embodiments described above.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, implements the steps of the index processing method in any one or more of the embodiments described above.
In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, implements the steps of the index processing method of any one or more of the embodiments described above.
It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (18)

1. An index processing method is characterized in that the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is the minimum retrievable data unit of the index; the method comprises the following steps:
Processing a document to be added into an index when a node where a main partition of a target index partition is located receives the document to be added into the index, obtaining retrieval data associated with the document, and writing the retrieval data into a segment cache of the main partition of the target index partition;
when the segment cache meets the index segment generation condition, reading out the retrieval data in the segment cache to generate an index segment, and adding the index segment into the main segment of the target index segment;
and copying the index segment to a copy slice of the target index slice.
2. The method of claim 1, wherein after the writing of the retrieved data into the segment cache of the master tile, the method further comprises:
writing the document into a log file of a main partition of the target index partition;
and sending the document to a node where the copy shard of the target index shard is located, so that the node where the copy shard of the target index shard is located, and writing the document into a log file of the copy shard of the target index shard.
3. The method of claim 1, wherein the copying the index segment to the copy of the target index segment comprises:
And sending the index segment to a node where the copy of the target index fragment is located, so that the node where the copy of the target index fragment is located, and adding the index segment into the copy of the target index fragment.
4. The method of claim 1, wherein the copying the index segment to the copy of the target index segment comprises:
and determining a copy slicing mode of copying the index segment to the target index slice according to the file size of the index segment, and distributing the index segment to a node where the copy slice of the target index slice is located according to the determined mode.
5. The method of claim 4, wherein determining a manner in which the index segment is copied to the copy shard of the target index shard based on the file size of the index segment, and distributing the index segment to a node where the copy shard of the target index shard is located based on the determined manner, comprises:
when the file size of the index segment is smaller than or equal to a preset threshold value, the index segment is sent to a node where a copy of the target index segment is located;
When the file size of the index segment is larger than the preset threshold, splitting the index segment into a plurality of data blocks, and sequentially sending the plurality of data blocks obtained by splitting to nodes where the copy slices of the target index slices are located, wherein the size of each data block is smaller than the preset threshold.
6. The method of claim 5, wherein splitting the index segment into a plurality of data blocks, and sequentially sending the plurality of split data blocks to a node where a copy slice of the target index slice is located, includes:
generating a data block according to a plurality of small files in the index section, wherein the total file size of the small files is smaller than or equal to the preset threshold value;
splitting a large file in the index section into a plurality of data blocks, wherein the file size of the large file is larger than the preset threshold value;
and sequentially sending each data block to the node where the copy slice of the target index slice is located.
7. The method of claim 1, wherein said reading out the retrieved data in the segment cache to generate an index segment, adding the index segment to the master tile of the target index tile, comprises:
Reading the retrieval data respectively associated with a plurality of documents from the segment cache;
generating an index segment according to the retrieval data, wherein the index segment comprises a group of files, and the group of files are used for inquiring each document in the plurality of documents;
refreshing the index segment to a file system cache of a main partition of the target index partition;
when a persistence instruction is triggered, at least one index segment in the file system cache is persisted into a master partition of the target index partition.
8. The method of claim 1, wherein after said adding said one index segment to the master tile of said target index tile, the method further comprises:
updating an index view file for a main tile of the target index tile, the updated index view file being used to indicate all index segments persisted into the main tile of the target index tile;
and sending the updated index view file to a node where the copy of the target index shard is located, so that when the updated index view file contains the segment identifier of the copied index segment, adding the copied index segment into the copy of the target index shard, and taking the received updated index view file as the index view file of the copy of the target index shard.
9. The method according to claim 1, wherein the method further comprises:
each time a plurality of index segments in a main segment which is lasting to the target index segment are combined to obtain a combined index segment, adding the combined index segment into the main segment of the target index segment;
deleting the plurality of index segments from the master shard of the target index shard;
and copying the combined index segment to a copy partition of the target index partition.
10. The method according to claim 9, wherein the method further comprises:
updating an index view file of the main partition of the target index partition according to the updated index segment in the main partition of the target index partition; the updated index view file is used for indicating valid index segments in the main partition which are persisted to the target index partition;
and sending the updated index view file to a node where the copy shard of the target index shard is located, so that the node where the copy shard of the target index shard is located, comparing the index view file before updating with the updated index view file to remove the plurality of index segments from the copy shard of the target index shard, and adding the received combined index segments to the copy shard of the target index shard.
11. An index processing method is characterized in that the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is the minimum retrievable data unit of the index; the method comprises the following steps:
receiving an index segment copied by a node where a main segment of a target index segment is located, wherein the index segment is persisted to the main segment of the target index segment;
receiving an index view file copied by a node where the main partition is located, wherein the index view file is used for indicating all index segments in the main partition which are persistent to the target index partition;
and adding the copied index segment into the copy partition of the target index partition when the copied index view file contains the segment identification of the copied index segment.
12. The method of claim 11, wherein the receiving the index segment replicated by the node at which the master shard of the target index shard is located comprises:
and receiving a merging index segment copied by a node where a main partition of the target index partition is located, wherein the merging index segment is obtained by merging a plurality of index segments which are persistent to the main partition of the target index partition by the node where the main partition of the target index partition is located.
13. The method of claim 12, wherein receiving the indexed view file replicated by the node at which the master shard resides comprises:
receiving an updated index view file copied by a node where the main partition is located; the updated index view file is updated after the node where the main segment is located adds the combined index segment to the main segment of the target index segment and deletes the plurality of index segments from the main segment of the target index segment;
and adding the copied index segment into the copy partition of the target index partition when the copied index view file contains the segment identifier of the copied index segment, wherein the method comprises the following steps:
comparing the index view file before updating with the updated index view file;
when the updated index view file does not contain the segment identifiers of the index segments and contains the segment identifiers of the merging index segments, the index segments are removed from the copy slices of the target index slices, the received merging index segments are added to the copy slices of the target index slices, and the received updated index view file is used as the index view file of the copy slices of the target index slices.
14. The method of claim 11, wherein the method further comprises:
receiving a document which is repeatedly sent by a node where the main partition of the target index partition is located;
and writing the document into a log file of the copy partition of the target index partition.
15. An index processing device is characterized in that the index comprises at least one index fragment, each index fragment comprises a main fragment and a auxiliary fragment, the main fragment and the auxiliary fragment corresponding to the same index fragment are distributed on different nodes, each index fragment comprises at least one index segment, and each index segment is the minimum retrievable data unit of the index; the device comprises:
the caching module is used for processing the document to be added into the index whenever the node where the main segment of the target index segment is located receives the document to be added into the index, obtaining retrieval data associated with the document, and writing the retrieval data into the segment cache of the main segment of the target index segment;
the index segment generation module is used for reading the search data in the segment cache to generate an index segment when the segment cache meets the index segment generation condition, and adding the index segment into the main segment of the target index segment;
And the copying module is used for copying the index segment to the copy fragment of the target index fragment.
16. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 14 when the computer program is executed.
17. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 14.
18. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 14.
CN202211172148.4A 2022-09-26 2022-09-26 Index processing method, apparatus, computer device, medium, and program product Pending CN117807174A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211172148.4A CN117807174A (en) 2022-09-26 2022-09-26 Index processing method, apparatus, computer device, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211172148.4A CN117807174A (en) 2022-09-26 2022-09-26 Index processing method, apparatus, computer device, medium, and program product

Publications (1)

Publication Number Publication Date
CN117807174A true CN117807174A (en) 2024-04-02

Family

ID=90420470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211172148.4A Pending CN117807174A (en) 2022-09-26 2022-09-26 Index processing method, apparatus, computer device, medium, and program product

Country Status (1)

Country Link
CN (1) CN117807174A (en)

Similar Documents

Publication Publication Date Title
CN107169083B (en) Mass vehicle data storage and retrieval method and device for public security card port and electronic equipment
US8683112B2 (en) Asynchronous distributed object uploading for replicated content addressable storage clusters
US10013440B1 (en) Incremental out-of-place updates for index structures
US9547706B2 (en) Using colocation hints to facilitate accessing a distributed data storage system
CN111522880B (en) Method for improving data read-write performance based on mysql database cluster
US10061834B1 (en) Incremental out-of-place updates for datasets in data stores
US11422721B2 (en) Data storage scheme switching in a distributed data storage system
Zhai et al. Hadoop perfect file: A fast and memory-efficient metadata access archive file to face small files problem in hdfs
CN114610708A (en) Vector data processing method and device, electronic equipment and storage medium
WO2017156855A1 (en) Database systems with re-ordered replicas and methods of accessing and backing up databases
US20180011897A1 (en) Data processing method having structure of cache index specified to transaction in mobile environment dbms
CN112416879B (en) NTFS file system-based block-level data deduplication method
Kvet et al. Relational pre-indexing layer supervised by the DB_index_consolidator Background Process
CN116578609A (en) Distributed searching method and device based on inverted index
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN117321583A (en) Storage engine for hybrid data processing
CN117807174A (en) Index processing method, apparatus, computer device, medium, and program product
CN114416676A (en) Data processing method, device, equipment and storage medium
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN113821573A (en) Mass data rapid retrieval service construction method, system, terminal and storage medium
US20230376461A1 (en) Supporting multiple fingerprint formats for data file segment
CN116719821B (en) Concurrent data insertion elastic search weight removing method, device and storage medium
CN117539690B (en) Method, device, equipment, medium and product for merging and recovering multi-disk data
US20230376451A1 (en) Client support of multiple fingerprint formats for data file segments
CN117493284A (en) File storage method, file reading method, file storage and reading system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination