CN115658841A

CN115658841A - Data management method and device, computing equipment and storage medium

Info

Publication number: CN115658841A
Application number: CN202211427149.9A
Authority: CN
Inventors: 王宁明; 王淑华; 姜维维; 陈艳; 王静; 修鹏; 赵梓旭
Original assignee: Beijing Institute of Environmental Features
Current assignee: Beijing Institute of Environmental Features
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-01-31

Abstract

The present invention relates to the field of database technologies, and in particular, to a data management method, an apparatus, a computing device, and a storage medium. The method comprises the following steps: acquiring a plurality of documents to be stored; performing word segmentation processing on each document to generate an index list in a corresponding index, and storing each document into a corresponding fragment of the index; when receiving a query request, each node in the index issues the query request to each fragment managed by the node; each fragment queries each document stored in the fragment according to the index list and the query request to generate a result set of the fragment; and summarizing and sequencing all the generated result sets to obtain query results. The scheme can improve the query speed of mass data and can meet the management and application requirements of hundreds of millions of data.

Description

Data management method and device, computing equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of databases, in particular to a data management method, a data management device, a computing device and a storage medium.

Background

In a simulation technology test, hundreds of millions of data are generated in one test, the data are generated by simulation software and need to be stored in a database, and the data content can be quickly inquired and read. However, due to the huge amount of data, the traditional database management method cannot meet the requirements.

Therefore, a new data management method is needed.

Disclosure of Invention

In order to solve the problem that the conventional database management method cannot meet the requirement of fast query and reading of mass data, the embodiment of the invention provides a data management method, a data management device, computing equipment and a storage medium.

In a first aspect, an embodiment of the present invention provides a data management method, including:

acquiring a plurality of documents to be stored;

performing word segmentation processing on each document to generate an index list in a corresponding index, and storing each document in a corresponding fragment of the index;

when receiving a query request, each node in the index issues the query request to each fragment managed by the node;

each fragment queries each document stored in the fragment according to the index list and the query request to generate a result set of the fragment;

and summarizing and sorting all the generated result sets to obtain query results.

Preferably, the index list includes a plurality of segments, a document ID corresponding to each segment, location information of each segment in a corresponding document, and a weight ratio of each segment in a corresponding document.

Preferably, the performing word segmentation processing on each document to generate an index list in a corresponding index, and storing each document into a corresponding segment of the index includes:

writing each document into a memory in sequence to generate translation files gradually increased in the corresponding index;

with the continuous writing of the documents, performing word segmentation processing on the documents written in the period of time at intervals, and dividing the documents written in the period of time into a segment after establishing an index relationship in an index list according to word segmentation results;

and after the time is set, executing a disk writing operation once to write each segment in the translation file into the corresponding segment of the index.

Preferably, the method further comprises the following steps:

and recording all operations between every two adjacent disk writing operations so as to restore the progress according to the record after the fault.

Preferably, after the performing a disk write operation to write each segment in the translation file into the corresponding slice of the index, the method further includes: merging each of the segments in the corresponding slice.

Preferably, each fragment corresponds to a deleted file;

when a deletion request is received, the document is identified to be deleted in the deletion file until a plurality of segments are combined, and then the document is deleted from the corresponding segment.

Preferably, the querying, by each of the shards according to the index list and the query request, each of the documents stored in the shard to generate a result set of the shard includes:

for each fragment, executing:

determining target word segmentation in the query request according to the query request;

determining a document ID corresponding to the target word segmentation according to the target word segmentation and the index list;

inquiring each document stored in the fragment according to the document ID corresponding to the target participle, determining the target document corresponding to the target participle in the fragment, and sequencing the target documents inquired in the fragment according to the weight proportion of the target participle in each target document;

and according to the predetermined return sequence number, generating a result set of the fragment by using the target document corresponding to the sequence number.

In a second aspect, an embodiment of the present invention further provides a data management apparatus, including:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of documents to be stored;

the storage unit is used for performing word segmentation processing on each document to generate an index list in a corresponding index and storing each document into a corresponding fragment of the index;

the issuing unit is used for issuing the query request to each fragment managed by the node by each node in the index when the query request is received;

the query unit is used for enabling each fragment to query each document stored in the fragment according to the index list and the query request so as to generate a result set of the fragment;

and the summarizing unit is used for summarizing and sequencing all the generated result sets to obtain the query result.

In a third aspect, an embodiment of the present invention further provides a computing device, including a memory and a processor, where the memory stores a computer program, and the processor, when executing the computer program, implements the method described in any embodiment of this specification.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method described in any embodiment of the present specification.

The embodiment of the invention provides a data management method, a data management device, a computing device and a storage medium, wherein firstly, word segmentation processing is carried out on each document to be stored so as to generate an index list in a corresponding index, and each document is stored in a corresponding fragment of the index; then, when receiving a query request, each node in the index issues the query request to each segment managed by the node; then, each fragment queries each document stored in the fragment according to the index list and the query request to generate a result set of the fragment; finally, all the generated result sets are collected and sorted, and a final query result can be obtained, so that the query speed of mass data is increased, and the management and application requirements of hundreds of millions of data can be met.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a data management method according to an embodiment of the present invention;

fig. 2 is a hardware architecture diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a structural diagram of a data management apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.

As mentioned above, in the simulation technology test, hundreds of millions of data are generated in one test, and the data are generated by simulation software and need to be stored in a database, and the content of the data can be quickly inquired and read. However, due to the huge amount of data, the traditional database management method cannot meet the requirements.

In order to solve the above technical problem, the inventor may consider that each document to be stored is subjected to word segmentation processing, and each document is distributed to different segments after the word segmentation processing, so that each segment can query the documents stored in its own segment when receiving a query request, thereby improving the query speed.

Specific implementations of the above concepts are described below.

Referring to fig. 1, an embodiment of the present invention provides a data management method, including:

step 100: acquiring a plurality of documents to be stored;

step 102: performing word segmentation processing on each document to generate an index list in a corresponding index, and storing each document into a corresponding fragment of the index;

step 104: when receiving a query request, each node in the index issues the query request to each fragment managed by the node;

step 106: each fragment queries each document stored in the fragment according to the index list and the query request to generate a result set of the fragment;

step 108: and summarizing and sorting all the generated result sets to obtain query results.

In the embodiment of the invention, firstly, word segmentation processing is carried out on each document to be stored so as to generate an index list in a corresponding index, and each document is stored in a corresponding fragment of the index; then, when receiving the query request, each node in the index issues the query request to each segment managed by the node; then, each fragment queries each document stored in the fragment according to the index list and the query request to generate a result set of the fragment; finally, all the generated result sets are collected and sorted, and a final query result can be obtained, so that the query speed of mass data is increased, and the management and application requirements of hundreds of millions of data can be met.

The manner in which the various steps shown in fig. 1 are performed is described below.

For step 100 and step 102:

in the embodiment, the whole scheme is realized by using java language programming and can run on a PC. The storage and query of the document are realized by using an Elasticissearch search server.

In some embodiments, the index list includes a number of segments, a document ID corresponding to each segment, location information of each segment in the corresponding document, and a weight ratio of each segment in the corresponding document.

In this embodiment, each document corresponds to one ID, the inverted index performs word segmentation on each document according to a specified syntax, and then maintains an index list to enumerate the word segments appearing in all documents, the document ID corresponding to each word segment, the position information of each word segment appearing in the document, and the weight ratio of each word segment in the corresponding document, that is, the frequency of occurrence. The index list is a specific storage form for realizing the word segmentation-document matrix.

In some embodiments, step 102 may include steps S1-S3:

step S1, writing each document into a memory in sequence to generate translation files gradually increased in corresponding indexes;

s2, with the continuous writing of the documents, performing word segmentation processing on the documents written in the period of time at intervals, and dividing the documents written in the period of time into a segment after establishing an index relationship in an index list according to word segmentation results;

and S3, after the time is set, executing one-time disk writing operation to write each segment in the translation file into the corresponding segment of the index.

In this embodiment, in step S1, each document is first written into a transom file (transoction. Log) in the memory in sequence, wherein the transom file will increase with the continuous writing of the document, and at this time, if a query request is issued, these new documents cannot be indexed; in step S2, in this embodiment, a period of time is set to 1S, the document written in 1S is subjected to word segmentation processing every 1S, an index relationship between the document in 1S and each word segmentation is established in an index list, then the document in 1S is written into a file system cache in the transcog file, and a segment (segment) is generated in the file system cache, where at this time, the document in the segment may be searched, but is not written into a hard disk yet, and the document may be lost due to downtime; in step S3, because new documents are written continuously, steps S1 and S2 are repeatedly executed continuously, new segment files are generated continuously, and the transform files are larger and larger. Therefore, when the set time or the transcog file reaches a certain storage amount, a disk write operation is executed to write the segment (segment) in the file system cache of the transcog file into the corresponding segment of the index in the disk, and the transcog is deleted (a new transcog is generated thereafter).

It should be noted that there are multiple nodes in an index for managing the shards in the index, and the documents are stored in the shards, and each shard is assigned to each node. The nodes are of different types, wherein a coordinating node is provided, and the fragment into which each transcog file is written can be determined according to the document amount in the fragment managed by each node, so that the load of each node is balanced. When nodes are added or reduced, the coordination node migrates the fragments among the nodes so as to enable the document data to be distributed evenly.

In some embodiments, further comprising:

In this embodiment, a transcog file is introduced to record all operations between every two adjacent disk write operations, so that when the machine recovers from a failure or the machine restarts, the recovery can be performed according to the transcog file.

In some embodiments, after performing a disk write operation to write each segment in the translation file into the corresponding slice of the index, further comprising: and merging each segment in the corresponding segment.

In this embodiment, since new segment files are continuously generated, when each segment is queried, all segments in the segment are queried in turn, which greatly affects the performance of the search, so that each segment is merged into a new and larger segment at regular intervals, and all merged old segments are removed.

In some embodiments, each shard corresponds to a deleted file;

when a deletion request is received, the document is identified to be deleted in the deletion file until the plurality of segments are merged, and then the document is deleted from the corresponding segment.

In this embodiment, the index list is not modifiable, and therefore the update and delete operations are not performed directly on the original index list. Each segment (segment) on each segment maintains a delete file that records the deleted document, and whenever a user initiates a delete request, the document is not actually deleted, the index list is not changed, but rather the document is identified in the delete file as deleted. Thus, deleted documents can still be retrieved, filtered out only when results are returned, and documents identified as deleted are actually deleted each time segment merging is initiated.

When the document is updated, the original document is firstly searched to obtain the version number of the document, then the updated document is written into the memory, namely a new document is written, and meanwhile, the old document is marked to be deleted.

With respect to step 104:

when a node receives a query request, the node becomes the coordinator node in step 104. The coordinating node distributes the query request to each of the other nodes, so that each node issues the query request to each of the fragments managed by the node.

For step 106:

in some embodiments, step 106 may comprise:

for each fragment, executing:

determining target participles in the query request according to the query request;

In this embodiment, each segment is an instance of Lucene, and a segment itself is a complete search engine. Therefore, for each fragment, when a query request is received, the document ID corresponding to the target word in the index list is determined according to the target word in the query request, each document stored in the fragment is queried by the fragment to determine the target document corresponding to the target word in the fragment, and the target documents queried in the fragment are sorted according to the weight proportion of the target word in each target document. The client may request to return the result sets of size from the first from in the ordering, and then the tile creates a priority queue of size from + size, and the result set of the tile is generated by correspondingly placing the information of the target document corresponding to the ordering in the priority queue.

For step 108:

in this step, each segment returns the result set obtained in step 106 to the coordination node, the coordination node collects all the result sets, and ranks the result sets according to the weight proportion of the target participles in each result in the target document to obtain a final query result.

After the query is finished, the client still needs to return the target documents, the coordination node sends an acquisition request to the fragments containing the target documents, the fragments acquire the corresponding target documents and return the target documents to the coordination node, and the coordination node returns the target documents to the client.

Finally, the performance of the protocol of this example was compared to that of a conventional mysql database, the results of which are shown in table 1 below.

TABLE 1

	mysql	This scheme
			Version(s)	5.7	6.2.2
Data volume	1000 ten thousand	1000 ten thousand
			Speed of inquiry	6.011s	0.203s
New speed	0.008s	0.253s
			Modifying speed	13.016s	0.051s
Speed of deletion	12.215s	0.052s

As can be seen from the above table 1, in this scheme, the query speed is 30 times that of mysql under the condition of the same data amount. The modification speed and the deletion speed are also dozens of times faster than mysql.

The performance of the scheme of the embodiment is also compared with that of the conventional HBase database, and the comparison result is shown in the following table 2.

TABLE 2

As can be seen from the above table 2, 2.7 hundred million data are recorded in both the scheme and the HBase database, and when the full-table scanning query is performed, the time consumed by HBase query in the traditional mode is 90 times that consumed by the scheme.

In conclusion, the scheme can improve the query speed of mass data and can meet the management and application requirements of hundreds of millions of data.

As shown in fig. 2 and fig. 3, an embodiment of the present invention provides a data management apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. From a hardware aspect, as shown in fig. 2, a hardware architecture diagram of a computing device in which an image compression apparatus for frame rate switching according to an embodiment of the present invention is located is shown, where the computing device in which the apparatus is located in the embodiment of the present invention may generally include other hardware, such as a forwarding chip responsible for processing a packet, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 2. Taking a software implementation as an example, as shown in fig. 3, as a logical apparatus, a CPU of a computing device in which the apparatus is located reads a corresponding computer program in a non-volatile memory into a memory to run.

As shown in fig. 3, the data management apparatus provided in this embodiment includes:

an acquisition unit 301 configured to acquire a plurality of documents to be stored;

a storage unit 302, configured to perform word segmentation processing on each document to generate an index list in a corresponding index, and store each document in a corresponding segment of the index;

the issuing unit 303 is configured to, when receiving the query request, issue the query request to each segment managed by the node by each node in the index;

a query unit 304, configured to enable each segment to query each document stored in the segment according to the index list and the query request, so as to generate a result set of the segment;

and the summarizing unit 305 is configured to summarize and sort all generated result sets to obtain a query result.

In an embodiment of the present invention, in the storage unit 302, the index list includes a plurality of segments, a document ID corresponding to each segment, position information of each segment in the corresponding document, and a weight ratio of each segment in the corresponding document.

In one embodiment of the present invention, the storage unit 302 is configured to perform:

writing each document into the memory in turn to generate translation files gradually increased in the corresponding index,

and after the time is set, executing a disk writing operation to write each segment in the translation file into the corresponding segment of the index.

In one embodiment of the present invention, the storage unit 302 is further configured to:

and recording all operations between every two adjacent disk writing operations so as to restore the progress according to the record after the failure.

In an embodiment of the present invention, the storage unit 302 is further configured to merge each segment in the corresponding segment after performing a disk write operation to write each segment in the translation file into the corresponding segment of the index.

In an embodiment of the present invention, in the storage unit 302, each fragment corresponds to a deleted file;

when a deletion request is received, the document is identified to be deleted in the deletion file until the plurality of segments are combined, and then the document is deleted from the corresponding segment.

In one embodiment of the invention, query unit 304 is configured to perform:

for each fragment, executing:

and according to the predetermined return sequence number, generating the result set of the fragment by using the target document corresponding to the sequence number.

For the information interaction, execution process and other contents between the modules in the above-mentioned apparatus, because the same concept is based on as the method embodiment of the present invention, specific contents can refer to the description in the method embodiment of the present invention, and are not described herein again.

The embodiment of the invention also provides a computing device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the data management method in any embodiment of the invention.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the processor is caused to execute a data management method in any embodiment of the present invention.

Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the embodiments described above are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion module connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion module to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for managing data, comprising:

acquiring a plurality of documents to be stored;

performing word segmentation processing on each document to generate an index list in a corresponding index, and storing each document into a corresponding fragment of the index;

2. The method of claim 1, wherein the index list comprises a plurality of segments, a document ID corresponding to each of the segments, location information of each of the segments in a corresponding document, and a weight ratio of each of the segments in the corresponding document.

3. The method of claim 2, wherein the tokenizing each of the documents to generate an index list in a corresponding index and storing each of the documents in a corresponding segment of the index comprises:

and after the time is set, executing a disk writing operation once to write each segment in the translation file into the corresponding fragment of the index.

4. The method of claim 3, further comprising:

5. The method of claim 3, wherein after performing a disk write operation to write each of the segments in the translation file into the corresponding segment of the index, further comprising: merging each of the segments in the corresponding slice.

6. The method of claim 5, wherein each of the shards corresponds to a deleted file;

7. The method of claim 2, wherein querying, by each of the shards, each of the documents stored in the shard according to the index list and the query request to generate a result set of the shard comprises:

for each fragment, executing:

8. A data management apparatus, comprising:

9. A computing device comprising a memory having stored therein a computer program and a processor that, when executing the computer program, implements the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.