CN111831622A

CN111831622A - Data index generation method and device, electronic equipment and readable storage medium

Info

Publication number: CN111831622A
Application number: CN202010244790.3A
Authority: CN
Inventors: 赵锐; 余汶龙; 李鑫
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-10-27

Abstract

The embodiment of the invention discloses a data index generation method, a data index generation device, electronic equipment and a readable storage medium.

Description

Data index generation method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data index generation method and apparatus, an electronic device, and a readable storage medium.

Background

The database can establish indexes for quick query and use on the basis of data, when the database system provides services in a production environment at present, the indexes are often required to be newly added on line, the traditional scheme is that data in all files are traversed, then the data needing to be established are generated, and finally the data are rewritten in the database to establish the indexes, the processing time required by the index establishing mode is long and uncontrollable, and the disk load caused by traversing and writing data can also increase the test on the stability of the services on the line.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data index generation method, an apparatus, an electronic device, and a readable storage medium, so as to improve the speed of establishing a data index.

In a first aspect, an embodiment of the present invention provides a data index generation method, where the method includes:

acquiring a plurality of data files in a database stored based on an LSM storage engine;

traversing and analyzing the plurality of data files in parallel to generate analysis data, wherein the analysis data comprises index data of each data in the data files;

determining at least one index file according to the analysis data, wherein index data in the index file are ordered;

and loading each index file to a corresponding database.

Optionally, the data file and the index file are stored in a physically isolated manner.

Optionally, traversing and parsing the plurality of data files in parallel, and generating the parsed data includes:

traversing a plurality of data files in parallel, and determining an index column in each data file, wherein at least one column in the index column is identification information;

and encoding the data values of the index column to generate index data of each data in the data file so as to determine the analysis data.

analyzing a plurality of data files in parallel through a Map algorithm to generate the analyzed data;

determining at least one index file from the parsed data comprises:

and generating at least one index file according to the analysis data through a Reduce algorithm.

Optionally, obtaining a plurality of data files in a database stored by the LSM-based storage engine includes:

and receiving a plurality of data files sent by the equipment where the database is located.

Optionally, loading each index file into a corresponding database includes:

and sending each index file to the equipment where the database is located, so that the equipment loads each index file into the database.

Optionally, the size of the data file is a first preset value, and the size of the index file is a second preset value.

Optionally, the first preset value is 64M, and the second preset value is 64M.

In a second aspect, an embodiment of the present invention provides a data index generating apparatus, where the apparatus includes:

a data file acquisition unit configured to acquire a plurality of data files in a database stored based on the LSM storage engine;

the analysis unit is configured to traverse and analyze the plurality of data files in parallel to generate analysis data, and the analysis data comprises index data of each data in the data files;

an index file determining unit configured to determine at least one index file according to the parsing data, wherein index data in the index file are ordered;

and the loading unit is configured to load each index file to a corresponding database.

Optionally, the parsing unit includes:

the index column determining subunit is configured to traverse a plurality of data files in parallel and determine an index column in each data file, wherein at least one column in the index columns is identification information;

and the encoding subunit is configured to encode the data values of the index column to generate index data of each data in the data file so as to determine the analysis data.

Optionally, the parsing unit is further configured to parse a plurality of the data files in parallel through a Map algorithm, so as to generate the parsed data;

the index file determining unit is further configured to generate at least one index file according to the analysis data through a Reduce algorithm.

Optionally, the data file obtaining unit includes:

and the receiving subunit is configured to receive a plurality of data files sent by the equipment where the database is located.

Optionally, the loading unit includes:

the sending subunit is configured to send each index file to the device where the database is located, so that the device loads each index file into the database.

Optionally, the first preset value is 64M, and the second preset value is 64M.

In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer program instructions, where the one or more computer program instructions are executed by the processor to implement the method described above.

In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as described above.

The method and the device for establishing the index of the data index can improve the speed of establishing the data index by acquiring a plurality of data files in a database stored based on an LSM storage engine, traversing and analyzing the plurality of data files in parallel to generate analyzed data, determining at least one index file according to the analyzed data, and loading each index file to a corresponding database, wherein the analyzed data comprises the index data of each data in the data files, and the index data in the index files are ordered.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a prior art data index generation process;

FIG. 2 is a flow chart of a data index generation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a data storage process of an embodiment of the present invention;

FIG. 4 is a diagram illustrating a data index generation process according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of another data index generation process according to an embodiment of the invention;

FIG. 6 is a flow chart of another data index generation method according to an embodiment of the present invention;

FIG. 7 is a diagram of a data index generation apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of another data index generation apparatus according to an embodiment of the present invention;

fig. 9 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

FIG. 1 is a schematic diagram of a prior art data index generation process. Database services often require an online full build index, i.e., the generation of index data for historical data calculations. In the prior art, a storage engine is usually relied on, a full scanning program is constructed in the program, traversal is performed by taking a single piece of data as a unit to search data needing to be indexed, the data needing to be indexed is analyzed to obtain analyzed data, the analyzed data is rearranged, and a new data file and an index file are obtained again based on the corresponding storage engine, wherein the analyzed data may include information (such as a main key and the like) for uniquely identifying data and data information. As shown in fig. 1, traversing data in a data file 11 in a database 1 one by one, determining data in which an index needs to be constructed, parsing the data in which the index needs to be constructed to obtain parsed data 2, rearranging the parsed data 2, and obtaining a new data file 11 and an index file 12 based on a corresponding storage engine. Therefore, in the prior art, each piece of data in the data file 11 needs to be traversed one by one, and if parallel traversal analysis is adopted, a great amount of system resources are occupied, so that execution of other applications in the system may be affected.

Therefore, the embodiment of the invention provides a data index generation method, which realizes high concurrence of data traversal analysis by performing parallel traversal analysis on data by taking a data file as a unit, thereby improving the index creation efficiency.

Fig. 2 is a flowchart of a data index generation method according to an embodiment of the present invention. As shown in fig. 2, the data index generating method of the present embodiment includes the following steps:

in step S110, a plurality of data files in a database stored by the LSM-based storage engine are obtained. In an alternative implementation, the size of the data file is a first predetermined value, and optionally, the first predetermined value is 64M.

The LSM storage engine is a storage engine based on an LSM-Tree (Log-Structured Merged-Tree). The LSM storage engine provides a file loading mode, so that files (SST files) generated by adopting a certain rule can be quickly imported into a database. The core idea of the LSM storage engine is to keep the modified deltas to the data in memory and to write them to disk in bulk after a specified size limit is reached.

FIG. 3 is a schematic diagram of a data storage process of an embodiment of the present invention. As shown in fig. 3, the LSM storage engine 3 mainly includes a MemTable file 311 and a frozen MemTable file 312 in the memory 31, and files on the disk 32, such as an SST file 321, an operation log file (not shown in the figure), and the like. When a record is written based on the storage engine 3, the modified increment is written into the operation log file, and then the modified increment is written into the MemTable file 311 in the memory, and after the memory occupied by the MemTable file 311 reaches the upper limit value, the data in the memory needs to be dumped into the external memory file. Specifically, first, the MemTable file 311 is frozen into the immutable frozen MemTable file 312, and then the data of the frozen MemTable file 312 is sorted and then dumped to the disk 32, thereby forming a new SST file. Wherein the data in the SST file is ordered, such as based on primary key. Thus, the data files in the database stored based on the LSM storage engine are ordered and have a predetermined size. Alternatively, the size of the data file may be set by setting an upper limit value of the memory occupied by the MemTable file 311. It should be understood that the LSM storage engine based data storage process in fig. 3 is only exemplary, and other LSM storage engine based data storage methods can be applied to the present embodiment.

Step S120, traversing and analyzing the plurality of data files in parallel to generate analysis data. The analysis data comprises index data of each data in the data file.

In an alternative implementation, step S120 may include:

a1: and traversing a plurality of data files in parallel, and determining an index column in each data file, wherein at least one column in the index column is identification information. Optionally, a plurality of data files are traversed in parallel, and data in the data files, which needs to be indexed, and an index column of each data file are determined. Optionally, if the data file includes a score table of the student, the school number column, the name column, and the like of the student may be determined as an index column, where the school number column of the student is identification information, that is, the student may be uniquely identified.

A2: and encoding the data values of the index column to generate index data of each data in the data file so as to determine the analysis data. The analysis data comprises index data of each data in the data file. Optionally, the data values of the index column are encoded according to a predetermined encoding rule to generate index data of the data file, so as to determine the parsing data. In an alternative implementation manner, index data of [ primary key, data ] is created according to a predetermined encoding rule, that is, data corresponding to the index data is obtained according to the primary key in the indexing process, wherein the primary key is information that can uniquely identify the data, such as student number and the like. In another optional implementation manner, the [ index, primary key ] is created according to a predetermined encoding rule, that is, the corresponding primary key is queried according to the corresponding index in the indexing process, and then the corresponding data is obtained according to the primary key.

Step S130, determining at least one index file according to the analysis data. Wherein the index data in the index file is ordered.

In an optional implementation manner, the embodiment adopts a Map/Reduce programming model to implement parallel processing of multiple data files. Specifically, a plurality of data files are analyzed in parallel through a Map algorithm to determine index data and data information of each data in each data file, analysis data are generated, and at least one ordered index file is generated according to the analysis data through a Reduce algorithm. Optionally, the Reduce algorithm writes the parsed data through the LSM storage engine to generate at least one index file, and thus, the index file is also an SST type file.

FIG. 4 is a diagram illustrating a data index generation process according to an embodiment of the present invention. As shown in FIG. 4, a plurality of data files D1-DN are analyzed in parallel through a Map algorithm to obtain analyzed data X, and a plurality of ordered index files I1-IM are generated through a Reduce algorithm. Wherein N is greater than or equal to 1, and M is greater than or equal to 1. Optionally, the size of the data file is a first preset value, and the size of the index file is a second preset value, where the first preset value and the second preset value may be the same or different. Optionally, the first preset value and the second preset value are both 64M. It should be appreciated that during Reduce processing, a plurality of ordered index files are generated from the index data of each of the parsed data X. That is, the index data in the index file is ordered, and thus, the data file and the index file may not have a one-to-one correspondence.

Step S140, loading each index file to the corresponding database to complete the creation process of the data index.

In an optional implementation manner, the data file and the index file of the embodiment are stored in a physically isolated manner, so that the parsed data does not need to be rearranged, and a new data file and an index file are obtained based on the storage engine. Therefore, system resources and data index creation time can be further saved.

FIG. 5 is a diagram illustrating another data index generation process according to an embodiment of the invention. As shown in fig. 5, in this embodiment, a plurality of data files 51 in the database 5 are traversed and parsed in parallel by Map/Reduce to determine an index column in each data file, data values of the index column are encoded to generate index data of each data in the data file, parse data is determined, at least one index file 52 is generated according to the parse data, and the index file 52 is imported into the database to complete the creation process of the data index. Therefore, the data file and the index file are stored in a physical isolation mode, so that the analysis data does not need to be rearranged, and a new data file and a new index file are obtained based on the storage engine, so that great system resources and data index creation time are saved.

The method and the device for establishing the data index have the advantages that the multiple data files in the database stored by the LSM-based storage engine are obtained, the multiple data files are traversed and analyzed in parallel, the analyzed data are generated, at least one index file is determined according to the analyzed data, and each index file is loaded to the corresponding database, wherein the analyzed data comprise the index data of each data in the data files, and the index data in the index files are ordered. Meanwhile, in the embodiment, data traversal is performed without depending on a storage engine, so that the concurrency of data traversal is easily increased, and the traversal analysis time is further saved. In addition, in the embodiment, the data file and the index file are stored in a physical isolation manner, so that the analysis data does not need to be rearranged, and a new data file and an index file are obtained based on the storage engine, thereby saving a large amount of system resources and data index creation time.

FIG. 6 is a flow chart of another data index generation method according to an embodiment of the invention. As shown in fig. 6, the data index generating method according to the embodiment of the present invention includes the following steps:

in step S1, the index creating apparatus receives a plurality of data files in the database transmitted from the apparatus in which the database is located. Wherein a plurality of data files in the database are stored based on the LSM storage engine. The data file is an ordered SST type data file.

Step S2, on the index creation device, traversing and parsing the plurality of data files in parallel to generate parsed data. The analysis data comprises index data of each data in the data file. In an optional implementation manner, a plurality of data files are traversed in parallel, an index column in each data file is determined, and data values of the index column are encoded to generate index data of each data in the data file, so as to determine the parsing data. At least one of the index columns is identification information, and the analysis data comprises index data of each data in the data file. Optionally, if the data file includes a score table of the student, the school number column, the name column, and the like of the student may be determined as an index column, where the school number column of the student is identification information, that is, the student may be uniquely identified.

Optionally, the data values of the index column are encoded according to a predetermined encoding rule to generate index data of the data file, so as to determine the parsing data. In an alternative implementation manner, index data of [ primary key, data ] is created according to a predetermined encoding rule, that is, data corresponding to the index data is obtained according to the primary key in the indexing process, wherein the primary key is information that can uniquely identify the data, such as student number and the like. In another optional implementation manner, the [ index, primary key ] is created according to a predetermined encoding rule, that is, the corresponding primary key is queried according to the corresponding index in the indexing process, and then the corresponding data is obtained according to the primary key.

Step S3, determining, on the index creating device, at least one index file according to the parsed data. Wherein the index data in the index file is ordered. In an optional implementation manner, the embodiment adopts a Map/Reduce programming model to implement parallel processing of multiple data files. Specifically, a plurality of data files are analyzed in parallel through a Map algorithm to determine index data and data information of each data in each data file, analysis data are generated, and at least one ordered index file is generated according to the analysis data through a Reduce algorithm. Optionally, the Reduce algorithm writes the parsed data through the LSM storage engine to generate at least one index file, and thus, the index file is also an SST type file.

In step S4, the index creating device sends each index file to the device where the database is located.

Step S5, the device in which the database is located loads each index file to the corresponding database to complete the creation process of the data index.

The embodiment of the invention does not occupy the system resource of the equipment where the database is located by transferring the data index creating process to other equipment, thereby avoiding the influence of the full index creating process on other applications of the equipment where the database is located. Meanwhile, in the embodiment of the invention, a plurality of data files in a database are acquired, the plurality of data files are traversed and analyzed in parallel to generate analysis data, at least one index file is determined according to the analysis data, and each index file is loaded to the corresponding database, wherein the analysis data comprises index data of each data in the data files, and the index data in the index files are ordered. Meanwhile, in the embodiment, data traversal is performed without depending on a storage engine, so that the concurrency of data traversal is easily increased, and the traversal analysis time is further saved. In addition, the data file and the index file are stored in a physical isolation manner, so that the data does not need to be rewritten into the database for index creation, and a large amount of system resources and data index creation time are saved.

Fig. 7 is a schematic diagram of a data index generating apparatus according to an embodiment of the present invention. As shown in fig. 7, the data indexing device 7 of the embodiment of the present invention includes a data file obtaining unit 71, a parsing unit 72, an index file determining unit 73, and a loading unit 74.

The data file acquiring unit 71 is configured to acquire a plurality of data files in a database stored based on the LSM storage engine.

The parsing unit 72 is configured to traverse and parse the plurality of data files in parallel, generating parsed data, which includes index data of each data in the data files. In an alternative implementation, the parsing unit 72 is further configured to parse a plurality of the data files in parallel through a Map algorithm to generate the parsed data. In an alternative implementation, the parsing unit 72 includes an index column determination subunit 721 and an encoding subunit 722. The index column determination subunit 721 is configured to traverse a plurality of the data files in parallel, and determine an index column in each of the data files, where at least one of the index columns is identification information. The encoding subunit 722 is configured to encode the data values of the index column to generate index data of each data in the data file to determine the parsed data.

The index file determining unit 73 is configured to determine at least one index file from the parsed data, index data in the index file being ordered. In an optional implementation manner, the index file determining unit 73 is further configured to generate at least one index file according to the parsed data through a Reduce algorithm.

The loading unit 74 is configured to load each of the index files into a corresponding database. In an alternative implementation, the data file and the index file are stored in a physically separate manner. In an optional implementation manner, the size of the data file is a first preset value, and the size of the index file is a second preset value. Optionally, the first preset value is 64M, and the second preset value is 64M.

Fig. 8 is a schematic diagram of another data index generating apparatus according to an embodiment of the present invention. As shown in fig. 8, the data index generating device 8 of the present embodiment includes a data file acquiring unit 81, a parsing unit 82, an index file determining unit 83, and a loading unit 84.

The data file acquiring unit 81 is configured to acquire a plurality of data files in a database stored based on the LSM storage engine. In an alternative implementation, the data file obtaining unit 81 includes a receiving subunit 811. The receiving subunit 811 is configured to receive a plurality of data files transmitted by the device in which the database is located.

The parsing unit 82 is configured to traverse and parse the plurality of data files in parallel, and generate parsed data, which includes index data of each data in the data files. In an alternative implementation manner, the parsing unit 82 is further configured to parse a plurality of the data files in parallel through a Map algorithm to generate the parsed data. In an alternative implementation, the parsing unit 82 includes an index column determination subunit 821 and an encoding subunit 822. The index column determination subunit 821 is configured to traverse a plurality of the data files in parallel, and determine an index column in each of the data files, where at least one of the index columns is identification information. The encoding subunit 822 is configured to encode the data values of the index column to generate index data of each data in the data file to determine the parsing data.

The index file determining unit 83 is configured to determine at least one index file from the parsed data, the index data in the index file being ordered. In an optional implementation manner, the index file determining unit 83 is further configured to generate at least one index file according to the parsed data through a Reduce algorithm.

The loading unit 84 is configured to load each of the index files into a corresponding database. In an alternative implementation, the loading unit 84 includes a sending subunit 841. The sending subunit 841 is configured to send each index file to the device where the database is located, so that the device loads each index file into the database.

In an alternative implementation, the data file and the index file are stored in a physically separate manner. In an optional implementation manner, the size of the data file is a first preset value, and the size of the index file is a second preset value. Optionally, the first preset value is 64M, and the second preset value is 64M.

Fig. 9 is a schematic diagram of an electronic device of an embodiment of the invention. As shown in fig. 9, the electronic device shown in fig. 9 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 91 and a memory 92. The processor 91 and the memory 92 are connected by a bus 93. The memory 92 is adapted to store instructions or programs executable by the processor 91. The processor 91 may be a stand-alone microprocessor or may be a collection of one or more microprocessors. Thus, the processor 91 implements the processing of data and the control of other devices by executing instructions stored by the memory 92 to perform the method flows of embodiments of the present invention as described above. The bus 93 connects the above components together, and also connects the above components to a display controller 94 and a display device and an input/output (I/O) device 95. Input/output (I/O) devices 95 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 95 are coupled to the system through input/output (I/O) controllers 99.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.

These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating a data index, the method comprising:

and loading each index file to a corresponding database.

2. The method of claim 1, wherein the data file and the index file are stored in physical isolation.

3. The method of claim 1, wherein traversing and parsing the plurality of data files in parallel, generating parsed data comprises:

4. The method of claim 1, wherein traversing and parsing the plurality of data files in parallel, generating parsed data comprises:

determining at least one index file from the parsed data comprises:

5. The method of claim 1, wherein retrieving the plurality of data files in the database stored based on the LSM storage engine comprises:

6. The method of claim 5, wherein loading each of the index files into a corresponding database comprises:

7. The method of claim 1, wherein the size of the data file is a first predetermined value and the size of the index file is a second predetermined value.

8. The method of claim 7, wherein the first preset value is 64M and the second preset value is 64M.

9. An apparatus for generating a data index, the apparatus comprising:

10. The apparatus of claim 9, wherein the data file and the index file are stored in physical isolation.

11. The apparatus of claim 9, wherein the parsing unit comprises:

12. The apparatus according to claim 9, wherein the parsing unit is further configured to parse a plurality of the data files in parallel through a Map algorithm, generating the parsed data;

13. The apparatus according to claim 9, wherein the data file obtaining unit comprises:

14. The apparatus of claim 13, wherein the loading unit comprises:

15. The apparatus of claim 9, wherein the size of the data file is a first predetermined value and the size of the index file is a second predetermined value.

16. The apparatus of claim 15, wherein the first preset value is 64M and the second preset value is 64M.

17. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.

18. A computer-readable storage medium on which computer program instructions are stored, which computer program instructions, when executed by a processor, are to implement a method according to any one of claims 1-8.