CN113934713A

CN113934713A - Order data indexing method, system, computer equipment and storage medium

Info

Publication number: CN113934713A
Application number: CN202111026637.4A
Authority: CN
Inventors: 刘松森; 王靖天; 赵志刚
Original assignee: Guangzhou Yidejia Network Technology Co ltd
Current assignee: Guangzhou Yidejia Network Technology Co ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2022-01-14

Abstract

The invention provides an order data indexing method, a system, computer equipment and a storage medium, wherein the scheme method comprises the following steps: according to a preset time threshold, first order data with the storage duration not greater than the time threshold are obtained; generating a log file of the first order data, and clustering the log file according to primary keywords of the log file to obtain a data service layer width table; dividing the data service layer width table to obtain a plurality of partitions; acquiring a query instruction, and indexing in a plurality of partitions through the primary keywords according to the query instruction to obtain target order data; according to the scheme, the efficiency of slow order data query under the condition of overlarge order number is improved through column type storage and distributed computing service, compared with a traditional database, second-level real-time query can be achieved, the query speed is higher, the response speed is higher, and the method and the device can be widely applied to the technical field of computers.

Description

Order data indexing method, system, computer equipment and storage medium

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an order data indexing method, an order data indexing system, computer equipment and a storage medium.

Background

With the development of the mobile internet, people have increasing online shopping demands. With the continuous complexity of transaction volume and business logic, the e-commerce platform generates a large amount of data according to the logical relationship between orders and customers, and generally uses a relational database management system MySQL as a storage engine in a traditional data technology mode. Under the condition of small data magnitude, MySQL can meet the normal torsion of services. However, in the existing data query scheme, once the data volume is too large, the query speed becomes slow, the response speed of a data product is slow, and the situations that a user order is jammed, data analysis is jammed and the like occur.

Disclosure of Invention

In view of the above, to at least partially solve one of the above technical problems, embodiments of the present invention provide a method, a system, a device and a storage medium for indexing order data, which are more convenient, efficient and fast and can deal with data concurrence.

In a first aspect, a technical solution of the present application provides an order data indexing method, which includes:

according to a preset time threshold, first order data with the storage duration not greater than the time threshold are obtained;

generating a log file of the first order data, and clustering the log file according to primary keywords of the log file to obtain a data service layer width table;

dividing the data service layer width table to obtain a plurality of partitions;

and acquiring a query instruction, and indexing in the plurality of partitions through the primary keywords according to the query instruction to obtain target order data.

In a possible embodiment of the present disclosure, the order data indexing method further includes:

acquiring second order data with the storage duration being greater than the time threshold, and performing offline cleaning on the second order data;

generating a data service layer width table according to the mapping relation between the second field attribute of the second order data and the first field attribute in the first order data;

and clustering according to the order data in the detailed data wide table to obtain a plurality of data service layer wide tables.

In a feasible embodiment of the present application, the generating a log file of the first order data, and clustering the log file according to the primary key of the log file includes the following steps:

generating a primary keyword of the first order data according to the log file, and combining the primary keyword with a first character to obtain a primary keyword;

the first character is calculated from the hash value of the first order data and the number of partitions.

In a possible embodiment of the present application, before the step of generating a log file of the first order data, and clustering the log file according to primary keys of the log file to obtain a data service layer width table, the method further includes:

acquiring the first order data, and carrying out desensitization treatment on the first order data;

the desensitization treatment comprises the following steps:

replacing the user information in the first order data with a symbol character string;

and screening fields in the first order data according to the mapping relation.

In a possible embodiment of the present disclosure, after the step of constructing the wide detailed data table by using the second order data after the offline cleaning and the first order data, the method further includes at least one of the following steps:

carrying out standardization processing on the order data in the detail data wide table, and unifying the data format of the order data;

and performing data cleaning on the order data in the detail data wide table, and reducing null values and dirty data of the order data.

In a possible embodiment of the present application, before the step of obtaining, according to a preset time threshold, first order data whose storage duration is not greater than the time threshold, the method further includes:

matching a data table through a regular expression, wherein the data table comprises a plurality of first order data;

and setting and acquiring data warehouse parameters of the data table.

In a feasible embodiment of the present application, the obtaining a query instruction and indexing in the plurality of partitions according to the query instruction by using the primary keyword to obtain target order data includes the following steps:

acquiring a plurality of historical query results, and acquiring high-frequency fields in the historical query results;

and constructing an index table according to the high-frequency field, and inquiring target order data according to the index table.

In a second aspect, an aspect of the present invention further provides an order data indexing system, including:

the data acquisition module is used for acquiring first order data with the storage duration not greater than a preset time threshold according to the preset time threshold;

the data classification module is used for generating a log file of the first order data, and clustering the log file according to primary keywords of the log file to obtain a data service layer width table; dividing the data service layer width table to obtain a plurality of partitions;

and the data query module is used for acquiring a query instruction, and indexing in the plurality of partitions through the primary keywords according to the query instruction to obtain target order data.

In a third aspect, a technical solution of the present invention further provides a computer device for order data indexing, including:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is caused to perform the method of the first aspect.

In a fourth aspect, the present invention further provides a storage medium, in which a program executable by a processor is stored, and the program executable by the processor is configured to execute the order data indexing method according to the first aspect when executed by the processor.

Advantages and benefits of the present invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention:

according to the technical scheme, a time threshold is set, corresponding order data are obtained according to the time threshold, massive order data are clustered and divided through log files of the order data to obtain a data service layer width table, corresponding partitions are constructed, the partitions comprise a plurality of data service layer width tables, and then primary keywords in the partitions are retrieved and compared according to query instructions to obtain target order data; according to the scheme, the efficiency of slow order data query under the condition of overlarge order number is improved through columnar storage and distributed computing service, and compared with a traditional database, second-level real-time query can be achieved, the query speed is higher, and the response speed is higher.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of an order data indexing method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating steps of another order data indexing method according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the prior art, a relational database management system MySQL is usually used as a storage engine, although MySQL can satisfy normal torsion of business data in a traditional data processing scenario, for a ubiquitous scenario with high order data concurrency, due to excessively large data volume, query speed is slowed down, response speed of a data product is affected, and situations such as process progress blocking and data analysis blocking are caused. In view of the above, the present application provides a technical solution that can perform fast order data indexing for big data.

In a first aspect, as shown in fig. 1, the technical solution of the present application provides an order data indexing method, which may include S100-S400:

s100, acquiring first order data with the storage duration not greater than a time threshold according to a preset time threshold;

the time threshold is the valid period of the preset order data, the data in the valid period is available data, and the data outside the valid period is integrated with the available data after necessary synchronous loading and cleaning processing; specifically, in the embodiment, first, order data within an expiration date is acquired from a storage space or a storage container in which the order data is preliminarily summarized; illustratively, an open-source framework Canal may be adopted in the embodiment to synchronize incremental data of the database to the data in the embodiment; the storage space for the preliminary summary of the order data is a MySQL database, a Canal can be used as a MySQL slave, a dump protocol is sent to a MySQL Mater by simulating the interaction protocol of the MySQL slave, the MySQL Mater receives a dump request sent by the Canal, binary logs (binary logs) are started to be pushed to the Canal, the Canal analyzes the binary logs and then sends the binary logs to a storage destination, and the storage destination comprises but is not limited to MySQL, Kafka and Elastic Search. Taking the validity period of one year as an example, the binary log of the order correlation table is synchronized to kafka by canal in the embodiment. It should be noted that, in the embodiment, when the acquired order data amount is small, a single partition topic may be established by Kafka, where topic is a basic unit of a Kafka data write operation, and a copy may be specified.

S200, generating a log file of the first order data, and clustering the log file according to primary keywords of the log file to obtain a data service layer width table;

wherein, the log file (binary log) is used for recording the data modification record in the database; including but not limited to insert, update, delete, create, drop, and alter's related statements. A primary key is one or more fields in a data table, and the primary key is used to uniquely identify a record in the table. In a two table relationship, the primary key is used to reference a particular record in one table from the other table. A data service layer (DWD) is an isolation layer between a business layer and a data warehouse; a wide table refers to a database table with many fields. Usually refers to a database table with the related indexes, latitudes and attributes of the business bodies associated together. In an exemplary embodiment, the order data acquired in step S100 is synchronized into Kafka, necessary cleaning and desensitization processing is performed on the order data through a Flink dataflow programming model, then the cleaned and desensitized order data are aggregated and written into Hbase, a DWD layer of data in the Hbase is constructed, and a DWD wide table is further generated.

S300, dividing a data service layer width table to obtain a plurality of partitions;

wherein, the partition (Region) is the minimum unit of HBase cluster distribution data; when the data table in the embodiment initially writes data, the table has only one Region, when the Region starts to become larger as the data increases, and when the Region reaches a defined threshold size, the Region is split into two regions with basically the same size, and the threshold is the set size of the storeFile, all the loaded data are placed on the server in the original area before the Region is split for the first time, and the number of the regions is correspondingly increased as the table becomes larger. In addition, the region server is a comparison core module in HBase, and responds to the IO request of the user to read and write data.

In addition, a data basis layer (DWS) is constructed in the embodiment and used for storing objective data and used as a middle layer or a data layer with a large number of indexes; based on basic data on a DWB layer, the embodiment can integrate and summarize a DWD broad table to be a service data layer for analyzing a certain subject domain, and obtain a data service layer (DWS) broad table; the method is used for providing subsequent business query, OLAP analysis, data distribution and the like. It is understood that the RegionServer can be applied to any wide table in the embodiment.

S400, acquiring a query instruction, and indexing in a plurality of partitions through primary keywords according to the query instruction to obtain target order data;

specifically, at the back end of the embodiment, according to a query instruction of a user for acquiring target order data, a DWS (DWS) width table is queried by using Phoenix; in the embodiment, the SQL layer of Hbase is constructed by Phoenix, the data of the table created by Phoenix is inquired in the Hbase, and if the data of the table of Phoenix is updated, the table of Hbase can be updated. When data aggregation query is performed in the phoenix, the response speed of the aggregation query result is higher than that of the query method in the prior art.

In the writing process in the Hbase in the related art, one Region is controlled by one Region server, when the Region exceeds a default size, the Region is split into two smaller regions, a new Region writes all new records, but the new Region is still on the same Region server, so that resource allocation of a cluster is unbalanced, and the performance of the cluster is affected. Therefore, in order to avoid the hot spot problem of a single RegionServer, the embodiment performs the salting process, that is, in step S200, generating the log file of the first order data, and clustering the log file according to the primary key of the log file, may include step S210:

s210, generating an original keyword of the first order data according to the log file, and combining the original keyword and the first character to obtain a primary keyword;

the first character is a byte of 1byte calculated by the hash value of the first order data and the partition number, and the original keyword refers to an original primary key of the order data; the primary key is obtained by salting. Specifically, the calculation formula of the salt adding process in the example is as follows:

new_row_key＝((byte)(hash(key)％BUCKETS_NUMBER)+original_key

in the calculation formula, BUCKETS _ NUMBER is the NUMBER of sub-BUCKETS formed by adding salt, original _ key is the original primary key of data, the new primary key generated by the data can write the data into different regions, the data of each Region is a subset of the original continuous increasing data, and the data written into each node achieves load balance.

In some optional embodiments, to fully utilize the data outside the valid period, the method of this embodiment further includes steps S500-S600:

s500: acquiring second order data with the storage duration being greater than a time threshold, and performing off-line cleaning on the second order data;

as shown in fig. 2, for example, with a time threshold set for 1 year as an example, the embodiment adopts an off-line cleaning manner to import data into Hbase for order related data exceeding one year, and then performs synchronization according to order data not exceeding one year.

Specifically, in the embodiment, data of the Mysql is synchronized to a hive data warehouse by using sqoop + dataX in an offline cleaning mode, then a Phoenix external table is built in the hive, a wide table is built for original data by using a service processing logic consistent with real-time processing, and the wide table is written into the Phoenix external table; after the writing is completed, the Canal, Kafka and Flink services are started, and the data is written into the Hbase in real time.

S600: generating a data service layer width table according to the mapping relation between the second field attribute of the second order data and the first field attribute in the first order data and according to the second order data after off-line cleaning and the first order data;

specifically, the order related data obtained in step S500 and the order data obtained in step S100 are summarized in the DWD layer in Hbase, a data mapping relationship is generated in the DWD layer according to the content of each field in the summarized order data, and first, the two parts of order data before being summarized are synchronized according to the mapping relationship, so that the two parts of order data tables have the same field attribute and field record content, and in addition, the mapping relationship can also be directly called if the information is needed during subsequent service processing; for example, embodiments may convert GPS latitude and longitude into a provincial detailed address; for example, GPS quick queries typically use a geo-hash mapping with a knowledge base of geographic locations; then, the GPS needing to be compared is converted into the geohash and then is compared with the geohash in the knowledge base, and the geographic position information is found out.

In some optional embodiments, before the step S200 of generating a log file of the first order data, and clustering the log file according to primary keys of the log file to obtain a data service layer width table, the method of the embodiment further includes the step S110:

s110, acquiring first order data, and desensitizing the first order data; the desensitization processing process includes, but is not limited to, replacing user information in the first order data with a symbol character string or screening fields in the first order data according to a mapping relationship;

illustratively, in the embodiment, the desensitization operation is performed on the customer information of the order form, the last 4 digits of the first three digits of the user are reserved, the rest digits are represented by the x number, the fields not involved in the report form are filtered, and the necessary privacy protection is performed on the users involved in the scheme while the efficiency of the processing flow is improved.

In some alternative embodiments, after the step S600 of constructing the wide table of detail data by using the second order data after the offline cleaning and the first order data, the embodiment method further includes S610 and S620:

s610, carrying out standardized processing on the first order data and the second order data, and unifying the data formats of the order data;

s620, performing data cleaning on the first order data and the second order data to reduce null values and dirty data of the order data;

specifically, the data in the valid period and the data which is not in the valid period after being cleaned and synchronized are gathered in a DWD layer, and the operations of cleaning and normalizing the data are carried out, wherein the data cleaning comprises the following steps: null values, dirty data, data exceeding a limit range, or records are removed.

In some alternative embodiments, the step S400 of obtaining a query instruction, and indexing the query instruction in several partitions by using the primary key to obtain the target order data may include steps S410-S420:

s410, obtaining a plurality of historical query results, and obtaining high-frequency fields in the historical query results;

s420, constructing an index table according to the high-frequency field, and inquiring target order data according to the index table;

specifically, for some foreseeable queries, a common scene set is to perform filtering and grouping query operations on several fields, unlike the conventional relational databases such as Mysql and the like, the covering index can not need to find data under a main key through the index, and preset fields in the index are reserved in the index; for a distributed database, indexes and data are possibly stored in different nodes, cross-boundary point query influences query efficiency, and query performance is improved by covering the indexes.

In some optional embodiments, before the step S100 of acquiring the first order data with the storage duration not greater than the time threshold according to the preset time threshold, the embodiment method may further include steps S001 to S002:

s001, matching a data table through a regular expression, wherein the data table comprises a plurality of first order data;

specifically, the embodiment modifies binlog-format of Mysql into ROW format; then, modifying an instance file, modifying a can file regex configuration, adding a table needing synchronization, and matching a table related to an order through a regular expression; the partition number and hash partition primary key of each table are controlled by modifying the canal.

S002, setting data warehouse parameters of the acquired data table;

specifically, embodiments also need to modify canal properties, fill in kafka cluster parameter configuration, port number, message transmission parameters, etc.

the data acquisition module is used for acquiring first order data with the storage duration not greater than a time threshold according to a preset time threshold;

the data classification module is used for generating a log file of the first order data and clustering the log file according to the primary key words of the log file to obtain a data service layer width table; dividing a data service layer width table to obtain a plurality of partitions;

and the data query module is used for acquiring a query instruction, and indexing the query instruction in the plurality of partitions through the primary keywords to obtain target order data.

In a third aspect, a technical solution of the present invention further provides an order data indexing device, including: at least one processor; at least one memory for storing at least one program; when the at least one program is executed by the at least one processor, the at least one processor is caused to perform the method of the first aspect.

From the above specific implementation process, it can be concluded that the technical solution provided by the present invention has the following advantages or advantages compared to the prior art:

according to the technical scheme, the slow query efficiency of the order data under the condition of overlarge order number is improved through the column type storage and the distributed computing service, and compared with a traditional database, the slow query efficiency can be achieved in second-level real-time query, the query speed is higher, and the response speed is higher.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the functions and/or features may be integrated in a single physical device and/or software module, or one or more of the functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An order data indexing method is characterized by comprising the following steps:

2. The order data indexing method according to claim 1, further comprising:

and generating a data service layer width table according to the mapping relation between the second field attribute of the second order data and the first field attribute in the first order data.

3. The order data indexing method according to claim 1, wherein the step of generating a log file of the first order data, and clustering the log file according to primary keys of the log file comprises the steps of:

the first character is calculated from a hash value of the first order data and the number of partitions.

4. The order data indexing method according to claim 2, wherein before the step of generating a log file of the first order data, and clustering the log file according to primary keys of the log file to obtain a data service layer width table, the method further comprises:

the desensitization treatment comprises the following steps:

and screening fields in the first order data according to the mapping relation.

5. The order data indexing method according to claim 2, wherein after the step of generating the data service layer width table according to the mapping relationship between the second field attribute of the second order data and the first field attribute of the first order data, the method further comprises at least one of the following steps:

carrying out standardization processing on the first order data and the second order data, and unifying the data format of the order data;

and performing data cleaning on the first order data and the second order data to reduce null values and dirty data of the order data.

6. The order data indexing method according to any one of claims 1 to 5, wherein before the step of obtaining the first order data with a storage duration not greater than the time threshold according to the preset time threshold, the method further comprises:

and setting and acquiring data warehouse parameters of the data table.

7. The order data indexing method according to any one of claims 1 to 5, wherein the obtaining of the query instruction, and indexing in the plurality of partitions by the primary key according to the query instruction to obtain the target order data, comprises the steps of:

acquiring a plurality of historical query results and high-frequency fields in the historical query results;

8. An order data indexing system, comprising:

9. A computer device for order data indexing, comprising:

at least one processor;

at least one memory for storing at least one program; when executed by the at least one processor, cause the at least one processor to perform the method of any one of claims 1-7.

10. A storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by a processor, is configured to execute an order data indexing method according to any one of claims 1 to 7.