CN114912815A

CN114912815A - Index automatic definition method, system and storage medium based on big data wide table

Info

Publication number: CN114912815A
Application number: CN202210568574.3A
Authority: CN
Inventors: 龚连平
Original assignee: Hunan Railway Lianchuang Technology Development Co ltd
Current assignee: Hunan Railway Lianchuang Technology Development Co ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-08-16

Abstract

The invention discloses an index automatic definition method, a system and a storage medium based on a big data wide table, wherein the method comprises the following steps: receiving an input basic data set, wherein the basic data set comprises basic indexes and relevant factors corresponding to the basic indexes, the basic indexes comprise sending amount, turnover amount and income, and the relevant factors comprise data types, data names, data sources, data dimensions, data apertures, data versions and data values; identifying basic indexes in the input basic data set and corresponding relevant factors thereof; and obtaining the custom derivative index according to each basic index and each relevant factor. The invention aims to simplify the index definition, meet the flexible use of users and lay the foundation for the subsequent flexible analysis.

Description

Index automatic definition method, system and storage medium based on big data wide table

Technical Field

The invention relates to the field of big data processing, in particular to an index automatic definition method, system and storage medium based on a big data wide table.

Background

The Business Intelligence (Business Intelligence, abbreviated as-BI), also known as Business Intelligence or Business Intelligence, is used for analyzing data by modern data warehouse technology, on-line analysis and processing technology, data mining and data display technology to realize Business value. At present, the traditional BI technology mainly aims at extracting, converting and loading basic indexes, and the definition, calculation and storage which are oriented to the indexes can only play an analysis role on the business, but the definition of the indexes has high requirement on professional knowledge of computer big data, is not easily accepted by enterprise managers, and further urgently needs to provide a data analysis processing mode which is applicable to common people.

The invention with the prior publication number of CN112633761A provides a method, a device, equipment and a storage medium for querying index data, which introduce a real-time analysis database and a cache database, invoke an index calculation engine to query index data in the real-time analysis database and the cache database according to an index query request, and perform standardized processing on the index data to generate target aggregated index data, thereby solving the problem that real-time index data cannot be queried.

Disclosure of Invention

The invention mainly aims to provide an index automatic definition method, an index automatic definition system and a storage medium based on a big data wide table, and aims to solve the technical problem of strong specialty of the existing index definition.

In order to achieve the above purpose, the present invention provides an index automatic definition method based on a big data wide table, the method includes the following steps:

receiving an input basic data set, wherein the basic data set comprises basic indexes and relevant factors corresponding to the basic indexes, the basic indexes comprise sending amount, turnover amount and income, and the relevant factors comprise data types, data names, data sources, data dimensions, data apertures, data versions and data values;

identifying basic indexes in the input basic data set and corresponding relevant factors thereof;

and obtaining the user-defined derivative indexes according to the basic indexes and the relevant factors.

Optionally, the step of obtaining a customized derivative index according to each basic index and each relevant factor includes:

and automatically defining to obtain a plurality of derived indexes according to the forward extension of each relevant factor, wherein the forward extension comprises the extension of data dimensionality, the extension of a data source and the extension of index coding.

Optionally, the step of automatically defining to obtain a plurality of derived indexes according to the forward expansion of each relevant factor includes:

and (4) freely combining into various dimension combinations according to the data dimensions in the relevant factors to obtain the derivative indexes of the various dimension combinations.

Optionally, the step of freely combining into various dimensional combinations according to the data dimensions in the relevant factors to obtain the derivative indexes of the various dimensional combinations includes:

identifying a data source corresponding to each basic index;

freely combining the data sources and the data dimensions to obtain an initial derivative index;

judging whether the target objects in the initial derivation indexes are repeated in each dimension combination;

if yes, determining the dimension priority of the target object;

if not, directly generating the derived index of the target object.

Optionally, the step of determining the dimension priority of the target object includes:

and determining the finest granularity in each initial derivative index containing the target object, and keeping the dimension combination with the smallest finest granularity as the derivative index of the target object.

Optionally, the step of obtaining a custom derivative index according to the basic index and each relevant factor includes:

and performing index calculation according to the basic indexes and the related factors included in the respective defined derivative indexes to obtain the index values of the corresponding user-defined derivative indexes.

Optionally, after the step of obtaining the customized derivative index according to the basic index and each relevant factor, the method includes:

automatically distributing corresponding timestamps according to basic indexes in the user-defined derived indexes;

and storing the corresponding user-defined derivative indexes in a column type storage mode according to the time stamps.

In addition, in order to achieve the above object, the present invention further provides a system for automatically defining an index based on a large data width table, including a memory, a processor, and an automatic index defining program based on a large data width table, stored in the memory and executable on the processor, where the automatic index defining program based on a large data width table, when executed by the processor, implements the steps of the automatic index defining method based on a large data width table according to any one of the above claims.

In addition, to achieve the above object, the present invention further provides a storage medium having stored thereon a large data width table based index automatic definition program, which when executed by a processor, implements the steps of the large data width table based index automatic definition method according to any one of the above.

The invention provides an index automatic definition method based on a big data wide table, which comprises the steps of receiving an input basic data set, wherein the basic data set comprises basic indexes and relevant factors corresponding to the basic indexes, the basic indexes comprise sending amount, turnover amount and income, and the relevant factors comprise data types, data names, data sources, data dimensions, data apertures, data versions and data values; identifying basic indexes in the input basic data set and corresponding relevant factors thereof; and obtaining the user-defined derivative indexes according to the basic indexes and the relevant factors. And then realized not needing artificial definition index, when recording the basic index, through the dimension in the basic index, let the system automatically generate the index of various dimension combinations, and then avoid loaded down with trivial details manual operation, also avoid the risk of missing in the existence of manual definition process, simultaneously, will all automatically define and calculate single dimension's index and the minimum granularity index of multidimension combination in advance, satisfy user's nimble use, establish the basis for subsequent nimble analysis.

Drawings

FIG. 1 is a schematic structural diagram of an automatic index definition system based on a large data width table according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an embodiment of a method for automatically defining an index based on a big data width table according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of an automatic index definition system based on a large data width table according to an embodiment of the present invention.

As shown in fig. 1, the system may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include an infrared receiving module for receiving a control command triggered by a user through a remote controller, and the optional user interface 1003 may further include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration of the large data width table based index automatic definition system shown in fig. 1 does not constitute a limitation of the large data width table based index automatic definition system, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

The specific embodiment of the automatic index definition system based on the big data wide table according to the present invention is substantially the same as the following embodiments of the automatic index definition method based on the big data wide table, and will not be described herein again.

Referring to fig. 2, the method for automatically defining an index based on a big data wide table according to the present invention provides a schematic flow diagram of the first embodiment, and the method includes:

step S10, receiving an input basic data set; step S20, identifying the basic indexes in the recorded basic data set and the corresponding relevant factors;

and step S30, obtaining the user-defined derived index according to each basic index and each relevant factor.

Specifically, the basic data sets in this embodiment are large data wide tables, where the recorded data is a set of concepts that express atomic quantitative attributes of business entities and are not separable, where the concepts include basic indexes such as sending amount, turnover amount, income, and a class of relevant factors corresponding to each basic index, including data types, data names, data sources, data dimensions, data apertures, data versions, and data values, for example, in the field of transportation for large data, where the data types are data classified from a business perspective, such as passenger transportation, freight transportation, and the like; data dimension is the clustering of features of things or phenomena, such as time dimension (day, month, year), space dimension (province, city); the data aperture is a statistical logic standard adopted by statistical data, such as the aperture of all enterprises (calculating all enterprises), the aperture of stock control (not calculating non-stock control enterprises); the data version is a combination of different data calibers and data dimensions, and data statistics is performed, such as a release version (the current data calibers, such as the last year usage containing stock control calibers and the present year usage containing calibers), an analysis version (any combination of the data calibers and the data dimensions, such as the last year and the present year usage containing calibers), and the like. Further, in practical applications, it is not possible to analyze only the basic index when counting traffic, for example, it is necessary to analyze an index such as "passenger transmission amount in a certain time range at a certain gate", and "passenger transmission amount in a certain time range at a certain gate" belongs to an index generated by combining the basic index "transmission amount" with the index-related factors "certain gate", "certain time", and "passenger transport", that is, a derivative index, which is an index to be analyzed when counting traffic. For example, there is a basic indicator "volume sent" in wide table a, with the relevant factors identified: data dimension "province", "city", data type "passenger transport", expand into 3 kinds of dimensions through the dimension, "province", "city", "province city", define 3 derived indicators automatically, province passenger's sending amount, city passenger's sending amount, province passenger's sending amount.

In this embodiment, by receiving an entered basic data set, the basic data set includes basic indexes and relevant factors corresponding to the basic indexes, the basic indexes include a sending amount, a turnover amount, and revenue, and the relevant factors include a data type, a data name, a data source, a data dimension, a data caliber, a data version, and a data value; identifying basic indexes in the input basic data set and corresponding relevant factors thereof; and obtaining the user-defined derivative indexes according to the basic indexes and the relevant factors. And then realized not needing artificial definition index, when the record basic index, through the dimension in the basic index, let the system automatic generation index that various dimensions make up, and then avoid loaded down with trivial details manual operation, also avoid the risk of omitting in the existence of manual definition process.

Further, the step of step S30 includes:

In the first case, the expansion process for the data dimension is as follows:

Specifically, first, data dimensions can be freely combined into a dimension combination, and then, indexes of various dimension combinations are automatically generated, for example, data dimensions in a base index include dimension 1, dimension 2, and dimension 3, and data dimensions can be freely combined into 3 single dimensions (dimension 1, dimension 2, and dimension 3) and 1 dimension combination (dimension 1_ dimension 2_ dimension 3), so that 4 derivative indexes are automatically defined, including: the "dimension 1" index, the "dimension 2" index, the "dimension 3" index, and the "dimension 1_ dimension 2_ dimension 3" index.

In the second case, the extension flow for the data source is as follows:

identifying a data source corresponding to each basic index;

and obtaining an initial derivative index according to the free combination of each data source and the data dimension.

Specifically, firstly, basic indexes are derived from an index wide table a, an index wide table B and an index wide table C, wherein dimension 1, dimension 2 and dimension 3 are derived from the index wide table a, dimension 1 and dimension 4 are derived from the index wide table B, and dimension 5 is derived from the index wide table C, and 7 derived indexes are automatically defined through free combination of data dimensions in each data source, and include a dimension 1 index, a dimension 2 index, a dimension 3 index, a dimension 1_ dimension 2_ dimension 3 index, a dimension 4 index, a dimension 1_ dimension 4 index and a dimension 5 index.

Further, when the dimension combination is expanded through the data source, it needs to be judged whether the target object in the initial derivation index is repeated in each dimension combination;

if the target object in the initial derivation index has an overlap condition in each dimension combination, determining the dimension priority of the target object, and the step of determining the dimension priority of the target object includes: and determining the finest granularity in each initial derivative index containing the target object, and keeping the dimension combination with the smallest finest granularity as the derivative index of the target object. For example, in the above example, the wide table a and the wide table B both contain a single dimension of "dimension 1", and at this time, the system will automatically detect that the dimension combination contains "dimension 1" (dimension 1_ dimension 2_ dimension 3, dimension 1_ dimension 4), the system will automatically compare two multidimensional combinations containing "dimension 1", determine the priority by the finest granularity and the generation time, and keep the single-dimensional combination with the higher priority, the finest granularity of dimension 1_ dimension 2_ dimension 3 is 3, and the finest granularity of dimension 1_ dimension 4 is 2, so that the single-dimensional combination "dimension 1" of the data source wide table a is kept.

And if the combination of the target object in the initial derivative indexes is not repeated in each dimension, directly generating the derivative indexes of the target object.

In the third case, the extension case for index coding is as follows:

the index code is also automatically configured according to the basic index-related factors, for example, if the basic index "transmission amount" is coded as FSL, the index "daily transmission amount" is coded as FSL _ D, and the index "monthly transmission amount" is coded as: FSL _ M.

Further, the step S30 further includes performing index calculation on the generated derived index, including:

Specifically, the index calculation is divided into: basic calculation, composite calculation and user-defined calculation.

Aiming at basic calculation, the index which indicates that the current index can be calculated by basic information according to relevant influence factors in a polymerization mode and can be automatically calculated without referring to other indexes is obtained; all the information in the calculation formula can be obtained from the current basic information, and the indexes are directly and automatically calculated. For example, the index "passenger-less transmission amount" is calculated by aggregating the basic information "transmission amount" by the data dimension "province", and the data type "passenger transport", and the index name "passenger-less transmission amount", the data dimension "province", the data type "passenger transport", and the index value "FSL" are recorded.

Aiming at the composite calculation, the current index needs to be calculated through one or more indexes and four arithmetic calculations, for example, the index of the number of vehicles loaded in the province in every day is calculated by dividing the number of days by the index of the number of vehicles loaded in the province in every day, the index of the average distance of the passengers in the province is calculated by dividing the number of the passengers in the province by the index of the number of people transported by the passengers in the province; the index "number of vehicles loaded per day" may record the index name "number of vehicles loaded per day", data dimension "province", and index value "ZCS", and the index "average distance of passengers in province" may record the index name "average distance of passengers in province", data dimension "province", data type "passenger transport", and index value "PJYC".

For the custom calculation, it indicates that the current index can be calculated by connecting the relational database in a custom SQL manner, for example, the custom index "city month sender" can be calculated by the custom SQL "select ny, city, sum (rs) from the wide table Agroup byny, city", and the index name "city month sender" and the data dimension "city", "month", and index value "FSL" can be recorded according to the custom SQL.

Further, after the step S30, the method further includes: storing the indexes;

specifically, corresponding timestamps are automatically allocated according to basic indexes in the user-defined derived indexes; and storing the corresponding user-defined derivative indexes in a column type storage mode according to the time stamps. The storage data is stored according to the column-based logical storage units, and the data in one column exists in a continuous storage form in the storage medium. Furthermore, in actual use, only the columns involved in the query are accessed during the query, so that the disk I/O of the system is greatly reduced, and the data types are consistent and the data characteristics are similar, so that the flexible use of users is met, and a foundation is laid for the subsequent flexible analysis; in addition, different compression algorithms can be adopted according to data characteristics, so that storage control is reduced to a certain extent, and generally, the storage data defined by the indexes only needs 20% of the storage space of the original width table.

Further, the data storage mode in the scheme is more convenient and clear, and the implementation process of the invention is described in detail by taking the creation, calculation and storage of an index 'passenger sending volume _ national railway _ month _ enterprise _ station' as a sample. The basic data set is a broad table A, wherein the basic indexes comprise sending amount, the related factors comprise year and month, ticket class, seat class, transportation class, provincial name, city, line, enterprise, station segment, station and train number,

firstly, inputting a basic index 'sending quantity', and identifying relevant factors of the basic index:

1) data type: the 'passenger transport' can be obtained through the class transportation;

2) data name: determining a code "FSL" in a transmission amount;

3) a data source: wide table A;

4) data dimension: wherein, the time dimension is 'year and month', and the space dimension is 'tickets', 'seats', 'transportation', 'provincial names', 'cities', 'lines', 'enterprises', 'station sections', 'stations', 'train numbers';

5) data caliber: the method is characterized in that the method is distinguished through enterprise attributes, enterprises belonging to the country record the national railway caliber, and all enterprises record the national railway caliber;

6) data version: the default record is "release version";

7) data value: corresponding to the amount of data sent.

Secondly, automatically defining indexes:

2.1 extension of data dimension: the dimension combination 'ticket class _ seat class _ transportation class _ province name _ city _ line _ enterprise _ station segment _ station _ train number' and the single dimension 'ticket class', 'seat class', 'transportation class', 'province name', 'city', 'line', 'enterprise', 'station segment', 'station', 'train number' can be automatically generated through the space dimension "

2.2 customization of derivation index: the following indicators may be automatically defined according to the data dimensions:

"passenger sending volume _ national railway/full-content _ month/year _ ticket class _ seat class _ transport class _ province name _ city _ line _ enterprise _ station segment _ station _ train number"

Passenger sending volume-national railway/full-content-month/year-ticket class "

"passenger sending volume _ national railroad/full-content _ month/year _ seat class"

"passenger sending volume _ national railroad/full-content _ month/year _ fortune class"

Passenger sending volume, national railway, full content, month, year, provincial names "

Passenger sending volume, national railway, full content, month, year, city "

"passenger sending volume _ national railroad/full-content _ month/year _ line"

"passenger delivery volume _ national railroad/full content _ month/year _ enterprise"

"passenger sending volume _ national railroad/full-content _ month/year _ station segment"

"passenger sending volume _ national railroad/full-content _ month/year _ station"

"passenger sending volume _ national railroad/full-content _ month/year _ train number"

Thirdly, calculating a derivative index through automatic index calculation, wherein data dimensions are clustered in a group by form, data apertures and data types are subjected to condition judgment in a where form, and SQL is automatically generated for calculation; for example, the SQL of the derived index "passenger transmission volume _ national railway _ month _ ticket _ agent _ class _ operation class _ provincial name _ city _ line _ business _ station _ train number" is as follows:

select month, tickets, seats, fortune, provincial names, cities, lines, enterprises, station sections, stations, train numbers,

sum (transmission amount)

from broad Table A

where transport category ═ passenger transport'

and enterprises belong to the country

Month of group, ticket class, seat class, transportation class, provincial name, city, line, enterprise, station section, station, train number

Fourth, by storing the derived indicators, wherein a timestamp is automatically assigned to the basic information "transmission amount", and storing the related derived indicators in the big data wide table, such as "passenger transmission amount _ national railroad _ month _ ticket class _ seat _ operation class _ provincial _ city _ line _ business _ station _ train number", the data will be recorded in the following fields: "timestamp", "index name", "data caliber", "data type", "month", "ticket type", "seat type", "transport type", "provincial name", "city", "route", "enterprise", "station section", "station", "train number", "transmission amount" (where the field "transmission amount" is a numerical type); for example, "passenger transmission amount _ all _ year _ number of cars" is recorded in the following field "time stamp", "index name", "data aperture", "data type", "year", "number of cars", and "transmission amount" (where the field "transmission amount" is a numerical type).

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores an index automatic definition program based on a large data width table, and the index automatic definition program based on the large data width table, when executed by a processor, implements the steps of the above index automatic definition method based on the large data width table.

The specific embodiment of the readable storage medium of the present invention is substantially the same as the embodiments of the index automatic definition method based on the big data width table, and will not be described herein again.

In addition, the invention also provides an index automatic definition system based on the big data wide table, which comprises the following components:

the basic index entry module is used for receiving and identifying an entered basic data set, wherein the basic data set comprises basic indexes and relevant factors corresponding to the basic indexes, the basic indexes comprise definitions of data values, and the relevant factors comprise data types, data names, data width tables, data dimensions, data apertures, data versions and data values;

the index building module is used for automatically generating various dimensional combinations (single dimension and multi-dimension) according to the basic indexes and the corresponding relevant factors thereof and judging and generating indexes of the various dimensional combinations according to the priority;

the index calculation module is used for automatically calculating various dimension combination indexes according to the calculation mode of the indexes;

and the index storage module is used for automatically generating indexes according to the basic indexes and storing the indexes combined in various dimensions in a column type storage mode.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An index automatic definition method based on a big data width table is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of obtaining the custom derivative index according to each basic index and each related factor comprises:

3. The method as claimed in claim 2, wherein the step of automatically defining and obtaining a plurality of derived indicators according to the forward expansion of each relevant factor comprises:

4. The method according to claim 3, wherein the step of freely combining the data dimensions into various dimension combinations according to the relevant factors to obtain the derived indexes of the various dimension combinations comprises:

identifying a data source corresponding to each basic index;

freely combining the data sources and the data dimensions to obtain initial derivative indexes;

if yes, determining the dimension priority of the target object;

if not, directly generating the derived index of the target object.

5. The method according to claim 4, wherein the step of determining the dimension priority of the target object comprises:

6. The method according to any one of claims 1 to 5, wherein the step of obtaining the custom derivative index according to the basic index and each relevant factor includes:

7. The method as claimed in claim 6, wherein the step of obtaining the customized derivative index from the basic index and the relevant factors is followed by:

8. A system of automatic index definition method based on big data width table, comprising a memory, a processor and an automatic index definition program based on big data width table stored in the memory and capable of running on the processor, wherein the automatic index definition program based on big data width table realizes the steps of the automatic index definition method based on big data width table according to any claim 1 to 7 when being executed by the processor.

9. A storage medium, characterized in that the storage medium stores thereon a large data width table-based index automatic definition program, which when executed by a processor implements the steps of the large data width table-based index automatic definition method according to any one of claims 1 to 7.