CN112667859A - Data processing method and device based on memory - Google Patents

Data processing method and device based on memory Download PDF

Info

Publication number
CN112667859A
CN112667859A CN202011620449.XA CN202011620449A CN112667859A CN 112667859 A CN112667859 A CN 112667859A CN 202011620449 A CN202011620449 A CN 202011620449A CN 112667859 A CN112667859 A CN 112667859A
Authority
CN
China
Prior art keywords
data
calculation
memory
data set
meta information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011620449.XA
Other languages
Chinese (zh)
Inventor
吴明星
王星宇
李纪洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING JOIN-CHEER SOFTWARE CO LTD
Original Assignee
BEIJING JOIN-CHEER SOFTWARE CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING JOIN-CHEER SOFTWARE CO LTD filed Critical BEIJING JOIN-CHEER SOFTWARE CO LTD
Priority to CN202011620449.XA priority Critical patent/CN112667859A/en
Publication of CN112667859A publication Critical patent/CN112667859A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method and device based on a memory. The method comprises the following steps: loading data acquired from a data interface to a data set model stored in a memory in advance line by line, and determining meta information corresponding to each line of data; wherein the meta-information comprises a basic attribute and an extended attribute; determining index items and dictionary entries corresponding to each row of data according to the meta information, and inputting the meta information, the index items and the dictionary entries into the data set model; and performing data calculation by using the data set model to obtain a calculation result. The invention utilizes the meta information of the data, combines the actual data, gives more meanings to the data, effectively enhances the data analysis and calculation capability, supports the multidimensional analysis scene, has high calculation efficiency, and realizes higher access performance through the data set structure defined in the memory.

Description

Data processing method and device based on memory
Technical Field
The present invention relates to the field of data analysis technologies, and in particular, to a data processing method and apparatus based on a memory.
Background
In the production process of enterprises, a large amount of data is generated. By analyzing the data, guidance opinions can be provided for enterprises. Then, when a large amount of data is analyzed, certain difficulty is brought to enterprises, and especially when multidimensional data is analyzed, a conventional analysis method loses certain flexibility, has performance problems and causes negative influence on production.
In recent years, data analysis is also more and more emphasized by enterprises, products and technologies for data analysis are also endless, most of the technologies operate based on data, and description of some data features is lacked, so that the effect of data analysis is influenced, and the analysis range is limited. Particularly for a multidimensional data analysis scene, the data have more meanings and associated characteristics. For example, the production record data includes data of time fields, data of product records, data of production information, and the like, through the time fields, the time granularity of the data can be seen, trend analysis is performed by using continuous time sequences, and the data can be summarized according to different time granularities to calculate a same ratio or a ring ratio; similarly, product aggregation may be performed via product data. According to the traditional thought, in order to realize the functions, programming realization needs to be carried out on each scene, or SQL is written for operation. These methods are difficult to use, fail to ensure efficiency, and are relatively poor in maintenance.
The methods for realizing data analysis in the market at present are roughly as follows: 1. and data is stored in a database, the upper-layer business directly queries from the database by writing SQL sentences, and the result is displayed on an interface. The method is realized by writing a database SQL, and the method writes corresponding SQL and stores the process according to the business requirement, thereby realizing the function of summarizing and inquiring. The problems with this approach include: the support scenes are few, and due to the limitation of the database, a lot of multidimensional analysis cannot be supported; the writing difficulty is high, the analysis functions provided by most databases are limited, if complex analysis, especially business correlation analysis, needs to be realized, a plurality of functions are nested, and some functions can be completed even by self-defining functions; performance is low, which results in poor performance due to the large number of function operations used in the analysis. 2. And generating various aggregation tables by planning a data structure for multidimensional analysis in advance, and inquiring based on the aggregation tables. The method firstly preprocesses data to be analyzed, generates intermediate tables such as sub-tables or aggregation tables through an ETL tool or an encoding method, and the generated logic is determined based on a subject to be analyzed. The problems with this approach include: data needs to fall to the ground in the middle, and cannot be real-time; the collected target data is fixed, and the function of ad hoc query cannot be realized; the whole technical scheme is improved in performance by replacing time and space, needs a large amount of time and space and is low in efficiency; the multidimensional analysis scene is complex, and most of calculation can be completed only by customized development; maintenance costs are high and many ETL processes need to be done. 3. And submitting the data to a distributed cluster through a distributed computing framework such as spark, and then obtaining a final computing result through multi-stage operation. The method takes data to be analyzed as an input source by virtue of a distributed memory computing framework, and then submits the data to a distributed engine for computing by compiling a distributed computing script or function. The problems with this approach include: the structure is complex, the maintenance difficulty is high, and the method is suitable for ultra-large-scale data calculation; the requirement on hardware resources is high, and better hardware resources are needed; the data preheating link is slow, and no advantage is brought to small and medium-scale data query; the multidimensional analysis scene is complex, and most of calculation can be completed only by customized development.
Disclosure of Invention
In view of the problems in the prior art, embodiments of the present invention mainly aim to provide a data processing method and apparatus based on a memory, which effectively enhance the data analysis capability and achieve higher access performance.
In order to achieve the above object, an embodiment of the present invention provides a data processing method based on a memory, where the method includes:
loading data acquired from a data interface to a data set model stored in a memory in advance line by line, and determining meta information corresponding to each line of data;
determining index items and dictionary entries corresponding to each row of data according to the meta information, and inputting the meta information, the index items and the dictionary entries into the data set model;
and performing data calculation by using the data set model to obtain a calculation result.
Optionally, in an embodiment of the present invention, the meta information includes a basic attribute and an extended attribute; the basic attribute comprises a data name and a data type; the extended attributes include field type, build column, name column, aggregation type, application type, time granularity, data format, display format, unit dimension tag, and level.
Optionally, in an embodiment of the present invention, the determining, according to the meta information, an index entry includes: and determining the index items corresponding to the data of each row according to the field types and the data types.
Optionally, in an embodiment of the present invention, the method further includes: and carrying out hierarchical construction on the data according to the hierarchy.
Optionally, in an embodiment of the present invention, the performing data calculation by using the data set model to obtain a calculation result includes: selecting an algorithm corresponding to the meta-information to perform data calculation according to the meta-information in the data set model to obtain a calculation result; wherein the calculation result comprises a multi-valued result or a single-valued result.
An embodiment of the present invention further provides a data processing apparatus based on a memory, where the apparatus includes:
the meta-information module is used for loading the data acquired from the data interface into a data set model stored in a memory in advance line by line and determining meta-information corresponding to each line of data;
the data set module is used for determining index items and dictionary entries corresponding to each row of data according to the meta information and inputting the meta information, the index items and the dictionary entries into the data set model;
and the data calculation module is used for performing data calculation by using the data set model to obtain a calculation result.
Optionally, in an embodiment of the present invention, the meta information includes a basic attribute and an extended attribute; the basic attribute comprises a data name and a data type; the extended attributes include field type, build column, name column, aggregation type, application type, time granularity, data format, display format, unit dimension tag, and level.
Optionally, in an embodiment of the present invention, the data set module is further configured to determine, according to the field type and the data type, an index entry corresponding to each row of data.
Optionally, in an embodiment of the present invention, the apparatus further includes: and the hierarchy module is used for carrying out hierarchy construction on the data according to the hierarchy.
Optionally, in an embodiment of the present invention, the data calculation module is further configured to select an algorithm corresponding to the meta information according to the meta information in the data set model to perform data calculation, so as to obtain a calculation result; wherein the calculation result comprises a multi-valued result or a single-valued result.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the program.
The present invention also provides a computer-readable storage medium storing a computer program for executing the above method.
The invention utilizes the meta information of the data, combines the actual data, gives more meanings to the data, effectively enhances the data analysis and calculation capability, supports the multidimensional analysis scene, has high calculation efficiency, and realizes higher access performance through the data set structure defined in the memory.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
Fig. 1 is a flowchart of a data processing method based on a memory according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data set package according to an embodiment of the present invention;
FIG. 3 is a flow chart of data preprocessing in an embodiment of the present invention;
FIG. 4 is a flow chart of loading data in an embodiment of the present invention;
FIG. 5 is a diagram illustrating an in-memory data structure of a data set model according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating the processing of each row of data according to an embodiment of the present invention;
FIG. 7 is a flow chart of creating an index in an embodiment of the present invention;
FIG. 8 is a diagram illustrating a meta-information structure according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating the construction of a memory fabric object according to an embodiment of the present invention;
FIG. 10 is a diagram illustrating a data calculation process according to an embodiment of the present invention;
FIG. 11 is a block diagram of a memory-based data processing apparatus according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a data processing method and device based on a memory.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to realize a good data analysis function, data needs to be organized effectively and some efficient computing power needs to be provided. An analysis scene of multi-dimensional data is complicated and changeable, many calculation logics do not have universality, a plurality of sets of solutions for memory calculation are provided on the market at present, but the solutions pay more attention to some basic data operation frameworks and cannot be well applied to a multi-dimensional operation model, and particularly, when some time offset calculation, grouping aggregation operation and other line-crossing operations are involved in the multi-dimensional model, the factors of data scale, data retrieval performance and the like of the line-crossing operation need to be considered more on the basis of the memory calculation.
The execution body of the data processing method based on the memory provided by the embodiment of the invention comprises but is not limited to a computer. Fig. 1 is a flowchart of a data processing method based on a memory according to an embodiment of the present invention, where the method includes:
step S1, the data acquired from the data interface is loaded into the data set model pre-stored in the memory row by row, and the meta information corresponding to each row of data is determined. Wherein the meta-information includes a base attribute and an extended attribute.
The data structure is predefined in the data set model, functions such as data management and query can be realized by using the data set model, a unified data query interface is provided, various data retrieval and calculation operations are packaged in the data set model, the data set model is equivalent to a black box for external use, and a user does not need to pay attention to internal implementation details, as shown in fig. 2.
The data interface is the channel for external data loading, which is provided to the application in various forms, possibly as a text data, or as data from a database. The data are loaded through different data interfaces and are standardized into data with a two-dimensional structure.
Further, the loaded data needs to be preprocessed, as shown in fig. 3, the preprocessing includes creating an index, compressing the data, constructing a memory structure, and the like, and combining with the dataset model, thereby forming a memory data structure that can be calculated and retrieved.
In order to completely parse the data, meta information of the data is determined. Through the meta information, data query and calculation are realized. Specifically, the meta information includes a basic attribute and an extended attribute. The meta information is set in the query field of the data set model, and includes extended attributes in addition to basic attributes such as name, title, data type, and the like. The extended attributes comprise field types, building columns, name columns, aggregation types, application types, time granularity, data formats, display formats, unit dimension marks and levels.
By extending the attributes, complex data operation can be achieved. When data is loaded, the meta-information is bound to the data set model. When data calculation is carried out, the meta-information needs to be read, and then the meta-information is obtained from the data set model.
Step S2, determining, according to the meta information, an index entry and a dictionary entry corresponding to each line of data, and inputting the meta information, the index entry and the dictionary entry into the data set model.
Fig. 4 is a flow chart illustrating loading data, which is loaded into a data set model stored in a memory. The memory pool is used for effectively managing the used shared buffer area. The memory pool allows multiple processes accessing the data set simultaneously or multiple threads of a process to share a cache, and is responsible for writing modified pages back to the file and allocating memory space for newly called pages. The memory pool comprises a plurality of memory block objects, when a new data set is added, a memory space is firstly applied, then data can be continuously written into the memory space, and along with continuous writing, if the data space is insufficient, a new memory block is applied at the moment, so that the memory is ensured to be relatively concentrated, fragmentation is avoided, and the allocation efficiency of the memory is also improved.
Based on the memory exchange technology, a shared buffer is created to enable memory data to directly exchange and access with a disk, so that the data is ensured not to have an OOM (out of memory) phenomenon, and meanwhile, the data access efficiency is also ensured. Particularly, for a scene with a large data size, a part of data needs to be exchanged to a disk, and data with a high use frequency or some index data reside in a memory, so that the hit rate of the data is improved.
The memory structure of the data set model comprises four parts: meta information, data pages, index pages, data dictionaries, and hierarchies. As shown in fig. 5, the data is organized in memory according to the structure of fig. 5. The data page records data acquired through the data interface, the index page records index items of each row of data, and the data dictionary records dictionary entries of each row of data.
Due to the adoption of the memory sharing exchange technology, different types of data and different residence mechanisms in the memory are different. Firstly, for meta information, dictionary and index data, the meta information, the dictionary and the index data are continuously resident in a memory, and the access efficiency is ensured, while for data in a data page, as the situation that the data size is extremely large may occur, a scene that part of data can be persisted to a disk may exist.
In the data loading process, the data of each line is processed, including operations such as data standardization, index construction and the like, and the processing flow of the line data is as shown in fig. 6. The data is standardized according to the data format, so that the data of the memory structure is recorded according to the defined field type. Then, whether an index needs to be created or not and what index should be created are judged according to the field type, different indexes occupy different spaces and have different access efficiencies, comprehensive judgment is performed according to the field type and the data type to determine what index is specifically adopted, and specific judgment logic is shown in fig. 7.
Specifically, for example, a column of data is a unit dimension, and in the data processing process, a unit index is first created, where the unit index is a tree index (a data structure in a memory, a tree structure, and a structure managed in a prefix tree manner), and a unit dictionary table is created at the same time. Both processes are done before reading the data and the construction of these information is done based on meta-information. Then, data reading is started line by line, and the tree index and the unit dictionary table are filled according to the data of each line. For the tree index, firstly traversing tree nodes, if the tree nodes exist, recording the positions of the nodes, if the tree nodes do not exist, creating a node, and adding the node to the tree index; for the unit dictionary table, firstly, the dictionary table is searched according to the dictionary directory (the index of the field entry), if the dictionary table exists, the position of the dictionary entry (an index number in the memory and shaping storage is adopted) is recorded, and if the dictionary table does not exist, a dictionary entry is created, and the dictionary position is recorded. Then, in the data model, the specific values are no longer recorded, but the locations of the dictionary entries are recorded.
And step S3, performing data calculation by using the data set model to obtain a calculation result.
After all the data lines are loaded, the operations are sequentially carried out according to the steps, and finally data of a memory structure are formed, namely a data set model corresponding to the acquired data is obtained. The data set model includes data pages, indices, meta information, and dictionaries, after which a compute field evaluation process is performed. The field evaluation is calculated according to a defined field formula. Since there is a single row of operations in the compute field, there will also be a cross-row operation, so the evaluation phase is arranged in the last step. And after all the data are loaded, taking the current result set as a large data set, sequentially operating each evaluation formula to obtain a calculation result, and adding the result to the target data.
Data computation is based on the existing data set to perform various operations, including field evaluation, aggregation operations, filtering queries, etc. The data calculation occurs at any time of the data set life cycle, for example, a subset is filtered according to a certain query condition, which belongs to the calculation stage.
Furthermore, according to the scene of multidimensional data analysis, various operation methods are designed, the operation methods can quickly realize data evaluation, and the operation capabilities can be realized only by writing a formula. These operations include standard mathematical operations such as conventional aggregation, group statistics, ranking, TOPN, quartile, geometric mean, moving average, etc., as well as standard industry data operations such as parity, ring ratio, composite growth rate, time period offset, etc.
All operations are performed based on a memory, each different operation method corresponds to one actuator, the actuators can receive input of a data set, and some basic operators (value taking, filtering and summing) are called to form some complex operations in a combined mode. In the operation process of the executor, the meta information of the field, including basic attributes, extended attributes and the like, can be read, and the executor applies different algorithms according to different field attributes. For example, when a metric needs to be summed up, the application type of the metric is analyzed, if the metric is an epoch number, the data of each period are accumulated together, and if the metric is an end-of-period number, only the last stage data is required to be taken as the operation result.
For another example, there is a sales list, and the data is recorded by day, and it is necessary to summarize sales of each commodity by month. Here, sales revenue is a measurement to be calculated, a commodity is a dimension to be analyzed, a sales date is a time dimension, specific meanings represented by the fields need to be determined, since the summary data is performed monthly, time needs to be eliminated, and the accumulated calculation is directly performed from daily granularity to months, and in addition, the measurement is a time period. To achieve this effect, it is only necessary to write the following formula: DS _ SUM (SALES, DAY ═ ALL).
And returning according to the calculation result, and outputting the calculation result comprising a single-value result or a multi-value result. For single value results, a specific value is returned. For multi-valued results, a result set is returned, which is a subset of the original data set if it is a non-aggregation operation, and a new result set if it is an aggregation operation.
In order to reduce the memory occupation, all the filtering operations return a subset, and the line reference number of the original data set is recorded in the subset, so that no additional memory space is occupied.
As an embodiment of the present invention, the meta information includes a basic attribute and an extended attribute; wherein the basic attribute comprises a data name and a data type; the extended attributes include field type, build column, name column, aggregation type, application type, time granularity, data format, display format, unit dimension tag, and level.
Specifically, the field type: the system is used for describing the service information attribute and comprises a common dimension, a time dimension, a measurement and a description field. By utilizing the field types, the meaning of each field can be conveniently identified, for example, the time dimension contains time information, data can be gathered according to the time information, trend analysis is carried out, and a time sequence is generated.
Key column/name column: the key column and name column are each a set of fields that are typically used to describe a set of attributes. For example, the unit code and the title of the unit are a key column and a name column for each other. By setting the key column name column, the effects that data is calculated according to codes and an interface is displayed according to names can be achieved. Particularly, for the condition that the unit name changes, the switching operation of the interface display content can be conveniently realized by the key column name column and the time dimension.
Polymerization type: the aggregation type is used for traffic scenario analysis and indicates the way fields need to be aggregated, including summation, counting, maximum, minimum, and averaging. By defining the aggregation type, the data can automatically participate in the summarization according to the set type during the operation, thereby avoiding the reassignment. The aggregation type is usually associated with a specific scenario, for example, sales revenue is a summary data, the aggregation type is a sum, and for example, the product category, the aggregation type thereof is a count.
The application type is as follows: the method is used for describing specific data characteristics on the service and is applied to a measurement field, including the number of epochs, the cumulative number, the number of beginnings and the number of ends. According to different application types, the data participation aggregation mode and the calculation mode are different. Through the application type and the combination of the aggregation mode, the data calculation behaviors of different services can be realized. For example, the total number of employees, the aggregation type of the employees should be counting, and the application type of the employees should be a time point, so that the data of different stages cannot be counted together when the counting is carried out.
Time granularity: granularity for a given time dimension includes year, half year, month, season, ten days, day, etc. For example, the granularity of the annual data should be years, and the granularity of the monthly data should be months. The frequency of representing data sources of the data with different granularities is different, and according to the information, the data can be counted and checked according to different periods.
Data format: recording the data format of the source, loading data according to the provided data format by the data set, and realizing data standardization by defining the data format. For example, a time-type string data is read, which is stored in an 8-bit format yyymmdd, and the data is parsed according to this format and then converted into standard date-type data.
The display format is as follows: the format in which the fields are ultimately presented. The display format is corresponding to the data format, and is for the effect of finally displaying on the interface or other terminals, and the display format is usually different from the data format, and the data display can be more intuitive by setting the display format.
Whether it is a unit dimension: whether the tag dimension is a unit. Effective against common dimensions. In the enterprise analysis process, most data contains unit information, and the unit is used as a special main body and is often required to be endowed with more business meanings, for example, the unit can be a slowly-changing dimension, the unit can be associated with an organization, the system authority can be bound with the unit, and the like. With the unit attribute, more business logic operations can be performed.
And (3) hierarchy: the hierarchy function can conveniently check and summarize data, hierarchy management has various hierarchy modes, parent-child hierarchies are formed according to parent attributes, and coding hierarchies are formed according to coding formats.
In this embodiment, determining the index entry according to the meta information includes: and determining the index items corresponding to the data of each row according to the field types and the data types.
In this embodiment, the method further includes: and carrying out hierarchical construction on the data according to the hierarchy.
The hierarchical structure information of the data is analyzed, the codes and the father attribute values of the hierarchy are recorded for the columns containing the hierarchy, and finally a hierarchical tree structure is generated according to the information. Through the hierarchical construction, the functions of data summarization step by step and data display according to a tree can be conveniently realized.
As an embodiment of the present invention, performing data calculation by using the data set model, and obtaining a calculation result includes: selecting an algorithm corresponding to the meta-information to perform data calculation according to the meta-information in the data set model to obtain a calculation result; wherein the calculation result comprises a multi-valued result or a single-valued result.
And returning according to the calculation result, and outputting the calculation result comprising a single-value result or a multi-value result. For single value results, a specific value is returned. For multi-valued results, a result set is returned, which is a subset of the original data set if it is a non-aggregation operation, and a new result set if it is an aggregation operation.
In one embodiment of the present invention, the data processing process of the present invention is illustrated as a complete example. Analyzing the demand: an enterprise A has a large amount of business transactions and income, and a set of business systems are arranged in the enterprise, wherein the income detail conditions of the enterprise are recorded. In order to clearly understand all income conditions, the internal decision income difference needs to be analyzed, the occupation ratio and ranking condition of budget and decision in all categories and the change condition of the same period in the last year are checked, and the final result is displayed in a unit tree form mode. Data owned in known business systems include: unit dimension tables, decision revenue categories (including business revenue, government subsidies, other revenue, etc.), decision revenue details (including time of occurrence, associated units, revenue categories, decision count).
1. Defining a field: for determining the data of the analysis. a. A common field: period, year, unit code, unit name, unit father node, administrative division, income classification code, income classification name, pre-calculation number, and settlement number. b. And (4) calculating a field: difference number between budget and final calculation, ratio of budget number to final calculation, ratio of final calculation number to final calculation number, and ratio of final calculation number to final calculation number
2. Defining field attributes: for determining data characteristics, as shown in table 1.
TABLE 1
Figure BDA0002873954120000101
3. Defining a calculation field formula: the method is used for obtaining the index to be analyzed and specifically comprises the following steps:
1) budget and decision difference: SR _ AT-SR _ BT;
2) the budget number is as follows: SR _ BT/DS _ SUM (SR _ BT, U _ CODE ═ ALL);
3) the final calculation number is in proportion: SR _ AT/DS _ SUM (SR _ AT, U _ CODE ═ ALL);
4) the final number is equal to the ratio: DS _ YOY (SR _ AT).
4. Defining a hierarchy: the method is used for tree aggregation and hierarchical display. The unit code and the parent node form a unit parent-child hierarchy.
5. According to the above meta information, a data set model structure is initialized, wherein the data set model structure comprises meta information, index pages, a hierarchical structure, a data dictionary and the like.
6. And reading data according to the data source.
7. From the data, processing methods are performed, including creating memory pools, populating data set model structures (data items, index items, hierarchy items, dictionary entries, etc.), performing data optimizations, and the like.
8. And after the processing is finished, carrying out data query and calculation based on the data set model of the memory.
The specific data processing method comprises the following steps:
1. reading configuration, including field basic attribute, extended attribute, calculation field and hierarchy. A meta information structure is generated as shown in fig. 8.
2. Applying for system resources, creating a memory pool and a shared buffer, constructing a memory structure object (i.e. a data set model), and filling a meta information structure into the memory structure object, as shown in fig. 9.
3. An index structure is created from the meta-information. According to the field type, if the field is of a dimension type (including time dimension, common dimension and unit dimension), an index is automatically created. The rules specifically created are as follows:
1) time dimension, if the data type is a numerical value, adopting a binary tree index; if the data type is a string, a prefix tree index is employed. Tree indexing is used because data can be quickly located through a tree, and a range finding function can be realized, which is related to time dimension characteristics. In addition, based on the prefix tree index, the character string storage space can be compressed, and the occupation amount of the memory is built.
2) In the common dimension or the unit dimension, if the field represents the key column, HASH index is adopted, because the data volume of the key column is small, and the retrieval is realized in a complete matching mode, the key column can be quickly positioned through HASH; if the field represents a name column, a prefix tree index is used, because name-based searching is more implemented based on fuzzy searching.
3) And if a certain dimension defines the hierarchy information, a corresponding hierarchy tree index is also created, and the hierarchy tree index is similar to a multi-level index and is used for tree node positioning, path searching and the like.
For the embodiment, the epoch field is the string index S1, the year is indexed with a binary tree index S2; the unit code, the parent node (S3), the administrative division, and the classification code are indexed by HASH, and the unit name and the classification name are indexed by prefix tree. The unit dimension also has a hierarchical index that is used in conjunction with the hierarchy.
4. From the meta-information, a hierarchy is created. The hierarchy is a tree structure with each node containing a list of parent and subordinate nodes. The hierarchy may implement node location and node traversal functions. In this case, the dimension code and parent node form a hierarchy (T1), and thus correspond to a hierarchy.
5. And creating a dictionary table according to the meta information, wherein different dimensional fields correspond to the respective dictionary tables. The dictionary table is a KV stored data structure, and for fields of dimension types, dictionary entries are constructed through the dictionary table, and dictionary positions are stored in the data page and the index page. In this embodiment, the dimension code (D1), the administrative partition (D2), and the type code (D3) correspond to a dictionary table in sequence.
6. And reading data, and filling dictionary entries, data items and index items for each row of data. For example, the current read record is shown in table 2.
TABLE 2
20190101 2019 A0101 A certain division of a group A01 010 C01 Operating revenue 1900000 1500000
1) Filling a dictionary table: sequentially searching dictionary tables D1, D2 and D3, judging whether the values A0101, 010 and C01 exist or not, if so, acquiring the corresponding dictionary position, and if not, creating a dictionary entry and returning the position.
2) And filling an index page: 20190101 is added to the time index S1, 2019 is added to the index S2, and so on, with other field values added to the corresponding index structure.
3) Filling a data page: and processing the current line data according to the data storage format defined in the meta information, filling the data into a data list, and recording the position of the dictionary table in the data list for the data stored in the dictionary table.
7. When the data loading is completed, a hierarchical tree is generated based on all the hierarchical definitions in the data and meta information and populated into the hierarchical structure T1.
8. And refreshing the data of the buffer.
9. A field evaluation is calculated. All data is traversed and the calculation field is evaluated. In this embodiment, the calculation field includes a budget and resolution difference number, a budget number ratio, a resolution number ratio, and a resolution number parity.
1) Firstly, analyzing the formula of the calculation field to obtain a grammar expression object corresponding to each formula, and calculating based on the expression object.
2) Budget and decision difference: the operation belongs to a single-line operation, and directly uses a decision-budget number
3) Budget and cost ratio: the calculation of the occupation ratio is to divide the value of the occupation ratio by the total amount, wherein the total amount is summarized according to time, so that the condition is implied here that the time is the currently recorded time. The total amount is obtained by a formula DS _ SUM (SR _ BT, U _ CODE ═ ALL) which represents that dimension of a unit is removed, and the budget total of ALL units is taken. Firstly, the data set needs to be filtered, a subset of all units in the current period is found, the filtering condition is that all units under the current time and the current classification are obtained, and a subset is obtained based on the condition. The operation of the metric is usually decided by combining the time dimension and the application type of the metric, and since the pre-arithmetic number is a time period number, the data of different time periods need to be gathered together for the summation of the metric.
4) The final number is equal to the ratio: the parity calculation is to compare the current year parity with the previous year parity, so the previous year parity needs to be calculated first, and here, the offset data is obtained by offsetting the dates according to the dates, and then the current year parity is calculated.
10. And filling the calculation result of the calculation field into a specific row of the data page.
11. At this point, the memory structure construction of the data set is completed. Operations such as querying, filtering, and calculating may then be performed on the data set.
As shown in fig. 10, the specific calculation process includes: the data set provides basic computational support, including basic algorithm units, and computational formulas. Various complex operations can be realized through the combination of basic algorithms and formulas.
The basic algorithm includes:
1) filter by field: there may be any field, either exact match, or range search, or fuzzy match. The calculation module analyzes whether the index can be started or not according to the filtering condition, for the index which can be used, the index is firstly converted into the index for filtering, if the included condition comprises an indexable condition and an non-indexable condition, the index is firstly filtered to generate an intermediate result, and then the non-indexable condition is used for filtering.
2) Sort by field Sort: the sort field may be designated, the data set sorted, and the new sorted data set returned, the sort field may be one or more. When sorting is performed, the calculation module will also read the index information, and if the sorting field is indexable, sorting is performed according to the index.
3) And (3) aggregation operation: the aggregation operation may be performed for the fields of the metric type, and the result of the aggregation operation may be a single record or a plurality of records. The aggregation operation is performed according to the dimension of the incoming demand aggregation. If all dimensions are dimension-eliminated, the returned result is only a single value, and if part of the dimensions are aggregated, a multi-record result set is returned. When performing aggregation operation, the application types (number of epochs, number of end-of-period, number of beginning-of-period) of the metrics need to be read, and the aggregation mode is different according to different application types. If the number of the time points or the number of the end of the period is the time point number, the data at the end of the period is taken as the aggregation result, if the number of the time periods is the time period number, the data are summarized by using a method such as summing or maximum value calculation according to the aggregation method, and if the number of the initial period is the time period number, the initial value is taken as the result. In addition, in the aggregation operation process, if the eliminated dimensionality includes a dimensionality with hierarchy, the dimensionality needs to be eliminated according to the hierarchy.
4) Time period offset: the time offset can realize data comparison, and particularly, for the comparison loop ratio operation, the offset is firstly needed based on the time. The time period offset is based on the time dimension field and the time granularity, e.g., the time granularity is a month, the time period offset by the granularity is the last month, the time field year offset is the last year. The shifted epochs are converted to filter terms and then executed as described in 1, resulting in shifted subsets.
5) Data Ranking (RANK): data ranking includes obtaining a ranking value for a specified condition or obtaining a corresponding record by the ranking value. The data ranking is based on the metrics, and when the ranking is calculated, all dimensions are compared to find a plurality of subsets of the same dimensions, and then ranking is performed based on the subsets. In addition, the ranking includes a continuous ranking and a non-continuous ranking.
6) TOP/BOTTOM: and taking the first N records or the last N records. And sorting based on the specified measurement, and taking the top N or the bottom N results as subsets according to the sorting result.
7) Ratio of same ratio/ring: the homodyne ring ratio operation is actually realized by combining the time offset, the filtering and the aggregation operation, and since the homodyne ring ratio is performed in a large quantity in practical application, the homodyne ring ratio operation is also taken as a basic operation unit. For the ring ratio operation, firstly, the granularity of the time dimension is determined, wherein the field type and the time granularity need to be analyzed from the meta-information, the period is subjected to offset according to the granularity, a filtering condition is generated according to the offset, a subset after the offset is obtained, a mapping relation is established according to the subset and the current data set, and the mapping relation is positioned in a dimension index mode. And dividing the metric to be evaluated by the mapping relation to obtain a result, namely the ring ratio. The analogy process is similar to the above process and is not separately described.
Besides the basic operator, the data set also provides formula support, and various complex calculation operations can be realized by writing formula expressions.
The invention has the following beneficial effects:
1. the use is flexible: fields and service attributes required to be calculated can be freely configured, and calculation does not need to be carried out by depending on an external data source;
2. the calculation efficiency is high: based on the memory operation mode, the index is created according to the needs of the multidimensional analysis scene, and the query and the summary calculation can be quickly completed;
3. data normalization: the standardized management is carried out on external data, various data sources can be adapted, an engine structure in the middle is abstracted, and the external use is more convenient;
4. rich multidimensional models: multidimensional analysis such as dimension, measurement, time period number, time point number, time period offset calculation and the like is provided, and business logic can be rapidly realized.
Fig. 11 is a schematic structural diagram of a data processing apparatus based on a memory according to an embodiment of the present invention, where the apparatus includes:
a meta-information module 10, configured to load data acquired from the data interface line by line into a data set model pre-stored in the memory, and determine meta-information corresponding to each line of data;
the data set module 20 is configured to determine, according to the meta information, an index entry and dictionary information corresponding to each row of data, and input the meta information, the index entry and the dictionary entry into the data set model;
and the data calculation module 30 is configured to perform data calculation by using the data set model to obtain a calculation result.
As an embodiment of the present invention, the meta information includes a basic attribute and an extended attribute; wherein the basic attribute comprises a data name and a data type; the extended attributes include field type, build column, name column, aggregation type, application type, time granularity, data format, display format, unit dimension tag, and level.
In this embodiment, the data set module is further configured to determine, according to the field type and the data type, an index entry corresponding to each row of data.
In this embodiment, the apparatus further comprises: and the hierarchy module is used for carrying out hierarchy construction on the data according to the hierarchy.
As an embodiment of the present invention, the data calculation module is further configured to select an algorithm corresponding to the meta information according to the meta information in the data set model to perform data calculation, so as to obtain a calculation result; wherein the calculation result comprises a multi-valued result or a single-valued result.
Based on the same application concept as the memory-based data calculation method, the invention also provides the memory-based data calculation device. Since the principle of solving the problem of the memory-based data calculation apparatus is similar to that of a memory-based data calculation method, the implementation of the memory-based data calculation apparatus can refer to the implementation of the memory-based data calculation method, and repeated details are not repeated.
The invention utilizes the meta information of the data, combines the actual data, gives more meanings to the data, effectively enhances the data analysis and calculation capability, supports the multidimensional analysis scene, has high calculation efficiency, and realizes higher access performance through the data set structure defined in the memory.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method when executing the program.
The present invention also provides a computer-readable storage medium storing a computer program for executing the above method.
As shown in fig. 12, the electronic device 600 may further include: communication module 110, input unit 120, audio processing unit 130, display 160, power supply 170. It is noted that the electronic device 600 does not necessarily include all of the components shown in fig. 12; furthermore, the electronic device 600 may also comprise components not shown in fig. 12, which may be referred to in the prior art.
As shown in fig. 12, the central processor 100, sometimes referred to as a controller or operational control, may include a microprocessor or other processor device and/or logic device, the central processor 100 receiving input and controlling the operation of the various components of the electronic device 600.
The memory 140 may be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 100 may execute the program stored in the memory 140 to realize information storage or processing, etc.
The input unit 120 provides input to the cpu 100. The input unit 120 is, for example, a key or a touch input device. The power supply 170 is used to provide power to the electronic device 600. The display 160 is used to display an object to be displayed, such as an image or a character. The display may be, for example, an LCD display, but is not limited thereto.
The memory 140 may be a solid state memory such as Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 140 may also be some other type of device. Memory 140 includes buffer memory 141 (sometimes referred to as a buffer). The memory 140 may include an application/function storage section 142, and the application/function storage section 142 is used to store application programs and function programs or a flow for executing the operation of the electronic device 600 by the central processing unit 100.
The memory 140 may also include a data store 143, the data store 143 for storing data, such as contacts, digital data, pictures, sounds, and/or any other data used by the electronic device. The driver storage portion 144 of the memory 140 may include various drivers of the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging application, address book application, etc.).
The communication module 110 is a transmitter/receiver 110 that transmits and receives signals via an antenna 111. The communication module (transmitter/receiver) 110 is coupled to the central processor 100 to provide an input signal and receive an output signal, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 110, such as a cellular network module, a bluetooth module, and/or a wireless local area network module, may be provided in the same electronic device. The communication module (transmitter/receiver) 110 is also coupled to a speaker 131 and a microphone 132 via an audio processor 130 to provide audio output via the speaker 131 and receive audio input from the microphone 132 to implement general telecommunications functions. Audio processor 130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, an audio processor 130 is also coupled to the central processor 100, so that recording on the local can be enabled through a microphone 132, and so that sound stored on the local can be played through a speaker 131.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. A memory-based data processing method, the method comprising:
loading data acquired from a data interface to a data set model stored in a memory in advance line by line, and determining meta information corresponding to each line of data;
determining index items and dictionary entries corresponding to each row of data according to the meta information, and inputting the meta information, the index items and the dictionary entries into the data set model;
and performing data calculation by using the data set model to obtain a calculation result.
2. The method of claim 1, wherein the meta information comprises a base attribute and an extended attribute; the basic attribute comprises a data name and a data type; the extended attributes include field type, build column, name column, aggregation type, application type, time granularity, data format, display format, unit dimension tag, and level.
3. The method of claim 2, wherein determining an index entry according to the meta information comprises: and determining the index items corresponding to the data of each row according to the field types and the data types.
4. The method of claim 2, further comprising: and carrying out hierarchical construction on the data according to the hierarchy.
5. The method of claim 1, wherein performing data calculations using the data set model to obtain a calculation comprises: selecting an algorithm corresponding to the meta-information to perform data calculation according to the meta-information in the data set model to obtain a calculation result; wherein the calculation result comprises a multi-valued result or a single-valued result.
6. A memory-based data processing apparatus, the apparatus comprising:
the meta-information module is used for loading the data acquired from the data interface into a data set model stored in a memory in advance line by line and determining meta-information corresponding to each line of data;
the data set module is used for determining index items and dictionary entries corresponding to each row of data according to the meta information and inputting the meta information, the index items and the dictionary entries into the data set model;
and the data calculation module is used for performing data calculation by using the data set model to obtain a calculation result.
7. The apparatus of claim 6, wherein the meta information comprises a base attribute and an extended attribute; the basic attribute comprises a data name and a data type; the extended attributes include field type, build column, name column, aggregation type, application type, time granularity, data format, display format, unit dimension tag, and level.
8. The apparatus of claim 7, wherein the data set module is further configured to determine an index entry corresponding to each row of data according to the field type and the data type.
9. The apparatus of claim 7, further comprising: and the hierarchy module is used for carrying out hierarchy construction on the data according to the hierarchy.
10. The device according to claim 6, wherein the data calculation module is further configured to select an algorithm corresponding to the meta information according to the meta information in the data set model to perform data calculation, so as to obtain a calculation result; wherein the calculation result comprises a multi-valued result or a single-valued result.
11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when executing the program.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 5.
CN202011620449.XA 2020-12-30 2020-12-30 Data processing method and device based on memory Pending CN112667859A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011620449.XA CN112667859A (en) 2020-12-30 2020-12-30 Data processing method and device based on memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011620449.XA CN112667859A (en) 2020-12-30 2020-12-30 Data processing method and device based on memory

Publications (1)

Publication Number Publication Date
CN112667859A true CN112667859A (en) 2021-04-16

Family

ID=75412122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011620449.XA Pending CN112667859A (en) 2020-12-30 2020-12-30 Data processing method and device based on memory

Country Status (1)

Country Link
CN (1) CN112667859A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918541A (en) * 2021-12-13 2022-01-11 广州市玄武无线科技股份有限公司 Preheating data processing method and device and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183917A (en) * 2015-10-15 2015-12-23 国家电网公司 Multi-dimensional analysis method for multi-level storage data
CN105608155A (en) * 2015-12-17 2016-05-25 北京华油信通科技有限公司 Massive data distributed storage system
CN107977446A (en) * 2017-12-11 2018-05-01 江苏润和软件股份有限公司 A kind of memory grid data load method based on data partition
CN109213772A (en) * 2018-09-12 2019-01-15 华东师范大学 Date storage method and NVMe storage system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183917A (en) * 2015-10-15 2015-12-23 国家电网公司 Multi-dimensional analysis method for multi-level storage data
CN105608155A (en) * 2015-12-17 2016-05-25 北京华油信通科技有限公司 Massive data distributed storage system
CN107977446A (en) * 2017-12-11 2018-05-01 江苏润和软件股份有限公司 A kind of memory grid data load method based on data partition
CN109213772A (en) * 2018-09-12 2019-01-15 华东师范大学 Date storage method and NVMe storage system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113918541A (en) * 2021-12-13 2022-01-11 广州市玄武无线科技股份有限公司 Preheating data processing method and device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN106997386B (en) OLAP pre-calculation model, automatic modeling method and automatic modeling system
US11520760B2 (en) System and method for providing bottom-up aggregation in a multidimensional database environment
US20180210934A1 (en) Systems and methods for interest-driven business intelligence systems including event-oriented data
CN110618983A (en) JSON document structure-based industrial big data multidimensional analysis and visualization method
CN110929042B (en) Knowledge graph construction and query method based on power enterprise
CN101916261B (en) Data partitioning method for distributed parallel database system
CN107818115B (en) Method and device for processing data table
US7908242B1 (en) Systems and methods for optimizing database queries
CN107103032B (en) Mass data paging query method for avoiding global sequencing in distributed environment
US9747349B2 (en) System and method for distributing queries to a group of databases and expediting data access
CN112269792B (en) Data query method, device, equipment and computer readable storage medium
CN103577440A (en) Data processing method and device in non-relational database
CN105631003A (en) Intelligent index establishing, inquiring and maintaining method supporting mass data classification and counting
CN104765731A (en) Database query optimization method and equipment
US20150081353A1 (en) Systems and Methods for Interest-Driven Business Intelligence Systems Including Segment Data
CN102867066A (en) Data summarization device and data summarization method
WO2022241813A1 (en) Graph database construction method and apparatus based on graph compression, and related component
CN115905630A (en) Graph database query method, device, equipment and storage medium
CN114064660B (en) Data structured analysis method based on ElasticSearch
CN113704248B (en) Block chain query optimization method based on external index
CN112667859A (en) Data processing method and device based on memory
CN116719822B (en) Method and system for storing massive structured data
CN110389953B (en) Data storage method, storage medium, storage device and server based on compression map
CN111125045B (en) Lightweight ETL processing platform
CN110321388B (en) Quick sequencing query method and system based on Greenplus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination