CN113282627A - Intelligent analysis method for big intelligence data - Google Patents

Intelligent analysis method for big intelligence data Download PDF

Info

Publication number
CN113282627A
CN113282627A CN202110636964.5A CN202110636964A CN113282627A CN 113282627 A CN113282627 A CN 113282627A CN 202110636964 A CN202110636964 A CN 202110636964A CN 113282627 A CN113282627 A CN 113282627A
Authority
CN
China
Prior art keywords
query
column
nodes
data
column block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110636964.5A
Other languages
Chinese (zh)
Inventor
梁斌
魏保磊
史国恩
刘娟
梁芳
支敏
刘波
孟罡
张有为
许光辉
刘振杰
胡建峰
郭波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Huazhengtong Information Technology Co ltd
Original Assignee
Henan Huazhengtong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Huazhengtong Information Technology Co ltd filed Critical Henan Huazhengtong Information Technology Co ltd
Priority to CN202110636964.5A priority Critical patent/CN113282627A/en
Publication of CN113282627A publication Critical patent/CN113282627A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an intelligent analysis method for big intelligence data, which comprises the following steps: providing a column block storage area in the object storage cluster, the column block storage area having a striped partition data table; transmitting the column blocks from the object storage cluster to a plurality of analysis nodes of an intelligence analysis server; receiving a query request from a client to enable a plurality of analysis nodes to perform distributed parallel processing on the query of the data table divided into column blocks; caching a subset of column blocks associated with the query request to execute a transaction sequence of the query request in the cached subset of column blocks; and acquiring the execution results of the transaction sequences of the query requests executed by the plurality of resolution nodes, and combining the execution results of the transaction sequences of the query requests executed by other resolution nodes by the current resolution node. The invention reduces data transmission among a plurality of nodes in the column block storage area, provides more flexibility and efficiency in the aspect of processing query, and is particularly suitable for data mining of commercial intelligence application.

Description

Intelligent analysis method for big intelligence data
Technical Field
The invention belongs to the field of big data analysis, and particularly relates to an intelligent analysis method for big information data.
Background
With the rapid development of the mobile internet, the data volume of commercial intelligence has increased explosively. Through the business behavior analysis of the partner, the potential market demand and risk are found, and the method is an effective means for improving the value and the operation level. However, the large data volume presents challenges to conventional data analysis processing techniques. And existing distributed storage systems cannot be seamlessly extended to business intelligence analysis and mining applications that require large amounts of data to be retrieved and processed quickly and efficiently. For example, for terabytes of data, distributed storage systems typically store table data in a particular format, such as storing a horizontal partition table across multiple servers to store a subset of data rows in a data table of an object storage cluster. Data retrieval for the distributed storage system is then handled by retrieving a row having a plurality of associated columns. If a failed transaction is encountered, the distributed storage system needs to roll back to the state before the failed transaction. Such distributed storage system implementations have proven to be less efficient for updating large data sets on the order of terabytes.
Disclosure of Invention
The invention provides an intelligent analysis method for big intelligence data in a first aspect, which comprises the following steps:
providing a column block storage area in an object storage cluster, the column block storage area having a striped partition data table; transmitting the column chunks from the object storage cluster to a plurality of parsing nodes of an intelligence analysis server;
receiving a query request from a client to enable the plurality of parsing nodes to perform distributed parallel processing on the query of the data table divided into column blocks;
wherein the intelligence analysis server further comprises: a query parser for verifying a syntax of the query request; a semantic analyzer for verifying semantic content of the query request; the optimizer is used for determining a transaction sequence of the query request subjected to distributed parallel processing by the plurality of resolution nodes;
caching a subset of column blocks associated with the query request to execute a sequence of transactions of the query request in the cached subset of column blocks;
and acquiring the execution result of the transaction sequence of the query request executed by the plurality of analysis nodes, and combining the execution results of the transaction sequences of the query requests executed by other analysis nodes by the current analysis node.
Preferably, the plurality of resolution nodes of the intelligence analysis server include a first layer, a second layer and a third layer of resolution nodes, wherein:
the first layer of analysis nodes receive query requests for processing intelligence data, determine transaction sequences corresponding to the queries, wherein the intelligence data are stored in a row block in advance and cached among a plurality of analysis nodes;
the second tier resolution node receiving a translated query from the first tier resolution node, the translated query indicating that the second tier query node triggers distributed parallel processing of the query;
the third layer parsing node executes a transaction sequence corresponding to the translated query.
Preferably, the plurality of parsing nodes perform distributed parallel processing on the query request of the data table divided into column blocks, further comprising:
receiving a query for a column block in a data table, traversing a hierarchical tree structure comprising a plurality of nodes, identifying a set of leaf nodes of the hierarchical tree structure based on a column block ID, processing the query for the column block of the data table for the set of leaf nodes; and based on generating the query result.
Preferably, each leaf node in the hierarchical tree structure is associated with a column block in the data table, each leaf node in the hierarchical tree structure comprising data representing a superset of values of the column block associated with the leaf node, each non-leaf node comprising data representing a superset of values described by data in a child node of the non-leaf node.
Preferably, the value of each column of the data table is a long whole value, wherein the data of each leaf node in the hierarchical tree structure represents the value of the column block associated with the leaf node according to the numerical range of the long whole value.
Preferably, the data of each non-leaf node in the hierarchical tree structure represents the superset in terms of a numerical range of the long integer value.
Preferably, the processing the query request for the column block set includes:
for each column block in the set of column blocks, determining whether a numerical range of a leaf node associated with the current column block is a subset of the numerical range of the query request;
including the column block in the query result if the numerical range of the leaf node associated with the column block is a subset of the numerical range of the query request.
Preferably, the value of each column of the data table is a string value.
Preferably, the data of each leaf node in the hierarchical tree structure represents values of column blocks associated with the leaf node according to a Cuckoo filter generated from string values of column blocks associated with the leaf node and a set of hash functions.
Preferably, the data of each non-leaf node in the hierarchical tree structure represents a superset in terms of values of associated column blocks of child nodes of the non-leaf node, the values of the associated column blocks of the child nodes of the non-leaf node being generated from the Cuckoo filters and the hash function sets of the child nodes of the non-leaf node.
Compared with the prior art, the invention has the following advantages:
the invention reduces data transmission among a plurality of nodes in the column block storage area, provides more flexibility and efficiency in the aspect of processing query, and is particularly suitable for data mining of commercial intelligence application.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
To illustrate the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
Fig. 1 shows a flow chart of an intelligence big data intelligent analysis method according to the invention.
Fig. 2 shows a schematic diagram of a distributed intelligence data query analysis system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention provides a scheme for informative data query analysis in a distributed column block storage area. The distributed column block storage area is provided by an object storage cluster connected to clients. The object storage cluster includes:
a data transformation engine for dividing the data table into column blocks for distribution across a plurality of servers of the object storage cluster;
a storage sharing engine for storing the column block when performing semantic operations on the column block; and a storage service engine for striping column blocks of the data table across the object storage cluster.
A plurality of client applications are remotely connected to the object storage cluster. The applications include an application for querying a column block storage area to retrieve data for executing business intelligence.
The data conversion engine includes:
the loading unit is used for importing data into a data table divided into column blocks;
the analysis unit is used for receiving a query request of the column block data;
a metadata unit for managing metadata of the column block;
the transaction unit is used for maintaining the integrity of the column block semantic operation information; and
and the storage service unit is used for receiving the storage service request and sending the request to be executed by the storage service engine.
The storage service engine is further to compress the column chunks prior to storing the column chunk data, and to send the plurality of compressed column chunks to other storage nodes of the object storage cluster.
In the process of dividing the data table into column blocks, a plurality of columns are used as keywords and are divided by various partition methods, including range partition, list partition, hash partition and combination thereof. In the partitioning method, a preset storage strategy is applied to specify a mode of partitioning the data table, so that column blocks are distributed among a plurality of storage nodes, including the number of the column blocks to be generated. The storage policy further specifies a desired level of redundancy for the column block recovered from the failure of the object storage cluster to store the column block. The storage policy also includes the manner in which the column blocks are allocated to the available object storage clusters. The storage policies correspond to data tables, i.e., each data table may have a separate storage policy and may specify the manner in which the data table is divided into column blocks, different levels of redundancy for recovering from failures of multiple servers, and/or the manner in which the column blocks are allocated among the object storage clusters.
The method of the present invention further supports query analysis of data stored in the distributed column block storage area. The query analysis is performed by an object storage cluster in a column block storage area or by a intelligence analysis server connected to the object storage cluster in the column block storage area. The intelligence analysis server includes a query parser having a function for validating query syntax, a semantic analyzer for validating query semantic content, an optimizer for optimizing a query transaction sequence, and a query executor for executing a query.
The object storage cluster and/or intelligence analysis server for receiving the query's request first dynamically determines its own hierarchy and then processes the query translated into distributed parallel processing. The plurality of storage nodes then process the translated query in parallel, and combine intermediate results obtained in the distributed parallel processing and return to the requestor.
Preferably, the load unit and the parse unit communicate with the metadata unit and the transaction unit using an interprocess communication mechanism. The units, in turn, communicate with the storage service unit to request services, such as retrieving column blocks and loading the column blocks into the storage sharing engine. The storage service unit receives storage read and write requests and passes the requests to the storage service engine to perform query requests.
And the metadata unit is used for providing service for the configuration of the object storage cluster and managing metadata of the data conversion engine and the column block storage area. The metadata includes a data table reflecting the current state of the system, including the name of each storage node configured in the system, the load on each storage node, the bandwidth between storage nodes, and a number of other variables maintained in the data table. For the transaction unit, it is responsible for maintaining active transactions in the system, informing the metadata unit to update or commit metadata related to the transaction. The transactions include semantic operations performed on the data including data loading, data optimization, data retrieval, updating existing data tables, creating new tables, modifying data patterns, generating new storage policies, partitioning data tables, recording column block distributions in object storage clusters.
The storage sharing engine includes a storage tier metadata and a column block. The storage tier metadata includes information about the physical storage, such as the file name and server name to which the column block belongs, the compressed size of the column block, the uncompressed original size of the column block, the parity bits of the column block, and whether the column block is corrupted in disk storage. The storage service engine generates storage tier metadata by using the storage node configuration data to establish physical storage for the column blocks.
The architecture of the present invention is particularly applicable to commercial intelligence data mining applications that require long-term storage of large volumes of data. For example, a large fact table of collection transactions for customers logging into an enterprise website may be partitioned into column blocks. And a column block may represent a column in a data table partitioned using multiple columns as keys.
In a preferred embodiment, a single data table is first partitioned into multiple data tables, and then column-wise partitioning is performed on the resulting data tables to generate column chunks. After dividing the data table into column blocks, the object storage cluster allocates the column blocks among the storage nodes of the object storage cluster. The column blocks of the data table support striping among the storage nodes of the object storage cluster. Then, in order to recover the storage node with failure, the redundancy level can be determined when the column block is allocated, so that multi-level redundancy is realized. For example, column block parity bits are computed and stored to recover from failures of storage nodes.
Said calculating and storing column block parity bits further comprises: a bitwise XOR operation is performed on the two column blocks to generate a parity column block. And performing a bitwise XOR operation on the block of parity columns and the binary representation of the two blocks of columns to compute a block of parity columns for the three blocks of columns, the resulting block of parity columns being assigned to other storage nodes. Any number of parity column blocks are computed and assigned to the object storage cluster through the above process for recovery from a failure of the object storage cluster. Where the shorter column block is filled with 0's to be equal to the length of the other column block before performing a bitwise XOR operation on two column blocks of unequal length.
And when the distribution of the column blocks in the object storage clusters is determined, storing the column blocks on the allocated storage nodes, and finishing the storage processing among the object storage clusters in the column block storage area.
Preferably, data field compression is applied to the column blocks prior to storing the column blocks in the allocated storage nodes. The data field compression represents the application of a compression scheme that compresses a particular data type. The partitioning of the data table into column blocks facilitates compression of data in the column blocks using a domain-type specific compression scheme, given that the values of the column blocks are typically part of the same data type and/or a particular data domain. For example, direct compression of addresses stored as strings is inefficient, whereas by breaking the address field into multiple subfields, each subfield can be represented as a separate subcolumn with a particular data type that can be better compressed. In another example, for a parameter list of key-value pairs, column blocks including sub-fields are decomposed into individual column blocks using numeric range based compression, or column blocks including key-value pairs are decomposed into individual column blocks, by decomposing key-value pairs into individual column blocks, each column block representing a value of a particular data type.
After storing the data table in the column block storage area, the client application sends a query for column block data in the column block storage area. A data query request is first received. The object storage cluster and/or intelligence analysis server translates the query request into a plurality of queries that are distributed and processed in parallel, which may be referred to as query transaction sequences, each transaction sequence comprising a subset of the distributed query requests for executing the original query. The plurality of query transaction sequences are then processed by the plurality of storage nodes, respectively, and the query results are combined to parse the query request.
If the intelligence analysis server includes multiple resolution nodes, the hierarchy of the intelligence analysis server itself needs to be determined first. The intelligence analysis server hierarchy includes multiple layers of resolution nodes. Wherein the interface layer resolution node acts as a gateway server, interacts with the client application, and may be connected to the intermediate layer nodes. The middle tier resolution nodes facilitate processing of queries translated for distributed parallel processing and are connected to core tier resolution nodes that process queries translated for a particular column block. Each parsing node dynamically configures a hierarchy of intelligence analysis servers for executing transaction sequences corresponding to queries translated for distributed parallel processing and provides instructions for distributed parallel processing of the queries.
As described above, the parsing node includes a query parser, a semantic analyzer, an optimizer, and a query executor. The query parser parses the received query and validates the syntax of the query. The semantic analyzer then verifies the semantic content of the query by verifying the data tables or columns referenced by the query. The optimizer determines and optimizes a transaction sequence corresponding to the query such that query execution is assigned to a core layer node that caches a subset of the column blocks referenced by the query request. The query executor executes the translated query.
The hierarchy further includes a column block caching function to speed query analysis and reduce input and output. If the column blocks needed to process the query are available in the interface layer node's cache, the interface layer node processes the query. The interface layer nodes determine to distribute the query, translate the query, distribute the core layer nodes to combine the results and provide instructions to process the query, and send the query to the intermediate layer nodes. The intermediate layer node determines whether the instruction indicates that the core layer node needs to process the translated query and, if so, sends the translated query to the core layer node. When the core layer node processes the query transaction, the results are sent to the intermediate layer nodes to combine the results of the execution of the transaction sequence by the core layer node. This configuration allows the intelligence analysis server to flexibly decide which intelligence query nodes can process query transactions most efficiently and reduces the transmission of column blocks.
The method further includes determining which resolution nodes are required as candidate core layer nodes for performing the translated query transaction based on the location of the cached column block referenced by the query. After determining the number of core layer nodes for processing the query transaction, instructions are provided for processing the query transaction at the selected core layer node.
Then, a number of results of intermediate results of the parallel processing query is determined. For example, the size of the results data table is calculated for processing the translation query for the subset of hash column blocks. It is then determined that intermediate level nodes in the intelligence analysis server hierarchy combine the intermediate results. After determining to combine intermediate results from processing the translated query and allocating a plurality of parsing nodes for combining the intermediate results, determining an instruction to send the results from the core layer node to the intermediate layer node.
In yet another embodiment of the present invention, the column block is further stored in a hierarchical tree. I.e., each leaf node in the hierarchical tree structure is associated with a column block in the partition data table. Upon receiving a user query for a column block in a data table, a hierarchical tree structure comprising a plurality of nodes arranged in a plurality of levels is traversed to identify a set of leaf nodes of the hierarchical tree structure based on a column block ID. Wherein each leaf node in the hierarchical tree structure includes data representing a superset of the column block associated with the current leaf node. Each non-leaf node includes data representing a superset represented by data in the child nodes of the non-leaf node. A query request for a set of column blocks of a data table associated with a leaf node is then processed, and query results are generated.
Each column value of the data table includes a long integer value. The data of each leaf node in the hierarchical tree structure may represent the values of the column blocks associated with the leaf nodes in a long integer numerical range. The data for each non-leaf node in the hierarchical tree structure may represent a superset in a range of values. For each column block in the set of column blocks, determining whether a range of long integer values of a leaf node associated with the column block is a subset of a range of long integer values of the query; including a column block in a result of the query if a numerical range of a leaf node associated with the column block is a subset of the numerical range of the query; and adding column block data for values within the numeric range of the query to the query result if the numeric range of the leaf node associated with the column block is not a subset of the numeric range of the query.
Optionally, each column value of the data table further comprises a string value. The data of each leaf node in the hierarchical tree structure represents values of column blocks associated with the leaf node according to a Cuckoo filter generated from string values of the column blocks associated with the leaf node and a set of hash functions. Data for each non-leaf node in the hierarchical tree structure may be generated according to a Cuckoo filter and a hash function set from child nodes of the non-leaf node. Processing of the query of the set of column blocks may include, for each column block in the set of column blocks, traversing data in the column block; and including data in a column block in the query result, the column block having a value in a column of the data table that matches the value of the column block ID. Preferably, the generating query results comprises aggregating results from the query in each column block. The values of the column block IDs are added to the superset of values represented by the data of each leaf node.
In particular, the data conversion engine generates a striping file from data stored in a data table. An example generation process for a striped file includes: the data transformation engine accesses the partition data table. The data conversion engine then divides the table into row blocks and then divides the row blocks into individual column blocks. After the row and column blocks are obtained, the data conversion engine generates relevant key value metadata describing each column of data. The key value metadata includes the ID of the column block, the minimum value and the maximum value in the column block. The striped file is then constructed based on the column blocks and the generated key-value metadata. The data conversion engine determines an uncompressed raw size in each key-value metadata according to the number of bytes, and formats the key-value metadata using the determined uncompressed raw size and writes it into the striped file. After writing key-value metadata to the striped file, the data conversion engine stores the column blocks correspondingly in the striped file.
For each column block, the data conversion engine encodes the data values using a long integer coding algorithm. I.e. the data conversion engine automatically selects from the following two different long integer coding algorithms. For example, the data conversion engine estimates the number of bytes to be used for each long integer encoding algorithm and selects the encoding algorithm with the least number of bytes. In a first encoding algorithm, the minimum number of bits required to represent a range of values in a long integer sequence is determined. Two VLQ code values are used to store the array length for storing the value and the number of bits used to represent the value. The number of bytes to be used is estimated according to the following formula:
b2b{N*max[bitrq(maxv),bitsrq(minv)]}
wherein N is the number of values in the long integer sequence; b2b () is used to calculate the number of bytes required to store the bit number from the bit number; max () is a maximum function; bitrq() The minimum number of bits required for calculating the value; maxvIs the maximum in the long integer sequence; minvIs the minimum in the long integer sequence.
In a second encoding algorithm, VLQ coding is used to store the first value in the long integer sequence and each subsequent value in the long integer sequence is stored based on the difference. The number of bytes to be used is estimated according to the following formula:
vlqbyte(v1)+numrp δ*vlqbyte(maxδ)
wherein v1 is the first value in a long integer sequence; vlqbyte() Returning the number of bytes required by storing the data by using a VLQ coding algorithm; numrp δIs the number of repeated differences between successive long integer values in the long integer sequence; maxδIs the maximum difference between consecutive long integer values in the long integer sequence.
The data conversion engine selects the long integer encoding algorithm with the lowest estimated number of bytes and encodes the long integer values in the column blocks. The data conversion engine writes the value associated with the determined long integer coding algorithm as the VLQ code value after coding in the column block.
The data transformation engine then generates column metadata for each column in the data table. The column metadata includes the type of data stored in the column, the offset of each column block, and a number of encoding algorithms used to encode the values in the column, the name of the column, and the number of null values per column. After generating the column metadata, the data conversion engine compresses the column metadata using lossless compression, determines whether the compressed column metadata size and the original size of the data are less than a threshold ratio, and uses the compressed column metadata or the original uncompressed column metadata accordingly for storage in the file. The data conversion engine then determines an uncompressed raw size of the column metadata according to the number of bytes, formats the column metadata using the determined uncompressed raw size, and writes it to the striped file.
When the intelligence analysis server receives a query request from a user, a transaction sequence is generated based on the query request. In order to execute the transaction sequence corresponding to the query request, the analysis node applies a filter to the columns in the data table, and determines the row blocks needing to be accessed based on the key value metadata. If the value range defined by the minimum value and the maximum value of the current line block is not in the value range defined by the query, skipping the current line block. Then, all rows in the row blocks included in the query result are determined, and it is determined whether the numerical range defined by the minimum value and the maximum value of the current column block is within the numerical range defined by the user query.
The offset and size of the column block to be accessed are sent to the core layer node, retrieved from the column metadata, and directed to the core layer node to retrieve the column block. Upon receiving a column block, the rows and columns in the column block that satisfy the query are identified. And then, the numerical values of the rows and the columns identified in the column block are included in the query result, and the query result is sent to the resolution node.
Optionally, after generating the column metadata, each column of metadata in the striped file is further stored as a separate transactional page. In particular, a log page is generated based on the total number of rows in the plurality of rows, the number of rows in each of the plurality of row blocks, and a reference to the set of column metadata, and the log page in the file is then stored as a separate transaction page.
Before processing queries using key-value metadata in the striped file, the key-value metadata is preferably implemented using a hierarchical tree structure. The number of leaf nodes of the hierarchical tree structure corresponding to a column is the same as the number of column blocks of the column.
When an instruction is received from a parsing node, a core layer node retrieves a transaction page of column metadata and a transaction page of key value metadata from a log page of a striped file. The hierarchical tree is then generated based on key-value metadata and traversed in a breadth-first manner.
In a string-value-based embodiment, the hierarchical tree structure includes a Cuckoo filter that is generated based on string values stored in columns. The Cuckoo filter of the root node is a combination of Cuckoo filters of its child nodes. In particular, the Cuckoo filter is an 8-bit array configured to store hash values from two hash functions. The data conversion engine generates the Cuckoo filter for the current hierarchical tree node by performing a bitwise or operation on the Cuckoo filters of the children nodes of each hierarchical tree node. Thus, each hierarchical tree node individually stores a superset of its respective child node.
Then, a set of leaf nodes of the hierarchical tree structure is identified based on the column block IDs through the hierarchical tree structure. Each leaf node in the hierarchical tree structure is associated with a column block in the data table. Each leaf node in the hierarchical tree structure includes superset data representing values of the column block associated with the leaf node. Each non-leaf node includes superset data representing values represented by data in child nodes of the non-leaf node. A query is then processed for a set of column blocks of the set of leaf node related data tables.
When determining a column block whose range of values defined by the minimum and maximum values overlaps with the range of values defined by the query, sending the offset and size of the transaction page containing the above column block and retrieving the offset and size from the column metadata, and instructing the core layer node to retrieve the corresponding column block from the corresponding transaction page and send the query result to the resolving node.
In a further embodiment of the present invention, an exemplary process for implementing intelligent analysis of enterprise workflow business models in business intelligence data based on the query analysis is specifically described based on the query patterns described above to implement automatic classification and training of business models and identify risk businesses. The business model is represented by a trust event library, which contains a feature event set.
First, the enterprise workflow business model is stored as a feature event data table, and second, the feature event data table is divided into column blocks. Obtaining a training event set { (K)n,rn) N is 1, 2, …, N }, where K isnSet of events for a particular business model, rn∈[1,2,…,TC]A model feature type flag. N is the number of model events and TC is the number of types.
Then set K of eventsnExpressed as a characteristic event MnThe definition is as follows:
Mn=[M1,n,…,Mi,n,…,Mln,n]
in the formula: mi,nIs the feature set calculated by the ith analysis node; lnRepresents KnThe number of relevant resolution nodes.
Representing a feature event set data table as bu ═ pi|i=1,…,Np},NpIs the number of feature event sets. Ith feature event column block piIs defined as { Mi,hti};htiIs the detection threshold.
In calculating MiFirst, all training feature events { M }1,…,MNPerforming matrix conversion to obtain main analytic nodes, clustering column blocks of all analytic nodes, and expressing a conversion function A as:
Figure BDA0003105562390000141
in the formula: mi,nAnd Mk,mRespectively calculating feature sets of the ith analysis node and the kth analysis node;
Figure BDA0003105562390000142
and
Figure BDA0003105562390000143
are respectively Mi,nAnd Mk,mGradient vectors of type t are described in (1).
Then, for the ith trust event library, setting a detection threshold value htiTo establish a training data set to avoid noisy event patterns.
For a feature event MnWill MnConversion into a column block vector, represented as
En=[E1,n,…,Ei,n,…,Eln,n]
In the formula Ei,nFor the column block corresponding to the ith characteristic event, processing M by using a characteristic event detection modelnSelect column block Ei,nThe response of the support vector machine is maximized.
From the trained column block [ E1,E2,…,EN]Obtaining a characteristic event set CER through an Apriori algorithm, wherein the characteristic event set CER represents a local characteristic structure of a business model event, and the jth event CERjThe definition is as follows:
CERj={cj,gj,qj,wj}
in the formula: c. Cj∈[1,2,…,TC]Marking a model characteristic type; gjIs an event mode; q. q.sjIs a feature event set feature; w is ajRepresents CERjIn model feature type cjThe weight of (c).
To calculate gjFirst, a column block of training data is collected. However, the device is not suitable for use in a kitchenThen calculating the event pattern from the collected column block, and analyzing the same event pattern from two model feature classes, so setting a weight wjFor mode gj,wjDenotes gjRelative weight of (c). If a pattern only occurs in one type, the weight will reach a maximum of 1.
The feature event set preserves the temporal relationship of the feature events. Let test event KTThe evaluation function of the feature event set CER, model feature c can be expressed as:
Figure BDA0003105562390000151
in the formula: a1j,c,a2j,c,a3j,cIs the parameter of the jth characteristic event set in the model characteristic type c. N is a radical ofRThe number of characteristic event sets. ETTo test event KTColumn block vector of d (E)T,gj) Is an event reference feature. d (E)T,gj) The method is used for calculating the similarity between the test event and the characteristic event set, and an initial value MPF (n, 0) is set to be 0, n belongs to [0, L];MPF(0,m)=-m,m∈[0,mj]L is ETNumber of middle resolution nodes, mjIs an event pattern gjThe length of the sequence of (c). Thus, the matching function MPF is defined as follows:
MPF(n,m)=max{Ph(Mn,T)+Ph(Mm,T),MPF(n-1,m),MPF(n,m-1)}
the event reference feature is a partial structure describing a model feature by matching a long sequence, namely a test event, representing the whole model feature structure to a short sequence. When g isjAnd test event ETWhen they match, d (E)T,gj) With the maximum reference score:
d(ET,gj)=max(MPF(n,mj)/mj)
in the stage of understanding and identifying the characteristics of the model data, an AHP algorithm is adopted to realize d (E)T,gj) Identification of event reference features to distribute classification internallyFunction HOwThe die of (2) is smaller; distribution function HO between simultaneous classesbThe modulus of (a) is larger to achieve optimal classification performance. Calculating a gradient function J:
Figure BDA0003105562390000152
in the formula: ε is the high dimensional column vector. The maximum HM at that time is obtained after projection by selecting epsilon which maximizes J (epsilon) as the projection directionbAnd minimum HMw(ii) a The current best discrimination vector ω is selected to establish a projection function P, expressed as:
Figure BDA0003105562390000161
and finally, performing dimension reduction on the projection function P by utilizing principal component analysis, eliminating redundant features and completing feature identification of the risk service model.
Therefore, the invention can flexibly distribute the query analysis request among a plurality of analysis nodes. The server hierarchy that performs the distributed query may be selected based on any criteria, including location of cache column blocks, server processing speed, input/output throughput, etc. The method of the present invention provides more flexibility in efficiently processing queries and reduces data transmission between intelligence analysis servers in column block storage and object storage clusters. The above method has significant advantages especially in data mining and commercial intelligence applications.

Claims (10)

1. An intelligence big data intelligent analysis method is characterized by comprising the following steps:
providing a column block storage area in an object storage cluster, the column block storage area having a striped partition data table; transmitting the column chunks from the object storage cluster to a plurality of parsing nodes of an intelligence analysis server;
receiving a query request from a client to enable the plurality of parsing nodes to perform distributed parallel processing on the query of the data table divided into column blocks;
wherein the intelligence analysis server further comprises: a query parser for verifying a syntax of the query request; a semantic analyzer for verifying semantic content of the query request; the optimizer is used for determining a transaction sequence of the query request subjected to distributed parallel processing by the plurality of resolution nodes;
caching a subset of column blocks associated with the query request to execute a sequence of transactions of the query request in the cached subset of column blocks;
and acquiring the execution result of the transaction sequence of the query request executed by the plurality of analysis nodes, and combining the execution results of the transaction sequences of the query requests executed by other analysis nodes by the current analysis node.
2. The intelligence big data intelligent analysis method of claim 1, wherein the plurality of parsing nodes of the intelligence analysis server comprises a first layer, a second layer and a third layer of parsing nodes, wherein:
the first layer of analysis nodes receive query requests for processing intelligence data, determine transaction sequences corresponding to the queries, wherein the intelligence data are stored in a row block in advance and cached among a plurality of analysis nodes;
the second tier resolution node receiving a translated query from the first tier resolution node, the translated query indicating that the second tier query node triggers distributed parallel processing of the query;
the third layer parsing node executes a transaction sequence corresponding to the translated query.
3. The intelligence big data intelligent analysis method of claim 1, wherein the plurality of parsing nodes perform distributed parallel processing on the query request of the data table divided into column blocks, further comprising:
receiving a query for a column block in a data table, traversing a hierarchical tree structure comprising a plurality of nodes, identifying a set of leaf nodes of the hierarchical tree structure based on a column block ID, processing the query for the column block of the data table for the set of leaf nodes; and based on generating the query result.
4. The intelligence big data intelligent analysis method of claim 3, wherein each leaf node in the hierarchical tree structure is associated with a column block in the data table, each leaf node in the hierarchical tree structure comprises data representing a superset of values of the column block associated with the leaf node, and each non-leaf node comprises data representing a superset of values described by data in children nodes of the non-leaf node.
5. The intelligent intelligence big data analysis method of claim 3, wherein the value of each column of the data table is a long-integer value, wherein the data of each leaf node in the hierarchical tree structure represents the value of the column block associated with the leaf node according to the numerical range of the long-integer value.
6. The intelligent intelligence big data analysis method of claim 5, wherein the data of each non-leaf node in the hierarchical tree structure represents the superset according to the numerical range of the long integer value.
7. The intelligence big data intelligent analysis method of claim 3, wherein the processing of the query request for the set of column blocks comprises:
for each column block in the set of column blocks, determining whether a numerical range of a leaf node associated with the current column block is a subset of the numerical range of the query request;
including the column block in the query result if the numerical range of the leaf node associated with the column block is a subset of the numerical range of the query request.
8. The intelligence big data intelligent analysis method of claim 3, wherein the value of each column of the data table is a string value.
9. The intelligent intelligence big data analysis method of claim 8, wherein the data of each leaf node in the hierarchical tree structure represents values of column blocks associated with the leaf node according to a Cuckoo filter generated from string values of column blocks associated with the leaf node and a set of hash functions.
10. The intelligence big data intelligent analysis method of claim 9, wherein the data of each non-leaf node in the hierarchical tree structure represents a superset according to values of related column blocks of child nodes of non-leaf nodes, the values of related column blocks of child nodes of non-leaf nodes being generated from Cuckoo filters and hash function sets of child nodes of non-leaf nodes.
CN202110636964.5A 2021-06-08 2021-06-08 Intelligent analysis method for big intelligence data Pending CN113282627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110636964.5A CN113282627A (en) 2021-06-08 2021-06-08 Intelligent analysis method for big intelligence data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110636964.5A CN113282627A (en) 2021-06-08 2021-06-08 Intelligent analysis method for big intelligence data

Publications (1)

Publication Number Publication Date
CN113282627A true CN113282627A (en) 2021-08-20

Family

ID=77283832

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110636964.5A Pending CN113282627A (en) 2021-06-08 2021-06-08 Intelligent analysis method for big intelligence data

Country Status (1)

Country Link
CN (1) CN113282627A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143261A1 (en) * 2005-12-19 2007-06-21 Yahoo! Inc. System of a hierarchy of servers for query processing of column chunks in a distributed column chunk data store
CN107256233A (en) * 2017-05-16 2017-10-17 北京奇虎科技有限公司 A kind of date storage method and device
CN110879854A (en) * 2018-09-06 2020-03-13 Sap欧洲公司 Searching data using a superset tree data structure
CN110909077A (en) * 2019-11-05 2020-03-24 四川中讯易科科技有限公司 Distributed storage method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143261A1 (en) * 2005-12-19 2007-06-21 Yahoo! Inc. System of a hierarchy of servers for query processing of column chunks in a distributed column chunk data store
CN107256233A (en) * 2017-05-16 2017-10-17 北京奇虎科技有限公司 A kind of date storage method and device
CN110879854A (en) * 2018-09-06 2020-03-13 Sap欧洲公司 Searching data using a superset tree data structure
CN110909077A (en) * 2019-11-05 2020-03-24 四川中讯易科科技有限公司 Distributed storage method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郭刚;卢宇彤;: "大规模集群一致性维护的网络传输控制方案", 科学技术与工程, no. 06, 30 March 2006 (2006-03-30) *

Similar Documents

Publication Publication Date Title
US11080277B2 (en) Data set compression within a database system
US9576024B2 (en) Hierarchy of servers for query processing of column chunks in a distributed column chunk data store
US7447839B2 (en) System for a distributed column chunk data store
US9892237B2 (en) System and method for characterizing biological sequence data through a probabilistic data structure
US7921087B2 (en) Method for query processing of column chunks in a distributed column chunk data store
US7457935B2 (en) Method for a distributed column chunk data store
US11983176B2 (en) Query execution utilizing negation of a logical connective
US12050580B2 (en) Data segment storing in a database system
US11755589B2 (en) Delaying segment generation in database systems
US11880368B2 (en) Compressing data sets for storage in a database system
US11803544B2 (en) Missing data-based indexing in database systems
US20240004858A1 (en) Implementing different secondary indexing schemes for different segments stored via a database system
CN113282627A (en) Intelligent analysis method for big intelligence data
CN111522825A (en) Efficient information updating method and system based on check information block shared cache mechanism
US20240202166A1 (en) Generating compressed column slabs for storage in a database system
US20240256541A1 (en) Query execution via communication with an object storage system via an object storage communication protocol

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination