CN111221837A

CN111221837A - Real-time computing query system and method based on B2B mall

Info

Publication number: CN111221837A
Application number: CN201911308700.6A
Authority: CN
Inventors: 崔素芳; 钟朝; 彭晓
Original assignee: Guangzhou President Enterprises Corp
Current assignee: Guangzhou President Enterprises Corp
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-06-02

Abstract

The invention relates to a real-time computing query system and a real-time computing query method based on a B2B mall, wherein the system comprises the following steps: the system comprises a service database, an HDFS distributed storage system, a Kudu column type database, a front-end WEB system, a real-time acquisition and forwarding unit, a Shell script and a Hive. The service data of the invention is collected into the Kudu type database in real time through the big data platform, thereby ensuring the timeliness of the data, the calculation request of the user is input and submitted in real time through the front-end WEB system, and the calculation result is obtained in time through the operational capability of the big data platform for supporting the decision.

Description

Real-time computing query system and method based on B2B mall

Technical Field

The invention relates to the technical field of big data calculation query, in particular to a real-time calculation query system and method based on a B2B mall.

Background

Aiming at the problem of large service data volume and the need of adopting big data calculation, at present, scheduling tasks are mostly formulated, service data are extracted according to specified time points, and then a calculation model is formulated according to service scenes for off-line calculation, and the method mainly adopts the following technologies: sqoop batch import data, HDFS batch storage data, Hive/MapReduce batch calculation data, Hue task scheduling, Hplsql calculation processing and an Impala query engine.

Chinese patent application publication No. CN109766368A discloses a Hive-based data query multi-type view output system, which includes: the query module comprises a query condition management module, a query result display module and a custom template generation module. According to the scheme, Hive is utilized to carry out statistical analysis on the mass data sets, and the large data sets in the data warehouse are subjected to batch statistics, mining and analysis through a multi-input means and a multi-output means, so that intelligent decision guidance and means based on the large data are provided for users, and the users can identify the current situation and grasp the trend.

The above prior art solutions have the following drawbacks: scheduling tasks are formulated, service data are extracted according to specified time points, a calculation model is formulated according to service scenes for off-line calculation, and the data instantaneity is low; computational model fixing, for example: calculating the client transaction amount, transaction amount and final transaction time through big data, wherein the established calculation model can only fixedly and respectively calculate 30/60/90 day transaction amounts, a business department filters which part of clients need to make key visits according to the calculation result, and when a new product is on line or executed by a promotion case or a key product is on line, 7/10 day or other types of transaction amounts may need to be calculated in a key way; data granularity is fixed, for example: a given calculation model can only calculate the minimum stock keeping units to the customer.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a real-time computing query system based on a B2B mall and a real-time computing query method based on a B2B mall.

The above object of the present invention is achieved by the following technical solutions:

a real-time computing query system based on a B2B mall, comprising:

a service database: the data information is used for storing the data information of the B2B shopping mall;

an HDFS distributed storage system;

kudu column-wise database: keeping synchronization with the service database through a big data platform, and mapping with the HDFS distributed storage system;

front-end WEB system: the data logging device is used for logging data and forwarding request information with the logged data through a big data interface library;

the real-time acquisition forwarding unit: the system is used for collecting and forwarding request information with the input data forwarded by the front-end WEB system through a big data interface library in real time;

shell script: the system comprises a real-time acquisition and forwarding unit, a processing unit and a processing unit, wherein the real-time acquisition and forwarding unit is used for receiving request information forwarded by the real-time acquisition and forwarding unit and then calling Hpl/Sql for processing, and the Hpl/Sql carries out calculation processing according to data in the Kudu column database and the request information; and

hive: and the system is used for storing the Hpl/Sql calculation result and pushing the calculation result to the front-end WEB system.

By adopting the technical scheme, the service data is collected into the Kudu column-type database in real time through the big data platform, the timeliness of the data is guaranteed, the calculation request of the user is input and submitted in real time through the front-end WEB system, and the calculation result is obtained in time through the calculation capacity of the big data platform and is used for supporting decision making.

The present invention in a preferred example may be further configured to: the real-time acquisition and forwarding unit comprises a real-time acquisition unit, a real-time message forwarding unit and an SPARK engine, the real-time acquisition unit is used for monitoring and acquiring request information forwarded by the big data interface library in real time and forwarding the request information to the SPARK engine through the real-time message forwarding unit, and the SPARK engine calls a python program to receive the request information and forwards the request information to the Shell script.

By adopting the technical scheme, the real-time acquisition unit is used for monitoring and acquiring the request information forwarded by the big data interface library in real time and forwarding the request information to the SPARK engine through the real-time message forwarding unit, and the SPARK engine calls a python program to receive the request information and forwards the request information to the Shell script.

The present invention in a preferred example may be further configured to: the real-time acquisition unit is StreamSets, and the real-time message forwarding unit is Kafka.

By adopting the technical scheme, the streammes has no buffer function, and can be put into a kafka queue as long as new request information comes in.

The present invention in a preferred example may be further configured to: the BI display system is used for providing a report form according to the calculation result and providing a decision basis.

By adopting the technical scheme, the BI display system is a complete solution, is used for effectively integrating the existing data in an enterprise, quickly and accurately provides a report form and provides a decision basis, and helps the enterprise make an intelligent business operation decision.

The second aim of the invention is realized by the following technical scheme:

a real-time calculation query method based on a B2B mall is characterized in that a Kudu column database and an HDFS distributed storage system are adopted for mapping, and data in the Kudu column database is kept synchronous with a service database through a big data platform; after the front-end WEB system inputs data, the request information with the input data is sent to a real-time acquisition and forwarding unit through a big data interface library; and the Shell script calls Hpl/Sql after receiving the request information forwarded by the real-time acquisition and forwarding unit, the Hpl/Sql performs calculation processing according to the data in the Kudu column database and the request information and stores a calculation result in the Hive, and the front-end WEB system receives and applies the calculation result.

The present invention in a preferred example may be further configured to: and after the computation is finished, the Shell script returns at least one piece of information in the computation result to the SPARK engine.

By adopting the technical scheme, the calculation result returned to the SPARK engine is used for updating information such as a completion mark in the interface table.

The present invention in a preferred example may be further configured to: the request information at least comprises a session id and input data of a front-end WEB system.

By adopting the technical scheme, the session id is unique, and the introduction of the session id is beneficial to carrying out classification processing and storage on each request message and the processing calculation result thereof.

The present invention in a preferred example may be further configured to: the input data of the front-end WEB system comprises at least one client static label and at least one commodity static label.

By adopting the technical scheme, the corresponding calculation dynamic label result is obtained according to the free combination of any static label in two dimensions of the static label of the customer and the static label of the commodity.

The present invention in a preferred example may be further configured to: and the calculation result is also sent to a BI display system for BI display.

In summary, the invention includes at least one of the following beneficial technical effects:

1. service data are collected into a Kudu type database in real time through a big data platform, so that the timeliness of the data is guaranteed;

2. the calculation request of the user is input and submitted in real time through a front-end WEB system, and a calculation result is obtained in time through the calculation capability of a big data platform and is used for supporting a decision;

3. the response capability of the super-large data volume operation improves the response time to the data.

Drawings

FIG. 1 is a schematic block diagram of the system of the present invention;

FIG. 2 is a flow chart of a method of the present invention;

FIG. 3 is a flow chart of a method of an embodiment of the present invention;

FIG. 4 is a schematic illustration of static and dynamic tags for the customer and product dimensions of the present invention;

FIG. 5 is a diagram illustrating an exemplary interface for a theme definition file according to the present invention;

FIG. 6 is a diagram illustrating an exemplary interface for a model data file according to the present invention;

FIG. 7 is a diagram illustrating an exemplary screening value interface for a model list according to the present invention;

FIG. 8 is a schematic view of an input interface of the front-end WEB system according to the present invention;

FIG. 9 is a diagram illustrating an exemplary model computing request file interface according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The noun explains:

the HPL/SQL is fully called Procedural SQL on Hadoop and provides support extension of a storage process for Hive;

hive is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop;

the Shell script is similar to batch processing under Windows/Dos, namely, various commands are put into a file in advance, a program file which is convenient to execute at one time is convenient, and the program file is mainly convenient for an administrator to set or manage;

kafka is a high throughput distributed publish-subscribe messaging system;

the Streamsets is a large data real-time acquisition and ETL tool, and can realize the acquisition and the circulation of data without writing a line of codes. The ETL is a process of loading data of a business system into a data warehouse after extraction, cleaning and conversion, aims to integrate scattered, disordered and standard non-uniform data in an enterprise and provides an analysis basis for the decision of the enterprise, and is an important link of a BI (business intelligence) project;

apache Kudu is a storage engine which is sourced by Cloudera and can provide low-delay random reading and writing and high-efficiency data analysis capability at the same time; the HDFS-HBase storage component is a new component integrating functions of an HDFS and an HBase and is provided with a new storage component between the HDFS and the HBase;

the bi (business intelligence), a complete solution, is used to effectively integrate the existing data in the enterprise, quickly and accurately provide reports and provide decision basis, and help the enterprise make intelligent business operation decision.

Referring to fig. 1, the invention discloses a real-time computing query system based on B2B mall, comprising:

an HDFS distributed storage system;

With reference to fig. 1, the real-time acquisition and forwarding unit includes a real-time acquisition unit, a real-time message forwarding unit, and a SPARK engine, where the real-time acquisition unit is configured to monitor and acquire request information forwarded by the big data interface library in real time and forward the request information to the SPARK engine through the real-time message forwarding unit, and the SPARK engine calls a python program to receive the request information and forwards the request information to the Shell script. Preferably, the real-time acquisition unit is streammes, and the real-time message forwarding unit is Kafka; the BI display system is used for providing a report form according to the calculation result and providing a decision basis.

Referring to fig. 2, in order to provide the real-time calculation query method based on the B2B mall disclosed in the present invention, a Kudu column-type database and an HDFS distributed storage system are used for mapping, and data in the Kudu column-type database is kept synchronous with a service database through a big data platform; after the front-end WEB system inputs data, the request information with the input data is sent to a real-time acquisition and forwarding unit through a big data interface library; and the Shell script calls Hpl/Sql after receiving the request information forwarded by the real-time acquisition and forwarding unit, the Hpl/Sql performs calculation processing according to the data in the Kudu column database and the request information and stores a calculation result in the Hive, and the front-end WEB system receives and applies the calculation result.

With reference to fig. 2, the real-time acquisition and forwarding unit includes a real-time acquisition unit, a real-time message forwarding unit, and a SPARK engine, the real-time acquisition unit is configured to monitor and acquire request information forwarded by the big data interface library in real time and forward the request information to the SPARK engine through the real-time message forwarding unit, and the SPARK engine calls a python program to receive the request information and forwards the request information to the Shell script.

Referring to fig. 3, the real-time acquisition unit is streamsetts, and the real-time message forwarding unit is Kafka.

After the computation is finished, the Shell script returns at least one piece of information (such as computation end time, computation completion flag, computation success or failure flag, computation result line number, computation key output information) in the computation result to the SPARK engine, writes back an interface table submitted by the computation request, and updates the completion flag and other information in the interface table. The request information at least comprises a unique session id and entry data of a front-end WEB system. The input data of the front-end WEB system comprises at least one client static label and at least one commodity static label. And the calculation result is also sent to a BI display system for BI display.

For a service table with fast service change and large data volume, such as an order, and the like, adding, deleting, modifying and checking log data in the Mysql or Oracle service database synchronously enter a Kudu column database of a large data platform.

Model functional requirements: the model is required to realize the free combination of the calculation according to any static label in two dimensions of the static label of the customer and the static label of the commodity, and a corresponding calculation dynamic label result is obtained. Referring to fig. 4, the customer static tag includes: regional/administrative province/city/county, merchant, channel customer/non-merchant/b, channel customer category, channel customer region, channel customer level, whether merchant is enabled, etc.; the static label of the goods comprises: commodity brand, commodity code, commodity classification, product line, outer box bar code, income number, auxiliary lining unit (cup, bowl, five-in-one), and the like. The result of calculating the dynamic label comprises the number of clients or clients never made, the number of composite transaction clients or clients (i.e. transaction of a plurality of commodity brands), the latest transaction time, transaction items and item numbers, transaction frequency, total number of clients, GMV (total transaction amount), transaction box number, transaction amount and the like. The computed dynamic label results are determined at model build time, and each computation computes all of these result columns.

Theme-oriented model definition: according to the user demand scene, defining a model theme, including dimension data, static labels, service data, calculation rules, calculation result measurement lists and the like used in the theme. And putting the part of data into a model interface table for user-defined selection.

Referring to fig. 5, the model theme definition file: the method comprises the model code (model _ name), the model subject name (model _ name), the model creation time, whether the model is enabled or not, the estimated duration (esti _ duration) of each time the model submits to a task, a model calculation output result table (dest _ tab _ name), and a URL (report _ URL) address viewed by the model report table.

Referring to fig. 6, model document: the calculation model comprises a model code number, a model use dimension table/service data table name (tab _ name), a dimension table/service data table description, a label field name (col _ name), a label type (col _ class static/dynamic), a label field type, a label field description and description, a label use case (when used for filtering), whether the label use case is a query/filter condition/secondary screening field, and a label sorting code number; and inquiring the relevant data of the calculation model through the model theme code.

Referring to fig. 7, the model column screens values: the static label in the calculation model data is used for screening specific values when a user submits a calculation request; and inquiring related data through the model code number, the table name and the list name.

Referring to fig. 8, according to the model definition data (including the theme-oriented model definition, the model theme definition file, the model data file, and the model list screening value), an input interface of the front-end WEB system is developed, i.e. a user requests to submit an operation interface in real time. The user can select a result list needing to be calculated and summarized by himself, such as a front-end Web page needing to calculate the number of never-handed customers, composite transaction, latest transaction time, transaction items and item numbers, transaction times and GMV from 2019, 11 months and 4 days of Shanghai/Guangdong/Guangxi in south China.

Referring to fig. 9, the model calculates the request file: according to the convention requirement, a session ID, a result column needing to be inquired, an inquiry condition needing to be limited and a screening condition needing to be subjected to secondary filtering based on a dynamic label of an operation result are written into an interface table.

When writing in the interface table, specific calculation values submitted by a user on a calculation request page include a session ID (session _ ID), a model theme (model _ theme), a result column (qry _ columns) to be calculated and summarized, a data range where condition (qry _ where) to be calculated, a condition (qry _ alive) to be subjected to secondary screening of dynamic tags, a calculation request submission time (cre _ time), and a column (qry _ columns _ rep) to be queried by a report, and are submitted to a model calculation request submission interface. The completion flag (completed), the completion time (completed _ time), the calculation success flag (process), the number of lines of calculation result (num _ rows), and the calculation key output information (run _ message) are written back when the calculation is completed. The values of two fields of the summarized result column (qry _ columns) and the data range where condition to be calculated (qry _ where) need to be input when submitting the request write according to the table alias + column name format in the model document file, and the model calculation request file interface is shown in fig. 9.

The model computation request interface table is monitored in real time through StreamSets, the SteamSets is a Data Collector, is a lightweight visual Data stream construction tool, and is used for routing and processing Data in the Data stream by using a Data Collector, after the Data Collector defines the Data stream and configures a pipeline, the Data Collector starts to work, if a new computation task is submitted, the SteamSets monitors the change in the interface table of Mysql, and submits the Data to a Kafka message queue.

Kafka is a high-throughput, distributed publish-subscribe messaging system that is responsible for passing data from one application to another, with applications only having to focus on the data and not on how the data is passed between two or more applications; the data producer StreamSets publishes the requested related data to the specified Topic, and delivers to consumers (consumers) subscribing to the message of the Topic for processing.

The method comprises the steps of monitoring a Kafka message queue and acquiring data in the queue through Spark + Python, wherein Spark Streaming is a real-time Streaming data processing mechanism, is used for processing data streams generated in real time and supports compiling Python codes; reading data in the Kafka message queue through Python codes, wherein the read result is Json format data, namely Key: value mode. And flattening the Json data, namely splicing the Json data into a row, so that the data transmission from the Mysql real-time computing request interface to the big data computing platform is completed.

After Spark + Python obtains the calculation request data, calling a Shell calculation script of a big data platform for each calculation request data, and transmitting a session ID (session _ ID), a model theme (model _ then), a result column (qry _ columns) to be calculated and summarized, a data range where condition (qry _ where) to be calculated and a secondary screening condition (qry _ hashing) to a dynamic label as parameters into the Shell calculation script.

The Linux Shell calculation script analyzes the parameters, and the Shell script has the function of compiling and writing a plurality of Linux instructions together and can execute a plurality of commands in one operation; the script disassembles the data range Where condition (qry _ Where) needs to be calculated, for example, the filtering conditions of a plurality of tables are disassembled and combined, so that each table in the calculation model is ensured to have a separate Where parameter; and after the disassembly is finished, transmitting the disassembled parameters into the Hplsql calculation script, calling the Hplsql calculation script, analyzing the output parameters in the calculation process of the Hplsql calculation script after the Hplsql calculation is finished, and transmitting the analyzed output parameters back to Spark.

Hplsql calculation script calculation: PL/HQL (HPL/SQL-Produceral SQL onHadoop) written based on Hplsql is a solution for storage procedures based on big data Hive database, which executes an SQL query statement or an SQL script through a command line tool. Analyzing and counting table data in the Hplsql through the SQL statement, wherein each counting is dynamic, so that only the correlation of the business data table and the calculation summary of the dynamic labels are compiled in the SQL statement, and the query columns and the Where condition of the Select are dynamic reserved parts; when the calculation script is called, SQL splicing is carried out on the transmitted parameters and the key business SQL to form complete statistical analysis calculation SQL so as to complete the statistical analysis and calculation of the data table.

The method comprises the steps of executing an Hplsql calculation process through an Impala interface, and calculating data in a service table (namely a Kudu column-type database real-time service table) collected in real time, wherein Impala is a big data calculation engine which is used for calculating based on a memory and can carry out interactive real-time query and analysis on PB-level data, and the big data calculation engine has the characteristics that the calculation speed is very high, and meanwhile, a strong memory is also needed as a support.

In Hplsql, the final calculation result data of the whole calculation script is written into a specified Hive result table with the session ID as a unique identifier.

After the calculation is finished, the Shell script returns the calculation end time, the calculation completion mark, the mark of whether the calculation is successful or not, the calculation result line number and the calculation key output information to Spark; and writing back the interface table submitted by the calculation request, and updating information such as a completion mark in the interface table.

After the calculation is completed, the user finds the latest submitted request on the front-end Web application interface according to the latest submitted request list, if the latest submitted request is in a completed state, the user clicks a check button, and the system jumps to the BI system for report check according to the session ID (Session ID) of the session request.

For example: entering columns to be queried and conditions to be limited from a front-end WEB system, such as calculation to a 'administrative province + brand' level: the amount of the transaction in the last 90 days in south China and the time of the last transaction; the front-end WEB system inputs 'administrative province, brand list' and the filtering condition is that the order time range is within 90 days in Huanan district. The request information is inserted into the mysql database through the api, a unique session id is provided, the unique session id is monitored by the streammes, the streammes has no caching function, the unique session id is placed into a kafka queue of the real-time message forwarding unit as long as new request information comes in, the Spark engine is always running in real time, the Spark engine calls a python program, the python program monitors the kafka message queue, a message comes in the kafka message queue, the python program adds the query and filtering parameters in the api library according to the session id in the message queue, a Shell script of the linux operating system is called, the Shell script processes the incoming parameters (the process refers to splitting the incoming parameters, for example, the query condition refers to several tables, the incoming parameters are split into several parameters, and parameters defining statistical output, for example, whether the sql runs successfully, completion time, how many line results are generated by statistical analysis, which are output parameters), the processed parameters are transmitted to a hive database of big data, and the process written by Sql is Hpl/Sql. Hpl/Sql is script code similar to Sql, mainly used for statistical analysis, receiving the transmitted parameters in the script and calculating, and then inserting the calculation result into the result table according to the current session id; the results of the current session can be reviewed for analysis through the BI report.

The embodiments of the present invention are preferred embodiments of the present invention, and the scope of the present invention is not limited by these embodiments, so: all equivalent changes made according to the structure, shape and principle of the invention are covered by the protection scope of the invention.

Claims

1. A real-time computing query system based on a B2B mall, comprising:

an HDFS distributed storage system;

2. The B2B mall-based real-time computing query system according to claim 1, wherein: the real-time acquisition and forwarding unit comprises a real-time acquisition unit, a real-time message forwarding unit and an SPARK engine, the real-time acquisition unit is used for monitoring and acquiring request information forwarded by the big data interface library in real time and forwarding the request information to the SPARK engine through the real-time message forwarding unit, and the SPARK engine calls a python program to receive the request information and forwards the request information to the Shell script.

3. The B2B mall-based real-time computing query system according to claim 2, wherein: the real-time acquisition unit is StreamSets, and the real-time message forwarding unit is Kafka.

4. The B2B mall-based real-time computing query system according to claim 1, wherein: the BI display system is used for providing a report form according to the calculation result and providing a decision basis.

5. A real-time computing query method based on a B2B mall is characterized in that: mapping by adopting a Kudu column database and an HDFS distributed storage system, and keeping the data in the Kudu column database synchronous with a service database through a big data platform; after the front-end WEB system inputs data, the request information with the input data is sent to a real-time acquisition and forwarding unit through a big data interface library; and the Shell script calls Hpl/Sql after receiving the request information forwarded by the real-time acquisition and forwarding unit, the Hpl/Sql performs calculation processing according to the data in the Kudu column database and the request information and stores a calculation result in the Hive, and the front-end WEB system receives and applies the calculation result.

6. The real-time computing query method based on the B2B mall according to claim 5, wherein: the real-time acquisition and forwarding unit comprises a real-time acquisition unit, a real-time message forwarding unit and an SPARK engine, the real-time acquisition unit is used for monitoring and acquiring request information forwarded by the big data interface library in real time and forwarding the request information to the SPARK engine through the real-time message forwarding unit, and the SPARK engine calls a python program to receive the request information and forwards the request information to the Shell script.

7. The real-time computing query method based on the B2B mall according to claim 5, wherein: and after the computation is finished, the Shell script returns at least one piece of information in the computation result to the SPARK engine.

8. The real-time computing query method based on the B2B mall according to claim 5, wherein: the request information at least comprises a session id and input data of a front-end WEB system.

9. The real-time computing query method based on the B2B mall according to claim 5, wherein: the input data of the front-end WEB system comprises at least one client static label and at least one commodity static label.

10. The real-time computing query method based on the B2B mall according to claim 5, wherein: and the calculation result is also sent to a BI display system for BI display.