CN110389989B

CN110389989B - Data processing method, device and equipment

Info

Publication number: CN110389989B
Application number: CN201910634080.9A
Authority: CN
Inventors: 朱松岭
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced Nova Technology Singapore Holdings Ltd
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2023-08-01
Anticipated expiration: 2039-07-15
Also published as: CN110389989A

Abstract

The embodiment of the specification discloses a data processing method, a device and equipment. The scheme comprises the following steps: storing stream data of a first service entity acquired from each data source into a log-type key value database, wherein the log-type key value database has an association relation between the data of the first service entity; and determining the multidimensional data of the first business entity in the log-type key value database according to the obtained log information of the log-type key value database and the association relation.

Description

Data processing method, device and equipment

The present disclosure relates to the field of computer data processing technologies, and in particular, to a data processing method, apparatus, and device.

Background

The real-time data warehouse can directly acquire various stream data of each service related in the enterprise operation process from a data source in real time, and combine the acquired stream data based on a real-time stream calculation task to obtain real-time multidimensional data of each service, and the multidimensional data of each service are analyzed and summarized, so that data support is provided for decision making processes of all levels of the enterprise. Currently, when a real-time data warehouse needs to acquire stream data from a plurality of data sources, a dual stream join method is generally adopted to combine stream data of a certain service acquired from the plurality of data sources to generate multidimensional data of the service. However, since the dual-flow join method is long in operation time, and each time one more data sources of the real-time data warehouse are added, the dual-flow join operation needs to be performed once more, and when the number of data sources of the real-time data warehouse is large, the delay time of the data processing method is large.

Based on this, it is necessary to provide a data processing scheme in which the processing delay time is smaller when the number of data sources is larger.

Disclosure of Invention

In view of this, the embodiments of the present application provide a data processing method, apparatus, and device, which are used to solve the problem of providing a data processing scheme with smaller processing delay time when the number of data sources is larger.

In order to solve the above technical problems, the embodiments of the present specification are implemented as follows:

the embodiment of the specification provides a data processing method, which is applied to a real-time data warehouse server and comprises the following steps:

acquiring stream data of a first service entity from each data source;

storing the stream data into a log-type key value database, wherein the log-type key value database has an association relation between data of a first service entity;

acquiring log information of the log-type key value database according to preset acquisition conditions;

and determining multidimensional data of the first business entity in the log-type key value database according to the log information and the association relation, wherein the multidimensional data are data from each data source.

The embodiment of the present specification provides a data processing apparatus, including:

The first acquisition module is used for acquiring stream data of the first service entity from each data source;

the storage module is used for storing the stream data to a log-type key value database, wherein the log-type key value database has an association relation between the data of a first service entity;

the second acquisition module is used for acquiring log information of the log-type key value database according to preset acquisition conditions;

and the first determining module is used for determining multidimensional data of the first business entity in the log-type key value database according to the log information and the association relation, wherein the multidimensional data are data from each data source.

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring stream data of a first service entity from each data source;

The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect:

storing stream data of a first service entity acquired from each data source into a log-type key value database, wherein the log-type key value database has an association relation between the data of the first service entity; and determining the multidimensional data of the first business entity in the log-type key value database according to the obtained log information of the log-type key value database and the association relation. By performing efficient operations such as data writing, log acquisition and data reverse checking and writing on the log-type key value database, stream data of the first service entity from a plurality of data sources are combined, so that stream data combining efficiency can be improved, and data processing delay time can be reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of an implementation manner of summarizing multidimensional data related to a first business entity according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a data processing apparatus corresponding to the method of FIG. 1 according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a data processing apparatus corresponding to the method of fig. 1 according to an embodiment of the present disclosure.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Before describing the present invention, a brief description will be first made of concepts to which the present invention relates.

Stream Data (Data stream): is a set of sequential, large, fast, consecutively arriving data sequences.

Streaming computing (stream computing): the method is a calculation method for analyzing large-scale stream data in real time, and can extract effective knowledge and information from huge and various continuous data streams and send the extraction result to the next calculation node.

Data warehouse (dataware): is a theme-oriented, integrated, relatively stable data set reflecting historical changes that can be used to support management decisions. The data warehouse stores current and historical data in one place, integrates data from one or more data sources, and provides integrated data to provide decision data or analysis reports for various decisions throughout the enterprise.

Reaction time: refers to the delay between the completion time of an action and the time that the data for that action is available in the data warehouse.

Real-time data warehouse (Real-time Data Warehouse): refers to a data warehouse where reaction time is negligible in real-time. Typically in a real-time data warehouse, the reaction time from the acquisition of data from a data source to the generation of summarized data should be controlled to within ten minutes.

Service entity: in a real-time data warehouse, it refers to an abstract concept that integrates, classifies, and analyzes data in an enterprise information system at a higher level, and each business entity may correspond to an analysis object involved in a macroscopic analysis field in an enterprise. For example, for an e-commerce company, its business entities may include, but are not limited to, orders, complaint work orders, refund orders, and the like.

Double-stream join: the method is to combine two stream data through a join operator to obtain combined stream data. Wherein the join operator is used to extract different associated fields from the two data streams, resulting in a complete associated field.

Log-type key value database: refers to a key database that can generate database logs during database operation. The database log is a file used for recording all transactions of the database and modifications of each transaction on the database, and the data written into the database can be queried according to the database log. A key-value store (key-value store) refers to a database that stores data as a set of key-value pairs, with keys as unique identifiers and values being the data to be deposited.

Multidimensional data: i.e. a collection of data containing multiple dimensions. Where a dimension is a class of attributes used to reflect a business, and a collection of such attributes constitutes a dimension.

As described in the background section, streaming data provided by multiple data sources is currently combined by dual stream join technology when performing real-time streaming computing tasks in a real-time data warehouse. For example, when three data sources such as a first data source, a second data source and a third data source exist, the stream data from the first data source and the stream data from the second data source need to be combined by using the double-stream join technology to obtain first combined stream data, and then the stream data from the first combined stream data and the stream data from the third data source need to be combined by using the double-stream join technology to obtain second combined stream data. Because the dual-stream join technology is low in efficiency when the stream data is merged, and when the number of data sources is large, multiple dual-stream join operations need to be sequentially executed, which aggravates the problem of data processing delay, so that the operation requirement of the real-time data warehouse cannot be met.

In view of such a problem, the inventors have noted that, by storing data in a log-type key database according to a certain naming rule, the stored data may have an association relationship in the log-type key database, and the data having the association relationship may be extracted entirely according to database log information. The method for establishing the association relation to the related data based on the read-write operation to the log-type key value database has higher operation efficiency, and provides an implementation basis for providing a data processing method with smaller data delay time.

In view of this, the present invention proposes to store the stream data of the first service entity acquired from each data source into a log-type key value database, and read the multidimensional data of the first service entity written into the log-type key value database according to the database log information. By performing efficient operations such as data writing, log acquisition, and writing data back according to logs on the log-type key value database, stream data of a first service entity from a plurality of data sources are combined, so that data processing delay time is reduced.

Having briefly described the concepts and basic principles of embodiments of the present invention, a further detailed description of the implementation of the present invention is provided below in conjunction with fig. 1-4.

Fig. 1 is a flow chart of a data processing method according to an embodiment of the present disclosure. From a program perspective, the execution subject of the flow may be a data warehouse application that is hosted on a real-time data warehouse server. The method can be applied to scenes with high real-time requirements on stream data processing of business entities, for example, the method can be applied to application scenes needing to provide real-time decision data or application scenes needing to display real-time summarized data. Each business entity may correspond to a business analysis object related to an enterprise, and in practical application, the business entities of the enterprises also differ due to different services provided by the enterprises. For example, for businesses that provide web retail services, business entities may include, but are not limited to, refund slips, merchandise orders, buyer accounts, merchant accounts, and items of merchandise, among others. For enterprises that provide stock exchange services, business entities include, but are not limited to, investors' accounts, stock orders, and stocks issued by a stock company.

The specific implementation manner of the method is described below by taking a scenario in which the method is applied to processing stream data of a service entity, i.e. a refund bill, as an example. It will be appreciated that the processing of the various stream data involved in the refund ticket using this method is merely an exemplary illustration and should not be construed as limiting the method. As shown in fig. 1, the process may include:

step 101: stream data of a first business entity is obtained from each data source.

In the embodiment of the present specification, each service entity generally relates to a plurality of data sources, and the stream data provided by each data source includes data of a first service entity with at least one dimension. Specifically, the streaming data of the first service entity can be obtained from each data source based on the real-time streaming computing task; the real-time stream calculation task can be realized based on stream calculation engines such as a Flink, spark streaming, storm or Beam. For example, for a refund list, the data sources may include, but are not limited to, an instant messaging information data source, a message information data source, a refund list base information data source, and a picture information data source.

Step 102: and storing the stream data into a log-type key value database, wherein the log-type key value database has an association relation between the data of the first service entity.

In this embodiment of the present disclosure, stream data obtained from each data source may be stored in a log-type key-value database according to a preset naming rule, so that there is an association relationship between data of a first service entity in the log-type key-value database. Specifically, since the data in the log-type key-value database is in a key-value pair format, the key name of the data in the log-type key-value database can be made to correspond to the service entity and the data source to which the data belongs according to a certain naming rule, so that the association relationship between the data of the first service entity in the log-type key-value database is determined according to the key name of each data.

Step 103: and acquiring log information of the log-type key value database according to preset acquisition conditions.

In this embodiment of the present disclosure, log information of the log-type key value database may be automatically obtained periodically based on a preset collection condition.

Step 104: and determining multidimensional data of the first business entity in the log-type key value database according to the log information and the association relation, wherein the multidimensional data are data from each data source.

In this embodiment of the present disclosure, the obtained log information may be used as a trigger condition, and when the log information is obtained, in a real-time streaming computing task, multidimensional data of the first service entity in the log-type key value database is determined according to the log information. The multi-dimensional data of the first service entity comprises data from each data source of the first service entity, so that the combination of stream data provided by each data source is realized.

In the embodiment of the present disclosure, the stream data of the first service entity acquired from each data source is stored in the log-type key value database, and by performing efficient operations such as data writing, log acquisition, and writing data according to log back check on the log-type key value database, the stream data of the first service entity from the plurality of data sources are combined, so that the stream data combining efficiency can be improved, and the data processing delay time can be reduced.

Based on the method in fig. 1, some embodiments of the method are also provided in the embodiments of the present specification, and are described below.

In a real-time data warehouse, the combined stream data is generally counted and summarized according to summarizing logic, so that the required summarizing data is provided for decisions of all levels of enterprises. Based on the method in fig. 1, an implementation manner of summarizing multidimensional data related to a first service entity is presented in an embodiment of the present disclosure. As shown in fig. 2, the flow of this implementation is as follows:

step 201: stream data of a first business entity is obtained from each data source.

Step 202: and storing the stream data into a log-type key value database, wherein the log-type key value database has an association relation between the data of the first service entity.

Step 203: and acquiring log information of the log-type key value database according to preset acquisition conditions.

Step 204: and determining multidimensional data of the first business entity in the log-type key value database according to the log information and the association relation, wherein the multidimensional data are data from each data source.

In the embodiment of the present disclosure, the steps 201 to 204 may be implemented in the same manner as the steps 101 to 104 in fig. 1, and will not be described herein.

Step 205: relevant dimension data is acquired.

In this embodiment of the present disclosure, the relevant dimension data may be data of one or more service entities having an association relationship with the first service entity. In practical application, a service entity having an association relationship with the first service entity may be specified in advance according to practical requirements. For example, when the first business entity is a refund of commodity C purchased by user a on retail platform B, the business entity having an association with the first business entity may include: the application account of the commodity C, the application account of the user A on the retail platform B, the application account of the merchant selling the commodity C, and the like.

The relevant dimension data may be data obtained by processing stream data of a service entity having an association relationship with the first service entity by using the method in steps S201 to S204, or may be pre-specified static data. In practical application, the real-time dimension table of the business entity having the association relation with the first business entity, which is inquired from the data warehouse, can be used as the related dimension data. For example, for a business entity that has an association relationship with the first business entity, such as an application account of the user a on the retail platform B, the relevant dimension data may be a real-time dimension table that includes a plurality of dynamic data, such as an order number and a refund number, on the application account of the user a, or a real-time dimension table that includes a plurality of static data, such as an associated mobile phone number and a receiving address, on the application account of the user a.

Step 206: and dividing the related dimension data and the multidimensional data of the first service entity to obtain each service subdomain data set.

In the embodiment of the present disclosure, service subdomains to which each piece of data of each service entity belongs may be specified in advance according to an enterprise analysis requirement, so as to divide relevant dimension data and multidimensional data of a first service entity according to the service subdomains, thereby facilitating subsequent data aggregation. Wherein the data in each service sub-domain may describe the service logic from different angles and hierarchies. For example, when the first service entity is a refund bill, the service subdomain corresponding to the data of the first service entity may include: merchant sub-domains, buyer sub-domains, merchandise sub-domains, refund sub-domains, and the like.

Step 207: and issuing a detail message of the first service entity, wherein the detail message comprises the service subdomain data sets.

In the embodiment of the present specification, a message middleware (Active Messenger) may be used to publish the detail message of the first service entity as a real-time detail layer (data warehouse detail) of the real-time data repository. In particular, the message middleware may be implemented using Notify or MetaQ, or the like.

Step 208: and summarizing the data in at least one business subdomain data set according to the detail information to obtain summarized data.

In the embodiment of the present disclosure, since the data of each service subdomain is already obtained in step 206, in step 208, it is no longer necessary to identify and divide the service subdomain to which each dimension data in the detail message belongs. In step 208, the data of the required business subdomain can be directly extracted from the detail message according to the preset summarization logic, and summarized to obtain summarized data, and the summarized data can be used as a summarizing layer (data warehouse service) of the real-time data warehouse, so that the data summarizing step is simplified. For example, the data of the commodity subdomain may be summarized to obtain refund rate information of the commodity, so that the merchant can decide whether to continue selling the commodity according to the refund rate information. Or summarizing the data of the buyer subdomain to obtain the order refund rate of the buyer so as to determine whether the buyer is a malicious user or not. Or summarizing the data of the commodity subdomain and the refund subdomain, so as to provide decision data for whether the refund list accords with refund conditions.

In the embodiment of the present disclosure, by performing efficient operations such as data writing, log acquisition, and writing data according to log back check on a log-type key database, the streaming data of a first service entity from multiple data sources is combined, so that the streaming data combining efficiency can be improved, and the data processing delay time can be reduced. The relevant dimension data and the multidimensional data of the first service entity are divided according to the service subdomains to obtain data sets of each service subdomain, and then the data sets of each service subdomain are distributed to serve as detailed information of the first service entity, so that summarizing steps during data summarizing can be simplified, further data processing efficiency is improved, and data processing delay time is reduced. In this implementation manner, if a new data source needs to be added because of service requirements, only steps 201 and 202 need to be modified, stream data acquired from the new data source is written into the log-type key value database, and step 208 is modified adaptively, so that the new data is summarized according to summarizing logic; namely, when the number of data sources is increased, the length of the data link is unchanged in the implementation mode, and the maintainability and the stability are good.

In the embodiment of the present specification, the log-type key value database includes, but is not limited to, databases such as HBase or Redis. Specific implementations of storing the stream data to a log-type key value database when HBase is employed are provided in this specification.

In this implementation manner, the stream data may be stored in a log-type key value database according to a naming rule, row key names corresponding to data of first service entities from different data sources in the HBase are the same, a corresponding relationship exists between a column cluster name of data of any one first service entity and a data source from which the data of the any one first service entity is derived, and a corresponding relationship exists between a column name of data of any one first service entity and a service subdomain to which the data of the any one first service entity is derived.

In the embodiment of the present specification, the data in HBase may be quickly located by the names of three dimensions, namely, row key (rowkey), column cluster (Column family), and Column (Column). Wherein the row key is a primary key for representing a unique row record, each row including at least one column cluster, each column cluster including at least one column of data. The data in HBase has a version concept, and each time the data is generated or modified, version information is stored, and the version data is a time stamp.

In the embodiment of the present specification, the data of the first service entity stored in the HBase is illustrated based on the application scenario that the first service entity is a refund ticket. Table 1 is multi-dimensional data of refund ticket with refund ticket number 123 stored in HBase. As shown in table 1:

in table 1, the refund number 123 of the refund is used as the row key name in HBase. The refund bill is provided with two data sources such as a refund bill basic information data source and a message information data source; the refund list basic information data source provides three-dimensional data sequences of trade names of refund commodities, the number of refund commodities, buyer accounts and the like which arrive rapidly and continuously, and the message information data source provides two-dimensional data sequences of buyer messages, merchant messages and the like which arrive rapidly and continuously. The trade name of the refund commodity belongs to the data of commodity subdomain, the number of refund commodity belongs to the data of refund subdomain, the buyer account and the buyer message belong to the data of buyer subdomain, and the merchant message belongs to the data of merchant subdomain.

In the embodiment of the present disclosure, by storing the stream data of the first service entity in the log-type key value database, an association relationship between a plurality of stream data of the first service entity is established, so as to implement merging of the plurality of stream data. Since the data in HBase is generally read and written only about 10 milliseconds, in the embodiment of the present disclosure, the delay time can be controlled within tens of milliseconds when the multiple stream data are combined, and at least several hundred milliseconds are required when the multiple stream data are combined by using the dual stream join method, the efficiency of the method for combining the stream data in the embodiment of the present disclosure is higher, and the delay time for processing the data is less. When the double-flow join method is used, each time one data source is added, double-flow join operation is needed to be executed once more, a data link is increased, the data processing efficiency is affected, the data processing program is greatly changed, and the running stability of the processing program is affected; in the implementation manner provided in this embodiment, when the number of data sources changes, the length of the data link is unchanged, so that maintainability and stability are better.

In this embodiment of the present disclosure, when the HBase is used as the log-type key value database, the preset collection condition may include reaching at least one of a preset collection time or a newly added log information data amount being greater than a preset threshold, where the newly added log information is log information generated by the HBase between a last log collection time and a current time.

The log information of the HBase is a log file, and the log file is generated by the HBase in a WAL (i.e., write ahread log) implementation manner. Specifically, each time the data in the Hbase is modified, the data is written into the memory, and when the data is written successfully, the Hbase writes the record into the HLog to generate the HLog file.

The acquiring the log information of the log-type key value database according to the preset acquisition condition may specifically include:

when the preset collection time is reached, collecting log information of the HBase; or when the data quantity of the newly added log information is larger than a preset threshold value, collecting the log information of the HBase.

In this embodiment of the present disclosure, when HBase is used as the log-type key value database, the determining multidimensional data of the first service entity in the log-type key value database may specifically include:

Determining a column cluster to which the data of the newly added first service entity in the HBase belongs according to the log information; determining a row key corresponding to a column cluster to which the data of the newly added first service entity belongs; and determining the data corresponding to the row key in the HBase as multi-dimensional data of the first service entity.

In the embodiment of the present disclosure, since the row key names of the stream data provided by the data sources of the first service entities are the same, all the data corresponding to the row key of the data of any one first service entity may be extracted as the multi-dimensional data of the first service entity, so as to obtain the multi-dimensional data after the stream data of the plurality of first service entities are combined.

In the present description embodiment, for step 206: the dividing the related dimension data and the multidimensional data of the first service entity may specifically include:

and dividing the multidimensional data of the first service entity according to the corresponding relation between the column name of the column to which the data of any one of the first service entities belongs and the service subdomain to which the data of any one of the first service entities belongs. And dividing the relevant dimension data according to the corresponding relation between the relevant dimension data and the business subdomains of the first business entity.

In this embodiment of the present disclosure, for any one of the service sub-domain data sets, the format of the data in the any one of the service sub-domain data sets may be a key value pair format or a JSON (i.e. JavaScript Object Notation) format, each data in the any one of the service sub-domain data sets corresponds to a field, and the any one of the service sub-domain data sets may include a plurality of fields.

In this embodiment of the present disclosure, the multidimensional data of the first service entity may be divided according to a correspondence between a column name of each column of data in the HBase and a service subdomain to which the data belongs. For example, as shown in table one, according to the column name "buyer subdomain-buyer account" of the dimension data of the buyer account, the dimension data of the buyer account is known to belong to the data of the buyer subdomain, and therefore, the column data of the buyer account can be divided into the buyer subdomain data set. According to the same principle, the other data in the table one is divided, and the buyer subdomain data set includes three fields, namely "buyer account with time stamp t 1-xiaohong", "buyer account with time stamp t 2-xiaohong", and "buyer message with time stamp t 1-commodity breakage request confirmation".

In this embodiment of the present disclosure, since each sub-domain data set may include multiple fields, when the number of data sources of the first service entity increases, by making a corresponding relationship between a column name of stream data provided by the newly added data source after being stored in the HBase and a service sub-domain to which the column name belongs, the stream data provided by the newly added data source may be automatically divided according to the corresponding relationship, so as to obtain each updated service sub-domain data set. In this embodiment, when the number of data sources is increased, only the stream data provided by the newly added data source is written into the HBase according to the preset naming rule, so that the fields provided by the newly added data source can be automatically divided into the corresponding service subdomain data sets, the data link length is fixed, the maintainability and stability of the scheme can be improved, and the resource consumption of the scheme can be reduced.

In the embodiment of the present specification, by performing efficient operations such as data writing, log acquisition, and writing data according to log back check on HBase, stream data of a first service entity from a plurality of data sources are combined; since the delay time for reading and writing data in HBase is typically about ten milliseconds and the delay time for the real-time streaming calculation task is typically about several tens of milliseconds, the data delay time from the acquisition of streaming data provided by each data source to the final acquisition of summarized data can be controlled to be in the sub-second level (i.e., several hundred milliseconds), and the data processing delay time for the data summarization scheme based on the dual-stream join technology is typically several seconds to several tens of seconds; the scheme provided by the embodiment of the specification reduces the data processing delay time, has the advantages of less resource consumption, fixed data link length and higher maintainability and stability, and further provides an implementation scheme for building a real-time data warehouse with sub-second level, low cost, low resource consumption and high maintainability.

In an embodiment of the present disclosure, before the obtaining the stream data of the first service entity from each data source, the method may further include:

acquiring a streaming computing task, wherein the streaming computing task is used for acquiring and processing streaming data of a first service entity; determining each data source of the first service entity; the data source is used for providing stream data of the first service entity; and subscribing the data source for the streaming computing task, so as to acquire the streaming data of the first business entity from the subscribed data source and process the streaming data in real time when the streaming computing task is executed.

Based on the same thought, the embodiment of the specification also provides a device corresponding to the method. Fig. 3 is a schematic structural diagram of a data processing apparatus corresponding to the method in fig. 1 according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus may include:

a first obtaining module 301, configured to obtain, from each data source, stream data of a first service entity.

And the storage module 302 is configured to store the stream data into a log-type key value database, where the log-type key value database has an association relationship between data of a first service entity.

And the second obtaining module 303 is configured to obtain log information of the log-type key value database according to a preset collection condition.

The first determining module 304 is configured to determine multidimensional data of the first service entity in the log-type key value database according to the log information and the association relationship, where the multidimensional data is data from each data source.

In this embodiment of the present disclosure, the storage module 302 stores the stream data of the first service entity obtained from each data source into the log-type key value database, and the first determining module 304 reversely searches the data of the first service entity written into the log-type key value database according to the log information of the log-type key value database, so as to obtain multi-dimensional data of the first service entity obtained by combining the plurality of stream data, thereby improving the stream data combining efficiency and reducing the delay time of data processing.

In an embodiment of the present disclosure, the data processing apparatus may further include:

and the third acquisition module is used for acquiring the related dimension data.

The division module is used for dividing the related dimension data and the multidimensional data of the first service entity to obtain each service subdomain data set.

And the issuing module is used for issuing the detail information of the first service entity, wherein the detail information comprises the service subdomain data sets.

And the summarizing module is used for summarizing the data in at least one business subdomain data set according to the detail information to obtain summarized data.

In the embodiment of the present disclosure, the dividing module in the data processing apparatus may divide the relevant dimension data and the multidimensional data of the first service entity according to the service subdomain to obtain each service subdomain data set; the publishing module is used for publishing the detail information of each business subdomain data set as the first business entity, so that the summarizing step during data summarizing can be simplified, the data processing efficiency can be further improved, and the data processing delay time can be reduced. When the data source data are increased, only the stream data acquired from the newly increased data source are written into the log-type key value database, and the newly increased data are summarized according to the summarization logic in an adaptive manner; namely, when the number of data sources is increased, the length of the data link in the data processing device is unchanged, and the maintainability and the stability are better.

In the embodiment of the present disclosure, the log-type key value database may be HBase; the storage module may be specifically configured to:

And storing the stream data into an HBase, wherein row key names corresponding to the data of the first service entities from different data sources in the HBase are the same, a corresponding relation exists between a column cluster name of the data of any one first service entity and a data source from which the data of the any one first service entity is derived, and a corresponding relation exists between the column name of the data of any one first service entity and a service subdomain to which the data of the any one first service entity is belonged.

In the embodiment of the present specification, the dividing module may specifically be used to:

In this embodiment of the present disclosure, for any one of the service sub-domain data sets generated by the partitioning module, the format of data in the any one of the service sub-domain data sets is a key-value pair format, and each data in the any one of the service sub-domain data sets corresponds to one field.

In this embodiment of the present disclosure, the preset collection condition may include at least one of reaching a preset collection time or a newly added log information data amount being greater than a preset threshold, where the newly added log information is log information generated by the HBase between a last log collection time and a current time.

The second obtaining module 303 may specifically be configured to: when the preset collection time is reached, collecting log information of the HBase; or when the data quantity of the newly added log information is larger than a preset threshold value, collecting the log information of the HBase.

In the embodiment of the present disclosure, the first determining module 304 may specifically be configured to:

and determining a column cluster to which the data of the newly added first service entity in the HBase belongs according to the log information.

And determining a row key corresponding to the column cluster to which the data of the newly added first service entity belongs.

And determining the data corresponding to the row key in the HBase as multi-dimensional data of the first service entity.

and the third acquisition module is used for acquiring a streaming computing task, and the streaming computing task is used for acquiring and processing the streaming data of the first service entity.

A second determining module, configured to determine each data source of the first service entity; the data source is configured to provide streaming data of the first service entity.

And the subscription module is used for subscribing the data sources for the streaming computing task.

Based on the same thought, the embodiment of the specification also provides equipment corresponding to the method. Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 4, the apparatus 400 may include:

at least one processor 410; the method comprises the steps of,

a memory 430 communicatively coupled to the at least one processor; wherein,,

the memory stores instructions 420 executable by the at least one processor 410, the instructions being executable by the at least one processor 410 to enable the at least one processor 410 to:

stream data of a first business entity is obtained from each data source.

And storing the stream data into a log-type key value database, wherein the log-type key value database has an association relation between the data of the first service entity.

And acquiring log information of the log-type key value database according to preset acquisition conditions.

In the embodiment of the present disclosure, the data processing apparatus stores the stream data of the first service entity acquired from each data source into the log-type key value database, and performs efficient operations such as writing data into the log-type key value database, collecting logs, and writing data according to log back check, so as to merge the stream data of the first service entity from the plurality of data sources, thereby improving the stream data merging efficiency and reducing the data processing delay time.

The processor 410 in the data processing apparatus is further capable of:

relevant dimension data is acquired.

And dividing the related dimension data and the multidimensional data of the first service entity to obtain each service subdomain data set.

And issuing a detail message of the first service entity, wherein the detail message comprises the service subdomain data sets.

And summarizing the data in at least one business subdomain data set according to the detail information to obtain summarized data.

In this embodiment of the present disclosure, the data processing device may further divide the related dimension data and the multidimensional data of the first service entity according to the service sub-domain, to obtain each service sub-domain data set, and then issue each service sub-domain data set as a detailed message of the first service entity, so as to simplify a summarizing step when summarizing data, further improve data processing efficiency, and reduce data processing delay time. When the data source data are increased, only the stream data acquired from the newly increased data source are written into the log-type key value database, and the newly increased data are summarized according to the summarization logic in an adaptive manner; namely, when the number of data sources is increased, the length of the data link in the data processing equipment is unchanged, and the maintainability and the stability are good.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present application.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A data processing method, the method being applied to a real-time data warehouse server, comprising:

acquiring stream data of a first service entity from each data source;

storing the stream data into a log-type key value database, wherein the log-type key value database has an association relation between data of a first service entity; the row key names corresponding to the data of the first business entities from different data sources in the log-type key value database are the same, and a corresponding relation exists between the column cluster name of the data of any one first business entity and the data source from which the data of the any one first business entity comes;

according to the log information and the association relation, determining multidimensional data of a first business entity in the log-type key value database, wherein the multidimensional data are data from each data source;

the determining the multidimensional data of the first service entity in the log-type key value database according to the log information and the association relation specifically comprises the following steps:

determining a row key corresponding to a column cluster to which the data of the newly added first business entity in the log-type key value database belong according to the log information;

and determining the data with the corresponding relation to the row key corresponding to the column cluster to which the data of the newly added first service entity belongs in the log-type key value database as the multidimensional data of the first service entity.

2. The method of claim 1, further comprising, after determining the multidimensional data of the first business entity in the log-type key database:

acquiring relevant dimension data;

dividing the related dimension data and the multidimensional data of the first service entity to obtain each service subdomain data set;

Issuing a detail message of the first service entity, wherein the detail message comprises the service subdomain data sets;

3. The method of claim 2, the log-type key-value database comprising HBase;

the storing the stream data in the log-type key value database specifically comprises:

4. The method of claim 3, wherein the dividing the relevant dimension data and the multidimensional data of the first service entity specifically includes:

dividing the multidimensional data of the first business entity according to the corresponding relation between the column name of the column to which the data of any one of the first business entities belongs and the business subdomain to which the data of any one of the first business entities belongs;

And dividing the relevant dimension data according to the corresponding relation between the relevant dimension data and the business subdomains of the first business entity.

5. The method of claim 4, wherein for any one of the respective service sub-domain data sets, the format of the data in the any one of the service sub-domain data sets is a key-value pair format, and each of the data in the any one of the service sub-domain data sets corresponds to a field.

6. The method of claim 3, wherein the preset collection condition includes reaching a preset collection time or an added log information data amount being greater than a preset threshold, the added log information being log information generated by the HBase between a last log collection time and a current time;

the acquiring the log information of the log-type key value database according to the preset acquisition condition specifically comprises the following steps:

when the preset collection time is reached, collecting log information of the HBase;

or when the data quantity of the newly added log information is larger than a preset threshold value, collecting the log information of the HBase.

7. The method of claim 3, wherein the determining the multidimensional data of the first business entity in the log-type key database specifically includes:

Determining a column cluster to which the data of the newly added first service entity in the HBase belongs according to the log information;

determining a row key corresponding to a column cluster to which the data of the newly added first service entity belongs;

8. A method according to claim 1 or 2, further comprising, prior to obtaining the stream data of the first service entity from the respective data source:

acquiring a streaming computing task, wherein the streaming computing task is used for acquiring and processing streaming data of a first service entity;

determining each data source of the first service entity; the data source is used for providing stream data of the first service entity;

subscribing to the data source for the streaming computing task.

9. A data processing apparatus comprising:

the storage module is used for storing the stream data to a log-type key value database, wherein the log-type key value database has an association relation between the data of a first service entity; the row key names corresponding to the data of the first business entities from different data sources in the log-type key value database are the same, and a corresponding relation exists between the column cluster name of the data of any one first business entity and the data source from which the data of the any one first business entity comes;

the first determining module is used for determining multidimensional data of a first business entity in the log-type key value database according to the log information and the association relation, wherein the multidimensional data are data from each data source;

the first determining module is specifically configured to: and determining row keys corresponding to column clusters to which the data of the newly added first service entity belong in the log-type key value database according to the log information, and determining the data with corresponding relations to the row keys corresponding to the column clusters to which the data of the newly added first service entity belong in the log-type key value database as the multidimensional data of the first service entity.

10. The apparatus of claim 9, further comprising:

the third acquisition module is used for acquiring the related dimension data;

the division module is used for dividing the related dimension data and the multidimensional data of the first service entity to obtain each service subdomain data set;

the publishing module is used for publishing the detail information of the first service entity, wherein the detail information comprises the service subdomain data sets;

11. The apparatus of claim 10, the log-key database comprising HBase; the storage module is specifically configured to:

12. The apparatus of claim 11, the partitioning module is specifically configured to:

13. The apparatus of claim 12, wherein for any one of the respective business sub-domain data sets, the format of the data in the any one of the business sub-domain data sets is a key-value pair format, and each of the data in the any one of the business sub-domain data sets corresponds to a field.

14. The apparatus of claim 11, the preset acquisition condition comprising reaching a preset acquisition time or an increased amount of log information data greater than a preset threshold, the increased log information being log information generated by the HBase between a last log acquisition time to a current time;

the second obtaining module is specifically configured to:

15. The apparatus of claim 11, the first determining module is specifically configured to:

16. The apparatus of claim 9 or 10, further comprising:

the third acquisition module is used for acquiring a streaming computing task, wherein the streaming computing task is used for acquiring and processing streaming data of the first service entity;

a second determining module, configured to determine each data source of the first service entity; the data source is used for providing stream data of the first service entity;

17. A data processing apparatus comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,,

acquiring stream data of a first service entity from each data source;