CN109359109B

CN109359109B - Data processing method and system based on distributed stream computing

Info

Publication number: CN109359109B
Application number: CN201810968190.4A
Authority: CN
Inventors: 王一光; 孙尚椿; 王琳; 朱冠胤
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2022-05-27
Anticipated expiration: 2038-08-23
Also published as: CN109359109A

Abstract

A data processing method and system based on distributed stream computing are disclosed. Based on various different business scenarios, a flow calculation process of data is abstracted into a general data acquisition stage, a feature extraction stage and a feature statistics stage, and a corresponding processing program (such as a data acquisition program, a feature extraction program or a feature statistics program) is developed for each stage and program deployment is completed. In addition, for the processing program corresponding to each stage, the processing program needs to load configuration information (such as data source configuration information, extraction rule configuration information, or statistical rule configuration information) specified by the service party before implementing a corresponding processing function (such as a data acquisition function, a feature extraction function, or a feature statistical function). And the service party can configure the relevant parameters of the service scene in the processing program by adopting a mode of specifying the configuration information.

Description

Data processing method and system based on distributed stream computing

Technical Field

The embodiment of the specification relates to the technical field of information, in particular to a data processing method and system based on distributed stream computing.

Background

In the internet industry, an internet enterprise typically analyzes and processes business data generated in business operations to count out valuable information. For example, the e-commerce platform may analyze the recorded user behavior log so as to count, for each user, the number of clicks made by the user on each recommended item within a specified period, and as a statistical result, the statistical result may reflect the shopping preference of the user.

In fact, the amount of service data generated during service operation is huge, and considering that the processing capacity of a single device is limited and the generated service data needs to be processed in time, the service data is generally processed in a distributed stream computing manner.

Specifically, for the above-mentioned service scenario (i.e. counting the number of clicks of each recommended product by the user) as an example, before the service data is analyzed and processed by the distributed system, the service party usually performs program deployment on the distributed system, that is, for the above-mentioned service scenario, writes a processing program for counting the number of clicks of each recommended product by the user, and installs the processing program on each node of the distributed system. Therefore, the service data stream (i.e. a plurality of user behavior logs sequentially arranged according to the generation sequence) can flow among the nodes, and the nodes can cooperatively complete the analysis and processing work of massive user behavior logs by executing the processing program, and finally count the number of clicks of each user on each recommended commodity in a specified period.

However, in practice, the service scenarios are various, and the above-mentioned program deployment manner may bring a large amount of workload to the service side.

Disclosure of Invention

In view of the foregoing technical problems, an embodiment of the present specification provides a data processing method and system based on distributed stream computing, and a technical solution is as follows:

according to the 1 st aspect of the embodiments of the present specification, there is provided a data processing method based on distributed stream computing, where a data processing system includes a data acquisition node, a feature extraction node, and a feature statistics node, the method includes:

after loading data source configuration information specified by a service party through an installed data acquisition program, the data acquisition node acquires a service data stream from a data source recorded in the data source configuration information through the data acquisition program and transfers the service data stream to the feature extraction node;

after the feature extraction node loads the extraction rule configuration information specified by the service party through an installed feature extraction program, sequentially aiming at each service data in the service data stream, extracting feature information from the service data according to the extraction rule recorded in the extraction rule configuration information through the feature extraction program, and transmitting the obtained feature information stream to the feature statistical node;

and after loading the statistical rule configuration information appointed by the service party through the installed characteristic statistical program, the characteristic statistical node performs statistics on the characteristic information in the characteristic information flow through the characteristic statistical program according to the statistical rule recorded in the statistical rule configuration information to obtain a statistical result, and outputs the statistical result.

According to the 2 nd aspect of the embodiments of the present specification, there is provided a data processing system based on distributed stream computing, including a data processing system including a data acquisition node, a feature extraction node, and a feature statistics node;

the data acquisition node acquires a service data stream from a data source recorded in the data source configuration information through the data acquisition program after loading the data source configuration information specified by a service party through an installed data acquisition program, and transfers the service data stream to the feature extraction node;

the feature extraction node loads extraction rule configuration information specified by the service party through an installed feature extraction program, sequentially extracts feature information from each service data in the service data stream according to an extraction rule recorded in the extraction rule configuration information through the feature extraction program for each service data in the service data stream, and transfers the obtained feature information stream to the feature statistics node;

and after loading the statistical rule configuration information specified by the service party through the installed characteristic statistical program, the characteristic statistical node performs statistics on the characteristic information in the characteristic information flow through the characteristic statistical program according to the statistical rule recorded in the statistical rule configuration information to obtain a statistical result, and outputs the statistical result.

According to the 3 rd aspect of the embodiments of the present specification, there is provided a data processing method based on distributed stream computing, where a data processing system includes a data acquisition node, a feature extraction node, and a feature statistics node, the method includes:

after loading data source configuration information specified by a service party through an installed data acquisition program, the data acquisition node acquires a service data stream from a data source recorded in the data source configuration information through the data acquisition program;

passing the traffic data stream to the feature extraction node such that the feature extraction node, after loading the extraction rule configuration information specified by the traffic party by the installed feature extraction program, sequentially aiming at each service data in the service data flow, through the characteristic extraction program, extracting feature information from the service data according to the extraction rule described in the extraction rule configuration information, transferring the obtained feature information stream to the feature statistical node, and further enables the feature statistic node to load the statistic rule configuration information specified by the service party through the installed feature statistic program, configuring, by the feature statistical program, the statistical rule described in the information according to the statistical rule, and counting the characteristic information in the characteristic information flow to obtain a statistical result, and outputting the statistical result.

According to the 4 th aspect of the embodiments of the present specification, there is provided a data processing method based on distributed stream computing, where a data processing system includes a data acquisition node, a feature extraction node, and a feature statistics node, the method includes:

after loading extraction rule configuration information specified by a service party through an installed feature extraction program, the feature extraction node sequentially extracts feature information from each service data in a service data stream according to an extraction rule recorded in the extraction rule configuration information through the feature extraction program;

the obtained feature information stream is transmitted to the feature statistical node, so that after the feature statistical node loads statistical rule configuration information specified by the service party through an installed feature statistical program, the feature statistical program performs statistics on the feature information in the feature information stream according to the statistical rule recorded in the statistical rule configuration information to obtain a statistical result, and the statistical result is output;

the service data stream is obtained from a data source recorded in the data source configuration information and transmitted to the feature extraction node through the data acquisition program after the data acquisition node loads the data source configuration information specified by the service party through the installed data acquisition program.

According to the 5 th aspect of the embodiments of the present specification, there is provided a data processing method based on distributed stream computing, where a data processing system includes a data acquisition node, a feature extraction node, and a feature statistics node, the method includes:

after loading statistical rule configuration information specified by a service party through an installed characteristic statistical program, the characteristic statistical node performs statistics on characteristic information in a characteristic information flow through the characteristic statistical program according to a statistical rule recorded in the statistical rule configuration information to obtain a statistical result;

outputting the statistical result;

after the feature extraction node loads the extraction rule configuration information specified by the service party through an installed feature extraction program, the feature information flow is used for sequentially aiming at each service data in the service data flow, extracting feature information from the service data through the feature extraction program according to the extraction rule recorded in the extraction rule configuration information and transmitting the feature information to the feature statistical node; the service data flow is obtained from the data source recorded in the data source configuration information and transmitted to the feature extraction node through the data acquisition program after the data acquisition node loads the data source configuration information specified by the service party through the installed data acquisition program.

According to the 6 th aspect of the embodiments of the present specification, there is provided a data processing apparatus based on distributed stream computing, a data processing system including the apparatus, a feature extraction node, and a feature statistics node, the apparatus including:

the acquisition module is used for acquiring a service data stream from a data source recorded in the data source configuration information through an installed data acquisition program after the data source configuration information specified by a service party is loaded through the data acquisition program;

a transfer module, which transfers the service data stream to the feature extraction node, so that the feature extraction node sequentially extracts feature information from each service data in the service data stream according to the extraction rule recorded in the extraction rule configuration information through the feature extraction program after loading the extraction rule configuration information specified by the service party through an installed feature extraction program, transfers the obtained feature information stream to the feature statistical node, further makes the feature statistical node perform statistics on the feature information in the feature information stream according to the statistical rule recorded in the statistical rule configuration information through the feature statistical program after loading the statistical rule configuration information specified by the service party through the installed feature statistical program, and obtains a statistical result, and outputting the statistical result.

According to the 7 th aspect of the embodiments of the present specification, there is provided a data processing apparatus based on distributed stream computing, a data processing system including the apparatus, a feature extraction node, and a feature statistics node, the apparatus including:

According to the 8 th aspect of the embodiments of the present specification, there is provided a data processing apparatus based on distributed stream computing, a data processing system including a data acquisition node, a feature extraction node, the apparatus including:

the statistical module is used for carrying out statistics on the characteristic information in the characteristic information flow according to the statistical rule recorded in the statistical rule configuration information through the characteristic statistical program after loading the statistical rule configuration information appointed by a service party through the installed characteristic statistical program to obtain a statistical result;

the output module is used for outputting the statistical result;

after the feature extraction node loads the extraction rule configuration information specified by the service party through an installed feature extraction program, the feature information flow sequentially extracts feature information from each service data in the service data flow through the feature extraction program according to the extraction rule recorded in the extraction rule configuration information and transmits the feature information to the feature statistical node; the service data flow is obtained from the data source recorded in the data source configuration information and transmitted to the feature extraction node through the data acquisition program after the data acquisition node loads the data source configuration information specified by the service party through the installed data acquisition program.

In the technical solution provided in the embodiment of the present specification, based on various different service scenarios, a distributed flow calculation statistical process for data is abstracted into a general data acquisition stage, a feature extraction stage, and a feature statistical stage, and a corresponding processing program (such as a data acquisition program, a feature extraction program, or a feature statistical program) is developed for each stage, and program deployment is completed. In addition, for the processing program corresponding to each stage, the processing program needs to load configuration information (such as data source configuration information, extraction rule configuration information, or statistical rule configuration information) specified by the service party before implementing a corresponding processing function (such as a data acquisition function, a feature extraction function, or a feature statistical function). And the service party can configure the relevant parameters of the service scene in the processing program by adopting a mode of specifying the configuration information. Through the embodiment of the specification, the data statistics framework composed of the data acquisition program, the feature extraction program and the feature statistics program can be universally used in various service scenes, and a code development adaptive processing program does not need to be written for each service scene, so that great workload is saved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the invention.

In addition, any one of the embodiments in the present specification is not required to achieve all of the effects described above.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present specification, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a block diagram of a data processing system based on distributed stream computing according to an embodiment of the present disclosure;

FIG. 2 is a block diagram of another data processing system based on distributed stream computing according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a data processing method based on distributed stream computing according to an embodiment of the present specification;

FIG. 4 is a schematic structural diagram of a data processing apparatus based on distributed stream computing according to an embodiment of the present specification;

FIG. 5 is a schematic structural diagram of another data processing apparatus based on distributed stream computing according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of another data processing apparatus based on distributed stream computing according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a device for configuring the method of the embodiments of the present description.

Detailed Description

In practice, due to diversity of service scenarios, different types of service data generated by different service scenarios, and different types of information to be counted from different types of service data, a service party (a main body responsible for analyzing and processing service data, which may be specifically a technician managing and maintaining a distributed system) has to write a processing program adapted to the service scenario when performing program deployment on the distributed system with respect to the current service scenario.

For example, if the service scenario a is, counting the number of clicks of each user on each recommended commodity every day from the user behavior log recorded by the e-commerce platform; the business scenario B is that the number of times of each user mentioning the specified keyword per month is counted from the published records of the users recorded by the public speaking platform. When the business side faces the business scenario a, a processing program for counting the number of clicks of each recommended commodity of the user from the user behavior log recorded by the e-commerce platform needs to be written. When the business side faces the business scenario B, a processing program for counting the number of times each user mentions the specified keyword from the user publication records recorded by the public speaking platform needs to be additionally written.

Obviously, when the service scenes are many, the workload of the service side is huge.

Therefore, the applicant analyzes and summarizes the distributed flow calculation statistical process aiming at the data in various service scenes, and finds that the distributed flow calculation statistical process can be abstracted into a general data acquisition stage, a feature extraction stage and a feature statistical stage, that is, the data statistical process in any service scene needs to pass through the three stages. If a processing program for realizing a general function is written for different stages and relevant parameters of a service scene in the program are configured by combining program configuration information specified by a service party, the service party can multiplex the deployed processing program to various service scenes only by carrying out one-time program deployment, and the statistical requirements for various information are met.

Distributed flow calculations are described herein. In order to achieve real-time data acquisition and processing, that is, to process data in time when the data is generated, rather than storing the data first and then processing the data in batch after accumulating to a certain amount, the data generated in real time is usually continuously input into a distributed system in the form of data stream (or called data sequence), the data stream flows among nodes in the distributed system, and the nodes cooperate to process the data stream, which is distributed stream computing.

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present specification, the technical solutions in the embodiments of the present specification will be described in detail below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of protection.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic structural diagram of a data processing system based on distributed stream computing according to an embodiment of the present specification, where the data processing system includes a data acquisition node 101, a feature extraction node 102, and a feature statistics node 103.

The data acquisition node 101 is a node responsible for executing a data acquisition function, and a data acquisition program is installed thereon. The feature extraction node 102 is a node responsible for performing a feature extraction function, and a feature extraction program is installed thereon. The feature statistics node 103 is a node responsible for performing a feature statistics function, and a feature statistics program is installed thereon.

It should be noted that, in practice, the data acquisition node 101 generally only needs to install the data acquisition program, the feature extraction node 102 generally only needs to install the feature extraction program, and the feature statistics node 103 generally only needs to install the feature statistics program.

Furthermore, the role of a certain node is not necessarily fixed. Specifically, any node in the data processing system may be installed with the above three programs (i.e., the data acquisition program, the feature extraction program, and the feature statistics program), and when the node selects to run the data acquisition program, it becomes the data acquisition node 101, when the node selects to run the feature extraction program, it becomes the feature extraction node 102, and when the node selects to run the feature statistics program, it becomes the feature statistics node 103.

In the data processing system, the data acquisition node 101 is configured to, after loading data source configuration information specified by a service party by an installed data acquisition program, acquire, by the data acquisition program, a service data stream from a data source described in the data source configuration information, and deliver the service data stream to the feature extraction node 102.

The feature extraction node 102 is configured to, after the extraction rule configuration information specified by the service provider is loaded through an installed feature extraction program, sequentially extract, for each service data in the service data stream, feature information from the service data according to an extraction rule recorded in the extraction rule configuration information through the feature extraction program, and deliver the obtained feature information stream to the feature statistics node 103;

the feature statistics node 103 is configured to, after loading the statistics rule configuration information specified by the service party through an installed feature statistics program, perform statistics on the feature information in the feature information stream according to the statistics rule recorded in the statistics rule configuration information through the feature statistics program to obtain a statistics result, and output the statistics result.

The specific form of the above-mentioned various configuration information may be a configuration file. In the field of software development, a configuration file for a program typically provides parameters required for the program to operate. In the embodiments of the present specification, various configuration information specified by a service party generally includes relevant parameters (referred to as service scenario relevant parameters herein) in a specific service scenario.

Specifically, the service scenario related parameter that may be recorded in the data source configuration information is a data source identifier. In practice, the sources of the service data to be processed are usually different in different service scenarios. For example, in the service scenario a described above, the data source may be a message system of an e-commerce platform (specifically, Kafka), and in the service scenario B, the data source may be a message system of a public speaking platform.

After the data source configuration information is loaded, the data acquisition program can only specify where the service data stream should be acquired for processing.

The service scenario-related parameters that can be described in the extraction rule configuration information are feature extraction rules for how to extract feature information from service data. In practice, feature information that needs to be extracted from business data is often different in different business scenarios. For example, in the service scenario a described above, the feature information to be extracted from the user behavior log generally includes a user identifier, a recommended article identifier, time information, and the like. In the service scenario B, the feature information to be extracted from the user publication record generally includes a user identifier, a specified keyword, time information, and the like.

After the feature extraction program loads the extraction rule configuration information, it can be determined which information in the service data should be extracted as feature information.

The service scenario related parameters that can be recorded in the statistical rule configuration information are statistical rules for how to perform statistics on the feature information in the feature information stream. In practice, in different service scenarios, it is usually necessary to perform statistics (or aggregation) on the feature information in the feature information stream in different ways. For example, in the service scenario a described above, the number of clicks per recommended item per day by each user needs to be counted, and then feature information generated on the current day needs to be aggregated into a statistical result in the feature information stream. In the business scenario B, the number of times that a specified keyword is mentioned in the statement published by each user every month needs to be counted, and then feature information generated in the month needs to be aggregated into a statistical result in the feature information stream.

The feature statistics program can only specify which feature information in the feature information stream should be aggregated into a statistical result after the statistical rule configuration information is loaded.

In addition, the data processing system shown in FIG. 1 may also include a statistics storage node 104. Specifically, the feature statistics node 103 may output the statistics to the statistics storage node 104. The statistical result storage node 104 may store the statistical result in the data storage format described in the storage format configuration information by the statistical result processing program after loading the storage format configuration information specified by the service party by the installed statistical result storage program.

The storage format configuration information may record service scenario-related parameters as data storage format information to indicate which data storage format the statistical result storage program processes the statistical result into for storage. Alternative data storage formats include, but are not limited to, Hbase, Kafka, MySQL. The statistics storage node 104 may store the statistics in a local database or a non-local database (e.g., a distributed database, a cloud database, etc.).

It should be noted that the role of the statistics storage node 104 may not be fixed. Specifically, any node in the data processing system may be installed with the statistics storage program, and when the node selects to run the statistics storage program, the node becomes the statistics storage node 104.

It is worth emphasizing that when the nodes 101 to 104 are all installed with a data acquisition program, a feature extraction program, a feature statistics program, and a statistics result storage program, the data source configuration information, the extraction rule configuration information, the statistics rule configuration information, and the storage format configuration information may be included in the same configuration file, and the service side may upload the configuration file to each node. When the data acquisition node 101 only has a data acquisition program, the feature extraction node 102 only has a feature extraction program, the feature statistics node 103 only has a feature statistics program, and the statistics storage node 104 only has a statistics storage program, the data source configuration information, the extraction rule configuration information, the statistics rule configuration information, and the storage format configuration information may be different configuration files, respectively, and the service side may upload the data source configuration file to the data acquisition node 101, upload the extraction rule configuration file to the feature extraction node 102, upload the statistics rule configuration file to the feature statistics node 103, and upload the storage format configuration file to the statistics storage node 104.

In summary, when a business party deploys a program, the business party may be decoupled from a specific business scenario, and a reusable data statistics framework, that is, a software architecture composed of a data acquisition program, a feature extraction program, and a feature statistics program (which may also include a statistics result storage program) is directly deployed on each node of the distributed system. When a specific service scenario is faced, the service party can upload configuration information adapted to the service scenario, and configure service scenario-related parameters of the various programs so as to trigger actual operation of the software architecture. The above-described software architecture is actually the topology formed by the flow of data streams.

The software architecture is based on a hardware architecture of a distributed system during actual operation, and the software architecture and the hardware architecture jointly form the data processing system. The data processing system at least has nodes with three roles, namely a data acquisition node 101, a feature extraction node 102 and a feature statistics node 103, and can further include a statistics result storage node 104. The nodes differ mainly in the functions implemented by the programs they run. The data acquisition node 101 runs a data acquisition program in the software architecture, the feature extraction node 102 runs a feature extraction program in the software architecture, the feature statistics node 103 runs a feature statistics program in the software architecture, and the statistics storage node 104 runs a statistics storage program in the software architecture.

Data flows in from the data acquisition node 101 in the form of service data stream (or called service data sequence), further enters the feature extraction node 102, then flows out from the feature extraction node 102 in the form of feature information stream (or called feature information sequence), enters the feature statistics node 103, and the feature statistics node 103 outputs the statistics result to the statistics result storage node 104 to complete storage.

It should be noted that, in the data processing system, the number of the data acquisition nodes 101, and/or the number of the feature extraction nodes 102, and/or the number of the feature statistics nodes 103, and/or the number of the statistics storage nodes 104 may be more than one. In this case, the data stream may be split during the transfer between the nodes. For example, if the number of the feature extraction nodes 102 is 3 and the number of the other nodes is 1, the data acquisition node 101 divides the service data stream into segments, and respectively transmits the segments to the three feature extraction nodes 102 for processing. The three feature extraction nodes 102 then pass the resulting feature information streams to the feature statistics node 103, as shown in fig. 2. It should be noted that fig. 2 does not show the loading process of the configuration information, but actually, each node in fig. 2 loads the corresponding configuration information through the installed program.

Fig. 3 is a flowchart of a data processing method based on distributed stream computing according to an embodiment of the present specification, where the method includes the following steps:

s300: after loading data source configuration information specified by a service party through an installed data acquisition program, the data acquisition node acquires a service data stream from a data source recorded in the data source configuration information through the data acquisition program.

S302: and the data acquisition node transmits the service data stream to the feature extraction node.

S304: and after the feature extraction node loads the extraction rule configuration information specified by the business party through an installed feature extraction program, sequentially aiming at each business data in the business data stream, extracting feature information from the business data through the feature extraction program according to the extraction rule recorded in the extraction rule configuration information.

S306: and the feature extraction node transmits the obtained feature information stream to the feature statistical node.

S308: and after loading the statistical rule configuration information appointed by the service party through the installed characteristic statistical program, the characteristic statistical node performs statistics on the characteristic information in the characteristic information flow through the characteristic statistical program according to the statistical rule recorded in the statistical rule configuration information to obtain a statistical result, and outputs the statistical result.

The method shown in fig. 3 is based on the data processing system shown in fig. 1. The present embodiment describes the data processing procedure of the data processing system in detail.

In the embodiments of the present specification, a general data structure may be defined in advance. The function of the universal data structure is to enable the distributed stream computing framework to perform data processing with a uniform data structure when being applied to various service scenarios, thereby improving the processing efficiency.

In this way, in step S304, the feature extraction program may sequentially extract information from each service data in the service data stream according to the extraction rule described in the extraction rule configuration information, organize the information into the general data structure, and then use the information organized into the general data structure as feature information.

Table 1 below provides an alternative generic data structure. Of course, other general data structures may be defined by one skilled in the art.

TABLE 1

The generic data structure shown in Table 1 includes a group identification field, a remark field, and at least one key-value pair key-value field.

Based on the general data structure shown in table 1, in step S308, the feature statistical program may filter out the feature information whose group identifier field value is the designated group identifier from the feature information stream, and determine a plurality of spare feature information according to the filtered feature information. And then, determining the standby characteristic information of which the value of the remark field meets the statistical condition as the target characteristic information.

Then, an aggregation operation may be performed on each piece of target feature information, specifically, for each key-value included in each piece of target feature information, values of the key-value of each piece of target feature information are added, an obtained sum and a key of the key-value are combined into one integrated key-value, and a statistical result is determined according to each integrated key-value.

Wherein the specified group identifier is specified by a statistical rule described in the statistical rule configuration information, and the statistical condition is specified by a statistical rule described in the statistical rule configuration information. The statistical result has the general data structure, the value of the group identifier field of the statistical result is the specified group identifier, the value of the remark field of the statistical result is the statistical condition, the value of each key-value field of the statistical result corresponds to each comprehensive key-value one-to-one, and generally, each key-value of the statistical result is each comprehensive key-value.

More specifically, when the service data flow is a user behavior log queue, in step S302, the feature extraction program may write a user identifier included in the user behavior log into a group identifier field of the generic data structure according to an extraction rule recorded in the extraction rule configuration information; writing the time information contained in the user behavior log into the remark field according to the extraction rule; and according to the extraction rule, a plurality of behavior content fields extracted from the user behavior log form key-value by taking the behavior content field as a key and a preset value as a value for each extracted behavior content field, and write the formed key-value into the key-value field of the universal data structure.

Wherein the behavior content field is specified in the extraction rule. In the service scenario a described above, the behavior content field is generally a field containing an identification of a product. And the commodity identification appearing in the user behavior log is the identification of the recommended commodity clicked by the user. The preset value is also specified by the extraction rule. In the service scenario a, the preset value may be 1, which indicates that the user clicks the product once.

For example, in the service scenario a, the following table 2 shows one feature information.

TABLE 2

The feature information shown in table 2 is extracted from a user behavior log generated by the user performing a rounding operation on the user three times in 2018, 7 months, 14 pm, and 18 pm. The recommended items 1, 2 and 5 are clicked one time each by three.

Assuming that each piece of feature information in the feature information stream received by the feature statistics program is the data structure shown in fig. 2, in step S308, if the specified group is identified as "zhangsan", the feature statistics program may screen out the piece of feature information with the group identified as "zhangsan" from the feature information stream (the sequence of feature information shown in fig. 2).

And then, the characteristic statistical program determines a plurality of spare characteristic information according to the screened characteristic information of the Zhang III. It should be noted here that the feature statistical program may directly determine the screened feature information as the spare feature information.

The feature statistics program may then filter out a number of feature information similar to that shown in FIG. 2. Assuming that there are 3 spare characteristic information, they are shown in tables 3 to 5, respectively.

TABLE 3

TABLE 4

TABLE 5

In the service scenario a, the standby feature information whose value of the remark field satisfies the statistical condition is determined as the target feature information, specifically, the standby feature information whose time information in the remark field falls into a specified time period may be determined as the target feature information.

Assuming that the specified time period is 2018071400:00:00 ~ 2018071424: 00:00, the characteristic information shown in Table 4 does not fall within the specified time period. Therefore, the feature information shown in table 3 and the feature information shown in table 5 are determined as target feature information.

Next, the feature statistical program performs an aggregation operation of the feature information shown in table 3 and the feature information shown in table 5, and can obtain the statistical results shown in table 6.

TABLE 6

As can be seen, in the service scenario A, one statistical result actually represents the number of times that three users click on each recommended commodity within a specified time period (2018071400:00: 00-2018071424: 00: 00).

In step S308, the feature statistical program determines a plurality of spare feature information according to the screened feature information, which may specifically be: and the characteristic statistical program counts the characteristic information screened into the cache in each cache period (such as one minute) to obtain a first intermediate result corresponding to the cache period, wherein the cache period is smaller than the time interval corresponding to the specified time period, and the first intermediate result has the general data structure. Then, writing the first intermediate result corresponding to the cache cycle into a database so as to determine a plurality of spare characteristic information. Here, each first intermediate result may be directly used as one spare characteristic information.

Further, instead of using each first intermediate result as one spare characteristic information, the first intermediate results written into the database in each writing cycle may be counted as a second intermediate result as one spare characteristic information. The write cycle is greater than the cache cycle and less than the time interval corresponding to the specified time period, and the second intermediate result has the general data structure.

It should be noted that, in actual operation, on one hand, the feature information stream continuously flows into the feature statistical node, and on the other hand, the cache space of the feature statistical node is limited. Assuming that in the service scenario a, the number of times that the user clicks each recommended product within the last 24 hours needs to be counted, the feature statistics node often needs to accumulate all feature information corresponding to zhang in the last 24 hours in the cache, and when the data size is too large, it is not practical to do so.

For this reason, in the embodiment of the present specification, the feature statistical program may use 1 minute as a cache cycle, perform one aggregation on feature information corresponding to the screened zhang san every minute to obtain a first intermediate result, and then store the first intermediate result in the database (which may be a column-oriented database HBase) in time. Then, the feature statistical program may take 1 hour as a writing period, and re-aggregate the first intermediate result written into the database within 1 hour into a second intermediate result as a spare feature information.

For example. The feature information corresponding to Zhang III screened by the feature statistical program in 1 minute is shown in tables 3 and 5 (for convenience of description, the screened feature information is less, but in actual operation, the screened feature information may be more every 1 minute), then the feature information shown in tables 3 and 5 may be aggregated to obtain a first intermediate result, as shown in Table 7.

TABLE 7

As shown in table 7, the value of the remark field is a time period of 1 minute in length.

The characteristic statistical program writes the first intermediate result (denoted as s) obtained every minute into the HBase data table shown in table 8.

User identification	……	18:00:00～19:00:00	19:00:00～20:00:00	……
					Zhang San	……	s1、s2……s60	S61、s62……s120	……

TABLE 8

In table 8, the year, month, day and time (20180714) are omitted.

As shown in table 8, the time interval corresponding to one column in the HBase data table is 1 hour, i.e., the above-mentioned write cycle. One write cycle corresponds to 60 consecutive cache cycles. Thus, one column in the HBase data table can accommodate the 60 first intermediate results. The feature statistics program may then aggregate the 60 intermediate results of each column into a second intermediate result as one piece of spare feature information. It is assumed that a second intermediate result (spare characteristic information) aggregated from 60 first intermediate results can be shown in table 9.

TABLE 9

The spare characteristic information shown in table 9 indicates the number of times that the user clicks each recommended product within one writing cycle. Therefore, subsequently, all the spare characteristic information (12 spare characteristic information similar to Table 9) whose time information falls within the specified time period (2018071400:00:00 ~ 2018071424: 00:00) can be finally aggregated into a statistical result, assuming as shown in Table 10.

Watch 10

In addition, in the embodiment of the specification, the hot loading of the configuration information by the program can also be realized. The hot loading means that the program can load configuration information in a running state without restarting, and update related parameters of a service scene. The specific scheme is as follows:

first aspect

After the data acquisition node loads data source configuration information specified by a service party through an installed data acquisition program, the program identifier of the data acquisition program is used as a key, the data source configuration information is used as a value, and the key-value is established and stored in a KV database.

And when monitoring that the data source configuration information corresponding to the program identifier of the data acquisition program in the KV database is modified, the data acquisition node acquires the modified data source configuration information from the KV database, and reloads the modified data source configuration information through the data acquisition program.

Second aspect of the invention

And after the feature extraction node loads the extraction rule configuration information specified by the service party through the installed feature extraction program, establishing key-value and storing the key-value into a KV database by taking the program identifier of the feature extraction program as a key and the extraction rule configuration information as a value.

And when monitoring that the extraction rule configuration information corresponding to the program identifier of the feature extraction program in the KV database is modified, the feature extraction node acquires the modified extraction rule configuration information from the KV database, and reloads the modified extraction rule configuration information through the feature extraction program.

Third aspect of the invention

And after loading the statistical rule configuration information appointed by the service party through the installed characteristic statistical program, establishing key-value and storing the key-value into a KV database by taking the program identifier of the characteristic statistical program as a key and the statistical rule configuration information as a value by the characteristic statistical node.

And when monitoring that the statistical rule configuration information corresponding to the program identifier of the feature statistical program in the KV database is modified, the feature statistical node acquires the modified statistical rule configuration information from the KV database, and reloads the modified statistical rule configuration information through the feature statistical program.

Fourth aspect of the invention

And after loading the storage format configuration information specified by the service party through the installed statistical result storage program, the statistical result storage node establishes a key-value and stores the key-value into the KV database by taking the program identifier of the statistical result storage program as a key and the storage format configuration information as a value.

And when monitoring that the storage format configuration information corresponding to the program identifier of the statistical result storage program in the KV database is modified, the statistical result storage node acquires the modified storage format configuration information from the KV database, and reloads the modified storage format configuration information through the statistical result storage program.

It should be noted that the KV databases (specifically, the Redis database) in the above four aspects may be the same database (the database shared by the above 4 nodes), or may be different databases (databases local to each node).

In addition, the four programs described herein, that is, the data acquisition program, the feature extraction program, the feature statistics program, and the statistics result storage program may be specifically developed based on the flow computation infrastructure Flink, or may be developed based on the flow computation infrastructure Storm or Spark.

Based on the data processing method based on distributed stream computation shown in fig. 3, an embodiment of the present specification further provides a data processing apparatus based on distributed stream computation, as shown in fig. 4, a data processing system includes the apparatus, a feature extraction node, and a feature statistics node, where the apparatus includes:

an obtaining module 401, configured to, after loading data source configuration information specified by a service party through an installed data obtaining program, obtain, through the data obtaining program, a service data stream from a data source described in the data source configuration information;

a transfer module 402, which transfers the service data stream to the feature extraction node, so that the feature extraction node sequentially extracts feature information from each service data in the service data stream according to the extraction rule recorded in the extraction rule configuration information through the feature extraction program after loading the extraction rule configuration information specified by the service party through an installed feature extraction program, transfers the obtained feature information stream to the feature statistical node, so that the feature statistical node performs statistics on the feature information in the feature information stream according to the statistical rule recorded in the statistical rule configuration information through the feature statistical program after loading the statistical rule configuration information specified by the service party through the installed feature statistical program, and obtains a statistical result, and outputting the statistical result.

Based on the data processing method based on distributed stream computation shown in fig. 3, an embodiment of the present specification further provides a data processing apparatus based on distributed stream computation, as shown in fig. 5, a data processing system includes a data acquisition node, the apparatus, and a feature statistics node, and the apparatus includes:

an extraction module 501, configured to load extraction rule configuration information specified by a service party through an installed feature extraction program, and then sequentially extract feature information from each service data in a service data stream according to an extraction rule recorded in the extraction rule configuration information through the feature extraction program for each service data;

a transfer module 502, configured to transfer the obtained feature information stream to the feature statistics node, so that after the feature statistics node loads, through an installed feature statistics program, the statistics rule configuration information specified by the service party, the feature statistics program performs statistics on the feature information in the feature information stream according to the statistics rule recorded in the statistics rule configuration information, so as to obtain a statistics result, and outputs the statistics result;

Based on the data processing method based on distributed stream computation shown in fig. 3, an embodiment of the present specification further provides a data processing apparatus based on distributed stream computation, as shown in fig. 6, a data processing system includes a data acquisition node, a feature extraction node, and the apparatus includes:

a statistical module 601, configured to load, by an installed feature statistical program, statistical rule configuration information specified by a service party, and perform statistics on feature information in a feature information stream according to a statistical rule recorded in the statistical rule configuration information by the feature statistical program to obtain a statistical result;

an output module 602, configured to output the statistical result;

Embodiments of the present specification also provide a computer device, which at least includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method shown in fig. 3 when executing the program.

Fig. 7 is a more specific hardware structure diagram of a computing device provided in an embodiment of the present specification, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

Embodiments of the present description also provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the method shown in fig. 3.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, general data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The systems, methods, modules or units described in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the method embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to the partial description of the method embodiment for relevant points. The above-described method embodiments are merely illustrative, wherein the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present specification. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims

1. A data processing method based on distributed stream computing is disclosed, wherein a data processing system comprises a data acquisition node, a feature extraction node and a feature statistical node, and the method comprises the following steps:

after the feature extraction node loads the extraction rule configuration information specified by the service party through an installed feature extraction program, sequentially aiming at each service data in the service data stream, extracting information from the service data according to the extraction rule recorded in the extraction rule configuration information through the feature extraction program, organizing the information organized into a universal data structure, taking the information organized into the universal data structure as feature information, and transmitting the obtained feature information stream to the feature statistical node; the generic data structure includes: a group identification field, a remark field and at least one key-value pair key-value field;

after the feature statistical node loads the statistical rule configuration information specified by the service party through the installed feature statistical program, the feature statistical node configures the statistical rule recorded in the information according to the statistical rule through the feature statistical program,

screening out the characteristic information with the value of the group identification field as the specified group identification from the characteristic information flow; the specified group identifier is specified by the statistical rule recorded in the statistical rule configuration information;

determining a plurality of standby characteristic information according to the screened characteristic information;

determining standby characteristic information of which the value of the remark field meets the statistical condition as target characteristic information; the statistical condition is specified by a statistical rule described in the statistical rule configuration information;

adding values of the key-values of the target characteristic information aiming at each key-value contained in the target characteristic information, and forming a comprehensive key-value by the obtained sum and the key of the key-value;

determining a statistical result according to each comprehensive key-value, and outputting the statistical result;

the statistical result has the general data structure, the value of the group identification field of the statistical result is the designated group identification, the value of the remark field of the statistical result is the statistical condition, and the value of each key-value field of the statistical result corresponds to each comprehensive key-value one to one.

2. The method of claim 1, wherein the service data stream specifically includes:

a user behavior log queue;

according to the extraction rule recorded in the extraction rule configuration information, extracting information from the service data and organizing the information into the general data structure, specifically comprising:

writing the user identification contained in the user behavior log into the group identification field of the universal data structure according to the extraction rule recorded in the extraction rule configuration information; and

writing the time information contained in the user behavior log into the remark field according to the extraction rule; and

according to the extraction rule, a plurality of behavior content fields extracted from the user behavior log form key-value by taking the behavior content field as a key and a preset value as a value for each extracted behavior content field, and write the formed key-value into the key-value field of the universal data structure.

3. The method according to claim 2, wherein determining the backup feature information whose value in the remark field satisfies the statistical condition as the target feature information specifically includes:

and determining the standby characteristic information of which the time information in the remark field falls into the specified time period as the target characteristic information.

4. The method according to claim 3, wherein determining a plurality of spare feature information according to the selected feature information comprises:

counting the characteristic information screened from the cache in each cache period to obtain a first intermediate result corresponding to the cache period; the cache cycle is smaller than the time interval corresponding to the specified time period, and the first intermediate result has the general data structure;

and writing the first intermediate result corresponding to the cache period into a database so as to determine a plurality of spare characteristic information.

5. The method of claim 4, wherein determining the plurality of spare characteristic information specifically comprises:

counting a first intermediate result written into the database in each writing period into a second intermediate result as standby characteristic information; the write cycle is greater than the cache cycle and less than the time interval corresponding to the specified time period, and the second intermediate result has the general data structure.

6. The method of claim 1, the data processing system further comprising a statistics storage node;

the feature statistics node outputs the statistics result, and specifically includes:

the feature statistical node outputs the statistical result to the statistical result storage node;

the method further comprises the following steps:

and after loading the storage format configuration information specified by the service party through the installed statistical result storage program, the statistical result storage node stores the statistical result in the data storage format recorded in the storage format configuration information through the statistical result processing program.

7. The method of claim 1, further comprising:

after the data acquisition node loads data source configuration information specified by a service party through an installed data acquisition program, establishing key-value and storing the key-value into a KV database by taking the program identifier of the data acquisition program as a key and the data source configuration information as a value;

8. The method of claim 1, further comprising:

after the feature extraction node loads the extraction rule configuration information specified by the service party through the installed feature extraction program, establishing key-value and storing the key-value into a KV database by taking the program identifier of the feature extraction program as a key and the extraction rule configuration information as a value;

9. The method of claim 1, further comprising:

after the feature statistical node loads statistical rule configuration information specified by the service party through an installed feature statistical program, establishing key-value and storing the key-value into a KV database by taking the program identifier of the feature statistical program as a key and the statistical rule configuration information as a value;

10. The method of claim 8, further comprising:

after loading the storage format configuration information specified by the service party through the installed statistical result storage program, the statistical result storage node establishes a key-value and stores the key-value into a KV database by taking the program identifier of the statistical result storage program as a key and the storage format configuration information as a value;

11. A data processing system based on distributed stream computing comprises a data acquisition node, a feature extraction node and a feature statistical node;

the feature extraction node sequentially extracts information from each service data in the service data stream according to the extraction rule recorded in the extraction rule configuration information through the feature extraction program and organizes the information organized into a universal data structure according to the extraction rule specified in the extraction rule configuration information after the extraction rule configuration information specified by the service party is loaded through the installed feature extraction program, the information organized into the universal data structure is used as feature information, and the obtained feature information stream is transferred to the feature statistical node; the generic data structure includes: a group identification field, a remark field and at least one key-value pair key-value field;

after the feature statistical node loads the statistical rule configuration information specified by the service party through the installed feature statistical program, screening the value of the group identification field as the feature information of the specified group identification from the feature information flow through the feature statistical program according to the statistical rule recorded in the statistical rule configuration information; the specified group identifier is specified by the statistical rule recorded in the statistical rule configuration information; determining a plurality of standby characteristic information according to the screened characteristic information; determining standby characteristic information of which the value of the remark field meets the statistical condition as target characteristic information; the statistical condition is specified by a statistical rule described in the statistical rule configuration information; adding values of the key-values of the target characteristic information aiming at each key-value contained in the target characteristic information, and forming a comprehensive key-value by the obtained sum and the key of the key-value; determining a statistical result according to each comprehensive key-value, and outputting the statistical result;

12. The system of claim 11, further comprising: a statistical result storage node;

the characteristic statistical node outputs the statistical result to the statistical result storage node;

and the statistical result storage node loads the storage format configuration information specified by the service party through the installed statistical result storage program, and then stores the statistical result in the data storage format recorded in the storage format configuration information through the statistical result processing program.

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the functionality of a data acquisition node or a feature extraction node or a feature statistics node as claimed in claim 1.