CN115080565A

CN115080565A - Multi-source data unified processing system based on big data engine

Info

Publication number: CN115080565A
Application number: CN202210643243.1A
Authority: CN
Inventors: 任玉荣; 方月月; 陈晓娟; 刘会锋; 薛飞龙
Original assignee: Shaanxi Tiancheng Software Co ltd
Current assignee: Shaanxi Tiancheng Software Co ltd
Priority date: 2022-06-08
Filing date: 2022-06-08
Publication date: 2022-09-20

Abstract

The invention relates to the technical field of digital information transmission, and discloses a multisource data unified processing system based on a big data engine, which comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classified storage module, a data preprocessing module, a data auditing module, a data central control module and a human-computer interaction module, realizes rule configuration, timing operation and access of different source data, realizes rapid acquisition and cleaning of the data, is loaded into a big data platform to realize layered and separated storage of the data, provides a flexible external service mode, solves the problem that the increasingly complex requirement of the internet era is met due to single data channel or isolated channel, can effectively collect the data of the channels together, realizes unified access, distribution and processing of information data, and improves the user experience, and the transmission efficiency of data processing transmission is improved.

Description

Multi-source data unified processing system based on big data engine

Technical Field

The invention relates to the technical field of digital information transmission, in particular to a multi-source data unified processing system based on a big data engine.

Background

The data processing system has the main functions of processing and sorting input data information, calculating various analysis indexes, changing the input data information into an information form which is easy to accept by people, storing the processed information in order and transmitting the processed information to information users through external equipment at any time.

In a traditional data processing system, due to the fact that a data channel is single or the channel is isolated, increasingly complex requirements of the internet era cannot be met, data of multiple channels cannot be effectively collected together, unified information data access, distribution and processing cannot be achieved, information cannot be identified, user experience is reduced, and transmission efficiency of data processing and transmission is reduced, and therefore a data fusion and data collection center system scheme is urgently needed, and a data fusion technology and a data fusion method are explained.

Disclosure of Invention

Solves the technical problem

Aiming at the defects of the prior art, the invention provides a multi-source data unified processing system based on a big data engine, which realizes the regular configuration, the timing operation and the access of data from different sources, realizes the rapid acquisition and cleaning of the data, is loaded into a big data platform to realize the layered and separate storage of the data, provides a flexible external service mode, solves the problem that the data channel is single or the channel is isolated, meets the increasingly complex requirement of the internet era, effectively gathers the data of multiple channels together, realizes the access, the distribution and the processing of unified information data, improves the user experience, and improves the transmission efficiency of data processing and transmission.

Technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a multi-source data unified processing system based on a big data engine comprises a multi-source data timing acquisition module, a multi-source data rule setting module, a multi-source data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multi-source data timing acquisition module is based on a multi-source data timing acquisition and transmission process on the man-machine interaction module and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes.

Preferably, the multi-source data rule setting module sets and classifies data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module, and the multi-source data classification storage module stores classification data through the multi-source data rule setting module.

Preferably, the data preprocessing module is based on a multi-source data classification storage module, and can perform data integration and data cleaning on each group of information through the preprocessing module, and the data auditing module is used for auditing the classification data of the data preprocessing module.

Preferably, the data central control module directly controls the multi-source data classification storage module, and the human-computer interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.

Preferably, the multi-source data timing acquisition module is used for solving the timing of data, and a data supplier may add a file every other time, namely 5 minutes and 10 minutes, so that a data acquisition program is required to perform regular scanning to ensure the real-time performance of data, and the data of 5 minutes cannot be processed as 10 minutes; secondly, the inconsistency of data interfaces exists, and an interface platform provided by a data supplier cannot be only an FTP server and may comprise an SQLSERVER database, an FTP server and a Cobar; and there is also diversity in the files provided by different platforms, such as database platform only provides database table, FTP provides CSV, XML, TXT type files, which all need to be analyzed uniformly, the specific steps are as follows:

preferably, the data preprocessing module includes data cleaning and data integration, the data integration merges a plurality of data sources into one data storage, and if the analyzed data originally does not need data integration in one data storage, i.e. all-in-one, the data integration is realized by using two data frames as a basis and a merge function in R, and the statements are merge (dataframe1, dataframe2, by ═ key "), and are arranged in ascending order by default, and the following problems may occur when data integration is performed: synonymy, wherein the name of a certain attribute in the data source A is the same as the name of a certain attribute in the data source B, but the represented entities are different and cannot be used as keywords; synonymy of different names, namely that the names of certain attributes of two data sources are different but the represented entities are the same, and the names can be used as keywords; data integration often causes data redundancy, the same attribute may appear for multiple times, or the attribute names may be inconsistent to cause repetition, one of the repeated attributes is firstly subjected to related analysis and detection, and if the repeated attribute exists, the repeated attribute is deleted through a human-computer interaction module.

Preferably, the multi-source data rule setting module sets rules of an FP-Growth algorithm, and the algorithm is specifically described as follows:

the specific description of the algorithm is as follows:

inputting: a transaction database D; a minimum support threshold min _ sup;

and (3) outputting: a complete set of frequent patterns;

the first step is as follows: constructing the FP-Tree according to the following steps:

1. scanning the transaction database D once, collecting a set F of frequent items and the support degrees of the frequent items, and sorting the F in a descending order according to the support degrees, wherein the result is a frequent item table L;

2. the root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: the frequent items in Trans are selected and sorted in order in L, with the sorted list of frequent items being [ plP ], where P is the first element and P is the list of remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, the count is set to 1, the node N is linked to the parent node T of the node N, the node N is linked to the node with the same item-name through a node chain, and if P is not empty, the insert _ tree (P, N) is called recursively;

the second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:

the if Tree comprises a single path Pthen;

each combination of nodes in the for path P (denoted as β);

generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;

else for reach ai at the Tree head {;

generating a pattern β ═ aiUa, with a support ═ ai.support;

constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;

If Treeβ/0then；

call FP-growth (TreeB, B }).

Preferably, the data preprocessing module includes data cleansing and data integration, the data cleansing includes processing of missing values and abnormal values, the missing values include identification of missing values and processing of missing values, identification of missing values in R is judged by using a function is. Deletion, replacement and interpolation, deletion method: the deleting method can be divided into a deleting observation sample and a variable according to different deleting angles, the deleting observation sample (line deleting method) can delete the line containing the missing value in the R inner na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete group of data from a data set containing missing values, perform the multiple times to generate a random sample of the missing values, and perform the multiple interpolation in the R mic packet.

Advantageous effects

The invention provides a multi-source data unified processing system based on a big data engine, which has the following beneficial effects:

the multisource data unified processing system based on the big data engine realizes rule configuration, timing operation and access of data from different sources, realizes rapid acquisition and cleaning of the data, is loaded into a big data platform to realize hierarchical and warehouse-by-warehouse storage of the data, provides a flexible external service mode, meets the requirement of increasingly complex Internet era due to single data channel or isolated channel, can effectively collect data from multiple channels together, realizes unified access, distribution and processing of information data, improves user experience, and improves the transmission efficiency of data processing and transmission.

Drawings

FIG. 1 is a flow chart of a multi-source data unified processing system of a big data engine according to the present invention

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the application are applicable to computer systems/servers that are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above, and the like.

The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Example 1

The invention provides a technical scheme that: a multisource data unified processing system based on a big data engine comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multisource data timing acquisition module is based on a multisource data timing acquisition and transmission process on the man-machine interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes; the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module; the multi-source data classification storage module stores classification data through the multi-source data rule setting module; the data preprocessing module is used for integrating and cleaning data of each group of information through the preprocessing module based on the multi-source data classified storage module; the data auditing module is used for auditing the classified data of the data preprocessing module; the data central control module directly controls the multi-source data classified storage module; the man-machine interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.

A management method of a multi-source data unified processing system based on a big data engine specifically comprises the following steps:

101. the data is collected regularly by matching the multi-source data timing collection module with the man-machine interaction module and the multi-source data rule setting module to set the data rule;

in this embodiment, a multi-source data timing acquisition module is specifically described, and a problem to be solved is timing of data, since a data supplier adds a file every other time, 5 minutes or 10 minutes, a data acquisition program performs periodic scanning, and data of 5 minutes cannot be processed as data of 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:

the dc.xml mainly configures how many triggers need to be started in the acquisition module, each trigger represents a data branch of thread security, and in addition, another file needs to be configured to specifically describe the running time of each data branch, the target object, which is called: xml, the specific steps are as follows:

it should be specifically noted that the trigger specifies that the acquisition object is triggered in 20 seconds per 5 minutes of each hour, that is, the thread of the data branch is started, which is not specifically limited in this embodiment.

102. The collected data are transmitted to a multi-source data classification storage module for data classification and storage, and each group of information is subjected to data cleaning and data integration through a data preprocessing module;

in this embodiment, it should be specifically described that data integration is implemented by merging multiple data sources into one data storage, and the analyzed data originally does not need data integration in one data storage, that is, all data integration is performed, two data frames are implemented by using a merge function in R based on a keyword, and statements are merge (dataframe1, dataframe2, by ═ keyword ""), and are arranged in ascending order by default, and the following problems may occur when data integration is performed: synonymy, wherein the name of a certain attribute in the data source A is the same as the name of a certain attribute in the data source B, but the represented entities are different and cannot be used as keywords; synonymy of different names, namely that the names of certain attributes of the two data sources are different but the represented entities are the same as the keywords; when data redundancy is caused by data integration, a duplicate attribute is firstly subjected to correlation analysis and detection, and is deleted through a human-computer interaction module, and the embodiment is not particularly limited.

It should be specifically described that data integration can perform data cleansing conversion, provide field calculation, merging, distribution, filtering, field desensitization components or functions, and support fault-tolerant configuration, concurrent configuration, and speed-limiting configuration, which is not specifically limited in this embodiment.

103. Data cleaning and data integration are monitored and audited through a data auditing module numerical differentiation variable, and then data processing is carried out through a data central processing module;

in this embodiment, it is specifically explained that the solution of the data central control module is as follows: for time solution, a smart algorithm can be adopted to match with a proper data structure, such as Bloom filter, Hashmap, bit-map, heap/database, inverted index and trie tree; the present embodiment is not particularly limited to solving the problem of large size, small size, and divide into two, and enlarging the size into small size and each breaking.

What needs to be specifically explained is to divide and treat, and the multi-source data is divided and treated through the modes of hash mapping, hash map statistics, rapidness, merging and heap sorting, that is, the multi-source data cannot be read into the memory at one time, and the operations such as counting, sorting and the like need to be carried out on the multi-source data, and the basic idea is as follows: the hash value of each piece of data is calculated by means of a hash algorithm, the multi-source data are distributed and stored in a plurality of buckets according to the hash values, the same data are necessarily stored in the same bucket according to the uniqueness of a hash function, therefore, a user processes the small files in sequence and finally performs merging operation, and the embodiment is not limited specifically.

104. Inputting, reading, monitoring and modifying data through a man-machine interaction module based on a multi-source data timing acquisition module and a multi-source data rule setting module;

in this embodiment, it needs to be specifically stated that the multi-source data rule setting module sets the rule through an FP-Growth algorithm, which is specifically described as follows:

the specific description of the algorithm is as follows:

inputting: a transaction database D; a minimum support threshold min _ sup;

and (3) outputting: a complete set of frequent patterns;

the first step is as follows: the FP-Tree is constructed by the following steps:

1. the transaction database D is scanned once. A set F of frequent items and their support is collected. Sorting F in descending order according to the support degree, and obtaining a result as a frequent item list L.

2. The root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: frequent entries in Trans are selected and sorted in order in L. Let the sorted frequent items table be [ plP ], where P is the first element and P is the table of the remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to a node with the same item-name through a chain of nodes, and if P is not empty, the insert _ tree (P, N) is recursively called.

the if Tree comprises a single path Pthen;

each combination of nodes in the for path P (denoted as β);

else for reach ai at the Tree head {;

generating a pattern β ═ aiUa with a support degree ═ ai.support;

If Treeβ/0then；

call FP-growth (TreeB, B }).

Specifically, it is to be noted that the FP-Growth algorithm can effectively compress a large database into a high-density structure much smaller than the original database, thereby avoiding the overhead of repeated scanning; the algorithm adopts a mode-growing recursion strategy based on the mining of the FP-tree, creatively provides a mining method without a candidate item set, and has better efficiency when mining long and frequent item sets; in the mining process, a divide-and-conquer strategy is adopted, the compressed database DB is divided into a group of condition databases Dn, each condition database is associated with a frequent item, and each condition database is mined respectively, and the condition databases Dn are far smaller than the database DB, which is not limited in the embodiment.

Example 2

101. the data is regularly acquired by matching a multi-source data timing acquisition module with a man-machine interaction module and a multi-source data rule setting module to set data rules;

in this embodiment, it is specifically described that data is collected at regular time, and the problem to be solved is the timing of data, because a data supplier adds a file every other time, 5 minutes or 10 minutes, and the data collection program does regular scanning, the data of 5 minutes cannot be treated as data of 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:

the dc.xml mainly configures how many triggers need to be started in the acquisition module, each trigger represents a data branch of thread security, and in addition, another file needs to be configured to specifically describe the running time of each data branch, the target object, which is called: xfdc. xml, the specific steps are as follows:

in this embodiment, it is specifically described that data cleansing includes processing of a missing value and an abnormal value, where the missing value includes identification of the missing value and processing of the missing value, and identification of the missing value in R uses function is. Deletion, replacement and interpolation, deletion method: the deletion method comprises the steps of deleting observation samples and variables according to different deletion angles, deleting the observation samples (line deletion method), deleting lines containing missing values in an R internal na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete set of data from a data set including a missing value, perform multiple times to generate a random sample of the missing value, and perform multiple interpolation in a mic packet in R, which is not specifically limited in this embodiment.

in this embodiment, it needs to be specifically stated that the multi-source data rule setting module sets the rule through an FP-Growth algorithm, and the algorithm is specifically described as follows:

the specific description of the algorithm is as follows:

inputting: a transaction database D; a minimum support threshold min _ sup;

and (3) outputting: a complete set of frequent patterns;

the transaction database D is scanned once. A set F of frequent items and their support is collected. Sorting F in descending order according to the support degree, and obtaining a result as a frequent item list L.

The root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: frequent entries in Trans are selected and sorted in order in L. Let the sorted frequent items table be [ plP ], where P is the first element and P is the table of the remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to a node with the same item-name through a chain of nodes, and if P is not empty, the insert _ tree (P, N) is recursively called.

the if Tree comprises a single path Pthen;

each combination of nodes in the for path P (denoted as β);

else for reach ai at the Tree head {;

generating a pattern β ═ aiUa, with a support ═ ai.support;

If Treeβ/0then；

call FP-growth (TreeB, B }).

Specifically, the FP-Growth algorithm effectively compresses a large database into a high-density structure much smaller than the original database, thereby avoiding the overhead of repeated scanning; the algorithm adopts a mode-growing recursion strategy based on the mining of the FP-tree, creatively provides a mining method without a candidate item set, and has better efficiency when mining long and frequent item sets; in the mining process, a divide-and-conquer strategy is adopted, the compressed database DB is divided into a group of condition databases Dn, each condition database is associated with a frequent item, and each condition database is mined respectively, and the condition databases Dn are far smaller than the database DB, which is not limited in the embodiment.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A multisource data unified processing system based on big data engine is characterized in that: the system comprises a multi-source data timing acquisition module, a multi-source data rule setting module, a multi-source data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a human-computer interaction module, wherein the multi-source data timing acquisition module is based on a multi-source data timing acquisition and transmission process on the human-computer interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes.

2. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module, and the multi-source data classification storage module is used for storing classification data through the multi-source data rule setting module.

3. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the data preprocessing module is used for integrating and cleaning each group of information through the preprocessing module based on the multi-source data classified storage module, and the data auditing module is used for auditing the classified data of the data preprocessing module.

4. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the data central control module directly controls the multi-source data classified storage module, and the human-computer interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.

5. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data timing acquisition module is used for solving the timing of data, and a data acquisition program is used for scanning periodically as a data supplier adds a file every a period of time, 5 minutes or 10 minutes, so that the data of 5 minutes cannot be treated as 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:

<bean

autowire＝"no"class＝"org.springframework.scheduling.quartz.Schedule rFactoryBean">

<list>

<！--msc-->

</list>

</property>

</bean>。

6. the big data engine-based multi-source data unified processing system according to claim 3, wherein: the data preprocessing module comprises data cleaning and data integration, wherein the data integration is realized by merging a plurality of data sources into one data storage, the analyzed data originally does not need data integration in the one data storage, namely, the data integration is integrated in a multiple mode, two data frames are realized by using keywords as bases in the data integration, a merge function is used in R, sentences are merge (data frame1, data frame2, by ═ keyword "), the sentences are arranged in an ascending order by default, and the following problems can occur when the data integration is performed: homonymy, wherein a certain attribute name in the data source A is the same as a certain attribute name in the data source B, but the represented entities are different and cannot be used as keywords; synonymy with different names, namely that the names of certain attributes of two data sources are different but the represented entities are the same as keywords; when data redundancy is caused by data integration, relevant analysis and detection are firstly carried out on one of the repeated attributes, and the repeated attribute is deleted through the human-computer interaction module.

7. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data rule setting module is used for setting rules of an FP-Growth algorithm, and the algorithm is specifically described as follows:

the specific description of the algorithm is as follows:

inputting: a transaction database D; a minimum support threshold min _ sup;

and (3) outputting: a complete set of frequent patterns;

scanning the transaction database D once, collecting a set F of frequent items and the support degrees of the frequent items, and sorting the F in a descending order according to the support degrees, wherein the result is a frequent item table L;

create the root node of the FP-Tree, mark it with nul ", for each transaction Trans in D, perform: selecting frequent items in Trans and sorting in the order in L, and setting the sorted frequent items table as [ plP ], where P is the first element and P is the table of the remaining elements, calling insert _ tree ([ plP ], T), the procedure is performed as follows: if T has a son N, so that N.item-name is p.item-name, the count of N is increased by 1; otherwise, a new node N is created, the count is set to 1, the node N is linked to the parent node T of the node N, the node N is linked to the node with the same item-name through a node chain, and if P is not empty, the insert _ tree (P, N) is called recursively;

the if Tree comprises a single path Pthen;

each combination of nodes in the for path P (denoted as β);

else for reach ai at the Tree head {;

generating a pattern β ═ aiUa, with a support ═ ai.support;

If Treeβ/0then；

call FP-growth (TreeB, B }).

8. The big data engine-based multi-source data unified processing system according to claim 6, wherein: the data preprocessing module comprises data cleaning and data integration, the data cleaning comprises missing value and abnormal value processing, the missing value comprises missing value identification and missing value processing, the identification of the missing value in the R is judged by using a function is. Deletion, replacement and interpolation, deletion method: the deleting method comprises the steps of deleting observation samples and variables according to different deleting angles, deleting the observation samples, deleting rows containing missing values in an R inner na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete set of data from a data set containing missing values, perform multiple times to generate a random sample of the missing values, and perform multiple interpolation in a R-mic packet.