CN115080565A - Multi-source data unified processing system based on big data engine - Google Patents

Multi-source data unified processing system based on big data engine Download PDF

Info

Publication number
CN115080565A
CN115080565A CN202210643243.1A CN202210643243A CN115080565A CN 115080565 A CN115080565 A CN 115080565A CN 202210643243 A CN202210643243 A CN 202210643243A CN 115080565 A CN115080565 A CN 115080565A
Authority
CN
China
Prior art keywords
data
module
source data
source
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210643243.1A
Other languages
Chinese (zh)
Inventor
任玉荣
方月月
陈晓娟
刘会锋
薛飞龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Tiancheng Software Co ltd
Original Assignee
Shaanxi Tiancheng Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Tiancheng Software Co ltd filed Critical Shaanxi Tiancheng Software Co ltd
Priority to CN202210643243.1A priority Critical patent/CN115080565A/en
Publication of CN115080565A publication Critical patent/CN115080565A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The invention relates to the technical field of digital information transmission, and discloses a multisource data unified processing system based on a big data engine, which comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classified storage module, a data preprocessing module, a data auditing module, a data central control module and a human-computer interaction module, realizes rule configuration, timing operation and access of different source data, realizes rapid acquisition and cleaning of the data, is loaded into a big data platform to realize layered and separated storage of the data, provides a flexible external service mode, solves the problem that the increasingly complex requirement of the internet era is met due to single data channel or isolated channel, can effectively collect the data of the channels together, realizes unified access, distribution and processing of information data, and improves the user experience, and the transmission efficiency of data processing transmission is improved.

Description

Multi-source data unified processing system based on big data engine
Technical Field
The invention relates to the technical field of digital information transmission, in particular to a multi-source data unified processing system based on a big data engine.
Background
The data processing system has the main functions of processing and sorting input data information, calculating various analysis indexes, changing the input data information into an information form which is easy to accept by people, storing the processed information in order and transmitting the processed information to information users through external equipment at any time.
In a traditional data processing system, due to the fact that a data channel is single or the channel is isolated, increasingly complex requirements of the internet era cannot be met, data of multiple channels cannot be effectively collected together, unified information data access, distribution and processing cannot be achieved, information cannot be identified, user experience is reduced, and transmission efficiency of data processing and transmission is reduced, and therefore a data fusion and data collection center system scheme is urgently needed, and a data fusion technology and a data fusion method are explained.
Disclosure of Invention
Solves the technical problem
Aiming at the defects of the prior art, the invention provides a multi-source data unified processing system based on a big data engine, which realizes the regular configuration, the timing operation and the access of data from different sources, realizes the rapid acquisition and cleaning of the data, is loaded into a big data platform to realize the layered and separate storage of the data, provides a flexible external service mode, solves the problem that the data channel is single or the channel is isolated, meets the increasingly complex requirement of the internet era, effectively gathers the data of multiple channels together, realizes the access, the distribution and the processing of unified information data, improves the user experience, and improves the transmission efficiency of data processing and transmission.
Technical scheme
In order to achieve the purpose, the invention provides the following technical scheme: a multi-source data unified processing system based on a big data engine comprises a multi-source data timing acquisition module, a multi-source data rule setting module, a multi-source data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multi-source data timing acquisition module is based on a multi-source data timing acquisition and transmission process on the man-machine interaction module and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes.
Preferably, the multi-source data rule setting module sets and classifies data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module, and the multi-source data classification storage module stores classification data through the multi-source data rule setting module.
Preferably, the data preprocessing module is based on a multi-source data classification storage module, and can perform data integration and data cleaning on each group of information through the preprocessing module, and the data auditing module is used for auditing the classification data of the data preprocessing module.
Preferably, the data central control module directly controls the multi-source data classification storage module, and the human-computer interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
Preferably, the multi-source data timing acquisition module is used for solving the timing of data, and a data supplier may add a file every other time, namely 5 minutes and 10 minutes, so that a data acquisition program is required to perform regular scanning to ensure the real-time performance of data, and the data of 5 minutes cannot be processed as 10 minutes; secondly, the inconsistency of data interfaces exists, and an interface platform provided by a data supplier cannot be only an FTP server and may comprise an SQLSERVER database, an FTP server and a Cobar; and there is also diversity in the files provided by different platforms, such as database platform only provides database table, FTP provides CSV, XML, TXT type files, which all need to be analyzed uniformly, the specific steps are as follows:
Figure BDA0003683065920000021
Figure BDA0003683065920000031
preferably, the data preprocessing module includes data cleaning and data integration, the data integration merges a plurality of data sources into one data storage, and if the analyzed data originally does not need data integration in one data storage, i.e. all-in-one, the data integration is realized by using two data frames as a basis and a merge function in R, and the statements are merge (dataframe1, dataframe2, by ═ key "), and are arranged in ascending order by default, and the following problems may occur when data integration is performed: synonymy, wherein the name of a certain attribute in the data source A is the same as the name of a certain attribute in the data source B, but the represented entities are different and cannot be used as keywords; synonymy of different names, namely that the names of certain attributes of two data sources are different but the represented entities are the same, and the names can be used as keywords; data integration often causes data redundancy, the same attribute may appear for multiple times, or the attribute names may be inconsistent to cause repetition, one of the repeated attributes is firstly subjected to related analysis and detection, and if the repeated attribute exists, the repeated attribute is deleted through a human-computer interaction module.
Preferably, the multi-source data rule setting module sets rules of an FP-Growth algorithm, and the algorithm is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: constructing the FP-Tree according to the following steps:
1. scanning the transaction database D once, collecting a set F of frequent items and the support degrees of the frequent items, and sorting the F in a descending order according to the support degrees, wherein the result is a frequent item table L;
2. the root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: the frequent items in Trans are selected and sorted in order in L, with the sorted list of frequent items being [ plP ], where P is the first element and P is the list of remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, the count is set to 1, the node N is linked to the parent node T of the node N, the node N is linked to the node with the same item-name through a node chain, and if P is not empty, the insert _ tree (P, N) is called recursively;
the second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa, with a support ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
Preferably, the data preprocessing module includes data cleansing and data integration, the data cleansing includes processing of missing values and abnormal values, the missing values include identification of missing values and processing of missing values, identification of missing values in R is judged by using a function is. Deletion, replacement and interpolation, deletion method: the deleting method can be divided into a deleting observation sample and a variable according to different deleting angles, the deleting observation sample (line deleting method) can delete the line containing the missing value in the R inner na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete group of data from a data set containing missing values, perform the multiple times to generate a random sample of the missing values, and perform the multiple interpolation in the R mic packet.
Advantageous effects
The invention provides a multi-source data unified processing system based on a big data engine, which has the following beneficial effects:
the multisource data unified processing system based on the big data engine realizes rule configuration, timing operation and access of data from different sources, realizes rapid acquisition and cleaning of the data, is loaded into a big data platform to realize hierarchical and warehouse-by-warehouse storage of the data, provides a flexible external service mode, meets the requirement of increasingly complex Internet era due to single data channel or isolated channel, can effectively collect data from multiple channels together, realizes unified access, distribution and processing of information data, improves user experience, and improves the transmission efficiency of data processing and transmission.
Drawings
FIG. 1 is a flow chart of a multi-source data unified processing system of a big data engine according to the present invention
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Embodiments of the application are applicable to computer systems/servers that are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above, and the like.
The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Example 1
The invention provides a technical scheme that: a multisource data unified processing system based on a big data engine comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multisource data timing acquisition module is based on a multisource data timing acquisition and transmission process on the man-machine interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes; the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module; the multi-source data classification storage module stores classification data through the multi-source data rule setting module; the data preprocessing module is used for integrating and cleaning data of each group of information through the preprocessing module based on the multi-source data classified storage module; the data auditing module is used for auditing the classified data of the data preprocessing module; the data central control module directly controls the multi-source data classified storage module; the man-machine interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
A management method of a multi-source data unified processing system based on a big data engine specifically comprises the following steps:
101. the data is collected regularly by matching the multi-source data timing collection module with the man-machine interaction module and the multi-source data rule setting module to set the data rule;
in this embodiment, a multi-source data timing acquisition module is specifically described, and a problem to be solved is timing of data, since a data supplier adds a file every other time, 5 minutes or 10 minutes, a data acquisition program performs periodic scanning, and data of 5 minutes cannot be processed as data of 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:
Figure BDA0003683065920000071
Figure BDA0003683065920000081
the dc.xml mainly configures how many triggers need to be started in the acquisition module, each trigger represents a data branch of thread security, and in addition, another file needs to be configured to specifically describe the running time of each data branch, the target object, which is called: xml, the specific steps are as follows:
Figure BDA0003683065920000082
it should be specifically noted that the trigger specifies that the acquisition object is triggered in 20 seconds per 5 minutes of each hour, that is, the thread of the data branch is started, which is not specifically limited in this embodiment.
102. The collected data are transmitted to a multi-source data classification storage module for data classification and storage, and each group of information is subjected to data cleaning and data integration through a data preprocessing module;
in this embodiment, it should be specifically described that data integration is implemented by merging multiple data sources into one data storage, and the analyzed data originally does not need data integration in one data storage, that is, all data integration is performed, two data frames are implemented by using a merge function in R based on a keyword, and statements are merge (dataframe1, dataframe2, by ═ keyword ""), and are arranged in ascending order by default, and the following problems may occur when data integration is performed: synonymy, wherein the name of a certain attribute in the data source A is the same as the name of a certain attribute in the data source B, but the represented entities are different and cannot be used as keywords; synonymy of different names, namely that the names of certain attributes of the two data sources are different but the represented entities are the same as the keywords; when data redundancy is caused by data integration, a duplicate attribute is firstly subjected to correlation analysis and detection, and is deleted through a human-computer interaction module, and the embodiment is not particularly limited.
It should be specifically described that data integration can perform data cleansing conversion, provide field calculation, merging, distribution, filtering, field desensitization components or functions, and support fault-tolerant configuration, concurrent configuration, and speed-limiting configuration, which is not specifically limited in this embodiment.
103. Data cleaning and data integration are monitored and audited through a data auditing module numerical differentiation variable, and then data processing is carried out through a data central processing module;
in this embodiment, it is specifically explained that the solution of the data central control module is as follows: for time solution, a smart algorithm can be adopted to match with a proper data structure, such as Bloom filter, Hashmap, bit-map, heap/database, inverted index and trie tree; the present embodiment is not particularly limited to solving the problem of large size, small size, and divide into two, and enlarging the size into small size and each breaking.
What needs to be specifically explained is to divide and treat, and the multi-source data is divided and treated through the modes of hash mapping, hash map statistics, rapidness, merging and heap sorting, that is, the multi-source data cannot be read into the memory at one time, and the operations such as counting, sorting and the like need to be carried out on the multi-source data, and the basic idea is as follows: the hash value of each piece of data is calculated by means of a hash algorithm, the multi-source data are distributed and stored in a plurality of buckets according to the hash values, the same data are necessarily stored in the same bucket according to the uniqueness of a hash function, therefore, a user processes the small files in sequence and finally performs merging operation, and the embodiment is not limited specifically.
104. Inputting, reading, monitoring and modifying data through a man-machine interaction module based on a multi-source data timing acquisition module and a multi-source data rule setting module;
in this embodiment, it needs to be specifically stated that the multi-source data rule setting module sets the rule through an FP-Growth algorithm, which is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: the FP-Tree is constructed by the following steps:
1. the transaction database D is scanned once. A set F of frequent items and their support is collected. Sorting F in descending order according to the support degree, and obtaining a result as a frequent item list L.
2. The root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: frequent entries in Trans are selected and sorted in order in L. Let the sorted frequent items table be [ plP ], where P is the first element and P is the table of the remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to a node with the same item-name through a chain of nodes, and if P is not empty, the insert _ tree (P, N) is recursively called.
The second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa with a support degree ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
Specifically, it is to be noted that the FP-Growth algorithm can effectively compress a large database into a high-density structure much smaller than the original database, thereby avoiding the overhead of repeated scanning; the algorithm adopts a mode-growing recursion strategy based on the mining of the FP-tree, creatively provides a mining method without a candidate item set, and has better efficiency when mining long and frequent item sets; in the mining process, a divide-and-conquer strategy is adopted, the compressed database DB is divided into a group of condition databases Dn, each condition database is associated with a frequent item, and each condition database is mined respectively, and the condition databases Dn are far smaller than the database DB, which is not limited in the embodiment.
Example 2
The invention provides a technical scheme that: a multisource data unified processing system based on a big data engine comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multisource data timing acquisition module is based on a multisource data timing acquisition and transmission process on the man-machine interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes; the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module; the multi-source data classification storage module stores classification data through the multi-source data rule setting module; the data preprocessing module is used for integrating and cleaning data of each group of information through the preprocessing module based on the multi-source data classified storage module; the data auditing module is used for auditing the classified data of the data preprocessing module; the data central control module directly controls the multi-source data classified storage module; the man-machine interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
A management method of a multi-source data unified processing system based on a big data engine specifically comprises the following steps:
101. the data is regularly acquired by matching a multi-source data timing acquisition module with a man-machine interaction module and a multi-source data rule setting module to set data rules;
in this embodiment, it is specifically described that data is collected at regular time, and the problem to be solved is the timing of data, because a data supplier adds a file every other time, 5 minutes or 10 minutes, and the data collection program does regular scanning, the data of 5 minutes cannot be treated as data of 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:
Figure BDA0003683065920000121
the dc.xml mainly configures how many triggers need to be started in the acquisition module, each trigger represents a data branch of thread security, and in addition, another file needs to be configured to specifically describe the running time of each data branch, the target object, which is called: xfdc. xml, the specific steps are as follows:
Figure BDA0003683065920000131
it should be specifically noted that the trigger specifies that the acquisition object is triggered in 20 seconds per 5 minutes of each hour, that is, the thread of the data branch is started, which is not specifically limited in this embodiment.
102. The collected data are transmitted to a multi-source data classification storage module for data classification and storage, and each group of information is subjected to data cleaning and data integration through a data preprocessing module;
in this embodiment, it is specifically described that data cleansing includes processing of a missing value and an abnormal value, where the missing value includes identification of the missing value and processing of the missing value, and identification of the missing value in R uses function is. Deletion, replacement and interpolation, deletion method: the deletion method comprises the steps of deleting observation samples and variables according to different deletion angles, deleting the observation samples (line deletion method), deleting lines containing missing values in an R internal na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete set of data from a data set including a missing value, perform multiple times to generate a random sample of the missing value, and perform multiple interpolation in a mic packet in R, which is not specifically limited in this embodiment.
It should be specifically described that data integration can perform data cleansing conversion, provide field calculation, merging, distribution, filtering, field desensitization components or functions, and support fault-tolerant configuration, concurrent configuration, and speed-limiting configuration, which is not specifically limited in this embodiment.
103. Data cleaning and data integration are monitored and audited through a data auditing module numerical differentiation variable, and then data processing is carried out through a data central processing module;
in this embodiment, it is specifically explained that the solution of the data central control module is as follows: for time solution, a smart algorithm can be adopted to match with a proper data structure, such as Bloom filter, Hashmap, bit-map, heap/database, inverted index and trie tree; the present embodiment is not particularly limited to solving the problem of large size, small size, and divide into two, and enlarging the size into small size and each breaking.
What needs to be specifically explained is to divide and treat, and the multi-source data is divided and treated through the modes of hash mapping, hash map statistics, rapidness, merging and heap sorting, that is, the multi-source data cannot be read into the memory at one time, and the operations such as counting, sorting and the like need to be carried out on the multi-source data, and the basic idea is as follows: the hash value of each piece of data is calculated by means of a hash algorithm, the multi-source data are distributed and stored in a plurality of buckets according to the hash values, the same data are necessarily stored in the same bucket according to the uniqueness of a hash function, therefore, a user processes the small files in sequence and finally performs merging operation, and the embodiment is not limited specifically.
104. Inputting, reading, monitoring and modifying data through a man-machine interaction module based on a multi-source data timing acquisition module and a multi-source data rule setting module;
in this embodiment, it needs to be specifically stated that the multi-source data rule setting module sets the rule through an FP-Growth algorithm, and the algorithm is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: the FP-Tree is constructed by the following steps:
the transaction database D is scanned once. A set F of frequent items and their support is collected. Sorting F in descending order according to the support degree, and obtaining a result as a frequent item list L.
The root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: frequent entries in Trans are selected and sorted in order in L. Let the sorted frequent items table be [ plP ], where P is the first element and P is the table of the remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to a node with the same item-name through a chain of nodes, and if P is not empty, the insert _ tree (P, N) is recursively called.
The second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa, with a support ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
Specifically, the FP-Growth algorithm effectively compresses a large database into a high-density structure much smaller than the original database, thereby avoiding the overhead of repeated scanning; the algorithm adopts a mode-growing recursion strategy based on the mining of the FP-tree, creatively provides a mining method without a candidate item set, and has better efficiency when mining long and frequent item sets; in the mining process, a divide-and-conquer strategy is adopted, the compressed database DB is divided into a group of condition databases Dn, each condition database is associated with a frequent item, and each condition database is mined respectively, and the condition databases Dn are far smaller than the database DB, which is not limited in the embodiment.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A multisource data unified processing system based on big data engine is characterized in that: the system comprises a multi-source data timing acquisition module, a multi-source data rule setting module, a multi-source data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a human-computer interaction module, wherein the multi-source data timing acquisition module is based on a multi-source data timing acquisition and transmission process on the human-computer interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes.
2. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module, and the multi-source data classification storage module is used for storing classification data through the multi-source data rule setting module.
3. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the data preprocessing module is used for integrating and cleaning each group of information through the preprocessing module based on the multi-source data classified storage module, and the data auditing module is used for auditing the classified data of the data preprocessing module.
4. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the data central control module directly controls the multi-source data classified storage module, and the human-computer interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
5. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data timing acquisition module is used for solving the timing of data, and a data acquisition program is used for scanning periodically as a data supplier adds a file every a period of time, 5 minutes or 10 minutes, so that the data of 5 minutes cannot be treated as 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:
<bean
autowire="no"class="org.springframework.scheduling.quartz.Schedule rFactoryBean">
<property name="triggers">
<list>
<!--msc-->
<ref bean="Msc5mGather Trigger"/><ref bean="Mgw5mGather Trigger"/><ref bean="HIr5mGather Trigger"/>
</list>
</property>
</bean>。
6. the big data engine-based multi-source data unified processing system according to claim 3, wherein: the data preprocessing module comprises data cleaning and data integration, wherein the data integration is realized by merging a plurality of data sources into one data storage, the analyzed data originally does not need data integration in the one data storage, namely, the data integration is integrated in a multiple mode, two data frames are realized by using keywords as bases in the data integration, a merge function is used in R, sentences are merge (data frame1, data frame2, by ═ keyword "), the sentences are arranged in an ascending order by default, and the following problems can occur when the data integration is performed: homonymy, wherein a certain attribute name in the data source A is the same as a certain attribute name in the data source B, but the represented entities are different and cannot be used as keywords; synonymy with different names, namely that the names of certain attributes of two data sources are different but the represented entities are the same as keywords; when data redundancy is caused by data integration, relevant analysis and detection are firstly carried out on one of the repeated attributes, and the repeated attribute is deleted through the human-computer interaction module.
7. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data rule setting module is used for setting rules of an FP-Growth algorithm, and the algorithm is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: the FP-Tree is constructed by the following steps:
scanning the transaction database D once, collecting a set F of frequent items and the support degrees of the frequent items, and sorting the F in a descending order according to the support degrees, wherein the result is a frequent item table L;
create the root node of the FP-Tree, mark it with nul ", for each transaction Trans in D, perform: selecting frequent items in Trans and sorting in the order in L, and setting the sorted frequent items table as [ plP ], where P is the first element and P is the table of the remaining elements, calling insert _ tree ([ plP ], T), the procedure is performed as follows: if T has a son N, so that N.item-name is p.item-name, the count of N is increased by 1; otherwise, a new node N is created, the count is set to 1, the node N is linked to the parent node T of the node N, the node N is linked to the node with the same item-name through a node chain, and if P is not empty, the insert _ tree (P, N) is called recursively;
the second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa, with a support ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
8. The big data engine-based multi-source data unified processing system according to claim 6, wherein: the data preprocessing module comprises data cleaning and data integration, the data cleaning comprises missing value and abnormal value processing, the missing value comprises missing value identification and missing value processing, the identification of the missing value in the R is judged by using a function is. Deletion, replacement and interpolation, deletion method: the deleting method comprises the steps of deleting observation samples and variables according to different deleting angles, deleting the observation samples, deleting rows containing missing values in an R inner na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete set of data from a data set containing missing values, perform multiple times to generate a random sample of the missing values, and perform multiple interpolation in a R-mic packet.
CN202210643243.1A 2022-06-08 2022-06-08 Multi-source data unified processing system based on big data engine Pending CN115080565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210643243.1A CN115080565A (en) 2022-06-08 2022-06-08 Multi-source data unified processing system based on big data engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210643243.1A CN115080565A (en) 2022-06-08 2022-06-08 Multi-source data unified processing system based on big data engine

Publications (1)

Publication Number Publication Date
CN115080565A true CN115080565A (en) 2022-09-20

Family

ID=83251865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210643243.1A Pending CN115080565A (en) 2022-06-08 2022-06-08 Multi-source data unified processing system based on big data engine

Country Status (1)

Country Link
CN (1) CN115080565A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952325A (en) * 2023-03-09 2023-04-11 广东创能科技股份有限公司 Data aggregation method and device based on big data platform
CN116108476A (en) * 2022-11-03 2023-05-12 广东加一信息技术有限公司 Information security management and monitoring system based on big data
CN116340975A (en) * 2023-03-16 2023-06-27 江苏骏安信息测评认证有限公司 Cache data safety protection system based on cloud computing
CN117573655A (en) * 2024-01-15 2024-02-20 中国标准化研究院 Data management optimization method and system based on convolutional neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202430A (en) * 2016-07-13 2016-12-07 武汉斗鱼网络科技有限公司 Live platform user interest-degree digging system based on correlation rule and method for digging
CN111708773A (en) * 2020-08-13 2020-09-25 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data fusion method
CN113468163A (en) * 2021-09-01 2021-10-01 南京烽火星空通信发展有限公司 Multisource heterogeneous public security big data intelligent docking engine system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202430A (en) * 2016-07-13 2016-12-07 武汉斗鱼网络科技有限公司 Live platform user interest-degree digging system based on correlation rule and method for digging
CN111708773A (en) * 2020-08-13 2020-09-25 江苏宝和数据股份有限公司 Multi-source scientific and creative resource data fusion method
CN113468163A (en) * 2021-09-01 2021-10-01 南京烽火星空通信发展有限公司 Multisource heterogeneous public security big data intelligent docking engine system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
KEALOO: "海量数据处理方法总结", 《HTTPS://BLOG.CSDN.NET/QQ_44797267/ARTICLE/DETAILS/120228705》 *
伍起鑫 等: "基于Spring框架的定时数据采集关键技术研究" *
伍起鑫 等: "基于Spring框架的定时数据采集关键技术研究", 《电脑知识与技术》 *
阿里云云栖号: "Dataphin功能:集成——如何将业务系统的数据抽取汇聚到数据中台", 《HTTPS://WWW.SOHU.COM/A/483125239_612370》 *
鱼鱼鱼小昶: "数据挖掘算法揭秘篇——关联规则方法", 《HTTPS://BLOG.CSDN.NET/QQ_39391192/ARTICLE/DETAILS/81703706》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108476A (en) * 2022-11-03 2023-05-12 广东加一信息技术有限公司 Information security management and monitoring system based on big data
CN116108476B (en) * 2022-11-03 2023-08-25 深圳市和合信诺大数据科技有限公司 Information security management and monitoring system based on big data
CN115952325A (en) * 2023-03-09 2023-04-11 广东创能科技股份有限公司 Data aggregation method and device based on big data platform
CN115952325B (en) * 2023-03-09 2023-05-16 广东创能科技股份有限公司 Data collection method and device based on big data platform
CN116340975A (en) * 2023-03-16 2023-06-27 江苏骏安信息测评认证有限公司 Cache data safety protection system based on cloud computing
CN117573655A (en) * 2024-01-15 2024-02-20 中国标准化研究院 Data management optimization method and system based on convolutional neural network
CN117573655B (en) * 2024-01-15 2024-03-12 中国标准化研究院 Data management optimization method and system based on convolutional neural network

Similar Documents

Publication Publication Date Title
CN115080565A (en) Multi-source data unified processing system based on big data engine
US11582123B2 (en) Distribution of data packets with non-linear delay
US11182098B2 (en) Optimization for real-time, parallel execution of models for extracting high-value information from data streams
CN111339071B (en) Method and device for processing multi-source heterogeneous data
US20210279265A1 (en) Optimization for Real-Time, Parallel Execution of Models for Extracting High-Value Information from Data Streams
Ediger et al. Tracking structure of streaming social networks
Ko et al. Incremental lossless graph summarization
CN107660283A (en) For realizing the method and system of daily record resolver in Log Analysis System
Tavares et al. Overlapping analytic stages in online process mining
Qu et al. Efficient mining of frequent itemsets using only one dynamic prefix tree
Ahsaan et al. Big data analytics: challenges and technologies
Prakash et al. Big data preprocessing for modern world: opportunities and challenges
CN117251414A (en) Data storage and processing method based on heterogeneous technology
CN112906373A (en) Alarm calculation method and device, electronic equipment and storage medium
Wadhera et al. A systematic Review of Big data tools and application for developments
CN113641705B (en) Marketing disposal rule engine method based on calculation engine
EP3380906A1 (en) Optimization for real-time, parallel execution of models for extracting high-value information from data streams
CN109117426A (en) Distributed networks database query method, apparatus, equipment and storage medium
CN111914146A (en) Business software platform convenient for big data interaction and automatic extraction
Bodra Processing queries over partitioned graph databases: An approach and it’s evaluation
US20230060475A1 (en) Operation data analysis device, operation data analysis system, and operation data analysis method
Technolgy Clasifcaton Technolgy
CN115048468A (en) A rural area integrated service platform for rural area happy
Kompalli Knowledge Discovery Using Data Stream Mining: An Analytical Approach
Kompalli Mining Data Streams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220920