CN115080565A - Multi-source data unified processing system based on big data engine - Google Patents
Multi-source data unified processing system based on big data engine Download PDFInfo
- Publication number
- CN115080565A CN115080565A CN202210643243.1A CN202210643243A CN115080565A CN 115080565 A CN115080565 A CN 115080565A CN 202210643243 A CN202210643243 A CN 202210643243A CN 115080565 A CN115080565 A CN 115080565A
- Authority
- CN
- China
- Prior art keywords
- data
- module
- source data
- source
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 43
- 238000007781 pre-processing Methods 0.000 claims abstract description 23
- 230000003993 interaction Effects 0.000 claims abstract description 20
- 230000005540 biological transmission Effects 0.000 claims abstract description 18
- 238000004140 cleaning Methods 0.000 claims abstract description 15
- 238000000034 method Methods 0.000 claims description 30
- 230000010354 integration Effects 0.000 claims description 27
- 238000005065 mining Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 238000012217 deletion Methods 0.000 claims description 9
- 230000037430 deletion Effects 0.000 claims description 9
- 238000004458 analytical method Methods 0.000 claims description 6
- 238000013500 data storage Methods 0.000 claims description 6
- 238000012544 monitoring process Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 230000001174 ascending effect Effects 0.000 claims description 3
- 230000001419 dependent effect Effects 0.000 claims description 3
- 238000001514 detection method Methods 0.000 claims description 3
- 235000010627 Phaseolus vulgaris Nutrition 0.000 claims 5
- 244000046052 Phaseolus vulgaris Species 0.000 claims 5
- 239000010453 quartz Substances 0.000 claims 1
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N silicon dioxide Inorganic materials O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 claims 1
- 230000006870 function Effects 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000000586 desensitisation Methods 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Abstract
The invention relates to the technical field of digital information transmission, and discloses a multisource data unified processing system based on a big data engine, which comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classified storage module, a data preprocessing module, a data auditing module, a data central control module and a human-computer interaction module, realizes rule configuration, timing operation and access of different source data, realizes rapid acquisition and cleaning of the data, is loaded into a big data platform to realize layered and separated storage of the data, provides a flexible external service mode, solves the problem that the increasingly complex requirement of the internet era is met due to single data channel or isolated channel, can effectively collect the data of the channels together, realizes unified access, distribution and processing of information data, and improves the user experience, and the transmission efficiency of data processing transmission is improved.
Description
Technical Field
The invention relates to the technical field of digital information transmission, in particular to a multi-source data unified processing system based on a big data engine.
Background
The data processing system has the main functions of processing and sorting input data information, calculating various analysis indexes, changing the input data information into an information form which is easy to accept by people, storing the processed information in order and transmitting the processed information to information users through external equipment at any time.
In a traditional data processing system, due to the fact that a data channel is single or the channel is isolated, increasingly complex requirements of the internet era cannot be met, data of multiple channels cannot be effectively collected together, unified information data access, distribution and processing cannot be achieved, information cannot be identified, user experience is reduced, and transmission efficiency of data processing and transmission is reduced, and therefore a data fusion and data collection center system scheme is urgently needed, and a data fusion technology and a data fusion method are explained.
Disclosure of Invention
Solves the technical problem
Aiming at the defects of the prior art, the invention provides a multi-source data unified processing system based on a big data engine, which realizes the regular configuration, the timing operation and the access of data from different sources, realizes the rapid acquisition and cleaning of the data, is loaded into a big data platform to realize the layered and separate storage of the data, provides a flexible external service mode, solves the problem that the data channel is single or the channel is isolated, meets the increasingly complex requirement of the internet era, effectively gathers the data of multiple channels together, realizes the access, the distribution and the processing of unified information data, improves the user experience, and improves the transmission efficiency of data processing and transmission.
Technical scheme
In order to achieve the purpose, the invention provides the following technical scheme: a multi-source data unified processing system based on a big data engine comprises a multi-source data timing acquisition module, a multi-source data rule setting module, a multi-source data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multi-source data timing acquisition module is based on a multi-source data timing acquisition and transmission process on the man-machine interaction module and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes.
Preferably, the multi-source data rule setting module sets and classifies data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module, and the multi-source data classification storage module stores classification data through the multi-source data rule setting module.
Preferably, the data preprocessing module is based on a multi-source data classification storage module, and can perform data integration and data cleaning on each group of information through the preprocessing module, and the data auditing module is used for auditing the classification data of the data preprocessing module.
Preferably, the data central control module directly controls the multi-source data classification storage module, and the human-computer interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
Preferably, the multi-source data timing acquisition module is used for solving the timing of data, and a data supplier may add a file every other time, namely 5 minutes and 10 minutes, so that a data acquisition program is required to perform regular scanning to ensure the real-time performance of data, and the data of 5 minutes cannot be processed as 10 minutes; secondly, the inconsistency of data interfaces exists, and an interface platform provided by a data supplier cannot be only an FTP server and may comprise an SQLSERVER database, an FTP server and a Cobar; and there is also diversity in the files provided by different platforms, such as database platform only provides database table, FTP provides CSV, XML, TXT type files, which all need to be analyzed uniformly, the specific steps are as follows:
preferably, the data preprocessing module includes data cleaning and data integration, the data integration merges a plurality of data sources into one data storage, and if the analyzed data originally does not need data integration in one data storage, i.e. all-in-one, the data integration is realized by using two data frames as a basis and a merge function in R, and the statements are merge (dataframe1, dataframe2, by ═ key "), and are arranged in ascending order by default, and the following problems may occur when data integration is performed: synonymy, wherein the name of a certain attribute in the data source A is the same as the name of a certain attribute in the data source B, but the represented entities are different and cannot be used as keywords; synonymy of different names, namely that the names of certain attributes of two data sources are different but the represented entities are the same, and the names can be used as keywords; data integration often causes data redundancy, the same attribute may appear for multiple times, or the attribute names may be inconsistent to cause repetition, one of the repeated attributes is firstly subjected to related analysis and detection, and if the repeated attribute exists, the repeated attribute is deleted through a human-computer interaction module.
Preferably, the multi-source data rule setting module sets rules of an FP-Growth algorithm, and the algorithm is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: constructing the FP-Tree according to the following steps:
1. scanning the transaction database D once, collecting a set F of frequent items and the support degrees of the frequent items, and sorting the F in a descending order according to the support degrees, wherein the result is a frequent item table L;
2. the root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: the frequent items in Trans are selected and sorted in order in L, with the sorted list of frequent items being [ plP ], where P is the first element and P is the list of remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, the count is set to 1, the node N is linked to the parent node T of the node N, the node N is linked to the node with the same item-name through a node chain, and if P is not empty, the insert _ tree (P, N) is called recursively;
the second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa, with a support ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
Preferably, the data preprocessing module includes data cleansing and data integration, the data cleansing includes processing of missing values and abnormal values, the missing values include identification of missing values and processing of missing values, identification of missing values in R is judged by using a function is. Deletion, replacement and interpolation, deletion method: the deleting method can be divided into a deleting observation sample and a variable according to different deleting angles, the deleting observation sample (line deleting method) can delete the line containing the missing value in the R inner na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete group of data from a data set containing missing values, perform the multiple times to generate a random sample of the missing values, and perform the multiple interpolation in the R mic packet.
Advantageous effects
The invention provides a multi-source data unified processing system based on a big data engine, which has the following beneficial effects:
the multisource data unified processing system based on the big data engine realizes rule configuration, timing operation and access of data from different sources, realizes rapid acquisition and cleaning of the data, is loaded into a big data platform to realize hierarchical and warehouse-by-warehouse storage of the data, provides a flexible external service mode, meets the requirement of increasingly complex Internet era due to single data channel or isolated channel, can effectively collect data from multiple channels together, realizes unified access, distribution and processing of information data, improves user experience, and improves the transmission efficiency of data processing and transmission.
Drawings
FIG. 1 is a flow chart of a multi-source data unified processing system of a big data engine according to the present invention
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the application, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Embodiments of the application are applicable to computer systems/servers that are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above, and the like.
The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Example 1
The invention provides a technical scheme that: a multisource data unified processing system based on a big data engine comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multisource data timing acquisition module is based on a multisource data timing acquisition and transmission process on the man-machine interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes; the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module; the multi-source data classification storage module stores classification data through the multi-source data rule setting module; the data preprocessing module is used for integrating and cleaning data of each group of information through the preprocessing module based on the multi-source data classified storage module; the data auditing module is used for auditing the classified data of the data preprocessing module; the data central control module directly controls the multi-source data classified storage module; the man-machine interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
A management method of a multi-source data unified processing system based on a big data engine specifically comprises the following steps:
101. the data is collected regularly by matching the multi-source data timing collection module with the man-machine interaction module and the multi-source data rule setting module to set the data rule;
in this embodiment, a multi-source data timing acquisition module is specifically described, and a problem to be solved is timing of data, since a data supplier adds a file every other time, 5 minutes or 10 minutes, a data acquisition program performs periodic scanning, and data of 5 minutes cannot be processed as data of 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:
the dc.xml mainly configures how many triggers need to be started in the acquisition module, each trigger represents a data branch of thread security, and in addition, another file needs to be configured to specifically describe the running time of each data branch, the target object, which is called: xml, the specific steps are as follows:
it should be specifically noted that the trigger specifies that the acquisition object is triggered in 20 seconds per 5 minutes of each hour, that is, the thread of the data branch is started, which is not specifically limited in this embodiment.
102. The collected data are transmitted to a multi-source data classification storage module for data classification and storage, and each group of information is subjected to data cleaning and data integration through a data preprocessing module;
in this embodiment, it should be specifically described that data integration is implemented by merging multiple data sources into one data storage, and the analyzed data originally does not need data integration in one data storage, that is, all data integration is performed, two data frames are implemented by using a merge function in R based on a keyword, and statements are merge (dataframe1, dataframe2, by ═ keyword ""), and are arranged in ascending order by default, and the following problems may occur when data integration is performed: synonymy, wherein the name of a certain attribute in the data source A is the same as the name of a certain attribute in the data source B, but the represented entities are different and cannot be used as keywords; synonymy of different names, namely that the names of certain attributes of the two data sources are different but the represented entities are the same as the keywords; when data redundancy is caused by data integration, a duplicate attribute is firstly subjected to correlation analysis and detection, and is deleted through a human-computer interaction module, and the embodiment is not particularly limited.
It should be specifically described that data integration can perform data cleansing conversion, provide field calculation, merging, distribution, filtering, field desensitization components or functions, and support fault-tolerant configuration, concurrent configuration, and speed-limiting configuration, which is not specifically limited in this embodiment.
103. Data cleaning and data integration are monitored and audited through a data auditing module numerical differentiation variable, and then data processing is carried out through a data central processing module;
in this embodiment, it is specifically explained that the solution of the data central control module is as follows: for time solution, a smart algorithm can be adopted to match with a proper data structure, such as Bloom filter, Hashmap, bit-map, heap/database, inverted index and trie tree; the present embodiment is not particularly limited to solving the problem of large size, small size, and divide into two, and enlarging the size into small size and each breaking.
What needs to be specifically explained is to divide and treat, and the multi-source data is divided and treated through the modes of hash mapping, hash map statistics, rapidness, merging and heap sorting, that is, the multi-source data cannot be read into the memory at one time, and the operations such as counting, sorting and the like need to be carried out on the multi-source data, and the basic idea is as follows: the hash value of each piece of data is calculated by means of a hash algorithm, the multi-source data are distributed and stored in a plurality of buckets according to the hash values, the same data are necessarily stored in the same bucket according to the uniqueness of a hash function, therefore, a user processes the small files in sequence and finally performs merging operation, and the embodiment is not limited specifically.
104. Inputting, reading, monitoring and modifying data through a man-machine interaction module based on a multi-source data timing acquisition module and a multi-source data rule setting module;
in this embodiment, it needs to be specifically stated that the multi-source data rule setting module sets the rule through an FP-Growth algorithm, which is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: the FP-Tree is constructed by the following steps:
1. the transaction database D is scanned once. A set F of frequent items and their support is collected. Sorting F in descending order according to the support degree, and obtaining a result as a frequent item list L.
2. The root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: frequent entries in Trans are selected and sorted in order in L. Let the sorted frequent items table be [ plP ], where P is the first element and P is the table of the remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to a node with the same item-name through a chain of nodes, and if P is not empty, the insert _ tree (P, N) is recursively called.
The second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa with a support degree ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
Specifically, it is to be noted that the FP-Growth algorithm can effectively compress a large database into a high-density structure much smaller than the original database, thereby avoiding the overhead of repeated scanning; the algorithm adopts a mode-growing recursion strategy based on the mining of the FP-tree, creatively provides a mining method without a candidate item set, and has better efficiency when mining long and frequent item sets; in the mining process, a divide-and-conquer strategy is adopted, the compressed database DB is divided into a group of condition databases Dn, each condition database is associated with a frequent item, and each condition database is mined respectively, and the condition databases Dn are far smaller than the database DB, which is not limited in the embodiment.
Example 2
The invention provides a technical scheme that: a multisource data unified processing system based on a big data engine comprises a multisource data timing acquisition module, a multisource data rule setting module, a multisource data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a man-machine interaction module, wherein the multisource data timing acquisition module is based on a multisource data timing acquisition and transmission process on the man-machine interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes; the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module; the multi-source data classification storage module stores classification data through the multi-source data rule setting module; the data preprocessing module is used for integrating and cleaning data of each group of information through the preprocessing module based on the multi-source data classified storage module; the data auditing module is used for auditing the classified data of the data preprocessing module; the data central control module directly controls the multi-source data classified storage module; the man-machine interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
A management method of a multi-source data unified processing system based on a big data engine specifically comprises the following steps:
101. the data is regularly acquired by matching a multi-source data timing acquisition module with a man-machine interaction module and a multi-source data rule setting module to set data rules;
in this embodiment, it is specifically described that data is collected at regular time, and the problem to be solved is the timing of data, because a data supplier adds a file every other time, 5 minutes or 10 minutes, and the data collection program does regular scanning, the data of 5 minutes cannot be treated as data of 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:
the dc.xml mainly configures how many triggers need to be started in the acquisition module, each trigger represents a data branch of thread security, and in addition, another file needs to be configured to specifically describe the running time of each data branch, the target object, which is called: xfdc. xml, the specific steps are as follows:
it should be specifically noted that the trigger specifies that the acquisition object is triggered in 20 seconds per 5 minutes of each hour, that is, the thread of the data branch is started, which is not specifically limited in this embodiment.
102. The collected data are transmitted to a multi-source data classification storage module for data classification and storage, and each group of information is subjected to data cleaning and data integration through a data preprocessing module;
in this embodiment, it is specifically described that data cleansing includes processing of a missing value and an abnormal value, where the missing value includes identification of the missing value and processing of the missing value, and identification of the missing value in R uses function is. Deletion, replacement and interpolation, deletion method: the deletion method comprises the steps of deleting observation samples and variables according to different deletion angles, deleting the observation samples (line deletion method), deleting lines containing missing values in an R internal na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete set of data from a data set including a missing value, perform multiple times to generate a random sample of the missing value, and perform multiple interpolation in a mic packet in R, which is not specifically limited in this embodiment.
It should be specifically described that data integration can perform data cleansing conversion, provide field calculation, merging, distribution, filtering, field desensitization components or functions, and support fault-tolerant configuration, concurrent configuration, and speed-limiting configuration, which is not specifically limited in this embodiment.
103. Data cleaning and data integration are monitored and audited through a data auditing module numerical differentiation variable, and then data processing is carried out through a data central processing module;
in this embodiment, it is specifically explained that the solution of the data central control module is as follows: for time solution, a smart algorithm can be adopted to match with a proper data structure, such as Bloom filter, Hashmap, bit-map, heap/database, inverted index and trie tree; the present embodiment is not particularly limited to solving the problem of large size, small size, and divide into two, and enlarging the size into small size and each breaking.
What needs to be specifically explained is to divide and treat, and the multi-source data is divided and treated through the modes of hash mapping, hash map statistics, rapidness, merging and heap sorting, that is, the multi-source data cannot be read into the memory at one time, and the operations such as counting, sorting and the like need to be carried out on the multi-source data, and the basic idea is as follows: the hash value of each piece of data is calculated by means of a hash algorithm, the multi-source data are distributed and stored in a plurality of buckets according to the hash values, the same data are necessarily stored in the same bucket according to the uniqueness of a hash function, therefore, a user processes the small files in sequence and finally performs merging operation, and the embodiment is not limited specifically.
104. Inputting, reading, monitoring and modifying data through a man-machine interaction module based on a multi-source data timing acquisition module and a multi-source data rule setting module;
in this embodiment, it needs to be specifically stated that the multi-source data rule setting module sets the rule through an FP-Growth algorithm, and the algorithm is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: the FP-Tree is constructed by the following steps:
the transaction database D is scanned once. A set F of frequent items and their support is collected. Sorting F in descending order according to the support degree, and obtaining a result as a frequent item list L.
The root node of the FP-Tree is created, labeled nul ". For each transaction Trans in D, perform: frequent entries in Trans are selected and sorted in order in L. Let the sorted frequent items table be [ plP ], where P is the first element and P is the table of the remaining elements. An insert _ tree ([ plP ], T) is called. The process is performed as follows: if T has a child N such that n.item-name is p.item-name, then the count of N is incremented by 1; otherwise, a new node N is created, its count is set to 1, linked to its parent node T, and linked to a node with the same item-name through a chain of nodes, and if P is not empty, the insert _ tree (P, N) is recursively called.
The second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa, with a support ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
Specifically, the FP-Growth algorithm effectively compresses a large database into a high-density structure much smaller than the original database, thereby avoiding the overhead of repeated scanning; the algorithm adopts a mode-growing recursion strategy based on the mining of the FP-tree, creatively provides a mining method without a candidate item set, and has better efficiency when mining long and frequent item sets; in the mining process, a divide-and-conquer strategy is adopted, the compressed database DB is divided into a group of condition databases Dn, each condition database is associated with a frequent item, and each condition database is mined respectively, and the condition databases Dn are far smaller than the database DB, which is not limited in the embodiment.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (8)
1. A multisource data unified processing system based on big data engine is characterized in that: the system comprises a multi-source data timing acquisition module, a multi-source data rule setting module, a multi-source data classification storage module, a data preprocessing module, a data auditing module, a data central control module and a human-computer interaction module, wherein the multi-source data timing acquisition module is based on a multi-source data timing acquisition and transmission process on the human-computer interaction module, and specifically comprises a data acquisition and transmission process of character keyword acquisition and extraction, picture or voice recognition data acquisition and other modes.
2. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data rule setting module is used for setting and classifying data rules based on the multi-source data timing acquisition module and the multi-source data classification storage module, and the multi-source data classification storage module is used for storing classification data through the multi-source data rule setting module.
3. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the data preprocessing module is used for integrating and cleaning each group of information through the preprocessing module based on the multi-source data classified storage module, and the data auditing module is used for auditing the classified data of the data preprocessing module.
4. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the data central control module directly controls the multi-source data classified storage module, and the human-computer interaction module is used for manually inputting, reading, monitoring and modifying data based on the display terminal.
5. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data timing acquisition module is used for solving the timing of data, and a data acquisition program is used for scanning periodically as a data supplier adds a file every a period of time, 5 minutes or 10 minutes, so that the data of 5 minutes cannot be treated as 10 minutes; secondly, the inconsistency of data interfaces is caused, and an interface platform provided by a data supplier is not only an FTP server, but also comprises an SQLSERVER database, an FTP server and a Cobar; and the files provided by different platforms have diversity, for example, the database platform only provides a database table, the FTP provides CSV, XML and TXT type files for uniform analysis, and the specific steps are as follows:
<bean
autowire="no"class="org.springframework.scheduling.quartz.Schedule rFactoryBean">
<property name="triggers">
<list>
<!--msc-->
<ref bean="Msc5mGather Trigger"/><ref bean="Mgw5mGather Trigger"/><ref bean="HIr5mGather Trigger"/>
</list>
</property>
</bean>。
6. the big data engine-based multi-source data unified processing system according to claim 3, wherein: the data preprocessing module comprises data cleaning and data integration, wherein the data integration is realized by merging a plurality of data sources into one data storage, the analyzed data originally does not need data integration in the one data storage, namely, the data integration is integrated in a multiple mode, two data frames are realized by using keywords as bases in the data integration, a merge function is used in R, sentences are merge (data frame1, data frame2, by ═ keyword "), the sentences are arranged in an ascending order by default, and the following problems can occur when the data integration is performed: homonymy, wherein a certain attribute name in the data source A is the same as a certain attribute name in the data source B, but the represented entities are different and cannot be used as keywords; synonymy with different names, namely that the names of certain attributes of two data sources are different but the represented entities are the same as keywords; when data redundancy is caused by data integration, relevant analysis and detection are firstly carried out on one of the repeated attributes, and the repeated attribute is deleted through the human-computer interaction module.
7. The big data engine-based multi-source data unified processing system according to claim 1, wherein: the multi-source data rule setting module is used for setting rules of an FP-Growth algorithm, and the algorithm is specifically described as follows:
the specific description of the algorithm is as follows:
inputting: a transaction database D; a minimum support threshold min _ sup;
and (3) outputting: a complete set of frequent patterns;
the first step is as follows: the FP-Tree is constructed by the following steps:
scanning the transaction database D once, collecting a set F of frequent items and the support degrees of the frequent items, and sorting the F in a descending order according to the support degrees, wherein the result is a frequent item table L;
create the root node of the FP-Tree, mark it with nul ", for each transaction Trans in D, perform: selecting frequent items in Trans and sorting in the order in L, and setting the sorted frequent items table as [ plP ], where P is the first element and P is the table of the remaining elements, calling insert _ tree ([ plP ], T), the procedure is performed as follows: if T has a son N, so that N.item-name is p.item-name, the count of N is increased by 1; otherwise, a new node N is created, the count is set to 1, the node N is linked to the parent node T of the node N, the node N is linked to the node with the same item-name through a node chain, and if P is not empty, the insert _ tree (P, N) is called recursively;
the second step is that: and (3) mining a frequent item set according to the FP-Tree, wherein the pseudo code is realized in the process as follows:
the if Tree comprises a single path Pthen;
each combination of nodes in the for path P (denoted as β);
generating a mode beta Ua, wherein the support of the mode beta Ua is the minimum support of nodes in the mode beta;
else for reach ai at the Tree head {;
generating a pattern β ═ aiUa, with a support ═ ai.support;
constructing a condition mode base of beta, and then constructing a condition FP-Tree beta of the beta;
If Treeβ/0then;
call FP-growth (TreeB, B }).
8. The big data engine-based multi-source data unified processing system according to claim 6, wherein: the data preprocessing module comprises data cleaning and data integration, the data cleaning comprises missing value and abnormal value processing, the missing value comprises missing value identification and missing value processing, the identification of the missing value in the R is judged by using a function is. Deletion, replacement and interpolation, deletion method: the deleting method comprises the steps of deleting observation samples and variables according to different deleting angles, deleting the observation samples, deleting rows containing missing values in an R inner na. The replacement method comprises the following steps: according to the difference of the variables and different replacement rules, the variable of the missing value is a numerical type, and the missing value is replaced by the mean value of other numbers under the variable; when the variable is a non-numerical variable, replacing the variable with the median or mode of other observed values under the variable; interpolation method: taking the interpolated variable as a dependent variable y, fitting other variables by using a regression model, and interpolating the missing value by using an Im regression function in R; the multiple interpolation is to generate a complete set of data from a data set containing missing values, perform multiple times to generate a random sample of the missing values, and perform multiple interpolation in a R-mic packet.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210643243.1A CN115080565A (en) | 2022-06-08 | 2022-06-08 | Multi-source data unified processing system based on big data engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210643243.1A CN115080565A (en) | 2022-06-08 | 2022-06-08 | Multi-source data unified processing system based on big data engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115080565A true CN115080565A (en) | 2022-09-20 |
Family
ID=83251865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210643243.1A Pending CN115080565A (en) | 2022-06-08 | 2022-06-08 | Multi-source data unified processing system based on big data engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115080565A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115952325A (en) * | 2023-03-09 | 2023-04-11 | 广东创能科技股份有限公司 | Data aggregation method and device based on big data platform |
CN116108476A (en) * | 2022-11-03 | 2023-05-12 | 广东加一信息技术有限公司 | Information security management and monitoring system based on big data |
CN116340975A (en) * | 2023-03-16 | 2023-06-27 | 江苏骏安信息测评认证有限公司 | Cache data safety protection system based on cloud computing |
CN117573655A (en) * | 2024-01-15 | 2024-02-20 | 中国标准化研究院 | Data management optimization method and system based on convolutional neural network |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202430A (en) * | 2016-07-13 | 2016-12-07 | 武汉斗鱼网络科技有限公司 | Live platform user interest-degree digging system based on correlation rule and method for digging |
CN111708773A (en) * | 2020-08-13 | 2020-09-25 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data fusion method |
CN113468163A (en) * | 2021-09-01 | 2021-10-01 | 南京烽火星空通信发展有限公司 | Multisource heterogeneous public security big data intelligent docking engine system |
-
2022
- 2022-06-08 CN CN202210643243.1A patent/CN115080565A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106202430A (en) * | 2016-07-13 | 2016-12-07 | 武汉斗鱼网络科技有限公司 | Live platform user interest-degree digging system based on correlation rule and method for digging |
CN111708773A (en) * | 2020-08-13 | 2020-09-25 | 江苏宝和数据股份有限公司 | Multi-source scientific and creative resource data fusion method |
CN113468163A (en) * | 2021-09-01 | 2021-10-01 | 南京烽火星空通信发展有限公司 | Multisource heterogeneous public security big data intelligent docking engine system |
Non-Patent Citations (5)
Title |
---|
KEALOO: "海量数据处理方法总结", 《HTTPS://BLOG.CSDN.NET/QQ_44797267/ARTICLE/DETAILS/120228705》 * |
伍起鑫 等: "基于Spring框架的定时数据采集关键技术研究" * |
伍起鑫 等: "基于Spring框架的定时数据采集关键技术研究", 《电脑知识与技术》 * |
阿里云云栖号: "Dataphin功能:集成——如何将业务系统的数据抽取汇聚到数据中台", 《HTTPS://WWW.SOHU.COM/A/483125239_612370》 * |
鱼鱼鱼小昶: "数据挖掘算法揭秘篇——关联规则方法", 《HTTPS://BLOG.CSDN.NET/QQ_39391192/ARTICLE/DETAILS/81703706》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116108476A (en) * | 2022-11-03 | 2023-05-12 | 广东加一信息技术有限公司 | Information security management and monitoring system based on big data |
CN116108476B (en) * | 2022-11-03 | 2023-08-25 | 深圳市和合信诺大数据科技有限公司 | Information security management and monitoring system based on big data |
CN115952325A (en) * | 2023-03-09 | 2023-04-11 | 广东创能科技股份有限公司 | Data aggregation method and device based on big data platform |
CN115952325B (en) * | 2023-03-09 | 2023-05-16 | 广东创能科技股份有限公司 | Data collection method and device based on big data platform |
CN116340975A (en) * | 2023-03-16 | 2023-06-27 | 江苏骏安信息测评认证有限公司 | Cache data safety protection system based on cloud computing |
CN117573655A (en) * | 2024-01-15 | 2024-02-20 | 中国标准化研究院 | Data management optimization method and system based on convolutional neural network |
CN117573655B (en) * | 2024-01-15 | 2024-03-12 | 中国标准化研究院 | Data management optimization method and system based on convolutional neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115080565A (en) | Multi-source data unified processing system based on big data engine | |
US11582123B2 (en) | Distribution of data packets with non-linear delay | |
US11182098B2 (en) | Optimization for real-time, parallel execution of models for extracting high-value information from data streams | |
CN111339071B (en) | Method and device for processing multi-source heterogeneous data | |
US20210279265A1 (en) | Optimization for Real-Time, Parallel Execution of Models for Extracting High-Value Information from Data Streams | |
Ediger et al. | Tracking structure of streaming social networks | |
Ko et al. | Incremental lossless graph summarization | |
CN107660283A (en) | For realizing the method and system of daily record resolver in Log Analysis System | |
Tavares et al. | Overlapping analytic stages in online process mining | |
Qu et al. | Efficient mining of frequent itemsets using only one dynamic prefix tree | |
Ahsaan et al. | Big data analytics: challenges and technologies | |
Prakash et al. | Big data preprocessing for modern world: opportunities and challenges | |
CN117251414A (en) | Data storage and processing method based on heterogeneous technology | |
CN112906373A (en) | Alarm calculation method and device, electronic equipment and storage medium | |
Wadhera et al. | A systematic Review of Big data tools and application for developments | |
CN113641705B (en) | Marketing disposal rule engine method based on calculation engine | |
EP3380906A1 (en) | Optimization for real-time, parallel execution of models for extracting high-value information from data streams | |
CN109117426A (en) | Distributed networks database query method, apparatus, equipment and storage medium | |
CN111914146A (en) | Business software platform convenient for big data interaction and automatic extraction | |
Bodra | Processing queries over partitioned graph databases: An approach and it’s evaluation | |
US20230060475A1 (en) | Operation data analysis device, operation data analysis system, and operation data analysis method | |
Technolgy | Clasifcaton Technolgy | |
CN115048468A (en) | A rural area integrated service platform for rural area happy | |
Kompalli | Knowledge Discovery Using Data Stream Mining: An Analytical Approach | |
Kompalli | Mining Data Streams |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220920 |