CN117540343B

CN117540343B - Data fusion method and system

Info

Publication number: CN117540343B
Application number: CN202410026687.XA
Authority: CN
Inventors: 叶士飞; 沈鸣飞; 何亮; 刘少梁; 蒋晓军
Original assignee: Suzhou Yuancheng Technology Co ltd
Current assignee: Suzhou Yuancheng Technology Co ltd
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-16
Anticipated expiration: 2044-01-09
Also published as: CN117540343A

Abstract

The invention discloses a data fusion method and a system, comprising the steps of accessing a data source, creating a data table, extracting data, processing the data, importing the data and fusing the data; the data in the service system can be extracted and processed to finally generate the data actually useful for the service; the process has the characteristics of high efficiency, accuracy and reliability, can meet the data processing requirement under large data volume, can acquire a subset of data by carrying out data slicing in advance before data fusion, thereby being more convenient for carrying out detailed analysis and processing on specific data, being beneficial to reducing the workload of data analysis and improving the efficiency, extracting the data of a central front-end processor into a standard library through Spark data real-time processing framework and a DataFrame API technology, and carrying out preprocessing on the data, thereby not only improving the efficiency of data processing, but also improving the accuracy of data analysis.

Description

Data fusion method and system

Technical Field

The invention relates to the technical field of data fusion, in particular to a data fusion method and system.

Background

Data fusion is a technique and method of integrating data from different sources. Its goal is to provide more comprehensive, accurate and useful information by integrating and analyzing information from multiple data sources. The data fusion can be applied to various fields such as business intelligence, medical care, finance and the like.

In practical applications, data fusion faces some challenges and drawbacks. First, data fusion is more complex due to the multiple data sources and multiple data types involved. The data quality of different data sources may vary, including problems with accuracy, integrity, and consistency. If low quality data is fused into the analysis, inaccurate results and decisions may result; second, data in different data sources may use different data models, formats, and naming conventions. This presents consistency issues for data fusion, which need to be addressed to establish a consistent view; in addition, as the amount of data increases, the data fusion system may suffer from scalability issues. Designing and implementing a solution that can handle large-scale data is an important consideration.

In summary, the data fusion technology needs to take appropriate strategies and measures to solve these challenges based on comprehensively considering factors such as data quality, security, consistency, performance and cost.

Disclosure of Invention

The invention aims to provide a data fusion method and a data fusion system, which solve the technical problems in the background technology.

The aim of the invention can be achieved by the following technical scheme:

a data fusion method comprising the steps of:

first step, accessing data source

The data source 1, the data source 2, the data source 3 and the data source N of the service system are accessed to the central front-end processor;

second step, data table creation

The central front-end processor creates a 1:1 data inclusion table according to the data sources, and simultaneously establishes a mapping relation between the data inclusion table and each data source;

third step, data extraction

Extracting corresponding service data in each data source to a corresponding position of a data inclusion table, and extracting newly-added service data in a service system to the data inclusion table of the central front-end processor according to the generation and updating rules of the service data in each data source;

fourth step, data processing

Carrying out standardized conversion, checking correction and missing information filling processing on the service data in the data incorporating table, and then extracting the preprocessed service data into a standard library;

the method comprises the steps of performing standardized conversion, namely converting service data into standard data through a pre-training standard conversion dictionary, checking and correcting the standard data through a DataFrame API technology, filling missing information into a pre-designated unique identification code, extracting information corresponding to the missing content, and filling the information;

fifth step, data import

Storing data corresponding to the business data processed by the data in the central front-end processor into a standard repository;

sixth step, data fusion

And extracting the business data processed by each data from the standard repository to the fusion repository according to the business data extraction requirements of the data extractor, and fusing the business data to generate a data fusion extraction table.

As a further scheme of the invention: in the fourth step, the standardized conversion is performed in the following manner:

firstly, predefining corresponding standard conversion dictionary for each field in a data inclusion table, wherein the standard conversion dictionary comprises a plurality of fixed dictionary values as standard translations of the corresponding fields;

then, the contents of the corresponding fields contained in the data inclusion table are traversed, and standard translation of each field is acquired by using the get () method of the dictionary.

As a further scheme of the invention: in the fourth step, the manner of checking the correction is as follows:

s41, checking null values of data in corresponding fields by using na functions of a DataFrame API, and selecting to delete or fill the null values, wherein the null values represent missing, unknown or inapplicable field data;

s42, checking the data type of the corresponding field by using a dtypes function of the DataFrame API, and if the data type does not meet the requirement, converting the corresponding service data into a specified data type;

s43, converting the time stamp into a date format by utilizing a from_unixtime function of the DataFrame API;

s44, checking the range of the data of the corresponding field by utilizing the betwen function of the DataFrame API, and if the data is out of range, selecting to delete or fill the data.

As a further scheme of the invention: in the fourth step, the manner of missing information padding is as follows:

when the data of the corresponding field is missing, the content of the corresponding field is extracted from other data sources which are confirmed by the right through taking the identification card number as a unique identification code, and the content is filled into the current field position.

As a further scheme of the invention: in the sixth step, the data extraction mode in the standard repository is to perform screening, sorting and grouping operations on the data through SQL sentences or programming languages so as to acquire the required information.

As a further scheme of the invention: when the data are fused, the method is also used for carrying out slicing processing on all the data inclusion tables, and the slicing processing mode is as follows:

selecting a field with high distinguishing property as a main field of a data slice in a plurality of data inclusion tables, and simultaneously taking the field as a subset of the data slice;

step two, matching the service data extraction requirements with a plurality of data slice subsets respectively, and obtaining corresponding result sets according to matching results:

and thirdly, storing all corresponding result sets into a data fusion library.

As a further scheme of the invention: the matching mode in the second step is as follows:

selecting a subset of the data slices;

if the matching results are consistent, mapping the numerical value or interval corresponding to the service data extraction requirement with the dictionary value corresponding to the main field of the data slice;

when the value or interval of the query falls in a certain slice, only the data of the slice is loaded, and the data is used as a result set;

when the value or interval of the query falls in two or more slices, loading the data of the slices and taking the data as a result set;

when the queried numerical value or interval falls in all slices, loading the data of all slices, namely, the whole data is included in a table; meanwhile, the failure of slicing is indicated;

when the queried value or interval falls outside all slices, the extraction fusion is invalid;

if the matching results are inconsistent, the main fields of the data slices are exchanged.

The method is also used for carrying out abnormality detection on each data inclusion table in the process of checking and correcting, and the method comprises the following steps:

selecting a data inclusion table;

acquiring the number of all fields of a data inclusion table, and firstly checking each field through a DataFrame API technology without correction;

then obtaining the corresponding fields and the corresponding quantity of each abnormal result;

each abnormal result is expressed as that the corresponding field contains null value, data type which does not meet the requirement and data out of range;

marking the number of fields corresponding to each abnormal result as Yi, wherein i=1, 2 and … … n, and n represents the type number of the abnormal result;

then pass throughCalculating a field anomaly analysis value CZ of the data inclusion table;

wherein βi represents a preset scaling factor corresponding to each abnormal result, i=1, 2, … … n;

the field anomaly analysis value CZ is then compared with a preset anomaly threshold Cy:

if CZ is more than or equal to Cy, the data mapping state between the data inclusion table and the corresponding data source is abnormal, then the corresponding data inclusion table is built again, and the corresponding business data in the corresponding data source is extracted to the corresponding position of the data inclusion table;

if CZ is less than Cy, the data mapping state between the data inclusion table and the corresponding data source is normal, and then each field is corrected through the DataFrame API technology;

a data fusion system is used for realizing the data fusion method.

The invention has the beneficial effects that:

(1) The invention can extract and process the data in the business system to finally generate the data actually useful for business; the flow has the characteristics of high efficiency, accuracy and reliability, and can meet the data processing requirement under large data volume;

(2) According to the invention, the data is sliced in advance before the data fusion, so that a subset of the data can be obtained, and detailed analysis and processing can be more conveniently carried out on specific data, which is helpful for reducing the workload of data analysis and improving the efficiency;

(3) According to the invention, through the Spark data real-time processing frame and the DataFrame API technology, the data of the central front-end processor is extracted into the standard library, and the data is preprocessed, so that the data processing efficiency can be improved, and the accuracy of data analysis can be improved;

(4) The invention can ensure that newly added service data in the service system can be timely extracted into the central front-end processor by configuring the timing extraction strategy, thereby improving the timeliness and usability of the data.

(5) The invention can quickly correct data, and can improve the quality of the data and reduce the business problem caused by data errors by automatic anomaly detection and correction, and specifically comprises the following steps: when an abnormality is detected, the data can be quickly corrected by reestablishing a corresponding data inclusion table and extracting corresponding service data in a corresponding data source to the corresponding position of the data inclusion table.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart of a data fusion method and system according to the present invention.

FIG. 2 is a flow chart of a data fusion method and system slicing process according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, the present invention provides a data fusion method, which includes the following steps:

first step, accessing data source

second step, data table creation

third step, data extraction

Firstly, configuring extraction conditions of a data extraction tool, and then extracting corresponding business data in each data source to a corresponding position of a data inclusion table through the data extraction tool; the data extraction tool is used for extracting the data, so that the automatic extraction of the data can be realized, the efficiency and the accuracy of the data extraction are improved, in addition, the manual errors can be reduced, and the quality and the reliability of the data are improved;

the extraction condition configuration refers to obtaining service data in a specified format from a data source and converting the service data into a target format;

then extracting newly added service data in the service system to a data inclusion table of the central front-end processor at regular time according to a pre-configured regular extraction strategy;

the timing extraction strategy is expressed as extraction time and frequency preset according to the generation and update rules of the business data in each data source; by configuring the timing extraction strategy, newly added service data in the service system can be ensured to be extracted into the central front-end processor in time, so that the timeliness and usability of the data are improved, in addition, the timing extraction strategy can help us save human resources, and the working efficiency is improved.

Fourth step, data processing

Preprocessing the service data in the data entry table by utilizing a Spark data real-time processing frame technology, and then extracting the preprocessed service data into a standard library:

the pretreatment mode is as follows: standardized conversion, null value check, data format check, timestamp check, data range check, correction and missing information filling processing are carried out by using a DataFrame API technology;

the method comprises the following steps:

s1, standardized conversion:

first, each field in the data entry table predefines a corresponding standard conversion dictionary, wherein the standard conversion dictionary contains a plurality of fixed dictionary values: standard translations of corresponding fields such as personnel gender, academy, marital status;

then traversing the content of the corresponding fields contained in the data inclusion table, and acquiring standard translation of each field by using a get () method of the dictionary; the step aims at unifying the representation mode of the data, so that the subsequent data processing and analysis are convenient;

s2, checking and correcting

Firstly, checking null values of data in corresponding fields by using na functions of a DataFrame API, and selecting to delete or fill the null values, wherein the null values represent missing, unknown or inapplicable field data;

then checking the data type of the corresponding field by using the dtypes function of the DataFrame API, and if the data type does not meet the requirement, converting the corresponding service data into the appointed data type;

converting the time stamp into a date format by utilizing a from_unixtime function of the DataFrame API;

then checking the range of the data of the corresponding field by utilizing the betwen function of the DataFrame API, and if the data exceeds the range, selecting to delete or fill the data;

the data of the central front-end processor is extracted into the standard library through the Spark data real-time processing frame and the DataFrame API technology, and the data is preprocessed, so that the data processing efficiency can be improved, the data analysis accuracy can also be improved, and the Spark data real-time processing frame technology and the DataFrame API technology are the prior art, so that details are not repeated here;

s3, filling missing information

Extracting the content of the corresponding field, such as the name, sex and other basic information of personnel, from the corresponding field data, and filling the content into the current field position; the method aims at improving the integrity and accuracy of the data and reducing the loss and the error of the data;

fifth step, data import

sixth step, data fusion

According to the service data extraction requirements of the data extractor, extracting service data processed by each data from the standard repository to the fusion repository, and fusing the service data to generate a data fusion extraction table;

in this embodiment, the data extraction mode in the standard repository is to perform screening, sorting and grouping operations on the data through SQL sentences or programming languages (such as Python) to obtain the required information;

in the embodiment, the data fusion mode is to connect the tables of the multiple tables and the associated keys among the tables to generate a total table, wherein the associated keys are preset by the central front-end processor when the data inclusion table is created;

example two

Referring to fig. 2, as a second embodiment of the present invention, in comparison with the first embodiment, the technical solution of the present embodiment is only different from the first embodiment in that:

in this embodiment, the slicing processing method is further used for slicing all the data inclusion tables when data fusion is performed, and in this embodiment, the slicing processing method includes common data slicing algorithms including Hash (List), range (Range), tag (Tag) and the like, and the algorithms are used for dividing the data according to different rules so as to manage and process the data in the distributed system;

the slicing treatment mode is as follows:

selecting a field with high distinguishing property from a plurality of data inclusion tables as a main field of a data slice, such as a time stamp and a service type field, and simultaneously taking the field as a subset of the data slice;

step two, matching the service data extraction requirements with a plurality of data slice subsets respectively:

taking a subset of data slices as an example:

when the queried numerical value or interval falls in all slices, loading the data of all slices, namely, the whole data is included in a table; meanwhile, the failure of slicing is described, and then granularity is lifted to carry out slicing again;

if the matching results are inconsistent, exchanging the main fields of the data slices;

step three, obtaining all corresponding result sets and storing the result sets into a data fusion library;

according to the embodiment, the data table with larger data volume is sliced, and the sliced data subsets are stored independently, so that the problem that in a traditional mode, in the process of data fusion, the efficiency of fusion is rapidly reduced due to the larger data volume of the table is solved.

Example III

As an embodiment three of the present invention, in the implementation of the present application, compared with the first embodiment and the second embodiment, the technical solution of the present embodiment is that the solutions of the first embodiment and the second embodiment are implemented in combination, and the technical solution of the present embodiment is different from the solutions of the first embodiment and the second embodiment only in that the present embodiment is also used for performing anomaly detection on each data inclusion table in the process of checking and correcting, and the method is as follows:

selecting a data inclusion table;

example IV

As an embodiment four of the present invention, in the present application, the technical solution of the present embodiment is to combine the solutions of the above embodiment one, embodiment two and embodiment three compared with the solution of the embodiment one, embodiment two and embodiment three.

The embodiment can quickly correct data, can improve the quality of the data and reduce the business problem caused by data errors by automatic anomaly detection and correction, and specifically comprises the following steps: when an abnormality is detected, the data can be quickly corrected by reestablishing a corresponding data inclusion table and extracting corresponding service data in a corresponding data source to the corresponding position of the data inclusion table.

The invention has the advantages that the data can be obtained by carrying out data slicing in advance before data fusion, thereby more conveniently carrying out detailed analysis and processing on specific data, being beneficial to reducing the workload of data analysis and improving the efficiency; the characteristics and the distribution of each subset can be more clearly understood, so that the overall situation of the data can be better understood;

in addition, the data slicing is convenient for sharing and collaboration of the data, and when the data is required to be shared or collaborated, the data can be sliced into different parts and provided for required personnel or teams; therefore, the risk of data leakage can be reduced, and the access and utilization of the data can be better controlled;

according to the method and the device, data in the service system can be extracted and processed, and finally data actually useful for the service are generated. The flow has the characteristics of high efficiency, accuracy and reliability, and can meet the data processing requirement under large data volume; meanwhile, the efficiency and performance of data fusion are improved in a slicing mode.

The invention also provides a data fusion system for realizing the data fusion method, which comprises the following steps:

the central front-end processor is used for creating a data table and carrying out data extraction and data processing on service data;

the standard repository is used for storing the business data processed by the central front-end processor;

the fusion library is used for fusing the corresponding extracted service data according to the service data extraction requirements of the extractors, obtaining corresponding fusion extraction tables and displaying the fusion extraction tables to the extractors;

the foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of data fusion comprising the steps of:

the method comprises the steps that firstly, a data source 1, a data source 2, a data source 3 and a data source N of a business system are connected to a central front-end processor, the data sources are used for business intelligence, medical care and finance, the central front-end processor creates a data inclusion table according to the data sources, and meanwhile, a mapping relation is built between the data inclusion table and each data source;

step two, extracting the corresponding service data in each data source to the corresponding position of the data inclusion table, and extracting the newly added service data in the service system to the data inclusion table of the central front-end processor according to the generation and updating rules of the service data in each data source;

thirdly, carrying out standardized conversion, inspection correction and missing information filling processing on the service data in the data inclusion table, and then extracting the preprocessed service data into a standard library; the method comprises the steps of performing standardized conversion, namely converting service data into standard data through a pre-training standard conversion dictionary, checking and correcting the standard data through a DataFrame API technology, filling missing information into a pre-designated unique identification code, extracting information corresponding to the missing content, and filling the information;

fourth, the data corresponding to the service data after data processing in the central front-end processor is stored in the standard repository in an incorporating table, the service data after data processing is extracted from the standard repository to the fusion repository according to the service data extraction requirement of a data extractor, and fusion is carried out, so that a data fusion extraction table is generated;

and the method is also used for detecting the abnormality of each data inclusion table in the process of checking and correcting, and the method is as follows:

selecting a data inclusion table;

the method is also used for slicing all the data inclusion tables, and the slicing method is as follows:

step two, matching the service data extraction requirements with a plurality of data slice subsets respectively, and obtaining a corresponding result set according to a matching result;

the matching mode is as follows:

selecting a subset of the data slices;

when the value or interval of the query falls in two or more slices, loading data corresponding to the two or more slices, and taking the data as a result set;

and thirdly, storing all corresponding result sets into a data fusion library.

2. A data fusion method according to claim 1, wherein in the third step, the standardized transformation is performed by:

3. A data fusion method according to claim 1, characterized in that in the third step, the way of checking the correction is as follows:

s41, checking null values of data in corresponding fields by using a DataFrame API, and selecting to delete or fill the null values, wherein the null values represent missing, unknown or inapplicable field data;

s42, checking the data type of the corresponding field by using a DataFrame API, and if the data type does not meet the requirement, converting the corresponding service data into a specified data type;

s43, converting the time stamp into a date format by using a DataFrame API;

s44, checking the range of the data of the corresponding field by using the DataFrame API, and if the data is out of range, selecting to delete or fill the data.

4. A data fusion method according to claim 1, characterized in that in the third step the missing information is filled in the following way:

5. A data fusion method according to claim 1, wherein in the fourth step, the data in the standard repository is extracted by filtering, sorting and grouping the data by SQL statement or programming language to obtain the required information.

6. A data fusion method according to claim 1, wherein if the matching results are inconsistent, the data slice main field is swapped.

7. A data fusion system, characterized in that the system is realized by a data fusion method according to any one of claims 1-6.