CN117540343B - Data fusion method and system - Google Patents

Data fusion method and system Download PDF

Info

Publication number
CN117540343B
CN117540343B CN202410026687.XA CN202410026687A CN117540343B CN 117540343 B CN117540343 B CN 117540343B CN 202410026687 A CN202410026687 A CN 202410026687A CN 117540343 B CN117540343 B CN 117540343B
Authority
CN
China
Prior art keywords
data
field
fusion
service
standard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410026687.XA
Other languages
Chinese (zh)
Other versions
CN117540343A (en
Inventor
叶士飞
沈鸣飞
何亮
刘少梁
蒋晓军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yuancheng Technology Co ltd
Original Assignee
Suzhou Yuancheng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yuancheng Technology Co ltd filed Critical Suzhou Yuancheng Technology Co ltd
Priority to CN202410026687.XA priority Critical patent/CN117540343B/en
Publication of CN117540343A publication Critical patent/CN117540343A/en
Application granted granted Critical
Publication of CN117540343B publication Critical patent/CN117540343B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • G06F18/15Statistical pre-processing, e.g. techniques for normalisation or restoring missing data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data fusion method and a system, comprising the steps of accessing a data source, creating a data table, extracting data, processing the data, importing the data and fusing the data; the data in the service system can be extracted and processed to finally generate the data actually useful for the service; the process has the characteristics of high efficiency, accuracy and reliability, can meet the data processing requirement under large data volume, can acquire a subset of data by carrying out data slicing in advance before data fusion, thereby being more convenient for carrying out detailed analysis and processing on specific data, being beneficial to reducing the workload of data analysis and improving the efficiency, extracting the data of a central front-end processor into a standard library through Spark data real-time processing framework and a DataFrame API technology, and carrying out preprocessing on the data, thereby not only improving the efficiency of data processing, but also improving the accuracy of data analysis.

Description

Data fusion method and system
Technical Field
The invention relates to the technical field of data fusion, in particular to a data fusion method and system.
Background
Data fusion is a technique and method of integrating data from different sources. Its goal is to provide more comprehensive, accurate and useful information by integrating and analyzing information from multiple data sources. The data fusion can be applied to various fields such as business intelligence, medical care, finance and the like.
In practical applications, data fusion faces some challenges and drawbacks. First, data fusion is more complex due to the multiple data sources and multiple data types involved. The data quality of different data sources may vary, including problems with accuracy, integrity, and consistency. If low quality data is fused into the analysis, inaccurate results and decisions may result; second, data in different data sources may use different data models, formats, and naming conventions. This presents consistency issues for data fusion, which need to be addressed to establish a consistent view; in addition, as the amount of data increases, the data fusion system may suffer from scalability issues. Designing and implementing a solution that can handle large-scale data is an important consideration.
In summary, the data fusion technology needs to take appropriate strategies and measures to solve these challenges based on comprehensively considering factors such as data quality, security, consistency, performance and cost.
Disclosure of Invention
The invention aims to provide a data fusion method and a data fusion system, which solve the technical problems in the background technology.
The aim of the invention can be achieved by the following technical scheme:
a data fusion method comprising the steps of:
first step, accessing data source
The data source 1, the data source 2, the data source 3 and the data source N of the service system are accessed to the central front-end processor;
second step, data table creation
The central front-end processor creates a 1:1 data inclusion table according to the data sources, and simultaneously establishes a mapping relation between the data inclusion table and each data source;
third step, data extraction
Extracting corresponding service data in each data source to a corresponding position of a data inclusion table, and extracting newly-added service data in a service system to the data inclusion table of the central front-end processor according to the generation and updating rules of the service data in each data source;
fourth step, data processing
Carrying out standardized conversion, checking correction and missing information filling processing on the service data in the data incorporating table, and then extracting the preprocessed service data into a standard library;
the method comprises the steps of performing standardized conversion, namely converting service data into standard data through a pre-training standard conversion dictionary, checking and correcting the standard data through a DataFrame API technology, filling missing information into a pre-designated unique identification code, extracting information corresponding to the missing content, and filling the information;
fifth step, data import
Storing data corresponding to the business data processed by the data in the central front-end processor into a standard repository;
sixth step, data fusion
And extracting the business data processed by each data from the standard repository to the fusion repository according to the business data extraction requirements of the data extractor, and fusing the business data to generate a data fusion extraction table.
As a further scheme of the invention: in the fourth step, the standardized conversion is performed in the following manner:
firstly, predefining corresponding standard conversion dictionary for each field in a data inclusion table, wherein the standard conversion dictionary comprises a plurality of fixed dictionary values as standard translations of the corresponding fields;
then, the contents of the corresponding fields contained in the data inclusion table are traversed, and standard translation of each field is acquired by using the get () method of the dictionary.
As a further scheme of the invention: in the fourth step, the manner of checking the correction is as follows:
s41, checking null values of data in corresponding fields by using na functions of a DataFrame API, and selecting to delete or fill the null values, wherein the null values represent missing, unknown or inapplicable field data;
s42, checking the data type of the corresponding field by using a dtypes function of the DataFrame API, and if the data type does not meet the requirement, converting the corresponding service data into a specified data type;
s43, converting the time stamp into a date format by utilizing a from_unixtime function of the DataFrame API;
s44, checking the range of the data of the corresponding field by utilizing the betwen function of the DataFrame API, and if the data is out of range, selecting to delete or fill the data.
As a further scheme of the invention: in the fourth step, the manner of missing information padding is as follows:
when the data of the corresponding field is missing, the content of the corresponding field is extracted from other data sources which are confirmed by the right through taking the identification card number as a unique identification code, and the content is filled into the current field position.
As a further scheme of the invention: in the sixth step, the data extraction mode in the standard repository is to perform screening, sorting and grouping operations on the data through SQL sentences or programming languages so as to acquire the required information.
As a further scheme of the invention: when the data are fused, the method is also used for carrying out slicing processing on all the data inclusion tables, and the slicing processing mode is as follows:
selecting a field with high distinguishing property as a main field of a data slice in a plurality of data inclusion tables, and simultaneously taking the field as a subset of the data slice;
step two, matching the service data extraction requirements with a plurality of data slice subsets respectively, and obtaining corresponding result sets according to matching results:
and thirdly, storing all corresponding result sets into a data fusion library.
As a further scheme of the invention: the matching mode in the second step is as follows:
selecting a subset of the data slices;
if the matching results are consistent, mapping the numerical value or interval corresponding to the service data extraction requirement with the dictionary value corresponding to the main field of the data slice;
when the value or interval of the query falls in a certain slice, only the data of the slice is loaded, and the data is used as a result set;
when the value or interval of the query falls in two or more slices, loading the data of the slices and taking the data as a result set;
when the queried numerical value or interval falls in all slices, loading the data of all slices, namely, the whole data is included in a table; meanwhile, the failure of slicing is indicated;
when the queried value or interval falls outside all slices, the extraction fusion is invalid;
if the matching results are inconsistent, the main fields of the data slices are exchanged.
The method is also used for carrying out abnormality detection on each data inclusion table in the process of checking and correcting, and the method comprises the following steps:
selecting a data inclusion table;
acquiring the number of all fields of a data inclusion table, and firstly checking each field through a DataFrame API technology without correction;
then obtaining the corresponding fields and the corresponding quantity of each abnormal result;
each abnormal result is expressed as that the corresponding field contains null value, data type which does not meet the requirement and data out of range;
marking the number of fields corresponding to each abnormal result as Yi, wherein i=1, 2 and … … n, and n represents the type number of the abnormal result;
then pass throughCalculating a field anomaly analysis value CZ of the data inclusion table;
wherein βi represents a preset scaling factor corresponding to each abnormal result, i=1, 2, … … n;
the field anomaly analysis value CZ is then compared with a preset anomaly threshold Cy:
if CZ is more than or equal to Cy, the data mapping state between the data inclusion table and the corresponding data source is abnormal, then the corresponding data inclusion table is built again, and the corresponding business data in the corresponding data source is extracted to the corresponding position of the data inclusion table;
if CZ is less than Cy, the data mapping state between the data inclusion table and the corresponding data source is normal, and then each field is corrected through the DataFrame API technology;
a data fusion system is used for realizing the data fusion method.
The invention has the beneficial effects that:
(1) The invention can extract and process the data in the business system to finally generate the data actually useful for business; the flow has the characteristics of high efficiency, accuracy and reliability, and can meet the data processing requirement under large data volume;
(2) According to the invention, the data is sliced in advance before the data fusion, so that a subset of the data can be obtained, and detailed analysis and processing can be more conveniently carried out on specific data, which is helpful for reducing the workload of data analysis and improving the efficiency;
(3) According to the invention, through the Spark data real-time processing frame and the DataFrame API technology, the data of the central front-end processor is extracted into the standard library, and the data is preprocessed, so that the data processing efficiency can be improved, and the accuracy of data analysis can be improved;
(4) The invention can ensure that newly added service data in the service system can be timely extracted into the central front-end processor by configuring the timing extraction strategy, thereby improving the timeliness and usability of the data.
(5) The invention can quickly correct data, and can improve the quality of the data and reduce the business problem caused by data errors by automatic anomaly detection and correction, and specifically comprises the following steps: when an abnormality is detected, the data can be quickly corrected by reestablishing a corresponding data inclusion table and extracting corresponding service data in a corresponding data source to the corresponding position of the data inclusion table.
Drawings
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a flow chart of a data fusion method and system according to the present invention.
FIG. 2 is a flow chart of a data fusion method and system slicing process according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, the present invention provides a data fusion method, which includes the following steps:
first step, accessing data source
The data source 1, the data source 2, the data source 3 and the data source N of the service system are accessed to the central front-end processor;
second step, data table creation
The central front-end processor creates a 1:1 data inclusion table according to the data sources, and simultaneously establishes a mapping relation between the data inclusion table and each data source;
third step, data extraction
Firstly, configuring extraction conditions of a data extraction tool, and then extracting corresponding business data in each data source to a corresponding position of a data inclusion table through the data extraction tool; the data extraction tool is used for extracting the data, so that the automatic extraction of the data can be realized, the efficiency and the accuracy of the data extraction are improved, in addition, the manual errors can be reduced, and the quality and the reliability of the data are improved;
the extraction condition configuration refers to obtaining service data in a specified format from a data source and converting the service data into a target format;
then extracting newly added service data in the service system to a data inclusion table of the central front-end processor at regular time according to a pre-configured regular extraction strategy;
the timing extraction strategy is expressed as extraction time and frequency preset according to the generation and update rules of the business data in each data source; by configuring the timing extraction strategy, newly added service data in the service system can be ensured to be extracted into the central front-end processor in time, so that the timeliness and usability of the data are improved, in addition, the timing extraction strategy can help us save human resources, and the working efficiency is improved.
Fourth step, data processing
Preprocessing the service data in the data entry table by utilizing a Spark data real-time processing frame technology, and then extracting the preprocessed service data into a standard library:
the pretreatment mode is as follows: standardized conversion, null value check, data format check, timestamp check, data range check, correction and missing information filling processing are carried out by using a DataFrame API technology;
the method comprises the following steps:
s1, standardized conversion:
first, each field in the data entry table predefines a corresponding standard conversion dictionary, wherein the standard conversion dictionary contains a plurality of fixed dictionary values: standard translations of corresponding fields such as personnel gender, academy, marital status;
then traversing the content of the corresponding fields contained in the data inclusion table, and acquiring standard translation of each field by using a get () method of the dictionary; the step aims at unifying the representation mode of the data, so that the subsequent data processing and analysis are convenient;
s2, checking and correcting
Firstly, checking null values of data in corresponding fields by using na functions of a DataFrame API, and selecting to delete or fill the null values, wherein the null values represent missing, unknown or inapplicable field data;
then checking the data type of the corresponding field by using the dtypes function of the DataFrame API, and if the data type does not meet the requirement, converting the corresponding service data into the appointed data type;
converting the time stamp into a date format by utilizing a from_unixtime function of the DataFrame API;
then checking the range of the data of the corresponding field by utilizing the betwen function of the DataFrame API, and if the data exceeds the range, selecting to delete or fill the data;
the data of the central front-end processor is extracted into the standard library through the Spark data real-time processing frame and the DataFrame API technology, and the data is preprocessed, so that the data processing efficiency can be improved, the data analysis accuracy can also be improved, and the Spark data real-time processing frame technology and the DataFrame API technology are the prior art, so that details are not repeated here;
s3, filling missing information
Extracting the content of the corresponding field, such as the name, sex and other basic information of personnel, from the corresponding field data, and filling the content into the current field position; the method aims at improving the integrity and accuracy of the data and reducing the loss and the error of the data;
fifth step, data import
Storing data corresponding to the business data processed by the data in the central front-end processor into a standard repository;
sixth step, data fusion
According to the service data extraction requirements of the data extractor, extracting service data processed by each data from the standard repository to the fusion repository, and fusing the service data to generate a data fusion extraction table;
in this embodiment, the data extraction mode in the standard repository is to perform screening, sorting and grouping operations on the data through SQL sentences or programming languages (such as Python) to obtain the required information;
in the embodiment, the data fusion mode is to connect the tables of the multiple tables and the associated keys among the tables to generate a total table, wherein the associated keys are preset by the central front-end processor when the data inclusion table is created;
example two
Referring to fig. 2, as a second embodiment of the present invention, in comparison with the first embodiment, the technical solution of the present embodiment is only different from the first embodiment in that:
in this embodiment, the slicing processing method is further used for slicing all the data inclusion tables when data fusion is performed, and in this embodiment, the slicing processing method includes common data slicing algorithms including Hash (List), range (Range), tag (Tag) and the like, and the algorithms are used for dividing the data according to different rules so as to manage and process the data in the distributed system;
the slicing treatment mode is as follows:
selecting a field with high distinguishing property from a plurality of data inclusion tables as a main field of a data slice, such as a time stamp and a service type field, and simultaneously taking the field as a subset of the data slice;
step two, matching the service data extraction requirements with a plurality of data slice subsets respectively:
taking a subset of data slices as an example:
if the matching results are consistent, mapping the numerical value or interval corresponding to the service data extraction requirement with the dictionary value corresponding to the main field of the data slice;
when the value or interval of the query falls in a certain slice, only the data of the slice is loaded, and the data is used as a result set;
when the value or interval of the query falls in two or more slices, loading the data of the slices and taking the data as a result set;
when the queried numerical value or interval falls in all slices, loading the data of all slices, namely, the whole data is included in a table; meanwhile, the failure of slicing is described, and then granularity is lifted to carry out slicing again;
when the queried value or interval falls outside all slices, the extraction fusion is invalid;
if the matching results are inconsistent, exchanging the main fields of the data slices;
step three, obtaining all corresponding result sets and storing the result sets into a data fusion library;
according to the embodiment, the data table with larger data volume is sliced, and the sliced data subsets are stored independently, so that the problem that in a traditional mode, in the process of data fusion, the efficiency of fusion is rapidly reduced due to the larger data volume of the table is solved.
Example III
As an embodiment three of the present invention, in the implementation of the present application, compared with the first embodiment and the second embodiment, the technical solution of the present embodiment is that the solutions of the first embodiment and the second embodiment are implemented in combination, and the technical solution of the present embodiment is different from the solutions of the first embodiment and the second embodiment only in that the present embodiment is also used for performing anomaly detection on each data inclusion table in the process of checking and correcting, and the method is as follows:
selecting a data inclusion table;
acquiring the number of all fields of a data inclusion table, and firstly checking each field through a DataFrame API technology without correction;
then obtaining the corresponding fields and the corresponding quantity of each abnormal result;
each abnormal result is expressed as that the corresponding field contains null value, data type which does not meet the requirement and data out of range;
marking the number of fields corresponding to each abnormal result as Yi, wherein i=1, 2 and … … n, and n represents the type number of the abnormal result;
then pass throughCalculating a field anomaly analysis value CZ of the data inclusion table;
wherein βi represents a preset scaling factor corresponding to each abnormal result, i=1, 2, … … n;
the field anomaly analysis value CZ is then compared with a preset anomaly threshold Cy:
if CZ is more than or equal to Cy, the data mapping state between the data inclusion table and the corresponding data source is abnormal, then the corresponding data inclusion table is built again, and the corresponding business data in the corresponding data source is extracted to the corresponding position of the data inclusion table;
if CZ is less than Cy, the data mapping state between the data inclusion table and the corresponding data source is normal, and then each field is corrected through the DataFrame API technology;
example IV
As an embodiment four of the present invention, in the present application, the technical solution of the present embodiment is to combine the solutions of the above embodiment one, embodiment two and embodiment three compared with the solution of the embodiment one, embodiment two and embodiment three.
The embodiment can quickly correct data, can improve the quality of the data and reduce the business problem caused by data errors by automatic anomaly detection and correction, and specifically comprises the following steps: when an abnormality is detected, the data can be quickly corrected by reestablishing a corresponding data inclusion table and extracting corresponding service data in a corresponding data source to the corresponding position of the data inclusion table.
The invention has the advantages that the data can be obtained by carrying out data slicing in advance before data fusion, thereby more conveniently carrying out detailed analysis and processing on specific data, being beneficial to reducing the workload of data analysis and improving the efficiency; the characteristics and the distribution of each subset can be more clearly understood, so that the overall situation of the data can be better understood;
in addition, the data slicing is convenient for sharing and collaboration of the data, and when the data is required to be shared or collaborated, the data can be sliced into different parts and provided for required personnel or teams; therefore, the risk of data leakage can be reduced, and the access and utilization of the data can be better controlled;
according to the method and the device, data in the service system can be extracted and processed, and finally data actually useful for the service are generated. The flow has the characteristics of high efficiency, accuracy and reliability, and can meet the data processing requirement under large data volume; meanwhile, the efficiency and performance of data fusion are improved in a slicing mode.
The invention also provides a data fusion system for realizing the data fusion method, which comprises the following steps:
the central front-end processor is used for creating a data table and carrying out data extraction and data processing on service data;
the standard repository is used for storing the business data processed by the central front-end processor;
the fusion library is used for fusing the corresponding extracted service data according to the service data extraction requirements of the extractors, obtaining corresponding fusion extraction tables and displaying the fusion extraction tables to the extractors;
the foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. A method of data fusion comprising the steps of:
the method comprises the steps that firstly, a data source 1, a data source 2, a data source 3 and a data source N of a business system are connected to a central front-end processor, the data sources are used for business intelligence, medical care and finance, the central front-end processor creates a data inclusion table according to the data sources, and meanwhile, a mapping relation is built between the data inclusion table and each data source;
step two, extracting the corresponding service data in each data source to the corresponding position of the data inclusion table, and extracting the newly added service data in the service system to the data inclusion table of the central front-end processor according to the generation and updating rules of the service data in each data source;
thirdly, carrying out standardized conversion, inspection correction and missing information filling processing on the service data in the data inclusion table, and then extracting the preprocessed service data into a standard library; the method comprises the steps of performing standardized conversion, namely converting service data into standard data through a pre-training standard conversion dictionary, checking and correcting the standard data through a DataFrame API technology, filling missing information into a pre-designated unique identification code, extracting information corresponding to the missing content, and filling the information;
fourth, the data corresponding to the service data after data processing in the central front-end processor is stored in the standard repository in an incorporating table, the service data after data processing is extracted from the standard repository to the fusion repository according to the service data extraction requirement of a data extractor, and fusion is carried out, so that a data fusion extraction table is generated;
and the method is also used for detecting the abnormality of each data inclusion table in the process of checking and correcting, and the method is as follows:
selecting a data inclusion table;
acquiring the number of all fields of a data inclusion table, and firstly checking each field through a DataFrame API technology without correction;
then obtaining the corresponding fields and the corresponding quantity of each abnormal result;
each abnormal result is expressed as that the corresponding field contains null value, data type which does not meet the requirement and data out of range;
marking the number of fields corresponding to each abnormal result as Yi, wherein i=1, 2 and … … n, and n represents the type number of the abnormal result;
then pass throughCalculating a field anomaly analysis value CZ of the data inclusion table;
wherein βi represents a preset scaling factor corresponding to each abnormal result, i=1, 2, … … n;
the field anomaly analysis value CZ is then compared with a preset anomaly threshold Cy:
if CZ is more than or equal to Cy, the data mapping state between the data inclusion table and the corresponding data source is abnormal, then the corresponding data inclusion table is built again, and the corresponding business data in the corresponding data source is extracted to the corresponding position of the data inclusion table;
if CZ is less than Cy, the data mapping state between the data inclusion table and the corresponding data source is normal, and then each field is corrected through the DataFrame API technology;
the method is also used for slicing all the data inclusion tables, and the slicing method is as follows:
selecting a field with high distinguishing property as a main field of a data slice in a plurality of data inclusion tables, and simultaneously taking the field as a subset of the data slice;
step two, matching the service data extraction requirements with a plurality of data slice subsets respectively, and obtaining a corresponding result set according to a matching result;
the matching mode is as follows:
selecting a subset of the data slices;
if the matching results are consistent, mapping the numerical value or interval corresponding to the service data extraction requirement with the dictionary value corresponding to the main field of the data slice;
when the value or interval of the query falls in a certain slice, only the data of the slice is loaded, and the data is used as a result set;
when the value or interval of the query falls in two or more slices, loading data corresponding to the two or more slices, and taking the data as a result set;
when the queried numerical value or interval falls in all slices, loading the data of all slices, namely, the whole data is included in a table; meanwhile, the failure of slicing is indicated;
when the queried value or interval falls outside all slices, the extraction fusion is invalid;
and thirdly, storing all corresponding result sets into a data fusion library.
2. A data fusion method according to claim 1, wherein in the third step, the standardized transformation is performed by:
firstly, predefining corresponding standard conversion dictionary for each field in a data inclusion table, wherein the standard conversion dictionary comprises a plurality of fixed dictionary values as standard translations of the corresponding fields;
then, the contents of the corresponding fields contained in the data inclusion table are traversed, and standard translation of each field is acquired by using the get () method of the dictionary.
3. A data fusion method according to claim 1, characterized in that in the third step, the way of checking the correction is as follows:
s41, checking null values of data in corresponding fields by using a DataFrame API, and selecting to delete or fill the null values, wherein the null values represent missing, unknown or inapplicable field data;
s42, checking the data type of the corresponding field by using a DataFrame API, and if the data type does not meet the requirement, converting the corresponding service data into a specified data type;
s43, converting the time stamp into a date format by using a DataFrame API;
s44, checking the range of the data of the corresponding field by using the DataFrame API, and if the data is out of range, selecting to delete or fill the data.
4. A data fusion method according to claim 1, characterized in that in the third step the missing information is filled in the following way:
when the data of the corresponding field is missing, the content of the corresponding field is extracted from other data sources which are confirmed by the right through taking the identification card number as a unique identification code, and the content is filled into the current field position.
5. A data fusion method according to claim 1, wherein in the fourth step, the data in the standard repository is extracted by filtering, sorting and grouping the data by SQL statement or programming language to obtain the required information.
6. A data fusion method according to claim 1, wherein if the matching results are inconsistent, the data slice main field is swapped.
7. A data fusion system, characterized in that the system is realized by a data fusion method according to any one of claims 1-6.
CN202410026687.XA 2024-01-09 2024-01-09 Data fusion method and system Active CN117540343B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410026687.XA CN117540343B (en) 2024-01-09 2024-01-09 Data fusion method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410026687.XA CN117540343B (en) 2024-01-09 2024-01-09 Data fusion method and system

Publications (2)

Publication Number Publication Date
CN117540343A CN117540343A (en) 2024-02-09
CN117540343B true CN117540343B (en) 2024-04-16

Family

ID=89792255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410026687.XA Active CN117540343B (en) 2024-01-09 2024-01-09 Data fusion method and system

Country Status (1)

Country Link
CN (1) CN117540343B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231417A (en) * 2020-10-14 2021-01-15 平安国际智慧城市科技股份有限公司 Data classification method and device, electronic equipment and storage medium
CN114386509A (en) * 2022-01-12 2022-04-22 平安普惠企业管理有限公司 Data fusion method and device, electronic equipment and storage medium
CN115237636A (en) * 2022-08-10 2022-10-25 沈阳数融科技有限公司 Real-time data quality inspection and repair system
CN115525655A (en) * 2022-10-09 2022-12-27 北京瑟维斯科技有限公司 Method and system for data query slicing
CN116415206A (en) * 2023-06-06 2023-07-11 中国移动紫金(江苏)创新研究院有限公司 Operator multiple data fusion method, system, electronic equipment and computer storage medium
CN116662371A (en) * 2023-06-13 2023-08-29 国网信通亿力科技有限责任公司 Cross-domain data fusion method
CN117312290A (en) * 2023-10-16 2023-12-29 上海欧冶金诚信息服务股份有限公司 Method for improving heterogeneous system data quality

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231417A (en) * 2020-10-14 2021-01-15 平安国际智慧城市科技股份有限公司 Data classification method and device, electronic equipment and storage medium
CN114386509A (en) * 2022-01-12 2022-04-22 平安普惠企业管理有限公司 Data fusion method and device, electronic equipment and storage medium
CN115237636A (en) * 2022-08-10 2022-10-25 沈阳数融科技有限公司 Real-time data quality inspection and repair system
CN115525655A (en) * 2022-10-09 2022-12-27 北京瑟维斯科技有限公司 Method and system for data query slicing
CN116415206A (en) * 2023-06-06 2023-07-11 中国移动紫金(江苏)创新研究院有限公司 Operator multiple data fusion method, system, electronic equipment and computer storage medium
CN116662371A (en) * 2023-06-13 2023-08-29 国网信通亿力科技有限责任公司 Cross-domain data fusion method
CN117312290A (en) * 2023-10-16 2023-12-29 上海欧冶金诚信息服务股份有限公司 Method for improving heterogeneous system data quality

Also Published As

Publication number Publication date
CN117540343A (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN104123288B (en) A kind of data query method and device
US9600507B2 (en) Index structure for a relational database table
JP2022508350A (en) Multi-center medical term standardization system based on general medical term library
US8615526B2 (en) Markup language based query and file generation
Ramzan et al. Intelligent data engineering for migration to NoSQL based secure environments
US9171051B2 (en) Data definition language (DDL) expression annotation
CN111291049A (en) Method, device, equipment and storage medium for creating table
CN111061739A (en) Method and device for warehousing massive medical data, electronic equipment and storage medium
CN110990390A (en) Data cooperative processing method and device, computer equipment and storage medium
CN111078729B (en) Medical data tracing method, device, system, storage medium and electronic equipment
CN110069478A (en) Multi-source heterogeneous data integrated system towards medical big data
CN114996288A (en) Data comparison method and device, computer storage medium and electronic equipment
GB2507095A (en) Generating synthetic data from a decision tree model of a dataset
Goloboff et al. Comparative cladistics: identifying the sources for differing phylogenetic results between competing morphology-based datasets
CN117540343B (en) Data fusion method and system
CN116541411A (en) SQL sentence acquisition method, report generation device, computer equipment and storage medium
CN116010439A (en) Visual Chinese SQL system and query construction method
CN115080594A (en) Method and system for carrying out multi-dimensional analysis on data and electronic equipment
CN114882965A (en) Single disease type data reporting method, terminal equipment and storage medium
CN112650754A (en) Method for importing total data of relational database into Hive
CN111339147A (en) Medical data processing method, dimension reduction query method and storage medium
CN116126873B (en) Data summarization method and device based on nonstandard data table and storage medium
CN115458103B (en) Medical data processing method, medical data processing device, electronic equipment and readable storage medium
CN114625757B (en) Task execution method and device based on domain specific language, medium and equipment
CN111221846B (en) Automatic translation method and device for SQL sentences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant