CN107833595A

CN107833595A - Medical big data multicenter integration platform and method

Info

Publication number: CN107833595A
Application number: CN201710946758.8A
Authority: CN
Inventors: 薛付忠; 季晓康; 王永超; 高琦; 徐聪; 王晓鹤; 阿力木·达依木; 曹瑾; 许艺博; 蒋正; 卞伟玮; 李敏; 孙苑潆; 韩君铭; 马官慧
Original assignee: Kang Wei Health Care Big Data Technology Co Ltd; Shandong University
Current assignee: Kang Wei Health Care Big Data Technology Co Ltd; Shandong University
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-03-23

Abstract

The invention discloses medical big data multicenter integration platform and method；By the data access of each data sub-central's server into data center server, quality evaluation is carried out to the data of each data sub-central's server, if quality evaluation is by entering in next step；If quality evaluation is by the way that data center server does not pass through conclusion to data sub-central's server feedback；The quality evaluation, including：The assessment of data integrity, Data duplication rate, data deviation, data volume size；Data center server is established and maintenance criterion variable and normal dictionary, meanwhile, establishing criteria variable and normal dictionary are to data prediction；Data normalization processing：Variable standardization and data value standardization；By way of Similarity matching algorithm and manual examination and verification, the data variable of data sub-central's server and the canonical variable of data center server are established mapping relations one by one；Data after being standardized to data center server carry out data utilization.

Description

Medical big data multicenter integration platform and method

Technical field

The present invention relates to a kind of medical big data multicenter integration platform and method.

Background technology

There are the following problems needs to solve for prior art：

First, data volume is huge and pattern is numerous and diverse；The data include the physical examination data, multiple of tens of medical centers The clinical data of the government datas such as the basic public health service in area, the women of child-bearing age, more Grade A hospitals, and multiple training disease that calls for specialized treatments Data, such as：The major disease database such as mental disease data, glioma, each data source store substantial amounts of data, and Each data source data form varies；

Second, traditional data arrange the drawbacks of, traditional data arrange both for centralized database, consume substantial amounts of manpower Material resources arrange data, statistical analysis, find valuable scientific achievement.But as the arrival in big data epoch, wearing are set Standby addition, the data volume of medicine and hygiene fieldses is just in the growth of exponentially type, it is clear that traditional data preparation mode is not Current data processing needs are adapted to, and a major obstacles of data are utilized as researcher in particular how Multicenter, diversified data are managed as a whole on a data preparation platform, plans as a whole to excavate, complements each other, are even more tradition The insurmountable problem of data preparation mode.Citing：Conventional process mode can not solve same person in Different hospital diagnosis and treatment or Physical examination, the problem of how differentiating same person.

3rd, data display mode：The data volume that biometrics is studied all is huge, described database, each Individual is all the data volume of millions and the above.Prior art can not intuitively see these data.We must use big data Visual means, with more intuitive image mode come display data, such as histogram, line chart, scatter diagram so that data User and policymaker have the understanding of an ABC intuition to data, in order to do the scientific research of next step and decision-making.

4th, the standardization of data：Each medical institutions, each data side, due to and be not present a unified industry mark Standard, during respective Informatization Development, the data of storage are gathered, very big difference be present, for example, same disease, medicine Different calls in different institutions be present in thing, operation；Identical Testing index, due to the difference of detecting instrument, detection reagent Difference, its term of reference, unit are very different, as a platform for data arrangement, it is necessary to a set of standard are established, to the name of index Claim, the end value of index, arranged by effective handling implement, normalizing operation.

5th, the processing of unstructured data：Unstructured data processing refers to checking description, checking the text envelopes such as conclusion The processing of breath, whole section of word description, it is necessary to key message therein is extracted, otherwise can not carry out effective scientific research utilization, And these substantial amounts of text datas, comprising information content be it is huge, extraction is crucial, effective information while, it is necessary to Ensure comprehensive, the loss of any useful information of information extraction, be all the massive losses of a data integrity.

6th, the relation of scientific research and arrangement：It is known that data preparation is the premise of scientific research statistics, but it is awkward in the presence of one The problem of a word used for translation, it is more likely that the required research index of scientific research can not be met in the database of arrangement, for example, we Scientific research needs the index " NASH " studied, and during in general data preparation, whether physical examination index has drinks Wine and Ultrasonic Diagnosis whether fatty liver, the type to fatty liver is, it is necessary to which researcher oneself is defined, it is necessary to arrange original number again According to.

The content of the invention

The purpose of the present invention is exactly to solve the above problems, there is provided a kind of medical big data multicenter integration platform and side The advantages that method, it has access convenient, and distributed libray, instrument enriches, visualization directly perceived, and intelligence arranges.

To achieve these goals, the present invention adopts the following technical scheme that：

A kind of medical big data multicenter integration platform, including：

Data center server, establish simultaneously maintenance criterion variable and normal dictionary；

Data sub-central's server, each data source initial data is gathered, by initial data storage into corresponding database, Include in each database：Variable concordance list, personal information table, inspection result table；To variable concordance list, personnel's essential information Data in table, inspection result table carry out pretreatment operation；Each database corresponds to unique encodings；

Data application server, for carrying out data utilization to the data after data sub-central's server pretreatment operation.

The canonical variable, including：Item code, project name, affiliated section office, index deciphering, data type, data mark Label, term of reference；

Item code, such as：1001、1002；Project name, such as：It is mean corpuscular hemoglobin concentration (MCHC), average red thin Extracellular hemoglobin content；Affiliated section office, such as：Clinical laboratory, gynaecology；Index is understood, the introduction to project name；Data class, example Such as：Numeric type, text-type；Data label, such as：Blood routine, routine urinalysis；Term of reference, such as：The reference of each testing result Scope；

The normal dictionary, including：《The international statistical classification of diseases and related health problems》ICD10、《Chinese Pharmacopoeia》 Or positive sign；

The canonical variable maintenance, including：Project of standard development title, coding and classification.

The normal dictionary maintenance, according to《The international statistical classification of diseases and related health problems》ICD10 or《Middle traditional Chinese medicines Allusion quotation》, initial data is standardized and text structureization is handled.

The pretreatment refers to：

Data processing is carried out to every a data in variable concordance list and obtains new data variable, is become using new data Amount establishes new data variable index；Canonical variable according to data center server is to the inspection project name in variable concordance list Title and inspection project name encoding standardization；

Duplicate removal processing is carried out to the data inside personnel's Basic Information Table；The duplicate removal processing, including：Work unit's duplicate removal With identification card number duplicate removal；

Structural data, the standard word according to data center server are converted into the text data inside inspection result table Allusion quotation is to the inspection result title inside inspection result table and inspection result name encoding standardization.

Every a data in the concordance list to variable carries out data processing and obtains new data variable, utilizes new number New data variable is established according to variable to index, including：

Module is split manually, for medical record data manually to be split into multiple sentence variables；

Canonical matching module, for extracting rule data, i.e., obtained data are matched by regular expression, it is described Regular data is for example：Numeral；

Automatic paragraphing module, new variables is produced according to the separating character of setting；Separating character is self-defined, such as：Branch, sky Lattice etc.；

Text replacement module, for replacing expression way wrong in initial data；

Piece root module is intercepted, for according to the word fragment being actually needed in interception inspection result；

Conversion of measurement unit module, for being changed to the unit of data, it is therefore an objective to the measurement of uniform data；

Text structure module, it is structured variable data by unstructured text data processing, at natural language Reason or the mode of machine learning are split to text data to be standardized；For example, the text such as image ultrasound, which describes data, splits standardization；

Data normalization module, by way of Similarity Detection Algorithm and manual examination and verification, data sub-central is serviced The data variable of device and the canonical variable of data center server establish mapping relations one by one.

KEY data are stored in the variable concordance list；BASE data are stored in the personal information table；The inspection result VALUE data are stored in table；KEY tables of data shows that data variable indexes；VALUE data represent initial data；BASE data represent Personnel's essential information data；

The KEY data, for indexing VALUE data, including grouping sheet and the table of comparisons, the grouping sheet is used for data Variable index carries out packet storage；The grouping sheet, such as：Section office's packet, data type packet and composite type packet；Combination Type packet refers to the combination of inspection project, such as five indexes of hepatitis b or blood routine；The table of comparisons is used to index data variable One-to-one relationship between data is stored, and as the outer key index of VALUE data, indexes same detection All detected values of purpose；

The VALUE data, it is the table stored according to the different types of data of initial data to initial data, it is each Bar initial data has unique index, regional code+mechanism coding+initial data that unique index passes through hospital Record coding is formed；

The BASE data, for storing personnel's essential information, each data provides individual and there was only a note in principle Record, including：Sex, name, marriage, identity card, phone and mailbox, height is unique and security request data is of a relatively high；It is described BASE data, including：Personnel's Basic Information Table, person works' unit table and the mapping table of personnel and data.

The data application server, including：Queue creator and data statistics platform；

The queue creator, by queue creator, the data after selecting initial data or excavating become as research Amount, inclusive criteria and exclusion standard are set, the generation scientific research queue of final result variable is set；The inclusive criteria, refer to and meeting diagnosis In the patient of standard, a series of indexs or condition of selection；The exclusion standard, refer to exclude patient several can disturb result The index of accuracy；The final result variable is also outcome variable, referred to as final result.Refer to the expected knot that will appear from follow-up observation Fruit event, namely researcher wish the event of tracing study.The scientific research queue refer in a specified crowd select needed for Research object, according at present or in the past, whether some period is exposed to some hazards to be studied, the data matrix of composition.

The data statistics platform, for statistics and display data.

A kind of medical big data multicenter integration method, comprises the following steps：

Step (1)：By the data access of each data sub-central's server into data center server, to each data The data of branch center server carry out quality evaluation, if quality evaluation is by into step (2)；If quality evaluation is not By the way that then data center server does not pass through conclusion to data sub-central's server feedback；The quality evaluation, including：Data The assessment of integrality, Data duplication rate, data deviation, data volume size；

Step (2)：Data center server is established and maintenance criterion variable and normal dictionary, meanwhile, establishing criteria variable With normal dictionary to data prediction；

Step (3)：Data normalization processing：Variable standardization and data value standardization；By Similarity matching algorithm and The mode of manual examination and verification, the data variable of data sub-central's server and the canonical variable of data center server are established one by one Mapping relations；

Step (4)：Data after being standardized to data center server carry out data utilization.

The step of step (2) is：

Step (201)：Each data is checked according to data variable index, utilizes frequency table or the patterned work of column diagram Tool is intuitively expressed original data by data sub-central's server, rejecting abnormalities data；

Step (202)：Edit is carried out to data：

Manual splitting step, for medical record data manually to be split into multiple sentence variables；

Canonical matching step, for extracting rule data, i.e., obtained data are matched by regular expression, it is described Regular data is for example：Numeral；

Automatic paragraphing step, new variables is produced according to the separating character of setting；Separating character is self-defined, such as：Branch, sky Lattice etc.；

Text replacement step, for replacing expression way wrong in initial data；

Fragment step is intercepted, for according to the word fragment being actually needed in interception inspection result；

Conversion of measurement unit step, for being changed to the unit of data, it is therefore an objective to the measurement of uniform data；

Text structure step, it is structured variable data by unstructured text data processing, at natural language Reason or the mode of machine learning are split to text data to be standardized；For example, the text such as image ultrasound, which describes data, splits standardization；

Unstructured text data processing the step of being structured variable data of the step (202) is：

Step (2021)：Selection needs to carry out the data variable of text structure processing；

Step (2022)：Duplicate removal processing is carried out to data variable, storage is into text structure tables of data after duplicate removal processing；

Step (2023)：Using the segmentation methods of natural language processing, using normal dictionary storehouse as participle basis, first to original Beginning text data segment, data sectional Comparative result normal dictionary storehouse, realization are automatically performed by word segmentation processing by Similarity algorithm；

Step (2024)：The data that artificial supplementation can not identify completely, guarantee data integrity；

Step (2025)：Derived type structure data.

Step (203)：It is stored in the data that step (202) edit obtains as new variable data in data point The variable concordance list of central server.

The step of variable standardization of the step (3) is：

Step (301)：A data variable is selected from the variable concordance list of data sub-central's server, then from data A canonical variable is selected in the normal dictionary that central server defines, user is according to medical knowledge to two name variables, affiliated Section office and the True Data result of detection, the one-to-one relationship of data variable and canonical variable is determined, so as to complete to compare Mapping；

Step (302)：To the variable of step (301) control mapping, audited, normalizing operation is completed, so as to ensure The accuracy of variable control.

The step (4)：By queue creator, the data after selecting initial data or excavating are used as research variable, Inclusive criteria and exclusion standard are set, the generation scientific research queue of final result variable is set；The inclusive criteria, refer to and meeting diagnostic criteria Patient in, a series of indexs or condition of selection；The exclusion standard, refer to exclude patient several result can be disturbed accurate The index of property；The final result variable is also outcome variable, referred to as final result.Refer to the expected results thing that will appear from follow-up observation Part, namely researcher wish the event of tracing study.The scientific research queue refers to selects required research in a specified crowd Object, according at present or in the past, whether some period is exposed to some hazards to be studied, the data matrix of composition.

Beneficial effects of the present invention：

Solve that data volume is huge and diversified problem 1. of the invention effective, first, each data source is individually divided With a thesaurus node, it is easy to later stage Distributed Calculation, solves the problems, such as that data volume is big with the storage mode of data warehouse. Angle that this model utilizes from scientific data designs, and the data structure of each data memory node is consistent, all includes：Index Table KEY points to all data storages, VALUE tables storage mass data, and BASE tables are essential information, unified standard storage data Structure is easy to data management to utilize and solve the diversified problem of initial data.

2. integrate the platform for arranging data the invention provides multicenter, on this platform, data preparation personnel or Person's data user of service need not be concerned about the sources of data, storage, quantity, it is not necessary to possess any programming knowledge.Utilize platform The discovery and arrangement instrument of offer, it is possible to handle separate sources with the mode of unified procedure, the data of different patterns, and with The data format storage or export of one standard, quickly realize edit.

3. the invention provides intuitively big data visualization function, including pie chart, column diagram, scatter diagram, line chart, Crater blasting etc., greatly improve the utilization, decision-making and arrangement of data.Meanwhile it is of the invention in order to preferably be easy to research work, Provide the view in terms of many statistics descriptions, for example, chart of frequency distribution, statistics description figure (median, mode, two quantiles, Three quantiles) etc..

4. the present invention effectively solves " information island " problem so that multicenter data fusion is improved, System can demonstrate,prove number, height and weight, age, date of birth, disease condition, family history, personal history, property according to a person's identity Not, the multi-dimensional data such as marriage judges the possibility of same person (e.g., name phonetic is identical, and sex is identical, date of birth phase Together, work unit is identical, and height difference is considered as same person no more than 2cm).The dictionary that simultaneously different Research fundings passes through platform Tree, carry out standard control, fast standard data.

5. the invention provides a kind of instrument to unstructured text data structuring, and in accuracy rate, processing speed The real work that obtains of degree, degree of intelligence etc. is examined and approved.

6. the Scientific Research Platform in data preparation provided by the invention and later stage complements each other, promote mutually.One side data are whole Manage and provided the foundation for data using scientific research, another aspect scientific research is also depositing into the deeper digging of data simultaneously Pick.

Brief description of the drawings

Fig. 1 is the integrated stand composition of the present invention.

Embodiment

The invention will be further described with embodiment below in conjunction with the accompanying drawings.

As shown in figure 1, big data integration platform establishes a central database, in that context it may be convenient to establishes and safeguards a set of standard Index dictionary library, the present invention simultaneously, there is provided variable contradistinction system, the data standard degree regardless of initial data, lead to Cross and carry out variable control using the contradistinction system of the present invention, can be indexed and utilize by consolidated storage.It is whole to provide a variety of data minings Science and engineering has, arrangement data that can be rapidly and efficiently, the cost that solution amplification quantity personnel participate in.Pass through the storage model of the present invention, profit With the data of key table indexs to each back end, and shown by visual mode.

Present invention employs Medical Language Processing Algorithm：

The first step, to text initial data variable design, arrangement personnel are made to view original number with open-and-shut index According to the discovery and arrangement instrument provided using system, realizing that preliminary fractionation arranges；

Second step, using segmentation methods, word segmentation processing generating structure data are carried out to the data of edit；

3rd step, using Similarity Algorithm, the dictionary accumulated according to machine learning, artificial intelligence realizes structural data Standardization；

4th step, the mode of manual examination and verification, quality control is carried out, while train artificial intelligence dictionary, make text structure Function is more intelligent.

4th, scientific research progress is promoted using data preparation, scientific research promotes the benign cycle mode of depth arrangement data, By the variable builder module of queue creator, new variable is excavated, realizes further data preparation.

Arrange 12 physical examination databases, the clinical database of 3 hospitals, the basic public health database in 4 areas, 1 Individual regional medical insurance database, dead database, 1 regional women of child-bearing age's database, save children's physical examination database.

Data sub-central's server is ensureing not lose using the integrality for ensureing initial data as the first storage principle On the basis of any original valid data, no matter data volume is how many, will extract data variable and establish data directory.

Data sub-central's server is the distributed basis for arranging data, handling large-scale data, due to each data volume It is all very huge, although database software disposal ability oneself hardware server performance increases substantially at present, with tradition Database processing mode it is obviously improper.Distributed data processing method, multiple data source distribution disparate databases are different Server, while carry out data preparation integration and be independent of each other.

Data sub-central's server is also the center that data mining arranges, due to the diversity of data, each data source Should all there are the ordering plan and step of uniqueness, the scheme of branch center storage solves this problem, different data sources well Carry out each different excavation, the unique data of excavation simultaneously as new variable storage in respective branch center, meanwhile, data The corresponding relation of standardization is also stored in data sub-central's server.

Each data source initial data of data sub-central's server, including：Each province, city's medical center physical examination number According to, the disease control data that provide of basic public health service data, Medicare data and Disease Control and Prevention Center；These data characteristicses It is：Data volume is big, and researching value is big, and source is a line most True Data, and the quality of data is uneven, and the big attribute of otherness is special Point using the data store organisation (master data table) of standardization, it is necessary to be used for data access.

The master data table, including：Personnel's Basic Information Table, personnel are registered (record) table, data variable concordance list, changed Test numeric type data storage table, image data storage table, the classifying type table data stores such as inspection and always examine conclusion；

Personnel's Basic Information Table, including：The essential information field such as name, sex, identification card number and marriage.

The data variable concordance list, the information of each inspection project is stored, including：Project name, term of reference, institute In table and querying condition

The image data storage table, refer to that the text-type such as image finding, diagnosis unstructured data stores；

The classifying type table data store, including：Qualitative data, if normal etc.；

Total inspection conclusion, including：It is directed to total inspection conclusion, disease etc. of physical examination database；

For more efficient data storage, data sub-central's server can give each data source to distribute a unique storage Coding, each data source initial data is stored in the different database of different servers, is realized and is carried out distributed, Ji Nengti High storage management efficiency, and can prevent separate sources data from data contamination leakage or maloperation occurs, and improve security.Per number All stored in corresponding database according to the initial data in source, system provides substantial amounts of Data Mining Tools, platform user Data mining being carried out to initial data using these instruments and obtaining new data variable, establishing data using data variable becomes Amount index, each database have corresponding data variable to index；Data variable indexes the one-to-one relationship between data It is also stored into database；

Data center server, due to the otherness of the data in different pieces of information source, it is unified standard, is easy to later data Extraction and application, platform provide standard care function, and canonical variable and normal dictionary are safeguarded respectively respectively, utilize standard Dictionary and variable, the comparison tool of platform offer is used by professional, using professional knowledge, to data sub-central's server Initial data is contrasted, and then completes standardization work；Result after standardization serves data application service Device.

The canonical variable, such as：Project name：Twenty-four-hour urine amount, belong to clinical laboratory, System Number 2437, phonetic Code be：24hNL, use sex：It is unlimited, result type：Numeric type, unit：Ml, male reference upper level：1800 male reference lower limits： 800, female's reference upper level, 1600 female's reference lower limits：600.

Data center server, for the basis of data normalization, solve the problems, such as the key point of data silo, plan as a whole each The original variable data and excavation variable data of individual source database, while unitized mark is provided for data application server It is accurate.

The maintenance of data center server offer normal dictionary, the name variable of standard, organization, code library etc. all exist Data center server is completed to safeguard.

Data application server is researcher, policymaker, and data preparation personnel etc. provide effective application service Or instrument.

The KEY data, for indexing VALUE data, are easy to the management of data, have many dependency numbers around KEY data According to table, such as the grouping sheet for being grouped to variable, can be with more efficient management variable by packet；For storage and criterion numeral According to the table of comparisons of control mapping, branch center storehouse data can easily be indexed by table of comparisons consolidated storage.

What VALUE data represented is the tables of data that a major class stores according to different types of data, and each of data records There is unique mark which individual of BASE data determined to belong to, data caused by which time.

The canonical variable is safeguarded, for planning as a whole each branch center data, is counted between each branch center according to background Huge according to difference, even if being all standard is also not present in physical examination mechanism specification in the industry, multicenter platform for data arrangement will be done It is exactly to build standard set variable to the integration top priority to data, according in the world, the country is authoritative and some are arranged Custom into specification, build project of standard development title, coding, classification etc..

The normal dictionary is safeguarded, for standardizing the VALUE values of initial data, while is also text structureization processing Important evidence.The present invention delivers the various dictionary standards of issue, such as ICD10, pharmacopeia, operation, ultrasound according to domestic and international authority Image etc. dictionary.

The queue creator, cohort study is carried out for building scientific research queue, on the one hand passes through queue creator, selection Data after initial data or excavation include exclusion standard as research variable, setting, set final result variable, it is not necessary to any Knowledge is programmed, quickly generates scientific research queue, and then generate scientific achievement；The process that another aspect queue creates is also a new round The process of data mining, the final result condition of queue, the mining data that exclusion standard etc. all can be new as one is included, be used for Other scientific research demands.

The data statistics platform, for being counted in a manner of science directly perceived and display data, it is easy to policymaker, scientific research work Author quickly understands data, utilizes data.

The incoming file, there are perfect data to export technical support for accessing data providing, provided according to data The specific job specification in side, the data presentation mode designed with them, in the form of a file exports source data.In general data File includes excel files, and csv file, three kinds of TXT files, for different file types, the present invention, which both provides, effectively to be connect Enter mode.

The access database, for carrying out data transmission between the data providing with depth cooperation, data providing Database structure document description is given, and coordinates the whole of data to utilize link, cooperation of such a mode in both sides technical staff Under can utilize data to greatest extent.The present invention have accumulated more main flow physical examination database data structures of in the market, more families Data structure of basic public health service software company etc., time access database data can be saved to greatest extent.

The access interface service, there is the situation of certain technology development capability for data providing.Pacified based on data Full consideration, a part of data providing be present and be not easy to use above two access way, the invention provides the number of safety According to incoming interface, WEBSERVICE technical schemes are utilized, there is provided detailed data access document description, there is certain technology reality Data providing can conveniently in this way, encryption safe transmission data.

Crawler capturing service is accessed, for a large amount of disclosed data existing for internet, using data such as air, weather, There is very big key in these data, but a convenient data download address, network are not climbed with health medical treatment analysis Worm can effectively solve this problem.Meanwhile for some left over by history websites of many NGOs, software is had been subjected to The maintenance phase, the mode of web crawlers is a very economical effective scheme.

Questionnaire is enrolled and other access ways, with upper type both in the access of substantial amounts of field evidence, sheet Body belongs to structural data or semi-structured data.But belong to unstructured data in the presence of some data, based on papery Survey data, the data of taking pictures of case homepage, the access of these data, the invention provides high-efficiency artificial to enroll mode, Database can be arrived, completes data access by self-defined questionnaire, manual extraction, permanence storage.

Although above-mentioned the embodiment of the present invention is described with reference to accompanying drawing, model not is protected to the present invention The limitation enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not Need to pay various modifications or deformation that creative work can make still within protection scope of the present invention.

Claims

1. a kind of medical big data multicenter integration platform, it is characterized in that, including：

Data sub-central's server, each data source initial data is gathered, by initial data storage into corresponding database, each Include in database：Variable concordance list, personal information table, inspection result table；To variable concordance list, personnel's Basic Information Table, Data in inspection result table carry out pretreatment operation；Each database corresponds to unique encodings；

2. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that, the canonical variable, bag Include：Item code, project name, affiliated section office, index deciphering, data type, data label, term of reference；

The normal dictionary, including：《The international statistical classification of diseases and related health problems》ICD10、《Chinese Pharmacopoeia》Or sun Property sign；

The canonical variable maintenance, including：Project of standard development title, coding and classification；

The normal dictionary maintenance, according to《The international statistical classification of diseases and related health problems》ICD10 or《Chinese Pharmacopoeia》, Initial data is standardized and text structureization is handled.

3. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that,

The pretreatment refers to：

Data processing is carried out to every a data in variable concordance list and obtains new data variable, is built using new data variable Vertical new data variable index；Canonical variable according to data center server to the inspection project title in variable concordance list and Inspection project name encoding standardization；

Duplicate removal processing is carried out to the data inside personnel's Basic Information Table；The duplicate removal processing, including：Work unit's duplicate removal and body Part card duplicate removal；

Structural data, the normal dictionary pair according to data center server are converted into the text data inside inspection result table Inspection result title and inspection result name encoding standardization inside inspection result table.

4. a kind of medical big data multicenter integration platform as claimed in claim 3, it is characterized in that,

Every a data in the concordance list to variable carries out data processing and obtains new data variable, is become using new data Amount establishes new data variable index, including：

Canonical matching module, for extracting rule data, i.e., obtained data are matched by regular expression；

Automatic paragraphing module, new variables is produced according to the separating character of setting；Separating character is self-defined；

Text replacement module, for replacing expression way wrong in initial data；

Text structure module, be structured variable data by unstructured text data processing, by natural language processing or The mode of machine learning splits to text data and standardized；

Data normalization module, by way of Similarity Detection Algorithm and manual examination and verification, data sub-central's server The canonical variable of data variable and data center server establishes mapping relations one by one.

5. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that,

KEY data are stored in the variable concordance list；BASE data are stored in the personal information table；In the inspection result table Store VALUE data；KEY tables of data shows that data variable indexes；VALUE data represent initial data；The BASE tables of data persons of leting others have a look at Essential information data；

The KEY data, for indexing VALUE data, including grouping sheet and the table of comparisons, the grouping sheet is used for data variable Index carries out packet storage；Composite type is grouped the combination for referring to inspection project；The table of comparisons is used to index data variable One-to-one relationship between data is stored, and as the outer key index of VALUE data, indexes same detection All detected values of purpose；

The VALUE data, are the tables stored according to the different types of data of initial data to initial data, each original Beginning data have unique index, the record for regional code+mechanism coding+initial data that unique index passes through hospital Coding is formed；

The BASE data, for storing personnel's essential information, each data provides individual and there was only a record, bag in principle Include：Sex, name, marriage, identity card, phone and mailbox, height is unique and security request data is of a relatively high；The BASE numbers According to, including：Personnel's Basic Information Table, person works' unit table and the mapping table of personnel and data.

6. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that,

The queue creator, by queue creator, the data after selecting initial data or excavating are used as research variable, if Inclusive criteria and exclusion standard are put, the generation scientific research queue of final result variable is set；The inclusive criteria, refer to and meeting diagnostic criteria In patient, a series of indexs or condition of selection；The exclusion standard, refer to exclude patient several can disturb result accuracy Index；The final result variable is also outcome variable, referred to as final result；Refer to the expected results thing that will appear from follow-up observation Part, namely researcher wish the event of tracing study；The scientific research queue refers to selects required research in a specified crowd Object, according at present or in the past, whether some period is exposed to some hazards to be studied, the data matrix of composition；

The data statistics platform, for statistics and display data.

7. a kind of medical big data multicenter integration method, it is characterized in that, comprise the following steps：

Step (1)：By the data access of each data sub-central's server into data center server, in each data point The data of central server carry out quality evaluation, if quality evaluation is by into step (2)；If quality evaluation not by, Then data center server does not pass through conclusion to data sub-central's server feedback；The quality evaluation, including：Data are complete Property, the assessment of Data duplication rate, data deviation, data volume size；

Step (2)：Data center server is established and maintenance criterion variable and normal dictionary, meanwhile, establishing criteria variable and mark Quasi- dictionary is to data prediction；

Step (3)：Data normalization processing：Variable standardization and data value standardization；By Similarity matching algorithm and manually The mode of examination ＆ verification, the data variable of data sub-central's server and the canonical variable of data center server are established and mapped one by one Relation；

8. a kind of medical big data multicenter integration method as claimed in claim 7, it is characterized in that, the step of the step (2) Suddenly it is：

Step (201)：Each data is checked according to data variable index, using frequency table or the patterned instrument of column diagram by Data sub-central's server is intuitively expressed original data, rejecting abnormalities data；

Step (202)：Edit is carried out to data：

Canonical matching step, for extracting rule data, i.e., obtained data are matched by regular expression；

Automatic paragraphing step, new variables is produced according to the separating character of setting；Separating character is self-defined；

Text replacement step, for replacing expression way wrong in initial data；

Text structure step, be structured variable data by unstructured text data processing, by natural language processing or The mode of machine learning splits to text data and standardized；

Step (203)：Data sub-central's clothes are stored in using the data that step (202) edit obtains as new variable data The variable concordance list of business device.

9. a kind of medical big data multicenter integration method as claimed in claim 8, it is characterized in that, the step (202) Unstructured text data processing the step of being structured variable data is：

Step (2023)：Using the segmentation methods of natural language processing, using normal dictionary storehouse as participle basis, first to original text Notebook data is segmented, and data sectional Comparative result normal dictionary storehouse, realization are automatically performed into word segmentation processing by Similarity algorithm；

Step (2025)：Derived type structure data.

10. a kind of medical big data multicenter integration method as claimed in claim 7, it is characterized in that, the change of the step (3) Measuring the step of standardizing is：

Step (301)：A data variable is selected from the variable concordance list of data sub-central's server, then is taken from data center A canonical variable is selected in the normal dictionary that business device defines, user is according to medical knowledge to two name variables, affiliated section office And the True Data result of detection, the one-to-one relationship of data variable and canonical variable is determined, is mapped so as to complete control；

Step (302)：To the variable of step (301) control mapping, audited, normalizing operation is completed, so as to ensure variable The accuracy of control.