CN107833595A - Medical big data multicenter integration platform and method - Google Patents

Medical big data multicenter integration platform and method Download PDF

Info

Publication number
CN107833595A
CN107833595A CN201710946758.8A CN201710946758A CN107833595A CN 107833595 A CN107833595 A CN 107833595A CN 201710946758 A CN201710946758 A CN 201710946758A CN 107833595 A CN107833595 A CN 107833595A
Authority
CN
China
Prior art keywords
data
variable
server
central
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710946758.8A
Other languages
Chinese (zh)
Inventor
薛付忠
季晓康
王永超
高琦
徐聪
王晓鹤
阿力木·达依木
曹瑾
许艺博
蒋正
卞伟玮
李敏
孙苑潆
韩君铭
马官慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kang Wei Health Care Big Data Technology Co Ltd
Shandong University
Original Assignee
Kang Wei Health Care Big Data Technology Co Ltd
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kang Wei Health Care Big Data Technology Co Ltd, Shandong University filed Critical Kang Wei Health Care Big Data Technology Co Ltd
Priority to CN201710946758.8A priority Critical patent/CN107833595A/en
Publication of CN107833595A publication Critical patent/CN107833595A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses medical big data multicenter integration platform and method;By the data access of each data sub-central's server into data center server, quality evaluation is carried out to the data of each data sub-central's server, if quality evaluation is by entering in next step;If quality evaluation is by the way that data center server does not pass through conclusion to data sub-central's server feedback;The quality evaluation, including:The assessment of data integrity, Data duplication rate, data deviation, data volume size;Data center server is established and maintenance criterion variable and normal dictionary, meanwhile, establishing criteria variable and normal dictionary are to data prediction;Data normalization processing:Variable standardization and data value standardization;By way of Similarity matching algorithm and manual examination and verification, the data variable of data sub-central's server and the canonical variable of data center server are established mapping relations one by one;Data after being standardized to data center server carry out data utilization.

Description

Medical big data multicenter integration platform and method
Technical field
The present invention relates to a kind of medical big data multicenter integration platform and method.
Background technology
There are the following problems needs to solve for prior art:
First, data volume is huge and pattern is numerous and diverse;The data include the physical examination data, multiple of tens of medical centers The clinical data of the government datas such as the basic public health service in area, the women of child-bearing age, more Grade A hospitals, and multiple training disease that calls for specialized treatments Data, such as:The major disease database such as mental disease data, glioma, each data source store substantial amounts of data, and Each data source data form varies;
Second, traditional data arrange the drawbacks of, traditional data arrange both for centralized database, consume substantial amounts of manpower Material resources arrange data, statistical analysis, find valuable scientific achievement.But as the arrival in big data epoch, wearing are set Standby addition, the data volume of medicine and hygiene fieldses is just in the growth of exponentially type, it is clear that traditional data preparation mode is not Current data processing needs are adapted to, and a major obstacles of data are utilized as researcher in particular how Multicenter, diversified data are managed as a whole on a data preparation platform, plans as a whole to excavate, complements each other, are even more tradition The insurmountable problem of data preparation mode.Citing:Conventional process mode can not solve same person in Different hospital diagnosis and treatment or Physical examination, the problem of how differentiating same person.
3rd, data display mode:The data volume that biometrics is studied all is huge, described database, each Individual is all the data volume of millions and the above.Prior art can not intuitively see these data.We must use big data Visual means, with more intuitive image mode come display data, such as histogram, line chart, scatter diagram so that data User and policymaker have the understanding of an ABC intuition to data, in order to do the scientific research of next step and decision-making.
4th, the standardization of data:Each medical institutions, each data side, due to and be not present a unified industry mark Standard, during respective Informatization Development, the data of storage are gathered, very big difference be present, for example, same disease, medicine Different calls in different institutions be present in thing, operation;Identical Testing index, due to the difference of detecting instrument, detection reagent Difference, its term of reference, unit are very different, as a platform for data arrangement, it is necessary to a set of standard are established, to the name of index Claim, the end value of index, arranged by effective handling implement, normalizing operation.
5th, the processing of unstructured data:Unstructured data processing refers to checking description, checking the text envelopes such as conclusion The processing of breath, whole section of word description, it is necessary to key message therein is extracted, otherwise can not carry out effective scientific research utilization, And these substantial amounts of text datas, comprising information content be it is huge, extraction is crucial, effective information while, it is necessary to Ensure comprehensive, the loss of any useful information of information extraction, be all the massive losses of a data integrity.
6th, the relation of scientific research and arrangement:It is known that data preparation is the premise of scientific research statistics, but it is awkward in the presence of one The problem of a word used for translation, it is more likely that the required research index of scientific research can not be met in the database of arrangement, for example, we Scientific research needs the index " NASH " studied, and during in general data preparation, whether physical examination index has drinks Wine and Ultrasonic Diagnosis whether fatty liver, the type to fatty liver is, it is necessary to which researcher oneself is defined, it is necessary to arrange original number again According to.
The content of the invention
The purpose of the present invention is exactly to solve the above problems, there is provided a kind of medical big data multicenter integration platform and side The advantages that method, it has access convenient, and distributed libray, instrument enriches, visualization directly perceived, and intelligence arranges.
To achieve these goals, the present invention adopts the following technical scheme that:
A kind of medical big data multicenter integration platform, including:
Data center server, establish simultaneously maintenance criterion variable and normal dictionary;
Data sub-central's server, each data source initial data is gathered, by initial data storage into corresponding database, Include in each database:Variable concordance list, personal information table, inspection result table;To variable concordance list, personnel's essential information Data in table, inspection result table carry out pretreatment operation;Each database corresponds to unique encodings;
Data application server, for carrying out data utilization to the data after data sub-central's server pretreatment operation.
The canonical variable, including:Item code, project name, affiliated section office, index deciphering, data type, data mark Label, term of reference;
Item code, such as:1001、1002;Project name, such as:It is mean corpuscular hemoglobin concentration (MCHC), average red thin Extracellular hemoglobin content;Affiliated section office, such as:Clinical laboratory, gynaecology;Index is understood, the introduction to project name;Data class, example Such as:Numeric type, text-type;Data label, such as:Blood routine, routine urinalysis;Term of reference, such as:The reference of each testing result Scope;
The normal dictionary, including:《The international statistical classification of diseases and related health problems》ICD10、《Chinese Pharmacopoeia》 Or positive sign;
The canonical variable maintenance, including:Project of standard development title, coding and classification.
The normal dictionary maintenance, according to《The international statistical classification of diseases and related health problems》ICD10 or《Middle traditional Chinese medicines Allusion quotation》, initial data is standardized and text structureization is handled.
The pretreatment refers to:
Data processing is carried out to every a data in variable concordance list and obtains new data variable, is become using new data Amount establishes new data variable index;Canonical variable according to data center server is to the inspection project name in variable concordance list Title and inspection project name encoding standardization;
Duplicate removal processing is carried out to the data inside personnel's Basic Information Table;The duplicate removal processing, including:Work unit's duplicate removal With identification card number duplicate removal;
Structural data, the standard word according to data center server are converted into the text data inside inspection result table Allusion quotation is to the inspection result title inside inspection result table and inspection result name encoding standardization.
Every a data in the concordance list to variable carries out data processing and obtains new data variable, utilizes new number New data variable is established according to variable to index, including:
Module is split manually, for medical record data manually to be split into multiple sentence variables;
Canonical matching module, for extracting rule data, i.e., obtained data are matched by regular expression, it is described Regular data is for example:Numeral;
Automatic paragraphing module, new variables is produced according to the separating character of setting;Separating character is self-defined, such as:Branch, sky Lattice etc.;
Text replacement module, for replacing expression way wrong in initial data;
Piece root module is intercepted, for according to the word fragment being actually needed in interception inspection result;
Conversion of measurement unit module, for being changed to the unit of data, it is therefore an objective to the measurement of uniform data;
Text structure module, it is structured variable data by unstructured text data processing, at natural language Reason or the mode of machine learning are split to text data to be standardized;For example, the text such as image ultrasound, which describes data, splits standardization;
Data normalization module, by way of Similarity Detection Algorithm and manual examination and verification, data sub-central is serviced The data variable of device and the canonical variable of data center server establish mapping relations one by one.
KEY data are stored in the variable concordance list;BASE data are stored in the personal information table;The inspection result VALUE data are stored in table;KEY tables of data shows that data variable indexes;VALUE data represent initial data;BASE data represent Personnel's essential information data;
The KEY data, for indexing VALUE data, including grouping sheet and the table of comparisons, the grouping sheet is used for data Variable index carries out packet storage;The grouping sheet, such as:Section office's packet, data type packet and composite type packet;Combination Type packet refers to the combination of inspection project, such as five indexes of hepatitis b or blood routine;The table of comparisons is used to index data variable One-to-one relationship between data is stored, and as the outer key index of VALUE data, indexes same detection All detected values of purpose;
The VALUE data, it is the table stored according to the different types of data of initial data to initial data, it is each Bar initial data has unique index, regional code+mechanism coding+initial data that unique index passes through hospital Record coding is formed;
The BASE data, for storing personnel's essential information, each data provides individual and there was only a note in principle Record, including:Sex, name, marriage, identity card, phone and mailbox, height is unique and security request data is of a relatively high;It is described BASE data, including:Personnel's Basic Information Table, person works' unit table and the mapping table of personnel and data.
The data application server, including:Queue creator and data statistics platform;
The queue creator, by queue creator, the data after selecting initial data or excavating become as research Amount, inclusive criteria and exclusion standard are set, the generation scientific research queue of final result variable is set;The inclusive criteria, refer to and meeting diagnosis In the patient of standard, a series of indexs or condition of selection;The exclusion standard, refer to exclude patient several can disturb result The index of accuracy;The final result variable is also outcome variable, referred to as final result.Refer to the expected knot that will appear from follow-up observation Fruit event, namely researcher wish the event of tracing study.The scientific research queue refer in a specified crowd select needed for Research object, according at present or in the past, whether some period is exposed to some hazards to be studied, the data matrix of composition.
The data statistics platform, for statistics and display data.
A kind of medical big data multicenter integration method, comprises the following steps:
Step (1):By the data access of each data sub-central's server into data center server, to each data The data of branch center server carry out quality evaluation, if quality evaluation is by into step (2);If quality evaluation is not By the way that then data center server does not pass through conclusion to data sub-central's server feedback;The quality evaluation, including:Data The assessment of integrality, Data duplication rate, data deviation, data volume size;
Step (2):Data center server is established and maintenance criterion variable and normal dictionary, meanwhile, establishing criteria variable With normal dictionary to data prediction;
Step (3):Data normalization processing:Variable standardization and data value standardization;By Similarity matching algorithm and The mode of manual examination and verification, the data variable of data sub-central's server and the canonical variable of data center server are established one by one Mapping relations;
Step (4):Data after being standardized to data center server carry out data utilization.
The step of step (2) is:
Step (201):Each data is checked according to data variable index, utilizes frequency table or the patterned work of column diagram Tool is intuitively expressed original data by data sub-central's server, rejecting abnormalities data;
Step (202):Edit is carried out to data:
Manual splitting step, for medical record data manually to be split into multiple sentence variables;
Canonical matching step, for extracting rule data, i.e., obtained data are matched by regular expression, it is described Regular data is for example:Numeral;
Automatic paragraphing step, new variables is produced according to the separating character of setting;Separating character is self-defined, such as:Branch, sky Lattice etc.;
Text replacement step, for replacing expression way wrong in initial data;
Fragment step is intercepted, for according to the word fragment being actually needed in interception inspection result;
Conversion of measurement unit step, for being changed to the unit of data, it is therefore an objective to the measurement of uniform data;
Text structure step, it is structured variable data by unstructured text data processing, at natural language Reason or the mode of machine learning are split to text data to be standardized;For example, the text such as image ultrasound, which describes data, splits standardization;
Unstructured text data processing the step of being structured variable data of the step (202) is:
Step (2021):Selection needs to carry out the data variable of text structure processing;
Step (2022):Duplicate removal processing is carried out to data variable, storage is into text structure tables of data after duplicate removal processing;
Step (2023):Using the segmentation methods of natural language processing, using normal dictionary storehouse as participle basis, first to original Beginning text data segment, data sectional Comparative result normal dictionary storehouse, realization are automatically performed by word segmentation processing by Similarity algorithm;
Step (2024):The data that artificial supplementation can not identify completely, guarantee data integrity;
Step (2025):Derived type structure data.
Step (203):It is stored in the data that step (202) edit obtains as new variable data in data point The variable concordance list of central server.
The step of variable standardization of the step (3) is:
Step (301):A data variable is selected from the variable concordance list of data sub-central's server, then from data A canonical variable is selected in the normal dictionary that central server defines, user is according to medical knowledge to two name variables, affiliated Section office and the True Data result of detection, the one-to-one relationship of data variable and canonical variable is determined, so as to complete to compare Mapping;
Step (302):To the variable of step (301) control mapping, audited, normalizing operation is completed, so as to ensure The accuracy of variable control.
The step (4):By queue creator, the data after selecting initial data or excavating are used as research variable, Inclusive criteria and exclusion standard are set, the generation scientific research queue of final result variable is set;The inclusive criteria, refer to and meeting diagnostic criteria Patient in, a series of indexs or condition of selection;The exclusion standard, refer to exclude patient several result can be disturbed accurate The index of property;The final result variable is also outcome variable, referred to as final result.Refer to the expected results thing that will appear from follow-up observation Part, namely researcher wish the event of tracing study.The scientific research queue refers to selects required research in a specified crowd Object, according at present or in the past, whether some period is exposed to some hazards to be studied, the data matrix of composition.
Beneficial effects of the present invention:
Solve that data volume is huge and diversified problem 1. of the invention effective, first, each data source is individually divided With a thesaurus node, it is easy to later stage Distributed Calculation, solves the problems, such as that data volume is big with the storage mode of data warehouse. Angle that this model utilizes from scientific data designs, and the data structure of each data memory node is consistent, all includes:Index Table KEY points to all data storages, VALUE tables storage mass data, and BASE tables are essential information, unified standard storage data Structure is easy to data management to utilize and solve the diversified problem of initial data.
2. integrate the platform for arranging data the invention provides multicenter, on this platform, data preparation personnel or Person's data user of service need not be concerned about the sources of data, storage, quantity, it is not necessary to possess any programming knowledge.Utilize platform The discovery and arrangement instrument of offer, it is possible to handle separate sources with the mode of unified procedure, the data of different patterns, and with The data format storage or export of one standard, quickly realize edit.
3. the invention provides intuitively big data visualization function, including pie chart, column diagram, scatter diagram, line chart, Crater blasting etc., greatly improve the utilization, decision-making and arrangement of data.Meanwhile it is of the invention in order to preferably be easy to research work, Provide the view in terms of many statistics descriptions, for example, chart of frequency distribution, statistics description figure (median, mode, two quantiles, Three quantiles) etc..
4. the present invention effectively solves " information island " problem so that multicenter data fusion is improved, System can demonstrate,prove number, height and weight, age, date of birth, disease condition, family history, personal history, property according to a person's identity Not, the multi-dimensional data such as marriage judges the possibility of same person (e.g., name phonetic is identical, and sex is identical, date of birth phase Together, work unit is identical, and height difference is considered as same person no more than 2cm).The dictionary that simultaneously different Research fundings passes through platform Tree, carry out standard control, fast standard data.
5. the invention provides a kind of instrument to unstructured text data structuring, and in accuracy rate, processing speed The real work that obtains of degree, degree of intelligence etc. is examined and approved.
6. the Scientific Research Platform in data preparation provided by the invention and later stage complements each other, promote mutually.One side data are whole Manage and provided the foundation for data using scientific research, another aspect scientific research is also depositing into the deeper digging of data simultaneously Pick.
Brief description of the drawings
Fig. 1 is the integrated stand composition of the present invention.
Embodiment
The invention will be further described with embodiment below in conjunction with the accompanying drawings.
As shown in figure 1, big data integration platform establishes a central database, in that context it may be convenient to establishes and safeguards a set of standard Index dictionary library, the present invention simultaneously, there is provided variable contradistinction system, the data standard degree regardless of initial data, lead to Cross and carry out variable control using the contradistinction system of the present invention, can be indexed and utilize by consolidated storage.It is whole to provide a variety of data minings Science and engineering has, arrangement data that can be rapidly and efficiently, the cost that solution amplification quantity personnel participate in.Pass through the storage model of the present invention, profit With the data of key table indexs to each back end, and shown by visual mode.
Present invention employs Medical Language Processing Algorithm:
The first step, to text initial data variable design, arrangement personnel are made to view original number with open-and-shut index According to the discovery and arrangement instrument provided using system, realizing that preliminary fractionation arranges;
Second step, using segmentation methods, word segmentation processing generating structure data are carried out to the data of edit;
3rd step, using Similarity Algorithm, the dictionary accumulated according to machine learning, artificial intelligence realizes structural data Standardization;
4th step, the mode of manual examination and verification, quality control is carried out, while train artificial intelligence dictionary, make text structure Function is more intelligent.
4th, scientific research progress is promoted using data preparation, scientific research promotes the benign cycle mode of depth arrangement data, By the variable builder module of queue creator, new variable is excavated, realizes further data preparation.
Arrange 12 physical examination databases, the clinical database of 3 hospitals, the basic public health database in 4 areas, 1 Individual regional medical insurance database, dead database, 1 regional women of child-bearing age's database, save children's physical examination database.
Data sub-central's server is ensureing not lose using the integrality for ensureing initial data as the first storage principle On the basis of any original valid data, no matter data volume is how many, will extract data variable and establish data directory.
Data sub-central's server is the distributed basis for arranging data, handling large-scale data, due to each data volume It is all very huge, although database software disposal ability oneself hardware server performance increases substantially at present, with tradition Database processing mode it is obviously improper.Distributed data processing method, multiple data source distribution disparate databases are different Server, while carry out data preparation integration and be independent of each other.
Data sub-central's server is also the center that data mining arranges, due to the diversity of data, each data source Should all there are the ordering plan and step of uniqueness, the scheme of branch center storage solves this problem, different data sources well Carry out each different excavation, the unique data of excavation simultaneously as new variable storage in respective branch center, meanwhile, data The corresponding relation of standardization is also stored in data sub-central's server.
Each data source initial data of data sub-central's server, including:Each province, city's medical center physical examination number According to, the disease control data that provide of basic public health service data, Medicare data and Disease Control and Prevention Center;These data characteristicses It is:Data volume is big, and researching value is big, and source is a line most True Data, and the quality of data is uneven, and the big attribute of otherness is special Point using the data store organisation (master data table) of standardization, it is necessary to be used for data access.
The master data table, including:Personnel's Basic Information Table, personnel are registered (record) table, data variable concordance list, changed Test numeric type data storage table, image data storage table, the classifying type table data stores such as inspection and always examine conclusion;
Personnel's Basic Information Table, including:The essential information field such as name, sex, identification card number and marriage.
The data variable concordance list, the information of each inspection project is stored, including:Project name, term of reference, institute In table and querying condition
The image data storage table, refer to that the text-type such as image finding, diagnosis unstructured data stores;
The classifying type table data store, including:Qualitative data, if normal etc.;
Total inspection conclusion, including:It is directed to total inspection conclusion, disease etc. of physical examination database;
For more efficient data storage, data sub-central's server can give each data source to distribute a unique storage Coding, each data source initial data is stored in the different database of different servers, is realized and is carried out distributed, Ji Nengti High storage management efficiency, and can prevent separate sources data from data contamination leakage or maloperation occurs, and improve security.Per number All stored in corresponding database according to the initial data in source, system provides substantial amounts of Data Mining Tools, platform user Data mining being carried out to initial data using these instruments and obtaining new data variable, establishing data using data variable becomes Amount index, each database have corresponding data variable to index;Data variable indexes the one-to-one relationship between data It is also stored into database;
Data center server, due to the otherness of the data in different pieces of information source, it is unified standard, is easy to later data Extraction and application, platform provide standard care function, and canonical variable and normal dictionary are safeguarded respectively respectively, utilize standard Dictionary and variable, the comparison tool of platform offer is used by professional, using professional knowledge, to data sub-central's server Initial data is contrasted, and then completes standardization work;Result after standardization serves data application service Device.
The canonical variable, such as:Project name:Twenty-four-hour urine amount, belong to clinical laboratory, System Number 2437, phonetic Code be:24hNL, use sex:It is unlimited, result type:Numeric type, unit:Ml, male reference upper level:1800 male reference lower limits: 800, female's reference upper level, 1600 female's reference lower limits:600.
Data center server, for the basis of data normalization, solve the problems, such as the key point of data silo, plan as a whole each The original variable data and excavation variable data of individual source database, while unitized mark is provided for data application server It is accurate.
The maintenance of data center server offer normal dictionary, the name variable of standard, organization, code library etc. all exist Data center server is completed to safeguard.
Data application server is researcher, policymaker, and data preparation personnel etc. provide effective application service Or instrument.
The KEY data, for indexing VALUE data, are easy to the management of data, have many dependency numbers around KEY data According to table, such as the grouping sheet for being grouped to variable, can be with more efficient management variable by packet;For storage and criterion numeral According to the table of comparisons of control mapping, branch center storehouse data can easily be indexed by table of comparisons consolidated storage.
What VALUE data represented is the tables of data that a major class stores according to different types of data, and each of data records There is unique mark which individual of BASE data determined to belong to, data caused by which time.
The canonical variable is safeguarded, for planning as a whole each branch center data, is counted between each branch center according to background Huge according to difference, even if being all standard is also not present in physical examination mechanism specification in the industry, multicenter platform for data arrangement will be done It is exactly to build standard set variable to the integration top priority to data, according in the world, the country is authoritative and some are arranged Custom into specification, build project of standard development title, coding, classification etc..
The normal dictionary is safeguarded, for standardizing the VALUE values of initial data, while is also text structureization processing Important evidence.The present invention delivers the various dictionary standards of issue, such as ICD10, pharmacopeia, operation, ultrasound according to domestic and international authority Image etc. dictionary.
The queue creator, cohort study is carried out for building scientific research queue, on the one hand passes through queue creator, selection Data after initial data or excavation include exclusion standard as research variable, setting, set final result variable, it is not necessary to any Knowledge is programmed, quickly generates scientific research queue, and then generate scientific achievement;The process that another aspect queue creates is also a new round The process of data mining, the final result condition of queue, the mining data that exclusion standard etc. all can be new as one is included, be used for Other scientific research demands.
The data statistics platform, for being counted in a manner of science directly perceived and display data, it is easy to policymaker, scientific research work Author quickly understands data, utilizes data.
The incoming file, there are perfect data to export technical support for accessing data providing, provided according to data The specific job specification in side, the data presentation mode designed with them, in the form of a file exports source data.In general data File includes excel files, and csv file, three kinds of TXT files, for different file types, the present invention, which both provides, effectively to be connect Enter mode.
The access database, for carrying out data transmission between the data providing with depth cooperation, data providing Database structure document description is given, and coordinates the whole of data to utilize link, cooperation of such a mode in both sides technical staff Under can utilize data to greatest extent.The present invention have accumulated more main flow physical examination database data structures of in the market, more families Data structure of basic public health service software company etc., time access database data can be saved to greatest extent.
The access interface service, there is the situation of certain technology development capability for data providing.Pacified based on data Full consideration, a part of data providing be present and be not easy to use above two access way, the invention provides the number of safety According to incoming interface, WEBSERVICE technical schemes are utilized, there is provided detailed data access document description, there is certain technology reality Data providing can conveniently in this way, encryption safe transmission data.
Crawler capturing service is accessed, for a large amount of disclosed data existing for internet, using data such as air, weather, There is very big key in these data, but a convenient data download address, network are not climbed with health medical treatment analysis Worm can effectively solve this problem.Meanwhile for some left over by history websites of many NGOs, software is had been subjected to The maintenance phase, the mode of web crawlers is a very economical effective scheme.
Questionnaire is enrolled and other access ways, with upper type both in the access of substantial amounts of field evidence, sheet Body belongs to structural data or semi-structured data.But belong to unstructured data in the presence of some data, based on papery Survey data, the data of taking pictures of case homepage, the access of these data, the invention provides high-efficiency artificial to enroll mode, Database can be arrived, completes data access by self-defined questionnaire, manual extraction, permanence storage.
Although above-mentioned the embodiment of the present invention is described with reference to accompanying drawing, model not is protected to the present invention The limitation enclosed, one of ordinary skill in the art should be understood that on the basis of technical scheme those skilled in the art are not Need to pay various modifications or deformation that creative work can make still within protection scope of the present invention.

Claims (10)

1. a kind of medical big data multicenter integration platform, it is characterized in that, including:
Data center server, establish simultaneously maintenance criterion variable and normal dictionary;
Data sub-central's server, each data source initial data is gathered, by initial data storage into corresponding database, each Include in database:Variable concordance list, personal information table, inspection result table;To variable concordance list, personnel's Basic Information Table, Data in inspection result table carry out pretreatment operation;Each database corresponds to unique encodings;
Data application server, for carrying out data utilization to the data after data sub-central's server pretreatment operation.
2. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that, the canonical variable, bag Include:Item code, project name, affiliated section office, index deciphering, data type, data label, term of reference;
The normal dictionary, including:《The international statistical classification of diseases and related health problems》ICD10、《Chinese Pharmacopoeia》Or sun Property sign;
The canonical variable maintenance, including:Project of standard development title, coding and classification;
The normal dictionary maintenance, according to《The international statistical classification of diseases and related health problems》ICD10 or《Chinese Pharmacopoeia》, Initial data is standardized and text structureization is handled.
3. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that,
The pretreatment refers to:
Data processing is carried out to every a data in variable concordance list and obtains new data variable, is built using new data variable Vertical new data variable index;Canonical variable according to data center server to the inspection project title in variable concordance list and Inspection project name encoding standardization;
Duplicate removal processing is carried out to the data inside personnel's Basic Information Table;The duplicate removal processing, including:Work unit's duplicate removal and body Part card duplicate removal;
Structural data, the normal dictionary pair according to data center server are converted into the text data inside inspection result table Inspection result title and inspection result name encoding standardization inside inspection result table.
4. a kind of medical big data multicenter integration platform as claimed in claim 3, it is characterized in that,
Every a data in the concordance list to variable carries out data processing and obtains new data variable, is become using new data Amount establishes new data variable index, including:
Module is split manually, for medical record data manually to be split into multiple sentence variables;
Canonical matching module, for extracting rule data, i.e., obtained data are matched by regular expression;
Automatic paragraphing module, new variables is produced according to the separating character of setting;Separating character is self-defined;
Text replacement module, for replacing expression way wrong in initial data;
Piece root module is intercepted, for according to the word fragment being actually needed in interception inspection result;
Conversion of measurement unit module, for being changed to the unit of data, it is therefore an objective to the measurement of uniform data;
Text structure module, be structured variable data by unstructured text data processing, by natural language processing or The mode of machine learning splits to text data and standardized;
Data normalization module, by way of Similarity Detection Algorithm and manual examination and verification, data sub-central's server The canonical variable of data variable and data center server establishes mapping relations one by one.
5. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that,
KEY data are stored in the variable concordance list;BASE data are stored in the personal information table;In the inspection result table Store VALUE data;KEY tables of data shows that data variable indexes;VALUE data represent initial data;The BASE tables of data persons of leting others have a look at Essential information data;
The KEY data, for indexing VALUE data, including grouping sheet and the table of comparisons, the grouping sheet is used for data variable Index carries out packet storage;Composite type is grouped the combination for referring to inspection project;The table of comparisons is used to index data variable One-to-one relationship between data is stored, and as the outer key index of VALUE data, indexes same detection All detected values of purpose;
The VALUE data, are the tables stored according to the different types of data of initial data to initial data, each original Beginning data have unique index, the record for regional code+mechanism coding+initial data that unique index passes through hospital Coding is formed;
The BASE data, for storing personnel's essential information, each data provides individual and there was only a record, bag in principle Include:Sex, name, marriage, identity card, phone and mailbox, height is unique and security request data is of a relatively high;The BASE numbers According to, including:Personnel's Basic Information Table, person works' unit table and the mapping table of personnel and data.
6. a kind of medical big data multicenter integration platform as claimed in claim 1, it is characterized in that,
The data application server, including:Queue creator and data statistics platform;
The queue creator, by queue creator, the data after selecting initial data or excavating are used as research variable, if Inclusive criteria and exclusion standard are put, the generation scientific research queue of final result variable is set;The inclusive criteria, refer to and meeting diagnostic criteria In patient, a series of indexs or condition of selection;The exclusion standard, refer to exclude patient several can disturb result accuracy Index;The final result variable is also outcome variable, referred to as final result;Refer to the expected results thing that will appear from follow-up observation Part, namely researcher wish the event of tracing study;The scientific research queue refers to selects required research in a specified crowd Object, according at present or in the past, whether some period is exposed to some hazards to be studied, the data matrix of composition;
The data statistics platform, for statistics and display data.
7. a kind of medical big data multicenter integration method, it is characterized in that, comprise the following steps:
Step (1):By the data access of each data sub-central's server into data center server, in each data point The data of central server carry out quality evaluation, if quality evaluation is by into step (2);If quality evaluation not by, Then data center server does not pass through conclusion to data sub-central's server feedback;The quality evaluation, including:Data are complete Property, the assessment of Data duplication rate, data deviation, data volume size;
Step (2):Data center server is established and maintenance criterion variable and normal dictionary, meanwhile, establishing criteria variable and mark Quasi- dictionary is to data prediction;
Step (3):Data normalization processing:Variable standardization and data value standardization;By Similarity matching algorithm and manually The mode of examination & verification, the data variable of data sub-central's server and the canonical variable of data center server are established and mapped one by one Relation;
Step (4):Data after being standardized to data center server carry out data utilization.
8. a kind of medical big data multicenter integration method as claimed in claim 7, it is characterized in that, the step of the step (2) Suddenly it is:
Step (201):Each data is checked according to data variable index, using frequency table or the patterned instrument of column diagram by Data sub-central's server is intuitively expressed original data, rejecting abnormalities data;
Step (202):Edit is carried out to data:
Manual splitting step, for medical record data manually to be split into multiple sentence variables;
Canonical matching step, for extracting rule data, i.e., obtained data are matched by regular expression;
Automatic paragraphing step, new variables is produced according to the separating character of setting;Separating character is self-defined;
Text replacement step, for replacing expression way wrong in initial data;
Fragment step is intercepted, for according to the word fragment being actually needed in interception inspection result;
Conversion of measurement unit step, for being changed to the unit of data, it is therefore an objective to the measurement of uniform data;
Text structure step, be structured variable data by unstructured text data processing, by natural language processing or The mode of machine learning splits to text data and standardized;
Step (203):Data sub-central's clothes are stored in using the data that step (202) edit obtains as new variable data The variable concordance list of business device.
9. a kind of medical big data multicenter integration method as claimed in claim 8, it is characterized in that, the step (202) Unstructured text data processing the step of being structured variable data is:
Step (2021):Selection needs to carry out the data variable of text structure processing;
Step (2022):Duplicate removal processing is carried out to data variable, storage is into text structure tables of data after duplicate removal processing;
Step (2023):Using the segmentation methods of natural language processing, using normal dictionary storehouse as participle basis, first to original text Notebook data is segmented, and data sectional Comparative result normal dictionary storehouse, realization are automatically performed into word segmentation processing by Similarity algorithm;
Step (2024):The data that artificial supplementation can not identify completely, guarantee data integrity;
Step (2025):Derived type structure data.
10. a kind of medical big data multicenter integration method as claimed in claim 7, it is characterized in that, the change of the step (3) Measuring the step of standardizing is:
Step (301):A data variable is selected from the variable concordance list of data sub-central's server, then is taken from data center A canonical variable is selected in the normal dictionary that business device defines, user is according to medical knowledge to two name variables, affiliated section office And the True Data result of detection, the one-to-one relationship of data variable and canonical variable is determined, is mapped so as to complete control;
Step (302):To the variable of step (301) control mapping, audited, normalizing operation is completed, so as to ensure variable The accuracy of control.
CN201710946758.8A 2017-10-12 2017-10-12 Medical big data multicenter integration platform and method Pending CN107833595A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710946758.8A CN107833595A (en) 2017-10-12 2017-10-12 Medical big data multicenter integration platform and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710946758.8A CN107833595A (en) 2017-10-12 2017-10-12 Medical big data multicenter integration platform and method

Publications (1)

Publication Number Publication Date
CN107833595A true CN107833595A (en) 2018-03-23

Family

ID=61647787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710946758.8A Pending CN107833595A (en) 2017-10-12 2017-10-12 Medical big data multicenter integration platform and method

Country Status (1)

Country Link
CN (1) CN107833595A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595614A (en) * 2018-04-20 2018-09-28 成都智信电子技术有限公司 Tables of data mapping method applied to HIS systems
CN108595571A (en) * 2018-04-16 2018-09-28 深圳零壹云医科技有限公司 A kind of Data Integration management method, device, system and user terminal
CN109189784A (en) * 2018-08-09 2019-01-11 纳里健康科技有限公司 A kind of method of distributed electronic medical record data integration
CN109408635A (en) * 2018-09-28 2019-03-01 湖南智腾安控科技有限公司 A kind of case history document standard processing system and method
CN109473149A (en) * 2018-11-09 2019-03-15 天津开心生活科技有限公司 Data Quality Assessment Methodology, device, electronic equipment and computer-readable medium
CN109522302A (en) * 2018-11-09 2019-03-26 南京医渡云医学技术有限公司 Medical data processing method, device, electronic equipment and computer-readable medium
CN109659034A (en) * 2018-11-30 2019-04-19 平安医疗健康管理股份有限公司 Data Quality Assessment Methodology, device, equipment and the storage medium of first page of illness case
CN109710670A (en) * 2018-12-11 2019-05-03 河南通域医疗科技有限公司 A method of case history text is converted into structural metadata from natural language
CN110347764A (en) * 2019-06-12 2019-10-18 重庆工商大学融智学院 A kind of ecological space data integration method
CN110362829A (en) * 2019-07-16 2019-10-22 北京百度网讯科技有限公司 Method for evaluating quality, device and the equipment of structured patient record data
CN110851595A (en) * 2019-10-08 2020-02-28 云知声智能科技股份有限公司 Identification method and device for disease term core vocabulary
CN110990591A (en) * 2019-12-26 2020-04-10 北京亚信数据有限公司 Method and system for auditing transcoding quality of medical data
CN111105867A (en) * 2019-12-24 2020-05-05 湖南长城医疗科技有限公司 Novel architecture and implementation method of public health medical information area platform
WO2020119386A1 (en) * 2018-12-13 2020-06-18 平安医疗健康管理股份有限公司 Big data-based abnormal data identification method and device, and storage medium and apparatus
CN111309727A (en) * 2020-01-22 2020-06-19 北京明略软件系统有限公司 Information table processing method and device and storage medium
CN111339084A (en) * 2020-02-15 2020-06-26 河北唐宋大数据产业股份有限公司 Data processing method and system
CN111899885A (en) * 2020-06-28 2020-11-06 万达信息股份有限公司 Distributed personnel event index implementation method and system
CN112070584A (en) * 2020-09-09 2020-12-11 畅销家(深圳)科技有限公司 Order management method, device, equipment and storage medium
CN112214524A (en) * 2020-08-27 2021-01-12 优学汇信息科技(广东)有限公司 Data evaluation system and evaluation method based on deep data mining
CN112650865A (en) * 2021-01-27 2021-04-13 南威软件股份有限公司 Method and system for solving multi-region license data conflict based on flexible rule
CN112768059A (en) * 2021-01-25 2021-05-07 武汉大学 Method for standardizing grade data in medical data
CN113901060A (en) * 2021-11-18 2022-01-07 贵州电网有限责任公司 Method for establishing employee health database
CN114936243A (en) * 2021-06-18 2022-08-23 上海重明鸟软件有限公司 Data standardization system and use method
CN115019914A (en) * 2022-07-19 2022-09-06 深圳市指南针医疗科技有限公司 Data quality evaluation method, device, equipment and storage medium
CN116682519A (en) * 2023-08-03 2023-09-01 广东杰纳医药科技有限公司 Clinical experiment data unit analysis method
CN117219214A (en) * 2023-11-07 2023-12-12 江苏法迈生医学科技有限公司 Data management method of clinical scientific research integrated information platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509012A (en) * 2011-11-04 2012-06-20 厦门市智业软件工程有限公司 Method for mapping contents of electronic medical record into electronic medical record standard database
CN106446526A (en) * 2016-08-31 2017-02-22 北京千安哲信息技术有限公司 Electronic medical record entity relation extraction method and apparatus
CN106874693A (en) * 2017-03-15 2017-06-20 国信优易数据有限公司 A kind of medical big data analysis process system and method
WO2017116452A1 (en) * 2015-12-31 2017-07-06 Sole Guerra Alberto System for acquisition, processing and visualization of clinical data of patients
CN109189784A (en) * 2018-08-09 2019-01-11 纳里健康科技有限公司 A kind of method of distributed electronic medical record data integration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509012A (en) * 2011-11-04 2012-06-20 厦门市智业软件工程有限公司 Method for mapping contents of electronic medical record into electronic medical record standard database
WO2017116452A1 (en) * 2015-12-31 2017-07-06 Sole Guerra Alberto System for acquisition, processing and visualization of clinical data of patients
CN106446526A (en) * 2016-08-31 2017-02-22 北京千安哲信息技术有限公司 Electronic medical record entity relation extraction method and apparatus
CN106874693A (en) * 2017-03-15 2017-06-20 国信优易数据有限公司 A kind of medical big data analysis process system and method
CN109189784A (en) * 2018-08-09 2019-01-11 纳里健康科技有限公司 A kind of method of distributed electronic medical record data integration

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
卞伟玮等: "基于网络爬虫技术的健康医疗大数据采集整理系统", 《山东大学学报(医学版)》 *
姚远: "军队医院慢性病诊疗信息系统的设计与应用研究", 《中国博士学位论文全文数据库 医药卫生科技辑》 *
沈顺: "基于大数据处理的用户健康信息服务平台优化设计及应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
薛付忠: "健康医疗大数据驱动的健康管理学理论方法体系", 《山东大学学报(医学版)》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595571A (en) * 2018-04-16 2018-09-28 深圳零壹云医科技有限公司 A kind of Data Integration management method, device, system and user terminal
CN108595614A (en) * 2018-04-20 2018-09-28 成都智信电子技术有限公司 Tables of data mapping method applied to HIS systems
CN109189784A (en) * 2018-08-09 2019-01-11 纳里健康科技有限公司 A kind of method of distributed electronic medical record data integration
CN109408635A (en) * 2018-09-28 2019-03-01 湖南智腾安控科技有限公司 A kind of case history document standard processing system and method
CN109473149B (en) * 2018-11-09 2021-01-15 天津开心生活科技有限公司 Data quality evaluation method and device, electronic equipment and computer readable medium
CN109473149A (en) * 2018-11-09 2019-03-15 天津开心生活科技有限公司 Data Quality Assessment Methodology, device, electronic equipment and computer-readable medium
CN109522302A (en) * 2018-11-09 2019-03-26 南京医渡云医学技术有限公司 Medical data processing method, device, electronic equipment and computer-readable medium
CN109659034A (en) * 2018-11-30 2019-04-19 平安医疗健康管理股份有限公司 Data Quality Assessment Methodology, device, equipment and the storage medium of first page of illness case
CN109710670B (en) * 2018-12-11 2020-04-28 萱闱(河南)生命科学研究院有限公司 Method for converting medical record text from natural language into structured metadata
CN109710670A (en) * 2018-12-11 2019-05-03 河南通域医疗科技有限公司 A method of case history text is converted into structural metadata from natural language
WO2020119386A1 (en) * 2018-12-13 2020-06-18 平安医疗健康管理股份有限公司 Big data-based abnormal data identification method and device, and storage medium and apparatus
CN110347764A (en) * 2019-06-12 2019-10-18 重庆工商大学融智学院 A kind of ecological space data integration method
CN110362829A (en) * 2019-07-16 2019-10-22 北京百度网讯科技有限公司 Method for evaluating quality, device and the equipment of structured patient record data
CN110362829B (en) * 2019-07-16 2023-01-03 北京百度网讯科技有限公司 Quality evaluation method, device and equipment for structured medical record data
CN110851595A (en) * 2019-10-08 2020-02-28 云知声智能科技股份有限公司 Identification method and device for disease term core vocabulary
CN111105867A (en) * 2019-12-24 2020-05-05 湖南长城医疗科技有限公司 Novel architecture and implementation method of public health medical information area platform
CN110990591A (en) * 2019-12-26 2020-04-10 北京亚信数据有限公司 Method and system for auditing transcoding quality of medical data
CN111309727A (en) * 2020-01-22 2020-06-19 北京明略软件系统有限公司 Information table processing method and device and storage medium
CN111309727B (en) * 2020-01-22 2024-03-22 北京明略软件系统有限公司 Information table processing method, device and storage medium
CN111339084A (en) * 2020-02-15 2020-06-26 河北唐宋大数据产业股份有限公司 Data processing method and system
CN111899885A (en) * 2020-06-28 2020-11-06 万达信息股份有限公司 Distributed personnel event index implementation method and system
CN112214524A (en) * 2020-08-27 2021-01-12 优学汇信息科技(广东)有限公司 Data evaluation system and evaluation method based on deep data mining
CN112070584A (en) * 2020-09-09 2020-12-11 畅销家(深圳)科技有限公司 Order management method, device, equipment and storage medium
CN112768059A (en) * 2021-01-25 2021-05-07 武汉大学 Method for standardizing grade data in medical data
CN112768059B (en) * 2021-01-25 2022-09-09 武汉大学 Method for standardizing grade data in medical data
CN112650865A (en) * 2021-01-27 2021-04-13 南威软件股份有限公司 Method and system for solving multi-region license data conflict based on flexible rule
CN112650865B (en) * 2021-01-27 2021-11-09 南威软件股份有限公司 Method and system for solving multi-region license data conflict based on flexible rule
CN114936243A (en) * 2021-06-18 2022-08-23 上海重明鸟软件有限公司 Data standardization system and use method
CN113901060A (en) * 2021-11-18 2022-01-07 贵州电网有限责任公司 Method for establishing employee health database
CN115019914A (en) * 2022-07-19 2022-09-06 深圳市指南针医疗科技有限公司 Data quality evaluation method, device, equipment and storage medium
CN116682519B (en) * 2023-08-03 2024-03-19 广东杰纳医药科技有限公司 Clinical experiment data unit analysis method
CN116682519A (en) * 2023-08-03 2023-09-01 广东杰纳医药科技有限公司 Clinical experiment data unit analysis method
CN117219214A (en) * 2023-11-07 2023-12-12 江苏法迈生医学科技有限公司 Data management method of clinical scientific research integrated information platform
CN117219214B (en) * 2023-11-07 2024-02-20 江苏法迈生医学科技有限公司 Data management method of clinical scientific research integrated information platform

Similar Documents

Publication Publication Date Title
CN107833595A (en) Medical big data multicenter integration platform and method
CN107145511A (en) Structured medical data library generating method and system based on medical science text message
CN109785927A (en) Clinical document structuring processing method based on internet integration medical platform
Shen et al. Discovering the potential opportunities of scientific advancement and technological innovation: A case study of smart health monitoring technology
CN107330238A (en) Medical information collection, processing, storage and display methods and device
WO2022116430A1 (en) Big data mining-based model deployment method, apparatus and device, and storage medium
CN101149751B (en) Generalized relating rule digging method for analyzing traditional Chinese medicine recipe drug matching rule
CN110189802B (en) Bidirectional mapping queue research information system based on index storage model
JP2014228907A (en) Information structuring system
CN101986333A (en) Auxiliary decision supporting system of hospital
Roque et al. A comparison of several key information visualization systems for secondary use of electronic health record content
CN108962394B (en) Medical data decision support method and system
CN111243748A (en) Needle pushing health data standardization system
CN112349369A (en) Medical image big data intelligent analysis method, system and storage medium
CN106777996A (en) A kind of physical examination data search system based on Solr
CN113688255A (en) Knowledge graph construction method based on Chinese electronic medical record
CN113360530A (en) Event screener system
CN107330111A (en) The search method and device of domain body based on common version body
CN113362960A (en) Urban resident public health influence factor visual analysis system and method combining multi-source data
CN110321556A (en) A kind of method and its system of doctor's diagnosis and treatment medical insurance control expense intelligent recommendation scheme
Wang et al. A review of the application of natural language processing in clinical medicine
Ma et al. Design of medical examination data mining system based on decision tree model
Tasdelen et al. Artificial Intelligence Research on COVID-19 Pandemic: A Bibliometric Analysis
Bettouche et al. Topical Clustering of Unlabeled Transformer-Encoded Researcher Activity
Liu et al. Knowledge fragment cleaning in a genealogy knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180323

RJ01 Rejection of invention patent application after publication