CN113380414A - Data acquisition method and system based on big data - Google Patents

Data acquisition method and system based on big data Download PDF

Info

Publication number
CN113380414A
CN113380414A CN202110552784.9A CN202110552784A CN113380414A CN 113380414 A CN113380414 A CN 113380414A CN 202110552784 A CN202110552784 A CN 202110552784A CN 113380414 A CN113380414 A CN 113380414A
Authority
CN
China
Prior art keywords
medical data
data
acquisition
medical
big
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110552784.9A
Other languages
Chinese (zh)
Other versions
CN113380414B (en
Inventor
王兴维
邰从越
陈攀
张迁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Senyint International Digital Medical System Dalian Co ltd
Original Assignee
Senyint International Digital Medical System Dalian Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Senyint International Digital Medical System Dalian Co ltd filed Critical Senyint International Digital Medical System Dalian Co ltd
Priority to CN202110552784.9A priority Critical patent/CN113380414B/en
Publication of CN113380414A publication Critical patent/CN113380414A/en
Application granted granted Critical
Publication of CN113380414B publication Critical patent/CN113380414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Quality & Reliability (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a data acquisition method and a data acquisition system based on big data, relating to the technical field of medical data acquisition; the method comprises the following steps: acquiring various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels; aggregating the unstructured medical data; processing the medical data; and performing local storage and/or cloud storage on the processed medical data. The invention integrates and stores various related medical data after being collected, provides two dirty data processing modes, can realize accurate filtering, identification, collection and display of dirty data in the processing process, has strong reliability and high safety, and can also process repeated data in the medical data.

Description

Data acquisition method and system based on big data
Technical Field
The invention relates to the technical field of medical data acquisition, in particular to a data acquisition method and system based on big data.
Background
At present, the medical data in China mainly come from disease and physical sign data recorded by informatization systems and equipment such as a hospital information system HIS, an electronic medical record system EMR, an image acquisition and transmission system PACS, a laboratory examination information system LIS, a pathology system PS, medical instruments and the like. The system also comprises data generated by hospital material management and hospital operation systems. According to survey and display, more than 70% of hospitals realize medical informatization at present, but only less than 3% of hospitals realize data intercommunication, medical big data are relatively dispersed, and information islands need to be broken. Sometimes, two doctors have different interpretations in the same medical record, so that information between hospitals is greatly lost for patients if the information cannot be communicated. The information island also brings great inconvenience to doctors and hospital managers who need to use data and information.
The information isolated island is a historical problem left in the health informatization construction process of China, and due to the fact that relevant standards are not provided, each hospital lacks standard guidance when a medical information system is constructed, top-level design is not provided, and the information isolated island is generated due to the fact that bars and blocks are divided. Therefore, the establishment of a medical data acquisition center is an important means for improving medical technology, breaking information islands and realizing interconnection and intercommunication among hospitals at present.
Because medical data are various in types, large in quantity and high in updating speed, the conventional medical data acquisition system cannot well process various large data, cannot ensure the reliability of the acquired data, and cannot process repeated data.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a data acquisition method and system based on big data, which can process various large amounts of data, have strong reliability and high safety, and can process repeated data in the acquired data.
According to the embodiment of the first aspect of the application, a big data-based data acquisition method comprises the following steps:
acquiring various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
aggregating the unstructured medical data;
processing the medical data;
and performing local storage and/or cloud storage on the processed medical data.
According to some embodiments of the present application, before acquiring the medical data by the plurality of acquisition methods, the method further comprises:
and performing basic configuration on services corresponding to the yml type files, and transmitting medical data among the services in a queue mode.
According to some embodiments of the application, processing the medical data comprises:
verifying the quality of the medical data;
labeling the verified medical data;
an index is created for the tagged medical data.
According to some embodiments of the present application, verifying the quality of the medical data comprises:
verifying the accuracy of the medical data;
performing deduplication processing on the medical data through a neural network;
and encrypting the medical data after the duplication is removed.
According to some embodiments of the present application, tagging verified medical data comprises:
inputting the verified medical data into a bert neural network to obtain a text vector V;
randomly selecting a plurality of text vectors V as a clustering central point a;
obtaining the distance between other medical data and each clustering center point a, classifying the other medical data into text vectors V with the closest distance, and obtaining clustering center points b of multiple types of text vectors V after classification is finished;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into text vectors V with the closest distance, obtaining clustering center points c of multiple types of text vectors V after classification is finished, and repeating the steps to obtain multiple types of texts;
labeling the text of each type with a central word;
the newly acquired medical data is classified according to the similarity with the headword.
According to some embodiments of the present application, tagging verified medical data comprises:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert + bilstm + cnn + attention + crf neural network until the accuracy is greater than a threshold value;
and classifying the newly acquired medical data by using the trained bert + bilstm + cnn + attention + crf neural network so as to enable the newly acquired medical data to belong to the corresponding type.
According to some embodiments of the present application, locally storing the processed medical data comprises:
acquiring an agent service and a port where the attribute table is located;
the agent service scans the initial row key configured by each attribute in the attribute table, and judges which attribute range the current medical data is in and then stores the medical data in the database;
the database stores the corresponding relation between the attribute and the proxy service.
According to some embodiments of the application, managing the database comprises:
reading the medical data and translating into an internal unified data format;
performing increasing, deleting, modifying and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
According to the second aspect of the application, the big data based data acquisition system comprises:
the acquisition module acquires various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
a summarization module that summarizes the unstructured medical data;
the processing module is used for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
According to some embodiments of the application, the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the labeling module is used for labeling the verified medical data;
and the index creating module is used for creating an index for the labeled medical data.
Through the technical scheme, the technical effects are as follows: the invention integrates and stores various related medical data after being collected, provides two dirty data processing modes, can realize accurate filtering, identification, collection and display of dirty data in the processing process, has strong reliability and high safety, and can also process repeated data in the medical data.
Drawings
FIG. 1 is a block diagram of a hardware configuration of a data acquisition computer disclosed in an embodiment of the present application;
FIG. 2 is a flow chart of a data collection method disclosed in an embodiment of the present application;
FIG. 3 is a flow chart illustrating a quality check and processing of the medical data as disclosed in an embodiment of the present application;
FIG. 4 is a flow chart illustrating a process for verifying the quality of medical data as disclosed in an embodiment of the present application;
fig. 5 is a flow chart illustrating local storage of processed medical data according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
The current three-level comprehensive hospital medical quality management and control index framework comprises 7 major indexes, 44 quality evaluation indexes, 730 single indexes, 2610 composite indexes and 400 monitoring data, wherein the index classification comprises an in-patient death index, a reentry index, a hospital infection index, a surgical complication index, a patient safety index, a medical institution reasonable medication index and a hospital operation management index. The management system is huge, and the monitoring difficulty is also large. And the database server of each business system runs the DBMS, and the problems of existence of manual data, large manual data volume, existence of unstructured data and the like are technical barriers of medical intercommunication, so the embodiment of the application provides a data acquisition method and a data acquisition system based on big data. The data acquisition method may be performed in a server, a computer, or a similar computing device. Taking an example of the data acquisition computer running on a computer, fig. 1 is a hardware structure block diagram of the data acquisition computer disclosed in the embodiment of the present application. As shown in fig. 1, computer 10 may include one or more (only one shown in fig. 1) processors 102 (processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is illustrative only and is not intended to limit the configuration of the computer described above. For example, computer 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the data acquisition method in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of such networks may include wireless networks provided by the communications provider of computer 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
As shown in fig. 2, in some embodiments, a big data based data collection method includes:
s1, acquiring various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
specifically, different acquisition modes can be utilized to acquire medical data in corresponding acquisition channels. The task scheduling center can manage different data (logs, data in a database and the like) collection tasks. The acquisition mode can be as follows:
(1) various development data sets are arranged on the network, data sets in the medical field can be obtained as long as corresponding websites are found to obtain downloading links, the data sets can help a medical system to perfect internal information, and a configuration collector crawls and sorts medical data through methods such as crawler and rule matching.
(2) For a logging service of a medical system, a related log collection scheme may be employed. Some common log collection tools are Logstash, Filebeat, Flume, fluent, Lotagent, rsyslog, syslog-ng. And reading and collecting the medical data by a configuration collector in a log text information reading mode.
(3) Corresponding medical data are obtained through a social investigation mode, and the medical data can perfect the data content of the medical system. And configuring a collector to obtain medical data of the social investigation result.
(4) The medical system is provided with a daily operation and business department module, and various related data of the module are recorded in certain files or systems, such as a common medical system database and the like. A large amount of medical data are stored in a database, and a collector is configured to acquire data from different types of databases in different ways.
(5) The medical sensor is a detection device, can sense measured information, can convert the sensed information into electric signals or other information in required forms according to a certain rule, and outputs the electric signals or other information, acquires medical data through the medical sensor and uploads the medical data, and a collector is configured to collect the medical data.
S2, summarizing the unstructured medical data;
specifically, the medical data are collected through a unified micro-service interface. After acquiring data, the collector sends the data to the microservice through the Restful interface, and then the data is placed in a distributed cache platform of redis for temporary storage.
The unstructured medical data acquired through different channels are summarized and then uniformly delivered to data quality check for processing. This application has adopted the mode of different collectors, can handle multiple data. The problem that various data cannot be well processed in a traditional system is solved.
S3, processing the medical data;
specifically, after medical data is acquired, the quality of the medical data is checked, which includes checking the accuracy of various related medical information, and performing deduplication, encryption, and the like on the medical data. And after checking, labeling the data after the duplication removal, and monitoring the information source.
And S4, performing local storage and/or cloud storage on the processed medical data.
In particular, there is a risk that the medical data is stored locally, such as a local device being damaged, resulting in partial or complete data loss. Therefore, the medical data can be backed up in the cloud. The acquired medical data is stored locally, and meanwhile, the same data information is sent to the cloud for storage. Or compressing the updated part of the locally stored medical data at intervals (for example, half a month, etc.) and then backing up the compressed part to the cloud. And the data security is ensured.
The medical service supervision information data acquisition platform established by the method has high architecture security and easy expansion, can support various mainstream development languages, and provides rich interfaces. And simultaneously, the storage and the application of structured and unstructured data can be supported.
In some embodiments, before acquiring the medical data by the multiple acquisition modes, the method further includes:
and performing basic configuration on services corresponding to the yml type files, and transmitting medical data among the services in a queue mode.
Specifically, all configuration files in the data collection process are stored in the nacos, where there are many yml types of files, which are configuration files of each service, for example: collector services, data quality check services, user center services, gateway services, data management services, and the like. Firstly, basic configuration is carried out on ports, database addresses, starting modes and the like for service of yml type files. Each service automatically acquires the corresponding configuration from the nacos when being started.
It should be noted that the transmission of medical data in each service is implemented by using middleware, that is, by using queue to solve the concurrency problem, thereby reducing the pressure of the server.
As shown in fig. 3, in some embodiments, the medical data is quality checked and processed, including:
s31, checking the quality of the medical data;
specifically, the quality verification comprises verifying the accuracy of the medical data through a comparison principle, removing the duplicate of the medical data through a neural network, and encrypting the data subjected to the duplicate removal through a TripleDES algorithm.
S32, labeling the verified medical data;
specifically, the method can be realized in two ways, wherein the first way is to cluster the acquired medical data; the second way is to use neural networks for classification. And after the classification is finished, the data source of the data is also stored into the overall information of the data as an attribute.
And S33, creating an index for the labeled medical data.
Specifically, each piece of medical data is summarized to obtain a title of the data. The title acquisition mode is divided into two types: one way to obtain the first 10-20 characters of the data directly as the header of the data, the other way is to obtain the digest of the data by the encoder and decoder in the Seq2Seq architecture. Important attributes such as the abstract and the generation time of the data are indexed in the Elasticsearch, so that a user can quickly inquire the data.
As shown in fig. 4, in some embodiments, verifying the quality of the medical data includes:
s311, checking the accuracy of the medical data;
specifically, the accuracy verification of the medical data can be realized in various ways, for example, the first way is to compare the MD5 code of the medical data sent by the middleware with the MD5 code carried by the data, and if the medical data is the same, it indicates that there is no problem in the transmitted data; the second is to compare the similarity through a plurality of data sources of the data, and if the difference is large, the data has problems; and the third is to determine whether there is a large error before and after data transmission, if the average value of the same index has a great difference and does not accord with logic, the problem is caused in the transmission process, and the obtained data is inaccurate.
S312, performing duplicate removal processing on the medical data through a neural network;
specifically, the neural network is used for removing duplication of the medical data text, and the acquired complete data after inspection is compared with corresponding module information in the existing system. For example, the information of the existing disease condition of a person is obtained and compared with the information of the disease condition of the person in the existing system. The judgment mode can be that the corresponding sentence vector is obtained by the bert in the neural network and then the similarity is calculated. A similarity greater than 90% defines the text as substantially the same, a similarity greater than 80% as substantially the same, and a similarity less than 50% as different. Filtering out the same medical data and storing different medical data.
And S313, encrypting the medical data after the duplication is removed.
Specifically, the TripleDES algorithm may change a 64-bit plaintext input block into a 64-bit ciphertext output block, where 8 bits are parity bits and the other 56 bits are the length of the cipher.
In some embodiments, tagging the verified medical data comprises:
inputting the verified medical data into a bert neural network to obtain a text vector V;
randomly selecting a plurality of (for example, 10) text vectors V as a clustering center point a;
obtaining the distance between other medical data and each clustering center point a (judging the distance of meanings between other medical data and texts through similarity calculation), classifying the other medical data into a text vector V with the closest distance, and obtaining the clustering center point b of a plurality of types (for example, 10 types) of text vectors V after classification is finished;
obtaining the distance between other medical data and each clustering center point b (judging the distance between the other medical data and the text by similarity calculation), classifying the other medical data into a text vector V with the closest distance, obtaining clustering center points c of multiple classes (for example, 10 classes) of text vectors V after classification is finished, repeating the step for N times, and obtaining multiple (for example, 10 classes) of texts for storage;
labeling the text of each type with a central word;
the newly acquired medical data is classified according to the similarity with the headword.
In some embodiments, tagging the verified medical data comprises:
classifying existing medical data into a plurality of types;
specifically, the plurality of types may be 10 types, and the number is determined according to the existing medical data amount.
Training the existing medical data through a bert + bilstm + cnn + attention + crf neural network until the accuracy is greater than a threshold value;
specifically, the memory of the display card of the device for training is larger than 10G, and the effect that the accuracy rate is larger than 90% is trained.
And classifying the newly acquired medical data by using the trained bert + bilstm + cnn + attention + crf neural network so as to enable the newly acquired medical data to belong to the corresponding type.
It should be noted that: in the embodiment, 2 or more processing modes are provided for various dirty data, and when a certain data processing mode fails, a signal 0 of processing failure can be automatically identified and returned, and another processing mode is immediately started to process the data, so that the stability of data processing is ensured. For example: when the medical data fails to be classified regularly, the medical data is immediately classified by the neural network model to obtain an accurate classification result.
As shown in fig. 5, in some embodiments, the local storage of the processed medical data includes:
s41, acquiring the proxy service and the port where the attribute table is located;
specifically, after medical data including complete attributes is acquired. And connecting the zookeeper through the client, and finding the proxy service and the port where the attribute table is located from the node of the zookeeper.
S42, the agent service scans the initial row key configured by each attribute in the attribute table, and judges which attribute range the current medical data is in and then stores the current medical data in a database;
and S43, storing the corresponding relation between the attribute and the proxy service in the database.
Specifically, the client directly requests the corresponding proxy service; the proxy service writes the medical data into the attributes after receiving the request from the client.
In some embodiments, managing the database includes:
reading the medical data and translating into an internal unified data format;
in particular, this step allows the resources in the database to be adequately managed and enables a kind of control on the data;
performing increasing, deleting, modifying and checking operation on the acquisition source of the medical data;
specifically, the website is monitored in real time according to the information source state, the regular state and the like; for keyword search acquisition, real-time addition/deletion and starting/closing acquisition are facilitated; and adjusting the acquisition strategy in real time according to the actual acquisition situation. Such as adding/deleting collectors, etc.;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
Specifically, a data request (high-level instruction) of a user is converted into a complex machine code (low-level instruction), so that query operation on a database is realized, and a query result is obtained; and processing (format conversion) the query result and returning the query result to the user.
This embodiment also discloses a data acquisition system based on big data, includes:
the acquisition module acquires various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
an acquisition module that aggregates the unstructured medical data;
the processing module is used for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
The system completes the storage of a single piece of medical data, called Job, and after receiving a Job, it will launch a process to complete the entire storage process. The system Job module is a central management node of a single Job and bears the functions of data cleaning, subtask segmentation (converting single Job calculation into a plurality of sub tasks), Task group management and the like. After the system Job is started, the Job is divided into a plurality of small tasks according to different source segmentation strategies, so that concurrent execution is facilitated. The tasks are the minimum units of system operation, and each Task is responsible for storing a part of data. After the multiple tasks are segmented, the system Job calls the Scheduler module, and the segmented tasks are recombined according to the configured concurrent data volume to assemble a Task group. Each Task group is responsible for all tasks allocated after a certain concurrent operation, and the default concurrency number of a single Task group can be 10. Each Task is started by a Task group, and after the Task is started, threads of Reader-Channel-Writer are started fixedly to finish data storage work. After the system operation is operated, Job monitors and waits for the completion of a plurality of task group modules, and after all task group modules are completed, Job successfully exits.
In some embodiments, the big-data based data acquisition system further comprises:
and the configuration module is used for carrying out basic configuration on the services corresponding to the yml type files, and transmitting the medical data among the services in a queue mode.
In some embodiments, the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the labeling module is used for labeling the verified medical data;
and the index creating module is used for creating an index for the labeled medical data.
In some embodiments, the verification module comprises:
the accuracy checking module is used for checking the accuracy of the medical data;
the duplication removing module is used for carrying out duplication removing processing on the medical data through a neural network;
and the encryption module is used for encrypting the medical data after the duplication is removed.
In some embodiments, the specific implementation manner of the labeling module includes:
inputting the verified medical data into a bert neural network to obtain a text vector V;
randomly selecting a plurality of text vectors V as a clustering central point a;
obtaining the distance between other medical data and each clustering center point a, classifying the other medical data into text vectors V with the closest distance, and obtaining clustering center points b of multiple types of text vectors V after classification is finished;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into text vectors V with the closest distance, obtaining clustering center points c of multiple types of text vectors V after classification is finished, and repeating the steps to obtain multiple types of texts;
labeling the text of each type with a central word;
the newly acquired medical data is classified according to the similarity with the headword.
In some embodiments, the specific implementation manner of the labeling module includes:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert + bilstm + cnn + attention + crf neural network until the accuracy is greater than a threshold value;
and classifying the newly acquired medical data by using the trained bert + bilstm + cnn + attention + crf neural network so as to enable the newly acquired medical data to belong to the corresponding type.
In some embodiments, the specific implementation manner of the storage module includes:
acquiring an agent service and a port where the attribute table is located;
the agent service scans the initial row key configured by each attribute in the attribute table, and judges which attribute range the current medical data is in and then stores the medical data in the database;
the database stores the corresponding relation between the attribute and the proxy service.
In some embodiments, managing the database includes:
reading the medical data and translating into an internal unified data format;
performing increasing, deleting, modifying and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
Because the principle of solving the problems of the data acquisition system based on the big data is similar to that of the data acquisition method, the implementation of the data acquisition system based on the big data can refer to the implementation of the method, and details are not repeated herein.
The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program is executed by a processor to execute the steps of the data acquisition method.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium, which for purposes of this specification, can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated together. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A big data-based data acquisition method is characterized by comprising the following steps:
acquiring various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
aggregating the unstructured medical data;
processing the medical data;
and performing local storage and/or cloud storage on the processed medical data.
2. The big-data-based data acquisition method according to claim 1, further comprising, before acquiring the medical data in a plurality of acquisition modes:
and performing basic configuration on services corresponding to the yml type files, and transmitting medical data among the services in a queue mode.
3. The big data based data collection method of claim 1, wherein processing the medical data comprises:
verifying the quality of the medical data;
labeling the verified medical data;
an index is created for the tagged medical data.
4. The big data based data collection method of claim 3, wherein verifying the quality of the medical data comprises:
verifying the accuracy of the medical data;
performing deduplication processing on the medical data through a neural network;
and encrypting the medical data after the duplication is removed.
5. The big data based data collection method according to claim 3, wherein tagging verified medical data comprises:
inputting the verified medical data into a bert neural network to obtain a text vector V;
randomly selecting a plurality of text vectors V as a clustering central point a;
obtaining the distance between other medical data and each clustering center point a, classifying the other medical data into text vectors V with the closest distance, and obtaining clustering center points b of multiple types of text vectors V after classification is finished;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into text vectors V with the closest distance, obtaining clustering center points c of multiple types of text vectors V after classification is finished, and repeating the steps to obtain multiple types of texts;
labeling the text of each type with a central word;
the newly acquired medical data is classified according to the similarity with the headword.
6. The big data based data collection method according to claim 3, wherein tagging verified medical data comprises:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert + bilstm + cnn + attention + crf neural network until the accuracy is greater than a threshold value;
and classifying the newly acquired medical data by using the trained bert + bilstm + cnn + attention + crf neural network so as to enable the newly acquired medical data to belong to the corresponding type.
7. The big data based data collection method according to claim 1 or 3, wherein storing the processed medical data locally comprises:
acquiring an agent service and a port where the attribute table is located;
the agent service scans the initial row key configured by each attribute in the attribute table, and judges which attribute range the current medical data is in and then stores the medical data in the database;
the database stores the corresponding relation between the attribute and the proxy service.
8. The big data based data collection method according to claim 7, wherein managing the database comprises:
reading the medical data and translating into an internal unified data format;
performing increasing, deleting, modifying and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
9. A big-data based data acquisition system, comprising:
the acquisition module acquires various medical data through an acquisition scheduling center, wherein the acquisition scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
a summarization module that summarizes the unstructured medical data;
the processing module is used for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
10. The big-data based data collection system of claim 9, wherein the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the labeling module is used for labeling the verified medical data;
and the index creating module is used for creating an index for the labeled medical data.
CN202110552784.9A 2021-05-20 2021-05-20 Data acquisition method and system based on big data Active CN113380414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110552784.9A CN113380414B (en) 2021-05-20 2021-05-20 Data acquisition method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110552784.9A CN113380414B (en) 2021-05-20 2021-05-20 Data acquisition method and system based on big data

Publications (2)

Publication Number Publication Date
CN113380414A true CN113380414A (en) 2021-09-10
CN113380414B CN113380414B (en) 2023-11-10

Family

ID=77571507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110552784.9A Active CN113380414B (en) 2021-05-20 2021-05-20 Data acquisition method and system based on big data

Country Status (1)

Country Link
CN (1) CN113380414B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564444A (en) * 2022-02-24 2022-05-31 朗森特科技有限公司 System for extracting, identifying and classifying files by using binary system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108899078A (en) * 2018-06-27 2018-11-27 郑州云海信息技术有限公司 A kind of health and fitness information processing system based on cloud storage
CN108922632A (en) * 2018-05-03 2018-11-30 广东健凯医疗有限公司 A kind of data managing method and system
CN109785927A (en) * 2019-02-01 2019-05-21 上海众恒信息产业股份有限公司 Clinical document structuring processing method based on internet integration medical platform
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform
CN111259154A (en) * 2020-02-07 2020-06-09 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
US20200387810A1 (en) * 2019-06-05 2020-12-10 The Ronin Project, Inc. Modeling for complex outcomes using clustering and machine learning algorithms
CA3085033A1 (en) * 2019-07-30 2021-01-30 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
CN112711581A (en) * 2020-12-30 2021-04-27 医渡云(北京)技术有限公司 Medical data verification method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922632A (en) * 2018-05-03 2018-11-30 广东健凯医疗有限公司 A kind of data managing method and system
CN108899078A (en) * 2018-06-27 2018-11-27 郑州云海信息技术有限公司 A kind of health and fitness information processing system based on cloud storage
CN109785927A (en) * 2019-02-01 2019-05-21 上海众恒信息产业股份有限公司 Clinical document structuring processing method based on internet integration medical platform
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
US20200387810A1 (en) * 2019-06-05 2020-12-10 The Ronin Project, Inc. Modeling for complex outcomes using clustering and machine learning algorithms
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform
CA3085033A1 (en) * 2019-07-30 2021-01-30 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
CN111259154A (en) * 2020-02-07 2020-06-09 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN112711581A (en) * 2020-12-30 2021-04-27 医渡云(北京)技术有限公司 Medical data verification method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
WAN, XIANGPENG 等: "Word Embedding-based Text Processing for Comprehensive Summarization and Distinct Information Extraction", 020 IEEE TECHNOLOGY & ENGINEERING MANAGEMENT CONFERENCE, pages 1 - 5 *
刘宇枝 等: "基于TextRank的医院信息智能处理方法研究", 粘接, vol. 49, no. 9, pages 57 - 63 *
李媛: "基于大数据处理的模糊聚类分析应用研究", 中国知网硕士学位论文库, no. 4, pages 1 - 57 *
王兴维 等: "基于云计算的医疗大数据分析服务平台及应用示范", 中国知网, pages 1 - 2 *
马满福;刘元?;李勇;王霞;贾海;史彦斌;张小康;: "基于LCN的医疗知识问答模型", 西南大学学报(自然科学版), no. 10, pages 30 - 41 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564444A (en) * 2022-02-24 2022-05-31 朗森特科技有限公司 System for extracting, identifying and classifying files by using binary system

Also Published As

Publication number Publication date
CN113380414B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
US10706093B2 (en) System for organizing and fast searching of massive amounts of data
CN110008288A (en) The construction method in the knowledge mapping library for Analysis of Network Malfunction and its application
US9081829B2 (en) System for organizing and fast searching of massive amounts of data
US20030097359A1 (en) Deduplicaiton system
US20220328150A1 (en) Medical diagnostic platform
US20030135489A1 (en) System and method for processing data in a distributed architecture
CN111611458A (en) Method for realizing system data architecture combing based on metadata and data analysis technology in big data management
CN110597946B (en) Case storage method, device, equipment and storage medium
CN111210884B (en) Clinical medical data acquisition method, device, medium and equipment
CN114049927A (en) Disease data processing method and device, electronic equipment and readable medium
CN113486008A (en) Data blood margin analysis method, device, equipment and storage medium
WO2022237506A1 (en) Method, apparatus, and device for monitoring online diagnosis service, and storage medium
CN117251414B (en) Data storage and processing method based on heterogeneous technology
CN113380414B (en) Data acquisition method and system based on big data
CN113495945A (en) Text search method, text search device and storage medium
US11748634B1 (en) Systems and methods for integration of machine learning components within a pipelined search query to generate a graphic visualization
CN114925210A (en) Knowledge graph construction method, device, medium and equipment
CN114328947A (en) Knowledge graph-based question and answer method and device
CN109582795B (en) Data processing method, device, system and medium based on full life cycle
CN112447280A (en) Intelligent medical system for medical image information management
US11835989B1 (en) FPGA search in a cloud compute node
US11755626B1 (en) Systems and methods for classifying data objects
US11838171B2 (en) Proactive network application problem log analyzer
US10901980B2 (en) Health care clinical data controlled data set generator
Raghavan et al. Analytics using metadata associations for digital investigations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant