CN113380414B - Data acquisition method and system based on big data - Google Patents

Data acquisition method and system based on big data Download PDF

Info

Publication number
CN113380414B
CN113380414B CN202110552784.9A CN202110552784A CN113380414B CN 113380414 B CN113380414 B CN 113380414B CN 202110552784 A CN202110552784 A CN 202110552784A CN 113380414 B CN113380414 B CN 113380414B
Authority
CN
China
Prior art keywords
medical data
data
acquisition
medical
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110552784.9A
Other languages
Chinese (zh)
Other versions
CN113380414A (en
Inventor
王兴维
邰从越
陈攀
张迁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaorui Medical Technology (Dalian) Co.,Ltd.
Original Assignee
Senyint International Digital Medical System Dalian Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Senyint International Digital Medical System Dalian Co ltd filed Critical Senyint International Digital Medical System Dalian Co ltd
Priority to CN202110552784.9A priority Critical patent/CN113380414B/en
Publication of CN113380414A publication Critical patent/CN113380414A/en
Application granted granted Critical
Publication of CN113380414B publication Critical patent/CN113380414B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Quality & Reliability (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The application discloses a data acquisition method and system based on big data, and relates to the technical field of medical data acquisition; the method comprises the following steps: acquiring a plurality of medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels; summarizing the unstructured medical data; processing the medical data; and carrying out local storage and/or cloud storage on the processed medical data. The application integrates and stores various related medical data after being collected, provides two dirty data processing modes, can realize accurate filtering, identification, acquisition and display of dirty data in the processing process, has strong reliability and high safety, and can also process repeated data in the medical data.

Description

Data acquisition method and system based on big data
Technical Field
The application relates to the technical field of medical data acquisition, in particular to a data acquisition method and system based on big data.
Background
At present, medical data of China mainly come from disease and physical sign data recorded by informatization systems and equipment such as a hospital information system HIS, an electronic medical record system EMR, an image acquisition and transmission system PACS, a laboratory examination information system LIS, a pathology system PS, medical instruments and the like. And also includes data generated by hospital material management and hospital operation systems. Investigation shows that over 70% of hospitals realize medical informatization at present, but only less than 3% of hospitals are communicated with each other, medical big data are more dispersed, and information islands are to be broken. Sometimes, the same medical record is read differently by two doctors, so that if the information between hospitals cannot be communicated, the information is greatly lost to patients. Information islands also present great inconvenience to doctors and hospital administrators who need to use data and information.
The information island is a historical problem left in the sanitary informatization construction process of China, and as the relevant standard is not exported, each hospital lacks standard guidance when constructing a medical information system, has no top-layer design and is divided into strips, so that the information island is generated. Therefore, the establishment of a medical data acquisition center is an important means for improving medical technology, breaking information islands and realizing interconnection and intercommunication among hospitals.
Because of the variety of medical data, large quantity and fast update speed, the existing medical data acquisition system cannot well process a large variety of data, cannot guarantee the reliability of acquired data, and cannot process repeated data.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a data acquisition method and system based on big data, which can process various large amount of data, has strong reliability and high safety, and can also process repeated data in the acquired data.
According to an embodiment of the first aspect of the present application, a data collection method based on big data includes:
acquiring a plurality of medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
summarizing the unstructured medical data;
processing the medical data;
and carrying out local storage and/or cloud storage on the processed medical data.
According to some embodiments of the application, before acquiring the medical data by the plurality of acquisition modes, the method further comprises:
and carrying out basic configuration on the services corresponding to the yml type file, and transmitting medical data among the services in a queue mode.
According to some embodiments of the application, processing the medical data comprises:
verifying the quality of the medical data;
labeling the medical data after verification;
an index is created for the labeled medical data.
According to some embodiments of the application, verifying the quality of the medical data comprises:
checking the accuracy of the medical data;
performing de-duplication processing on the medical data through a neural network;
encrypting the medical data after the duplication removal.
According to some embodiments of the application, labeling the verified medical data includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of text vectors V as a clustering center point a;
acquiring the distance between other medical data and each clustering center point a, classifying the other medical data into a text vector V closest to the other medical data, and obtaining a clustering center point b of a plurality of types of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into a text vector V with the nearest distance, obtaining a clustering center point c of a plurality of types of text vectors V after classification, and repeating the steps to obtain a plurality of types of texts;
labeling the text of each category with a center word;
the newly acquired medical data is classified according to similarity with the center word.
According to some embodiments of the application, labeling the verified medical data includes:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
the new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
According to some embodiments of the application, storing the processed medical data locally includes:
acquiring proxy service and port where the attribute table is located;
the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and stores the current medical data in the database;
and the corresponding relation between the attribute and the proxy service is stored in the database.
According to some embodiments of the application, managing the database includes:
reading the medical data and translating the medical data into an internal unified data format;
performing adding, deleting and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
According to a second aspect of the present application, a data acquisition system based on big data includes:
the acquisition module acquires various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
a summarizing module summarizing the unstructured medical data;
the processing module is used for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
According to some embodiments of the application, the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the marking module is used for marking the medical data after verification;
and an index creating module for creating an index for the labeled medical data.
Through above technical scheme, the technical effect who obtains lies in: the application integrates and stores various related medical data after being collected, provides two dirty data processing modes, can realize accurate filtering, identification, acquisition and display of dirty data in the processing process, has strong reliability and high safety, and can also process repeated data in the medical data.
Drawings
FIG. 1 is a block diagram of a hardware configuration of a data acquisition computer according to an embodiment of the present application;
FIG. 2 is a flow chart of a data acquisition method according to an embodiment of the present application;
FIG. 3 is a flow chart of a process for performing a quality check and a checksum on the medical data according to an embodiment of the present application;
FIG. 4 is a flow chart for verifying the quality of medical data according to an embodiment of the present application;
fig. 5 is a flow chart of local storage of processed medical data according to an embodiment of the present application.
Detailed Description
Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art.
The current three-level comprehensive hospital medical quality management and control index framework comprises 7 major indexes, 44 quality evaluation indexes, 730 single indexes, 2610 composite indexes and 400 monitoring data, wherein index classification comprises hospitalization death indexes, reentry indexes, hospital infection indexes, operation complication indexes, patient safety indexes, medical institution reasonable medication indexes and hospital operation management indexes. The management system is huge, and the monitoring difficulty is also high. And the database server of each business system runs the DBMS, whether the problems of manual data, how large the manual data volume is, unstructured data exists and the like are all technical barriers for medical intercommunication, so the embodiment of the application provides a data acquisition method and a system based on big data. The data acquisition method may be performed in a server, computer or similar computing device. Taking a computer as an example, fig. 1 is a block diagram of a hardware structure of a data acquisition computer according to an embodiment of the present application. As shown in fig. 1, the computer 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is merely illustrative and is not intended to limit the configuration of the computer described above. For example, computer 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a data acquisition method in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of computer 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
As shown in fig. 2, in some embodiments, the big data based data acquisition method includes:
s1, acquiring various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
specifically, medical data may be acquired in corresponding acquisition channels using different acquisition modes. The task scheduling center can manage different data (logs, data in a database, etc.) acquisition tasks. The acquisition mode can be as follows:
(1) Various development data sets exist on the network, and the data sets in the medical field can be obtained only by finding out the corresponding website to acquire the download link, the data sets can help the medical system to perfect internal information, and the configuration collector can crawl and sort medical data through methods such as crawlers, rule matching and the like.
(2) For the logging service of the medical system, a related logging scheme may be employed. Several log collection tools are relatively common, logstash, filebeat, flume, fluentd, logagent, rsyslog, syslog-ng. The configuration collector reads the medical data by using a log text information reading mode and collects the medical data.
(3) Corresponding medical data are acquired through a social investigation mode, and the medical data can perfect the data content of a medical system. The configuration collector acquires medical data of social investigation results.
(4) Medical systems are provided with daily operations and business sector modules, and various relevant data thereof are recorded in certain files or systems, such as common medical system databases and the like. A large amount of medical data is stored in the database, and the configuration collector acquires data in different ways for different kinds of databases.
(5) The medical sensor is a detection device, can sense the measured information, can convert the sensed information into an electric signal or other information output in a required form according to a certain rule, acquires medical data through the medical sensor, uploads the medical data, and configures the collector to collect the medical data.
S2, summarizing the unstructured medical data;
specifically, medical data are collected through a unified micro-service interface. The collector acquires the data, then sends the data to the micro-service through a Restful interface, and then the data is put in a distributed cache platform of redis for temporary storage.
The unstructured medical data acquired through different channels are summarized and then uniformly delivered to the data quality check for processing. The application adopts different collectors and can process various data. The method solves the problem that various data cannot be processed well in the traditional system.
S3, processing the medical data;
specifically, after the medical data is acquired, the quality of the medical data is firstly checked, which includes checking the accuracy of various relevant medical information, and performing duplication elimination, encryption and the like on the medical data. And after verification, marking the data subjected to duplication removal, and monitoring the information source.
S4, carrying out local storage and/or cloud storage on the processed medical data.
In particular, there is a risk that some or all of the data may disappear due to local storage of medical data, such as local equipment damage. The medical data can be backed up at the cloud. And sending the same data information to the cloud for storage while locally storing the acquired medical data. Or compress the locally stored medical data update portion at intervals (e.g., half a month, one month, etc.), and then back up to the cloud. And the security of the data is ensured.
The application can extract the medical data of distributed and heterogeneous data sources in a hospital information system HIS, an electronic medical record system EMR, an image acquisition and transmission system PACS, a laboratory examination information system LIS, a pathology system PS and other hospital informatization systems to a temporary middle layer for cleaning, conversion and integration, and finally loads the medical data into a database to form the basis of medical data on-line analysis and medical data mining. While enabling the storage and application of structured and unstructured data.
In some embodiments, before acquiring the medical data by the plurality of acquisition modes, further comprising:
and carrying out basic configuration on the services corresponding to the yml type file, and transmitting medical data among the services in a queue mode.
Specifically, all configuration files in the data collection process are stored in a nano, and a plurality of yml files are stored in the nano, and are configuration files of each service, for example: collector services, data quality verification services, user center services, gateway services, data management services, and the like. First, the ports, database addresses, startup modes and the like for servicing yml type files are configured basically. Each service automatically obtains the corresponding configuration from the nacos at start-up.
It should be noted that, the medical data is transferred in each service by using a middleware tool, that is, a queue mode is used to solve the concurrency problem, so as to reduce the pressure of the server.
As shown in fig. 3, in some embodiments, performing a quality checksum process on the medical data includes:
s31, checking the quality of medical data;
specifically, the quality check includes checking the accuracy of medical data through a comparison principle, performing duplication elimination on the medical data through a neural network, and performing triple des algorithm encryption on the duplicated data.
S32, marking the medical data after verification;
specifically, the method can be realized in two ways, wherein the first way is to cluster acquired medical data; the second way is to use a neural network for classification. And after classification is finished, the data source of the data is also stored as an attribute into the whole information of the data.
S33, creating an index for the labeled medical data.
Specifically, each piece of medical data is summarized to obtain the title of the data. The acquisition header modes are divided into two types: another way is to obtain the digest of the data by an encoder and decoder in the Seq2Seq architecture, in order to directly obtain the first 10-20 characters of the data as the header of the data. Creating an index in the elastic search for important attributes such as the abstract, the generation time and the like of the data facilitates the quick inquiry of the user.
As shown in fig. 4, in some embodiments, verifying the quality of the medical data includes:
s311, checking the accuracy of the medical data;
specifically, the accuracy check of the medical data can be achieved in various ways, for example, the first way is to compare the MD5 code with the MD5 code carried by the data and the medical data sent by the middleware, if the two codes are the same, it is indicated that the transmitted data has no problem; secondly, similarity comparison is carried out through a plurality of data sources of the data, and if the difference is large, the data has a problem; and thirdly, determining whether a large error exists before and after data transmission, if the average value of the same index is greatly different and does not accord with logic, indicating that the transmission process is problematic, and acquiring inaccurate data.
S312, performing de-duplication processing on the medical data through a neural network;
specifically, the neural network is used for de-duplicating the medical data text, and the acquired complete data after inspection is compared with corresponding module information in the existing system. For example, the existing illness state information of a person is obtained and compared with the illness state information of the person in the existing system. The judging mode can use the bert in the neural network to acquire the corresponding sentence vector and then calculate the similarity. A similarity greater than 90% defines text as substantially identical, a similarity greater than 80% as substantially identical, and a similarity less than 50% as different. Filtering out the same medical data and storing different medical data.
S313, encrypting the medical data after the duplication removal.
Specifically, the triple des algorithm may change a 64-bit plaintext input block into a ciphertext output block having a data length of 64 bits, where 8 bits are parity bits and the other 56 bits are the length of the cipher.
In some embodiments, labeling the verified medical data includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of (e.g., 10) text vectors V as cluster center points a;
acquiring the distance between other medical data and each clustering center point a (judging the distance between the meanings of the other medical data and the text through similarity calculation), classifying the other medical data into a text vector V closest to the other medical data, and acquiring a clustering center point b of a plurality of types (for example, 10 types) of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b (judging the distance between the meanings of the other medical data and texts through similarity calculation), classifying the other medical data into text vectors V closest to the text vectors, obtaining clustering center points c of a plurality of types (for example, 10 types) of text vectors V after classification is completed, repeating the steps for N times, and obtaining and storing a plurality of types (for example, 10 types) of texts;
labeling the text of each category with a center word;
the newly acquired medical data is classified according to similarity with the center word.
In some embodiments, labeling the verified medical data includes:
classifying existing medical data into a plurality of types;
in particular, the plurality of types may be 10 types, the number being determined according to the amount of existing medical data.
Training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
specifically, the memory of the display card of the training device is larger than 10G, and the training accuracy is larger than 90%.
The new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
It should be noted that: in this embodiment, 2 or more processing modes are provided for various dirty data, and under the condition that a processing mode of a certain data fails, a signal 0 returning to the processing failure can be automatically identified, and another processing mode is immediately started to process the data, so that the stability of data processing is ensured. For example: when the rule classification of the medical data fails, the neural network model is immediately used for classifying the medical data to obtain an accurate classification result.
As shown in fig. 5, in some embodiments, locally storing the processed medical data includes:
s41, acquiring proxy service and ports where attribute tables are located;
specifically, after acquiring medical data containing complete attributes. And (3) connecting the zookeeper through the client, and finding the proxy service and the port where the attribute table is located from the node of the zookeeper.
S42, the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and then stores the current medical data in the database;
s43, storing the corresponding relation between the attribute and the proxy service in the database.
Specifically, the client directly requests the corresponding proxy service; after receiving the request from the client, the proxy service writes the medical data into the attributes.
In some embodiments, managing the database includes:
reading the medical data and translating the medical data into an internal unified data format;
in particular, this step allows resources in the database to be adequately managed and enables a control with respect to the data;
performing adding, deleting and checking operation on the acquisition source of the medical data;
specifically, the website is monitored in real time according to the information source state, the regular state and the like; for keyword searching and collecting, the real-time adding/deleting and the starting/closing of the collecting are facilitated; and adjusting the acquisition strategy in real time according to the actual condition of acquisition. Such as add/drop collectors, etc.;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
Specifically, a data request (high-level instruction) of a user is converted into a complex machine code (low-level instruction), so that the query operation of a database is realized and a query result is obtained; processing (format conversion) the query results is returned to the user.
The embodiment also discloses a data acquisition system based on big data, comprising:
the acquisition module acquires various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
the acquisition module is used for summarizing the unstructured medical data;
a processing module for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
The system completes the storage of a single medical data, called Job, and after receiving a Job, it initiates a process to complete the storage process. The Job module of the system is a central management node of a single Job and bears the functions of data cleaning, subtask segmentation (converting single Job calculation into a plurality of subtasks), task group management and the like. After the system Job is started, the Job is segmented into a plurality of small tasks according to different source segmentation strategies so as to be convenient for concurrent execution. The tasks are the minimum units of system operation, and each Task is responsible for storing a part of data. After splitting multiple tasks, the system Job calls a Scheduler module, and reassembles the split tasks into Task groups according to the configured concurrent data volume. Each Task group is responsible for all tasks allocated with a certain concurrency finish, and the concurrency number of a default single Task group can be 10. Each Task is started by a Task group, and after the Task is started, a thread of a Reader- (Channel- (Writer) is fixedly started to finish data storage work. After the system Job is run, job monitors and waits for a plurality of task group module tasks to be completed, and Job successfully exits after all task group tasks are completed.
In some embodiments, the big data based data acquisition system further comprises:
and the configuration module is used for carrying out basic configuration on the services corresponding to the yml type file, and the medical data is transferred between the services in a queue mode.
In some embodiments, the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the marking module is used for marking the medical data after verification;
and an index creating module for creating an index for the labeled medical data.
In some embodiments, the verification module comprises:
an accuracy checking module for checking the accuracy of the medical data;
the de-duplication module is used for performing de-duplication treatment on the medical data through a neural network;
and the encryption module is used for encrypting the medical data after the duplication removal.
In some embodiments, the labeling module implementation includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of text vectors V as a clustering center point a;
acquiring the distance between other medical data and each clustering center point a, classifying the other medical data into a text vector V closest to the other medical data, and obtaining a clustering center point b of a plurality of types of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into a text vector V with the nearest distance, obtaining a clustering center point c of a plurality of types of text vectors V after classification, and repeating the steps to obtain a plurality of types of texts;
labeling the text of each category with a center word;
the newly acquired medical data is classified according to similarity with the center word.
In some embodiments, the labeling module implementation includes:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
the new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
In some embodiments, the memory module implementation includes:
acquiring proxy service and port where the attribute table is located;
the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and stores the current medical data in the database;
and the corresponding relation between the attribute and the proxy service is stored in the database.
In some embodiments, managing the database includes:
reading the medical data and translating the medical data into an internal unified data format;
performing adding, deleting and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
Since the principle of the big data based data acquisition system for solving the problem is similar to that of the data acquisition method, the implementation of the big data based data acquisition system can be referred to the implementation of the method, and the description is omitted here.
The embodiment of the application also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program executes the steps of the data acquisition method when being run by a processor.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., may be considered as a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, which can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In addition, each functional module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated together. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the above examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. The data acquisition method based on big data is characterized by comprising the following steps:
acquiring a plurality of medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
summarizing the unstructured medical data;
processing the medical data;
performing local storage and/or cloud storage on the processed medical data;
processing the medical data, including:
verifying the quality of the medical data;
labeling the medical data after verification;
creating an index for the labeled medical data;
labeling the verified medical data in a first mode or a second mode,
the first mode includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of text vectors V as a clustering center point a;
acquiring the distance between other medical data and each clustering center point a, classifying the other medical data into a text vector V closest to the other medical data, and obtaining a clustering center point b of a plurality of types of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into a text vector V with the nearest distance, obtaining a clustering center point c of a plurality of types of text vectors V after classification, and repeating the steps to obtain a plurality of types of texts;
labeling the text of each category with a center word;
classifying the newly acquired medical data according to the similarity with the central word;
the second mode comprises the following steps:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
the new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
2. The method for acquiring data based on big data according to claim 1, further comprising, before acquiring the medical data by a plurality of acquisition modes:
and carrying out basic configuration on the services corresponding to the yml type file, and transmitting medical data among the services in a queue mode.
3. The big data based data collection method of claim 1, wherein verifying the quality of the medical data comprises:
checking the accuracy of the medical data;
performing de-duplication processing on the medical data through a neural network;
encrypting the medical data after the duplication removal.
4. The big data based data collection method of claim 1, wherein the locally storing the processed medical data comprises:
acquiring proxy service and port where the attribute table is located;
the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and stores the current medical data in the database;
and the corresponding relation between the attribute and the proxy service is stored in the database.
5. The big data based data collection method of claim 4, wherein managing the database comprises:
reading the medical data and translating the medical data into an internal unified data format;
performing adding, deleting and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
6. A big data based data acquisition system for implementing the data acquisition method of any one of claims 1-5, comprising:
the acquisition module acquires various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
a summarizing module summarizing the unstructured medical data;
the processing module is used for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
7. The big data based data acquisition system of claim 6, wherein the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the marking module is used for marking the medical data after verification;
and an index creating module for creating an index for the labeled medical data.
CN202110552784.9A 2021-05-20 2021-05-20 Data acquisition method and system based on big data Active CN113380414B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110552784.9A CN113380414B (en) 2021-05-20 2021-05-20 Data acquisition method and system based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110552784.9A CN113380414B (en) 2021-05-20 2021-05-20 Data acquisition method and system based on big data

Publications (2)

Publication Number Publication Date
CN113380414A CN113380414A (en) 2021-09-10
CN113380414B true CN113380414B (en) 2023-11-10

Family

ID=77571507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110552784.9A Active CN113380414B (en) 2021-05-20 2021-05-20 Data acquisition method and system based on big data

Country Status (1)

Country Link
CN (1) CN113380414B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564444A (en) * 2022-02-24 2022-05-31 朗森特科技有限公司 System for extracting, identifying and classifying files by using binary system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108899078A (en) * 2018-06-27 2018-11-27 郑州云海信息技术有限公司 A kind of health and fitness information processing system based on cloud storage
CN108922632A (en) * 2018-05-03 2018-11-30 广东健凯医疗有限公司 A kind of data managing method and system
CN109785927A (en) * 2019-02-01 2019-05-21 上海众恒信息产业股份有限公司 Clinical document structuring processing method based on internet integration medical platform
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform
CN111259154A (en) * 2020-02-07 2020-06-09 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CA3085033A1 (en) * 2019-07-30 2021-01-30 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
CN112711581A (en) * 2020-12-30 2021-04-27 医渡云(北京)技术有限公司 Medical data verification method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020247651A1 (en) * 2019-06-05 2020-12-10 The Ronin Project, Inc. Modeling for complex outcomes using clustering and machine learning algorithms

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108922632A (en) * 2018-05-03 2018-11-30 广东健凯医疗有限公司 A kind of data managing method and system
CN108899078A (en) * 2018-06-27 2018-11-27 郑州云海信息技术有限公司 A kind of health and fitness information processing system based on cloud storage
CN109785927A (en) * 2019-02-01 2019-05-21 上海众恒信息产业股份有限公司 Clinical document structuring processing method based on internet integration medical platform
CN110021439A (en) * 2019-03-07 2019-07-16 平安科技(深圳)有限公司 Medical data classification method, device and computer equipment based on machine learning
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform
CA3085033A1 (en) * 2019-07-30 2021-01-30 Imrsv Data Labs Inc. Methods and systems for multi-label classification of text data
CN111259154A (en) * 2020-02-07 2020-06-09 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN111401066A (en) * 2020-03-12 2020-07-10 腾讯科技(深圳)有限公司 Artificial intelligence-based word classification model training method, word processing method and device
CN112711581A (en) * 2020-12-30 2021-04-27 医渡云(北京)技术有限公司 Medical data verification method and device, electronic equipment and storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Word Embedding-based Text Processing for Comprehensive Summarization and Distinct Information Extraction;Wan, Xiangpeng 等;020 IEEE TECHNOLOGY & ENGINEERING MANAGEMENT CONFERENCE;1-5 *
基于LCN的医疗知识问答模型;马满福;刘元喆;李勇;王霞;贾海;史彦斌;张小康;;西南大学学报(自然科学版)(10);30-41 *
基于TextRank的医院信息智能处理方法研究;刘宇枝 等;粘接;第49卷(第9期);57-63 *
基于云计算的医疗大数据分析服务平台及应用示范;王兴维 等;中国知网;1-2 *
基于大数据处理的模糊聚类分析应用研究;李媛;中国知网硕士学位论文库(第4期);1-57 *

Also Published As

Publication number Publication date
CN113380414A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN109471863B (en) Information query method and device based on distributed database and electronic equipment
WO2021143779A1 (en) Cross-department chronic kidney disease early diagnosis and decision support system based on knowledge graph
US11238069B2 (en) Transforming a data stream into structured data
CN111488363B (en) Data processing method, device, electronic equipment and medium
US20180225339A1 (en) System and process for searching massive amounts of time-series data
CN110008288A (en) The construction method in the knowledge mapping library for Analysis of Network Malfunction and its application
CN110597946B (en) Case storage method, device, equipment and storage medium
WO2021120688A1 (en) Medical misdiagnosis detection method and apparatus, electronic device and storage medium
US11600367B2 (en) Medical diagnostic platform
CN109524070B (en) Data processing method and device, electronic equipment and storage medium
WO2022222943A1 (en) Department recommendation method and apparatus, electronic device and storage medium
US11748634B1 (en) Systems and methods for integration of machine learning components within a pipelined search query to generate a graphic visualization
US11977546B1 (en) System and method for integrating disparate information sources
JP2015533437A (en) System and method for medical information analysis using de-identification and re-identification
CN109360615A (en) A kind of medical resource sharing method, device, equipment and storage medium
US11921758B2 (en) Systems and methods for machine learning models for entity resolution
WO2022237506A1 (en) Method, apparatus, and device for monitoring online diagnosis service, and storage medium
CN111210884B (en) Clinical medical data acquisition method, device, medium and equipment
US20220101961A1 (en) Systems and methods for matching medical records for patients across disparate medical providers to facilitate continuity of care
CN107506422A (en) The distributed information log processing system and method for a kind of multi-data source
CN111370132A (en) Electronic file analysis method and device, computer equipment and storage medium
CN113380414B (en) Data acquisition method and system based on big data
CN115346686A (en) Relation map generation method and device, storage medium and electronic equipment
CN113806332B (en) Heterogeneous system integrated data processing method and device and computer equipment
CN113495945A (en) Text search method, text search device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240725

Address after: No. 1012, 20th Floor, No.1 Huangpu Road, Ganjingzi District, Dalian City, Liaoning Province 116000

Patentee after: Xiaorui Medical Technology (Dalian) Co.,Ltd.

Country or region after: China

Address before: 116023 403-404A, 3, 5 East Road, software park, Dalian hi tech Industrial Park, Liaoning

Patentee before: SENYINT INTERNATIONAL DIGITAL MEDICAL SYSTEM (DALIAN) Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right