Disclosure of Invention
Aiming at the problems in the prior art, the application provides a data acquisition method and system based on big data, which can process various large amount of data, has strong reliability and high safety, and can also process repeated data in the acquired data.
According to an embodiment of the first aspect of the present application, a data collection method based on big data includes:
acquiring a plurality of medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
summarizing the unstructured medical data;
processing the medical data;
and carrying out local storage and/or cloud storage on the processed medical data.
According to some embodiments of the application, before acquiring the medical data by the plurality of acquisition modes, the method further comprises:
and carrying out basic configuration on the services corresponding to the yml type file, and transmitting medical data among the services in a queue mode.
According to some embodiments of the application, processing the medical data comprises:
verifying the quality of the medical data;
labeling the medical data after verification;
an index is created for the labeled medical data.
According to some embodiments of the application, verifying the quality of the medical data comprises:
checking the accuracy of the medical data;
performing de-duplication processing on the medical data through a neural network;
encrypting the medical data after the duplication removal.
According to some embodiments of the application, labeling the verified medical data includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of text vectors V as a clustering center point a;
acquiring the distance between other medical data and each clustering center point a, classifying the other medical data into a text vector V closest to the other medical data, and obtaining a clustering center point b of a plurality of types of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into a text vector V with the nearest distance, obtaining a clustering center point c of a plurality of types of text vectors V after classification, and repeating the steps to obtain a plurality of types of texts;
labeling the text of each category with a center word;
the newly acquired medical data is classified according to similarity with the center word.
According to some embodiments of the application, labeling the verified medical data includes:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
the new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
According to some embodiments of the application, storing the processed medical data locally includes:
acquiring proxy service and port where the attribute table is located;
the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and stores the current medical data in the database;
and the corresponding relation between the attribute and the proxy service is stored in the database.
According to some embodiments of the application, managing the database includes:
reading the medical data and translating the medical data into an internal unified data format;
performing adding, deleting and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
According to a second aspect of the present application, a data acquisition system based on big data includes:
the acquisition module acquires various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
a summarizing module summarizing the unstructured medical data;
the processing module is used for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
According to some embodiments of the application, the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the marking module is used for marking the medical data after verification;
and an index creating module for creating an index for the labeled medical data.
Through above technical scheme, the technical effect who obtains lies in: the application integrates and stores various related medical data after being collected, provides two dirty data processing modes, can realize accurate filtering, identification, acquisition and display of dirty data in the processing process, has strong reliability and high safety, and can also process repeated data in the medical data.
Detailed Description
Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the exemplary embodiments to those skilled in the art.
The current three-level comprehensive hospital medical quality management and control index framework comprises 7 major indexes, 44 quality evaluation indexes, 730 single indexes, 2610 composite indexes and 400 monitoring data, wherein index classification comprises hospitalization death indexes, reentry indexes, hospital infection indexes, operation complication indexes, patient safety indexes, medical institution reasonable medication indexes and hospital operation management indexes. The management system is huge, and the monitoring difficulty is also high. And the database server of each business system runs the DBMS, whether the problems of manual data, how large the manual data volume is, unstructured data exists and the like are all technical barriers for medical intercommunication, so the embodiment of the application provides a data acquisition method and a system based on big data. The data acquisition method may be performed in a server, computer or similar computing device. Taking a computer as an example, fig. 1 is a block diagram of a hardware structure of a data acquisition computer according to an embodiment of the present application. As shown in fig. 1, the computer 10 may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is merely illustrative and is not intended to limit the configuration of the computer described above. For example, computer 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a data acquisition method in an embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communications provider of computer 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
As shown in fig. 2, in some embodiments, the big data based data acquisition method includes:
s1, acquiring various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
specifically, medical data may be acquired in corresponding acquisition channels using different acquisition modes. The task scheduling center can manage different data (logs, data in a database, etc.) acquisition tasks. The acquisition mode can be as follows:
(1) Various development data sets exist on the network, and the data sets in the medical field can be obtained only by finding out the corresponding website to acquire the download link, the data sets can help the medical system to perfect internal information, and the configuration collector can crawl and sort medical data through methods such as crawlers, rule matching and the like.
(2) For the logging service of the medical system, a related logging scheme may be employed. Several log collection tools are relatively common, logstash, filebeat, flume, fluentd, logagent, rsyslog, syslog-ng. The configuration collector reads the medical data by using a log text information reading mode and collects the medical data.
(3) Corresponding medical data are acquired through a social investigation mode, and the medical data can perfect the data content of a medical system. The configuration collector acquires medical data of social investigation results.
(4) Medical systems are provided with daily operations and business sector modules, and various relevant data thereof are recorded in certain files or systems, such as common medical system databases and the like. A large amount of medical data is stored in the database, and the configuration collector acquires data in different ways for different kinds of databases.
(5) The medical sensor is a detection device, can sense the measured information, can convert the sensed information into an electric signal or other information output in a required form according to a certain rule, acquires medical data through the medical sensor, uploads the medical data, and configures the collector to collect the medical data.
S2, summarizing the unstructured medical data;
specifically, medical data are collected through a unified micro-service interface. The collector acquires the data, then sends the data to the micro-service through a Restful interface, and then the data is put in a distributed cache platform of redis for temporary storage.
The unstructured medical data acquired through different channels are summarized and then uniformly delivered to the data quality check for processing. The application adopts different collectors and can process various data. The method solves the problem that various data cannot be processed well in the traditional system.
S3, processing the medical data;
specifically, after the medical data is acquired, the quality of the medical data is firstly checked, which includes checking the accuracy of various relevant medical information, and performing duplication elimination, encryption and the like on the medical data. And after verification, marking the data subjected to duplication removal, and monitoring the information source.
S4, carrying out local storage and/or cloud storage on the processed medical data.
In particular, there is a risk that some or all of the data may disappear due to local storage of medical data, such as local equipment damage. The medical data can be backed up at the cloud. And sending the same data information to the cloud for storage while locally storing the acquired medical data. Or compress the locally stored medical data update portion at intervals (e.g., half a month, one month, etc.), and then back up to the cloud. And the security of the data is ensured.
The application can extract the medical data of distributed and heterogeneous data sources in a hospital information system HIS, an electronic medical record system EMR, an image acquisition and transmission system PACS, a laboratory examination information system LIS, a pathology system PS and other hospital informatization systems to a temporary middle layer for cleaning, conversion and integration, and finally loads the medical data into a database to form the basis of medical data on-line analysis and medical data mining. While enabling the storage and application of structured and unstructured data.
In some embodiments, before acquiring the medical data by the plurality of acquisition modes, further comprising:
and carrying out basic configuration on the services corresponding to the yml type file, and transmitting medical data among the services in a queue mode.
Specifically, all configuration files in the data collection process are stored in a nano, and a plurality of yml files are stored in the nano, and are configuration files of each service, for example: collector services, data quality verification services, user center services, gateway services, data management services, and the like. First, the ports, database addresses, startup modes and the like for servicing yml type files are configured basically. Each service automatically obtains the corresponding configuration from the nacos at start-up.
It should be noted that, the medical data is transferred in each service by using a middleware tool, that is, a queue mode is used to solve the concurrency problem, so as to reduce the pressure of the server.
As shown in fig. 3, in some embodiments, performing a quality checksum process on the medical data includes:
s31, checking the quality of medical data;
specifically, the quality check includes checking the accuracy of medical data through a comparison principle, performing duplication elimination on the medical data through a neural network, and performing triple des algorithm encryption on the duplicated data.
S32, marking the medical data after verification;
specifically, the method can be realized in two ways, wherein the first way is to cluster acquired medical data; the second way is to use a neural network for classification. And after classification is finished, the data source of the data is also stored as an attribute into the whole information of the data.
S33, creating an index for the labeled medical data.
Specifically, each piece of medical data is summarized to obtain the title of the data. The acquisition header modes are divided into two types: another way is to obtain the digest of the data by an encoder and decoder in the Seq2Seq architecture, in order to directly obtain the first 10-20 characters of the data as the header of the data. Creating an index in the elastic search for important attributes such as the abstract, the generation time and the like of the data facilitates the quick inquiry of the user.
As shown in fig. 4, in some embodiments, verifying the quality of the medical data includes:
s311, checking the accuracy of the medical data;
specifically, the accuracy check of the medical data can be achieved in various ways, for example, the first way is to compare the MD5 code with the MD5 code carried by the data and the medical data sent by the middleware, if the two codes are the same, it is indicated that the transmitted data has no problem; secondly, similarity comparison is carried out through a plurality of data sources of the data, and if the difference is large, the data has a problem; and thirdly, determining whether a large error exists before and after data transmission, if the average value of the same index is greatly different and does not accord with logic, indicating that the transmission process is problematic, and acquiring inaccurate data.
S312, performing de-duplication processing on the medical data through a neural network;
specifically, the neural network is used for de-duplicating the medical data text, and the acquired complete data after inspection is compared with corresponding module information in the existing system. For example, the existing illness state information of a person is obtained and compared with the illness state information of the person in the existing system. The judging mode can use the bert in the neural network to acquire the corresponding sentence vector and then calculate the similarity. A similarity greater than 90% defines text as substantially identical, a similarity greater than 80% as substantially identical, and a similarity less than 50% as different. Filtering out the same medical data and storing different medical data.
S313, encrypting the medical data after the duplication removal.
Specifically, the triple des algorithm may change a 64-bit plaintext input block into a ciphertext output block having a data length of 64 bits, where 8 bits are parity bits and the other 56 bits are the length of the cipher.
In some embodiments, labeling the verified medical data includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of (e.g., 10) text vectors V as cluster center points a;
acquiring the distance between other medical data and each clustering center point a (judging the distance between the meanings of the other medical data and the text through similarity calculation), classifying the other medical data into a text vector V closest to the other medical data, and acquiring a clustering center point b of a plurality of types (for example, 10 types) of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b (judging the distance between the meanings of the other medical data and texts through similarity calculation), classifying the other medical data into text vectors V closest to the text vectors, obtaining clustering center points c of a plurality of types (for example, 10 types) of text vectors V after classification is completed, repeating the steps for N times, and obtaining and storing a plurality of types (for example, 10 types) of texts;
labeling the text of each category with a center word;
the newly acquired medical data is classified according to similarity with the center word.
In some embodiments, labeling the verified medical data includes:
classifying existing medical data into a plurality of types;
in particular, the plurality of types may be 10 types, the number being determined according to the amount of existing medical data.
Training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
specifically, the memory of the display card of the training device is larger than 10G, and the training accuracy is larger than 90%.
The new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
It should be noted that: in this embodiment, 2 or more processing modes are provided for various dirty data, and under the condition that a processing mode of a certain data fails, a signal 0 returning to the processing failure can be automatically identified, and another processing mode is immediately started to process the data, so that the stability of data processing is ensured. For example: when the rule classification of the medical data fails, the neural network model is immediately used for classifying the medical data to obtain an accurate classification result.
As shown in fig. 5, in some embodiments, locally storing the processed medical data includes:
s41, acquiring proxy service and ports where attribute tables are located;
specifically, after acquiring medical data containing complete attributes. And (3) connecting the zookeeper through the client, and finding the proxy service and the port where the attribute table is located from the node of the zookeeper.
S42, the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and then stores the current medical data in the database;
s43, storing the corresponding relation between the attribute and the proxy service in the database.
Specifically, the client directly requests the corresponding proxy service; after receiving the request from the client, the proxy service writes the medical data into the attributes.
In some embodiments, managing the database includes:
reading the medical data and translating the medical data into an internal unified data format;
in particular, this step allows resources in the database to be adequately managed and enables a control with respect to the data;
performing adding, deleting and checking operation on the acquisition source of the medical data;
specifically, the website is monitored in real time according to the information source state, the regular state and the like; for keyword searching and collecting, the real-time adding/deleting and the starting/closing of the collecting are facilitated; and adjusting the acquisition strategy in real time according to the actual condition of acquisition. Such as add/drop collectors, etc.;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
Specifically, a data request (high-level instruction) of a user is converted into a complex machine code (low-level instruction), so that the query operation of a database is realized and a query result is obtained; processing (format conversion) the query results is returned to the user.
The embodiment also discloses a data acquisition system based on big data, comprising:
the acquisition module acquires various medical data through an acquisition and scheduling center, wherein the acquisition and scheduling center comprises a plurality of different collectors, and the different collectors acquire unstructured medical data in corresponding acquisition channels;
the acquisition module is used for summarizing the unstructured medical data;
a processing module for processing the medical data;
and the storage module is used for carrying out local storage and/or cloud storage on the processed medical data.
The system completes the storage of a single medical data, called Job, and after receiving a Job, it initiates a process to complete the storage process. The Job module of the system is a central management node of a single Job and bears the functions of data cleaning, subtask segmentation (converting single Job calculation into a plurality of subtasks), task group management and the like. After the system Job is started, the Job is segmented into a plurality of small tasks according to different source segmentation strategies so as to be convenient for concurrent execution. The tasks are the minimum units of system operation, and each Task is responsible for storing a part of data. After splitting multiple tasks, the system Job calls a Scheduler module, and reassembles the split tasks into Task groups according to the configured concurrent data volume. Each Task group is responsible for all tasks allocated with a certain concurrency finish, and the concurrency number of a default single Task group can be 10. Each Task is started by a Task group, and after the Task is started, a thread of a Reader- (Channel- (Writer) is fixedly started to finish data storage work. After the system Job is run, job monitors and waits for a plurality of task group module tasks to be completed, and Job successfully exits after all task group tasks are completed.
In some embodiments, the big data based data acquisition system further comprises:
and the configuration module is used for carrying out basic configuration on the services corresponding to the yml type file, and the medical data is transferred between the services in a queue mode.
In some embodiments, the processing module comprises:
the verification module is used for verifying the quality of the medical data;
the marking module is used for marking the medical data after verification;
and an index creating module for creating an index for the labeled medical data.
In some embodiments, the verification module comprises:
an accuracy checking module for checking the accuracy of the medical data;
the de-duplication module is used for performing de-duplication treatment on the medical data through a neural network;
and the encryption module is used for encrypting the medical data after the duplication removal.
In some embodiments, the labeling module implementation includes:
inputting the checked medical data into a bert neural network to acquire a text vector V;
randomly selecting a plurality of text vectors V as a clustering center point a;
acquiring the distance between other medical data and each clustering center point a, classifying the other medical data into a text vector V closest to the other medical data, and obtaining a clustering center point b of a plurality of types of text vectors V after classification is completed;
obtaining the distance between other medical data and each clustering center point b, classifying the other medical data into a text vector V with the nearest distance, obtaining a clustering center point c of a plurality of types of text vectors V after classification, and repeating the steps to obtain a plurality of types of texts;
labeling the text of each category with a center word;
the newly acquired medical data is classified according to similarity with the center word.
In some embodiments, the labeling module implementation includes:
classifying existing medical data into a plurality of types;
training the existing medical data through a bert+bilstm+cnn+intent+crf neural network until the accuracy is greater than a threshold;
the new acquired medical data is classified by the trained bert+bilstm+cnn+attion+crf neural network, so that the new acquired medical data belongs to the corresponding type.
In some embodiments, the memory module implementation includes:
acquiring proxy service and port where the attribute table is located;
the proxy service scans the initial keys of each attribute configuration in the attribute table, judges which attribute range the current medical data is in and stores the current medical data in the database;
and the corresponding relation between the attribute and the proxy service is stored in the database.
In some embodiments, managing the database includes:
reading the medical data and translating the medical data into an internal unified data format;
performing adding, deleting and checking operation on the acquisition source of the medical data;
and after the query result is obtained from the database, carrying out data format conversion on the query result.
Since the principle of the big data based data acquisition system for solving the problem is similar to that of the data acquisition method, the implementation of the big data based data acquisition system can be referred to the implementation of the method, and the description is omitted here.
The embodiment of the application also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program executes the steps of the data acquisition method when being run by a processor.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., may be considered as a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, which can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In addition, each functional module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated together. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the above examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.