CN111078976A

CN111078976A - Medical system crawler-based data extraction method

Info

Publication number: CN111078976A
Application number: CN201911104769.7A
Authority: CN
Inventors: 马磊; 蒋卫丽; 陈振华; 王雄彬; 陈昊昱; 龙晨
Original assignee: Kunming University of Science and Technology; Second Affiliated Hospital of Kunming Medical University
Current assignee: Kunming University of Science and Technology; Second Affiliated Hospital of Kunming Medical University
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2020-04-28

Abstract

The invention relates to a method for extracting data based on medical system crawler, and belongs to the technical field of medical image character recognition. Firstly, initializing a URL in a medical system; analyzing the URL queue, analyzing html data by using a regular expression, and analyzing json data by using a json module; then, HTTP transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; judging the crawled patient data, and performing character recognition on the PDF document by using a Baidu character recognition API; and performing word segmentation, text denoising and key information extraction on the PDF document corpus processed by the Baidu character recognition API, and storing the key information in a medical database. The invention solves the problems of difficult extraction of medical data and time-consuming and tedious extraction.

Description

Medical system crawler-based data extraction method

Technical Field

The invention relates to a method for extracting data based on medical system crawler, and belongs to the technical field of medical image character recognition.

Background

With the development of the medical health industry in China, domestic hospitals are successively provided with systems such as a hospital information system, a PACS (medical image transmission and archiving system), an LIS (examination information system) and the like, and along with the application of the information systems, a long-term neglected problem gradually emerges from the water surface, which is the problem of data extraction. Nowadays, the problem of data extraction has become a bottleneck and a short board which limit the performance of various information systems, and the importance of data extraction has become a key point of attention of people;

data mining is a non-squaring process that proposes implicit, potentially valuable, and ultimately understandable patterns from a database, a key step in knowledge discovery. The medical database is rich in information, and may contain medical images of patients, related pathological parameters, test and measurement results, diagnosis records, and related parameter bases (age, sex, medical history, time of hospital admission, etc.). Medical data is generally stored in a medical system, and a corresponding interface is not used for extraction, so that the arrangement of the medical data is very complex and tedious, manual arrangement is needed manually, and a large amount of manpower and material resources are consumed. However, with the development of the internet, all users can acquire knowledge to be acquired by a certain means in huge network information. As is known, for different data individuals, knowledge needing to be taken is different, and the difficulty in acquiring target information is greatly increased by the phenomenon, so that the concept of the Web crawler is brought forward, and the Web crawler has strong specialty and can effectively query a plurality of Web pages. The starting point of Web crawler execution is a simple Web page, and then, to access other pages, the access is mainly completed according to hyperlinks, and the above operations are repeated, so that all pages can be retrieved and scanned, and the required information is acquired. The crawler program can automatically acquire the webpage, the implementation strategy adopted by the crawler program and the operation efficiency are obvious, the influence on the search result is obvious, and if the selected crawler program is excellent and efficient, the search information can be timely and accurate. The earliest crawlers were Goole crawlers, and the function achieved was that different processes could be completed for each crawler set-up; search engines such as hundredths, search fox and the like should also start to research the crawler program, but the crawler technology of the engines is kept secret. The crawler can be edited according to the effective combination of the algorithm provided by the computer and the assistance completed manually of the website, and can obtain more complete relevant information, which is urgently needed for building the medical information base. With the development of the times, the updating speed of the medical system is high, a long process may be needed for the construction of the medical system interface and the medical system interface is not necessarily suitable for all medical departments, but the manual arrangement and the collection of medical data information are very complicated and energy-consuming.

Disclosure of Invention

The invention provides a method for extracting data based on medical system crawler, which is used for solving the problems that medical data are difficult to extract and time-consuming and tedious to extract.

The technical scheme of the invention is as follows: a method for extracting data based on medical system crawler includes initializing URL in medical system; analyzing the URL queue, analyzing html data by using a regular expression, and analyzing json data by using a json module; then, HTTP transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; judging the crawled patient data, analyzing whether the crawled patient data is a PDF document or not, if the crawled patient data is the PDF document, then performing character recognition by using a Baidu character recognition API (application program interface), and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data in a medical database; and performing word segmentation, text denoising and key information extraction on the PDF document corpus processed by the Baidu character recognition API, and storing the key information in a medical database.

Further, the method for extracting data based on the medical system crawler comprises the following specific steps:

step 1: initializing URL: sending a request to a target medical data website for medical data crawling by using an http (hyper text transport protocol) library of a hospital webpage in a medical system, and if a server can respond, obtaining a Response of the hospital webpage, wherein the Response comprises hypertext markup language html (hypertext markup language) data of the hospital webpage and light data exchange format json data of the hospital webpage;

step 2: analyzing the URL queue: the regular expression is used for analyzing html data, and then a json module is used for analyzing json data;

step 3: patient data crawling: HTTP protocol transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database;

step 4: and (3) PDF document character recognition: judging whether the patient data crawled at Step3 is a PDF document or not, if so, performing character recognition by using an Baidu character recognition API, and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data into a medical database, wherein the Baidu character recognition API is a platform capable of recognizing various general scenes and files and then returning results according to lines;

step 5: utilizing a jieba word segmentation algorithm to segment the PDF document corpus processed by the Baidu character recognition API;

step 6: text denoising: the PDF document corpus processed by the Baidu character recognition API comprises a plurality of symbols, punctuations and stop word information after word segmentation, and the information affects the quality of medical data and is not beneficial to keyword extraction of a medical report form, so that irrelevant text contents are removed; then establishing a Chinese stop word list stopwords.txt, traversing each word in the text, and deleting the word appearing in the stop word list;

step 7: extracting key information: and the PDF document corpus subjected to text denoising cannot obtain key information corresponding to the keywords, the key information is processed by using a regular expression, the key information under the corresponding input keywords is extracted, and the key information is stored in a medical database.

Further, the method also comprises the Step of Step 8: newly adding data processing: processing the data updated aiming at the medical system every day according to Step 1-7; and searching the information obtained after extracting the newly added data in the medical database according to the name, the age, the address and the identity card to see whether a plurality of patients with the same attributes such as the name, the age, the address and the identity card exist, if so, judging that the patients are readmitted, storing the patients into a readmitted medical database, and otherwise, storing the patients into a readmitted patient information base.

The invention has the beneficial effects that:

1. the method provided by the invention can be used for sorting the medical data, solving the problem that the medical document is difficult to extract the information, providing a monitoring function for newly-added data, judging whether the patient is admitted or a new patient within 30 days, and providing technical support for further mining and analyzing subsequent medical data;

2. the automatic process of medical data processing and storing is realized, a large amount of manpower and material resources are saved, and unformatted data of the medical data are converted into formatted data;

3. a more perfect database which develops the medical health industry can be obtained to a certain extent;

4. according to the invention, on the basis of replacing manual extraction, all target medical data are extracted, and PDF documents such as medical advice documents and CT diagnosis lists are subjected to character recognition by using a Baidu character recognition API (application program interface), and key information extraction is carried out after denoising, a searching and judging process is added to newly added data every day, a database for hospital re-admission is added, finally, a complete medical database is formed, the target medical data are fully and efficiently extracted and sorted, and a large amount of manpower and material resources are saved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of a network architecture for crawling targeted medical data in the present invention;

FIG. 3 is a matching graph of key information matching of regular expressions in the present invention.

Detailed Description

Example 1: as shown in fig. 1 to 3, a method for extracting data based on a medical system crawler includes the following specific steps:

step 3: patient data crawling: HTTP protocol transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; FIG. 2 is a flow chart of a network architecture for crawling target medical data according to the present invention;

step 7: extracting key information: and the PDF document corpus subjected to text denoising cannot obtain key information corresponding to the keywords, the key information is processed by using a regular expression, the key information under the corresponding input keywords is extracted, and the key information is stored in a medical database. For example, the keyword is "tumor size", and data after the "nodule" is extracted to obtain key information after the two keywords; including what the size of the tumor is, whether the nodule may be a benign nodule or a malignant nodule, etc.; as shown in fig. 3, it is a matching graph of the regular expression matching key information in the present invention;

According to the method, the webpage structure of the medical data is analyzed, the medical data of the system is crawled, and aiming at the problem that the system login interface is troublesome to extract, the medical data of each patient is crawled by matching the ID of the patient in a doctor, the medical advice ID and the like as identifiers;

the invention can effectively store the previous medical information, the current-stage basic information and the like of the patient in the database, extract the medical data of PDF medical advice documents, abdominal slices, CT enhanced reconstruction documents and the like of the patient, utilize Baidu character recognition API to carry out character recognition, extract the key information and store the key information in the database, can dig medical data to be difficult to do a breakthrough point, and save the waste of a large amount of human resources;

processing the text denoised PDF document corpus by using a regular expression, extracting key information under corresponding input keywords, and storing the key information in a medical database; the invention converts the unformatted data into the formatted data, stores the formatted data in the correspondingly constructed database, extracts and judges the newly-added data every day to obtain the information of the patient to be admitted again, and finally forms the complete database. The experiment is carried out in the urological department of a second affiliated hospital of a university, and finally, a complete urological database is extracted, and compared with a manual extraction method and subsequent storage, better results are obtained.

In order to test the performance of the method provided by the invention, a database of manual statistics is adopted to be compared with a database of the invention; table 1 shows the comparison between the time and the accuracy of the manual data extraction and the data extraction of the invention, and the method has the advantages of high accuracy, short required time and high efficiency;

TABLE 1

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for extracting data based on medical system crawler is characterized in that: firstly, initializing a URL in a medical system; analyzing the URL queue, analyzing html data by using a regular expression, and analyzing json data by using a json module; then, HTTP transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; judging the crawled patient data, analyzing whether the crawled patient data is a PDF document or not, if the crawled patient data is the PDF document, then performing character recognition by using a Baidu character recognition API (application program interface), and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data in a medical database; and performing word segmentation, text denoising and key information extraction on the PDF document corpus processed by the Baidu character recognition API, and storing the key information in a medical database.

2. The method for medical system crawler-based data extraction of claim 1, wherein: the method for extracting data based on the medical system crawler comprises the following specific steps:

3. The method for medical system crawler-based data extraction of claim 1, wherein: further comprising the Step 8: newly adding data processing: processing the data updated aiming at the medical system every day according to Step 1-7; and searching the information obtained after extracting the newly added data in the medical database according to the name, the age, the address and the identity card to see whether a plurality of patients with the same attributes such as the name, the age, the address and the identity card exist, if so, judging that the patients are readmitted, storing the patients into a readmitted medical database, and otherwise, storing the patients into a readmitted patient information base.