CN111078976A - Medical system crawler-based data extraction method - Google Patents

Medical system crawler-based data extraction method Download PDF

Info

Publication number
CN111078976A
CN111078976A CN201911104769.7A CN201911104769A CN111078976A CN 111078976 A CN111078976 A CN 111078976A CN 201911104769 A CN201911104769 A CN 201911104769A CN 111078976 A CN111078976 A CN 111078976A
Authority
CN
China
Prior art keywords
data
medical
character recognition
crawled
baidu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911104769.7A
Other languages
Chinese (zh)
Inventor
马磊
蒋卫丽
陈振华
王雄彬
陈昊昱
龙晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Second Affiliated Hospital of Kunming Medical University
Original Assignee
Kunming University of Science and Technology
Second Affiliated Hospital of Kunming Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology, Second Affiliated Hospital of Kunming Medical University filed Critical Kunming University of Science and Technology
Priority to CN201911104769.7A priority Critical patent/CN111078976A/en
Publication of CN111078976A publication Critical patent/CN111078976A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Abstract

The invention relates to a method for extracting data based on medical system crawler, and belongs to the technical field of medical image character recognition. Firstly, initializing a URL in a medical system; analyzing the URL queue, analyzing html data by using a regular expression, and analyzing json data by using a json module; then, HTTP transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; judging the crawled patient data, and performing character recognition on the PDF document by using a Baidu character recognition API; and performing word segmentation, text denoising and key information extraction on the PDF document corpus processed by the Baidu character recognition API, and storing the key information in a medical database. The invention solves the problems of difficult extraction of medical data and time-consuming and tedious extraction.

Description

Medical system crawler-based data extraction method
Technical Field
The invention relates to a method for extracting data based on medical system crawler, and belongs to the technical field of medical image character recognition.
Background
With the development of the medical health industry in China, domestic hospitals are successively provided with systems such as a hospital information system, a PACS (medical image transmission and archiving system), an LIS (examination information system) and the like, and along with the application of the information systems, a long-term neglected problem gradually emerges from the water surface, which is the problem of data extraction. Nowadays, the problem of data extraction has become a bottleneck and a short board which limit the performance of various information systems, and the importance of data extraction has become a key point of attention of people;
data mining is a non-squaring process that proposes implicit, potentially valuable, and ultimately understandable patterns from a database, a key step in knowledge discovery. The medical database is rich in information, and may contain medical images of patients, related pathological parameters, test and measurement results, diagnosis records, and related parameter bases (age, sex, medical history, time of hospital admission, etc.). Medical data is generally stored in a medical system, and a corresponding interface is not used for extraction, so that the arrangement of the medical data is very complex and tedious, manual arrangement is needed manually, and a large amount of manpower and material resources are consumed. However, with the development of the internet, all users can acquire knowledge to be acquired by a certain means in huge network information. As is known, for different data individuals, knowledge needing to be taken is different, and the difficulty in acquiring target information is greatly increased by the phenomenon, so that the concept of the Web crawler is brought forward, and the Web crawler has strong specialty and can effectively query a plurality of Web pages. The starting point of Web crawler execution is a simple Web page, and then, to access other pages, the access is mainly completed according to hyperlinks, and the above operations are repeated, so that all pages can be retrieved and scanned, and the required information is acquired. The crawler program can automatically acquire the webpage, the implementation strategy adopted by the crawler program and the operation efficiency are obvious, the influence on the search result is obvious, and if the selected crawler program is excellent and efficient, the search information can be timely and accurate. The earliest crawlers were Goole crawlers, and the function achieved was that different processes could be completed for each crawler set-up; search engines such as hundredths, search fox and the like should also start to research the crawler program, but the crawler technology of the engines is kept secret. The crawler can be edited according to the effective combination of the algorithm provided by the computer and the assistance completed manually of the website, and can obtain more complete relevant information, which is urgently needed for building the medical information base. With the development of the times, the updating speed of the medical system is high, a long process may be needed for the construction of the medical system interface and the medical system interface is not necessarily suitable for all medical departments, but the manual arrangement and the collection of medical data information are very complicated and energy-consuming.
Disclosure of Invention
The invention provides a method for extracting data based on medical system crawler, which is used for solving the problems that medical data are difficult to extract and time-consuming and tedious to extract.
The technical scheme of the invention is as follows: a method for extracting data based on medical system crawler includes initializing URL in medical system; analyzing the URL queue, analyzing html data by using a regular expression, and analyzing json data by using a json module; then, HTTP transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; judging the crawled patient data, analyzing whether the crawled patient data is a PDF document or not, if the crawled patient data is the PDF document, then performing character recognition by using a Baidu character recognition API (application program interface), and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data in a medical database; and performing word segmentation, text denoising and key information extraction on the PDF document corpus processed by the Baidu character recognition API, and storing the key information in a medical database.
Further, the method for extracting data based on the medical system crawler comprises the following specific steps:
step 1: initializing URL: sending a request to a target medical data website for medical data crawling by using an http (hyper text transport protocol) library of a hospital webpage in a medical system, and if a server can respond, obtaining a Response of the hospital webpage, wherein the Response comprises hypertext markup language html (hypertext markup language) data of the hospital webpage and light data exchange format json data of the hospital webpage;
step 2: analyzing the URL queue: the regular expression is used for analyzing html data, and then a json module is used for analyzing json data;
step 3: patient data crawling: HTTP protocol transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database;
step 4: and (3) PDF document character recognition: judging whether the patient data crawled at Step3 is a PDF document or not, if so, performing character recognition by using an Baidu character recognition API, and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data into a medical database, wherein the Baidu character recognition API is a platform capable of recognizing various general scenes and files and then returning results according to lines;
step 5: utilizing a jieba word segmentation algorithm to segment the PDF document corpus processed by the Baidu character recognition API;
step 6: text denoising: the PDF document corpus processed by the Baidu character recognition API comprises a plurality of symbols, punctuations and stop word information after word segmentation, and the information affects the quality of medical data and is not beneficial to keyword extraction of a medical report form, so that irrelevant text contents are removed; then establishing a Chinese stop word list stopwords.txt, traversing each word in the text, and deleting the word appearing in the stop word list;
step 7: extracting key information: and the PDF document corpus subjected to text denoising cannot obtain key information corresponding to the keywords, the key information is processed by using a regular expression, the key information under the corresponding input keywords is extracted, and the key information is stored in a medical database.
Further, the method also comprises the Step of Step 8: newly adding data processing: processing the data updated aiming at the medical system every day according to Step 1-7; and searching the information obtained after extracting the newly added data in the medical database according to the name, the age, the address and the identity card to see whether a plurality of patients with the same attributes such as the name, the age, the address and the identity card exist, if so, judging that the patients are readmitted, storing the patients into a readmitted medical database, and otherwise, storing the patients into a readmitted patient information base.
The invention has the beneficial effects that:
1. the method provided by the invention can be used for sorting the medical data, solving the problem that the medical document is difficult to extract the information, providing a monitoring function for newly-added data, judging whether the patient is admitted or a new patient within 30 days, and providing technical support for further mining and analyzing subsequent medical data;
2. the automatic process of medical data processing and storing is realized, a large amount of manpower and material resources are saved, and unformatted data of the medical data are converted into formatted data;
3. a more perfect database which develops the medical health industry can be obtained to a certain extent;
4. according to the invention, on the basis of replacing manual extraction, all target medical data are extracted, and PDF documents such as medical advice documents and CT diagnosis lists are subjected to character recognition by using a Baidu character recognition API (application program interface), and key information extraction is carried out after denoising, a searching and judging process is added to newly added data every day, a database for hospital re-admission is added, finally, a complete medical database is formed, the target medical data are fully and efficiently extracted and sorted, and a large amount of manpower and material resources are saved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of a network architecture for crawling targeted medical data in the present invention;
FIG. 3 is a matching graph of key information matching of regular expressions in the present invention.
Detailed Description
Example 1: as shown in fig. 1 to 3, a method for extracting data based on a medical system crawler includes the following specific steps:
step 1: initializing URL: sending a request to a target medical data website for medical data crawling by using an http (hyper text transport protocol) library of a hospital webpage in a medical system, and if a server can respond, obtaining a Response of the hospital webpage, wherein the Response comprises hypertext markup language html (hypertext markup language) data of the hospital webpage and light data exchange format json data of the hospital webpage;
step 2: analyzing the URL queue: the regular expression is used for analyzing html data, and then a json module is used for analyzing json data;
step 3: patient data crawling: HTTP protocol transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; FIG. 2 is a flow chart of a network architecture for crawling target medical data according to the present invention;
step 4: and (3) PDF document character recognition: judging whether the patient data crawled at Step3 is a PDF document or not, if so, performing character recognition by using an Baidu character recognition API, and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data into a medical database, wherein the Baidu character recognition API is a platform capable of recognizing various general scenes and files and then returning results according to lines;
step 5: utilizing a jieba word segmentation algorithm to segment the PDF document corpus processed by the Baidu character recognition API;
step 6: text denoising: the PDF document corpus processed by the Baidu character recognition API comprises a plurality of symbols, punctuations and stop word information after word segmentation, and the information affects the quality of medical data and is not beneficial to keyword extraction of a medical report form, so that irrelevant text contents are removed; then establishing a Chinese stop word list stopwords.txt, traversing each word in the text, and deleting the word appearing in the stop word list;
step 7: extracting key information: and the PDF document corpus subjected to text denoising cannot obtain key information corresponding to the keywords, the key information is processed by using a regular expression, the key information under the corresponding input keywords is extracted, and the key information is stored in a medical database. For example, the keyword is "tumor size", and data after the "nodule" is extracted to obtain key information after the two keywords; including what the size of the tumor is, whether the nodule may be a benign nodule or a malignant nodule, etc.; as shown in fig. 3, it is a matching graph of the regular expression matching key information in the present invention;
further, the method also comprises the Step of Step 8: newly adding data processing: processing the data updated aiming at the medical system every day according to Step 1-7; and searching the information obtained after extracting the newly added data in the medical database according to the name, the age, the address and the identity card to see whether a plurality of patients with the same attributes such as the name, the age, the address and the identity card exist, if so, judging that the patients are readmitted, storing the patients into a readmitted medical database, and otherwise, storing the patients into a readmitted patient information base.
According to the method, the webpage structure of the medical data is analyzed, the medical data of the system is crawled, and aiming at the problem that the system login interface is troublesome to extract, the medical data of each patient is crawled by matching the ID of the patient in a doctor, the medical advice ID and the like as identifiers;
the invention can effectively store the previous medical information, the current-stage basic information and the like of the patient in the database, extract the medical data of PDF medical advice documents, abdominal slices, CT enhanced reconstruction documents and the like of the patient, utilize Baidu character recognition API to carry out character recognition, extract the key information and store the key information in the database, can dig medical data to be difficult to do a breakthrough point, and save the waste of a large amount of human resources;
processing the text denoised PDF document corpus by using a regular expression, extracting key information under corresponding input keywords, and storing the key information in a medical database; the invention converts the unformatted data into the formatted data, stores the formatted data in the correspondingly constructed database, extracts and judges the newly-added data every day to obtain the information of the patient to be admitted again, and finally forms the complete database. The experiment is carried out in the urological department of a second affiliated hospital of a university, and finally, a complete urological database is extracted, and compared with a manual extraction method and subsequent storage, better results are obtained.
In order to test the performance of the method provided by the invention, a database of manual statistics is adopted to be compared with a database of the invention; table 1 shows the comparison between the time and the accuracy of the manual data extraction and the data extraction of the invention, and the method has the advantages of high accuracy, short required time and high efficiency;
TABLE 1
Figure 4864DEST_PATH_IMAGE002
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. A method for extracting data based on medical system crawler is characterized in that: firstly, initializing a URL in a medical system; analyzing the URL queue, analyzing html data by using a regular expression, and analyzing json data by using a json module; then, HTTP transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database; judging the crawled patient data, analyzing whether the crawled patient data is a PDF document or not, if the crawled patient data is the PDF document, then performing character recognition by using a Baidu character recognition API (application program interface), and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data in a medical database; and performing word segmentation, text denoising and key information extraction on the PDF document corpus processed by the Baidu character recognition API, and storing the key information in a medical database.
2. The method for medical system crawler-based data extraction of claim 1, wherein: the method for extracting data based on the medical system crawler comprises the following specific steps:
step 1: initializing URL: sending a request to a target medical data website for medical data crawling by using an http (hyper text transport protocol) library of a hospital webpage in a medical system, and if a server can respond, obtaining a Response of the hospital webpage, wherein the Response comprises hypertext markup language html (hypertext markup language) data of the hospital webpage and light data exchange format json data of the hospital webpage;
step 2: analyzing the URL queue: the regular expression is used for analyzing html data, and then a json module is used for analyzing json data;
step 3: patient data crawling: HTTP protocol transmission is carried out on the URL of each piece of medical data, and the target medical data are matched and crawled through the ID of the patient for seeing a doctor and the ID of medical advice; storing the data crawled by the crawler into a medical database;
step 4: and (3) PDF document character recognition: judging whether the patient data crawled at Step3 is a PDF document or not, if so, performing character recognition by using an Baidu character recognition API, and converting picture data into character data after the Baidu character recognition API is recognized; if not, storing the crawled data into a medical database, wherein the Baidu character recognition API is a platform capable of recognizing various general scenes and files and then returning results according to lines;
step 5: utilizing a jieba word segmentation algorithm to segment the PDF document corpus processed by the Baidu character recognition API;
step 6: text denoising: the PDF document corpus processed by the Baidu character recognition API comprises a plurality of symbols, punctuations and stop word information after word segmentation, and the information affects the quality of medical data and is not beneficial to keyword extraction of a medical report form, so that irrelevant text contents are removed; then establishing a Chinese stop word list stopwords.txt, traversing each word in the text, and deleting the word appearing in the stop word list;
step 7: extracting key information: and the PDF document corpus subjected to text denoising cannot obtain key information corresponding to the keywords, the key information is processed by using a regular expression, the key information under the corresponding input keywords is extracted, and the key information is stored in a medical database.
3. The method for medical system crawler-based data extraction of claim 1, wherein: further comprising the Step 8: newly adding data processing: processing the data updated aiming at the medical system every day according to Step 1-7; and searching the information obtained after extracting the newly added data in the medical database according to the name, the age, the address and the identity card to see whether a plurality of patients with the same attributes such as the name, the age, the address and the identity card exist, if so, judging that the patients are readmitted, storing the patients into a readmitted medical database, and otherwise, storing the patients into a readmitted patient information base.
CN201911104769.7A 2019-11-08 2019-11-08 Medical system crawler-based data extraction method Pending CN111078976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911104769.7A CN111078976A (en) 2019-11-08 2019-11-08 Medical system crawler-based data extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911104769.7A CN111078976A (en) 2019-11-08 2019-11-08 Medical system crawler-based data extraction method

Publications (1)

Publication Number Publication Date
CN111078976A true CN111078976A (en) 2020-04-28

Family

ID=70310938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911104769.7A Pending CN111078976A (en) 2019-11-08 2019-11-08 Medical system crawler-based data extraction method

Country Status (1)

Country Link
CN (1) CN111078976A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223661A (en) * 2021-05-26 2021-08-06 杭州比康信息科技有限公司 Traditional Chinese medicine prescription transmission system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104820697A (en) * 2015-04-28 2015-08-05 迈德高武汉生物医学信息科技有限公司 Medical data mining method and system
CN109215754A (en) * 2018-09-10 2019-01-15 平安科技(深圳)有限公司 Medical record data processing method, device, computer equipment and storage medium
CN109493931A (en) * 2018-10-25 2019-03-19 平安科技(深圳)有限公司 A kind of coding method of patient file, server and computer readable storage medium
CN109524073A (en) * 2018-10-17 2019-03-26 新博卓畅技术(北京)有限公司 A kind of automatic deciphering method of hospital's audit report, system and equipment
CN110136837A (en) * 2019-03-29 2019-08-16 中国人民解放军总医院 A kind of medical data processing platform
CN110335654A (en) * 2019-07-03 2019-10-15 重庆邮电大学 A kind of information extraction method of electronic health record, system and computer equipment
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104820697A (en) * 2015-04-28 2015-08-05 迈德高武汉生物医学信息科技有限公司 Medical data mining method and system
CN109215754A (en) * 2018-09-10 2019-01-15 平安科技(深圳)有限公司 Medical record data processing method, device, computer equipment and storage medium
CN109524073A (en) * 2018-10-17 2019-03-26 新博卓畅技术(北京)有限公司 A kind of automatic deciphering method of hospital's audit report, system and equipment
CN109493931A (en) * 2018-10-25 2019-03-19 平安科技(深圳)有限公司 A kind of coding method of patient file, server and computer readable storage medium
CN110136837A (en) * 2019-03-29 2019-08-16 中国人民解放军总医院 A kind of medical data processing platform
CN110335654A (en) * 2019-07-03 2019-10-15 重庆邮电大学 A kind of information extraction method of electronic health record, system and computer equipment
CN110415831A (en) * 2019-07-18 2019-11-05 天宜(天津)信息科技有限公司 A kind of medical treatment big data cloud service analysis platform

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
于珊珊等: "医疗大数据中的非结构化数据检索爬虫技术研究", 《2014中华医院信息网络大会》 *
冯思度等: "基于医疗信息的网络爬虫系统的研究与设计", 《现代信息科技》 *
卞伟玮等: "基于网络爬虫技术的健康医疗大数据采集整理系统", 《山东大学学报(医学版)》 *
苗玥等: "基于Python的医学数据爬取及分析处理", 《信息技术与现代化》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223661A (en) * 2021-05-26 2021-08-06 杭州比康信息科技有限公司 Traditional Chinese medicine prescription transmission system

Similar Documents

Publication Publication Date Title
CN102053991B (en) Method and system for multi-language document retrieval
KR101845897B1 (en) System and method for supporting medical academic research
WO2015196906A1 (en) Search-based method and device for obtaining disease advisory information
CN105912684B (en) The cross-media retrieval method of view-based access control model feature and semantic feature
CN112559684A (en) Keyword extraction and information retrieval method
CN112232065A (en) Method and device for mining synonyms
WO2020101479A1 (en) System and method to detect and generate relevant content from uniform resource locator (url)
Mehta et al. DOM tree based approach for web content extraction
Ruocco et al. A scalable algorithm for extraction and clustering of event-related pictures
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN108280081B (en) Method and device for generating webpage
CN107193996B (en) Similar medical record matching and retrieving system
US20170235835A1 (en) Information identification and extraction
Martín-Valdivia et al. Using information gain to improve multi-modal information retrieval systems
CN114141384A (en) Method, apparatus and medium for retrieving medical data
CN111403011B (en) Registration department pushing method, device and system, electronic equipment and storage medium
CN113343680A (en) Structured information extraction method based on multi-type case history texts
CN111078976A (en) Medical system crawler-based data extraction method
US11880396B2 (en) Method and system to perform text-based search among plurality of documents
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Kiran et al. An approach towards establishing reference linking in desktop reference manager
CN114238735B (en) Intelligent internet data acquisition method
Karisani et al. Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
US11669556B1 (en) Method and system for document retrieval and exploration augmented by knowledge graphs
EP3367267A1 (en) System and method for creating entity records using existing data sources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200428