CN108197216A - A kind of method of information processing - Google Patents

A kind of method of information processing Download PDF

Info

Publication number
CN108197216A
CN108197216A CN201711463023.6A CN201711463023A CN108197216A CN 108197216 A CN108197216 A CN 108197216A CN 201711463023 A CN201711463023 A CN 201711463023A CN 108197216 A CN108197216 A CN 108197216A
Authority
CN
China
Prior art keywords
gauge outfit
information processing
coordinate
row
medical electronic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711463023.6A
Other languages
Chinese (zh)
Inventor
邱恒
龙汉
王海生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Original Assignee
Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd filed Critical Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Priority to CN201711463023.6A priority Critical patent/CN108197216A/en
Publication of CN108197216A publication Critical patent/CN108197216A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present invention relates to medical electronic report information extractive technique field more particularly to a kind of methods of information processing, include the following steps:Obtain the intermediate format data in medical electronic report;The coordinate of intermediate format data is ranked up;It was found that the cut-off rule in medical electronic report;The gauge outfit in medical electronic report is obtained, and gauge outfit is saved as into keyword;Define gauge outfit;Define the row and column of table;The form data extracted is converted into document and the output of preset format.A kind of method of information processing of the present invention, based on keyword, information during method supplemented by lines reports medical electronic is extracted and is arranged, realize low cost, efficient automation extraction, the medical electronic of a variety of patterns is supported to report identification, the identification and deciphering of many indexes item are covered, has higher discrimination to the irregular table in electronic report.

Description

A kind of method of information processing
Technical field
The present invention relates to medical electronic report information extractive technique field more particularly to a kind of methods of information processing.
Background technology
Medical electronic report is mostly with PDF, based on XPS formatted files, includes that abundant patient is personal and medical record data, XPS Document is similar with PDF document, is a kind of read-only document format, and structural data form is used to preserve data, is calculated using It is machine-readable when taking document content, it needs to be parsed accordingly and extraction process..net there is the component of reading process in, though the component The text message in XPS or PDF can be obtained, it is underground that the acquisition capability of coordinate information is provided, though there is a hiding interface can It obtains coordinate information but accuracy is very low.Test rating in medical electronic report is usually presented in table form, and tradition carries The way for taking table is to divide table element using visual pattern.The table for being used for presenting test rating in medical electronic report leads to Often without specific separator bar, rectangle or interval, the scheme accuracy for dividing table element with visual pattern merely is relatively low, no It is completely suitable for test rating extraction.
Invention content
For problems of the prior art, the present invention provides a kind of method of information processing.
A kind of method of information processing, includes the following steps:
Obtain the intermediate format data in medical electronic report;
The coordinate of intermediate format data is ranked up;
It was found that the cut-off rule in medical electronic report;
The gauge outfit in medical electronic report is obtained, and gauge outfit is saved as into keyword, according to preset data dictionary and pass Key word positions gauge outfit coordinate;
Define gauge outfit;
Define the row and column of table;
The form data extracted is converted into document and the output of preset format.
Further, the step of being ranked up to the coordinate of intermediate format data be specially:
The coordinate of intermediate format data and intermediate format data is arranged again according to the sequence that row arranges again after first page Sequence.
Further, the sequence that row arranges again after first page specifically includes:All intermediate format datas, are divided by page, by the page number Ascending order arranges;The single page, by the Y coordinate ascending sort of element, the vertical interval of Y coordinate makes element be divided into multirow, in row Element presses X-coordinate ascending sort.
Further, it is found that the cut-off rule step in medical electronic report is specially:
Vertical line and horizontal linear are filtered out from intermediate format data.
Further, data dictionary, content sources are in the pattern of common medical electronic report checklist, wherein gauge outfit Content is stored as keyword message into data dictionary.
Further, it is matched line by line according to the keyword stored in data dictionary, calculates often row text block and close The frequency of key word, by gauge outfit coordinate setting in the high row of frequency;When matching degree is relatively low, calculates separator bar that may be present and formed Rectangle carry out auxiliary positioning table starting point.
Further, defining gauge outfit step is specially:
Using data dictionary, the gauge outfit row found in previous step is split or recombinated, matching is allowed to and is arranged for correct gauge outfit.
Further, the row and column step for defining table is specially:
Using text block segmentation algorithm, with gauge outfit Distance Judgment, the unit met with gauge outfit column number is split or is merged into Lattice.
Further, after the row and column step for defining table, noise removal is carried out to table area, noise refers to non-table content Element.
It further,, will when text is excessively assigned to next line in cell for no defining network Inter-bank text merges.
A kind of method of information processing of the present invention, the method based on keyword, supplemented by lines report medical electronic In information extract and arrange, realize low cost, efficient automation extraction, support the medical electronic reports of a variety of patterns Identification is accused, the identification and deciphering of many indexes item is covered, has higher discrimination to the irregular table in electronic report.
Description of the drawings
Illustrate the embodiment of the present invention or technical solution of the prior art in order to clearer, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it is clear that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of method flow diagram of information processing provided by the invention.
Specific embodiment
Below in conjunction with the attached drawing in the present invention, the technical solution in the embodiment of the present invention is carried out it is clear, completely retouch It states, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on the present invention In embodiment, all other reality that those skilled in the art is obtained under the premise of creative work is not made Example is applied, belongs to protection scope of the present invention.
A kind of method from information processing, extracting method include the following steps:
S01:Obtain the intermediate format data in medical electronic report;
S02:The coordinate of intermediate format data is ranked up, to parse line by line in order later;According to first page The sequence that row arranges again afterwards resequences to the coordinate of intermediate format data and intermediate format data;
Specifically, the sequence that row arranges again after first page specifically includes:All intermediate format datas, are divided by page, by page number liter Sequence arranges;The single page, by the Y coordinate ascending sort of element, the vertical interval of Y coordinate makes element be divided into multirow, member in row Element presses X-coordinate ascending sort;
S03:It was found that the cut-off rule in medical electronic report;Other lines of exclusion diagonal or curve etc., from intermediate form Vertical line and horizontal linear are filtered out in data;The presence of lines helps quickly to find table range, but and not all text Part can find lines, need to establish data dictionary at this time, content sources in the pattern of common medical report checklist, Middle gauge outfit content is stored as keyword message into data dictionary.
S04:The gauge outfit in medical electronic report is obtained, and gauge outfit is saved as into keyword;According to preset data dictionary Gauge outfit coordinate is positioned with keyword;
Specifically, being matched line by line according to the keyword stored in data dictionary, calculate often row text block and key occur The frequency of word, by gauge outfit coordinate setting in the high row of frequency;When matching degree is relatively low, calculate what separator bar that may be present was formed Rectangle carrys out auxiliary positioning table starting point;Starting point coordinate of the gauge outfit coordinate as table range, table bottom by special algorithm, Starting point coordinate is calculated between end elements.It sorts specifically, end elements refer to intermediate format data according to Y coordinate, X-coordinate Least significant end text block afterwards.
S05:Define gauge outfit;Using data dictionary, the gauge outfit row found in previous step is split or recombinates, being allowed to matching is Correct gauge outfit row;
S06:Define the row and column of table;Using text block segmentation algorithm, and gauge outfit Distance Judgment splits or is merged into and table The cell that head column number meets;
Such as there is apparent space between " project name " and " result ", then with regard to two parts can be split into, obtain gauge outfit Specifying information is aware of the columns of table, the present embodiment utilizes text block segmentation algorithm, with gauge outfit distance, the width of gauge outfit row Degree carrys out division unit lattice.
Using after table starting point between-line spacing variation, separator bar determine the coordinate of table bottom, end to end range determines Later, it is necessary to exclude noise that may be present, noise refers to the element of non-table content, i.e. table range left and right edges other Text or oblique line are excluded to calculate by the text block position or separation line length of table starting point row, table are thus determined Rectangular extent.
S07:The form data extracted is converted into document and the output of preset format.
Specifically, for no fully defining network, when text is excessively assigned to next line in cell, Inter-bank text is merged.
Such as the coordinate in gauge outfit row " project name " upper left corner is x=50, y=40, the coordinate in the lower right corner is x=110, y =55, set A is denoted as,
A={ 50,60,70,80,90,100,110 };
The coordinate in text block " irregular antibody screening " upper left corner is x=50, and y=60, the coordinate in the lower right corner is x=150, Y=75 is denoted as set B,
B={ 50,60,70,80,90,100,110,120,130,140,150 };
ThenIt is not that empty set means that text block is this row A cell.And so on the text block in the range of table is split or is merged into the unit met with gauge outfit column number Lattice when text is excessively assigned to next line in cell, are merged inter-bank text by the method for Merge Cells.
In this way, table extraction just completes, after data extraction is completed, the data of Formatting Output table are needed.Extraction Table out is converted into JSON forms or XML format, preserves specified path.
A kind of method of information processing of the present invention, the method based on keyword, supplemented by lines report medical electronic In information extract and arrange, realize low cost, efficient automation extraction, support the medical electronic reports of a variety of patterns Identification is accused, the identification and deciphering of many indexes item is covered, has higher discrimination to the irregular table in electronic report.It adopts In aforementioned manners, the discrimination to test rating table is improved, exempts from configuration, effective and convenient large-scale application.
The present invention is further described by specific embodiment above, it should be understood that, here specifically Description, should not be construed as the restriction to the spirit and scope of the invention, and one of ordinary skilled in the art is reading this explanation The various modifications made after book to above-described embodiment belong to the range that the present invention is protected.

Claims (10)

  1. A kind of 1. method of information processing, which is characterized in that the extracting method includes the following steps:
    Obtain the intermediate format data in medical electronic report;
    The coordinate of the intermediate format data is ranked up;
    It was found that the cut-off rule in medical electronic report;
    The gauge outfit in medical electronic report is obtained, and gauge outfit is saved as into keyword, according to preset data dictionary and keyword Gauge outfit coordinate is positioned;
    Define gauge outfit;
    Define the row and column of table;
    The form data extracted is converted into document and the output of preset format.
  2. A kind of 2. method of information processing as described in claim 1, which is characterized in that the coordinate to intermediate format data The step of being ranked up be specially:
    Weight is carried out to the coordinate of the intermediate format data and the intermediate format data according to the sequence that row arranges again after first page New sort.
  3. A kind of 3. method of information processing as claimed in claim 2, which is characterized in that the sequence tool that row arranges again after elder generation's page Body includes:All intermediate format datas, are divided by page, are arranged by page number ascending order;The single page, by the Y coordinate liter of element Sequence sorts, and the vertical interval of Y coordinate makes element be divided into multirow, and row interior element presses X-coordinate ascending sort.
  4. 4. a kind of method of information processing as described in claim 1, which is characterized in that in the discovery medical electronic report Cut-off rule step is specially:
    Vertical line and horizontal linear are filtered out from the intermediate format data.
  5. 5. a kind of method of information processing as described in claim 1, which is characterized in that the data dictionary, content sources In the pattern of common medical electronic report checklist, wherein gauge outfit content is stored as keyword message to the data dictionary In.
  6. 6. a kind of method of information processing as claimed in claim 5, which is characterized in that according to what is stored in the data dictionary Keyword is matched line by line, calculates often row text block and the frequency of keyword occurs, by gauge outfit coordinate setting in the high row of frequency; When matching degree is relatively low, calculates the rectangle that separator bar that may be present is formed and carry out auxiliary positioning table starting point.
  7. 7. a kind of method of information processing as claimed in claim 6, which is characterized in that the gauge outfit step that defines is specially:
    Using the data dictionary, the gauge outfit row found in previous step is split or recombinated, matching is allowed to and is arranged for correct gauge outfit.
  8. A kind of 8. method of information processing as claimed in claim 7, which is characterized in that the row and column step tool for defining table Body is:
    Using text block segmentation algorithm, with gauge outfit Distance Judgment, the cell met with gauge outfit column number is split or is merged into.
  9. A kind of 9. method of information processing as claimed in claim 8, which is characterized in that the row and column step for defining table Afterwards, noise removal is carried out to table area, the noise refers to the element of non-table content.
  10. 10. a kind of method of information processing as claimed in claim 9, which is characterized in that for no defining network, When text is excessively assigned to next line in cell, inter-bank text is merged.
CN201711463023.6A 2017-12-28 2017-12-28 A kind of method of information processing Pending CN108197216A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711463023.6A CN108197216A (en) 2017-12-28 2017-12-28 A kind of method of information processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711463023.6A CN108197216A (en) 2017-12-28 2017-12-28 A kind of method of information processing

Publications (1)

Publication Number Publication Date
CN108197216A true CN108197216A (en) 2018-06-22

Family

ID=62585471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711463023.6A Pending CN108197216A (en) 2017-12-28 2017-12-28 A kind of method of information processing

Country Status (1)

Country Link
CN (1) CN108197216A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582928A (en) * 2018-12-06 2019-04-05 万兴科技股份有限公司 PDF report data extracting method and device
CN110472209A (en) * 2019-07-04 2019-11-19 重庆金融资产交易所有限责任公司 Table generation method, device and computer equipment based on deep learning
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN110765079A (en) * 2018-07-27 2020-02-07 国信优易数据有限公司 Table information searching method and device
CN111062259A (en) * 2019-11-25 2020-04-24 泰康保险集团股份有限公司 Form recognition method and device
CN111931750A (en) * 2020-10-12 2020-11-13 杭州太美星程医药科技有限公司 Identification method and identification device for laboratory test reports
CN112348027A (en) * 2020-11-09 2021-02-09 浙江太美医疗科技股份有限公司 Identification method and identification device for drug order
CN112348472A (en) * 2020-11-09 2021-02-09 浙江太美医疗科技股份有限公司 Laboratory checklist entry method, device and computer readable medium
CN112348017A (en) * 2020-11-09 2021-02-09 浙江太美医疗科技股份有限公司 Identification method and identification device for clinical test charging document

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794280A (en) * 2010-03-11 2010-08-04 北京中科辅龙计算机技术股份有限公司 Form automatic generation method and system based on form template set
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN104268127A (en) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 Method for analyzing reading order of electronic layout file
CN105205040A (en) * 2015-09-14 2015-12-30 浪潮(北京)电子信息产业有限公司 Flex-based table displaying method and device thereof
CN105302626A (en) * 2015-11-09 2016-02-03 深圳市依伴数字科技有限公司 Analytic method of XPS (XML Paper Specification) structural data
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794280A (en) * 2010-03-11 2010-08-04 北京中科辅龙计算机技术股份有限公司 Form automatic generation method and system based on form template set
CN101976232A (en) * 2010-09-19 2011-02-16 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN104268127A (en) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 Method for analyzing reading order of electronic layout file
CN105205040A (en) * 2015-09-14 2015-12-30 浪潮(北京)电子信息产业有限公司 Flex-based table displaying method and device thereof
CN105302626A (en) * 2015-11-09 2016-02-03 深圳市依伴数字科技有限公司 Analytic method of XPS (XML Paper Specification) structural data
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110765079A (en) * 2018-07-27 2020-02-07 国信优易数据有限公司 Table information searching method and device
CN109582928A (en) * 2018-12-06 2019-04-05 万兴科技股份有限公司 PDF report data extracting method and device
CN110472209A (en) * 2019-07-04 2019-11-19 重庆金融资产交易所有限责任公司 Table generation method, device and computer equipment based on deep learning
CN110472209B (en) * 2019-07-04 2024-02-06 深圳同奈信息科技有限公司 Deep learning-based table generation method and device and computer equipment
WO2021042507A1 (en) * 2019-09-02 2021-03-11 苏州朗动网络科技有限公司 Method and device for extracting table data from pdf file, and storage medium
CN110516048A (en) * 2019-09-02 2019-11-29 苏州朗动网络科技有限公司 The extracting method, equipment and storage medium of list data in pdf document
CN111062259A (en) * 2019-11-25 2020-04-24 泰康保险集团股份有限公司 Form recognition method and device
CN111062259B (en) * 2019-11-25 2023-08-25 泰康保险集团股份有限公司 Table identification method and apparatus
CN111931750A (en) * 2020-10-12 2020-11-13 杭州太美星程医药科技有限公司 Identification method and identification device for laboratory test reports
CN111931750B (en) * 2020-10-12 2021-01-22 杭州太美星程医药科技有限公司 Identification method and identification device for laboratory test reports
CN112348472A (en) * 2020-11-09 2021-02-09 浙江太美医疗科技股份有限公司 Laboratory checklist entry method, device and computer readable medium
CN112348017A (en) * 2020-11-09 2021-02-09 浙江太美医疗科技股份有限公司 Identification method and identification device for clinical test charging document
CN112348472B (en) * 2020-11-09 2023-10-31 浙江太美医疗科技股份有限公司 Method, device and computer readable medium for inputting laboratory checklist
CN112348017B (en) * 2020-11-09 2024-01-23 浙江太美医疗科技股份有限公司 Identification method and identification device for clinical test charging receipt
CN112348027B (en) * 2020-11-09 2024-01-23 浙江太美医疗科技股份有限公司 Identification method and identification device for drug list
CN112348027A (en) * 2020-11-09 2021-02-09 浙江太美医疗科技股份有限公司 Identification method and identification device for drug order

Similar Documents

Publication Publication Date Title
CN108197216A (en) A kind of method of information processing
CN105930159B (en) A kind of method and system that the GUI code based on image generates
US9798925B2 (en) Method for identifying PDF document
CN105589841B (en) A kind of method of PDF document Table recognition
CN101770446B (en) Method and system for identifying form in layout file
CN104346319B (en) Method and system for inspecting document style
EP4235461A3 (en) Editing a database during preview of a virtual web page
CN103377177A (en) Method and device for identifying forms in digital format files
CN104268127A (en) Method for analyzing reading order of electronic layout file
CN102081732B (en) Method and system for recognizing format template
US20150095769A1 (en) Layout Analysis Method And System
CN110705515A (en) Hospital paper archive filing method and system based on OCR character recognition
CN104598577A (en) Extraction method for webpage text
CN104699785A (en) Paper similarity detection method
CN110516221A (en) Extract method, equipment and the storage medium of chart data in PDF document
CN106326194A (en) Directory generation method and apparatus applied to file format conversion scene
CN104636428A (en) Trademark recommendation method and device
CN104751148A (en) Method for recognizing scientific formulas in layout file
CN104268283A (en) Method for automatically analyzing Internet web page
CN104951430A (en) Product feature tag extraction method and device
CN104915420A (en) Knowledge base data processing method and knowledge base data processing system
CN106407392A (en) A marking language-based node mapping relationship extracting method and system
CN104268545A (en) Method for table area recognition and content rasterization in electronic document layout files
CN106777281A (en) For improving web crawlers stability, the data processing method of availability and device
CN106294525A (en) A kind of well logging columnar section information extracting method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518000 Wensheng center, Wenjin square, East Wenjin Road, Luohu District, Shenzhen, Guangdong, 2001

Applicant after: Shenzhen juding Medical Co.,Ltd.

Address before: 518000 Wensheng center, Wenjin square, East Wenjin Road, Luohu District, Shenzhen, Guangdong, 2001

Applicant before: SHENZHEN JUDING MEDICAL DEVICE Co.,Ltd.

CB02 Change of applicant information