CN108228553A - A kind of method of information processing - Google Patents
A kind of method of information processing Download PDFInfo
- Publication number
- CN108228553A CN108228553A CN201711463021.7A CN201711463021A CN108228553A CN 108228553 A CN108228553 A CN 108228553A CN 201711463021 A CN201711463021 A CN 201711463021A CN 108228553 A CN108228553 A CN 108228553A
- Authority
- CN
- China
- Prior art keywords
- row
- grapholect
- cutting
- character
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000010365 information processing Effects 0.000 title claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 210000001072 colon Anatomy 0.000 claims abstract description 7
- 238000003745 diagnosis Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/189—Automatic justification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to content of text extractive technique field more particularly to a kind of methods of information processing, include the following steps:The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word;Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row;The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is grouped according to gap;According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to obtain row cutting result;Summarize the row cutting result of every row output.A kind of method of information processing of the present invention removes cumbersome configuration during field value cutting from, realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process becomes simple, adaptable.
Description
Technical field
The present invention relates to content of text extractive technique field more particularly to a kind of methods of information processing.
Background technology
Medical electronic report is mostly with PDF, based on XPS formatted files, includes that abundant patient is personal and medical record data, XPS
Document is similar with PDF document, is a kind of read-only document format, and structural data form is used to preserve data, is calculated using
It is machine-readable when taking document content, it needs to be parsed accordingly and extraction process.The process of structuring extraction is carried out to document at present
In, it is largely that cutting, both modes are carried out to multiple field values by the way of template matches or regular expression extraction
It is required for carrying out the configuration of individual template or regular expression according to the word content and layout of each document, step is numerous
Trivial, adaptation ability is poor.
Invention content
For problems of the prior art, the present invention provides one kind for PDF, and XPS formatted files are in its text
Hold the method for piecemeal.
A kind of method of information processing, includes the following steps:
The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word;
Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row;
The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is carried out according to gap
Grouping;
According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to be gone
Cutting result;
Summarize the row cutting result of every row output.
Further, cutting, which is handled, is specially:
When grapholect includes Chinese and English, according to the gap between character and character local width to Chinese word
Language is split with English word.
Further, cutting, which is handled, is specially:
When whether inter-character space two neighboring in grapholect is in the first prepsetting gap and the second prepsetting gap, phase is judged
Whether the overlapping range of adjacent two characters is more than that predetermined word is wide;
If so, two neighboring character belongs to same;
If not, cutting is carried out to two neighboring character.
Further, cutting, which is handled, is specially:
When the difference in height of two neighboring character is more than preset height ratio, it is further that cutting is carried out to two neighboring character
, cutting processing is specially:
When character is located among bracket in grapholect, then judge that two neighboring character belongs to same.
Further, it is specially in lines:
For there is overlapping up and down or hang the row at angle, it is classified as same a line.
Further, time format content, space and colon are removed from the word of every row and obtains the grapholect of every row
The step of before, it is further comprising the steps of:
Time format contents extraction in the word of every row is come out with preset regular expression.
As unit of character, branch's processing is carried out with the corresponding coordinate of character for a kind of method of information processing of the present invention,
It is grouped after grapholect is obtained by calculating the gap between word, row cutting processing.This kind of method removes field value from
Cumbersome configuration during cutting realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process become it is simple,
It is adaptable.
Description of the drawings
Illustrate the embodiment of the present invention or technical solution of the prior art in order to clearer, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it is clear that, the accompanying drawings in the following description is only this
Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of method flow diagram of information processing of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the present invention, the technical solution in the embodiment of the present invention is carried out it is clear, completely retouch
It states, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on the present invention
In embodiment, all other reality that those skilled in the art is obtained under the premise of creative work is not made
Example is applied, belongs to protection scope of the present invention.
The method of a kind of information processing of the embodiment of the present invention, as shown in Figure 1, including the following steps:
Step S01, extracts the coordinate of all words in document files, and according to all words of coordinate pair of each word into
Row branch;
Step S02, removal time format content, space and colon obtain the grapholect of every row from the word of every row;
Step S03 calculates the gap between adjacent character in often capable grapholect, and according to gap to the standard of every row
Word is grouped;
According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row by step S04
To obtain row cutting result;
Step S05 summarizes the row cutting result of every row output.
Specifically, cutting processing includes:When grapholect includes Chinese and English, according to the gap between character and
Character local width is split Chinese word language with English word.
Specifically, cutting processing is specially:When whether inter-character space two neighboring in grapholect is in the first prepsetting gap
During with the second prepsetting gap, whether the overlapping range for judging two neighboring character is more than that predetermined word is wide;If so, two neighboring word
Symbol belongs to same;If not, cutting is carried out to two neighboring character.When in grapholect character there are during lateral overlap, when
When inter-character space is between half word is wide with zero, illustrate former and later two characters there are lateral overlap, and overlapping range does not surpass
The wide half of word is crossed, thinks that front and rear character belongs to one piece at this time, does not do cutting;If overlapping range is more than the wide half of word,
Carry out cutting.
Specifically, cutting processing is specially:When the difference in height of two neighboring character is more than preset height ratio, to adjacent
Two characters carry out cutting, processing character cause situation not of uniform size;The difference in height of former and later two characters is more than 20%, then it is assumed that no
Belong to one text block, carry out cutting.
Specifically, cutting processing is specially:When character is located among bracket in grapholect, then two neighboring character is judged
Belong to same.
Such as in processes when English mixes situation, " DR diagnosis reports list ", in this text, " D " and " R " and " R "
Gap between " examining " is simultaneously unequal, but because DR, which is English character and distance, is less than the 10% of local width, then recognizes
To belong to a word, so the word judges the clearance condition between " diagnosis report list " below, hair as a whole
Now their gap is all identical, so this kind of situation is not done and is divided.
Such as during processing lateral overlap situation, when -0.5 word is wide<gap<When 0 (gap represents the inter-character space of adjacent character),
Former and later two characters are represented there are lateral overlap, and overlapping range is no more than the wide half of word, such case thinks front and rear character
One piece is belonged to, does not do cutting, such as overlapping is more than half, then carries out cutting.
When handling other situations, when the number of words in line of text be no more than 2, then do not do cutting.
Specifically, branch is specially:For there is overlapping up and down or hang the row at angle, it is classified as same a line.
Specifically, removal time format content, space and colon obtain the grapholect of every row from the word of every row
Before step, further include with preset regular expression time format contents extraction comes out in the word of every row the step for.
Specifically, for there is overlapping up and down or hang the row at angle, it is classified as same a line.
Specifically, the contents extraction of time format in file is come out with default regular expression.
Such as:The default regular expression of extraction time is as follows:
'(20\d\d-\d{1,2}-\d{1,2}.*:\d\d:D d) ' #2017-03-05 forms;
u':(\d\d-\d{1,2}-\d{1,2}.*:\d\d:D d) ' # date collecteds:16-08-16 12:02 form;
U'(20 d d d { 1,2 } moons d { 1,2 } .*:\d\d:D d) ' #2017 forms on March 4, match pattern
In containing Chinese, use unicode encode;
'(20\d\d/\d{1,2}/\d{1,2}.*:\d\d:D d) ' #2017/02/27 forms;
'(\d{1,2}/\d{1,2}/20\d\d.*:\d\d:D d) ' #27/02/2017 forms;
'(20\d{6}\d\d:\d\d:\d{0,2})'#20160329 08:10 forms;
Wherein, it is regular expression before #, is associated annotation after #.
As unit of character, branch's processing is carried out with the corresponding coordinate of character for a kind of method of information processing of the present invention,
It is grouped after grapholect is obtained by calculating the gap between word, row cutting processing.This kind of method removes field value from
Cumbersome configuration during cutting realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process become it is simple,
It is adaptable.
The present invention is further described by specific embodiment above, it should be understood that, here specifically
Description, should not be construed as the restriction to the spirit and scope of the invention, and one of ordinary skilled in the art is reading this explanation
The various modifications made after book to above-described embodiment belong to the range that the present invention is protected.
Claims (7)
- A kind of 1. method of information processing, which is characterized in that include the following steps:The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word;Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row;The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is carried out according to the gap Grouping;According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to obtain row cutting As a result;Summarize the row cutting result of every row output.
- 2. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:When grapholect includes Chinese and English, according to the gap between character and character local width to Chinese word language with English word is split.
- 3. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:When whether inter-character space two neighboring in grapholect is in the first prepsetting gap and the second prepsetting gap, adjacent two are judged Whether the overlapping range of a character is more than that predetermined word is wide;If so, two neighboring character belongs to same;If not, cutting is carried out to two neighboring character.
- 4. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:When the difference in height of two neighboring character is more than preset height ratio, cutting is carried out to two neighboring character.
- 5. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:When character is located among bracket in grapholect, then judge that two neighboring character belongs to same.
- 6. a kind of method of information processing as described in claim 1, which is characterized in that the branch is specially:For there is overlapping up and down or hang the row at angle, it is classified as same a line.
- 7. a kind of method of information processing as described in claim 1, which is characterized in that described when being removed from the word of every row Between before format content, space and colon the step of obtaining the grapholect of every row, it is further comprising the steps of:Time format contents extraction in the word of every row is come out with preset regular expression.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711463021.7A CN108228553A (en) | 2017-12-28 | 2017-12-28 | A kind of method of information processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711463021.7A CN108228553A (en) | 2017-12-28 | 2017-12-28 | A kind of method of information processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108228553A true CN108228553A (en) | 2018-06-29 |
Family
ID=62645673
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711463021.7A Pending CN108228553A (en) | 2017-12-28 | 2017-12-28 | A kind of method of information processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108228553A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1410943A (en) * | 2001-09-27 | 2003-04-16 | 佳能株式会社 | Character image line selecting method and device and character image identifying method and device |
CN102332096A (en) * | 2011-10-17 | 2012-01-25 | 中国科学院自动化研究所 | Video caption text extraction and identification method |
CN102456136A (en) * | 2010-10-29 | 2012-05-16 | 方正国际软件(北京)有限公司 | Image-text splitting method and system |
CN102567300A (en) * | 2011-12-29 | 2012-07-11 | 方正国际软件有限公司 | Picture document processing method and device |
US20140105496A1 (en) * | 2012-10-17 | 2014-04-17 | Cognex Corporation | System and Method for Selecting Segmentation Parameters for Optical Character Recognition |
CN105302626A (en) * | 2015-11-09 | 2016-02-03 | 深圳市依伴数字科技有限公司 | Analytic method of XPS (XML Paper Specification) structural data |
-
2017
- 2017-12-28 CN CN201711463021.7A patent/CN108228553A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1410943A (en) * | 2001-09-27 | 2003-04-16 | 佳能株式会社 | Character image line selecting method and device and character image identifying method and device |
CN102456136A (en) * | 2010-10-29 | 2012-05-16 | 方正国际软件(北京)有限公司 | Image-text splitting method and system |
CN102332096A (en) * | 2011-10-17 | 2012-01-25 | 中国科学院自动化研究所 | Video caption text extraction and identification method |
CN102567300A (en) * | 2011-12-29 | 2012-07-11 | 方正国际软件有限公司 | Picture document processing method and device |
US20140105496A1 (en) * | 2012-10-17 | 2014-04-17 | Cognex Corporation | System and Method for Selecting Segmentation Parameters for Optical Character Recognition |
CN105302626A (en) * | 2015-11-09 | 2016-02-03 | 深圳市依伴数字科技有限公司 | Analytic method of XPS (XML Paper Specification) structural data |
Non-Patent Citations (1)
Title |
---|
苗红霞: "一种身份证图像字符分割的改进方法", 《微处理机》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1739574B1 (en) | Method of identifying words in an electronic document | |
JPH0798765A (en) | Direction-detecting method and image analyzer | |
KR20150128921A (en) | Detection and reconstruction of east asian layout features in a fixed format document | |
US20200364452A1 (en) | A heuristic method for analyzing content of an electronic document | |
CN104516859B (en) | A kind of word modification method and system | |
CN112417826B (en) | PDF online editing method and device, electronic equipment and readable storage medium | |
JP2013254321A (en) | Image processing apparatus, image processing method, and program | |
CN106776527B (en) | Electronic book data display method and device and terminal equipment | |
EP2410487A1 (en) | Method for automatically modifying a graphics feature to comply with a resolution limit | |
CN108228553A (en) | A kind of method of information processing | |
KR101794169B1 (en) | Personal data detecting and masking system and method based on printed position of hwp file | |
US20150331837A1 (en) | Text processing method and mobile terminal | |
JP5715172B2 (en) | Document display device, document display method, and document display program | |
CN115712601A (en) | Method for reading fixed-length files in batch based on springbatch | |
JP4770285B2 (en) | Image processing apparatus and control program therefor | |
CN104412277B (en) | Device and method for comparing two documents containing graphic elements and text elements | |
CN111460792B (en) | Auxiliary editing and correcting method and device and storage medium | |
KR20150085282A (en) | Operating method of terminal for correcting electronic document | |
CN117235345B (en) | Open format document OFD searching method and device and electronic equipment | |
CN108170651A (en) | A kind of method of information processing | |
JPS63221457A (en) | Document shaping device | |
CN113033164A (en) | PDF file information analysis method and device | |
CN104424184B (en) | Generate the method and system of font character library | |
EP2891989A1 (en) | System and method for converting an electronic document from a paginated format to a non-paginated format | |
JP2016024495A (en) | Information processing device, information processing method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 518000 Wensheng center, Wenjin square, East Wenjin Road, Luohu District, Shenzhen, Guangdong, 2001 Applicant after: Shenzhen juding Medical Co.,Ltd. Address before: 518000 Wensheng center, Wenjin square, East Wenjin Road, Luohu District, Shenzhen, Guangdong, 2001 Applicant before: SHENZHEN JUDING MEDICAL DEVICE Co.,Ltd. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180629 |
|
RJ01 | Rejection of invention patent application after publication |