CN108228553A - A kind of method of information processing - Google Patents

A kind of method of information processing Download PDF

Info

Publication number
CN108228553A
CN108228553A CN201711463021.7A CN201711463021A CN108228553A CN 108228553 A CN108228553 A CN 108228553A CN 201711463021 A CN201711463021 A CN 201711463021A CN 108228553 A CN108228553 A CN 108228553A
Authority
CN
China
Prior art keywords
row
grapholect
cutting
character
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711463021.7A
Other languages
Chinese (zh)
Inventor
朱光强
龙汉
王海生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Original Assignee
Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd filed Critical Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Priority to CN201711463021.7A priority Critical patent/CN108228553A/en
Publication of CN108228553A publication Critical patent/CN108228553A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to content of text extractive technique field more particularly to a kind of methods of information processing, include the following steps:The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word;Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row;The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is grouped according to gap;According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to obtain row cutting result;Summarize the row cutting result of every row output.A kind of method of information processing of the present invention removes cumbersome configuration during field value cutting from, realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process becomes simple, adaptable.

Description

A kind of method of information processing
Technical field
The present invention relates to content of text extractive technique field more particularly to a kind of methods of information processing.
Background technology
Medical electronic report is mostly with PDF, based on XPS formatted files, includes that abundant patient is personal and medical record data, XPS Document is similar with PDF document, is a kind of read-only document format, and structural data form is used to preserve data, is calculated using It is machine-readable when taking document content, it needs to be parsed accordingly and extraction process.The process of structuring extraction is carried out to document at present In, it is largely that cutting, both modes are carried out to multiple field values by the way of template matches or regular expression extraction It is required for carrying out the configuration of individual template or regular expression according to the word content and layout of each document, step is numerous Trivial, adaptation ability is poor.
Invention content
For problems of the prior art, the present invention provides one kind for PDF, and XPS formatted files are in its text Hold the method for piecemeal.
A kind of method of information processing, includes the following steps:
The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word;
Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row;
The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is carried out according to gap Grouping;
According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to be gone Cutting result;
Summarize the row cutting result of every row output.
Further, cutting, which is handled, is specially:
When grapholect includes Chinese and English, according to the gap between character and character local width to Chinese word Language is split with English word.
Further, cutting, which is handled, is specially:
When whether inter-character space two neighboring in grapholect is in the first prepsetting gap and the second prepsetting gap, phase is judged Whether the overlapping range of adjacent two characters is more than that predetermined word is wide;
If so, two neighboring character belongs to same;
If not, cutting is carried out to two neighboring character.
Further, cutting, which is handled, is specially:
When the difference in height of two neighboring character is more than preset height ratio, it is further that cutting is carried out to two neighboring character , cutting processing is specially:
When character is located among bracket in grapholect, then judge that two neighboring character belongs to same.
Further, it is specially in lines:
For there is overlapping up and down or hang the row at angle, it is classified as same a line.
Further, time format content, space and colon are removed from the word of every row and obtains the grapholect of every row The step of before, it is further comprising the steps of:
Time format contents extraction in the word of every row is come out with preset regular expression.
As unit of character, branch's processing is carried out with the corresponding coordinate of character for a kind of method of information processing of the present invention, It is grouped after grapholect is obtained by calculating the gap between word, row cutting processing.This kind of method removes field value from Cumbersome configuration during cutting realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process become it is simple, It is adaptable.
Description of the drawings
Illustrate the embodiment of the present invention or technical solution of the prior art in order to clearer, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it is clear that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of method flow diagram of information processing of the present invention.
Specific embodiment
Below in conjunction with the attached drawing in the present invention, the technical solution in the embodiment of the present invention is carried out it is clear, completely retouch It states, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on the present invention In embodiment, all other reality that those skilled in the art is obtained under the premise of creative work is not made Example is applied, belongs to protection scope of the present invention.
The method of a kind of information processing of the embodiment of the present invention, as shown in Figure 1, including the following steps:
Step S01, extracts the coordinate of all words in document files, and according to all words of coordinate pair of each word into Row branch;
Step S02, removal time format content, space and colon obtain the grapholect of every row from the word of every row;
Step S03 calculates the gap between adjacent character in often capable grapholect, and according to gap to the standard of every row Word is grouped;
According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row by step S04 To obtain row cutting result;
Step S05 summarizes the row cutting result of every row output.
Specifically, cutting processing includes:When grapholect includes Chinese and English, according to the gap between character and Character local width is split Chinese word language with English word.
Specifically, cutting processing is specially:When whether inter-character space two neighboring in grapholect is in the first prepsetting gap During with the second prepsetting gap, whether the overlapping range for judging two neighboring character is more than that predetermined word is wide;If so, two neighboring word Symbol belongs to same;If not, cutting is carried out to two neighboring character.When in grapholect character there are during lateral overlap, when When inter-character space is between half word is wide with zero, illustrate former and later two characters there are lateral overlap, and overlapping range does not surpass The wide half of word is crossed, thinks that front and rear character belongs to one piece at this time, does not do cutting;If overlapping range is more than the wide half of word, Carry out cutting.
Specifically, cutting processing is specially:When the difference in height of two neighboring character is more than preset height ratio, to adjacent Two characters carry out cutting, processing character cause situation not of uniform size;The difference in height of former and later two characters is more than 20%, then it is assumed that no Belong to one text block, carry out cutting.
Specifically, cutting processing is specially:When character is located among bracket in grapholect, then two neighboring character is judged Belong to same.
Such as in processes when English mixes situation, " DR diagnosis reports list ", in this text, " D " and " R " and " R " Gap between " examining " is simultaneously unequal, but because DR, which is English character and distance, is less than the 10% of local width, then recognizes To belong to a word, so the word judges the clearance condition between " diagnosis report list " below, hair as a whole Now their gap is all identical, so this kind of situation is not done and is divided.
Such as during processing lateral overlap situation, when -0.5 word is wide<gap<When 0 (gap represents the inter-character space of adjacent character), Former and later two characters are represented there are lateral overlap, and overlapping range is no more than the wide half of word, such case thinks front and rear character One piece is belonged to, does not do cutting, such as overlapping is more than half, then carries out cutting.
When handling other situations, when the number of words in line of text be no more than 2, then do not do cutting.
Specifically, branch is specially:For there is overlapping up and down or hang the row at angle, it is classified as same a line.
Specifically, removal time format content, space and colon obtain the grapholect of every row from the word of every row Before step, further include with preset regular expression time format contents extraction comes out in the word of every row the step for.
Specifically, for there is overlapping up and down or hang the row at angle, it is classified as same a line.
Specifically, the contents extraction of time format in file is come out with default regular expression.
Such as:The default regular expression of extraction time is as follows:
'(20\d\d-\d{1,2}-\d{1,2}.*:\d\d:D d) ' #2017-03-05 forms;
u':(\d\d-\d{1,2}-\d{1,2}.*:\d\d:D d) ' # date collecteds:16-08-16 12:02 form;
U'(20 d d d { 1,2 } moons d { 1,2 } .*:\d\d:D d) ' #2017 forms on March 4, match pattern In containing Chinese, use unicode encode;
'(20\d\d/\d{1,2}/\d{1,2}.*:\d\d:D d) ' #2017/02/27 forms;
'(\d{1,2}/\d{1,2}/20\d\d.*:\d\d:D d) ' #27/02/2017 forms;
'(20\d{6}\d\d:\d\d:\d{0,2})'#20160329 08:10 forms;
Wherein, it is regular expression before #, is associated annotation after #.
As unit of character, branch's processing is carried out with the corresponding coordinate of character for a kind of method of information processing of the present invention, It is grouped after grapholect is obtained by calculating the gap between word, row cutting processing.This kind of method removes field value from Cumbersome configuration during cutting realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process become it is simple, It is adaptable.
The present invention is further described by specific embodiment above, it should be understood that, here specifically Description, should not be construed as the restriction to the spirit and scope of the invention, and one of ordinary skilled in the art is reading this explanation The various modifications made after book to above-described embodiment belong to the range that the present invention is protected.

Claims (7)

  1. A kind of 1. method of information processing, which is characterized in that include the following steps:
    The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word;
    Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row;
    The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is carried out according to the gap Grouping;
    According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to obtain row cutting As a result;
    Summarize the row cutting result of every row output.
  2. 2. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:
    When grapholect includes Chinese and English, according to the gap between character and character local width to Chinese word language with English word is split.
  3. 3. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:
    When whether inter-character space two neighboring in grapholect is in the first prepsetting gap and the second prepsetting gap, adjacent two are judged Whether the overlapping range of a character is more than that predetermined word is wide;
    If so, two neighboring character belongs to same;
    If not, cutting is carried out to two neighboring character.
  4. 4. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:
    When the difference in height of two neighboring character is more than preset height ratio, cutting is carried out to two neighboring character.
  5. 5. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially:
    When character is located among bracket in grapholect, then judge that two neighboring character belongs to same.
  6. 6. a kind of method of information processing as described in claim 1, which is characterized in that the branch is specially:
    For there is overlapping up and down or hang the row at angle, it is classified as same a line.
  7. 7. a kind of method of information processing as described in claim 1, which is characterized in that described when being removed from the word of every row Between before format content, space and colon the step of obtaining the grapholect of every row, it is further comprising the steps of:
    Time format contents extraction in the word of every row is come out with preset regular expression.
CN201711463021.7A 2017-12-28 2017-12-28 A kind of method of information processing Pending CN108228553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711463021.7A CN108228553A (en) 2017-12-28 2017-12-28 A kind of method of information processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711463021.7A CN108228553A (en) 2017-12-28 2017-12-28 A kind of method of information processing

Publications (1)

Publication Number Publication Date
CN108228553A true CN108228553A (en) 2018-06-29

Family

ID=62645673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711463021.7A Pending CN108228553A (en) 2017-12-28 2017-12-28 A kind of method of information processing

Country Status (1)

Country Link
CN (1) CN108228553A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1410943A (en) * 2001-09-27 2003-04-16 佳能株式会社 Character image line selecting method and device and character image identifying method and device
CN102332096A (en) * 2011-10-17 2012-01-25 中国科学院自动化研究所 Video caption text extraction and identification method
CN102456136A (en) * 2010-10-29 2012-05-16 方正国际软件(北京)有限公司 Image-text splitting method and system
CN102567300A (en) * 2011-12-29 2012-07-11 方正国际软件有限公司 Picture document processing method and device
US20140105496A1 (en) * 2012-10-17 2014-04-17 Cognex Corporation System and Method for Selecting Segmentation Parameters for Optical Character Recognition
CN105302626A (en) * 2015-11-09 2016-02-03 深圳市依伴数字科技有限公司 Analytic method of XPS (XML Paper Specification) structural data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1410943A (en) * 2001-09-27 2003-04-16 佳能株式会社 Character image line selecting method and device and character image identifying method and device
CN102456136A (en) * 2010-10-29 2012-05-16 方正国际软件(北京)有限公司 Image-text splitting method and system
CN102332096A (en) * 2011-10-17 2012-01-25 中国科学院自动化研究所 Video caption text extraction and identification method
CN102567300A (en) * 2011-12-29 2012-07-11 方正国际软件有限公司 Picture document processing method and device
US20140105496A1 (en) * 2012-10-17 2014-04-17 Cognex Corporation System and Method for Selecting Segmentation Parameters for Optical Character Recognition
CN105302626A (en) * 2015-11-09 2016-02-03 深圳市依伴数字科技有限公司 Analytic method of XPS (XML Paper Specification) structural data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苗红霞: "一种身份证图像字符分割的改进方法", 《微处理机》 *

Similar Documents

Publication Publication Date Title
EP1739574B1 (en) Method of identifying words in an electronic document
JPH0798765A (en) Direction-detecting method and image analyzer
KR20150128921A (en) Detection and reconstruction of east asian layout features in a fixed format document
US20200364452A1 (en) A heuristic method for analyzing content of an electronic document
CN104516859B (en) A kind of word modification method and system
CN112417826B (en) PDF online editing method and device, electronic equipment and readable storage medium
JP2013254321A (en) Image processing apparatus, image processing method, and program
CN106776527B (en) Electronic book data display method and device and terminal equipment
EP2410487A1 (en) Method for automatically modifying a graphics feature to comply with a resolution limit
CN108228553A (en) A kind of method of information processing
KR101794169B1 (en) Personal data detecting and masking system and method based on printed position of hwp file
US20150331837A1 (en) Text processing method and mobile terminal
JP5715172B2 (en) Document display device, document display method, and document display program
CN115712601A (en) Method for reading fixed-length files in batch based on springbatch
JP4770285B2 (en) Image processing apparatus and control program therefor
CN104412277B (en) Device and method for comparing two documents containing graphic elements and text elements
CN111460792B (en) Auxiliary editing and correcting method and device and storage medium
KR20150085282A (en) Operating method of terminal for correcting electronic document
CN117235345B (en) Open format document OFD searching method and device and electronic equipment
CN108170651A (en) A kind of method of information processing
JPS63221457A (en) Document shaping device
CN113033164A (en) PDF file information analysis method and device
CN104424184B (en) Generate the method and system of font character library
EP2891989A1 (en) System and method for converting an electronic document from a paginated format to a non-paginated format
JP2016024495A (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518000 Wensheng center, Wenjin square, East Wenjin Road, Luohu District, Shenzhen, Guangdong, 2001

Applicant after: Shenzhen juding Medical Co.,Ltd.

Address before: 518000 Wensheng center, Wenjin square, East Wenjin Road, Luohu District, Shenzhen, Guangdong, 2001

Applicant before: SHENZHEN JUDING MEDICAL DEVICE Co.,Ltd.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20180629

RJ01 Rejection of invention patent application after publication