CN108228553A

CN108228553A - A kind of method of information processing

Info

Publication number: CN108228553A
Application number: CN201711463021.7A
Authority: CN
Inventors: 朱光强; 龙汉; 王海生
Original assignee: Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Current assignee: Shenzhen Huge Ancient Cooking Vessel Medical Devices Co Ltd
Priority date: 2017-12-28
Filing date: 2017-12-28
Publication date: 2018-06-29

Abstract

The present invention relates to content of text extractive technique field more particularly to a kind of methods of information processing, include the following steps：The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word；Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row；The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is grouped according to gap；According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to obtain row cutting result；Summarize the row cutting result of every row output.A kind of method of information processing of the present invention removes cumbersome configuration during field value cutting from, realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process becomes simple, adaptable.

Description

A kind of method of information processing

Technical field

The present invention relates to content of text extractive technique field more particularly to a kind of methods of information processing.

Background technology

Medical electronic report is mostly with PDF, based on XPS formatted files, includes that abundant patient is personal and medical record data, XPS Document is similar with PDF document, is a kind of read-only document format, and structural data form is used to preserve data, is calculated using It is machine-readable when taking document content, it needs to be parsed accordingly and extraction process.The process of structuring extraction is carried out to document at present In, it is largely that cutting, both modes are carried out to multiple field values by the way of template matches or regular expression extraction It is required for carrying out the configuration of individual template or regular expression according to the word content and layout of each document, step is numerous Trivial, adaptation ability is poor.

Invention content

For problems of the prior art, the present invention provides one kind for PDF, and XPS formatted files are in its text Hold the method for piecemeal.

A kind of method of information processing, includes the following steps：

The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word；

Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row；

The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is carried out according to gap Grouping；

According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to be gone Cutting result；

Summarize the row cutting result of every row output.

Further, cutting, which is handled, is specially：

When grapholect includes Chinese and English, according to the gap between character and character local width to Chinese word Language is split with English word.

Further, cutting, which is handled, is specially：

When whether inter-character space two neighboring in grapholect is in the first prepsetting gap and the second prepsetting gap, phase is judged Whether the overlapping range of adjacent two characters is more than that predetermined word is wide；

If so, two neighboring character belongs to same；

If not, cutting is carried out to two neighboring character.

Further, cutting, which is handled, is specially：

When the difference in height of two neighboring character is more than preset height ratio, it is further that cutting is carried out to two neighboring character , cutting processing is specially：

When character is located among bracket in grapholect, then judge that two neighboring character belongs to same.

Further, it is specially in lines：

For there is overlapping up and down or hang the row at angle, it is classified as same a line.

Further, time format content, space and colon are removed from the word of every row and obtains the grapholect of every row The step of before, it is further comprising the steps of：

Time format contents extraction in the word of every row is come out with preset regular expression.

As unit of character, branch's processing is carried out with the corresponding coordinate of character for a kind of method of information processing of the present invention, It is grouped after grapholect is obtained by calculating the gap between word, row cutting processing.This kind of method removes field value from Cumbersome configuration during cutting realizes zero configuration cutting, is adapted to various text layouts so that structuring extraction process become it is simple, It is adaptable.

Description of the drawings

Illustrate the embodiment of the present invention or technical solution of the prior art in order to clearer, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it is clear that, the accompanying drawings in the following description is only this Some embodiments of invention, for those of ordinary skill in the art, without creative efforts, can be with Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of method flow diagram of information processing of the present invention.

Specific embodiment

Below in conjunction with the attached drawing in the present invention, the technical solution in the embodiment of the present invention is carried out it is clear, completely retouch It states, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.Based on the present invention In embodiment, all other reality that those skilled in the art is obtained under the premise of creative work is not made Example is applied, belongs to protection scope of the present invention.

The method of a kind of information processing of the embodiment of the present invention, as shown in Figure 1, including the following steps：

Step S01, extracts the coordinate of all words in document files, and according to all words of coordinate pair of each word into Row branch；

Step S02, removal time format content, space and colon obtain the grapholect of every row from the word of every row；

Step S03 calculates the gap between adjacent character in often capable grapholect, and according to gap to the standard of every row Word is grouped；

According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row by step S04 To obtain row cutting result；

Step S05 summarizes the row cutting result of every row output.

Specifically, cutting processing includes：When grapholect includes Chinese and English, according to the gap between character and Character local width is split Chinese word language with English word.

Specifically, cutting processing is specially：When whether inter-character space two neighboring in grapholect is in the first prepsetting gap During with the second prepsetting gap, whether the overlapping range for judging two neighboring character is more than that predetermined word is wide；If so, two neighboring word Symbol belongs to same；If not, cutting is carried out to two neighboring character.When in grapholect character there are during lateral overlap, when When inter-character space is between half word is wide with zero, illustrate former and later two characters there are lateral overlap, and overlapping range does not surpass The wide half of word is crossed, thinks that front and rear character belongs to one piece at this time, does not do cutting；If overlapping range is more than the wide half of word, Carry out cutting.

Specifically, cutting processing is specially：When the difference in height of two neighboring character is more than preset height ratio, to adjacent Two characters carry out cutting, processing character cause situation not of uniform size；The difference in height of former and later two characters is more than 20%, then it is assumed that no Belong to one text block, carry out cutting.

Specifically, cutting processing is specially：When character is located among bracket in grapholect, then two neighboring character is judged Belong to same.

Such as in processes when English mixes situation, " DR diagnosis reports list ", in this text, " D " and " R " and " R " Gap between " examining " is simultaneously unequal, but because DR, which is English character and distance, is less than the 10% of local width, then recognizes To belong to a word, so the word judges the clearance condition between " diagnosis report list " below, hair as a whole Now their gap is all identical, so this kind of situation is not done and is divided.

Such as during processing lateral overlap situation, when -0.5 word is wide<gap<When 0 (gap represents the inter-character space of adjacent character), Former and later two characters are represented there are lateral overlap, and overlapping range is no more than the wide half of word, such case thinks front and rear character One piece is belonged to, does not do cutting, such as overlapping is more than half, then carries out cutting.

When handling other situations, when the number of words in line of text be no more than 2, then do not do cutting.

Specifically, branch is specially：For there is overlapping up and down or hang the row at angle, it is classified as same a line.

Specifically, removal time format content, space and colon obtain the grapholect of every row from the word of every row Before step, further include with preset regular expression time format contents extraction comes out in the word of every row the step for.

Specifically, for there is overlapping up and down or hang the row at angle, it is classified as same a line.

Specifically, the contents extraction of time format in file is come out with default regular expression.

Such as：The default regular expression of extraction time is as follows：

'(20\d\d-\d{1,2}-\d{1,2}.*:\d\d:D d) ' #2017-03-05 forms；

u'：(\d\d-\d{1,2}-\d{1,2}.*:\d\d:D d) ' # date collecteds：16-08-16 12:02 form；

U'(20 d d d { 1,2 } moons d { 1,2 } .*:\d\d:D d) ' #2017 forms on March 4, match pattern In containing Chinese, use unicode encode；

'(20\d\d/\d{1,2}/\d{1,2}.*:\d\d:D d) ' #2017/02/27 forms；

'(\d{1,2}/\d{1,2}/20\d\d.*:\d\d:D d) ' #27/02/2017 forms；

'(20\d{6}\d\d:\d\d:\d{0,2})'#20160329 08:10 forms；

Wherein, it is regular expression before #, is associated annotation after #.

The present invention is further described by specific embodiment above, it should be understood that, here specifically Description, should not be construed as the restriction to the spirit and scope of the invention, and one of ordinary skilled in the art is reading this explanation The various modifications made after book to above-described embodiment belong to the range that the present invention is protected.

Claims

A kind of 1. method of information processing, which is characterized in that include the following steps：

The coordinate of all words in document files is extracted, and branch is carried out according to all words of coordinate pair of each word；

Time format content, space and colon are removed from the word of every row and obtains the grapholect of every row；

The gap between adjacent character in often capable grapholect is calculated, and the grapholect of every row is carried out according to the gap Grouping；

According to character types each in the grapholect of every row, cutting processing is carried out to the grapholect of every row to obtain row cutting As a result；

Summarize the row cutting result of every row output.
2. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially：

When grapholect includes Chinese and English, according to the gap between character and character local width to Chinese word language with English word is split.
3. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially：

When whether inter-character space two neighboring in grapholect is in the first prepsetting gap and the second prepsetting gap, adjacent two are judged Whether the overlapping range of a character is more than that predetermined word is wide；

If so, two neighboring character belongs to same；

If not, cutting is carried out to two neighboring character.
4. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially：

When the difference in height of two neighboring character is more than preset height ratio, cutting is carried out to two neighboring character.
5. a kind of method of information processing as described in claim 1, which is characterized in that the cutting, which is handled, is specially：

When character is located among bracket in grapholect, then judge that two neighboring character belongs to same.
6. a kind of method of information processing as described in claim 1, which is characterized in that the branch is specially：

For there is overlapping up and down or hang the row at angle, it is classified as same a line.
7. a kind of method of information processing as described in claim 1, which is characterized in that described when being removed from the word of every row Between before format content, space and colon the step of obtaining the grapholect of every row, it is further comprising the steps of：

Time format contents extraction in the word of every row is come out with preset regular expression.