CN106649213A - Method and system for identifying spaces in document - Google Patents

Method and system for identifying spaces in document Download PDF

Info

Publication number
CN106649213A
CN106649213A CN201610843703.XA CN201610843703A CN106649213A CN 106649213 A CN106649213 A CN 106649213A CN 201610843703 A CN201610843703 A CN 201610843703A CN 106649213 A CN106649213 A CN 106649213A
Authority
CN
China
Prior art keywords
space
threshold value
value
space threshold
gap length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610843703.XA
Other languages
Chinese (zh)
Other versions
CN106649213B (en
Inventor
李云生
晏检平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd filed Critical SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201610843703.XA priority Critical patent/CN106649213B/en
Publication of CN106649213A publication Critical patent/CN106649213A/en
Application granted granted Critical
Publication of CN106649213B publication Critical patent/CN106649213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/163Handling of whitespace
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention is suitable for the field of character identification and provides a method and a system for identifying spaces in a document. The method comprises the steps of acquiring width values of gaps between all adjacent characters in each basic unit by taking lines or paragraphs in the document as basic units, and obtaining an initial gap width set corresponding to each basic unit; taking the initial gap width set as an input set, processing the input set through a space threshold calculation method, and taking an obtained space threshold as a first space threshold; and judging whether the width values of the gaps between the adjacent characters in the basic unit are greater than the first space threshold or not in sequence: if the width values of the gaps between the adjacent characters in the basic unit are greater than the first space threshold, judging that the spaces exist between the adjacent characters; and if the width values of the gaps between the adjacent characters in the basic unit are not greater than the first space threshold, judging that the spaces do not exist between the adjacent characters. The fixed space width does not need to be used for judging the spaces, so that the precision of identifying the spaces in the document is improved and an identification result is more accurate and reliable.

Description

Space recognition methods and system in a kind of document
Technical field
The invention belongs to space recognition methods and system in information discriminating technology field, more particularly to a kind of document.
Background technology
The format files such as PDF (Portable Document Format, portable document format) are being converted to into other During the document of form (such as WORD, TXT form), the character in document will be identified, especially will be to adjacent character Between space judged, so as to successfully organize word and punctuate.
In document, space has multiple reasons between adjacent character, such as:Exist space, the space of a whole page be provided with character pitch, Word arranges the reasons such as character-spacing adjustment and independent text object.
In prior art, adopt based on document in full minimum adjacent character spacing, all adjacent character spacing are subtracted Go after minimum adjacent character spacing in full to determine whether less than predetermined space width.But space width predetermined in prior art What degree inherently cannot determine, and alignment mode can also affect space width, this can all cause space identification in document Spend the inaccurate problem of not high, recognition result.
The content of the invention
In view of this, space recognition methods and system in a kind of document are embodiments provided, to solve existing skill The hollow lattice resolution of document is high in art, the inaccurate problem of recognition result problem.
A kind of first aspect, there is provided space recognition methods in document, including:
The unit based on row in document or section, gathers respectively in each described fundamental unit between all of adjacent character Gap length value, obtain the corresponding initial void width set of each described fundamental unit;
Using initial void width set as input set, carried out to being input into set by space threshold value calculation method Process, and using the space threshold value for drawing as the first space threshold value;
Judge the gap length value in the fundamental unit between each adjacent character whether more than first space successively Threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than described first space threshold Value, then judge there is no space between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion State input set to be processed again through the space threshold value calculation method.
Further, the gap length gathered respectively in each described fundamental unit between all of adjacent character Value, obtaining the corresponding initial void width set of each described fundamental unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute State initial void width set.
Further, the gap length value bag gathered in each described fundamental unit between all of adjacent character Include:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space recognition methods also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges There is a space between the adjacent character.
A kind of second aspect, there is provided space identifying system in document, including:
Collecting unit, for the unit based on row in document or section, gathers respectively in each described fundamental unit and owns Adjacent character between gap length value, obtain the corresponding initial void width set of each described fundamental unit;
Processing unit, for gathering initial void width set as input, by space threshold value calculation method Process being input into set, and using the space threshold value for drawing as the first space threshold value;
Judgement unit, for judging whether the gap length value in the fundamental unit between each adjacent character is more than successively First space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than institute The first space threshold value is stated, then judges there is no space between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion State input set to be processed again through the space threshold value calculation method.
Further, the collecting unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute State initial void width set.
Further, the collecting unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges There is a space between the adjacent character.
In the embodiment of the present invention, by gathering the gap length set of collection as input, and by space threshold value meter Calculation method carrys out the first space of adaptive polo placement threshold value, so as to judge that space whether there is, comes without using fixed space width Judge space so that cause space resolution in document to be improved, recognition result more accurately and reliably.
Description of the drawings
Technical scheme in order to be illustrated more clearly that the embodiment of the present invention, below will be to embodiment or description of the prior art Needed for the accompanying drawing to be used be briefly described, it should be apparent that, drawings in the following description be only the present invention some Embodiment, for those of ordinary skill in the art, without having to pay creative labor, can be with according to these Accompanying drawing obtains other accompanying drawings.
Fig. 1 is the flowchart of space recognition methods in document provided in an embodiment of the present invention;
Fig. 2 is that schematic diagram of the distributed for alignment thereof is arranged in Word provided in an embodiment of the present invention;
Fig. 3 is the flowchart of space recognition methods in the document that another embodiment of the present invention is provided;
Fig. 4 is the flowchart of the concrete space counting method of identification provided in an embodiment of the present invention;
Fig. 5 is the structured flowchart of space identifying system in document provided in an embodiment of the present invention;
Specific embodiment
In below describing, in order to illustrate rather than in order to limit, it is proposed that the such as tool of particular system structure, technology etc Body details, thoroughly to understand the embodiment of the present invention.However, it will be clear to one skilled in the art that concrete without these The present invention can also be realized in the other embodiments of details.In other situations, omit to well-known system, device, electricity Road and the detailed description of method, in case unnecessary details hinders description of the invention.
Unit based on the row in document provided in an embodiment of the present invention or section, gathers respectively in each described fundamental unit Gap length value between all of adjacent character, obtains the corresponding initial void width set of each described fundamental unit;Will Initial void width set is processed being input into set as input set by space threshold value calculation method, and will The space threshold value for drawing is used as the first space threshold value;The gap length between each adjacent character in the fundamental unit is judged successively Whether value is more than first space threshold value:If being more than first space threshold value, judge there is space between the adjacent character; If no more than described first space threshold value, judges there is no space between the adjacent character.
Fig. 1 shows the flowchart of space recognition methods in a kind of document provided in an embodiment of the present invention, describes in detail such as Under:
In S101, the unit based on row in document or section gathers respectively all of phase in each described fundamental unit Gap length value between adjacent character, obtains the corresponding initial void width set of each described fundamental unit.
In a practical situation, by taking conventional office documents Word as an example, space or character pitch typically can be set in Word Self-adaptative adjustment is carried out according to two ends alignment thereof, such as common two ends are alignd or distributed, in distributed, in one section Last column character can be evenly distributed to full of a line, as shown in Fig. 2 now the character space of the row can be significantly greater than other OK, and unit can then reduce operand based on paragraph so that recognition efficiency is more efficient, thus the present invention adopt with text In shelves based on row or section unit being processed, in the hope of more accurately and reliably recognition result.
In S102, using initial void width set as input set, by space threshold value calculation method to defeated Enter set to be processed, and using the space threshold value for drawing as the first space threshold value.
Described space threshold value calculation method can be processed the input set being input into, and be adaptively calculated out each The gap length value in all spaces in input set, and it is next as space threshold value to select the gap length value in optimal space Judge between adjacent character with the presence or absence of space.
In S103, judge the gap length value in the fundamental unit between each adjacent character whether more than described successively First space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than described One space threshold value, then judge there is no space between the adjacent character.
Whether the described gap length value judged between character is more than first space threshold value, can only recognise that There is space, and there is space between adjacent character comprising various situations, such as there are one or two with first-class concrete space situation.
Further, in the S101, it is described gather respectively in each described fundamental unit all of adjacent character it Between gap length value, obtaining the corresponding initial void width set of each described fundamental unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute State initial void width set.
Intersect, overlap because of situations such as there may be character in document so that the space between adjacent character is negative value, if Directly this negative value calculating is processed, identification error can be caused so that recognition result is inaccurate, so when identifying processing Needs first delete negative value.
Further, in the S101, the space between all of adjacent character in each described fundamental unit is gathered Width value includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
When the gap length value between adjacent character is calculated, for the adjacent character sky observed with user perspective as far as possible Gap is consistent, and the embodiment of the present invention takes into account consideration PDF page rotations, text object rotation using the origin position according to adjacent character Turn, the factor such as the presentation direction of character, finally calculate the gap length value between adjacent character.
Further, in the S101:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion State input set to be processed again through the space threshold value calculation method.
The described calculating mathematic expectaion is referred to standard deviation sum described in three times:Calculated standard deviation is multiplied by Three, sued for peace with calculated mathematic expectaion again after the value of the former numerical value for obtaining its three times.
Described given threshold, can be set according to user's actual need, such as be set to 0.1.
When judging that ratio is not less than given threshold, rejecting process is carried out to set, and the set after data will be rejected Reuse space threshold value calculation method to be processed, till ratio is less than given threshold, export space threshold value.
Further, space recognition methods also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges There is a space between the adjacent character.
In practical operation, sometimes not it needs to be determined that between adjacent character exist all concrete space number when, can be by sky Gap width value contains space number and is divided into a few class set to carry out process judgement, and be such as divided three classes set:
First set:Space is not contained;Second set:Containing a space;3rd set:Containing two or more space.
When carrying out processing judgement to three class set, it is only necessary to determine go out the sky without space and containing a space Gap width value, remaining gap length value is directly included into the 3rd set:In containing two or more space, so can be with Reduce a large amount of unnecessary computings.
If described also needs to carry out further to the timing really of concrete space numerical value, if desired for determining three, four are contained Or during more concrete space numerical value, repeat the above steps are only needed, having had determined space in initial void width set After several gap length values is rejected, then concrete space number is judged after being processed by space threshold value calculation method, until determining To required determination adjacent character between exist concrete space number when, or until determine exist between all adjacent characters concrete During the number of space, terminate step and output result.
It is the embodiment of the present invention to recognition methods flowchart in space in document, setting based on the corresponding embodiments of Fig. 3 Threshold value is 0.1, step is described in detail in detail as follows:
1st, the unit based on the row or paragraph of document to be identified, it is all of in each fundamental unit in collection document Gap length value between adjacent character.
2nd, minus gap length value in gap length value is deleted, the nonnegative value composition set in fundamental unit is obtained To the corresponding initial void width set of fundamental unit each described.
3rd, using each initial void width set successively as input set.
4th, the mathematic expectaion and standard deviation of input set are calculated respectively, and calculate standard deviation and mathematic expectaion ratio.
5th, whether criterion difference is less than 0.1 with mathematic expectaion ratio.
If the 6, criterion difference is less than 0.1 with mathematic expectaion ratio, maximum interspace width value conduct in output input set Space threshold value, and using the space threshold value of output as the first space threshold value;
If criterion difference is not less than 0.1 with mathematic expectaion ratio, the mathematic expectaion and standard deviation described in three times are calculated Sum, obtains result of calculation, deletes more than all gap length values of result of calculation in the input set, and by after deletion Input set re-entered and processed among step 4.
7th, judge whether the gap length value in each fundamental unit between all adjacent characters is single more than the basis successively Corresponding first space in position threshold value.
If the 8, being more than the first space threshold value, judge there is space between the adjacent character;
If no more than the first space threshold value, judges there is no space between the adjacent character.
9th, the recognition result in last output document between all adjacent characters with the presence or absence of space.
It is that identification provided in an embodiment of the present invention concrete space counting method realizes flow process based on the corresponding embodiments of Fig. 4 Figure, setting will be divided three classes set to carry out process judgement containing space number:First set:Space, second set are not contained:Contain There are a space and the 3rd set:Containing two or more space, given threshold is 0.1, step is described in detail in detail as follows:
1st, the recognition result being input in document between all adjacent characters with the presence or absence of space.
2nd, using result of determination in each fundamental unit be there is space gap length value set as the fundamental unit pair The input set answered.
3rd, the mathematic expectaion and standard deviation of input set are calculated respectively, and calculate standard deviation and mathematic expectaion ratio.
4th, whether criterion difference is less than 0.1 with mathematic expectaion ratio.
If the 5, criterion difference is less than 0.1 with mathematic expectaion ratio, maximum interspace width value conduct in output input set Space threshold value, and using the space threshold value of output as the second space threshold value;
If criterion difference is not less than 0.1 with mathematic expectaion ratio, the mathematic expectaion and standard deviation described in three times are calculated Sum, obtains result of calculation, deletes more than all gap length values of result of calculation in the input set, and by after deletion Input set re-entered and processed among step 3.
6th, judge that whether result of determination is the gap length value that there is space more than the basis in each fundamental unit successively The corresponding second space threshold value of unit.
If the 7, be more than the second space threshold value, export the gap value and belong to the 3rd set, between adjacent character exist two with Upper space;
If no more than the second space threshold value, exports the gap value and belong to second set, there is one between the adjacent character Space.
In embodiments of the present invention, using unit based on row or paragraph, by fundamental unit in calculating document Gap length value between all effective adjacent characters, and gap length value is carried out using space threshold value calculation method adaptive The space threshold calculations answered, draw the space threshold value for being best suitable for fundamental unit, recycle space threshold value to be made whether there is space Judgement, while can also utilize space threshold value calculation method, further identify between adjacent character exist concrete space Number, draws because space threshold value carries out adaptive polo placement according to base unit, has broken away from prior art using fixation Default space width with the presence or absence of space to judging so that space resolution is greatly improved in document, identification As a result also more accurately and reliably.
For space recognition methods in the document described in foregoing embodiments, Fig. 5 shows text provided in an embodiment of the present invention The structured flowchart of space identifying system in shelves.
With reference to Fig. 5, the system includes:
Collecting unit 51, for the unit based on row in document or section, gathers respectively institute in each described fundamental unit Gap length value between some adjacent characters, obtains the corresponding initial void width set of each described fundamental unit;
Processing unit 52, for gathering initial void width set as input, by space threshold calculations side Method is processed being input into set, and using the space threshold value for drawing as the first space threshold value;
Whether judgement unit 53 is big for judging the gap length value in the fundamental unit between each adjacent character successively In first space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If being not more than First space threshold value, then judge there is no space between the adjacent character.
Further, in the first processing units 52:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion State input set to be processed again through the space threshold value calculation method.
Further, the collecting unit 51 also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute State initial void width set.
Further, the collecting unit 51 includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges There is a space between the adjacent character.
Embodiment described above only to illustrate technical scheme, rather than a limitation;Although with reference to aforementioned reality Apply example to be described in detail the present invention, it will be understood by those within the art that:It still can be to aforementioned each Technical scheme described in embodiment is modified, or carries out equivalent to which part technical characteristic;And these are changed Or replace, the spirit and scope of the essence disengaging various embodiments of the present invention technical scheme of appropriate technical solution are not made, all should It is included within protection scope of the present invention.

Claims (10)

1. space recognition methods in a kind of document, it is characterised in that include:
The unit based on row in document or section, gathers respectively the sky between all of adjacent character in each described fundamental unit Gap width value, obtains the corresponding initial void width set of each described fundamental unit;
Using initial void width set as input set, by space threshold value calculation method to being input at set Reason, and using the space threshold value for drawing as the first space threshold value;
Judge the gap length value in the fundamental unit between each adjacent character whether more than first space threshold value successively: If being more than first space threshold value, judge there is space between the adjacent character;If no more than described first space threshold value, Judge there is no space between the adjacent character.
2. method as claimed in claim 1, it is characterised in that:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold value, and Export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, obtain calculating knot Really, delete it is described input set in more than the result of calculation all described gap length value, and by deletion after it is described defeated Enter set to be processed again through the space threshold value calculation method.
3. method as claimed in claim 1, it is characterised in that described to gather all of adjacent in each described fundamental unit respectively Gap length value between character, obtaining the corresponding initial void width set of each described fundamental unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain described first Beginning gap length set.
4. method as claimed in claim 1, it is characterised in that all of adjacent character in the collection each described fundamental unit Between gap length value include:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
5. method as claimed in claim 1 or 2, it is characterised in that space recognition methods in the document also includes:
The concrete space number existed between identification adjacent character:
Using result of determination described in the fundamental unit be more than first space threshold value gap length value set as defeated Enter set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space threshold Value;
Judge the input set void width value whether more than second space threshold value successively:If empty more than described second Lattice threshold value, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges the phase There is a space in adjacent intercharacter.
6. space identifying system in a kind of document, it is characterised in that include:
Collecting unit, for the unit based on row in document or section, gathers respectively all of phase in each described fundamental unit Gap length value between adjacent character, obtains the corresponding initial void width set of each described fundamental unit;
Processing unit, for gathering initial void width set as input, by space threshold value calculation method to defeated Enter set to be processed, and using the space threshold value for drawing as the first space threshold value;
Judgement unit, for judging the gap length value in the fundamental unit between each adjacent character whether more than described successively First space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than described One space threshold value, then judge there is no space between the adjacent character.
7. system as claimed in claim 6, it is characterised in that:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold value, and Export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, obtain calculating knot Really, delete it is described input set in more than the result of calculation all described gap length value, and by deletion after it is described defeated Enter set to be processed again through the space threshold value calculation method.
8. system as claimed in claim 6, it is characterised in that the collecting unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain described first Beginning gap length set.
9. system as claimed in claim 6, it is characterised in that the collecting unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
10. system as claimed in claims 6 or 7, it is characterised in that space identifying system also includes in the document:
The concrete space number existed between identification adjacent character:
Using result of determination described in the fundamental unit be more than first space threshold value gap length value set as defeated Enter set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space threshold Value;
Judge the input set void width value whether more than second space threshold value successively:If empty more than described second Lattice threshold value, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges the phase There is a space in adjacent intercharacter.
CN201610843703.XA 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document Active CN106649213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610843703.XA CN106649213B (en) 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610843703.XA CN106649213B (en) 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document

Publications (2)

Publication Number Publication Date
CN106649213A true CN106649213A (en) 2017-05-10
CN106649213B CN106649213B (en) 2019-08-20

Family

ID=58853165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610843703.XA Active CN106649213B (en) 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document

Country Status (1)

Country Link
CN (1) CN106649213B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325215A (en) * 2018-12-04 2019-02-12 万兴科技股份有限公司 The output method and device of Word text
CN109582934A (en) * 2018-12-04 2019-04-05 万兴科技股份有限公司 The conversion method and device of format document
CN112699634A (en) * 2020-12-28 2021-04-23 掌阅科技股份有限公司 Typesetting processing method of electronic book, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901333A (en) * 2009-05-25 2010-12-01 汉王科技股份有限公司 Method for segmenting word in text image and identification device using same
CN101901348A (en) * 2010-06-29 2010-12-01 北京捷通华声语音技术有限公司 Normalization based handwriting identifying method and identifying device
CN101980185A (en) * 2010-10-29 2011-02-23 方正国际软件有限公司 Method and system for removing spaces from text copied from double-layer electronic file
US20150033185A1 (en) * 2013-07-29 2015-01-29 Fujitsu Limited Non-transitory computer-readable medium storing selected character specification program, selected character specification method, and selected character specification device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901333A (en) * 2009-05-25 2010-12-01 汉王科技股份有限公司 Method for segmenting word in text image and identification device using same
CN101901348A (en) * 2010-06-29 2010-12-01 北京捷通华声语音技术有限公司 Normalization based handwriting identifying method and identifying device
CN101980185A (en) * 2010-10-29 2011-02-23 方正国际软件有限公司 Method and system for removing spaces from text copied from double-layer electronic file
US20150033185A1 (en) * 2013-07-29 2015-01-29 Fujitsu Limited Non-transitory computer-readable medium storing selected character specification program, selected character specification method, and selected character specification device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DG LEE 等: "Automatic Word Spacing Using Probabilistic Models Based on Character n-grams", 《IEEE INTELLIGENT SYSTEMS》 *
南玉刚: "快速删除多余的空格", 《电脑迷》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325215A (en) * 2018-12-04 2019-02-12 万兴科技股份有限公司 The output method and device of Word text
CN109582934A (en) * 2018-12-04 2019-04-05 万兴科技股份有限公司 The conversion method and device of format document
CN109325215B (en) * 2018-12-04 2023-02-10 万兴科技股份有限公司 Word text output method and device
CN109582934B (en) * 2018-12-04 2023-02-10 万兴科技股份有限公司 Format document conversion method and device
CN112699634A (en) * 2020-12-28 2021-04-23 掌阅科技股份有限公司 Typesetting processing method of electronic book, electronic equipment and storage medium
CN112699634B (en) * 2020-12-28 2022-05-24 掌阅科技股份有限公司 Typesetting processing method of electronic book, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN106649213B (en) 2019-08-20

Similar Documents

Publication Publication Date Title
CN108288156B (en) Block chain transaction storage and queuing method
CN110874530B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
US11829401B2 (en) Method for table extraction from journal literature based on text state characteristics
CN106649213A (en) Method and system for identifying spaces in document
CN105550359B (en) Webpage sorting method and device based on vertical search and server
US20120066587A1 (en) Apparatus and Method for Text Extraction
CN101833546A (en) Method and device for extracting form from portable electronic document
CN112329548A (en) Document chapter segmentation method and device and storage medium
US6144963A (en) Apparatus and method for the frequency displaying of documents
CN108804472A (en) A kind of webpage content extraction method, device and server
US20130218913A1 (en) Parsing tables by probabilistic modeling of perceptual cues
CN110688998A (en) Bill identification method and device
CN107704341A (en) File access pattern method, apparatus and electronic equipment
US20160210372A1 (en) Method and system for obtaining knowledge point implicit relationship
CN110175155A (en) A kind of method and system of file duplicate removal processing
US10755033B1 (en) Digital content editing and publication tools
CN106980607B (en) Paragraph recognition methods, device and terminal device
CN112818937A (en) Excel file identification method and device, electronic equipment and readable storage medium
CN106933783A (en) A kind of method and device on the intelligent extraction date from text
KR20140031269A (en) Method and device for determining font
JPH06203020A (en) Method an device for recognizing and generating text format
CN107544949B (en) Template merging method and device
US11100280B2 (en) Test case consolidator
CN110457272B (en) Bill batch processing method and device
WO2022264207A1 (en) Data processing device, data processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant