CN106649213B - Space recognition methods and system in a kind of document - Google Patents
Space recognition methods and system in a kind of document Download PDFInfo
- Publication number
- CN106649213B CN106649213B CN201610843703.XA CN201610843703A CN106649213B CN 106649213 B CN106649213 B CN 106649213B CN 201610843703 A CN201610843703 A CN 201610843703A CN 106649213 B CN106649213 B CN 106649213B
- Authority
- CN
- China
- Prior art keywords
- threshold value
- space
- space threshold
- gap length
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/163—Handling of whitespace
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/178—Techniques for file synchronisation in file systems
- G06F16/1794—Details of file format conversion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
The present invention is suitable for character recognition field, provide space recognition methods and system in a kind of document, it include: with capable or section in document for basic unit, the gap length value between adjacent character all in each fundamental unit is acquired respectively, obtains the corresponding initial void width set of each fundamental unit;Gather the initial void width set as input, input set is handled by space threshold value calculation method, and using the space threshold value obtained as the first space threshold value;Successively judge whether the gap length value in the fundamental unit between each adjacent character is greater than first space threshold value: if more than first space threshold value, then determining that there are spaces between the adjacent character;If being not more than first space threshold value, determine that there is no spaces between the adjacent character.Judge space without using fixed space width so that cause in document space resolution to be improved, recognition result it is more accurate and reliable.
Description
Technical field
The invention belongs to space recognition methods and systems in information discriminating technology field more particularly to a kind of document.
Background technique
The format files such as PDF (Portable Document Format, portable document format) are being converted to other
When the document of format (such as WORD, TXT format), the character in document will be identified, it especially will be to adjacent character
Between space judged, to successfully organize word and punctuate.
In document, gap has multiple reasons between adjacent character, such as: there are space, the space of a whole page be provided with character pitch,
Text is arranged character-spacing and adjusts and the reasons such as independent text object.
In the prior art, it uses based on the minimum adjacent character spacing of document full text, all adjacent character spacing is subtracted
Determine whether to be less than scheduled space width after going the minimum adjacent character spacing of full text.But scheduled space is wide in the prior art
What degree can not inherently determine, and alignment mode also will affect space width, this can all cause space in document to identify
Spend the problem not high, recognition result is inaccurate.
Summary of the invention
In view of this, the embodiment of the invention provides space recognition methods and system in a kind of document, to solve existing skill
The problem of problem that the hollow lattice resolution of document is not high in art, recognition result is inaccurate.
In a first aspect, providing space recognition methods in a kind of document, comprising:
It is basic unit with capable or section in document, is acquired between adjacent character all in each fundamental unit respectively
Gap length value, obtain the corresponding initial void width set of each fundamental unit;
Gather the initial void width set as input, input set is carried out by space threshold value calculation method
Processing, and using the space threshold value obtained as the first space threshold value;
Successively judge whether the gap length value in the fundamental unit between each adjacent character is greater than first space
Threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If being not more than first space threshold
Value then determines that there is no spaces between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion
Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted
It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion
Input set is stated to be handled again through the space threshold value calculation method.
Further, the gap length acquired between adjacent character all in each fundamental unit respectively
Value, obtains the corresponding initial void width set of each fundamental unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute
State initial void width set.
Further, the gap length value packet between adjacent character all in each fundamental unit of acquisition
It includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space recognition methods in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value
For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space
Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the
Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine
There are a spaces between the adjacent character.
Second aspect provides space identifying system in a kind of document, comprising:
Acquisition unit acquires in each fundamental unit respectively and owns for being basic unit with capable or section in document
Adjacent character between gap length value, obtain the corresponding initial void width set of each fundamental unit;
Processing unit passes through space threshold value calculation method for gathering the initial void width set as input
Input set is handled, and using the space threshold value obtained as the first space threshold value;
Judgement unit, for successively judging whether the gap length value in the fundamental unit between each adjacent character is greater than
First space threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If no more than institute
The first space threshold value is stated, then determines that there is no spaces between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion
Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted
It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion
Input set is stated to be handled again through the space threshold value calculation method.
Further, the acquisition unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute
State initial void width set.
Further, the acquisition unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value
For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space
Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the
Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine
There are a spaces between the adjacent character.
In the embodiment of the present invention, gathered by the gap length set that will be acquired as input, and pass through space threshold value meter
Calculation method carrys out the first space of adaptive polo placement threshold value, to judge that space whether there is, comes without using fixed space width
Judge space so that cause in document space resolution to be improved, recognition result it is more accurate and reliable.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art
Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some
Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these
Attached drawing obtains other attached drawings.
Fig. 1 is the implementation flow chart of space recognition methods in document provided in an embodiment of the present invention;
Fig. 2 is that the schematic diagram that distributed is alignment thereof is arranged in Word provided in an embodiment of the present invention;
Fig. 3 be another embodiment of the present invention provides document in space recognition methods implementation flow chart;
Fig. 4 is the implementation flow chart of the specific space counting method of identification provided in an embodiment of the present invention;
Fig. 5 is the structural block diagram of space identifying system in document provided in an embodiment of the present invention;
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed
Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific
The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity
The detailed description of road and method, in case unnecessary details interferes description of the invention.
Provided in an embodiment of the present invention with capable in document or section is basic unit, is acquired in each fundamental unit respectively
Gap length value between all adjacent characters obtains the corresponding initial void width set of each fundamental unit;It will
The initial void width set is handled input set as input set, by space threshold value calculation method, and will
The space threshold value obtained is as the first space threshold value;Successively judge the gap length in the fundamental unit between each adjacent character
Whether value is greater than first space threshold value: if more than first space threshold value, then determining that there are spaces between the adjacent character;
If being not more than first space threshold value, determine that there is no spaces between the adjacent character.
Fig. 1 shows the implementation flow chart of space recognition methods in a kind of document provided in an embodiment of the present invention, is described in detail such as
Under:
In S101, it is basic unit with capable or section in document, acquires phase all in each fundamental unit respectively
Gap length value between adjacent character obtains the corresponding initial void width set of each fundamental unit.
In a practical situation, by taking common office documents Word as an example, space or character pitch can be generally set in Word
It is adaptively adjusted according to both ends alignment thereof, such as common both ends alignment or distributed, in distributed, in one section
Last line character can be evenly distributed to full of a line, as shown in Fig. 2, the character gap of the row can be significantly greater than other at this time
Row, and be that basic unit can then reduce operand with paragraph, so that recognition efficiency is more efficient, so the present invention is used with text
Capable or section is basic unit to be handled in shelves, in the hope of more accurately and reliably recognition result.
In S102, gather the initial void width set as input, by space threshold value calculation method to defeated
Enter set to be handled, and using the space threshold value obtained as the first space threshold value.
The space threshold value calculation method can be handled the input set of input, be adaptively calculated out each
The gap length value in all spaces in input set, and the gap length value for selecting optimal space is come as space threshold value
Judge between adjacent character with the presence or absence of space.
In S103, it is described successively to judge whether the gap length value in the fundamental unit between each adjacent character is greater than
First space threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If no more than described the
One space threshold value then determines that there is no spaces between the adjacent character.
Whether the gap length value judged between character is greater than first space threshold value, can only recognise that
There is space, and include a variety of situations there are space between adjacent character, such as there are one or two with first-class specific space situation.
Further, in the S101, it is described acquire respectively adjacent character all in each fundamental unit it
Between gap length value, obtain the corresponding initial void width set of each fundamental unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute
State initial void width set.
Since there may be characters to intersect in document, overlapping, so that the gap between adjacent character is negative value, if
Directly to this negative value calculation processing, it will cause identification error and make recognition result inaccurate, so when identifying processing
It needs first to delete negative value.
Further, in the S101, the gap between adjacent character all in each fundamental unit is acquired
Width value includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
When calculating the gap length value between adjacent character, in order to which the adjacent character observed as far as possible with user perspective is empty
Gap is consistent, and the embodiment of the present invention uses the origin position according to adjacent character, takes into account and considers PDF page rotation, text object rotation
Turn, the factors such as the presentation direction of character, the gap length value between adjacent character is finally calculated.
Further, in the S101:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion
Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted
It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion
Input set is stated to be handled again through the space threshold value calculation method.
The calculating mathematic expectaion refers to the sum of standard deviation described in three times: by the standard deviation being calculated multiplied by
Three, it sums again with the mathematic expectaion being calculated after obtaining the value of the former numerical value of its three times.
The given threshold can be set according to user's actual need, such as be set as 0.1.
When judging ratio not less than given threshold, rejecting processing carried out to set, and the set after data will be rejected
It reuses space threshold value calculation method to be handled, until ratio is less than given threshold, exports space threshold value.
Further, space recognition methods in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value
For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space
Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the
Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine
There are a spaces between the adjacent character.
It, can will be empty when not needing to determine existing all specific space numbers between adjacent character sometimes in practical operation
Gap width value contains space number and is divided into a few class set to carry out processing judgement, and be such as divided into three classes set:
First set: space is not contained;Second set: contain a space;Third set: contain more than two spaces.
When carrying out processing judgement to the three classes set, it is only necessary to determine be free of space and the sky containing a space out
Gap width value, remaining gap length value are directly included into third set:, in this way can be with containing in more than two spaces
Reduce a large amount of unnecessary operations.
If described also need to carry out further to the timing really of specific space numerical value, such as it needs to be determined that there are three containing, four
Or when more specific space numerical value, need to only repeat the above steps, having had determined space in initial void width set
After several gap length values is rejected, then specific space number is judged after being handled by space threshold value calculation method, until determining
To between the adjacent character of required determination when existing specific space number, or until it is existing specific between determining all adjacent characters
When the number of space, terminates step and export result.
It is the embodiment of the present invention to recognition methods implementation flow chart in space in document, setting based on the corresponding embodiment of Fig. 3
Threshold value is 0.1, and steps are as follows for detailed description:
1, it is basic unit with the row of document to be identified or paragraph, acquires all in each fundamental unit in document
Gap length value between adjacent character.
2, minus gap length value in gap length value is deleted, the nonnegative value in fundamental unit is formed into set, is obtained
To the corresponding initial void width set of each fundamental unit.
3, by each initial void width set successively as input set.
4, the mathematic expectaion and standard deviation for inputting set are calculated separately out, and calculates standard deviation and mathematic expectaion ratio.
5, whether judgment criteria difference and mathematic expectaion ratio are less than 0.1.
If 6, judgment criteria difference and mathematic expectaion ratio are less than 0.1, maximum interspace width value conduct in output input set
Space threshold value, and using the space threshold value of output as the first space threshold value;
If judgment criteria difference and mathematic expectaion ratio are not less than 0.1, standard deviation described in the mathematic expectaion and three times is calculated
The sum of, calculated result is obtained, deletes all gap length values for being greater than calculated result in the input set, and will be after deletion
Input set, which is re-entered among step 4, to be handled.
7, it is single successively to judge whether the gap length value in each fundamental unit between all adjacent characters is greater than the basis
Corresponding first space in position threshold value.
8, if more than the first space threshold value, then determine that there are spaces between the adjacent character;
If being not more than the first space threshold value, determine that there is no spaces between the adjacent character.
9, the recognition result that whether there is space in document between all adjacent characters is finally exported.
It is the implementation process of the specific space counting method of identification provided in an embodiment of the present invention based on the corresponding embodiment of Fig. 4
Figure, setting will be divided into three classes set containing space number to carry out processing judgement: first set: without containing space, second set: containing
Have a space and third set: containing more than two spaces, given threshold 0.1, steps are as follows for detailed description:
1, the recognition result that whether there is space in document between all adjacent characters is inputted.
2, it will determine that result is the gap length value set there are space as the fundamental unit pair in each fundamental unit
The input set answered.
3, the mathematic expectaion and standard deviation for inputting set are calculated separately out, and calculates standard deviation and mathematic expectaion ratio.
4, whether judgment criteria difference and mathematic expectaion ratio are less than 0.1.
If 5, judgment criteria difference and mathematic expectaion ratio are less than 0.1, maximum interspace width value conduct in output input set
Space threshold value, and using the space threshold value of output as the second space threshold value;
If judgment criteria difference and mathematic expectaion ratio are not less than 0.1, standard deviation described in the mathematic expectaion and three times is calculated
The sum of, calculated result is obtained, deletes all gap length values for being greater than calculated result in the input set, and will be after deletion
Input set, which is re-entered among step 3, to be handled.
6, successively judge to determine whether result is greater than the basis for the gap length value there are space in each fundamental unit
The corresponding second space threshold value of unit.
7, if more than the second space threshold value, then export the gap value and belong to third set, between adjacent character there are two with
Upper space;
If being not more than the second space threshold value, exports the gap value and belong to second set, there are one between the adjacent character
Space.
In embodiments of the present invention, using capable or paragraph as basic unit, by calculating in document in fundamental unit
Gap length value between all effective adjacent characters, and it is adaptive to the progress of gap length value using space threshold value calculation method
The space threshold calculations answered obtain the space threshold value of most suitable fundamental unit, and space threshold value is recycled to be made whether that there are spaces
Judgement, while can also utilize space threshold value calculation method, further identify existing specific space between adjacent character
Number is got rid of in the prior art since space threshold value is to carry out adaptive polo placement according to base unit to obtain using fixation
Default space width judges with the presence or absence of space, so that space resolution is greatly improved in document, identification
As a result also more accurate and reliable.
For space recognition methods in document described in foregoing embodiments, Fig. 5 shows text provided in an embodiment of the present invention
The structural block diagram of space identifying system in shelves.
Referring to Fig. 5, which includes:
Acquisition unit 51 acquires institute in each fundamental unit for being basic unit with capable or section in document respectively
Gap length value between some adjacent characters obtains the corresponding initial void width set of each fundamental unit;
Processing unit 52 passes through space threshold calculations side for gathering the initial void width set as input
Method handles input set, and using the space threshold value obtained as the first space threshold value;
Judgement unit 53, for successively judging whether the gap length value in the fundamental unit between each adjacent character is big
In first space threshold value: if more than first space threshold value, then determining that there are spaces between the adjacent character;If being not more than
First space threshold value then determines that there is no spaces between the adjacent character.
Further, in the first processing units 52:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion
Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted
It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion
Input set is stated to be handled again through the space threshold value calculation method.
Further, the acquisition unit 51 further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute
State initial void width set.
Further, the acquisition unit 51 includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value
For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space
Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the
Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine
There are a spaces between the adjacent character.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality
Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each
Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified
Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all
It is included within protection scope of the present invention.
Claims (8)
1. space recognition methods in a kind of document characterized by comprising
It is basic unit with capable or section in document, acquires the sky between adjacent character all in each fundamental unit respectively
Gap width value obtains the corresponding initial void width set of each fundamental unit;
Using the initial void width set as input gather, by space threshold value calculation method to input set at
Reason, and using the space threshold value obtained as the first space threshold value;
Successively judge whether the gap length value in the fundamental unit between each adjacent character is greater than first space threshold value:
If more than first space threshold value, then determine that there are spaces between the adjacent character;If being not more than first space threshold value,
Determine that there is no spaces between the adjacent character;
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation Yu the mathematic expectaion;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold value, and
Export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, obtains calculating knot
Fruit deletes all gap length values for being greater than the calculated result in the input set, and will be described defeated after deletion
Enter set to be handled again through the space threshold value calculation method.
2. method as described in claim 1, which is characterized in that it is described acquire respectively it is all adjacent in each fundamental unit
Gap length value between character obtains the corresponding initial void width set of each fundamental unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtain described first
Beginning gap length set.
3. method as described in claim 1, which is characterized in that all adjacent characters in each fundamental unit of acquisition
Between gap length value include:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
4. method as described in claim 1, which is characterized in that space recognition methods in the document further include:
Identify existing specific space number between adjacent character:
It is the gap length value set greater than first space threshold value as defeated using judgement result described in the fundamental unit
Enter set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space threshold
Value;
Successively judge whether the input set void width value is greater than second space threshold value: empty if more than described second
Lattice threshold value then determines there are more than two spaces between the adjacent character;If being not more than second space threshold value, the phase is determined
There are a spaces for adjacent intercharacter.
5. space identifying system in a kind of document characterized by comprising
Acquisition unit acquires phase all in each fundamental unit for being basic unit with capable or section in document respectively
Gap length value between adjacent character obtains the corresponding initial void width set of each fundamental unit;
Processing unit, for gathering the initial void width set as input, by space threshold value calculation method to defeated
Enter set to be handled, and using the space threshold value obtained as the first space threshold value;
Judgement unit, for successively judging it is described whether the gap length value in the fundamental unit between each adjacent character is greater than
First space threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If no more than described the
One space threshold value then determines that there is no spaces between the adjacent character;
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation Yu the mathematic expectaion;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold value, and
Export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, obtains calculating knot
Fruit deletes all gap length values for being greater than the calculated result in the input set, and will be described defeated after deletion
Enter set to be handled again through the space threshold value calculation method.
6. system as claimed in claim 5, which is characterized in that the acquisition unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtain described first
Beginning gap length set.
7. system as claimed in claim 5, which is characterized in that the acquisition unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
8. system as claimed in claim 5, which is characterized in that space identifying system in the document further include:
Identify existing specific space number between adjacent character:
It is the gap length value set greater than first space threshold value as defeated using judgement result described in the fundamental unit
Enter set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space threshold
Value;
Successively judge whether the input set void width value is greater than second space threshold value: empty if more than described second
Lattice threshold value then determines there are more than two spaces between the adjacent character;If being not more than second space threshold value, the phase is determined
There are a spaces for adjacent intercharacter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610843703.XA CN106649213B (en) | 2016-09-22 | 2016-09-22 | Space recognition methods and system in a kind of document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610843703.XA CN106649213B (en) | 2016-09-22 | 2016-09-22 | Space recognition methods and system in a kind of document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649213A CN106649213A (en) | 2017-05-10 |
CN106649213B true CN106649213B (en) | 2019-08-20 |
Family
ID=58853165
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610843703.XA Active CN106649213B (en) | 2016-09-22 | 2016-09-22 | Space recognition methods and system in a kind of document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649213B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109325215B (en) * | 2018-12-04 | 2023-02-10 | 万兴科技股份有限公司 | Word text output method and device |
CN109582934B (en) * | 2018-12-04 | 2023-02-10 | 万兴科技股份有限公司 | Format document conversion method and device |
CN112699634B (en) * | 2020-12-28 | 2022-05-24 | 掌阅科技股份有限公司 | Typesetting processing method of electronic book, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901333A (en) * | 2009-05-25 | 2010-12-01 | 汉王科技股份有限公司 | Method for segmenting word in text image and identification device using same |
CN101901348A (en) * | 2010-06-29 | 2010-12-01 | 北京捷通华声语音技术有限公司 | Normalization based handwriting identifying method and identifying device |
CN101980185A (en) * | 2010-10-29 | 2011-02-23 | 方正国际软件有限公司 | Method and system for removing spaces from text copied from double-layer electronic file |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6201488B2 (en) * | 2013-07-29 | 2017-09-27 | 富士通株式会社 | Selected character identification program, selected character identification method, and selected character identification device |
-
2016
- 2016-09-22 CN CN201610843703.XA patent/CN106649213B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901333A (en) * | 2009-05-25 | 2010-12-01 | 汉王科技股份有限公司 | Method for segmenting word in text image and identification device using same |
CN101901348A (en) * | 2010-06-29 | 2010-12-01 | 北京捷通华声语音技术有限公司 | Normalization based handwriting identifying method and identifying device |
CN101980185A (en) * | 2010-10-29 | 2011-02-23 | 方正国际软件有限公司 | Method and system for removing spaces from text copied from double-layer electronic file |
Non-Patent Citations (2)
Title |
---|
Automatic Word Spacing Using Probabilistic Models Based on Character n-grams;DG Lee 等;《IEEE Intelligent Systems》;20070129;第22卷(第1期);第28-35页 |
快速删除多余的空格;南玉刚;《电脑迷》;20070615;第2007年卷(第12期);第84页 |
Also Published As
Publication number | Publication date |
---|---|
CN106649213A (en) | 2017-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109062874B (en) | Financial data acquisition method, terminal device and medium | |
CN105389349B (en) | Dictionary update method and device | |
CN105183923B (en) | New word discovery method and device | |
CN106649213B (en) | Space recognition methods and system in a kind of document | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN102722709B (en) | Method and device for identifying garbage pictures | |
CN105224682B (en) | New word discovery method and device | |
CN107122342B (en) | Text code recognition method and device | |
KR101617696B1 (en) | Method and device for mining data regular expression | |
CN108734110A (en) | Text fragment identification control methods based on longest common subsequence and system | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
CN106156092A (en) | Data processing method and device | |
CN110321466A (en) | A kind of security information duplicate checking method and system based on semantic analysis | |
CN104636319A (en) | Text duplicate removal method and device | |
CN107085568A (en) | A kind of text similarity method of discrimination and device | |
CN101833546A (en) | Method and device for extracting form from portable electronic document | |
CN108153728B (en) | Keyword determination method and device | |
CN107704341A (en) | File access pattern method, apparatus and electronic equipment | |
CN110688998A (en) | Bill identification method and device | |
CN110245155A (en) | Data processing method, device, computer readable storage medium and terminal device | |
CN108205576A (en) | A kind of method and system for using and analyzing data based on Tool for Data Warehouse HIVE partitioned storages | |
CN104462322B (en) | Character string comparison method and device | |
CN111104853A (en) | Image information input method and device, electronic equipment and storage medium | |
CN114168871A (en) | Method and device for page jump, electronic equipment and storage medium | |
CN106933783A (en) | A kind of method and device on the intelligent extraction date from text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |