CN106649213A - Method and system for identifying spaces in document - Google Patents
Method and system for identifying spaces in document Download PDFInfo
- Publication number
- CN106649213A CN106649213A CN201610843703.XA CN201610843703A CN106649213A CN 106649213 A CN106649213 A CN 106649213A CN 201610843703 A CN201610843703 A CN 201610843703A CN 106649213 A CN106649213 A CN 106649213A
- Authority
- CN
- China
- Prior art keywords
- space
- threshold value
- value
- space threshold
- gap length
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/163—Handling of whitespace
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/178—Techniques for file synchronisation in file systems
- G06F16/1794—Details of file format conversion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Abstract
The invention is suitable for the field of character identification and provides a method and a system for identifying spaces in a document. The method comprises the steps of acquiring width values of gaps between all adjacent characters in each basic unit by taking lines or paragraphs in the document as basic units, and obtaining an initial gap width set corresponding to each basic unit; taking the initial gap width set as an input set, processing the input set through a space threshold calculation method, and taking an obtained space threshold as a first space threshold; and judging whether the width values of the gaps between the adjacent characters in the basic unit are greater than the first space threshold or not in sequence: if the width values of the gaps between the adjacent characters in the basic unit are greater than the first space threshold, judging that the spaces exist between the adjacent characters; and if the width values of the gaps between the adjacent characters in the basic unit are not greater than the first space threshold, judging that the spaces do not exist between the adjacent characters. The fixed space width does not need to be used for judging the spaces, so that the precision of identifying the spaces in the document is improved and an identification result is more accurate and reliable.
Description
Technical field
The invention belongs to space recognition methods and system in information discriminating technology field, more particularly to a kind of document.
Background technology
The format files such as PDF (Portable Document Format, portable document format) are being converted to into other
During the document of form (such as WORD, TXT form), the character in document will be identified, especially will be to adjacent character
Between space judged, so as to successfully organize word and punctuate.
In document, space has multiple reasons between adjacent character, such as:Exist space, the space of a whole page be provided with character pitch,
Word arranges the reasons such as character-spacing adjustment and independent text object.
In prior art, adopt based on document in full minimum adjacent character spacing, all adjacent character spacing are subtracted
Go after minimum adjacent character spacing in full to determine whether less than predetermined space width.But space width predetermined in prior art
What degree inherently cannot determine, and alignment mode can also affect space width, this can all cause space identification in document
Spend the inaccurate problem of not high, recognition result.
The content of the invention
In view of this, space recognition methods and system in a kind of document are embodiments provided, to solve existing skill
The hollow lattice resolution of document is high in art, the inaccurate problem of recognition result problem.
A kind of first aspect, there is provided space recognition methods in document, including:
The unit based on row in document or section, gathers respectively in each described fundamental unit between all of adjacent character
Gap length value, obtain the corresponding initial void width set of each described fundamental unit;
Using initial void width set as input set, carried out to being input into set by space threshold value calculation method
Process, and using the space threshold value for drawing as the first space threshold value;
Judge the gap length value in the fundamental unit between each adjacent character whether more than first space successively
Threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than described first space threshold
Value, then judge there is no space between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion
Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted
Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion
State input set to be processed again through the space threshold value calculation method.
Further, the gap length gathered respectively in each described fundamental unit between all of adjacent character
Value, obtaining the corresponding initial void width set of each described fundamental unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute
State initial void width set.
Further, the gap length value bag gathered in each described fundamental unit between all of adjacent character
Include:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space recognition methods also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit
For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space
Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the
Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges
There is a space between the adjacent character.
A kind of second aspect, there is provided space identifying system in document, including:
Collecting unit, for the unit based on row in document or section, gathers respectively in each described fundamental unit and owns
Adjacent character between gap length value, obtain the corresponding initial void width set of each described fundamental unit;
Processing unit, for gathering initial void width set as input, by space threshold value calculation method
Process being input into set, and using the space threshold value for drawing as the first space threshold value;
Judgement unit, for judging whether the gap length value in the fundamental unit between each adjacent character is more than successively
First space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than institute
The first space threshold value is stated, then judges there is no space between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion
Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted
Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion
State input set to be processed again through the space threshold value calculation method.
Further, the collecting unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute
State initial void width set.
Further, the collecting unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit
For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space
Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the
Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges
There is a space between the adjacent character.
In the embodiment of the present invention, by gathering the gap length set of collection as input, and by space threshold value meter
Calculation method carrys out the first space of adaptive polo placement threshold value, so as to judge that space whether there is, comes without using fixed space width
Judge space so that cause space resolution in document to be improved, recognition result more accurately and reliably.
Description of the drawings
Technical scheme in order to be illustrated more clearly that the embodiment of the present invention, below will be to embodiment or description of the prior art
Needed for the accompanying drawing to be used be briefly described, it should be apparent that, drawings in the following description be only the present invention some
Embodiment, for those of ordinary skill in the art, without having to pay creative labor, can be with according to these
Accompanying drawing obtains other accompanying drawings.
Fig. 1 is the flowchart of space recognition methods in document provided in an embodiment of the present invention;
Fig. 2 is that schematic diagram of the distributed for alignment thereof is arranged in Word provided in an embodiment of the present invention;
Fig. 3 is the flowchart of space recognition methods in the document that another embodiment of the present invention is provided;
Fig. 4 is the flowchart of the concrete space counting method of identification provided in an embodiment of the present invention;
Fig. 5 is the structured flowchart of space identifying system in document provided in an embodiment of the present invention;
Specific embodiment
In below describing, in order to illustrate rather than in order to limit, it is proposed that the such as tool of particular system structure, technology etc
Body details, thoroughly to understand the embodiment of the present invention.However, it will be clear to one skilled in the art that concrete without these
The present invention can also be realized in the other embodiments of details.In other situations, omit to well-known system, device, electricity
Road and the detailed description of method, in case unnecessary details hinders description of the invention.
Unit based on the row in document provided in an embodiment of the present invention or section, gathers respectively in each described fundamental unit
Gap length value between all of adjacent character, obtains the corresponding initial void width set of each described fundamental unit;Will
Initial void width set is processed being input into set as input set by space threshold value calculation method, and will
The space threshold value for drawing is used as the first space threshold value;The gap length between each adjacent character in the fundamental unit is judged successively
Whether value is more than first space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;
If no more than described first space threshold value, judges there is no space between the adjacent character.
Fig. 1 shows the flowchart of space recognition methods in a kind of document provided in an embodiment of the present invention, describes in detail such as
Under:
In S101, the unit based on row in document or section gathers respectively all of phase in each described fundamental unit
Gap length value between adjacent character, obtains the corresponding initial void width set of each described fundamental unit.
In a practical situation, by taking conventional office documents Word as an example, space or character pitch typically can be set in Word
Self-adaptative adjustment is carried out according to two ends alignment thereof, such as common two ends are alignd or distributed, in distributed, in one section
Last column character can be evenly distributed to full of a line, as shown in Fig. 2 now the character space of the row can be significantly greater than other
OK, and unit can then reduce operand based on paragraph so that recognition efficiency is more efficient, thus the present invention adopt with text
In shelves based on row or section unit being processed, in the hope of more accurately and reliably recognition result.
In S102, using initial void width set as input set, by space threshold value calculation method to defeated
Enter set to be processed, and using the space threshold value for drawing as the first space threshold value.
Described space threshold value calculation method can be processed the input set being input into, and be adaptively calculated out each
The gap length value in all spaces in input set, and it is next as space threshold value to select the gap length value in optimal space
Judge between adjacent character with the presence or absence of space.
In S103, judge the gap length value in the fundamental unit between each adjacent character whether more than described successively
First space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than described
One space threshold value, then judge there is no space between the adjacent character.
Whether the described gap length value judged between character is more than first space threshold value, can only recognise that
There is space, and there is space between adjacent character comprising various situations, such as there are one or two with first-class concrete space situation.
Further, in the S101, it is described gather respectively in each described fundamental unit all of adjacent character it
Between gap length value, obtaining the corresponding initial void width set of each described fundamental unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute
State initial void width set.
Intersect, overlap because of situations such as there may be character in document so that the space between adjacent character is negative value, if
Directly this negative value calculating is processed, identification error can be caused so that recognition result is inaccurate, so when identifying processing
Needs first delete negative value.
Further, in the S101, the space between all of adjacent character in each described fundamental unit is gathered
Width value includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
When the gap length value between adjacent character is calculated, for the adjacent character sky observed with user perspective as far as possible
Gap is consistent, and the embodiment of the present invention takes into account consideration PDF page rotations, text object rotation using the origin position according to adjacent character
Turn, the factor such as the presentation direction of character, finally calculate the gap length value between adjacent character.
Further, in the S101:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion
Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted
Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion
State input set to be processed again through the space threshold value calculation method.
The described calculating mathematic expectaion is referred to standard deviation sum described in three times:Calculated standard deviation is multiplied by
Three, sued for peace with calculated mathematic expectaion again after the value of the former numerical value for obtaining its three times.
Described given threshold, can be set according to user's actual need, such as be set to 0.1.
When judging that ratio is not less than given threshold, rejecting process is carried out to set, and the set after data will be rejected
Reuse space threshold value calculation method to be processed, till ratio is less than given threshold, export space threshold value.
Further, space recognition methods also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit
For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space
Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the
Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges
There is a space between the adjacent character.
In practical operation, sometimes not it needs to be determined that between adjacent character exist all concrete space number when, can be by sky
Gap width value contains space number and is divided into a few class set to carry out process judgement, and be such as divided three classes set:
First set:Space is not contained;Second set:Containing a space;3rd set:Containing two or more space.
When carrying out processing judgement to three class set, it is only necessary to determine go out the sky without space and containing a space
Gap width value, remaining gap length value is directly included into the 3rd set:In containing two or more space, so can be with
Reduce a large amount of unnecessary computings.
If described also needs to carry out further to the timing really of concrete space numerical value, if desired for determining three, four are contained
Or during more concrete space numerical value, repeat the above steps are only needed, having had determined space in initial void width set
After several gap length values is rejected, then concrete space number is judged after being processed by space threshold value calculation method, until determining
To required determination adjacent character between exist concrete space number when, or until determine exist between all adjacent characters concrete
During the number of space, terminate step and output result.
It is the embodiment of the present invention to recognition methods flowchart in space in document, setting based on the corresponding embodiments of Fig. 3
Threshold value is 0.1, step is described in detail in detail as follows:
1st, the unit based on the row or paragraph of document to be identified, it is all of in each fundamental unit in collection document
Gap length value between adjacent character.
2nd, minus gap length value in gap length value is deleted, the nonnegative value composition set in fundamental unit is obtained
To the corresponding initial void width set of fundamental unit each described.
3rd, using each initial void width set successively as input set.
4th, the mathematic expectaion and standard deviation of input set are calculated respectively, and calculate standard deviation and mathematic expectaion ratio.
5th, whether criterion difference is less than 0.1 with mathematic expectaion ratio.
If the 6, criterion difference is less than 0.1 with mathematic expectaion ratio, maximum interspace width value conduct in output input set
Space threshold value, and using the space threshold value of output as the first space threshold value;
If criterion difference is not less than 0.1 with mathematic expectaion ratio, the mathematic expectaion and standard deviation described in three times are calculated
Sum, obtains result of calculation, deletes more than all gap length values of result of calculation in the input set, and by after deletion
Input set re-entered and processed among step 4.
7th, judge whether the gap length value in each fundamental unit between all adjacent characters is single more than the basis successively
Corresponding first space in position threshold value.
If the 8, being more than the first space threshold value, judge there is space between the adjacent character;
If no more than the first space threshold value, judges there is no space between the adjacent character.
9th, the recognition result in last output document between all adjacent characters with the presence or absence of space.
It is that identification provided in an embodiment of the present invention concrete space counting method realizes flow process based on the corresponding embodiments of Fig. 4
Figure, setting will be divided three classes set to carry out process judgement containing space number:First set:Space, second set are not contained:Contain
There are a space and the 3rd set:Containing two or more space, given threshold is 0.1, step is described in detail in detail as follows:
1st, the recognition result being input in document between all adjacent characters with the presence or absence of space.
2nd, using result of determination in each fundamental unit be there is space gap length value set as the fundamental unit pair
The input set answered.
3rd, the mathematic expectaion and standard deviation of input set are calculated respectively, and calculate standard deviation and mathematic expectaion ratio.
4th, whether criterion difference is less than 0.1 with mathematic expectaion ratio.
If the 5, criterion difference is less than 0.1 with mathematic expectaion ratio, maximum interspace width value conduct in output input set
Space threshold value, and using the space threshold value of output as the second space threshold value;
If criterion difference is not less than 0.1 with mathematic expectaion ratio, the mathematic expectaion and standard deviation described in three times are calculated
Sum, obtains result of calculation, deletes more than all gap length values of result of calculation in the input set, and by after deletion
Input set re-entered and processed among step 3.
6th, judge that whether result of determination is the gap length value that there is space more than the basis in each fundamental unit successively
The corresponding second space threshold value of unit.
If the 7, be more than the second space threshold value, export the gap value and belong to the 3rd set, between adjacent character exist two with
Upper space;
If no more than the second space threshold value, exports the gap value and belong to second set, there is one between the adjacent character
Space.
In embodiments of the present invention, using unit based on row or paragraph, by fundamental unit in calculating document
Gap length value between all effective adjacent characters, and gap length value is carried out using space threshold value calculation method adaptive
The space threshold calculations answered, draw the space threshold value for being best suitable for fundamental unit, recycle space threshold value to be made whether there is space
Judgement, while can also utilize space threshold value calculation method, further identify between adjacent character exist concrete space
Number, draws because space threshold value carries out adaptive polo placement according to base unit, has broken away from prior art using fixation
Default space width with the presence or absence of space to judging so that space resolution is greatly improved in document, identification
As a result also more accurately and reliably.
For space recognition methods in the document described in foregoing embodiments, Fig. 5 shows text provided in an embodiment of the present invention
The structured flowchart of space identifying system in shelves.
With reference to Fig. 5, the system includes:
Collecting unit 51, for the unit based on row in document or section, gathers respectively institute in each described fundamental unit
Gap length value between some adjacent characters, obtains the corresponding initial void width set of each described fundamental unit;
Processing unit 52, for gathering initial void width set as input, by space threshold calculations side
Method is processed being input into set, and using the space threshold value for drawing as the first space threshold value;
Whether judgement unit 53 is big for judging the gap length value in the fundamental unit between each adjacent character successively
In first space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If being not more than
First space threshold value, then judge there is no space between the adjacent character.
Further, in the first processing units 52:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion
Value;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold
Value, and export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, counted
Result is calculated, is deleted more than all described gap length value of the result of calculation in the input set, and by the institute after deletion
State input set to be processed again through the space threshold value calculation method.
Further, the collecting unit 51 also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain institute
State initial void width set.
Further, the collecting unit 51 includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system also includes in the document:
The concrete space number existed between identification adjacent character:
It is to make more than the gap length value set of first space threshold value by result of determination described in the fundamental unit
For input set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space
Threshold value;
Judge the input set void width value whether more than second space threshold value successively:If more than described the
Two space threshold values, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges
There is a space between the adjacent character.
Embodiment described above only to illustrate technical scheme, rather than a limitation;Although with reference to aforementioned reality
Apply example to be described in detail the present invention, it will be understood by those within the art that:It still can be to aforementioned each
Technical scheme described in embodiment is modified, or carries out equivalent to which part technical characteristic;And these are changed
Or replace, the spirit and scope of the essence disengaging various embodiments of the present invention technical scheme of appropriate technical solution are not made, all should
It is included within protection scope of the present invention.
Claims (10)
1. space recognition methods in a kind of document, it is characterised in that include:
The unit based on row in document or section, gathers respectively the sky between all of adjacent character in each described fundamental unit
Gap width value, obtains the corresponding initial void width set of each described fundamental unit;
Using initial void width set as input set, by space threshold value calculation method to being input at set
Reason, and using the space threshold value for drawing as the first space threshold value;
Judge the gap length value in the fundamental unit between each adjacent character whether more than first space threshold value successively:
If being more than first space threshold value, judge there is space between the adjacent character;If no more than described first space threshold value,
Judge there is no space between the adjacent character.
2. method as claimed in claim 1, it is characterised in that:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold value, and
Export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, obtain calculating knot
Really, delete it is described input set in more than the result of calculation all described gap length value, and by deletion after it is described defeated
Enter set to be processed again through the space threshold value calculation method.
3. method as claimed in claim 1, it is characterised in that described to gather all of adjacent in each described fundamental unit respectively
Gap length value between character, obtaining the corresponding initial void width set of each described fundamental unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain described first
Beginning gap length set.
4. method as claimed in claim 1, it is characterised in that all of adjacent character in the collection each described fundamental unit
Between gap length value include:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
5. method as claimed in claim 1 or 2, it is characterised in that space recognition methods in the document also includes:
The concrete space number existed between identification adjacent character:
Using result of determination described in the fundamental unit be more than first space threshold value gap length value set as defeated
Enter set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space threshold
Value;
Judge the input set void width value whether more than second space threshold value successively:If empty more than described second
Lattice threshold value, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges the phase
There is a space in adjacent intercharacter.
6. space identifying system in a kind of document, it is characterised in that include:
Collecting unit, for the unit based on row in document or section, gathers respectively all of phase in each described fundamental unit
Gap length value between adjacent character, obtains the corresponding initial void width set of each described fundamental unit;
Processing unit, for gathering initial void width set as input, by space threshold value calculation method to defeated
Enter set to be processed, and using the space threshold value for drawing as the first space threshold value;
Judgement unit, for judging the gap length value in the fundamental unit between each adjacent character whether more than described successively
First space threshold value:If being more than first space threshold value, judge there is space between the adjacent character;If no more than described
One space threshold value, then judge there is no space between the adjacent character.
7. system as claimed in claim 6, it is characterised in that:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated respectively, and calculate the ratio of the standard deviation and the mathematic expectaion;
Judge the ratio whether less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in the input set as space threshold value, and
Export the space threshold value;
If the ratio is not less than given threshold, the mathematic expectaion and standard deviation sum described in three times are calculated, obtain calculating knot
Really, delete it is described input set in more than the result of calculation all described gap length value, and by deletion after it is described defeated
Enter set to be processed again through the space threshold value calculation method.
8. system as claimed in claim 6, it is characterised in that the collecting unit also includes:
If there is minus value in the gap length value, the minus gap length value is deleted, obtain described first
Beginning gap length set.
9. system as claimed in claim 6, it is characterised in that the collecting unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
10. system as claimed in claims 6 or 7, it is characterised in that space identifying system also includes in the document:
The concrete space number existed between identification adjacent character:
Using result of determination described in the fundamental unit be more than first space threshold value gap length value set as defeated
Enter set;
Processed being input into set by space threshold value calculation method, and using the space threshold value for drawing as the second space threshold
Value;
Judge the input set void width value whether more than second space threshold value successively:If empty more than described second
Lattice threshold value, then judge there is two or more space between the adjacent character;If no more than described second space threshold value, judges the phase
There is a space in adjacent intercharacter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610843703.XA CN106649213B (en) | 2016-09-22 | 2016-09-22 | Space recognition methods and system in a kind of document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610843703.XA CN106649213B (en) | 2016-09-22 | 2016-09-22 | Space recognition methods and system in a kind of document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649213A true CN106649213A (en) | 2017-05-10 |
CN106649213B CN106649213B (en) | 2019-08-20 |
Family
ID=58853165
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610843703.XA Active CN106649213B (en) | 2016-09-22 | 2016-09-22 | Space recognition methods and system in a kind of document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649213B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109325215A (en) * | 2018-12-04 | 2019-02-12 | 万兴科技股份有限公司 | The output method and device of Word text |
CN109582934A (en) * | 2018-12-04 | 2019-04-05 | 万兴科技股份有限公司 | The conversion method and device of format document |
CN112699634A (en) * | 2020-12-28 | 2021-04-23 | 掌阅科技股份有限公司 | Typesetting processing method of electronic book, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901333A (en) * | 2009-05-25 | 2010-12-01 | 汉王科技股份有限公司 | Method for segmenting word in text image and identification device using same |
CN101901348A (en) * | 2010-06-29 | 2010-12-01 | 北京捷通华声语音技术有限公司 | Normalization based handwriting identifying method and identifying device |
CN101980185A (en) * | 2010-10-29 | 2011-02-23 | 方正国际软件有限公司 | Method and system for removing spaces from text copied from double-layer electronic file |
US20150033185A1 (en) * | 2013-07-29 | 2015-01-29 | Fujitsu Limited | Non-transitory computer-readable medium storing selected character specification program, selected character specification method, and selected character specification device |
-
2016
- 2016-09-22 CN CN201610843703.XA patent/CN106649213B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901333A (en) * | 2009-05-25 | 2010-12-01 | 汉王科技股份有限公司 | Method for segmenting word in text image and identification device using same |
CN101901348A (en) * | 2010-06-29 | 2010-12-01 | 北京捷通华声语音技术有限公司 | Normalization based handwriting identifying method and identifying device |
CN101980185A (en) * | 2010-10-29 | 2011-02-23 | 方正国际软件有限公司 | Method and system for removing spaces from text copied from double-layer electronic file |
US20150033185A1 (en) * | 2013-07-29 | 2015-01-29 | Fujitsu Limited | Non-transitory computer-readable medium storing selected character specification program, selected character specification method, and selected character specification device |
Non-Patent Citations (2)
Title |
---|
DG LEE 等: "Automatic Word Spacing Using Probabilistic Models Based on Character n-grams", 《IEEE INTELLIGENT SYSTEMS》 * |
南玉刚: "快速删除多余的空格", 《电脑迷》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109325215A (en) * | 2018-12-04 | 2019-02-12 | 万兴科技股份有限公司 | The output method and device of Word text |
CN109582934A (en) * | 2018-12-04 | 2019-04-05 | 万兴科技股份有限公司 | The conversion method and device of format document |
CN109325215B (en) * | 2018-12-04 | 2023-02-10 | 万兴科技股份有限公司 | Word text output method and device |
CN109582934B (en) * | 2018-12-04 | 2023-02-10 | 万兴科技股份有限公司 | Format document conversion method and device |
CN112699634A (en) * | 2020-12-28 | 2021-04-23 | 掌阅科技股份有限公司 | Typesetting processing method of electronic book, electronic equipment and storage medium |
CN112699634B (en) * | 2020-12-28 | 2022-05-24 | 掌阅科技股份有限公司 | Typesetting processing method of electronic book, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106649213B (en) | 2019-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108288156B (en) | Block chain transaction storage and queuing method | |
CN110874530B (en) | Keyword extraction method, keyword extraction device, terminal equipment and storage medium | |
US11829401B2 (en) | Method for table extraction from journal literature based on text state characteristics | |
CN106649213A (en) | Method and system for identifying spaces in document | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
US20120066587A1 (en) | Apparatus and Method for Text Extraction | |
CN101833546A (en) | Method and device for extracting form from portable electronic document | |
CN112329548A (en) | Document chapter segmentation method and device and storage medium | |
US6144963A (en) | Apparatus and method for the frequency displaying of documents | |
CN108804472A (en) | A kind of webpage content extraction method, device and server | |
US20130218913A1 (en) | Parsing tables by probabilistic modeling of perceptual cues | |
CN110688998A (en) | Bill identification method and device | |
CN107704341A (en) | File access pattern method, apparatus and electronic equipment | |
US20160210372A1 (en) | Method and system for obtaining knowledge point implicit relationship | |
CN110175155A (en) | A kind of method and system of file duplicate removal processing | |
US10755033B1 (en) | Digital content editing and publication tools | |
CN106980607B (en) | Paragraph recognition methods, device and terminal device | |
CN112818937A (en) | Excel file identification method and device, electronic equipment and readable storage medium | |
CN106933783A (en) | A kind of method and device on the intelligent extraction date from text | |
KR20140031269A (en) | Method and device for determining font | |
JPH06203020A (en) | Method an device for recognizing and generating text format | |
CN107544949B (en) | Template merging method and device | |
US11100280B2 (en) | Test case consolidator | |
CN110457272B (en) | Bill batch processing method and device | |
WO2022264207A1 (en) | Data processing device, data processing method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |