CN106649213B - Space recognition methods and system in a kind of document - Google Patents

Space recognition methods and system in a kind of document Download PDF

Info

Publication number
CN106649213B
CN106649213B CN201610843703.XA CN201610843703A CN106649213B CN 106649213 B CN106649213 B CN 106649213B CN 201610843703 A CN201610843703 A CN 201610843703A CN 106649213 B CN106649213 B CN 106649213B
Authority
CN
China
Prior art keywords
threshold value
space
space threshold
gap length
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610843703.XA
Other languages
Chinese (zh)
Other versions
CN106649213A (en
Inventor
李云生
晏检平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd filed Critical SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201610843703.XA priority Critical patent/CN106649213B/en
Publication of CN106649213A publication Critical patent/CN106649213A/en
Application granted granted Critical
Publication of CN106649213B publication Critical patent/CN106649213B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/163Handling of whitespace
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The present invention is suitable for character recognition field, provide space recognition methods and system in a kind of document, it include: with capable or section in document for basic unit, the gap length value between adjacent character all in each fundamental unit is acquired respectively, obtains the corresponding initial void width set of each fundamental unit;Gather the initial void width set as input, input set is handled by space threshold value calculation method, and using the space threshold value obtained as the first space threshold value;Successively judge whether the gap length value in the fundamental unit between each adjacent character is greater than first space threshold value: if more than first space threshold value, then determining that there are spaces between the adjacent character;If being not more than first space threshold value, determine that there is no spaces between the adjacent character.Judge space without using fixed space width so that cause in document space resolution to be improved, recognition result it is more accurate and reliable.

Description

Space recognition methods and system in a kind of document
Technical field
The invention belongs to space recognition methods and systems in information discriminating technology field more particularly to a kind of document.
Background technique
The format files such as PDF (Portable Document Format, portable document format) are being converted to other When the document of format (such as WORD, TXT format), the character in document will be identified, it especially will be to adjacent character Between space judged, to successfully organize word and punctuate.
In document, gap has multiple reasons between adjacent character, such as: there are space, the space of a whole page be provided with character pitch, Text is arranged character-spacing and adjusts and the reasons such as independent text object.
In the prior art, it uses based on the minimum adjacent character spacing of document full text, all adjacent character spacing is subtracted Determine whether to be less than scheduled space width after going the minimum adjacent character spacing of full text.But scheduled space is wide in the prior art What degree can not inherently determine, and alignment mode also will affect space width, this can all cause space in document to identify Spend the problem not high, recognition result is inaccurate.
Summary of the invention
In view of this, the embodiment of the invention provides space recognition methods and system in a kind of document, to solve existing skill The problem of problem that the hollow lattice resolution of document is not high in art, recognition result is inaccurate.
In a first aspect, providing space recognition methods in a kind of document, comprising:
It is basic unit with capable or section in document, is acquired between adjacent character all in each fundamental unit respectively Gap length value, obtain the corresponding initial void width set of each fundamental unit;
Gather the initial void width set as input, input set is carried out by space threshold value calculation method Processing, and using the space threshold value obtained as the first space threshold value;
Successively judge whether the gap length value in the fundamental unit between each adjacent character is greater than first space Threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If being not more than first space threshold Value then determines that there is no spaces between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion Input set is stated to be handled again through the space threshold value calculation method.
Further, the gap length acquired between adjacent character all in each fundamental unit respectively Value, obtains the corresponding initial void width set of each fundamental unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute State initial void width set.
Further, the gap length value packet between adjacent character all in each fundamental unit of acquisition It includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space recognition methods in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine There are a spaces between the adjacent character.
Second aspect provides space identifying system in a kind of document, comprising:
Acquisition unit acquires in each fundamental unit respectively and owns for being basic unit with capable or section in document Adjacent character between gap length value, obtain the corresponding initial void width set of each fundamental unit;
Processing unit passes through space threshold value calculation method for gathering the initial void width set as input Input set is handled, and using the space threshold value obtained as the first space threshold value;
Judgement unit, for successively judging whether the gap length value in the fundamental unit between each adjacent character is greater than First space threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If no more than institute The first space threshold value is stated, then determines that there is no spaces between the adjacent character.
Further, the space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion Input set is stated to be handled again through the space threshold value calculation method.
Further, the acquisition unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute State initial void width set.
Further, the acquisition unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine There are a spaces between the adjacent character.
In the embodiment of the present invention, gathered by the gap length set that will be acquired as input, and pass through space threshold value meter Calculation method carrys out the first space of adaptive polo placement threshold value, to judge that space whether there is, comes without using fixed space width Judge space so that cause in document space resolution to be improved, recognition result it is more accurate and reliable.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only of the invention some Embodiment for those of ordinary skill in the art without any creative labor, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is the implementation flow chart of space recognition methods in document provided in an embodiment of the present invention;
Fig. 2 is that the schematic diagram that distributed is alignment thereof is arranged in Word provided in an embodiment of the present invention;
Fig. 3 be another embodiment of the present invention provides document in space recognition methods implementation flow chart;
Fig. 4 is the implementation flow chart of the specific space counting method of identification provided in an embodiment of the present invention;
Fig. 5 is the structural block diagram of space identifying system in document provided in an embodiment of the present invention;
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposed Body details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specific The present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricity The detailed description of road and method, in case unnecessary details interferes description of the invention.
Provided in an embodiment of the present invention with capable in document or section is basic unit, is acquired in each fundamental unit respectively Gap length value between all adjacent characters obtains the corresponding initial void width set of each fundamental unit;It will The initial void width set is handled input set as input set, by space threshold value calculation method, and will The space threshold value obtained is as the first space threshold value;Successively judge the gap length in the fundamental unit between each adjacent character Whether value is greater than first space threshold value: if more than first space threshold value, then determining that there are spaces between the adjacent character; If being not more than first space threshold value, determine that there is no spaces between the adjacent character.
Fig. 1 shows the implementation flow chart of space recognition methods in a kind of document provided in an embodiment of the present invention, is described in detail such as Under:
In S101, it is basic unit with capable or section in document, acquires phase all in each fundamental unit respectively Gap length value between adjacent character obtains the corresponding initial void width set of each fundamental unit.
In a practical situation, by taking common office documents Word as an example, space or character pitch can be generally set in Word It is adaptively adjusted according to both ends alignment thereof, such as common both ends alignment or distributed, in distributed, in one section Last line character can be evenly distributed to full of a line, as shown in Fig. 2, the character gap of the row can be significantly greater than other at this time Row, and be that basic unit can then reduce operand with paragraph, so that recognition efficiency is more efficient, so the present invention is used with text Capable or section is basic unit to be handled in shelves, in the hope of more accurately and reliably recognition result.
In S102, gather the initial void width set as input, by space threshold value calculation method to defeated Enter set to be handled, and using the space threshold value obtained as the first space threshold value.
The space threshold value calculation method can be handled the input set of input, be adaptively calculated out each The gap length value in all spaces in input set, and the gap length value for selecting optimal space is come as space threshold value Judge between adjacent character with the presence or absence of space.
In S103, it is described successively to judge whether the gap length value in the fundamental unit between each adjacent character is greater than First space threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If no more than described the One space threshold value then determines that there is no spaces between the adjacent character.
Whether the gap length value judged between character is greater than first space threshold value, can only recognise that There is space, and include a variety of situations there are space between adjacent character, such as there are one or two with first-class specific space situation.
Further, in the S101, it is described acquire respectively adjacent character all in each fundamental unit it Between gap length value, obtain the corresponding initial void width set of each fundamental unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute State initial void width set.
Since there may be characters to intersect in document, overlapping, so that the gap between adjacent character is negative value, if Directly to this negative value calculation processing, it will cause identification error and make recognition result inaccurate, so when identifying processing It needs first to delete negative value.
Further, in the S101, the gap between adjacent character all in each fundamental unit is acquired Width value includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
When calculating the gap length value between adjacent character, in order to which the adjacent character observed as far as possible with user perspective is empty Gap is consistent, and the embodiment of the present invention uses the origin position according to adjacent character, takes into account and considers PDF page rotation, text object rotation Turn, the factors such as the presentation direction of character, the gap length value between adjacent character is finally calculated.
Further, in the S101:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion Input set is stated to be handled again through the space threshold value calculation method.
The calculating mathematic expectaion refers to the sum of standard deviation described in three times: by the standard deviation being calculated multiplied by Three, it sums again with the mathematic expectaion being calculated after obtaining the value of the former numerical value of its three times.
The given threshold can be set according to user's actual need, such as be set as 0.1.
When judging ratio not less than given threshold, rejecting processing carried out to set, and the set after data will be rejected It reuses space threshold value calculation method to be handled, until ratio is less than given threshold, exports space threshold value.
Further, space recognition methods in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine There are a spaces between the adjacent character.
It, can will be empty when not needing to determine existing all specific space numbers between adjacent character sometimes in practical operation Gap width value contains space number and is divided into a few class set to carry out processing judgement, and be such as divided into three classes set:
First set: space is not contained;Second set: contain a space;Third set: contain more than two spaces.
When carrying out processing judgement to the three classes set, it is only necessary to determine be free of space and the sky containing a space out Gap width value, remaining gap length value are directly included into third set:, in this way can be with containing in more than two spaces Reduce a large amount of unnecessary operations.
If described also need to carry out further to the timing really of specific space numerical value, such as it needs to be determined that there are three containing, four Or when more specific space numerical value, need to only repeat the above steps, having had determined space in initial void width set After several gap length values is rejected, then specific space number is judged after being handled by space threshold value calculation method, until determining To between the adjacent character of required determination when existing specific space number, or until it is existing specific between determining all adjacent characters When the number of space, terminates step and export result.
It is the embodiment of the present invention to recognition methods implementation flow chart in space in document, setting based on the corresponding embodiment of Fig. 3 Threshold value is 0.1, and steps are as follows for detailed description:
1, it is basic unit with the row of document to be identified or paragraph, acquires all in each fundamental unit in document Gap length value between adjacent character.
2, minus gap length value in gap length value is deleted, the nonnegative value in fundamental unit is formed into set, is obtained To the corresponding initial void width set of each fundamental unit.
3, by each initial void width set successively as input set.
4, the mathematic expectaion and standard deviation for inputting set are calculated separately out, and calculates standard deviation and mathematic expectaion ratio.
5, whether judgment criteria difference and mathematic expectaion ratio are less than 0.1.
If 6, judgment criteria difference and mathematic expectaion ratio are less than 0.1, maximum interspace width value conduct in output input set Space threshold value, and using the space threshold value of output as the first space threshold value;
If judgment criteria difference and mathematic expectaion ratio are not less than 0.1, standard deviation described in the mathematic expectaion and three times is calculated The sum of, calculated result is obtained, deletes all gap length values for being greater than calculated result in the input set, and will be after deletion Input set, which is re-entered among step 4, to be handled.
7, it is single successively to judge whether the gap length value in each fundamental unit between all adjacent characters is greater than the basis Corresponding first space in position threshold value.
8, if more than the first space threshold value, then determine that there are spaces between the adjacent character;
If being not more than the first space threshold value, determine that there is no spaces between the adjacent character.
9, the recognition result that whether there is space in document between all adjacent characters is finally exported.
It is the implementation process of the specific space counting method of identification provided in an embodiment of the present invention based on the corresponding embodiment of Fig. 4 Figure, setting will be divided into three classes set containing space number to carry out processing judgement: first set: without containing space, second set: containing Have a space and third set: containing more than two spaces, given threshold 0.1, steps are as follows for detailed description:
1, the recognition result that whether there is space in document between all adjacent characters is inputted.
2, it will determine that result is the gap length value set there are space as the fundamental unit pair in each fundamental unit The input set answered.
3, the mathematic expectaion and standard deviation for inputting set are calculated separately out, and calculates standard deviation and mathematic expectaion ratio.
4, whether judgment criteria difference and mathematic expectaion ratio are less than 0.1.
If 5, judgment criteria difference and mathematic expectaion ratio are less than 0.1, maximum interspace width value conduct in output input set Space threshold value, and using the space threshold value of output as the second space threshold value;
If judgment criteria difference and mathematic expectaion ratio are not less than 0.1, standard deviation described in the mathematic expectaion and three times is calculated The sum of, calculated result is obtained, deletes all gap length values for being greater than calculated result in the input set, and will be after deletion Input set, which is re-entered among step 3, to be handled.
6, successively judge to determine whether result is greater than the basis for the gap length value there are space in each fundamental unit The corresponding second space threshold value of unit.
7, if more than the second space threshold value, then export the gap value and belong to third set, between adjacent character there are two with Upper space;
If being not more than the second space threshold value, exports the gap value and belong to second set, there are one between the adjacent character Space.
In embodiments of the present invention, using capable or paragraph as basic unit, by calculating in document in fundamental unit Gap length value between all effective adjacent characters, and it is adaptive to the progress of gap length value using space threshold value calculation method The space threshold calculations answered obtain the space threshold value of most suitable fundamental unit, and space threshold value is recycled to be made whether that there are spaces Judgement, while can also utilize space threshold value calculation method, further identify existing specific space between adjacent character Number is got rid of in the prior art since space threshold value is to carry out adaptive polo placement according to base unit to obtain using fixation Default space width judges with the presence or absence of space, so that space resolution is greatly improved in document, identification As a result also more accurate and reliable.
For space recognition methods in document described in foregoing embodiments, Fig. 5 shows text provided in an embodiment of the present invention The structural block diagram of space identifying system in shelves.
Referring to Fig. 5, which includes:
Acquisition unit 51 acquires institute in each fundamental unit for being basic unit with capable or section in document respectively Gap length value between some adjacent characters obtains the corresponding initial void width set of each fundamental unit;
Processing unit 52 passes through space threshold calculations side for gathering the initial void width set as input Method handles input set, and using the space threshold value obtained as the first space threshold value;
Judgement unit 53, for successively judging whether the gap length value in the fundamental unit between each adjacent character is big In first space threshold value: if more than first space threshold value, then determining that there are spaces between the adjacent character;If being not more than First space threshold value then determines that there is no spaces between the adjacent character.
Further, in the first processing units 52:
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation and the mathematic expectaion Value;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold Value, and export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, is counted It calculates as a result, delete all gap length values for being greater than the calculated result in the input set, and by the institute after deletion Input set is stated to be handled again through the space threshold value calculation method.
Further, the acquisition unit 51 further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtaining institute State initial void width set.
Further, the acquisition unit 51 includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
Further, space identifying system in the document further include:
Identify existing specific space number between adjacent character:
By the gap length value set work that judgement result described in the fundamental unit is greater than first space threshold value For input set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space Threshold value;
Judge whether input set void width value is greater than second space threshold value: successively if more than described the Two space threshold values then determine there are more than two spaces between the adjacent character;If being not more than second space threshold value, determine There are a spaces between the adjacent character.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned reality Applying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned each Technical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modified Or replacement, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution should all It is included within protection scope of the present invention.

Claims (8)

1. space recognition methods in a kind of document characterized by comprising
It is basic unit with capable or section in document, acquires the sky between adjacent character all in each fundamental unit respectively Gap width value obtains the corresponding initial void width set of each fundamental unit;
Using the initial void width set as input gather, by space threshold value calculation method to input set at Reason, and using the space threshold value obtained as the first space threshold value;
Successively judge whether the gap length value in the fundamental unit between each adjacent character is greater than first space threshold value: If more than first space threshold value, then determine that there are spaces between the adjacent character;If being not more than first space threshold value, Determine that there is no spaces between the adjacent character;
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation Yu the mathematic expectaion;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold value, and Export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, obtains calculating knot Fruit deletes all gap length values for being greater than the calculated result in the input set, and will be described defeated after deletion Enter set to be handled again through the space threshold value calculation method.
2. method as described in claim 1, which is characterized in that it is described acquire respectively it is all adjacent in each fundamental unit Gap length value between character obtains the corresponding initial void width set of each fundamental unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtain described first Beginning gap length set.
3. method as described in claim 1, which is characterized in that all adjacent characters in each fundamental unit of acquisition Between gap length value include:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
4. method as described in claim 1, which is characterized in that space recognition methods in the document further include:
Identify existing specific space number between adjacent character:
It is the gap length value set greater than first space threshold value as defeated using judgement result described in the fundamental unit Enter set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space threshold Value;
Successively judge whether the input set void width value is greater than second space threshold value: empty if more than described second Lattice threshold value then determines there are more than two spaces between the adjacent character;If being not more than second space threshold value, the phase is determined There are a spaces for adjacent intercharacter.
5. space identifying system in a kind of document characterized by comprising
Acquisition unit acquires phase all in each fundamental unit for being basic unit with capable or section in document respectively Gap length value between adjacent character obtains the corresponding initial void width set of each fundamental unit;
Processing unit, for gathering the initial void width set as input, by space threshold value calculation method to defeated Enter set to be handled, and using the space threshold value obtained as the first space threshold value;
Judgement unit, for successively judging it is described whether the gap length value in the fundamental unit between each adjacent character is greater than First space threshold value: if more than first space threshold value, then determine that there are spaces between the adjacent character;If no more than described the One space threshold value then determines that there is no spaces between the adjacent character;
The space threshold value calculation method includes:
The mathematic expectaion and standard deviation of input set are calculated separately, and calculates the ratio of the standard deviation Yu the mathematic expectaion;
Judge whether the ratio is less than given threshold:
If the ratio is less than given threshold, using maximum interspace width value in input set as space threshold value, and Export the space threshold value;
If the ratio is not less than given threshold, the sum of standard deviation described in the mathematic expectaion and three times is calculated, obtains calculating knot Fruit deletes all gap length values for being greater than the calculated result in the input set, and will be described defeated after deletion Enter set to be handled again through the space threshold value calculation method.
6. system as claimed in claim 5, which is characterized in that the acquisition unit further include:
If deleting the minus gap length value there are when minus value in the gap length value, obtain described first Beginning gap length set.
7. system as claimed in claim 5, which is characterized in that the acquisition unit includes:
According to the origin position of adjacent character, the gap length value between adjacent character is calculated.
8. system as claimed in claim 5, which is characterized in that space identifying system in the document further include:
Identify existing specific space number between adjacent character:
It is the gap length value set greater than first space threshold value as defeated using judgement result described in the fundamental unit Enter set;
Input set is handled by space threshold value calculation method, and using the space threshold value obtained as the second space threshold Value;
Successively judge whether the input set void width value is greater than second space threshold value: empty if more than described second Lattice threshold value then determines there are more than two spaces between the adjacent character;If being not more than second space threshold value, the phase is determined There are a spaces for adjacent intercharacter.
CN201610843703.XA 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document Active CN106649213B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610843703.XA CN106649213B (en) 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610843703.XA CN106649213B (en) 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document

Publications (2)

Publication Number Publication Date
CN106649213A CN106649213A (en) 2017-05-10
CN106649213B true CN106649213B (en) 2019-08-20

Family

ID=58853165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610843703.XA Active CN106649213B (en) 2016-09-22 2016-09-22 Space recognition methods and system in a kind of document

Country Status (1)

Country Link
CN (1) CN106649213B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325215B (en) * 2018-12-04 2023-02-10 万兴科技股份有限公司 Word text output method and device
CN109582934B (en) * 2018-12-04 2023-02-10 万兴科技股份有限公司 Format document conversion method and device
CN112699634B (en) * 2020-12-28 2022-05-24 掌阅科技股份有限公司 Typesetting processing method of electronic book, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901333A (en) * 2009-05-25 2010-12-01 汉王科技股份有限公司 Method for segmenting word in text image and identification device using same
CN101901348A (en) * 2010-06-29 2010-12-01 北京捷通华声语音技术有限公司 Normalization based handwriting identifying method and identifying device
CN101980185A (en) * 2010-10-29 2011-02-23 方正国际软件有限公司 Method and system for removing spaces from text copied from double-layer electronic file

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6201488B2 (en) * 2013-07-29 2017-09-27 富士通株式会社 Selected character identification program, selected character identification method, and selected character identification device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901333A (en) * 2009-05-25 2010-12-01 汉王科技股份有限公司 Method for segmenting word in text image and identification device using same
CN101901348A (en) * 2010-06-29 2010-12-01 北京捷通华声语音技术有限公司 Normalization based handwriting identifying method and identifying device
CN101980185A (en) * 2010-10-29 2011-02-23 方正国际软件有限公司 Method and system for removing spaces from text copied from double-layer electronic file

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Automatic Word Spacing Using Probabilistic Models Based on Character n-grams;DG Lee 等;《IEEE Intelligent Systems》;20070129;第22卷(第1期);第28-35页
快速删除多余的空格;南玉刚;《电脑迷》;20070615;第2007年卷(第12期);第84页

Also Published As

Publication number Publication date
CN106649213A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN105389349B (en) Dictionary update method and device
CN105183923B (en) New word discovery method and device
CN106649213B (en) Space recognition methods and system in a kind of document
CN103336766B (en) Short text garbage identification and modeling method and device
CN102722709B (en) Method and device for identifying garbage pictures
CN105224682B (en) New word discovery method and device
CN107122342B (en) Text code recognition method and device
KR101617696B1 (en) Method and device for mining data regular expression
CN108734110A (en) Text fragment identification control methods based on longest common subsequence and system
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN106156092A (en) Data processing method and device
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN104636319A (en) Text duplicate removal method and device
CN107085568A (en) A kind of text similarity method of discrimination and device
CN101833546A (en) Method and device for extracting form from portable electronic document
CN108153728B (en) Keyword determination method and device
CN107704341A (en) File access pattern method, apparatus and electronic equipment
CN110688998A (en) Bill identification method and device
CN110245155A (en) Data processing method, device, computer readable storage medium and terminal device
CN108205576A (en) A kind of method and system for using and analyzing data based on Tool for Data Warehouse HIVE partitioned storages
CN104462322B (en) Character string comparison method and device
CN111104853A (en) Image information input method and device, electronic equipment and storage medium
CN114168871A (en) Method and device for page jump, electronic equipment and storage medium
CN106933783A (en) A kind of method and device on the intelligent extraction date from text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant