CN102194117B - Method and device for detecting page direction of document - Google Patents
Method and device for detecting page direction of document Download PDFInfo
- Publication number
- CN102194117B CN102194117B CN 201010119229 CN201010119229A CN102194117B CN 102194117 B CN102194117 B CN 102194117B CN 201010119229 CN201010119229 CN 201010119229 CN 201010119229 A CN201010119229 A CN 201010119229A CN 102194117 B CN102194117 B CN 102194117B
- Authority
- CN
- China
- Prior art keywords
- text
- connected domain
- confidence
- page orientation
- manuscript page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title abstract description 11
- 238000001514 detection method Methods 0.000 claims description 25
- 238000004891 communication Methods 0.000 claims description 6
- 238000007598 dipping method Methods 0.000 claims description 5
- 238000005194 fractionation Methods 0.000 claims description 3
- 238000000926 separation method Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 12
- 238000012360 testing method Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 230000004069 differentiation Effects 0.000 description 5
- 239000000203 mixture Substances 0.000 description 4
- 238000003786 synthesis reaction Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 239000012141 concentrate Substances 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
- Facsimile Image Signal Circuits (AREA)
Abstract
The invention discloses a method and device for detecting page direction of a document, wherein the method comprises the following steps: performing picture-text separation on a page of the document, splitting a got text region into a plurality of text units, and determining language type attributes of the plurality of text units; getting presentation direction attribute feature values of the plurality of text units aiming at the language type attributes, and incorporating the presentation direction attribute feature values into corresponding global statistics; getting the page directions and the corresponding confidence levels aiming at all the presentation direction attribute feature values respectively according to all the presentation direction attribute feature values in the global statistics; and determining the page direction of the document and the corresponding overall confidence level according to the page directions and the corresponding confidence levels of all the presentation direction attribute feature values.
Description
Technical field
The present invention relates to image processing field, in particular to a kind of manuscript page orientation detection method and device.
Background technology
Although digital file exchange has become the main way of modern communications gradually, contract and have valency paper etc. to have legal effect and need the content of authentication still need to be transmitted and store with the paper document form.To the management of this type of paper document and the transmission of content, then need the in addition digitizing of mode by scanning.In routine office work and document centralized administration, the business demand of paper document scanning in enormous quantities all is huge.
In file scanning process in enormous quantities, need just all documents to arrange according to forward.The correction of document direction is divided into two kinds of methods afterwards, a kind of remaining by manually browsing page by page and image rotating, the another kind of mode is exactly automatically to identify the direction of file and picture and it is become a full member by computer program.The automatic document discriminating direction is divided into again two kinds of methods afterwards: a kind of is by OCR (Optical Character Recognition, optical character identification) thus technology the literal in the file and picture identified judged that page orientation, another kind of method are to judge page orientation by feature recognition algorithms.
Relevant manuscript page detection technology also has relevant patent and document both at home and abroad.In the US Patent No. 5276742 in 1991 " page orientation detects fast " dipping and heaving of roman character and the relation between the character calligraph direction have been proposed.1998, the Belle experiment researchist proposed a kind of feature that is associated with the asian type presentation direction at the article speech recognition of unintelligible file and picture " complicated, omnidirectional and ".More than these patents and document all concentrate on the relation of seeking between character in the manuscript page or punctuate feature and the presentation direction.
Yet actual manuscript page scan image is often mingled by the content of complexity and forms.Comprising image, literal, numeral, punctuate, and along with the increase of international exchange, usually also comprise the literal of multilingual different fonts and size in page.
For the manuscript page of complexity, can't obtain correct testing result in the prior art.
Summary of the invention
The present invention aims to provide a kind of manuscript page orientation detection method and device, can solve in the prior art for the manuscript page of complexity, can't obtain the problem of correct testing result.Simultaneously, also provide a kind of confidence calculations method of manuscript page orientation testing result, differentiated whether available foundation of direction testing result to provide.
According to an aspect of the present invention, provide a kind of manuscript page orientation detection method, may further comprise the steps: the manuscript page has been carried out picture and text separate, the character area that obtains is split as a plurality of text units, and determine the language category attribute of a plurality of text units; Obtain the presentation direction attributive character value of a plurality of text units for the language category attribute, and it is included in the corresponding global statistics; Obtain respectively for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value in the global statistics; The direction of determining the manuscript page according to the page orientation of each presentation direction attributive character value and corresponding degree of confidence with and corresponding whole degree of confidence.
According to another aspect of the present invention, a kind of manuscript page orientation pick-up unit also is provided, has comprised: split module, be used for that the manuscript page is carried out picture and text and separate, the character area that obtains is split as a plurality of text units, and determines the language category attribute of a plurality of text units; The first computing module is used for obtaining for the language category attribute presentation direction attributive character value of a plurality of text units, and it is included in the corresponding global statistics; The second computing module is used for obtaining respectively for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value of global statistics; The 3rd computing module, the direction that is used for determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence.
In the present invention, be split as text unit by the character area after manuscript page picture and text are separated, and then obtain page orientation and corresponding degree of confidence for each presentation direction attributive character value according to the language category attribute of text unit, the direction of at last determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence, realized automatically determining of manuscript page orientation, and provide whole degree of confidence to determine for the user whether result of determination is available, overcome in the prior art for the manuscript page of complexity, can't obtain the problem of correct testing result.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 shows according to an embodiment of the invention manuscript page orientation detection method process flow diagram;
Fig. 2 shows the effect screen synoptic diagram of the manuscript page of a Japanese newspaper;
Fig. 3 shows the manuscript page synoptic diagram that text unit is in different directions;
Fig. 4 shows manuscript page synoptic diagram in accordance with a preferred embodiment of the present invention;
Fig. 5 shows according to an embodiment of the invention manuscript page orientation pick-up unit synoptic diagram.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows according to an embodiment of the invention manuscript page orientation detection method process flow diagram, may further comprise the steps:
S102 carries out picture and text to the manuscript page and separates, and the character area that obtains is split as a plurality of text units, and determines the language category attribute of a plurality of text units;
S104 obtains the presentation direction attributive character value of a plurality of text units for the language category attribute, and it is included in the corresponding global statistics;
S106 obtains respectively for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value in the global statistics;
S108, the direction of determining the manuscript page according to the page orientation of each presentation direction attributive character value and corresponding degree of confidence with and corresponding whole degree of confidence.
In the present embodiment, be split as text unit by the character area after manuscript page picture and text are separated, and then obtain page orientation and corresponding degree of confidence for each presentation direction attributive character value according to the language category attribute of text unit, the direction of at last determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence, realized automatically determining of manuscript page orientation, and provide whole degree of confidence to determine for the user whether result of determination is available, overcome in the prior art for the manuscript page of complexity, can't obtain the problem of correct testing result.
Preferably, in above-mentioned manuscript page orientation detection method, the manuscript page is carried out picture and text separate, the character area that obtains is split as a plurality of text units specifically comprises: detect carrying out connected domain behind the image binaryzation of the manuscript page; According to level and vertical direction the connected domain that detects is gone detection respectively; Determine according to connected domain complexity in connected domain size and relative position relation, the row in connected domain quantity, the row in the row that is linked to be capable connected domain whether connected domain is literal line; The connected domain that is defined as literal line as character area, and is cut apart literal line and to be obtained text unit.In the present embodiment, the fractionation by text unit obtains word (shorter literal line mainly is the Rome literal) or sentence (longer literal line mainly is asian type).
In an embodiment of the present invention effective connected domain analysis is determined literal line and language category attribute.Effectively connected domain refers to nonoverlapping connected domain along the direction of scanning.Satisfy effective connected domain size (along scanning method to yardstick) and all to be no more than 30% connected domain row along the center amplitude of variation of direction of scanning be literal line.In the row connected domain quantity surpass 16 generally be Asian language literal section.Being no more than connected domain quantity that 16 and complexity meet the demands for connected domain quantity in the row, to account for connected domain sum in the row more than 25% generally be the Rome literal line.
For example, the content of pages such as books and periodicals papers and magazines are abundant, mainly can be divided into character area and image-region, but character area again Further Division be roman character spoken and written languages zone and Asian language character areas.And the presentation direction of Asian language literal can be divided into writes across the page and indulges and write two kinds, the presentation direction of the roman character spoken and written languages classification of then only writing across the page.
Fig. 2 shows the effect screen synoptic diagram of the manuscript page of a Japanese newspaper, as shown in Figure 2, comprises image-region 21 and character area.The character area 22 of ways of writing and the character area 23 of vertical WriteMode have simultaneously occurred writing across the page in same the page.Fig. 3 shows the manuscript page synoptic diagram that text unit is in different directions, as shown in Figure 3, the situation of different directions may appear in the text unit in the same page, the local direction of character area 32 Chinese word unit is for up, and the local direction of character area 32 Chinese word unit is towards a left side, for such situation, follow-up analysis-by-synthesis step will in conjunction with literal distributed number proportion separately, provide the judged result of final page orientation.Comprised two kinds of presentation directions in embodiment illustrated in fig. 3, but after the presentation direction eigenwert of all characters is carried out the integral body statistics, be dominant by the presentation direction eigenwert that makes progress, therefore final page orientation is differentiated for making progress.
Therefore, the differentiation of the manuscript page orientation that picture and text mix, mix in multilingual (Asia, Europe) need to be analyzed each character area, has improved the accuracy of differentiating the result.
Preferably, in above-mentioned manuscript page orientation detection method, the manuscript page is carried out picture and text to be separated, the character area that obtains is split as a plurality of text units, and determines that the language category attribute of a plurality of text units specifically comprises: size, form and relative position relation according to the literal connected domain internal communication territory of text unit are determined its linguistic property eigenwert; The corresponding reference value that prestores in linguistic property eigenwert and the database is compared, determine the language category attribute of text unit.
In the above-described embodiments, determine by the connected domain complexity metric whether this connected domain is roman character.The connected domain complexity mainly is to define with negative value internal communication territory quantity and relative position relation along the direction of scanning normal direction.
Preferably, in above-mentioned manuscript page orientation detection method, obtain respectively specifically comprising for the page orientation of each presentation direction attributive character value and corresponding degree of confidence according to each presentation direction attributive character value in the global statistics reference value that prestores in each presentation direction attributive character value and the database is compared, determine for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence.
In the present embodiment, to adding up respectively in order to each eigenwert of differentiating the manuscript page orientation, obtain the manuscript page orientation and differentiate the degree of confidence of the page orientation that obtains according to this eigenwert, degree of confidence is to determine by the power of each feature.For example,, be by the maximum quantity of punctuation mark in four quadrants and account for the total ratio of punctuation mark and calculate four degree of confidence of the manuscript page orientation differentiated of the distributed quantity feature in the limit mutually according to punctuation mark.It is larger that the maximal value of punctuation mark is larger in single quadrant, this maximal value accounts for punctuation mark sum ratio, and it is just higher then to differentiate the page orientation degree of confidence that obtains according to punctuation mark, otherwise then lower.
And final manuscript page orientation and degree of confidence by several differentiations for the different characteristic value as a result analysis-by-synthesis obtain.The direction that can differentiate by getting several page orientations degree of confidence maximum among the result is final page orientation.Final whole degree of confidence is that maximal value strengthens or weakens as the basis and obtains in the feature degree of confidence.Whole degree of confidence further promotes on maximum characteristic direction degree of confidence basis if inferior maximum confidence direction is consistent with the maximum confidence direction, obtains the final integral degree of confidence otherwise then reduce maximum characteristic direction degree of confidence.
Fig. 4 shows manuscript page synoptic diagram in accordance with a preferred embodiment of the present invention, as shown in Figure 4, behind literal line detection and group phrase sentence, obtains some literal fragments to be selected.Whether analyze through each character language feature in the literal segment to be selected, can distinguish the literal fragment is the Rome word.Each character that is defined as in the word of Rome is carried out Rome character calligraph direction character (roth's sign 1, roth's sign 2 etc.) extraction, each character in the literal fragment of non-Rome is carried out asian type presentation direction feature (Ya Tezheng 1, Ya Tezheng 2 etc.) extract.The eigenwert that extraction is obtained adds respectively asian type presentation direction characteristic statistic (inferior statistics 1, inferior statistics 2 etc.) and Rome character calligraph direction character statistic (Luo Tongji 1, Luo Tongji 2 etc.).The various characteristic statistics values of Asian language and Rome language are analyzed page orientation differentiation result and degree of confidence (inferior statistics 1_ direction, the inferior statistics 1_ degree of confidence that can access corresponding to every kind of statistical value; Inferior statistics 2_ direction, inferior statistics 2_ degree of confidence; Luo Tongji 1_ direction, Luo Tongji 1_ degree of confidence; Luo Tongji 2_ direction, Luo Tongji 2_ degree of confidence; Etc.).Last all characteristic direction and the degree of confidence of Comprehensive Comparison obtain page general direction and degree of confidence.
In an embodiment of the present invention, take characteristic direction with high confidence level as page orientation.If inferior high confidence level characteristic direction is consistent with high confidence level characteristic direction, then page degree of confidence then weakens (such as 20%) on the contrary in high confidence level basis come-up (such as 20%).
For example among Fig. 4 embodiment " inferior statistics 1_ degree of confidence " in " inferior statistics 1_ degree of confidence ", " inferior statistics 2_ degree of confidence ", " Luo Tongji 1_ degree of confidence ", " Luo Tongji 2_ degree of confidence " maximum, then page orientation be " the inferior 1_ of statistics direction ".If " Luo Tongji 2_ degree of confidence " is time large degree of confidence, and " inferior statistics 2_ direction " is consistent with " inferior statistics 1_ direction ", and then the page orientation degree of confidence is " inferior statistics 1_ degree of confidence " * 120%.
Preferably, in above-mentioned manuscript page orientation detection method, text unit comprises following at least a: Rome word, asian type section and punctuation mark.
Preferably, in above-mentioned manuscript page orientation detection method, the language classification comprises following at least a: Asian language classification and roman character language classification.
Preferably, in above-mentioned manuscript page orientation detection method, the presentation direction feature comprises following at least a: the Nun feature of asian type, the opening direction feature of roman character spoken and written languages, dipping and heaving feature and punctuation mark are with respect to the position feature of literal.
Fig. 5 shows according to an embodiment of the invention manuscript page orientation pick-up unit synoptic diagram, comprising:
The first computing module 20 is used for obtaining for the language category attribute presentation direction attributive character value of a plurality of text units, and it is included in the corresponding global statistics;
The second computing module 30 is used for obtaining respectively for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value of global statistics;
The 3rd computing module 40, the direction that is used for determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence.
In the present embodiment, be split as text unit by the character area after manuscript page picture and text are separated, and then obtain page orientation and corresponding degree of confidence for each presentation direction attributive character value according to the language category attribute of text unit, the direction of at last determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence, realized automatically determining of manuscript page orientation, and provide whole degree of confidence to determine for the user whether result of determination is available, overcome in the prior art for the manuscript page of complexity, can't obtain the problem of correct testing result.
Preferably, in above-mentioned manuscript page orientation pick-up unit, split module and specifically comprise: the connected domain detecting unit is used for detecting carrying out connected domain behind the image binaryzation of the manuscript page; The row detecting unit is used for respectively according to level and vertical direction the connected domain that detects being gone detection; The literal line determining unit is used for determining according to connected domain complexity in connected domain size and relative position relation, the row in connected domain quantity, the row in the row that is linked to be capable connected domain whether connected domain is literal line; Cutting unit, the connected domain that is used for being defined as literal line be as character area, and literal line cut apart obtain text unit.In the present embodiment, the fractionation by text unit obtains word (shorter literal line mainly is the Rome literal) or sentence (longer literal line mainly is asian type).
For example, the content of pages such as books and periodicals papers and magazines are abundant, mainly can be divided into character area and image-region, but character area again Further Division be roman character spoken and written languages zone and Asian language character areas.And the presentation direction of Asian language literal can be divided into writes across the page and indulges and write two kinds, the presentation direction of the roman character spoken and written languages classification of then only writing across the page.
Fig. 2 shows the manuscript page synoptic diagram of a Japanese newspaper, as shown in Figure 2, comprises image-region 21 and character area.The character area 22 of ways of writing and the character area 23 of vertical WriteMode have simultaneously occurred writing across the page in same the page.Fig. 3 shows the manuscript page synoptic diagram that text unit is in different directions, as shown in Figure 3, the situation of different directions may appear in the text unit in the same page, the local direction of character area 32 Chinese word unit is for up, and the local direction of character area 32 Chinese word unit is towards a left side, for such situation, follow-up analysis-by-synthesis step will in conjunction with literal distributed number proportion separately, provide the judged result of final page orientation.
Therefore, the differentiation of the manuscript page orientation that picture and text mix, mix in multilingual (Asia, Europe) need to be analyzed each character area, has improved the accuracy of differentiating the result.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the first computing module specifically comprises: the attributive character value cell is used for determining its linguistic property eigenwert according to size, form and the relative position relation in the literal connected domain internal communication territory of text unit; The first comparing unit is used for the corresponding reference value that linguistic property eigenwert and database prestore is compared, and determines the language category attribute of text unit.The linguistic property eigenwert is to judge that each text unit belongs to other feature of which kind of class of languages.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the second computing module specifically comprises: the second comparing unit, be used for each presentation direction attributive character value is compared with the reference value that database prestores, determine page orientation and corresponding degree of confidence for each presentation direction attributive character value.
In the present embodiment, to adding up respectively in order to each eigenwert of differentiating the manuscript page orientation, obtain the manuscript page orientation and differentiate the degree of confidence of the page orientation that obtains according to this eigenwert, degree of confidence is to determine by the power of each feature.For example,, be by the maximum quantity of punctuation mark in four quadrants and account for the total ratio of punctuation mark and calculate four degree of confidence of the manuscript page orientation differentiated of the distributed quantity feature in the limit mutually according to punctuation mark.It is larger that the maximal value of punctuation mark is larger in single quadrant, this maximal value accounts for punctuation mark sum ratio, and it is just higher then to differentiate the page orientation degree of confidence that obtains according to punctuation mark, otherwise then lower.
And final manuscript page orientation and degree of confidence by several differentiations for the different characteristic value as a result analysis-by-synthesis obtain.The direction that can differentiate by getting several page orientations degree of confidence maximum among the result is final page orientation.Final whole degree of confidence is that maximal value strengthens or weakens as the basis and obtains in the feature degree of confidence.Whole degree of confidence further promotes on maximum characteristic direction degree of confidence basis if inferior maximum confidence direction is consistent with the maximum confidence direction, obtains the final integral degree of confidence otherwise then reduce maximum characteristic direction degree of confidence.
Preferably, in above-mentioned manuscript page orientation pick-up unit, text unit comprises following at least a: Rome word, asian type section and punctuation mark.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the language classification comprises following at least a: Asian language classification and roman character language classification.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the presentation direction feature comprises following at least a: the Nun feature of asian type, the opening direction feature of roman character spoken and written languages, dipping and heaving feature and punctuation mark are with respect to the position feature of literal.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in the memory storage and be carried out by calculation element, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (14)
1. a manuscript page orientation detection method is characterized in that, may further comprise the steps:
The manuscript page is carried out picture and text separate, the character area that obtains is split as a plurality of text units, and determine the language category attribute of described a plurality of text units;
Obtain the presentation direction attributive character value of described a plurality of text units for described language category attribute, and it is included in the corresponding global statistics;
Obtain respectively for the page orientation of described each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value in the described global statistics;
The direction of determining the described manuscript page according to the page orientation of described each presentation direction attributive character value and corresponding degree of confidence with and corresponding whole degree of confidence.
2. manuscript page orientation detection method according to claim 1 is characterized in that, the manuscript page is carried out picture and text separate, and the character area that obtains is split as a plurality of text units specifically comprises:
Detect carrying out connected domain behind the image binaryzation of the manuscript page;
According to level and vertical direction the connected domain that detects is gone detection respectively;
Determine according to connected domain complexity in connected domain size and relative position relation, the row in connected domain quantity, the row in the row that is linked to be capable described connected domain whether described connected domain is literal line;
The described connected domain that is defined as literal line as character area, and is cut apart described literal line and to be obtained text unit.
3. manuscript page orientation detection method according to claim 2 is characterized in that, the manuscript page is carried out picture and text separate, and the character area that obtains is split as a plurality of text units, and determines that the language category attribute of described a plurality of text units specifically also comprises:
Size, form and relative position relation according to the literal connected domain internal communication territory of described text unit are determined its linguistic property eigenwert;
The corresponding reference value that prestores in described linguistic property eigenwert and the database is compared, determine the language category attribute of described text unit.
4. manuscript page orientation detection method according to claim 1, it is characterized in that, obtain respectively specifically comprising for page orientation and the corresponding degree of confidence of described each presentation direction attributive character value according to each presentation direction attributive character value in the described global statistics:
The reference value that prestores in described each presentation direction attributive character value and the database is compared, determine page orientation and corresponding degree of confidence for described each presentation direction attributive character value.
5. manuscript page orientation detection method according to claim 1 is characterized in that described text unit comprises following at least a:
Rome word, asian type section and punctuation mark.
6. manuscript page orientation detection method according to claim 1 is characterized in that, described language classification comprises following at least a:
Asian language classification and roman character language classification.
7. manuscript page orientation detection method according to claim 4 is characterized in that, described presentation direction attributive character value comprises following at least a:
The Nun feature of asian type, the opening direction feature of roman character spoken and written languages, dipping and heaving feature and punctuation mark are with respect to the position feature of literal.
8. a manuscript page orientation pick-up unit is characterized in that, comprising:
Split module, be used for that the manuscript page is carried out picture and text and separate, the character area that obtains is split as a plurality of text units, and determines the language category attribute of described a plurality of text units;
The first computing module is used for obtaining for described language category attribute the presentation direction attributive character value of described a plurality of text units, and it is included in the corresponding global statistics;
The second computing module is used for obtaining respectively for the page orientation of described each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value of described global statistics;
The 3rd computing module, the direction that is used for determining the described manuscript page according to page orientation and the corresponding degree of confidence of described each presentation direction attributive character value with and corresponding whole degree of confidence.
9. manuscript page orientation pick-up unit according to claim 8 is characterized in that, described fractionation module specifically comprises:
The connected domain detecting unit is used for detecting carrying out connected domain behind the image binaryzation of the manuscript page;
The row detecting unit is used for respectively according to level and vertical direction the connected domain that detects being gone detection;
The literal line determining unit is used for determining according to connected domain complexity in connected domain size and relative position relation, the row in connected domain quantity, the row in the row that is linked to be capable described connected domain whether described connected domain is literal line;
Cutting unit, the described connected domain that is used for being defined as literal line be as character area, and described literal line cut apart obtain text unit.
10. manuscript page orientation pick-up unit according to claim 8 is characterized in that, described the first computing module specifically comprises:
The attributive character value cell is used for determining its linguistic property eigenwert according to size, form and the relative position relation in the literal connected domain internal communication territory of described text unit;
The first comparing unit is used for the corresponding reference value that described linguistic property eigenwert and database prestore is compared, and determines the language category attribute of described text unit.
11. manuscript page orientation pick-up unit according to claim 8 is characterized in that, described the second computing module specifically comprises:
The second comparing unit is used for described each presentation direction attributive character value is compared with the reference value that database prestores, and determines page orientation and corresponding degree of confidence for described each presentation direction attributive character value.
12. manuscript page orientation pick-up unit according to claim 9 is characterized in that described text unit comprises following at least a:
Rome word, asian type section and punctuation mark.
13. manuscript page orientation pick-up unit according to claim 8 is characterized in that, described language classification comprises following at least a:
Asian language classification and roman character language classification.
14. manuscript page orientation pick-up unit according to claim 11 is characterized in that, described presentation direction attributive character value comprises following at least a:
The Nun feature of asian type, the opening direction feature of roman character spoken and written languages, dipping and heaving feature and punctuation mark are with respect to the position feature of literal.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010119229 CN102194117B (en) | 2010-03-05 | 2010-03-05 | Method and device for detecting page direction of document |
JP2010101990A JP2011188465A (en) | 2010-03-05 | 2010-04-27 | Method and device for detecting direction of document layout |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 201010119229 CN102194117B (en) | 2010-03-05 | 2010-03-05 | Method and device for detecting page direction of document |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102194117A CN102194117A (en) | 2011-09-21 |
CN102194117B true CN102194117B (en) | 2013-03-27 |
Family
ID=44602159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 201010119229 Expired - Fee Related CN102194117B (en) | 2010-03-05 | 2010-03-05 | Method and device for detecting page direction of document |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2011188465A (en) |
CN (1) | CN102194117B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5992614B2 (en) * | 2012-06-13 | 2016-09-14 | カタール・ファウンデーションQatar Foundation | Electronic reading apparatus and method |
CN102831421B (en) * | 2012-08-29 | 2015-09-23 | 华东师范大学 | A kind of document above-below direction detection method based on punctuation mark |
CN103870799A (en) * | 2012-12-17 | 2014-06-18 | 北京千橡网景科技发展有限公司 | Character direction judging method and device |
CN105574530B (en) * | 2014-10-08 | 2019-11-22 | 富士通株式会社 | The method and apparatus for extracting the line of text in document |
CN105989341A (en) * | 2015-02-17 | 2016-10-05 | 富士通株式会社 | Character recognition method and device |
CN106296629B (en) * | 2015-05-18 | 2019-01-22 | 富士通株式会社 | Image processing apparatus and method |
CN105718448B (en) * | 2016-01-13 | 2019-03-19 | 北京新美互通科技有限公司 | The method and apparatus that a kind of pair of input character carries out automatic translation |
CN107845094B (en) * | 2017-11-20 | 2020-06-19 | 北京小米移动软件有限公司 | Image character detection method and device and computer readable storage medium |
CN111476239A (en) * | 2020-05-28 | 2020-07-31 | 北京易真学思教育科技有限公司 | Image direction determining method and device and electronic equipment |
CN113901559A (en) * | 2021-10-27 | 2022-01-07 | 土巴兔集团股份有限公司 | Window layout acquisition method and related equipment thereof |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5852448A (en) * | 1996-09-20 | 1998-12-22 | Dynalab Inc. | Stroke-based font generation independent of resolution |
CN101013417A (en) * | 2007-02-12 | 2007-08-08 | 北京大学 | Page setup assisted apparatus and method for changing line-shifted attribute of composition data |
CN101271524A (en) * | 2007-03-15 | 2008-09-24 | 株式会社理光 | Image processing device, image processing method, and computer program product |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11338974A (en) * | 1998-05-28 | 1999-12-10 | Canon Inc | Document processing method and device therefor, and storage medium |
JP2004272798A (en) * | 2003-03-11 | 2004-09-30 | Pfu Ltd | Image reading device |
JP4553241B2 (en) * | 2004-07-20 | 2010-09-29 | 株式会社リコー | Character direction identification device, document processing device, program, and storage medium |
JP4881605B2 (en) * | 2005-10-28 | 2012-02-22 | 株式会社リコー | Character recognition device, storage medium, and character recognition method |
JP2011008549A (en) * | 2009-06-25 | 2011-01-13 | Sharp Corp | Image processor, image reader, multifunctional machine, image processing method, program, and recording medium |
-
2010
- 2010-03-05 CN CN 201010119229 patent/CN102194117B/en not_active Expired - Fee Related
- 2010-04-27 JP JP2010101990A patent/JP2011188465A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5852448A (en) * | 1996-09-20 | 1998-12-22 | Dynalab Inc. | Stroke-based font generation independent of resolution |
CN101013417A (en) * | 2007-02-12 | 2007-08-08 | 北京大学 | Page setup assisted apparatus and method for changing line-shifted attribute of composition data |
CN101271524A (en) * | 2007-03-15 | 2008-09-24 | 株式会社理光 | Image processing device, image processing method, and computer program product |
Also Published As
Publication number | Publication date |
---|---|
CN102194117A (en) | 2011-09-21 |
JP2011188465A (en) | 2011-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102194117B (en) | Method and device for detecting page direction of document | |
US7836390B2 (en) | Strategies for processing annotations | |
US20150095769A1 (en) | Layout Analysis Method And System | |
CA2116600C (en) | Methods and apparatus for inferring orientation of lines of text | |
CN111737969A (en) | Resume parsing method and system based on deep learning | |
CN105930159A (en) | Image-based interface code generation method and system | |
EP1052593A2 (en) | Form search apparatus and method | |
US20040015775A1 (en) | Systems and methods for improved accuracy of extracted digital content | |
US20090276378A1 (en) | System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing | |
EP1016033A1 (en) | Automatic language identification system for multilingual optical character recognition | |
Harit et al. | Table detection in document images using header and trailer patterns | |
CN103577818A (en) | Method and device for recognizing image characters | |
US8208726B2 (en) | Method and system for optical character recognition using image clustering | |
CN103996055A (en) | Identification method based on classifiers in image document electronic material identification system | |
CN106326193A (en) | Footnote identification method and footnote and footnote citation association method in fixed-layout document | |
CN104978577B (en) | Information processing method, device and electronic equipment | |
US20130124684A1 (en) | Visual separator detection in web pages using code analysis | |
CN103455823A (en) | English character recognizing method based on fuzzy classification and image segmentation | |
CN113807158A (en) | PDF content extraction method, device and equipment | |
Das et al. | Heuristic based script identification from multilingual text documents | |
CN100356393C (en) | Character recognition method predicted base on font | |
JP2000181931A (en) | Automatic authoring device and recording medium | |
KR101692244B1 (en) | Method for spam classfication, recording medium and device for performing the method | |
CN111291535A (en) | Script processing method and device, electronic equipment and computer readable storage medium | |
Padma et al. | Language identification of Kannada, Hindi and English text words through visual discriminating features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20130327 |