Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows according to an embodiment of the invention manuscript page orientation detection method process flow diagram, may further comprise the steps:
S102 carries out picture and text to the manuscript page and separates, and the character area that obtains is split as a plurality of text units, and determines the language category attribute of a plurality of text units;
S104 obtains the presentation direction attributive character value of a plurality of text units for the language category attribute, and it is included in the corresponding global statistics;
S106 obtains respectively for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value in the global statistics;
S108, the direction of determining the manuscript page according to the page orientation of each presentation direction attributive character value and corresponding degree of confidence with and corresponding whole degree of confidence.
In the present embodiment, be split as text unit by the character area after manuscript page picture and text are separated, and then obtain page orientation and corresponding degree of confidence for each presentation direction attributive character value according to the language category attribute of text unit, the direction of at last determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence, realized automatically determining of manuscript page orientation, and provide whole degree of confidence to determine for the user whether result of determination is available, overcome in the prior art for the manuscript page of complexity, can't obtain the problem of correct testing result.
Preferably, in above-mentioned manuscript page orientation detection method, the manuscript page is carried out picture and text separate, the character area that obtains is split as a plurality of text units specifically comprises: detect carrying out connected domain behind the image binaryzation of the manuscript page; According to level and vertical direction the connected domain that detects is gone detection respectively; Determine according to connected domain complexity in connected domain size and relative position relation, the row in connected domain quantity, the row in the row that is linked to be capable connected domain whether connected domain is literal line; The connected domain that is defined as literal line as character area, and is cut apart literal line and to be obtained text unit.In the present embodiment, the fractionation by text unit obtains word (shorter literal line mainly is the Rome literal) or sentence (longer literal line mainly is asian type).
In an embodiment of the present invention effective connected domain analysis is determined literal line and language category attribute.Effectively connected domain refers to nonoverlapping connected domain along the direction of scanning.Satisfy effective connected domain size (along scanning method to yardstick) and all to be no more than 30% connected domain row along the center amplitude of variation of direction of scanning be literal line.In the row connected domain quantity surpass 16 generally be Asian language literal section.Being no more than connected domain quantity that 16 and complexity meet the demands for connected domain quantity in the row, to account for connected domain sum in the row more than 25% generally be the Rome literal line.
For example, the content of pages such as books and periodicals papers and magazines are abundant, mainly can be divided into character area and image-region, but character area again Further Division be roman character spoken and written languages zone and Asian language character areas.And the presentation direction of Asian language literal can be divided into writes across the page and indulges and write two kinds, the presentation direction of the roman character spoken and written languages classification of then only writing across the page.
Fig. 2 shows the effect screen synoptic diagram of the manuscript page of a Japanese newspaper, as shown in Figure 2, comprises image-region 21 and character area.The character area 22 of ways of writing and the character area 23 of vertical WriteMode have simultaneously occurred writing across the page in same the page.Fig. 3 shows the manuscript page synoptic diagram that text unit is in different directions, as shown in Figure 3, the situation of different directions may appear in the text unit in the same page, the local direction of character area 32 Chinese word unit is for up, and the local direction of character area 32 Chinese word unit is towards a left side, for such situation, follow-up analysis-by-synthesis step will in conjunction with literal distributed number proportion separately, provide the judged result of final page orientation.Comprised two kinds of presentation directions in embodiment illustrated in fig. 3, but after the presentation direction eigenwert of all characters is carried out the integral body statistics, be dominant by the presentation direction eigenwert that makes progress, therefore final page orientation is differentiated for making progress.
Therefore, the differentiation of the manuscript page orientation that picture and text mix, mix in multilingual (Asia, Europe) need to be analyzed each character area, has improved the accuracy of differentiating the result.
Preferably, in above-mentioned manuscript page orientation detection method, the manuscript page is carried out picture and text to be separated, the character area that obtains is split as a plurality of text units, and determines that the language category attribute of a plurality of text units specifically comprises: size, form and relative position relation according to the literal connected domain internal communication territory of text unit are determined its linguistic property eigenwert; The corresponding reference value that prestores in linguistic property eigenwert and the database is compared, determine the language category attribute of text unit.
In the above-described embodiments, determine by the connected domain complexity metric whether this connected domain is roman character.The connected domain complexity mainly is to define with negative value internal communication territory quantity and relative position relation along the direction of scanning normal direction.
Preferably, in above-mentioned manuscript page orientation detection method, obtain respectively specifically comprising for the page orientation of each presentation direction attributive character value and corresponding degree of confidence according to each presentation direction attributive character value in the global statistics reference value that prestores in each presentation direction attributive character value and the database is compared, determine for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence.
In the present embodiment, to adding up respectively in order to each eigenwert of differentiating the manuscript page orientation, obtain the manuscript page orientation and differentiate the degree of confidence of the page orientation that obtains according to this eigenwert, degree of confidence is to determine by the power of each feature.For example,, be by the maximum quantity of punctuation mark in four quadrants and account for the total ratio of punctuation mark and calculate four degree of confidence of the manuscript page orientation differentiated of the distributed quantity feature in the limit mutually according to punctuation mark.It is larger that the maximal value of punctuation mark is larger in single quadrant, this maximal value accounts for punctuation mark sum ratio, and it is just higher then to differentiate the page orientation degree of confidence that obtains according to punctuation mark, otherwise then lower.
And final manuscript page orientation and degree of confidence by several differentiations for the different characteristic value as a result analysis-by-synthesis obtain.The direction that can differentiate by getting several page orientations degree of confidence maximum among the result is final page orientation.Final whole degree of confidence is that maximal value strengthens or weakens as the basis and obtains in the feature degree of confidence.Whole degree of confidence further promotes on maximum characteristic direction degree of confidence basis if inferior maximum confidence direction is consistent with the maximum confidence direction, obtains the final integral degree of confidence otherwise then reduce maximum characteristic direction degree of confidence.
Fig. 4 shows manuscript page synoptic diagram in accordance with a preferred embodiment of the present invention, as shown in Figure 4, behind literal line detection and group phrase sentence, obtains some literal fragments to be selected.Whether analyze through each character language feature in the literal segment to be selected, can distinguish the literal fragment is the Rome word.Each character that is defined as in the word of Rome is carried out Rome character calligraph direction character (roth's sign 1, roth's sign 2 etc.) extraction, each character in the literal fragment of non-Rome is carried out asian type presentation direction feature (Ya Tezheng 1, Ya Tezheng 2 etc.) extract.The eigenwert that extraction is obtained adds respectively asian type presentation direction characteristic statistic (inferior statistics 1, inferior statistics 2 etc.) and Rome character calligraph direction character statistic (Luo Tongji 1, Luo Tongji 2 etc.).The various characteristic statistics values of Asian language and Rome language are analyzed page orientation differentiation result and degree of confidence (inferior statistics 1_ direction, the inferior statistics 1_ degree of confidence that can access corresponding to every kind of statistical value; Inferior statistics 2_ direction, inferior statistics 2_ degree of confidence; Luo Tongji 1_ direction, Luo Tongji 1_ degree of confidence; Luo Tongji 2_ direction, Luo Tongji 2_ degree of confidence; Etc.).Last all characteristic direction and the degree of confidence of Comprehensive Comparison obtain page general direction and degree of confidence.
In an embodiment of the present invention, take characteristic direction with high confidence level as page orientation.If inferior high confidence level characteristic direction is consistent with high confidence level characteristic direction, then page degree of confidence then weakens (such as 20%) on the contrary in high confidence level basis come-up (such as 20%).
For example among Fig. 4 embodiment " inferior statistics 1_ degree of confidence " in " inferior statistics 1_ degree of confidence ", " inferior statistics 2_ degree of confidence ", " Luo Tongji 1_ degree of confidence ", " Luo Tongji 2_ degree of confidence " maximum, then page orientation be " the inferior 1_ of statistics direction ".If " Luo Tongji 2_ degree of confidence " is time large degree of confidence, and " inferior statistics 2_ direction " is consistent with " inferior statistics 1_ direction ", and then the page orientation degree of confidence is " inferior statistics 1_ degree of confidence " * 120%.
Preferably, in above-mentioned manuscript page orientation detection method, text unit comprises following at least a: Rome word, asian type section and punctuation mark.
Preferably, in above-mentioned manuscript page orientation detection method, the language classification comprises following at least a: Asian language classification and roman character language classification.
Preferably, in above-mentioned manuscript page orientation detection method, the presentation direction feature comprises following at least a: the Nun feature of asian type, the opening direction feature of roman character spoken and written languages, dipping and heaving feature and punctuation mark are with respect to the position feature of literal.
Fig. 5 shows according to an embodiment of the invention manuscript page orientation pick-up unit synoptic diagram, comprising:
Split module 10, be used for that the manuscript page is carried out picture and text and separate, the character area that obtains is split as a plurality of text units, and determines the language category attribute of a plurality of text units;
The first computing module 20 is used for obtaining for the language category attribute presentation direction attributive character value of a plurality of text units, and it is included in the corresponding global statistics;
The second computing module 30 is used for obtaining respectively for the page orientation of each presentation direction attributive character value and the degree of confidence of correspondence according to each presentation direction attributive character value of global statistics;
The 3rd computing module 40, the direction that is used for determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence.
In the present embodiment, be split as text unit by the character area after manuscript page picture and text are separated, and then obtain page orientation and corresponding degree of confidence for each presentation direction attributive character value according to the language category attribute of text unit, the direction of at last determining the manuscript page according to page orientation and the corresponding degree of confidence of each presentation direction attributive character value with and corresponding whole degree of confidence, realized automatically determining of manuscript page orientation, and provide whole degree of confidence to determine for the user whether result of determination is available, overcome in the prior art for the manuscript page of complexity, can't obtain the problem of correct testing result.
Preferably, in above-mentioned manuscript page orientation pick-up unit, split module and specifically comprise: the connected domain detecting unit is used for detecting carrying out connected domain behind the image binaryzation of the manuscript page; The row detecting unit is used for respectively according to level and vertical direction the connected domain that detects being gone detection; The literal line determining unit is used for determining according to connected domain complexity in connected domain size and relative position relation, the row in connected domain quantity, the row in the row that is linked to be capable connected domain whether connected domain is literal line; Cutting unit, the connected domain that is used for being defined as literal line be as character area, and literal line cut apart obtain text unit.In the present embodiment, the fractionation by text unit obtains word (shorter literal line mainly is the Rome literal) or sentence (longer literal line mainly is asian type).
For example, the content of pages such as books and periodicals papers and magazines are abundant, mainly can be divided into character area and image-region, but character area again Further Division be roman character spoken and written languages zone and Asian language character areas.And the presentation direction of Asian language literal can be divided into writes across the page and indulges and write two kinds, the presentation direction of the roman character spoken and written languages classification of then only writing across the page.
Fig. 2 shows the manuscript page synoptic diagram of a Japanese newspaper, as shown in Figure 2, comprises image-region 21 and character area.The character area 22 of ways of writing and the character area 23 of vertical WriteMode have simultaneously occurred writing across the page in same the page.Fig. 3 shows the manuscript page synoptic diagram that text unit is in different directions, as shown in Figure 3, the situation of different directions may appear in the text unit in the same page, the local direction of character area 32 Chinese word unit is for up, and the local direction of character area 32 Chinese word unit is towards a left side, for such situation, follow-up analysis-by-synthesis step will in conjunction with literal distributed number proportion separately, provide the judged result of final page orientation.
Therefore, the differentiation of the manuscript page orientation that picture and text mix, mix in multilingual (Asia, Europe) need to be analyzed each character area, has improved the accuracy of differentiating the result.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the first computing module specifically comprises: the attributive character value cell is used for determining its linguistic property eigenwert according to size, form and the relative position relation in the literal connected domain internal communication territory of text unit; The first comparing unit is used for the corresponding reference value that linguistic property eigenwert and database prestore is compared, and determines the language category attribute of text unit.The linguistic property eigenwert is to judge that each text unit belongs to other feature of which kind of class of languages.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the second computing module specifically comprises: the second comparing unit, be used for each presentation direction attributive character value is compared with the reference value that database prestores, determine page orientation and corresponding degree of confidence for each presentation direction attributive character value.
In the present embodiment, to adding up respectively in order to each eigenwert of differentiating the manuscript page orientation, obtain the manuscript page orientation and differentiate the degree of confidence of the page orientation that obtains according to this eigenwert, degree of confidence is to determine by the power of each feature.For example,, be by the maximum quantity of punctuation mark in four quadrants and account for the total ratio of punctuation mark and calculate four degree of confidence of the manuscript page orientation differentiated of the distributed quantity feature in the limit mutually according to punctuation mark.It is larger that the maximal value of punctuation mark is larger in single quadrant, this maximal value accounts for punctuation mark sum ratio, and it is just higher then to differentiate the page orientation degree of confidence that obtains according to punctuation mark, otherwise then lower.
And final manuscript page orientation and degree of confidence by several differentiations for the different characteristic value as a result analysis-by-synthesis obtain.The direction that can differentiate by getting several page orientations degree of confidence maximum among the result is final page orientation.Final whole degree of confidence is that maximal value strengthens or weakens as the basis and obtains in the feature degree of confidence.Whole degree of confidence further promotes on maximum characteristic direction degree of confidence basis if inferior maximum confidence direction is consistent with the maximum confidence direction, obtains the final integral degree of confidence otherwise then reduce maximum characteristic direction degree of confidence.
Preferably, in above-mentioned manuscript page orientation pick-up unit, text unit comprises following at least a: Rome word, asian type section and punctuation mark.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the language classification comprises following at least a: Asian language classification and roman character language classification.
Preferably, in above-mentioned manuscript page orientation pick-up unit, the presentation direction feature comprises following at least a: the Nun feature of asian type, the opening direction feature of roman character spoken and written languages, dipping and heaving feature and punctuation mark are with respect to the position feature of literal.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in the memory storage and be carried out by calculation element, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.