CN105760901A

CN105760901A - Automatic language identification method for multilingual skew document image

Info

Publication number: CN105760901A
Application number: CN201610053497.2A
Authority: CN
Inventors: 王恺
Original assignee: Tianjin Shenzhou Haotian Technology Co Ltd; Nankai University
Current assignee: Tianjin Shenzhou Haotian Technology Co Ltd; Nankai University
Priority date: 2016-01-27
Filing date: 2016-01-27
Publication date: 2016-07-13
Anticipated expiration: 2036-01-27
Also published as: CN105760901B

Abstract

The invention relates an automatic language identification method for a multilingual skew document image, which is technical characterized by comprising the following steps: as for acquired file images, a Gabor filtering method is used for carrying out automatic identification on languages of the file images, and the file images are divided into file images of Asian Languages and file images of Latin languages; a corresponding skew correction algorithm is used for a file image of a different language, a file image after correction is obtained, a keyword matching method is applied to the file image after correction for automatic identification on the language of the file image, and thus, the automatic language identification function on the file image is realized. the design is reasonable, a method of combining the Gabor filtering and the keyword matching is adopted to realize the automatic language identification function on the file image, the robustness of the method is ensured in a block voting mode, the identification accuracy is improved, and the accuracy meets requirements of practical application.

Description

A kind of automatic language method of discrimination of multilingual inclination file and picture

Technical field

The invention belongs to areas of information technology, the automatic language method of discrimination of especially a kind of multilingual inclination file and picture.

Background technology

Optical character recognition (is called for short OCR), and technology has been widely used in the Digitization of file and picture, and its effect is that the file and picture shot by photographing unit or scanner scanning obtains is converted into the electronic document can edited, can search for.Along with improving constantly of internationalization level, the file and picture with polyglot is often mixed in together.Mostly current OCR technique is to process for the file and picture of certain language-specific, file and picture carries out printed page analysis and Text region according to the language artificially specified, is translated into the electronic document can edited and can search for.By file and picture automatic language method of discrimination, can automatically by OCR to be process file and picture by language classification, according to distinguishing language result, file and picture sent into different OCR engine or use different choice of language to process, thus reducing manual intervention, reducing cost of labor.Owing to the charcter topology of some language use is similar, and the problem such as the noise being frequently encountered by image acquisition process is many, resolution is not high, it is difficult to design a kind of file and picture automatic language method of discrimination with high-accuracy that disclosure satisfy that practical application needs.

At present, in the research work of file and picture automatic language method of discrimination, distinguishing language is done mainly by textural characteristics and word shape facility, the subject matter of its existence is: (1) textural characteristics is difficult to reach the effect of practicality for the language that font is similar, such as English/moral/syntax language differentiation etc.；(2) language differed greatly for font, uses the problems such as the textural characteristics of single language area will also result in differentiation unstable result, accuracy rate is on the low side；(3) compared with textural characteristics, word shape facility is more suitable for distinguishes the language that text structure is similar, but utilizes word shape facility to be also difficult to the accuracy rate reaching to meet practical application needs when resolution is not high；(4) pending file and picture would be likely to occur the problems such as inclination, and the different sloped correcting method of application is needed for the file and picture of different language, as differed greatly due to charcter topology, the sloped correcting method of chinese document image and English document image entirely different.It addition, the character segmentation method of different language file and picture is also entirely different.Therefore, correct word shape facility cannot be extracted in file and picture so that the automatic language method of discrimination based on word shape facility lost efficacy when unknown languages.In sum, although existing file and picture automatic language method of discrimination achieves certain effect, but owing to some spoken and written languages is at texture and all closely similar in shape, and the problem such as many, not high, the inclination of resolution of noise introduced in image acquisition process, its accuracy rate is difficult to meet the demand of practical application.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, it is provided that a kind of reasonable in design, accuracy rate is high and the automatic language method of discrimination of adaptable multilingual inclination file and picture.

This invention address that it technical problem is that takes techniques below scheme to realize:

The automatic language method of discrimination of a kind of multilingual inclination file and picture, comprises the following steps:

Step 1, for gather file and picture, utilize Gabor filtering method to carry out the automatic discrimination of file and picture languages, file and picture be divided into Asia languages file and picture and Latin languages file and picture；

Step 2, for different language file and picture use corresponding slant correction algorithm, file and picture after being corrected, then on file and picture after calibration, key application word matching process carries out the automatic discrimination of file and picture language, thus realizing the language automatic discrimination function of file and picture.

Further, the concrete methods of realizing of described step 1 comprises the following steps:

(1) use the method for mathematical morphology to carry out filter the file and picture gathered to make an uproar process；

(2) for there is the file and picture of inclination, therefrom choosing and being suitable for doing a number of character area that automatic language differentiates；

(3) each character area selected is done respectively Gabor filtering, and according to the Gabor characteristic extracted, application class device, the languages of each character area is carried out automatic discrimination；

(4) the automatic languages of each character area are differentiated that result is voted, take the votes maximum languages languages differentiation result as whole file and picture, thus the file and picture of input being divided into Asia languages file and picture and the big class of Latin languages file and picture two.

Further, the described step (1) file and picture to gathering uses the method for mathematical morphology to filter process of making an uproar is adopt corrosion and expansion algorithm to realize.

Further, described step (3) method particularly includes: generate the Gabor image of different scale, multiple directions firstly for the character area image selected；Then generate and obtain Gabor magnitude image and Gabor magnitude image is carried out down sample；Finally, utilize character area training sample to carry out grader study according to the Gabor characteristic extracted, treat the character area image doing languages differentiation and classify, each character area is divided into Asia languages or Latin languages.

Further, the concrete methods of realizing of described step 2 comprises the following steps:

(1) file and picture is carried out slant correction and character segmentation processes by the languages obtained according to automatic discrimination；

(2) from character segmentation result, take out the some pieces of character images meeting word feature most；

(3) each piece is cut out by the languages obtained according to automatic discrimination character image or word image utilize grader to be identified, and according to recognition result, the language of each piece of character image or word image are carried out automatic discrimination；

(4) automatic language of character image or word image is differentiated that result is voted, take the votes maximum language distinguishing language result as view picture file and picture.

Further, the character segmentation processing method of described step (1) is: use the character segmentation method being applicable to different language on the file and picture of correction, obtain character segmentation result: for Asia languages file and picture, obtain multiple candidate characters by character segmentation；For Latin languages file and picture, obtain multiple word candidate by character segmentation.

Further, described step (2) method particularly includes: for Asia languages file and picture, first by candidate characters by high computational rectangular histogram, chooses height character near histogram peak, with filter out noise, reduces its impact on differentiating result；Then, the absolute value subtracting by the ratio of width to height to the character chosen carries out ascending sort, retains above character and carries out follow-up analysis；For Latin languages file and picture, according to the length descending of word candidate, retain above a number of word and carry out follow-up analysis.

Further, described step (3) method particularly includes: for Asia languages file and picture, the a number of character picture (2nd) step remained is sent into asian type Image Classifier and is carried out character recognition, the recognition result of each character includes Chinese, Japanese or Korean, and before reservation recognition credibility is the highest, a number of character carries out distinguishing language ballot；For Latin languages file and picture, a number of word that (2nd) step remains is carried out character cutting and identification, retain mated by language dictionary and credibility the highest before a number of word carry out distinguishing language ballot

Advantages of the present invention and having the benefit effect that

The present invention is reasonable in design, it adopts the method that Gabor filtering and Keywords matching combine to achieve the automatic language discrimination function of file and picture, and the robustness of method is ensure that by piecemeal ballot mode, improve the accuracy rate of identification, accuracy rate meets the demand of practical application, solves the automatic language discrimination of the Language Document images such as China, Japan and Korea under inclination conditions, English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland.

Accompanying drawing explanation

Fig. 1 is the system framework figure of the present invention；

Fig. 2 is that the automatic languages of file and picture differentiate flow chart；

Fig. 3 is that same languages file and picture automatic language differentiates flow chart；

Fig. 4 is the distinguishing language experimental result schematic diagram of Latin languages file and picture.

Detailed description of the invention

Below in conjunction with accompanying drawing, the embodiment of the present invention is further described:

A kind of automatic language method of discrimination of multilingual inclination file and picture, as it is shown in figure 1, comprise the following steps:

Step 1, for gather file and picture, utilize Gabor filtering method to carry out the automatic discrimination of file and picture languages, file and picture is divided into Asia languages (including China, Japan and Korea) file and picture and Latin languages (including English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland) file and picture.

The concrete processing procedure of this step is as in figure 2 it is shown, comprise the following steps:

Step (1), the morphologic method of file and picture applied mathematics gathered is carried out filter make an uproar processs, the impact of minimizing noise.

Treat the file and picture doing languages differentiation and do opening operation (namely first corroding reflation), filter noise that may be present in file and picture.Wherein,

A algorithm that () corrodes: with the structural element of 3*3, each pixel of scanogram, the bianry image covered with it with structural element does AND-operation.If being all 1, this pixel of result images is 1, is otherwise 0.

B algorithm that () expands: with the structural element of 3*3, each pixel of scanogram, the bianry image covered with it with structural element does AND-operation.If being all 0, this pixel of result images is 0, is otherwise 1.

Step (2), the file and picture tilted for existence, therefrom choose to be best suited for and do front m the character area (making m=21 here) that automatic language differentiates.

To extract the file and picture of Gabor characteristic for a width, from image, first randomly select 100 width be of a size of the subimage of 200*200；Then, this 100 width subimage being screened, screening criteria is as follows:

If a the black pixel quantity of () width subimage exceedes the 1/4 of subimage size, then it is assumed that be not character area, be deleted, to reduce the interference to result of the non-legible regions such as image.

B the every width character area image screened in (a) is divided into 4 row 4 and arranges by (), amount to 16 pieces.To each piece, canny operator is used to obtain its edge image.If the ratio at edge accounts for the total size 10%～20% of character area, then the support of character area increases 1.The support span of each character area finally given is 0～16.

C character area is ranked up from high to low by () according to support, choose the highest front 21 character area images of support and carry out the extraction of Gabor characteristic, and classify accordingly.

Step (3), each character area selected is done respectively Gabor filtering, and based on the Gabor characteristic extracted, application class device, the languages of each character area are carried out automatic discrimination.

For the 21 width character area images selected, firstly generate different scale (g=0,1,2), multiple directions (h=0,1,2 ..., 15) Gabor image.

Shown in the expression-form of Gabor function such as formula (1):

Ψ = \frac{f^{2}}{π γ η} e^{- (\frac{f^{2}}{γ^{2}} x_{t}^{2} + \frac{f^{2}}{η^{2}} y_{t}^{2})} e^{i 2 {πfx}_{t}} - - - (1)

The computational methods of real part and imaginary part are respectively as shown in formula (2) and formula (3):

\frac{f^{2}}{π γ η} e^{- (\frac{f^{2}}{γ^{2}} x_{t}^{2} + \frac{f^{2}}{η^{2}} y_{t}^{2})} c o s (2 {πfx}_{t}) - - - (2)

\frac{f^{2}}{π γ η} e^{- (\frac{f^{2}}{γ^{2}} x_{t}^{2} + \frac{f^{2}}{η^{2}} y_{t}^{2})} s i n (2 {πfx}_{t}) - - - (3)

Wherein,

x_t=xcos θ+ysin θ (4)

y_t=-xsin θ+ycos θ

Formula (1) is to formula (4), and x and y represents pixel coordinate；(x_t,y_t) it is (x, the result that the θ degree that y) turns clockwise obtains；F represents multiple sinusoidal signal frequency, and its value isf_max=0.25；θ represents small echo direction, and its value isγ represents the wavelet space width along sinusoidal plane wave, and η represents the wavelet space width being perpendicular to sinusoidal plane wave, here

γ = η = \sqrt{2} .

Under fixed size and fixed-direction, can calculating nuclear matrix, be divided into real part nuclear matrix and imaginary part nuclear matrix, the calculating of nuclear matrix needs a window, and window value is set to 8, the nuclear matrix of two 8*8 obtained.After obtaining nuclear matrix, real part nuclear matrix being spun upside down, add left and right upset, imaginary part nuclear matrix does not operate.Then by the two nuclear matrix, image is carried out convolution respectively, respectively obtain after real part convolution image after image and imaginary part convolution.Calculate amplitude finally according to the image after the two convolution, obtain magnitude image.

The Gabor magnitude image calculated is carried out down sample (down-sampling ratio 4), narrows down to the 1/4 of original size by magnitude image.Particular dimensions and specific direction have the image (50*50) after a width down-sampling, the pixel value of image is averaging.So for a width subimage, we have 3 yardsticks, 16 directions, then total characteristic number is 3*16=48.

Utilize character area training sample to carry out grader study according to the Gabor characteristic extracted, then treat and make the character area image that languages differentiate and classify, each character area is divided into Asia languages or Latin languages.

Step (4), the automatic languages of 21 character areas are differentiated that result is voted, take the maximum languages of votes and differentiate result as the languages of whole file and picture.

For the file and picture that width languages to be made differentiate, languages automatic discrimination result according to 21 character areas selected is voted, the languages that votes is many are the languages of file and picture and differentiate result, thus the file and picture that will be fed into is divided into Asia languages file and picture and the big class of Latin languages file and picture two.

Step 2, obtaining file and picture languages differentiate result basis on, file and picture for different language applies different slant correction algorithms, file and picture after being corrected, and on file and picture after calibration, key application word matching process carries out the automatic discrimination of file and picture language.

The concrete processing procedure of this step is as it is shown on figure 3, comprise the following steps:

File and picture carries out corresponding slant correction for step (1), the languages obtained according to automatic discrimination and character segmentation processes.

According to the file and picture languages automatic discrimination result being previously obtained, application is applicable to the sloped correcting method of different language, is become a full member by file and picture；Then, on the file and picture of correction, application is applicable to the character segmentation method of different language, obtains character segmentation result.For Asia languages file and picture, obtain multiple candidate characters by character segmentation；For Latin languages file and picture, obtain multiple word candidate by character segmentation.

Step (2), from character segmentation result, take out the some pieces of character images meeting word feature most.

For Asia languages file and picture, first by candidate characters by high computational rectangular histogram, choose height character near histogram peak, with filter out noise, reduce its impact on differentiating result；Then, by the absolute value of (the ratio of width to height-1), the character chosen being carried out ascending sort, retain front 100 characters and carry out follow-up analysis, namely the ratio of width to height is closer to 1, then be more likely to be the asian type that cutting is correct.

For Latin languages file and picture, according to the length descending of word candidate, retaining front 100 words and carry out follow-up analysis, namely word length is more long, then more can reduce distinguishing language mistake belonging to the word caused because individual characters knows by mistake.

Character image or word image that each piece is cut out by step (3), the languages obtained according to automatic discrimination utilize grader to be identified, and according to recognition result, the language of each piece of character image or word image are carried out automatic discrimination.

For Asia languages file and picture, 100 character pictures (2nd) step remained are sent into asian type Image Classifier and are carried out character recognition, the recognition result of each character is probably Chinese, Japanese or Korean, and front 20 characters retaining recognition credibility the highest carry out distinguishing language ballot.

For Latin languages file and picture, 100 words (2nd) step remained carry out character cutting and identification, retain and can carry out distinguishing language ballot by front 20 words that certain language dictionary coupling and credibility are the highest.

Step (4), the automatic language of character image or word image is differentiated that result is voted, take the votes maximum language distinguishing language result as view picture file and picture.

For Asia languages file and picture, the recognition result according to 20 characters that (3rd) step remains, carrying out China, Japan and Korea's distinguishing language ballot, the language that number of characters is maximum is Asia languages file and picture automatic language and differentiates result.

For Latin languages file and picture, recognition result according to 20 words that (3rd) step remains, carrying out English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland's distinguishing language ballot, the language that word number is maximum is Latin languages file and picture automatic language and differentiates result.

Pass through above step, the language automatic discrimination function of file and picture can be realized, solve the automatic language discrimination of the Language Document images such as China, Japan and Korea under inclination conditions, English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland.

Verify the multilingual inclination file and picture automatic language method of discrimination combined based on Gabor filtering and Keywords matching of present invention proposition with " with languages file and picture language automatic discrimination experimental result " two aspect below by " file and picture languages automatic discrimination experimental result ".Part 1, by the experiment on Asia languages file and picture and Latin languages file and picture, it was shown that the present invention differentiates there is stronger robustness for the languages tilting file and picture.Part 2, respectively through the experiment on the Latin languages file and pictures such as Asia languages file and picture and English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland such as China, Japan and Korea, it was shown that the present invention can differentiate, in languages, the same languages file and picture distinguishing language problem solving have similar text structure on the basis of result preferably.

1, file and picture languages automatic discrimination experimental result

This experiment gathers 110 width Asia languages file and pictures and 110 width Latin languages file and pictures, each image rotates by 15 kinds of different angles, finally obtain inclination Asia languages file and picture and tilt each 1650 width of Latin languages file and picture, the data set that these images are tested as file and picture languages automatic discrimination.Test result indicate that, the languages of Asia languages file and picture and Latin languages file and picture differentiate that rate of accuracy reached is to 99.48%.Concrete experimental result is as shown in table 1, and the Asia languages file and picture only having 0.70% is known for Asia languages by the Latin languages file and picture known for Latin languages, 0.33% by mistake by mistake.

Table 1 Asia languages and Latin languages file and picture languages discriminating experiment result

2, with languages file and picture language automatic discrimination experimental result

2.1 Asia languages file and picture language automatic discriminations

This experimental data set includes 40 width chinese document images after slant correction, 35 width Japanese file and pictures and 35 width Korean file and pictures, it is separately added into gaussian noise (average and variance respectively 0 and 0.02) and spiced salt noise (noise ratio is 0.05), obtains the data set that 220 width images are tested as Asia languages file and picture language automatic discrimination.Test result indicate that, the distinguishing language rate of accuracy reached of the Asia languages file and pictures such as China, Japan and Korea is to 98.18%.Concrete experimental result is as shown in table 2, and the distinguishing language accuracy rate of China, Japan and Korea's file and picture has respectively reached 100.00%, 97.14% and 97.14%.

The distinguishing language experimental result of table 2 Asia languages file and picture

2.2 Latin languages file and picture language automatic discriminations

In this experiment, the quantity of various Language Document images is as shown in table 3.

Table 3 Latin languages document image data collection

All samples are separately added into gaussian noise (average and variance respectively 0 and 0.02) and spiced salt noise (noise ratio is 0.05), obtain the data set that 25,614 width images are tested as Latin languages file and picture language automatic discrimination.Concrete experimental result as shown in Figure 4, it can be seen that the distinguishing language rate of accuracy reached of Latin languages file and picture is to 98.18%.

It is emphasized that; embodiment of the present invention is illustrative; rather than it is determinate; therefore the present invention is not limited to the embodiment described in detailed description of the invention; every other embodiments drawn according to technical scheme by those skilled in the art, also belong to the scope of protection of the invention.

Claims

1. the automatic language method of discrimination of a multilingual inclination file and picture, it is characterised in that comprise the following steps:

2. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 1, it is characterised in that: the concrete methods of realizing of described step 1 comprises the following steps:

3. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 2, it is characterised in that: it is adopt corrosion and expansion algorithm to realize that the described step (1) file and picture to gathering uses the method for mathematical morphology to filter process of making an uproar.

4. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 2, it is characterised in that: described step (3) method particularly includes: generate the Gabor image of different scale, multiple directions firstly for the character area image selected；Then generate and obtain Gabor magnitude image and Gabor magnitude image is carried out down sample；Finally, utilize character area training sample to carry out grader study according to the Gabor characteristic extracted, treat the character area image doing languages differentiation and classify, each character area is divided into Asia languages or Latin languages.

5. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 1, it is characterised in that: the concrete methods of realizing of described step 2 comprises the following steps:

6. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 5, it is characterized in that: the character segmentation processing method of described step (1) is: on the file and picture of correction, use the character segmentation method being applicable to different language, obtain character segmentation result: for Asia languages file and picture, obtain multiple candidate characters by character segmentation；For Latin languages file and picture, obtain multiple word candidate by character segmentation.

7. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 5, it is characterized in that: described step (2) method particularly includes: for Asia languages file and picture, first by candidate characters by high computational rectangular histogram, choose height character near histogram peak, with filter out noise, reduce its impact on differentiating result；Then, the absolute value subtracting by the ratio of width to height to the character chosen carries out ascending sort, retains above character and carries out follow-up analysis；For Latin languages file and picture, according to the length descending of word candidate, retain above a number of word and carry out follow-up analysis.

8. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 5, it is characterized in that: described step (3) method particularly includes: for Asia languages file and picture, the a number of character picture (2nd) step remained is sent into asian type Image Classifier and is carried out character recognition, the recognition result of each character includes Chinese, Japanese or Korean, and before reservation recognition credibility is the highest, a number of character carries out distinguishing language ballot；For Latin languages file and picture, a number of word that (2nd) step remains is carried out character cutting and identification, retain mated by language dictionary and credibility the highest before a number of word carry out distinguishing language ballot.