CN105760901A - Automatic language identification method for multilingual skew document image - Google Patents

Automatic language identification method for multilingual skew document image Download PDF

Info

Publication number
CN105760901A
CN105760901A CN201610053497.2A CN201610053497A CN105760901A CN 105760901 A CN105760901 A CN 105760901A CN 201610053497 A CN201610053497 A CN 201610053497A CN 105760901 A CN105760901 A CN 105760901A
Authority
CN
China
Prior art keywords
file
picture
languages
language
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610053497.2A
Other languages
Chinese (zh)
Other versions
CN105760901B (en
Inventor
王恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Shenzhou Haotian Technology Co Ltd
Nankai University
Original Assignee
Tianjin Shenzhou Haotian Technology Co Ltd
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Shenzhou Haotian Technology Co Ltd, Nankai University filed Critical Tianjin Shenzhou Haotian Technology Co Ltd
Priority to CN201610053497.2A priority Critical patent/CN105760901B/en
Publication of CN105760901A publication Critical patent/CN105760901A/en
Application granted granted Critical
Publication of CN105760901B publication Critical patent/CN105760901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/24Character recognition characterised by the processing or recognition method
    • G06V30/242Division of the character sequences into groups prior to recognition; Selection of dictionaries
    • G06V30/244Division of the character sequences into groups prior to recognition; Selection of dictionaries using graphical properties, e.g. alphabet type or font
    • G06V30/2445Alphabet recognition, e.g. Latin, Kanji or Katakana
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/1475Inclination or skew detection or correction of characters or of image to be recognised
    • G06V30/1478Inclination or skew detection or correction of characters or of image to be recognised of characters or characters lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/196Recognition using electronic means using sequential comparisons of the image signals with a plurality of references
    • G06V30/1983Syntactic or structural pattern recognition, e.g. symbolic string recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Input (AREA)

Abstract

The invention relates an automatic language identification method for a multilingual skew document image, which is technical characterized by comprising the following steps: as for acquired file images, a Gabor filtering method is used for carrying out automatic identification on languages of the file images, and the file images are divided into file images of Asian Languages and file images of Latin languages; a corresponding skew correction algorithm is used for a file image of a different language, a file image after correction is obtained, a keyword matching method is applied to the file image after correction for automatic identification on the language of the file image, and thus, the automatic language identification function on the file image is realized. the design is reasonable, a method of combining the Gabor filtering and the keyword matching is adopted to realize the automatic language identification function on the file image, the robustness of the method is ensured in a block voting mode, the identification accuracy is improved, and the accuracy meets requirements of practical application.

Description

A kind of automatic language method of discrimination of multilingual inclination file and picture
Technical field
The invention belongs to areas of information technology, the automatic language method of discrimination of especially a kind of multilingual inclination file and picture.
Background technology
Optical character recognition (is called for short OCR), and technology has been widely used in the Digitization of file and picture, and its effect is that the file and picture shot by photographing unit or scanner scanning obtains is converted into the electronic document can edited, can search for.Along with improving constantly of internationalization level, the file and picture with polyglot is often mixed in together.Mostly current OCR technique is to process for the file and picture of certain language-specific, file and picture carries out printed page analysis and Text region according to the language artificially specified, is translated into the electronic document can edited and can search for.By file and picture automatic language method of discrimination, can automatically by OCR to be process file and picture by language classification, according to distinguishing language result, file and picture sent into different OCR engine or use different choice of language to process, thus reducing manual intervention, reducing cost of labor.Owing to the charcter topology of some language use is similar, and the problem such as the noise being frequently encountered by image acquisition process is many, resolution is not high, it is difficult to design a kind of file and picture automatic language method of discrimination with high-accuracy that disclosure satisfy that practical application needs.
At present, in the research work of file and picture automatic language method of discrimination, distinguishing language is done mainly by textural characteristics and word shape facility, the subject matter of its existence is: (1) textural characteristics is difficult to reach the effect of practicality for the language that font is similar, such as English/moral/syntax language differentiation etc.;(2) language differed greatly for font, uses the problems such as the textural characteristics of single language area will also result in differentiation unstable result, accuracy rate is on the low side;(3) compared with textural characteristics, word shape facility is more suitable for distinguishes the language that text structure is similar, but utilizes word shape facility to be also difficult to the accuracy rate reaching to meet practical application needs when resolution is not high;(4) pending file and picture would be likely to occur the problems such as inclination, and the different sloped correcting method of application is needed for the file and picture of different language, as differed greatly due to charcter topology, the sloped correcting method of chinese document image and English document image entirely different.It addition, the character segmentation method of different language file and picture is also entirely different.Therefore, correct word shape facility cannot be extracted in file and picture so that the automatic language method of discrimination based on word shape facility lost efficacy when unknown languages.In sum, although existing file and picture automatic language method of discrimination achieves certain effect, but owing to some spoken and written languages is at texture and all closely similar in shape, and the problem such as many, not high, the inclination of resolution of noise introduced in image acquisition process, its accuracy rate is difficult to meet the demand of practical application.
Summary of the invention
It is an object of the invention to overcome the deficiencies in the prior art, it is provided that a kind of reasonable in design, accuracy rate is high and the automatic language method of discrimination of adaptable multilingual inclination file and picture.
This invention address that it technical problem is that takes techniques below scheme to realize:
The automatic language method of discrimination of a kind of multilingual inclination file and picture, comprises the following steps:
Step 1, for gather file and picture, utilize Gabor filtering method to carry out the automatic discrimination of file and picture languages, file and picture be divided into Asia languages file and picture and Latin languages file and picture;
Step 2, for different language file and picture use corresponding slant correction algorithm, file and picture after being corrected, then on file and picture after calibration, key application word matching process carries out the automatic discrimination of file and picture language, thus realizing the language automatic discrimination function of file and picture.
Further, the concrete methods of realizing of described step 1 comprises the following steps:
(1) use the method for mathematical morphology to carry out filter the file and picture gathered to make an uproar process;
(2) for there is the file and picture of inclination, therefrom choosing and being suitable for doing a number of character area that automatic language differentiates;
(3) each character area selected is done respectively Gabor filtering, and according to the Gabor characteristic extracted, application class device, the languages of each character area is carried out automatic discrimination;
(4) the automatic languages of each character area are differentiated that result is voted, take the votes maximum languages languages differentiation result as whole file and picture, thus the file and picture of input being divided into Asia languages file and picture and the big class of Latin languages file and picture two.
Further, the described step (1) file and picture to gathering uses the method for mathematical morphology to filter process of making an uproar is adopt corrosion and expansion algorithm to realize.
Further, described step (3) method particularly includes: generate the Gabor image of different scale, multiple directions firstly for the character area image selected;Then generate and obtain Gabor magnitude image and Gabor magnitude image is carried out down sample;Finally, utilize character area training sample to carry out grader study according to the Gabor characteristic extracted, treat the character area image doing languages differentiation and classify, each character area is divided into Asia languages or Latin languages.
Further, the concrete methods of realizing of described step 2 comprises the following steps:
(1) file and picture is carried out slant correction and character segmentation processes by the languages obtained according to automatic discrimination;
(2) from character segmentation result, take out the some pieces of character images meeting word feature most;
(3) each piece is cut out by the languages obtained according to automatic discrimination character image or word image utilize grader to be identified, and according to recognition result, the language of each piece of character image or word image are carried out automatic discrimination;
(4) automatic language of character image or word image is differentiated that result is voted, take the votes maximum language distinguishing language result as view picture file and picture.
Further, the character segmentation processing method of described step (1) is: use the character segmentation method being applicable to different language on the file and picture of correction, obtain character segmentation result: for Asia languages file and picture, obtain multiple candidate characters by character segmentation;For Latin languages file and picture, obtain multiple word candidate by character segmentation.
Further, described step (2) method particularly includes: for Asia languages file and picture, first by candidate characters by high computational rectangular histogram, chooses height character near histogram peak, with filter out noise, reduces its impact on differentiating result;Then, the absolute value subtracting by the ratio of width to height to the character chosen carries out ascending sort, retains above character and carries out follow-up analysis;For Latin languages file and picture, according to the length descending of word candidate, retain above a number of word and carry out follow-up analysis.
Further, described step (3) method particularly includes: for Asia languages file and picture, the a number of character picture (2nd) step remained is sent into asian type Image Classifier and is carried out character recognition, the recognition result of each character includes Chinese, Japanese or Korean, and before reservation recognition credibility is the highest, a number of character carries out distinguishing language ballot;For Latin languages file and picture, a number of word that (2nd) step remains is carried out character cutting and identification, retain mated by language dictionary and credibility the highest before a number of word carry out distinguishing language ballot
Advantages of the present invention and having the benefit effect that
The present invention is reasonable in design, it adopts the method that Gabor filtering and Keywords matching combine to achieve the automatic language discrimination function of file and picture, and the robustness of method is ensure that by piecemeal ballot mode, improve the accuracy rate of identification, accuracy rate meets the demand of practical application, solves the automatic language discrimination of the Language Document images such as China, Japan and Korea under inclination conditions, English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland.
Accompanying drawing explanation
Fig. 1 is the system framework figure of the present invention;
Fig. 2 is that the automatic languages of file and picture differentiate flow chart;
Fig. 3 is that same languages file and picture automatic language differentiates flow chart;
Fig. 4 is the distinguishing language experimental result schematic diagram of Latin languages file and picture.
Detailed description of the invention
Below in conjunction with accompanying drawing, the embodiment of the present invention is further described:
A kind of automatic language method of discrimination of multilingual inclination file and picture, as it is shown in figure 1, comprise the following steps:
Step 1, for gather file and picture, utilize Gabor filtering method to carry out the automatic discrimination of file and picture languages, file and picture is divided into Asia languages (including China, Japan and Korea) file and picture and Latin languages (including English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland) file and picture.
The concrete processing procedure of this step is as in figure 2 it is shown, comprise the following steps:
Step (1), the morphologic method of file and picture applied mathematics gathered is carried out filter make an uproar processs, the impact of minimizing noise.
Treat the file and picture doing languages differentiation and do opening operation (namely first corroding reflation), filter noise that may be present in file and picture.Wherein,
A algorithm that () corrodes: with the structural element of 3*3, each pixel of scanogram, the bianry image covered with it with structural element does AND-operation.If being all 1, this pixel of result images is 1, is otherwise 0.
B algorithm that () expands: with the structural element of 3*3, each pixel of scanogram, the bianry image covered with it with structural element does AND-operation.If being all 0, this pixel of result images is 0, is otherwise 1.
Step (2), the file and picture tilted for existence, therefrom choose to be best suited for and do front m the character area (making m=21 here) that automatic language differentiates.
To extract the file and picture of Gabor characteristic for a width, from image, first randomly select 100 width be of a size of the subimage of 200*200;Then, this 100 width subimage being screened, screening criteria is as follows:
If a the black pixel quantity of () width subimage exceedes the 1/4 of subimage size, then it is assumed that be not character area, be deleted, to reduce the interference to result of the non-legible regions such as image.
B the every width character area image screened in (a) is divided into 4 row 4 and arranges by (), amount to 16 pieces.To each piece, canny operator is used to obtain its edge image.If the ratio at edge accounts for the total size 10%~20% of character area, then the support of character area increases 1.The support span of each character area finally given is 0~16.
C character area is ranked up from high to low by () according to support, choose the highest front 21 character area images of support and carry out the extraction of Gabor characteristic, and classify accordingly.
Step (3), each character area selected is done respectively Gabor filtering, and based on the Gabor characteristic extracted, application class device, the languages of each character area are carried out automatic discrimination.
For the 21 width character area images selected, firstly generate different scale (g=0,1,2), multiple directions (h=0,1,2 ..., 15) Gabor image.
Shown in the expression-form of Gabor function such as formula (1):
Ψ = f 2 π γ η e - ( f 2 γ 2 x t 2 + f 2 η 2 y t 2 ) e i 2 πfx t - - - ( 1 )
The computational methods of real part and imaginary part are respectively as shown in formula (2) and formula (3):
f 2 π γ η e - ( f 2 γ 2 x t 2 + f 2 η 2 y t 2 ) c o s ( 2 πfx t ) - - - ( 2 )
f 2 π γ η e - ( f 2 γ 2 x t 2 + f 2 η 2 y t 2 ) s i n ( 2 πfx t ) - - - ( 3 )
Wherein,
xt=xcos θ+ysin θ (4)
yt=-xsin θ+ycos θ
Formula (1) is to formula (4), and x and y represents pixel coordinate;(xt,yt) it is (x, the result that the θ degree that y) turns clockwise obtains;F represents multiple sinusoidal signal frequency, and its value isfmax=0.25;θ represents small echo direction, and its value isγ represents the wavelet space width along sinusoidal plane wave, and η represents the wavelet space width being perpendicular to sinusoidal plane wave, here γ = η = 2 .
Under fixed size and fixed-direction, can calculating nuclear matrix, be divided into real part nuclear matrix and imaginary part nuclear matrix, the calculating of nuclear matrix needs a window, and window value is set to 8, the nuclear matrix of two 8*8 obtained.After obtaining nuclear matrix, real part nuclear matrix being spun upside down, add left and right upset, imaginary part nuclear matrix does not operate.Then by the two nuclear matrix, image is carried out convolution respectively, respectively obtain after real part convolution image after image and imaginary part convolution.Calculate amplitude finally according to the image after the two convolution, obtain magnitude image.
The Gabor magnitude image calculated is carried out down sample (down-sampling ratio 4), narrows down to the 1/4 of original size by magnitude image.Particular dimensions and specific direction have the image (50*50) after a width down-sampling, the pixel value of image is averaging.So for a width subimage, we have 3 yardsticks, 16 directions, then total characteristic number is 3*16=48.
Utilize character area training sample to carry out grader study according to the Gabor characteristic extracted, then treat and make the character area image that languages differentiate and classify, each character area is divided into Asia languages or Latin languages.
Step (4), the automatic languages of 21 character areas are differentiated that result is voted, take the maximum languages of votes and differentiate result as the languages of whole file and picture.
For the file and picture that width languages to be made differentiate, languages automatic discrimination result according to 21 character areas selected is voted, the languages that votes is many are the languages of file and picture and differentiate result, thus the file and picture that will be fed into is divided into Asia languages file and picture and the big class of Latin languages file and picture two.
Step 2, obtaining file and picture languages differentiate result basis on, file and picture for different language applies different slant correction algorithms, file and picture after being corrected, and on file and picture after calibration, key application word matching process carries out the automatic discrimination of file and picture language.
The concrete processing procedure of this step is as it is shown on figure 3, comprise the following steps:
File and picture carries out corresponding slant correction for step (1), the languages obtained according to automatic discrimination and character segmentation processes.
According to the file and picture languages automatic discrimination result being previously obtained, application is applicable to the sloped correcting method of different language, is become a full member by file and picture;Then, on the file and picture of correction, application is applicable to the character segmentation method of different language, obtains character segmentation result.For Asia languages file and picture, obtain multiple candidate characters by character segmentation;For Latin languages file and picture, obtain multiple word candidate by character segmentation.
Step (2), from character segmentation result, take out the some pieces of character images meeting word feature most.
For Asia languages file and picture, first by candidate characters by high computational rectangular histogram, choose height character near histogram peak, with filter out noise, reduce its impact on differentiating result;Then, by the absolute value of (the ratio of width to height-1), the character chosen being carried out ascending sort, retain front 100 characters and carry out follow-up analysis, namely the ratio of width to height is closer to 1, then be more likely to be the asian type that cutting is correct.
For Latin languages file and picture, according to the length descending of word candidate, retaining front 100 words and carry out follow-up analysis, namely word length is more long, then more can reduce distinguishing language mistake belonging to the word caused because individual characters knows by mistake.
Character image or word image that each piece is cut out by step (3), the languages obtained according to automatic discrimination utilize grader to be identified, and according to recognition result, the language of each piece of character image or word image are carried out automatic discrimination.
For Asia languages file and picture, 100 character pictures (2nd) step remained are sent into asian type Image Classifier and are carried out character recognition, the recognition result of each character is probably Chinese, Japanese or Korean, and front 20 characters retaining recognition credibility the highest carry out distinguishing language ballot.
For Latin languages file and picture, 100 words (2nd) step remained carry out character cutting and identification, retain and can carry out distinguishing language ballot by front 20 words that certain language dictionary coupling and credibility are the highest.
Step (4), the automatic language of character image or word image is differentiated that result is voted, take the votes maximum language distinguishing language result as view picture file and picture.
For Asia languages file and picture, the recognition result according to 20 characters that (3rd) step remains, carrying out China, Japan and Korea's distinguishing language ballot, the language that number of characters is maximum is Asia languages file and picture automatic language and differentiates result.
For Latin languages file and picture, recognition result according to 20 words that (3rd) step remains, carrying out English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland's distinguishing language ballot, the language that word number is maximum is Latin languages file and picture automatic language and differentiates result.
Pass through above step, the language automatic discrimination function of file and picture can be realized, solve the automatic language discrimination of the Language Document images such as China, Japan and Korea under inclination conditions, English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland.
Verify the multilingual inclination file and picture automatic language method of discrimination combined based on Gabor filtering and Keywords matching of present invention proposition with " with languages file and picture language automatic discrimination experimental result " two aspect below by " file and picture languages automatic discrimination experimental result ".Part 1, by the experiment on Asia languages file and picture and Latin languages file and picture, it was shown that the present invention differentiates there is stronger robustness for the languages tilting file and picture.Part 2, respectively through the experiment on the Latin languages file and pictures such as Asia languages file and picture and English, method, moral, Italy, Sweden, Spain, Portugal, Norway, Denmark, Poland, Finland such as China, Japan and Korea, it was shown that the present invention can differentiate, in languages, the same languages file and picture distinguishing language problem solving have similar text structure on the basis of result preferably.
1, file and picture languages automatic discrimination experimental result
This experiment gathers 110 width Asia languages file and pictures and 110 width Latin languages file and pictures, each image rotates by 15 kinds of different angles, finally obtain inclination Asia languages file and picture and tilt each 1650 width of Latin languages file and picture, the data set that these images are tested as file and picture languages automatic discrimination.Test result indicate that, the languages of Asia languages file and picture and Latin languages file and picture differentiate that rate of accuracy reached is to 99.48%.Concrete experimental result is as shown in table 1, and the Asia languages file and picture only having 0.70% is known for Asia languages by the Latin languages file and picture known for Latin languages, 0.33% by mistake by mistake.
Table 1 Asia languages and Latin languages file and picture languages discriminating experiment result
2, with languages file and picture language automatic discrimination experimental result
2.1 Asia languages file and picture language automatic discriminations
This experimental data set includes 40 width chinese document images after slant correction, 35 width Japanese file and pictures and 35 width Korean file and pictures, it is separately added into gaussian noise (average and variance respectively 0 and 0.02) and spiced salt noise (noise ratio is 0.05), obtains the data set that 220 width images are tested as Asia languages file and picture language automatic discrimination.Test result indicate that, the distinguishing language rate of accuracy reached of the Asia languages file and pictures such as China, Japan and Korea is to 98.18%.Concrete experimental result is as shown in table 2, and the distinguishing language accuracy rate of China, Japan and Korea's file and picture has respectively reached 100.00%, 97.14% and 97.14%.
The distinguishing language experimental result of table 2 Asia languages file and picture
2.2 Latin languages file and picture language automatic discriminations
In this experiment, the quantity of various Language Document images is as shown in table 3.
Table 3 Latin languages document image data collection
All samples are separately added into gaussian noise (average and variance respectively 0 and 0.02) and spiced salt noise (noise ratio is 0.05), obtain the data set that 25,614 width images are tested as Latin languages file and picture language automatic discrimination.Concrete experimental result as shown in Figure 4, it can be seen that the distinguishing language rate of accuracy reached of Latin languages file and picture is to 98.18%.
It is emphasized that; embodiment of the present invention is illustrative; rather than it is determinate; therefore the present invention is not limited to the embodiment described in detailed description of the invention; every other embodiments drawn according to technical scheme by those skilled in the art, also belong to the scope of protection of the invention.

Claims (8)

1. the automatic language method of discrimination of a multilingual inclination file and picture, it is characterised in that comprise the following steps:
Step 1, for gather file and picture, utilize Gabor filtering method to carry out the automatic discrimination of file and picture languages, file and picture be divided into Asia languages file and picture and Latin languages file and picture;
Step 2, for different language file and picture use corresponding slant correction algorithm, file and picture after being corrected, then on file and picture after calibration, key application word matching process carries out the automatic discrimination of file and picture language, thus realizing the language automatic discrimination function of file and picture.
2. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 1, it is characterised in that: the concrete methods of realizing of described step 1 comprises the following steps:
(1) use the method for mathematical morphology to carry out filter the file and picture gathered to make an uproar process;
(2) for there is the file and picture of inclination, therefrom choosing and being suitable for doing a number of character area that automatic language differentiates;
(3) each character area selected is done respectively Gabor filtering, and according to the Gabor characteristic extracted, application class device, the languages of each character area is carried out automatic discrimination;
(4) the automatic languages of each character area are differentiated that result is voted, take the votes maximum languages languages differentiation result as whole file and picture, thus the file and picture of input being divided into Asia languages file and picture and the big class of Latin languages file and picture two.
3. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 2, it is characterised in that: it is adopt corrosion and expansion algorithm to realize that the described step (1) file and picture to gathering uses the method for mathematical morphology to filter process of making an uproar.
4. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 2, it is characterised in that: described step (3) method particularly includes: generate the Gabor image of different scale, multiple directions firstly for the character area image selected;Then generate and obtain Gabor magnitude image and Gabor magnitude image is carried out down sample;Finally, utilize character area training sample to carry out grader study according to the Gabor characteristic extracted, treat the character area image doing languages differentiation and classify, each character area is divided into Asia languages or Latin languages.
5. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 1, it is characterised in that: the concrete methods of realizing of described step 2 comprises the following steps:
(1) file and picture is carried out slant correction and character segmentation processes by the languages obtained according to automatic discrimination;
(2) from character segmentation result, take out the some pieces of character images meeting word feature most;
(3) each piece is cut out by the languages obtained according to automatic discrimination character image or word image utilize grader to be identified, and according to recognition result, the language of each piece of character image or word image are carried out automatic discrimination;
(4) automatic language of character image or word image is differentiated that result is voted, take the votes maximum language distinguishing language result as view picture file and picture.
6. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 5, it is characterized in that: the character segmentation processing method of described step (1) is: on the file and picture of correction, use the character segmentation method being applicable to different language, obtain character segmentation result: for Asia languages file and picture, obtain multiple candidate characters by character segmentation;For Latin languages file and picture, obtain multiple word candidate by character segmentation.
7. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 5, it is characterized in that: described step (2) method particularly includes: for Asia languages file and picture, first by candidate characters by high computational rectangular histogram, choose height character near histogram peak, with filter out noise, reduce its impact on differentiating result;Then, the absolute value subtracting by the ratio of width to height to the character chosen carries out ascending sort, retains above character and carries out follow-up analysis;For Latin languages file and picture, according to the length descending of word candidate, retain above a number of word and carry out follow-up analysis.
8. the automatic language method of discrimination of a kind of multilingual inclination file and picture according to claim 5, it is characterized in that: described step (3) method particularly includes: for Asia languages file and picture, the a number of character picture (2nd) step remained is sent into asian type Image Classifier and is carried out character recognition, the recognition result of each character includes Chinese, Japanese or Korean, and before reservation recognition credibility is the highest, a number of character carries out distinguishing language ballot;For Latin languages file and picture, a number of word that (2nd) step remains is carried out character cutting and identification, retain mated by language dictionary and credibility the highest before a number of word carry out distinguishing language ballot.
CN201610053497.2A 2016-01-27 2016-01-27 A kind of automatic language method of discrimination of multilingual inclination file and picture Active CN105760901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610053497.2A CN105760901B (en) 2016-01-27 2016-01-27 A kind of automatic language method of discrimination of multilingual inclination file and picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610053497.2A CN105760901B (en) 2016-01-27 2016-01-27 A kind of automatic language method of discrimination of multilingual inclination file and picture

Publications (2)

Publication Number Publication Date
CN105760901A true CN105760901A (en) 2016-07-13
CN105760901B CN105760901B (en) 2019-01-04

Family

ID=56342625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610053497.2A Active CN105760901B (en) 2016-01-27 2016-01-27 A kind of automatic language method of discrimination of multilingual inclination file and picture

Country Status (1)

Country Link
CN (1) CN105760901B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN107256378A (en) * 2017-04-24 2017-10-17 北京航空航天大学 Language Identification and device
CN107346428A (en) * 2017-05-24 2017-11-14 上海视马艾智能科技有限公司 A kind of IC face characters recognition methods and device
CN109409356A (en) * 2018-08-23 2019-03-01 浙江理工大学 A kind of multi-direction Chinese print hand writing detection method based on SWT
CN109741377A (en) * 2018-11-30 2019-05-10 四川译讯信息科技有限公司 A kind of image difference detection method
CN110032996A (en) * 2018-01-11 2019-07-19 台达电子工业股份有限公司 The character skewness correction and segmentation devices and methods therefor of basis of classification formula
CN111027528A (en) * 2019-11-22 2020-04-17 华为技术有限公司 Language identification method and device, terminal equipment and computer readable storage medium
CN111046784A (en) * 2019-12-09 2020-04-21 科大讯飞股份有限公司 Document layout analysis and identification method and device, electronic equipment and storage medium
CN111339787A (en) * 2018-12-17 2020-06-26 北京嘀嘀无限科技发展有限公司 Language identification method and device, electronic equipment and storage medium
WO2023045721A1 (en) * 2021-09-27 2023-03-30 北京有竹居网络技术有限公司 Image language identification method and related device thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130022272A1 (en) * 2011-07-20 2013-01-24 Fujitsu Limited Method of and device for identifying direction of characters in image block

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130022272A1 (en) * 2011-07-20 2013-01-24 Fujitsu Limited Method of and device for identifying direction of characters in image block

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
侯跃云 等: ""文本图像语种识别技术"", 《计算机应用》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598937B (en) * 2015-10-16 2019-10-18 阿里巴巴集团控股有限公司 Language Identification, device and electronic equipment for text
CN106598937A (en) * 2015-10-16 2017-04-26 阿里巴巴集团控股有限公司 Language recognition method and device for text and electronic equipment
CN107256378A (en) * 2017-04-24 2017-10-17 北京航空航天大学 Language Identification and device
CN107346428A (en) * 2017-05-24 2017-11-14 上海视马艾智能科技有限公司 A kind of IC face characters recognition methods and device
CN110032996B (en) * 2018-01-11 2021-06-04 台达电子工业股份有限公司 Character inclination correcting device and method based on classification
CN110032996A (en) * 2018-01-11 2019-07-19 台达电子工业股份有限公司 The character skewness correction and segmentation devices and methods therefor of basis of classification formula
CN109409356A (en) * 2018-08-23 2019-03-01 浙江理工大学 A kind of multi-direction Chinese print hand writing detection method based on SWT
CN109741377A (en) * 2018-11-30 2019-05-10 四川译讯信息科技有限公司 A kind of image difference detection method
CN111339787A (en) * 2018-12-17 2020-06-26 北京嘀嘀无限科技发展有限公司 Language identification method and device, electronic equipment and storage medium
CN111339787B (en) * 2018-12-17 2023-09-19 北京嘀嘀无限科技发展有限公司 Language identification method and device, electronic equipment and storage medium
CN111027528A (en) * 2019-11-22 2020-04-17 华为技术有限公司 Language identification method and device, terminal equipment and computer readable storage medium
WO2021098490A1 (en) * 2019-11-22 2021-05-27 华为技术有限公司 Language recognition method and apparatus, terminal device, and computer-readable storage medium
CN111027528B (en) * 2019-11-22 2023-10-03 华为技术有限公司 Language identification method, device, terminal equipment and computer readable storage medium
CN111046784A (en) * 2019-12-09 2020-04-21 科大讯飞股份有限公司 Document layout analysis and identification method and device, electronic equipment and storage medium
CN111046784B (en) * 2019-12-09 2024-02-20 科大讯飞股份有限公司 Document layout analysis and identification method and device, electronic equipment and storage medium
WO2023045721A1 (en) * 2021-09-27 2023-03-30 北京有竹居网络技术有限公司 Image language identification method and related device thereof

Also Published As

Publication number Publication date
CN105760901B (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN105760901A (en) Automatic language identification method for multilingual skew document image
CN104751187B (en) Meter reading automatic distinguishing method for image
US5410611A (en) Method for identifying word bounding boxes in text
Aradhye A generic method for determining up/down orientation of text in roman and non-roman scripts
US6151423A (en) Character recognition with document orientation determination
CN108647681A (en) A kind of English text detection method with text orientation correction
CN103034848B (en) A kind of recognition methods of form types
CN101719142B (en) Method for detecting picture characters by sparse representation based on classifying dictionary
JPH05282495A (en) Comparison method
CN112183038A (en) Form identification and typing method, computer equipment and computer readable storage medium
CN101460937A (en) Model- based dewarping method and apparatus
Bukhari et al. High performance layout analysis of Arabic and Urdu document images
Sidhwa et al. Text extraction from bills and invoices
Yadav et al. Text extraction in document images: highlight on using corner points
CN115497109B (en) Character and image preprocessing method based on intelligent translation
Ramappa et al. Skew detection, correction and segmentation of handwritten Kannada document
Iqbal et al. Automatic signature extraction from document images using hyperspectral unmixing: Automatic signature extraction using hyperspectral unmixing
CN1128423C (en) Handwriting identifying method based on grain analysis
Jipeng et al. Skew correction for Chinese character using Hough transform
Roy et al. A novel approach to skew detection and character segmentation for handwritten Bangla words
CN107609482A (en) A kind of Chinese text image inversion method of discrimination based on Chinese-character stroke feature
CN111008635A (en) OCR-based multi-bill automatic identification method and system
Vincent et al. Document recto-verso registration using a dynamic time warping algorithm
Su et al. Skew detection for Chinese handwriting by horizontal stroke histogram
Kaur et al. Proposed approach for layout and handwritten character recognization in OCR

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant