CN105574513A

CN105574513A - Character detection method and device

Info

Publication number: CN105574513A
Application number: CN201510970839.2A
Authority: CN
Inventors: 姚聪; 周舒畅; 周昕宇; 印奇
Original assignee: Beijing Megvii Technology Co Ltd; Beijing Aperture Science and Technology Ltd
Current assignee: Beijing Megvii Technology Co Ltd; Beijing Aperture Science and Technology Ltd
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2016-05-11
Anticipated expiration: 2035-12-22
Also published as: CN105574513B

Abstract

The invention discloses a character detection method and a device. The character detection method comprises steps of receiving image to be detected, generating a character area probability graph of the whole text of the image to be detected which is generated by a semantic prediction model, wherein the character area probability graph uses different the pixel value area to distinguish the Chinese character area of the image to be detected and the non-Chinese character area of the image to be detected and performing segmentation operation on the probability graph of the Chinese character area in order to determine the Chinese character area. The Character detection method disclosed by the invention can detect the characters of various languages, directions, colors, fonts and sizes while effectively inhibiting the interference of the complex background. Besides, the Chinese character detection method and the device have characteristics of strong robustness and effectively resisting the interferences of the image noise, image fuzziness, the complex image background and the non-uniform light irradiation.

Description

Character detecting method and device

Technical field

The present invention relates to image processing field, be specifically related to a kind of character detecting method and device.

Background technology

Along with extensively popularizing of smart mobile phone and developing rapidly of mobile Internet, obtaining, retrieve and share information by the camera of the mobile terminals such as mobile phone progressively becomes one way of life mode.The understanding to photographed scene is emphasized in application based on (Camera-based) of camera more.Usually, at word and other objects and the scene of depositing, user often first more pays close attention to the Word message in scene, and the word thus in correct recognition image is taken intention to user and had more deep understanding.This just relate to text detection technology to identify the character area in shooting image.

Text detection, as an important basic technology, has the text detection of huge using value and wide application prospect, particularly natural scene image.Such as, the text detection technology of natural scene image can directly apply to the fields such as augmented reality, geo-location, man-machine interaction, robot navigation, autonomous driving vehicle and industrial automation.

But, mostly comprise more complicated background in image to be detected, and its quality may be subject to the impact of the factors such as noise, fuzzy, inhomogeneous illumination; In addition, word has diversity, and such as, the word in natural scene image may have different colors, size, font and direction etc.These factors all can bring huge difficulty and challenge to text detection.For these reasons, existing character detecting method easily produces false-alarm (falsealarm), is also determined as word mistakenly by the non-legible composition in background.In addition, existing character detecting method also Shortcomings part in adaptability, such as, most of method can only the word in detection level direction, then helpless for the word tilted or rotate.Again such as, some method is merely able to be applied to Chinese and detects, and directly cannot be generalized to the word of different classes of language (as English, Russian, Korean etc.).And when there is serious noise, fuzzy or inhomogeneous illumination in image, existing character detecting method often produces mistake again.In a word, existing character detecting method and system existing defects in precision and the scope of application etc.

Summary of the invention

In view of the above problems, the present invention is proposed to provide a kind of character detecting method of solving the problem at least in part and device.

According to one aspect of the invention, provide a kind of character detecting method, comprising:

Receive image to be detected; Via the character area probability graph of the full figure of image to be detected described in semantic forecast model generation, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected; And

Cutting operation is carried out to described character area probability graph, to determine described character area.

According to a further aspect of the invention, additionally provide a kind of text detection device, comprise semantic module and segmentation module.Semantic module is for receiving image to be detected, and use semantic forecast model to generate the character area probability graph of the full figure of described image to be detected, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected.Segmentation module is used for carrying out cutting operation to described character area probability graph, to determine described character area.

In above-mentioned character detecting method and device, support that the full figure treating detected image directly carries out text detection, be different from the algorithm based on simple threshold values segmentation, sliding window or connected component.It can while the interference effectively suppressing complex background, detects the word of different language, direction, color, font and size, wide accommodation.In addition, this character detecting method and device have the feature of strong robustness, can successfully manage the interference of the factor such as complex background, inhomogeneous illumination in picture noise, image blurring, image.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.

Accompanying drawing explanation

By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 a and Fig. 1 b schematically illustrates image to be detected according to an embodiment of the invention and image after testing respectively;

Fig. 2 schematically illustrates the process flow diagram of character detecting method according to an embodiment of the invention;

Fig. 3 a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b schematically illustrate full figure and its corresponding character area probability graph generated of image to be detected according to an embodiment of the invention respectively.

Fig. 7 schematically illustrates the process flow diagram of the method obtaining image to be detected according to an embodiment of the invention;

Fig. 8 schematically illustrates the process flow diagram of the method for according to an embodiment of the invention character area probability graph being carried out to cutting operation;

Fig. 9 schematically illustrates the process flow diagram of the method for neural network training according to an embodiment of the invention;

Figure 10 a, Figure 10 b, Figure 10 c and Figure 10 d respectively illustrate the sample image according to an embodiment of the invention with markup information;

Figure 11 a and Figure 11 b respectively illustrates has the sample image of markup information and the mask artwork of its correspondence according to an embodiment of the invention;

Figure 12 schematically illustrates the schematic diagram of complete according to an embodiment of the invention convolutional neural networks;

Figure 13 schematically illustrates the schematic block diagram of text detection device according to an embodiment of the invention;

Figure 14 schematically illustrates the schematic block diagram of text detection device in accordance with another embodiment of the present invention; And

Figure 15 schematically illustrates the schematic block diagram of text detection system according to an embodiment of the invention.

Embodiment

Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.

In order to character area in more reasonably automatic recognition image, the invention provides a kind of character detecting method.Fig. 1 a and Fig. 1 b schematically illustrates image to be detected according to an embodiment of the invention and image after testing respectively.Fig. 2 shows the process flow diagram of character detecting method 200 according to an embodiment of the invention.As shown in Figure 2, the method 200 comprises step S210 to step S230.

In step S210, receive image to be detected.Image to be detected can be original image, also can be the image obtained after carrying out pre-service to original image.In one embodiment of the invention, described image to be detected can be obtained by carrying out pre-service to the original image collected.Hereinafter be described in detail in conjunction with the method for concrete accompanying drawing to described Image semantic classification.

In step S220, via the character area probability graph of the full figure of image to be detected described in semantic forecast model generation, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected.According to one embodiment of present invention, character area refers in image the region comprising word.Character areas for the region of two black quadrilateral inside in Fig. 1 a and Fig. 1 b, Fig. 1 b.In first character area, comprise word " I am in growth ", in second character area, comprise word and " please do not step on me ".

In one embodiment, character area probability graph use different pixel values to represent different probability is with the non-legible region of the character area and described image to be detected of distinguishing described image to be detected.In one embodiment, the probability that this pixel region of the higher expression of pixel value of image belongs to character area is higher, and the pixel value of image is lower, represents that the probability that this pixel region belongs to character area is lower.Such as pixel value be 0 black picture element represent that the probability that this pixel region belongs to character area is 0, pixel value be 255 white pixel represent that the probability that this pixel region belongs to character area is 100%.

According to one embodiment of present invention, the full figure of image to be detected is via semantic forecast model generation character area probability graph.Semantic forecast model is used for according to the semantic generating character area probability figure of image to be detected, belongs to character area still belong to non-legible region with the pixel predicting in image to be detected.Image, semantic is the high-level characteristic of image, although it is based on the color of image, texture, shape etc. low-level image feature, significantly different from these low-level image features.Image, semantic is as the basic description carrier of knowledge information, and complete picture material can be converted to can the class text language performance of intuitivism apprehension, plays vital effect in image understanding.Image understanding input be view data, output be knowledge, it belongs to the High-level content of picture research field.Semantic forecast model can realize image understanding, and it can directly according to the character area in image, semantic recognition image, and these are significantly different from each model based on Threshold segmentation image.Semantic forecast model can treat the understanding of detected image based on it, according to the semanteme of image to be detected, generate more preferably character area probability graph, thus the pixel predicting in image to be detected belongs to character area still belongs to non-legible region, to obtain more reasonably character area.

Described semantic forecast model can be obtained by neural network training.Neural network can be used for estimating general unknown approximate function according to a large amount of inputs.Neural network can machine learning, has stronger self-adaptation character.Trained neural network can approach an arbitrary function, and it can from given data " study ".Thus, neural network is highly suitable for being used as semantic forecast model through training, identifies the character area in image to be detected.Hereinafter composition graphs 9 to Figure 12 is obtained semantic forecast model to neural network training to be described in detail.

Fig. 3 a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b are the character area probability graph of the full figure of image to be detected according to an embodiment of the invention and the correspondence via semantic forecast model generation respectively.Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a can be the full figures of image to be detected, and containing character area on image, such as, what comprise in the character area on Fig. 3 a is Chinese; The character area that Fig. 4 a image comprises comprises Chinese and English, and as shown in fig. 4 a, the direction non-horizontal of the character area in Fig. 4 a; Character area on the image of Fig. 5 a comprises Russian; The character area of Fig. 6 a image comprises Korean.Further, can find out, the image of Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a has different backgrounds, and background more complicated; Further, the word in above-mentioned image also has diversity, and such as these words have the information such as different colors, font, languages and size.The character area probability graph that Fig. 3 b, Fig. 4 b, Fig. 5 b and Fig. 6 b illustrate described Fig. 3 a, Fig. 4 a respectively, the full figure of image to be detected of Fig. 5 a and Fig. 6 a generates after semantic forecast model.The character area probability graph generated uses different pixel values to represent, and different probability is to distinguish the character area of image to be detected and non-legible region.Such as, use the pixel filling character area of pixel value 255, represent that this region belongs to the probability of character area the highest, use the non-legible region of pixel filling of pixel value 0 (such as, background area), represent that this region belongs to the probability of character area minimum, thus the character area distinguished in image to be detected and non-legible region.Different pixel values is used to distinguish character area in image 4a to be detected and non-legible region for the character area probability graph of Fig. 4 b, Fig. 4 b.Such as, use that to have pixel value be two character areas in the pixel filling image 4a to be detected of 255, " unauthorized No Admittance " and " AuthorizedPersonnelOnly ", thus the character area probability graph gone out as shown in Figure 4 b, further, the character area probability graph of Fig. 4 b also shows the direction of the character area in image 4a to be detected complete and accurate.

In step S230, cutting operation is carried out, to determine character area to the character area probability graph that step S220 generates.Because the numerical value of the pixel in character area probability graph can represent that this pixel region belongs to the probability of character area, thus distinguish character area and non-legible region, so can split character area probability according to low-level image feature (gray scale of such as image).

Such as, this step S230 can obtain character area by carrying out binaryzation operation to character area probability graph.In the present invention, owing to expecting to distinguish character area and non-legible region (background area), so utilize binaryzation to operate can realize this object.Binaryzation operation realizes simple, and calculated amount is few and speed is fast.

Binaryzation operation can be Threshold segmentation operation.Alternatively, threshold value T is adjustable parameter.If gray-scale value 255 represents that the probability belonging to character area is 100%, gray-scale value 0 represents that the probability belonging to character area is 0, so threshold value can be set to 128.

Binaryzation operation can also be the cutting operation increased based on region.Region growing methods assembles the method for pixel according to the similar quality of pixel in same object area.Particularly, from prime area (such as, the pixel that in character area probability graph, pixel value is larger) start, the adjacent pixel with same property (smaller with the difference of the pixel value of current pixel) to be integrated in current region thus progressively growth region, until do not have can till the pixel of merger.

Can think that splitting the region that in the rear image obtained, average pixel value is less is non-legible region, other regions are character area.Hereinafter by conjunction with concrete accompanying drawing, binaryzation operation is determined that described character area is described in detail.

One of ordinary skill in the art will appreciate that, said method 200 has universality.It may be used for the text detection of any image.The method 200 can carry out text detection and identification for file and picture, the photo of file and picture such as certificate and bill, the scanned copy etc. of paper document.The method 200 can also carry out text detection and identification for natural scene image.

Said method 200 of the present invention has abandoned the detection mode based on sliding window and the detection mode based on connected component, have employed the brand-new detection mode based on semantic segmentation.The method 200 can realize full figure prediction, namely input and output are all entire image, instead of regional area or window, therefore can utilize the contextual information in image better, contextual information particularly in natural scene image, thus obtain text detection result more accurately.

The method 200 can process the image of different scene, different quality.The method 200 while the interference effectively suppressing complex background, can detect the word of different colours, font and size.The method 200 can the direction of automatic Prediction literal line, can the word of different directions in direct-detection image.Language belonging to the method 200 pairs of words is insensitive, can detect the word that different classes of language (as Chinese, English, Korean etc.) is corresponding simultaneously.In addition, the method 200 has the feature of strong robustness, can successfully manage the interference of noise, fuzzy, the factor such as complex background, inhomogeneous illumination.

Fig. 7 shows the process flow diagram obtaining described image to be detected according to an embodiment of the invention.

In step S710, receive original image.In one embodiment, original image can have complicated background information, and its character area comprised also can have diversity, and such as character area can include the Word message of different colors, font, languages and size etc.

In step S720, pre-service is carried out to the original image received, to obtain image to be detected.In one embodiment, the original image received can be carried out dimension normalization, by original image maximum dimension (such as, the greater in the height of original image and width) zoom to pre-set dimension, described pre-set dimension can comprise 480,640,800 and 960 pixels etc.The Aspect Ratio of image to be detected obtained after dimension normalization operation keeps identical with the Aspect Ratio of original image.

Fig. 8 schematically illustrates the process flow diagram of the method for according to an embodiment of the invention character area probability graph being carried out to cutting operation.

In step S810, the character area probability graph treating detected image carries out binaryzation operation.

Be appreciated that and directly can obtain character area according to the result of binaryzation operation.In the present invention, owing to expecting to distinguish character area and non-legible region (background area), so utilize binaryzation to operate can realize this object.Binaryzation operation realizes simple, and calculated amount is few and speed is fast.

Binaryzation operation can also be the cutting operation increased based on region.Region growing methods assembles the method for pixel according to the similar quality of pixel in same object area.Particularly, from prime area (such as, the pixel that in character area probability graph, pixel value is larger) start, the adjacent pixel with same property (smaller with the difference of the pixel value of current pixel) to be integrated in current region thus progressively growth region, until do not have can till the pixel of merger

In the embodiment shown in fig. 8, after binaryzation operation, step S820 and step S830 is also comprised.

In step S820, determine that binaryzation operates the profile of each connected region obtained.This step can realize with any edge detection method of existing or following research and development, such as, based on various edge detection methods such as such as Sobel or Canny operators.

In step S830, be that quadrilateral is to determine described character area by the contour fitting of each connected region.In one embodiment, the interior zone of all quadrilaterals can as character area.Particularly, suppose that the set that all quadrilaterals form is B, B={b _k, k=1,2 ... Q, wherein b _krepresent the quadrilateral that matching obtains, Q represents the number of quadrilateral, and k is subscript.Then set B is the result output of text detection.

The region that quadrilateral surrounds can comprise the word of any direction, language preferably, and it calculates simple.Such as, shown in the character area probability graph of Fig. 6 b, the various reasons such as noise in image, image Chinese word shape may cause character area probability graph to fail more desirably to represent that pixel belongs to the probability of character area.By carrying out matching character area with quadrilateral area, can ensure further in character area, to comprise whole word content, thus ensure the precision of text detection.

Fig. 9 schematically illustrate according to an embodiment of the invention neural network training to obtain the process flow diagram of the method for semantic forecast model.The object of the method is from sample image learning semantic forecast model, and this model effectively can distinguish character area in image to be detected and non-legible region.

Sample image is the image of known wherein character area.As mentioned above, ability that neural network has " study ", can obtain available semantic forecast model by utilizing multiple sample image neural network training.In this embodiment, this training method makes semantic forecast model can according to the semanteme of image to be detected, generate character area probability graph more accurately, thus predict that the pixel in described image to be detected belongs to character area or non-legible region, thus, make the accuracy of the testing result of character detecting method higher.

One of ordinary skill in the art will appreciate that, for text detection system, this semantic forecast model can be pre-stored within wherein.

In step S910, receive multiple sample image and its markup information.

In one embodiment, the various images of word can be comprised in a large number as sample image from separate sources collection, such as, natural scene image.Expect sample image abundant species and number is more, to obtain desirable semantic forecast model.In one embodiment, the number of sample image is no less than 1000.

Can use polygon in each sample image, mark all character areas in described sample image, thus obtain the markup information of sample image.The ground literal unit of mark can be literal line or word.In sample image, the markup information of character area can be preserved with the form of polygon (such as, quadrilateral).Particularly, in one embodiment, the coordinate on four summits of quadrilateral can only be preserved.Preserve with the shape of quadrilateral the word that markup information not only can meet any direction, language, and be convenient to calculate.

Figure 10 a, Figure 10 b, Figure 10 c and Figure 10 d respectively illustrate according to an embodiment of the invention through the sample image with markup information of mark.Go out as shown in these figures, the character area in sample image can be marked with quadrilateral (in figure light quadrilateral), and this tab area is applicable to the direction of any font, languages and word.

In step S920, generate the mask figure of sample image according to described sample image and its markup information.Particularly, for sample image I and corresponding markup information a, a width and sample image I mask figure of the same size is generated.In one embodiment, described mask figure can comprise two-value mask figure R.In described two-value mask figure R, different pixel values is used to distinguish the character area of sample image and non-legible region.In one embodiment, for sample image I, use the character area that the pixel filling markup information with the first pixel value marks, use the non-legible region of pixel filling with the second pixel value, thus generate two-value mask figure R, wherein, the first pixel value is different with the second pixel value, to distinguish described character area and non-legible region.Such as, in two-value mask figure R, the pixel value of the character area (also namely using the interior zone of quadrilateral mark) marked is filled to be 255, but not the pixel value of character area is filled to be 0.

Figure 11 a and Figure 11 b respectively illustrates according to an embodiment of the invention through the sample image with markup information of mark and the mask figure of its correspondence.As shown in fig. lla, use quadrilateral by the word segment of former sample image (such as, " Haidian construction security ", " Haidian Middle St ", " HAIDIANZHONGJIE ", " South Road, Haidian ") mark out, and generate the mask artwork shown in Figure 11 b accordingly.Wherein, use and there is the pixel filling word segment that marks out that pixel value is 255, use pixel value be 0 the non-legible part of pixel filling, thus obtain the mask figure shown in Figure 11 b.

In step S930, sample image and its mask figure is utilized to build training set, and neural network training, to obtain semantic forecast model M.The mask figure composing training sample set S of original sample image and its correspondence.S={ (I _i, R _i), i=1,2 ..., N, wherein I _irepresent original sample image, R _ifor original sample image I _icorresponding mask figure, N is the number of sample image in training sample set S, and i is subscript.

In one embodiment, neural network can comprise full convolutional neural networks.Full convolutional neural networks is the special neural network of a class, and its feature is that all the comprising from being input to output can the layer of mathematic(al) parameter be all convolutional layer (convolutionallayer).Full convolutional neural networks avoids the complicated pre-service in early stage to image, and directly can input original image, it is specially adapted to the analyzing and processing to the image with complex background, and the text detection result of image can be made more accurate.

According to the present invention's specific embodiment, a full convolutional neural networks be made up of 13 layers can be adopted.Figure 12 shows the schematic diagram of this full convolutional neural networks.

Except comprising convolutional layer in this full convolutional neural networks, also comprise maximum pond layer.Maximum pond layer separates continuous print convolutional layer, and it effectively can reduce calculated amount, simultaneously the robustness of strength neural network.

This full convolutional neural networks be input as raw image data.As shown in figure 12, this full convolutional neural networks comprises first volume lamination and volume Two lamination, and the number of its median filter can be 64, and wave filter size can be 3x3.Volume Two lamination connects the first maximum pond layer (maxpoollayer).Next be the 3rd convolutional layer and Volume Four lamination, the number of its median filter can be 128, and wave filter size can be 3x3.Volume Four lamination connects the second maximum pond layer.Next be the 5th convolutional layer, the 6th convolutional layer and the 7th convolutional layer, the number of its median filter is 256, and wave filter size is 3x3.7th convolutional layer connects the 3rd maximum pond layer.Next be the 8th convolutional layer, the 9th convolutional layer and the tenth convolutional layer, the number of its median filter can be 512, and wave filter size can be 3x3.Tenth convolutional layer connects the 4th maximum pond layer.Next be the 11 convolutional layer, the 12 convolutional layer and the 13 convolutional layer, the number of its median filter can be 512, and wave filter size can be 3x3.

In the training process, be input in full convolutional neural networks by a sample image and corresponding mask figure, initial learn rate can be 0.00000001, often takes turns iteration through 10000, and learning rate reduces to original 1/10 at every turn.After iteration 100000 is taken turns, training process can stop.The full convolutional neural networks that training process obtains when stopping is the semantic forecast model of expectation.Via the described semantic forecast model trained, the character area probability graph of the full figure of image to be detected can be generated according to the semanteme of image to be detected, thus predict the character area in image to be detected.

One of ordinary skill in the art will appreciate that, although illustrate for the full convolutional neural networks of 13 layers above, the number of plies of full convolutional neural networks can be the Arbitrary Digit comprised between 6 to 19.The number of plies of this scope has weighed result of calculation accuracy and these two aspects of calculated amount.In addition, number and the size of wave filter recited above are also only example, and unrestricted.The number of such as wave filter can also be 100,500 or 1000 etc., and the size of wave filter can also be 1x1 or 5x5.

According to a further aspect of the invention, a kind of text detection device is additionally provided.Figure 13 shows the schematic block diagram of text detection device 1300 according to an embodiment of the invention.As shown in figure 13, text detection device 1300 comprises semantic module 1330 and segmentation module 1340.According to one embodiment of present invention, described semantic module 1330 also comprises semantic forecast model 1350.

Semantic module 1330 for receiving image to be detected, and uses semantic forecast model 1350 to generate the character area probability graph of the full figure of described image to be detected.Semantic forecast model is used for according to the semantic generating character area probability figure of image to be detected, belongs to character area still belong to non-legible region with the pixel predicting in described image to be detected.Described character area probability graph uses different pixel values to represent, and different probability is with the non-legible region of the character area and described image to be detected of distinguishing described image to be detected.

In one embodiment, image to be detected can be original image, also can be the image obtained after carrying out pre-service to original image.

In one embodiment, semantic forecast model 1350 can be obtained by neural network training.Semantic forecast model 1350 will be obtained in conjunction with Figure 14 to neural network training to be hereinafter described in detail.

Composition graphs 3a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b descriptive text area probability figure.Fig. 3 a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b are the character area probability graph of the full figure of image to be detected according to an embodiment of the invention and the correspondence via semantic forecast model 1350 generation respectively.Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a can be the full figures of image to be detected, and, containing character area on image, the character area probability graph that Fig. 3 b, Fig. 4 b, Fig. 5 b and Fig. 6 b illustrate described Fig. 3 a, Fig. 4 a, the full figure of image to be detected of Fig. 5 a and Fig. 6 a generates after semantic forecast model 1350.The character area probability graph generated uses different pixel values to represent, and different probability is to distinguish the character area of image to be detected and non-legible region.Such as, use the pixel filling character area of pixel value 255, represent that this region belongs to the probability of character area the highest, use the non-legible region of pixel filling of pixel value 0 (such as, background area), represent that this region belongs to the probability of character area minimum, thus the character area distinguished in image to be detected and non-legible region.Different pixel values is used to distinguish character area in image 4a to be detected and non-legible region for the character area probability graph of Fig. 4 b, Fig. 4 b.Such as, use that to have pixel value be two character areas in the pixel filling image 4a to be detected of 255, " unauthorized No Admittance " and " AuthorizedPersonnelOnly ", thus the character area probability graph gone out as shown in Figure 4 b, further, the character area probability graph of Fig. 4 b also shows the direction of the character area in former image 4a to be detected complete and accurate

Segmentation module 1340 is for carrying out cutting operation, to determine character area to described character area probability graph.Because the numerical value of the pixel in character area probability graph can represent that this pixel belongs to the probability of character area, so can split character area probability according to low-level image feature (gray scale of such as image).

Such as, split module 1340 and can obtain character area by carrying out binaryzation operation to character area probability graph.In the present invention, owing to expecting to distinguish character area and non-legible region (background area), so utilize binaryzation to operate can realize this object.Binaryzation operation realizes simple, and calculated amount is few and speed is fast.

After binaryzation operation, described segmentation module 1340 can also be used for determining that binaryzation operates the profile of each connected region obtained.Can realize with any edge detection method of existing or following research and development, such as, based on various edge detection methods such as such as Sobel or Canny operators.Segmentation module 1340 can also be used for being that quadrilateral is to determine described character area by the contour fitting of each connected region.In one embodiment, the interior zone of all quadrilaterals can as character area.Particularly, suppose that the set that all quadrilaterals form is B, B={b _k, k=1,2 ... Q, wherein b _krepresent the quadrilateral that matching obtains, Q represents the number of quadrilateral, and k is subscript.Then set B is the result output of text detection.

In one embodiment, can think that segmentation module 1340 splits the region that in the rear image obtained, average pixel value is less is non-legible region, and other regions are character area.

Figure 14 shows the schematic block diagram of text detection device 1400 according to another embodiment of the present invention.Semantic module 1330 in text detection device 1400 is similar with the semantic module 1330 in text detection device 1300, segmentation module 1340 in text detection device 1400 is similar with the segmentation module 1340 in text detection device 1300, for simplicity, do not repeat them here.

Compared with text detection device 1300, text detection device 1400 adds image pre-processing module 1410 and training module 1420.

According to embodiments of the invention, described image pre-processing module 1410 receives original image.In one embodiment, original image can have complicated background information, can comprise and have multifarious character area, such as, have the Word message of different colors, font, languages and size.

Image pre-processing module 1410 carries out pre-service to the original image received.In one embodiment, image pre-processing module 1410 can carry out dimension normalization to the original image received, by original image maximum dimension (such as, the greater in the height of original image and width) zoom to pre-set dimension, described pre-set dimension can comprise 480,640,800 and 960 pixels etc.Further, the Aspect Ratio of the image obtained after pre-service keeps identical with the Aspect Ratio of described original image.

After pre-service, image pre-processing module 1410 obtains described image to be detected and exports the full figure of described image to be detected to described semantic module 1330 processing.Wherein, according to description above, described image to be detected has pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of described original image.

According to one embodiment of present invention, training module 1420 is for utilizing multiple sample image neural network training, and to obtain semantic forecast model 1350, this model can distinguish character area in image to be detected and non-legible region effectively.

In one embodiment, the various images that training module 1420 can comprise word in a large number from separate sources collection receive the markup information of sample image as sample image.Sample image is such as natural scene image.Expect sample image abundant species and number is more, to obtain desirable semantic forecast model.In one embodiment, the number of sample image is no less than 1000.

All character areas in each sample image can use polygon to mark in this sample image.The ground literal unit of mark can be literal line or word.In sample image, the markup information of character area can be preserved with the form of polygon (such as, quadrilateral).Particularly, in one embodiment, the coordinate on four summits of quadrilateral can only be preserved.Preserve with the shape of quadrilateral the word that markup information not only can meet any direction, language, and be convenient to calculate.

Training module 1420 is also for generating the mask figure of sample image according to sample image and its markup information.In one embodiment, described mask figure comprises two-value mask figure.Particularly, for sample image I and corresponding markup information a, training module 1420 generates a width and sample image I mask figure of the same size, such as, and two-value mask figure R.Two-value mask figure R uses different pixel values to distinguish the character area of sample image and non-legible region.In one embodiment, for sample image I, use the character area that the pixel filling with the first pixel value marks, use the non-legible region of pixel filling with the second pixel value, thus generate described mask figure, wherein, the first pixel value is different with the second pixel value, to distinguish described character area and non-legible region.Such as, the pixel value of the character area (also namely using the interior zone of quadrilateral mark) marked is filled to be 255, but not the pixel value of character area is filled to be 0.

Training module 1420 is further used for utilizing sample image and its mask figure to build training set, and neural network training, to obtain semantic forecast model 1350.Particularly, the training sample set that original sample image is formed with the mask figure of its correspondence is S.S={ (I _i, R _i), i=1,2 ..., N, wherein I _irepresent original sample image, R _ifor original sample image I _icorresponding mask figure, N is the number of sample image in training sample set S, and i is subscript.

In one embodiment, neural network can be full convolutional neural networks.Full convolutional neural networks is the special neural network of a class, and its feature is that all the comprising from being input to output can the layer of mathematic(al) parameter be all convolutional layer.Full convolutional neural networks avoids the complicated pre-service in early stage to image, and directly can input original image, it is specially adapted to the analyzing and processing to the image with complex background, and the text detection result of image can be made more accurate.

Training sample set S is inputted full convolutional neural networks and trains by training module 1420, to obtain semantic forecast model 1350.According to the present invention's specific embodiment, a full convolutional neural networks be made up of 13 layers can be adopted.Figure 12 shows the schematic diagram of this full convolutional neural networks.

This full convolutional neural networks be input as raw image data.As shown in figure 12, this full convolutional neural networks comprises first volume lamination and volume Two lamination, and the number of its median filter can be 64, and wave filter size can be 3x3.Volume Two lamination connects the first maximum pond layer.Next be the 3rd convolutional layer and Volume Four lamination, the number of its median filter can be 128, and wave filter size can be 3x3.Volume Four lamination connects the second maximum pond layer.Next be the 5th convolutional layer, the 6th convolutional layer and the 7th convolutional layer, the number of its median filter is 256, and wave filter size is 3x3.7th convolutional layer connects the 3rd maximum pond layer.Next be the 8th convolutional layer, the 9th convolutional layer and the tenth convolutional layer, the number of its median filter can be 512, and wave filter size can be 3x3.Tenth convolutional layer connects the 4th maximum pond layer.Next be the 11 convolutional layer, the 12 convolutional layer and the 13 convolutional layer, the number of its median filter can be 512, and wave filter size can be 3x3.

In the training process, be input in full convolutional neural networks by a sample image and corresponding mask figure, initial learn rate can be 0.00000001, often takes turns iteration through 10000, and learning rate reduces to original 1/10 at every turn.After iteration 100000 is taken turns, training process can stop.The full convolutional neural networks that training process obtains when stopping is the semantic forecast model of expectation.Via the described semantic forecast model trained, according to the semantic generating character area probability figure of image to be detected, thus the character area in image to be detected can be predicted.

Multiple sample image neural network training is utilized and the semantic forecast model 1350 obtained can distinguish character area in image to be detected and non-legible region effectively through training module 1420.

Figure 15 shows the schematic block diagram of the text detection system 1500 according to the embodiment of the present invention.As shown in figure 15, text detection system 1500 comprise processor 1510, storer 1520 and in described storer 1520 store programmed instruction 1530.

Described programmed instruction 1530 can realize the function of each functional module of the text detection device according to the embodiment of the present invention when described processor 1510 runs, and/or can perform each step of the character detecting method according to the embodiment of the present invention.

Particularly, when described programmed instruction 1530 is run by described processor 1510, perform following steps: receive image to be detected; Via the character area probability graph of the full figure of image to be detected described in semantic forecast model generation, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected; And cutting operation is carried out to described character area probability graph, to determine described character area.The semantic forecast model pixel be used in image to be detected according to the semantic forecast of image belongs to character area and still belongs to non-legible region.

In addition, when described programmed instruction 1530 is run by described processor 1510, also following steps are performed: receive original image; And pre-service is carried out to described original image, to obtain described image to be detected, wherein, described image to be detected has pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of described original image.

In addition, performed when described programmed instruction 1530 is run by described processor 1510 cutting operation is carried out to determine that the step of described character area comprises to described character area probability graph: binaryzation operation is carried out to described character area probability graph, to determine described character area.

In addition, performed when described programmed instruction 1530 is run by described processor 1510 binaryzation operation is carried out to determine that the step of described character area comprises to described character area probability graph: determine described binaryzation operate the profile of each connected region that obtains; And be quadrilateral by described contour fitting, wherein, described quadrilateral interior zone is described character area.

In addition, when described programmed instruction 1530 is run by described processor 1510, also following steps are performed: utilize multiple sample image neural network training, to obtain described semantic forecast model.

In addition, when described programmed instruction 1530 is run by described processor 1510, performed multiple sample image neural network training that utilizes comprises with the step obtaining described semantic forecast model: the markup information receiving described sample image and described sample image; The mask figure of described sample image is generated according to the markup information of described sample image and described sample image; And utilize described sample image and described mask figure to train described neural network, to obtain described semantic forecast model.

In addition, being run execution at described programmed instruction 1530 by described processor 1510 utilizes multiple sample image neural network training to obtain in the step of described semantic forecast model, described mask figure comprises two-value mask figure, and described two-value mask figure uses different pixel values to distinguish the character area of described sample image and non-legible region.

In addition, run execution at described programmed instruction 1530 by described processor 1510 and utilize multiple sample image neural network training to obtain in the step of described semantic forecast model, described neural network comprises full convolutional neural networks.

In addition, run execution at described programmed instruction 1530 by described processor 1510 and utilize multiple sample image neural network training to obtain in the step of described semantic forecast model, the number of plies of described full convolutional neural networks comprises the Arbitrary Digit between 6 to 19.

In addition, according to the embodiment of the present invention, additionally provide a kind of storage medium, store programmed instruction on said storage, when described programmed instruction is run by computing machine or processor for performing the corresponding steps of the character detecting method of the embodiment of the present invention, and for realizing according to the corresponding module in the text detection device of the embodiment of the present invention.Described storage medium such as can comprise the combination in any of the storage card of smart phone, the memory unit of panel computer, the hard disk of personal computer, ROM (read-only memory) (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compact disc ROM (read-only memory) (CD-ROM), USB storage or above-mentioned storage medium.Described computer-readable recording medium can be the combination in any of one or more computer-readable recording medium, such as a computer-readable recording medium comprises for neural network training to obtain the computer-readable program code of semantic forecast model, and another computer-readable recording medium comprises the computer-readable program code for carrying out text detection.

In one embodiment, described computer program instructions by each functional module of text detection device that can realize during computer run according to the embodiment of the present invention, and/or can perform the character detecting method according to the embodiment of the present invention.

In one embodiment, described computer program instructions by during computer run perform following steps: receive image to be detected; Via the character area probability graph of the full figure of image to be detected described in semantic forecast model generation, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected; And cutting operation is carried out to described character area probability graph, to determine described character area.Described semantic forecast model belongs to character area for pixel in image to be detected according to the semantic forecast of image and still belongs to non-legible region.

In addition, described computer program instructions, being performed by during computer run, also performs following steps: receive original image; And pre-service is carried out to described original image, to obtain described image to be detected, wherein, described image to be detected has pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of described original image.

In addition, at described computer program instructions being carried out cutting operation to determine that the step of described character area comprises by performed during computer run to described character area probability graph: carry out binaryzation operation to described character area probability graph, to determine described character area.

In addition, at described computer program instructions being carried out binaryzation operation to determine that the step of described character area comprises by performed during computer run to described character area probability graph: determine described binaryzation operate the profile of each connected region that obtains; And be quadrilateral by described contour fitting, wherein, described quadrilateral interior zone is described character area.

In addition, at described computer program instructions when by computer run, also following steps are performed: utilize multiple sample image neural network training, to obtain described semantic forecast model.

In addition, at described computer program instructions when by computer run, performed multiple sample image neural network training that utilizes comprises with the step obtaining described semantic forecast model: the markup information receiving described sample image and described sample image; The mask figure of described sample image is generated according to the markup information of described sample image and described sample image; And utilize described sample image and described mask figure to train described neural network, to obtain described semantic forecast model.

In addition, described computer program instructions by during computer run perform utilize multiple sample image neural network training to obtain in the step of described semantic forecast model, described mask figure comprises two-value mask figure, and described two-value mask figure uses different pixel values to distinguish the character area of described sample image and non-legible region.

In addition, described computer program instructions by during computer run perform utilize multiple sample image neural network training to obtain in the step of described semantic forecast model, described neural network comprises full convolutional neural networks.

In addition, described computer program instructions by during computer run perform utilize multiple sample image neural network training to obtain in the step of described semantic forecast model, the number of plies of described full convolutional neural networks comprises the Arbitrary Digit between 6 to 19.

Those of ordinary skill in the art, by reading above about the detailed description of character detecting method, can understand above-mentioned text detection device, the structure of system, realization and advantage, therefore repeat no more here.

In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that, except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or device or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to some modules in the text detection device of the embodiment of the present invention.The present invention can also be embodied as part or all the device program (such as, computer program and computer program) for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims

1. a character detecting method, comprising:

Receive image to be detected;

Via the character area probability graph of the full figure of image to be detected described in semantic forecast model generation, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected; And

2. the method for claim 1, also comprises:

Receive original image; And

Pre-service is carried out to described original image, to obtain described image to be detected,

Wherein, described image to be detected has pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of described original image.

3. method according to claim 1, wherein, carries out cutting operation to described character area probability graph, to determine that described character area comprises:

Binaryzation operation is carried out to described character area probability graph, to determine described character area.

4. method as claimed in claim 3, wherein, binaryzation operation is carried out to described character area probability graph, to determine that described character area comprises:

Determine that described binaryzation operates the profile of each connected region obtained; And

Be quadrilateral by described contour fitting, wherein, described quadrilateral interior zone is described character area.

5. the method for claim 1, also comprises:

Utilize multiple sample image neural network training, to obtain described semantic forecast model.

6. method as claimed in claim 5, wherein, utilizes multiple sample image neural network training, comprises to obtain described semantic forecast model:

Receive the markup information of described sample image and described sample image;

The mask figure of described sample image is generated according to the markup information of described sample image and described sample image; And

Described sample image and described mask figure is utilized to train described neural network, to obtain described semantic forecast model.

7. method as claimed in claim 6, wherein, described mask figure comprises two-value mask figure, and described two-value mask figure uses different pixel values to distinguish the character area of described sample image and non-legible region.

8. method as claimed in claim 5, wherein, described neural network comprises full convolutional neural networks.

9. method as claimed in claim 8, wherein, the number of plies of described full convolutional neural networks comprises the Arbitrary Digit between 6 to 19.

10. the method as described in any one of claim 1 to 9, wherein, the described semantic forecast model pixel be used in image to be detected according to the semantic forecast of described image to be detected belongs to character area and still belongs to non-legible region.

11. 1 kinds of text detection devices, comprising:

Semantic module, for receiving image to be detected, and use semantic forecast model to generate the character area probability graph of the full figure of described image to be detected, wherein, described character area probability graph uses different pixel values to distinguish the character area of described image to be detected and the non-legible region of described image to be detected; And

Segmentation module, for carrying out cutting operation to described character area probability graph, to determine described character area.

12. text detection devices as claimed in claim 11, described device comprises further:

Image pre-processing module, for receiving original image, and carries out pre-service to described original image, to obtain described image to be detected,

13. text detection devices according to claim 11, wherein, described segmentation module is further used for carrying out binaryzation operation to described character area probability graph, to determine described character area.

14. text detection devices as claimed in claim 13, wherein, described segmentation module is further used for determining that described binaryzation operates the profile of each connected region obtained, and is quadrilateral by described contour fitting, wherein, described quadrilateral interior zone is described character area.

15. text detection devices as claimed in claim 11, described device also comprises:

Training module, is connected to described semantic module, for utilizing multiple sample image neural network training, to obtain described semantic forecast model.

16. text detection devices as claimed in claim 15, wherein, described training module is further used for the markup information receiving described sample image and described sample image, the mask figure of described sample image is generated according to the markup information of described sample image and described sample image, and utilize described sample image and described mask figure to train described neural network, to obtain described semantic forecast model.

17. text detection devices as claimed in claim 16, wherein, described mask figure comprises two-value mask figure, and described two-value mask figure uses different pixel values to distinguish the character area of described sample image and non-legible region.

18. text detection devices as claimed in claim 15, wherein, described neural network comprises full convolutional neural networks.

19. text detection devices as claimed in claim 18, wherein, the number of plies of described full convolutional neural networks comprises the Arbitrary Digit between 6 to 19.

20. text detection devices as described in any one of claim 11 to 19, wherein, the semantic forecast model pixel be used in image to be detected according to the semantic forecast of described image to be detected belongs to character area and still belongs to non-legible region.