CN105574513B

CN105574513B - Character detecting method and device

Info

Publication number: CN105574513B
Application number: CN201510970839.2A
Authority: CN
Inventors: 姚聪; 周舒畅; 周昕宇; 印奇
Original assignee: Beijing Megvii Technology Co Ltd; Beijing Maigewei Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd; Beijing Maigewei Technology Co Ltd
Priority date: 2015-12-22
Filing date: 2015-12-22
Publication date: 2017-11-24
Anticipated expiration: 2035-12-22
Also published as: CN105574513A

Abstract

The invention discloses a kind of character detecting method and device.The character detecting method includes：Receive image to be detected；The character area probability graph of the full figure of described image to be detected is generated via semantic forecast model, wherein, the character area probability graph distinguishes the character area of described image to be detected and the non-legible region of described image to be detected using different pixel values；The semantic forecast model is neutral net；And cutting operation is carried out to the character area probability graph, to determine the character area.Above-mentioned character detecting method and device while the interference of complex background is effectively suppressed, can detect the word of different language, direction, color, font and size, wide adaptation range.In addition, the character detecting method and device have the characteristics of strong robustness, can successfully manage picture noise, image be fuzzy, in image the factor such as complex background, inhomogeneous illumination interference.

Description

Character detecting method and device

Technical field

The present invention relates to image processing field, and in particular to a kind of character detecting method and device.

Background technology

Developed rapidly with the widely available and mobile Internet of smart mobile phone, pass through the shooting of the mobile terminals such as mobile phone Head obtains, retrieves and share information and progressively turn into a kind of life style.(Camera-based's) based on camera should With the understanding more emphasized to photographed scene.Generally, often more paid close attention to first in word and other objects and the scene deposited, user Text information in scene, thus the word in correct identification image user is shot be intended to have deeper into understanding.This is just Text detection technology be relate to identify the character area in shooting image.

The text detection basic technology important as one, there is huge application value and wide application prospect, it is special It is not the text detection of natural scene image.For example, the text detection technology of natural scene image may be directly applied to enhancing now The fields such as reality, geo-location, man-machine interaction, robot navigation, autonomous driving vehicle and industrial automation.

However, include more complicated background mostly in image to be detected, and its quality may be by noise, fuzzy, non-equal The influence of the factors such as even illumination；In addition, word has diversity, such as, the word in natural scene image may have difference Color, size, font and direction etc..These factors can all bring huge difficulty and challenge to text detection.Based on above-mentioned Reason, existing character detecting method easily produce false-alarm (false alarm), also i.e. by the non-legible composition mistake in background Ground is determined as word.In addition, existing character detecting method in terms of adaptability there is also weak point, for example, major part side Method can only detection level direction word, it is then helpless for the word that tilts or rotates.In another example some methods are merely able to Detected applied to Chinese, can not directly be generalized to the word of different classes of language (such as English, Russian, Korean).And when figure When serious noise, fuzzy or inhomogeneous illumination as in be present, existing character detecting method often produces mistake again.Always It, existing character detecting method and system are in precision and the scope of application etc. existing defects.

The content of the invention

In view of the above problems, it is proposed that the present invention is to provide a kind of text detection to solve the above problems at least in part Method and apparatus.

According to one aspect of the invention, there is provided a kind of character detecting method, including：

Receive the markup information of multiple sample images and the sample image；According to the sample image and the sample graph The markup information of picture generates the mask figure of the sample image；Utilize the sample image and mask figure training nerve net Network, to obtain semantic forecast model；Receive image to be detected；Described image to be detected is generated via the semantic forecast model The character area probability graph of full figure, wherein, the character area probability graph distinguishes the mapping to be checked using different pixel values The non-legible region of the character area of picture and described image to be detected；And segmentation behaviour is carried out to the character area probability graph Make, to determine the character area.

According to a further aspect of the invention, a kind of text detection device, including training module, semantic module are additionally provided With segmentation module.Training module is used for the markup information for receiving multiple sample images and the sample image, according to the sample The markup information of image and the sample image generates the mask figure of the sample image, and utilizes the sample image and institute Mask figure training neutral net is stated, to obtain semantic forecast model.Semantic module is used to receive image to be detected, and uses The semantic forecast model to generate the character area probability graph of the full figure of described image to be detected, wherein, the character area Probability graph distinguishes the character area of described image to be detected and the non-legible area of described image to be detected using different pixel values Domain.Split module to be used to carry out cutting operation to the character area probability graph, to determine the character area.

In above-mentioned character detecting method and device, support directly to carry out the full figure of image to be detected text detection, it is different In split based on simple threshold values, the algorithm of sliding window or connected component.It can effectively suppress the same of the interference of complex background When, detection different language, direction, color, the word of font and size, wide adaptation range.In addition, the character detecting method and dress The characteristics of putting with strong robustness, picture noise can be successfully managed, image obscures, complex background, inhomogeneous illumination in image Etc. the interference of factor.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the embodiment of the present invention.

Brief description of the drawings

By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings：

Fig. 1 a and Fig. 1 b schematically illustrate image to be detected according to an embodiment of the invention and after testing respectively Image；

Fig. 2 schematically illustrates the flow chart of character detecting method according to an embodiment of the invention；

Fig. 3 a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b schematically illustrate according to this hair respectively The character area probability graph of the full figure of image to be detected of bright embodiment and its corresponding generation.

Fig. 7 schematically illustrates the flow chart of the method for acquisition image to be detected according to an embodiment of the invention；

Fig. 8 is schematically illustrated according to an embodiment of the invention carries out cutting operation to character area probability graph The flow chart of method；

Fig. 9 schematically illustrates the flow chart of the method for training neutral net according to an embodiment of the invention；

Figure 10 a, Figure 10 b, Figure 10 c and Figure 10 d, which are respectively illustrated, according to an embodiment of the invention has markup information Sample image；

Figure 11 a and Figure 11 b respectively illustrate the sample image according to an embodiment of the invention with markup information and Its corresponding mask artwork；

Figure 12 schematically illustrates the schematic diagram of full convolutional neural networks according to an embodiment of the invention；

Figure 13 schematically illustrates the schematic block diagram of text detection device according to an embodiment of the invention；

Figure 14 schematically illustrates the schematic block diagram of text detection device in accordance with another embodiment of the present invention；With And

Figure 15 schematically illustrates the schematic block diagram of text detection system according to an embodiment of the invention.

Embodiment

The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.

For character area in more reasonably automatic identification image, the invention provides a kind of character detecting method.Fig. 1 a Image to be detected according to an embodiment of the invention and after testing image are schematically illustrated respectively with Fig. 1 b.Fig. 2 is shown The flow chart of character detecting method 200 according to an embodiment of the invention.As shown in Fig. 2 this method 200 includes step S210 to step S230.

In step S210, image to be detected is received.Image to be detected can be original image or to original graph As the image obtained after being pre-processed.In one embodiment of the invention, can be by entering to the original image collected Row pretreatment obtains described image to be detected.Carried out below in conjunction with the method that specific accompanying drawing pre-processes to described image detailed Description.

In step S220, the character area probability of the full figure of described image to be detected is generated via semantic forecast model Figure, wherein, the character area probability graph distinguishes the character area of described image to be detected and described using different pixel value The non-legible region of image to be detected.According to one embodiment of present invention, character area refers to the area that word is included in image Domain.By taking Fig. 1 a and Fig. 1 b as an example, the region in Fig. 1 b inside two black quadrangles is character area.In first character area In, comprising word " I is growing ", in second character area, include word " me please not stepped on ".

In one embodiment, character area probability graph represents that different probability is described to distinguish using different pixel values The non-legible region of the character area of image to be detected and described image to be detected.In one embodiment, the pixel value of image The probability that higher expression pixel region belongs to character area is higher, and the pixel value of image is more low, represents pixel place The probability that region belongs to character area is lower.Such as the black picture element that pixel value is 0 represents that the pixel region belongs to word The probability in region is 0, and the white pixel that pixel value is 255 represents that the probability that the pixel region belongs to character area is 100%.

According to one embodiment of present invention, the full figure of image to be detected is general via semantic forecast model generation character area Rate figure.Semantic forecast model is used for the semantic generation character area probability graph according to image to be detected, to predict image to be detected In pixel belong to character area and still fall within non-legible region.Image, semantic is the high-level characteristic of image, although it is with image Color, texture, based on shape etc. low-level image feature, it is but dramatically different with these low-level image features.Image, semantic, which is used as, to be known Know information basic description carrier, complete picture material can be converted into can intuitivism apprehension class text language performance, scheming As playing vital effect in understanding.Image understanding input is view data, and output is knowledge, and it belongs to image and ground Study carefully the High-level content in field.Semantic forecast model can realize image understanding, and it directly can identify image according to image, semantic In character area, this is dramatically different with each model based on Threshold segmentation image.It is right that semantic forecast model can be based on its The understanding of image to be detected, it is to be checked so as to predict according to the semanteme of image to be detected, generation more preferably character area probability graph Pixel in altimetric image belongs to character area and still falls within non-legible region, to obtain more reasonably character area.

The semantic forecast model can be by training neutral net to obtain.Neutral net can be used for according to substantial amounts of input Estimate the unknown approximate function of in general.Neutral net can machine learning, there is stronger self-adaptive property.Trained god An arbitrary function can be approached through network, it can be from given data " study ".Thus, neutral net be highly suitable for by Training is used as semantic forecast model, to identify the character area in image to be detected.Hereinafter in connection with Fig. 9 to Figure 12 to instruction Practice neutral net acquisition semantic forecast model to be described in detail.

Fig. 3 a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b are according to an embodiment of the invention respectively The full figure of image to be detected and the corresponding character area probability graph generated via semantic forecast model.Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a can be the full figure of image to be detected, also, contain character area on image, for example, in the character area on Fig. 3 a Comprising be Chinese；The character area included on Fig. 4 a images includes Chinese and English, and as shown in fig. 4 a, the text in Fig. 4 a The direction in block domain is non-horizontal；Character area on Fig. 5 a image includes Russian；The character area of Fig. 6 a images includes Korean. And it is possible to find out, Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a image have different backgrounds, and background is more complicated；On also, The word stated in image also has diversity, such as these words have the information such as different colors, font, languages and size. Fig. 3 b, Fig. 4 b, Fig. 5 b and Fig. 6 b are shown respectively Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a full figure of image to be detected and pass through language The character area probability graph generated after adopted forecast model.The character area probability graph of generation is represented using different pixel values Different probability is to distinguish the character area of image to be detected and non-legible region.For example, the pixel filling using pixel value 255 Character area, represent that the region belongs to the probability highest of character area, use the non-legible region (example of the pixel filling of pixel value 0 Such as, background area), represent the region belong to character area probability it is minimum, so as to distinguish the literal field in image to be detected Domain and non-legible region.By taking Fig. 4 b as an example, Fig. 4 b character area probability graph distinguishes image to be detected using different pixel values Character area and non-legible region in 4a.For example, using in the pixel filling image to be detected 4a for being 255 with pixel value Two character areas, " unauthorized No Admittance " and " Authorized PersonnelOnly ", so as to obtain as shown in Figure 4 b The character area probability graph gone out, also, Fig. 4 b character area probability graph is also shown in image to be detected 4a complete and accurate Character area direction.

In step S230, the character area probability graph generated to step S220 carries out cutting operation, to determine word Region.Because the numerical value of the pixel in character area probability graph can represent that the pixel region belongs to the general of character area Rate, so as to distinguish character area and non-legible region, it is possible to according to low-level image feature (such as gray scale of image) to literal field Domain probability is split.

For example, step S230 can be by carrying out binarization operation to obtain character area to character area probability graph. In the present invention, due to it is expected to distinguish character area and non-legible region (background area), so can be real using binarization operation The now purpose.Binarization operation realizes that simply amount of calculation is few and speed is fast.

Binarization operation can be Threshold segmentation operation.Alternatively, threshold value T is adjustable parameter.If gray value 255 represents The probability for belonging to character area is 100%, and gray value 0 represents that the probability for belonging to character area is 0, then can set threshold value For 128.

Binarization operation can also be the cutting operation increased based on region.Region growing methods are according to same object areas The similar quality of pixel assembles the method for pixel in domain.Specifically, from prime area (for example, picture in character area probability graph Element is worth larger pixel) start, by the adjacent pixel with same property (poor smaller with the pixel value of current pixel) It is integrated into current region so as to progressively growth region, until without can be untill the pixel of merger.

It is considered that the less region of average pixel value is non-legible region in the image that is obtained after segmentation, other regions For character area.Determine that the character area is described in detail to binarization operation below in conjunction with specific accompanying drawing.

It will appreciated by the skilled person that the above method 200 has universality.It can be used for any image Text detection.This method 200 can be directed to file and picture and carry out text detection and identification, file and picture such as certificate and bill The scanned copy etc. of photo, paper document.This method 200 can also be directed to natural scene image and carry out text detection and identification.

The above method 200 of the present invention has abandoned the detection mode based on sliding window and the detection side based on connected component Formula, employ the brand-new detection mode based on semantic segmentation.This method 200 can realize that full figure is predicted, that is, input and export all Be entire image, rather than regional area or window, therefore the contextual information in image can be better profited from, particularly from Contextual information in right scene image, so as to obtain more accurately text detection result.

This method 200 can handle different scenes, the image of different quality.This method 200 can suppress complicated effective While the interference of background, the word of detection different colours, font and size.This method 200 can be with automatic Prediction literal line Direction, can directly in detection image different directions word.This method 200 is insensitive to the language belonging to word, Ke Yitong When detect word corresponding to different classes of language (such as Chinese, English, Korean etc.).In addition, this method 200 has strong robustness Feature, the interference of the factor such as noise, fuzzy, complex background, inhomogeneous illumination can be successfully managed.

Fig. 7 shows the flow chart according to an embodiment of the invention for obtaining described image to be detected.

In step S710, original image is received.In one embodiment, original image can have complicated background letter Breath, its character area included can also have diversity, such as character area can include different colors, font, language The text information of kind and size etc..

In step S720, the original image received is pre-processed, to obtain image to be detected.In an implementation In example, the original image received can be subjected to dimension normalization, i.e., by the maximum dimension of original image (for example, original graph The greater in the height and width of picture) pre-set dimension is zoomed to, the pre-set dimension can include 480,640,800 and 960 Pixel etc..Kept in the Aspect Ratio of image to be detected that dimension normalization operation obtains afterwards and the Aspect Ratio of original image It is identical.

Fig. 8 is schematically illustrated according to an embodiment of the invention carries out cutting operation to character area probability graph The flow chart of method.

In step S810, binarization operation is carried out to the character area probability graph of image to be detected.

It is appreciated that character area directly can be obtained according to the result of binarization operation.In the present invention, due to the phase Hope and distinguish character area and non-legible region (background area), so the purpose can be achieved using binarization operation.Binaryzation Operation realizes that simply amount of calculation is few and speed is fast.

Binarization operation can also be the cutting operation increased based on region.Region growing methods are according to same object areas The similar quality of pixel assembles the method for pixel in domain.Specifically, from prime area (for example, picture in character area probability graph Element is worth larger pixel) start, by the adjacent pixel with same property (poor smaller with the pixel value of current pixel) It is integrated into current region so as to progressively growth region, until without can be untill the pixel of merger

In the embodiment shown in fig. 8, after binarization operation, in addition to step S820 and step S830.

In step S820, the profile for each connected region that binarization operation is obtained is determined.The step can be used existing Any edge detection methods of research and development have or following realizes, such as based on the various edges such as Sobel or Canny operators Detection method.

It is quadrangle to determine the character area by the contour fitting of each connected region in step S830.One In individual embodiment, the interior zone of all quadrangles can be used as character area.Specifically, it is assumed that the collection of all quadrangle compositions It is combined into B, B={ b_k, k=1,2 ... Q, wherein b_kThe quadrangle that fitting obtains is represented, Q represents the number of quadrangle, and k is subscript. Then set B is the result output of text detection.

The region that quadrangle surrounds can preferably include any direction, the word of language, and it is calculated simply.Such as Shown in Fig. 6 b character area probability graph, the various reasons such as noise in image, image Chinese word shape may cause character area Probability graph fails more preferably to represent the probability that pixel belongs to character area.Character area is fitted by using quadrilateral area, It may further ensure that in character area and include whole word contents, so as to ensure the precision of text detection.

Fig. 9 schematically illustrates training neutral net according to an embodiment of the invention to obtain semantic forecast model Method flow chart.The purpose of this method is that the model can be with effective district from sample image learning semantic forecast model The character area divided in image to be detected and non-legible region.

Sample image is the image of known wherein character area., can be with as described above, neutral net has " study " ability Neutral net is trained to obtain available semantic forecast model by using multiple sample images.In this embodiment, the training Method enables semanteme of the semantic forecast model according to image to be detected, generation more accurately character area probability graph, so as to Predict that the pixel in described image to be detected belongs to character area or non-legible region, so as to so that character detecting method The accuracy of testing result is higher.

It will appreciated by the skilled person that for text detection system, the semantic forecast model can be pre- It is first stored in wherein.

In step S910, multiple sample images and its markup information are received.

In one embodiment, can be gathered from separate sources largely the various images comprising word as sample image, For example, natural scene image.It is expected that sample image species is abundant and number is more, to obtain preferable semantic forecast model. In one embodiment, the number of sample image is no less than 1000.

Polygon can be used to mark all character areas in the sample image in each sample image, so as to obtain Obtain the markup information of sample image.The ground literal unit of mark can be literal line or word.Character area in sample image Markup information can be preserved in the form of polygon (for example, quadrangle).Specifically, in one embodiment, can only protect Deposit the coordinate on four summits of quadrangle.Any direction, language can not only be met by preserving markup information with the shape of quadrangle Word, and be easy to calculate.

Figure 10 a, Figure 10 b, Figure 10 c and Figure 10 d respectively illustrate having through mark according to an embodiment of the invention The sample image of markup information.Go out as shown in these figures, quadrangle (light quadrangle in figure) mark sample graph can be used Character area as in, and the tab area is applied to the direction of any font, languages and word.

In step S920, the mask figure of sample image is generated according to the sample image and its markup information.Specifically, For sample image I and corresponding markup information a, one width of generation and sample image I mask figures of the same size.In an implementation In example, the mask figure can include two-value mask figure R.In the two-value mask figure R, sample is distinguished using different pixel values The character area of this image and non-legible region.In one embodiment, for sample image I, using with the first pixel value The character area that is marked of pixel filling markup information, using the non-legible region of pixel filling with the second pixel value, from And two-value mask figure R is generated, wherein, the first pixel value and the second pixel value are different, to distinguish the character area and non-legible Region.For example, in two-value mask figure R, the picture of the character area (namely the interior zone marked using quadrangle) marked Plain value is filled to be 255, rather than the pixel value of character area is filled to be 0.

Figure 11 a and Figure 11 b respectively illustrate the sample with markup information according to an embodiment of the invention through mark This image mask figure corresponding with its.As shown in fig. 11a, using quadrangle by the word segment of original sample image (for example, " sea Shallow lake construction security ", " Haidian Middle St ", " HAIDIANZHONGJIE ", " Haidian South Road ") mark out and, and accordingly generate Figure 11 b Shown in mask artwork.Wherein, the word segment come using being marked out with the pixel filling that pixel value is 255, uses pixel It is worth the non-legible part of pixel filling for 0, so as to obtain the mask figure shown in Figure 11 b.

In step S930, training set is built using sample image and its mask figure, and trains neutral net, to obtain language Adopted forecast model M.Original sample image mask figure composing training sample set S corresponding with its.S={ (I_i,R_i), i=1, 2 ..., N, wherein I_iRepresent original sample image, R_iFor original sample image I_iCorresponding mask figure, N are training sample Collect the number of sample image in S, i is subscript.

In one embodiment, neutral net can include full convolutional neural networks.Full convolutional neural networks are a kind of special Different neutral net, its feature are that from all of output are input to comprising that can learn the layer of parameter be all convolutional layer (convolutional layer).Full convolutional neural networks avoid the pretreatment complicated early stage to image, can directly input Original image, it can make the text detection result of image especially suitable for the analyzing and processing to the image with complex background It is more accurate.

According to a specific embodiment of the invention, the full convolutional neural networks being made up of 13 layers can be used.Figure 12 show the schematic diagram of the full convolutional neural networks.

Except including convolutional layer, in addition to maximum pond layer in the full convolutional neural networks.Maximum pond layer separates company Continuous convolutional layer, it can effectively reduce amount of calculation, while the robustness of strength neural network.

The input of the full convolutional neural networks is raw image data.As shown in figure 12, the full convolutional neural networks include First convolutional layer and the second convolutional layer, the number of its median filter can be 64, and wave filter size can be 3x3.Second convolution Layer the first maximum pond layer (maxpool layer) of connection.Followed by the 3rd convolutional layer and Volume Four lamination, its median filter Number can be 128, wave filter size can be 3x3.Volume Four lamination connects the second maximum pond layer.Followed by the 5th Convolutional layer, the 6th convolutional layer and the 7th convolutional layer, the number of its median filter is 256, and wave filter size is 3x3.7th convolution Layer the 3rd maximum pond layer of connection.Followed by the 8th convolutional layer, the 9th convolutional layer and the tenth convolutional layer, the number of its median filter Mesh can be 512, and wave filter size can be 3x3.Tenth convolutional layer connects the 4th maximum pond layer.It is a roll of followed by the tenth Lamination, the 12nd convolutional layer and the 13rd convolutional layer, the number of its median filter can be 512, and wave filter size can be 3x3。

In the training process, a sample image and corresponding mask figure are input in full convolutional neural networks every time, Initial learning rate can be 0.00000001, often be reduced to original 1/10 by 10000 wheel iteration, learning rate.Work as iteration After 100000 wheels, training process can terminate.The full convolutional neural networks that training process is obtained when terminating are desired language Adopted forecast model.Via the semantic forecast model trained, mapping to be checked can be generated according to the semanteme of image to be detected The character area probability graph of the full figure of picture, so as to predict the character area in image to be detected.

Although it will appreciated by the skilled person that illustrate above by taking 13 layers of full convolutional neural networks as an example, But the number of plies of full convolutional neural networks can include the Arbitrary Digit between 6 to 19.The number of plies of this scope has weighed calculating As a result the two aspects of accuracy and amount of calculation.In addition, the number and size of wave filter recited above are also only example, rather than Limitation.Such as the number of wave filter can also be 100,500 or 1000 etc., the size of wave filter can also be 1x1 or 5x5.

According to a further aspect of the invention, a kind of text detection device is additionally provided.Figure 13 is shown according to of the invention one The schematic block diagram of the text detection device 1300 of embodiment.As shown in figure 13, text detection device 1300 includes semantic analysis Module 1330 and segmentation module 1340.According to one embodiment of present invention, the semantic module 1330 also includes Semantic forecast model 1350.

Semantic module 1330 is used to receive image to be detected, and is generated using semantic forecast model 1350 described to be checked The character area probability graph of the full figure of altimetric image.Semantic forecast model is used for the semantic generation character area according to image to be detected Probability graph, non-legible region is still fallen within to predict that the pixel in described image to be detected belongs to character area.The literal field Domain probability graph using different pixel value represents different probability to distinguish the character area of described image to be detected and described treat The non-legible region of detection image.

In one embodiment, image to be detected can be original image or original image is pre-processed The image obtained afterwards.

In one embodiment, semantic forecast model 1350 can be by training neutral net to obtain.Hereinafter will knot Figure 14 is closed training neutral net acquisition semantic forecast model 1350 is described in detail.

With reference to Fig. 3 a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b descriptive text area probability figures.Figure 3a and Fig. 3 b, Fig. 4 a and Fig. 4 b, Fig. 5 a and Fig. 5 b, Fig. 6 a and Fig. 6 b are image to be detected according to an embodiment of the invention respectively Full figure and via semantic forecast model 1350 generate corresponding character area probability graph.Fig. 3 a, Fig. 4 a, Fig. 5 a and Fig. 6 a can To be the full figure of image to be detected, also, contain character area on image, Fig. 3 b, Fig. 4 b, Fig. 5 b and Fig. 6 b show the figure 3a, Fig. 4 a, Fig. 5 a and Fig. 6 a full figure of image to be detected pass through the character area that is generated after semantic forecast model 1350 Probability graph.The character area probability graph of generation represents different probability to distinguish the text of image to be detected using different pixel value Block domain and non-legible region.For example, using the pixel filling character area of pixel value 255, represent that the region belongs to literal field The probability highest in domain, using the non-legible region of the pixel filling of pixel value 0 (for example, background area), represent that the region belongs to text The probability in block domain is minimum, so as to distinguish character area and the non-legible region in image to be detected.By taking Fig. 4 b as an example, Fig. 4 b Character area probability graph character area and non-legible region in image to be detected 4a are distinguished using different pixel values.Example Such as, using two character areas in the pixel filling image to be detected 4a for being 255 with pixel value, " unauthorized No Admittance " " Authorized Personnel Only ", so as to the character area probability graph gone out as shown in Figure 4 b, also, Fig. 4 b Character area probability graph also show the direction of the character area in former image to be detected 4a complete and accurate

Split module 1340 to be used to carry out cutting operation to the character area probability graph, to determine character area.Because The numerical value of pixel in character area probability graph can represent that the pixel belongs to the probability of character area, it is possible to according to bottom Feature (such as gray scale of image) is split to character area probability.

For example, segmentation module 1340 can be by carrying out binarization operation to obtain literal field to character area probability graph Domain.In the present invention, due to it is expected to distinguish character area and non-legible region (background area), so utilizing binarization operation Realize the purpose.Binarization operation realizes that simply amount of calculation is few and speed is fast.

After binarization operation, the segmentation module 1340 can be also used for determining that binarization operation is obtained and each connect The profile in logical region.It can be realized with any edge detection method of existing or following research and development, such as based on such as Sobel Or the various edge detection methods such as Canny operators.Segmentation module 1340 can be also used for the contour fitting of each connected region It is quadrangle to determine the character area.In one embodiment, the interior zone of all quadrangles can be used as literal field Domain.Specifically, it is assumed that the collection of all quadrangle compositions is combined into B, B={ b_k, k=1,2 ... Q, wherein b_kRepresent what fitting obtained Quadrangle, Q represent the number of quadrangle, and k is subscript.Then set B is the result output of text detection.

In one embodiment, it is believed that average pixel value is smaller in the image that segmentation module 1340 is obtained after splitting Region be non-legible region, other regions are character area.

Figure 14 shows the schematic block diagram of text detection device 1400 according to another embodiment of the present invention.Text detection Semantic module 1330 in device 1400 is similar with the semantic module 1330 in text detection device 1300, word inspection The segmentation module 1340 surveyed in device 1400 is similar with the segmentation module 1340 in text detection device 1300, for sake of simplicity, This is repeated no more.

Compared with text detection device 1300, text detection device 1400 adds image pre-processing module 1410 and training Module 1420.

According to an embodiment of the invention, described image pretreatment module 1410 receives original image.In one embodiment, Original image can have complicated background information, can include having multifarious character area, for example, there is different face Color, font, the text information of languages and size.

Image pre-processing module 1410 pre-processes to the original image received.In one embodiment, image is pre- Processing module 1410 can carry out dimension normalization to the original image that receives, i.e., by the maximum dimension of original image (for example, The greater in the height and width of original image) zoom to pre-set dimension, the pre-set dimension can include 480,640, 800 and 960 pixels etc..Also, the Aspect Ratio for the Aspect Ratio and the original image for pre-processing the image obtained afterwards is protected Hold identical.

After pretreatment, image pre-processing module 1410 obtains described image to be detected and by described image to be detected Full figure export to the semantic module 1330 and handled.Wherein, according to described above, image to be detected tool There is pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of the original image.

According to one embodiment of present invention, training module 1420 is used for using multiple sample images training neutral net, To obtain semantic forecast model 1350, the model can efficiently differentiate character area and non-legible area in image to be detected Domain.

In one embodiment, training module 1420 can gather the largely various images comprising word from separate sources and make For sample image and receive the markup information of sample image.Sample image is, for example, natural scene image.It is expected sample image kind Class is abundant and number is more, to obtain preferable semantic forecast model.In one embodiment, the number of sample image is no less than 1000。

All character areas in each sample image can be marked using polygon in the sample image.The base of mark This word unit can be literal line or word.The markup information of character area can be with polygon (for example, four in sample image Side shape) form preserve.Specifically, in one embodiment, the coordinate on four summits of quadrangle can only be preserved.With four sides The shape of shape, which preserves markup information, can not only meet any direction, the word of language, and be easy to calculate.

Training module 1420 is additionally operable to the mask figure according to sample image and its markup information generation sample image.At one In embodiment, the mask figure includes two-value mask figure.Specifically, for sample image I and corresponding markup information a, training Module 1420 generates a width and sample image I mask figures of the same size, for example, two-value mask figure R.Two-value mask figure R is used Different pixel values distinguishes the character area of sample image and non-legible region.In one embodiment, for sample image I, The character area marked using the pixel filling with the first pixel value, use the non-text of pixel filling with the second pixel value Block domain, so as to generate the mask figure, wherein, the first pixel value and the second pixel value are different, to distinguish the character area With non-legible region.For example, the pixel value of the character area (namely the interior zone marked using quadrangle) marked is filled out Fill for 255, rather than the pixel value of character area is filled to be 0.

Training module 1420 is further used for using sample image and its mask figure structure training set, and trains nerve net Network, to obtain semantic forecast model 1350.Specifically, the training sample that original sample image mask figure corresponding with its is formed Integrate as S.S={ (I_i,R_i), i=1, wherein 2 ..., N, I_iRepresent original sample image, R_iFor original sample image I_iIt is right The mask figure answered, N are the number of sample image in training sample set S, and i is subscript.

In one embodiment, neutral net can be full convolutional neural networks.Full convolutional neural networks are a kind of special Neutral net, its feature is that from all of output are input to comprising that can learn the layer of parameter be all convolutional layer.Full convolutional Neural Network avoids the pretreatment complicated early stage to image, can directly input original image, it is especially suitable for complexity The analyzing and processing of the image of background, the text detection result of image can be made more accurate.

Training sample set S is inputted full convolutional neural networks and is trained by training module 1420, to obtain semantic forecast mould Type 1350.According to a specific embodiment of the invention, the full convolutional neural networks being made up of 13 layers can be used.Figure 12 Show the schematic diagram of the full convolutional neural networks.

The input of the full convolutional neural networks is raw image data.As shown in figure 12, the full convolutional neural networks include First convolutional layer and the second convolutional layer, the number of its median filter can be 64, and wave filter size can be 3x3.Second convolution Layer the first maximum pond layer of connection.Followed by the 3rd convolutional layer and Volume Four lamination, the number of its median filter can be 128, wave filter size can be 3x3.Volume Four lamination connects the second maximum pond layer.Followed by the 5th convolutional layer, the 6th Convolutional layer and the 7th convolutional layer, the number of its median filter is 256, and wave filter size is 3x3.7th convolutional layer connection the 3rd is most Great Chiization layer.Followed by the 8th convolutional layer, the 9th convolutional layer and the tenth convolutional layer, the number of its median filter can be 512, Wave filter size can be 3x3.Tenth convolutional layer connects the 4th maximum pond layer.Followed by the 11st convolutional layer, the 12nd Convolutional layer and the 13rd convolutional layer, the number of its median filter can be 512, and wave filter size can be 3x3.

In the training process, a sample image and corresponding mask figure are input in full convolutional neural networks every time, Initial learning rate can be 0.00000001, often be reduced to original 1/10 by 10000 wheel iteration, learning rate.Work as iteration After 100000 wheels, training process can terminate.The full convolutional neural networks that training process is obtained when terminating are desired language Adopted forecast model., can be according to the semantic generation character area of image to be detected via the semantic forecast model trained Probability graph, so as to predict the character area in image to be detected.

The semantic forecast model 1350 trained neutral net using multiple sample images by training module 1420 and obtained The character area in image to be detected and non-legible region can be efficiently differentiated.

Figure 15 shows the schematic block diagram of text detection system 1500 according to embodiments of the present invention.As shown in figure 15, Text detection system 1500 includes processor 1510, memory 1520 and the programmed instruction stored in the memory 1520 1530。

Described program instruction 1530 can realize word according to embodiments of the present invention when the processor 1510 is run The function of each functional module of detection means, and/or character detecting method according to embodiments of the present invention can be performed Each step.

Specifically, when described program instruction 1530 is run by the processor 1510, following steps are performed：Receive to be checked Altimetric image；The character area probability graph of the full figure of described image to be detected is generated via semantic forecast model, wherein, the word Area probability figure distinguishes the character area of described image to be detected and the non-text of described image to be detected using different pixel values Block domain；And cutting operation is carried out to the character area probability graph, to determine the character area.Semantic forecast model is used Pixel in image to be detected described in the semantic forecast according to image belongs to character area and still falls within non-legible region.

In addition, when described program instruction 1530 is run by the processor 1510, following steps are also performed：Receive original Image；And the original image is pre-processed, to obtain described image to be detected, wherein, image to be detected tool There is pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of the original image.

It is in addition, performed general to the character area when described program instruction 1530 is run by the processor 1510 The step of rate figure progress cutting operation is to determine the character area includes：Binaryzation behaviour is carried out to the character area probability graph Make, to determine the character area.

It is in addition, performed general to the character area when described program instruction 1530 is run by the processor 1510 The step of rate figure progress binarization operation is to determine the character area includes：Determine that the binarization operation is obtained each The profile of connected region；And by the contour fitting be quadrangle, wherein, the quadrangle interior zone is the literal field Domain.

In addition, when described program instruction 1530 is run by the processor 1510, following steps are also performed：Using multiple Sample image trains neutral net, to obtain the semantic forecast model.

In addition, when described program instruction 1530 is run by the processor 1510, the multiple sample graphs of performed utilization As training neutral net is included with obtaining the step of the semantic forecast model：Receive the sample image and the sample image Markup information；The mask figure of the sample image is generated according to the markup information of the sample image and the sample image； And the neutral net is trained using the sample image and the mask figure, to obtain the semantic forecast model.

In addition, execution is run using multiple sample images training god by the processor 1510 in described program instruction 1530 In the step of through network to obtain the semantic forecast model, the mask figure includes two-value mask figure, and the two-value is covered Film figure distinguishes the character area of the sample image and non-legible region using different pixel values.

In addition, execution is run using multiple sample images training god by the processor 1510 in described program instruction 1530 In the step of through network to obtain the semantic forecast model, the neutral net includes full convolutional neural networks.

In addition, execution is run using multiple sample images training god by the processor 1510 in described program instruction 1530 In the step of through network to obtain the semantic forecast model, the number of plies of the full convolutional neural networks is included between 6 to 19 Arbitrary Digit.

In addition, according to embodiments of the present invention, a kind of storage medium is additionally provided, stores program on said storage Instruction, when described program instruction is run by computer or processor for performing the character detecting method of the embodiment of the present invention Corresponding steps, and for realizing the corresponding module in text detection device according to embodiments of the present invention.The storage medium Such as the storage card of smart phone, the memory unit of tablet personal computer, hard disk, the read-only storage of personal computer can be included (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compact disc read-only storage (CD-ROM), USB storage, Or any combination of above-mentioned storage medium.The computer-readable recording medium, which can be that one or more is computer-readable, to be deposited Any combination of storage media, such as a computer-readable recording medium include and are used to train neutral net to obtain semantic forecast The computer-readable program code of model, another computer-readable recording medium include the calculating for being used for carrying out text detection The readable program code of machine.

In one embodiment, the computer program instructions can be realized when being run by computer according to of the invention real Each functional module of the text detection device of example is applied, and/or text detection according to embodiments of the present invention can be performed Method.

In one embodiment, the computer program instructions perform following steps when being run by computer：Reception is treated Detection image；The character area probability graph of the full figure of described image to be detected is generated via semantic forecast model, wherein, the text Block domain probability graph using different pixel value distinguish described image to be detected character area and described image to be detected it is non- Character area；And cutting operation is carried out to the character area probability graph, to determine the character area.The semantic forecast Model belongs to character area for pixel in image to be detected according to the semantic forecast of image and still falls within non-legible region.

In addition, the computer program instructions perform when being run by computer, following steps are also performed：Receive original graph Picture；And the original image is pre-processed, to obtain described image to be detected, wherein, described image to be detected has Pre-set dimension size, and the Aspect Ratio of described image to be detected is identical with the Aspect Ratio of the original image.

It is in addition, performed to the character area probability graph when the computer program instructions are being run by computer Carrying out the step of cutting operation is to determine the character area includes：Binarization operation is carried out to the character area probability graph, To determine the character area.

It is in addition, performed to the character area probability graph when the computer program instructions are being run by computer Carrying out the step of binarization operation is to determine the character area includes：Determine each connection that the binarization operation is obtained The profile in region；And by the contour fitting be quadrangle, wherein, the quadrangle interior zone is the character area.

In addition, when the computer program instructions are being run by computer, following steps are also performed：Utilize multiple samples Image trains neutral net, to obtain the semantic forecast model.

In addition, when the computer program instructions are being run by computer, the multiple sample image instructions of performed utilization Practice neutral net is included with obtaining the step of the semantic forecast model：Receive the mark of the sample image and the sample image Note information；The mask figure of the sample image is generated according to the markup information of the sample image and the sample image；And The neutral net is trained using the sample image and the mask figure, to obtain the semantic forecast model.

Multiple sample images training nerve is utilized in addition, being performed when the computer program instructions are being run by computer In the step of network is to obtain the semantic forecast model, the mask figure includes two-value mask figure, and the two-value mask Figure distinguishes the character area of the sample image and non-legible region using different pixel values.

Multiple sample images training nerve is utilized in addition, being performed when the computer program instructions are being run by computer In the step of network is to obtain the semantic forecast model, the neutral net includes full convolutional neural networks.

Multiple sample images training nerve is utilized in addition, being performed when the computer program instructions are being run by computer In the step of network is to obtain the semantic forecast model, the number of plies of the full convolutional neural networks includes appointing between 6 to 19 Meaning number.

Those of ordinary skill in the art are by reading the detailed description above for character detecting method, it is to be understood that above-mentioned Text detection device, the structure of system, realization and advantage, therefore repeat no more here.

In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.

Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention：I.e. required guarantor The application claims of shield features more more than the feature being expressly recited in each claim.It is more precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself Separate embodiments all as the present invention.

Those skilled in the art are appreciated that except at least one in such feature and/or process or unit Outside excluding each other, any combinations can be used in this specification (including adjoint claim, summary and accompanying drawing) Disclosed all features and so disclosed any method or all processes or unit of device are combined.Unless in addition It is expressly recited, each feature disclosed in this specification (including adjoint claim, summary and accompanying drawing) can be by offer phase The alternative features of same, equivalent or similar purpose replace.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.

The all parts embodiment of the present invention can be realized with hardware, or to be run on one or more processor Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that it can use in practice Microprocessor or digital signal processor (DSP) realize some moulds in text detection device according to embodiments of the present invention The some or all functions of block.The present invention is also implemented as the part or complete for performing method as described herein The program of device (for example, computer program and computer program product) in portion.Such program for realizing the present invention can store On a computer-readable medium, or can the form with one or more signal.Such signal can be from internet Download and obtain on website, either provide on carrier signal or provided in the form of any other.

It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and being come by means of properly programmed computer real It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims

1. a kind of character detecting method, including：

Receive the markup information of multiple sample images and the sample image；

The mask figure of the sample image is generated according to the markup information of the sample image and the sample image；

Using the sample image and mask figure training neutral net, to obtain semantic forecast model；

Receive image to be detected；

The character area probability graph of the full figure of described image to be detected is generated via the semantic forecast model, wherein, the text Block domain probability graph using different pixel value distinguish described image to be detected character area and described image to be detected it is non- Character area；And

Cutting operation is carried out to the character area probability graph, to determine the character area.

2. the method as described in claim 1, in addition to：

Receive original image；And

The original image is pre-processed, to obtain described image to be detected,

Wherein, described image to be detected has pre-set dimension size, and the Aspect Ratio of described image to be detected and the original The Aspect Ratio of beginning image is identical.

3. the method described in claim 1, wherein, cutting operation is carried out to the character area probability graph, to determine the text Block domain includes：

Binarization operation is carried out to the character area probability graph, to determine the character area.

4. method as claimed in claim 3, wherein, binarization operation is carried out to the character area probability graph, to determine Stating character area includes：

Determine the profile for each connected region that the binarization operation is obtained；And

It is quadrangle by the contour fitting, wherein, the quadrangle interior zone is the character area.

5. the method for claim 1, wherein the mask figure includes two-value mask figure, and the two-value mask figure The character area of the sample image and non-legible region are distinguished using different pixel values.

6. the method for claim 1, wherein the neutral net includes full convolutional neural networks.

7. method as claimed in claim 6, wherein, the numbers of plies of the full convolutional neural networks includes any between 6 to 19 Number.

8. the method as described in any one of claim 1 to 7, wherein, the semantic forecast model is used for according to described to be detected Pixel in image to be detected described in the semantic forecast of image belongs to character area and still falls within non-legible region.

9. a kind of text detection device, including：

Training module, for receiving the markup information of multiple sample images and the sample image, according to the sample image and The markup information of the sample image generates the mask figure of the sample image, and utilizes the sample image and the mask Figure training neutral net, to obtain semantic forecast model；

Semantic module, the training module is connected to, for receiving image to be detected, and uses the semantic forecast model To generate the character area probability graph of the full figure of described image to be detected, wherein, the character area probability graph uses different Pixel value distinguishes the character area of described image to be detected and the non-legible region of described image to be detected；And

Split module, for carrying out cutting operation to the character area probability graph, to determine the character area.

10. text detection device as claimed in claim 9, described device further comprise：

Image pre-processing module, pre-processed for receiving original image, and to the original image, it is described to be checked to obtain Altimetric image,

11. the text detection device described in claim 9, wherein, the segmentation module is further used for the character area Probability graph carries out binarization operation, to determine the character area.

12. text detection device as claimed in claim 11, wherein, the segmentation module is further used for determining the two-value Change the profile for each connected region that operation is obtained, and be quadrangle by the contour fitting, wherein, inside the quadrangle Region is the character area.

13. text detection device as claimed in claim 9, wherein, the mask figure includes two-value mask figure, and described two It is worth mask figure and distinguishes the character area of the sample image and non-legible region using different pixel values.

14. text detection device as claimed in claim 9, wherein, the neutral net includes full convolutional neural networks.

15. text detection device as claimed in claim 14, wherein, the number of plies of the full convolutional neural networks includes 6 to 19 Between Arbitrary Digit.

16. the text detection device as described in any one of claim 9 to 15, wherein, semantic forecast model is used for according to Pixel in image to be detected described in the semantic forecast of image to be detected belongs to character area and still falls within non-legible region.