CN101833664A

CN101833664A - Video image character detecting method based on sparse expression

Info

Publication number: CN101833664A
Application number: CN 201010151779
Authority: CN
Inventors: 王春恒; 李心洁; 程刚; 张荣国; 张阳; 肖柏华
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2010-04-21
Filing date: 2010-04-21
Publication date: 2010-09-15

Abstract

The invention provides a video image character detecting method based on sparse expression, which comprises the following steps of: S1, resampling a video sequence to obtain a color video image, and converting the gray level and the multi-scale of the color video image to obtain a multi-scale gray level image; S2, performing edge detection and morphological closed operation to the multi-scale gray level image with an improved Sobel operator to obtain an edge image and filter the edge density of the edge image; obtaining a candidate character region through connected domain analysis and regular analysis; and S3, performing vertical projection and horizontal projection to the candidate character region, diving a vertical projecting image and a horizontal projecting image to obtain candidate character lines, dividing the candidate character lines into small regions through sliding windows, extracting the edge characteristics of the small regions, respectively classifying each small region with a classifying method based on the sparse expression, judging whether the small regions are character regions, judging the candidate character lines according to the judging result of the small regions to obtain and output a final character line region.

Description

Video image character detecting method based on sparse expression

Technical field

The invention belongs to image understanding and image retrieval category, be specifically related to a kind of rapid and precise video text detection method and system and realize.

Background technology

Along with multimedia technology and rapid development of Internet, the quantity of multimedia messages is explosive growth.Except comprising image and text message, also comprise video information in the more and more data storehouse.Video information is the most general a kind of in the multimedia messages, and it can obtain in several ways, as TV, network etc.This has caused the interest of Chinese scholars how to retrieve the information paid close attention to from a large amount of video informations.In video was understood and retrieved, literal can provide very abundant semantic information for video, is a crucial ingredient.The caption in the news program for example, the score of competing in the sports cast, the title of commodity and manufacturer in the advertisement.Current many video databases all are to carry out index and retrieval by manually picture being carried out the text message that note produces.Not only speed is slow but also very uninteresting when manually picture being carried out text annotation, therefore needs the effective computerized algorithm of exploitation that video image is carried out note automatically.By some algorithms, can carry out index and retrieval with the feature of from video image, directly extracting.

Literal in the video image can be divided into two classes: caption character, scene literal.Characteristics such as caption character is to be added in video image the artificial later stage to understand in order to help the beholder, and therefore this literal has the contrast height, and illumination is even.Because caption character adds video through the elaborately planned later stage, has important video content information usually.The scene literal is to occur as the part of video scene, and be taken into video together with scene, the scene literal overwhelming majority is accidental the appearance, occur along with the appearance of object in the scene, the literal on the road sign for example, the title in shop, the literal on the personage's clothes in the video, the literal on the billboard etc.

The method that the detection Word message has existed from video image roughly can be divided into four classes: based on the edge, utilize literal and background to have stronger contrast usually and have abundant marginal information; Based on connected domain, utilized literal in a row, become to list existing characteristic; Based on angle point, utilize character area to have abundant angle point with respect to background; Based on texture, utilize the moving window of fixed measure, extract the interior average of each window, second-order moment around mean, third central moment as feature.

Because video Chinese words size differs, font type and color are varied, and traditional method exists that efficient is lower, calculation of complex, the not high limitation of degree of accuracy.The method of utilizing marginal density to detect is in this article carried out quick rough detection to video image, for the candidate character region that rough detection obtains, utilizes the sorting technique of sparse expression to verify again.Experimental result shows that this method can overcome the deficiency of classic method.

Summary of the invention

At the deficiencies in the prior art, the present invention seeks to can orient text filed under the complex background effectively, accurately, fast by detection method from coarse to fine, for this reason, a kind of video image character detecting method based on sparse expression is proposed.

In order to achieve the above object, the technical scheme of video image character detecting method that the present invention is based on sparse expression is as follows: video sequence pre-service that the method comprising the steps of, video image character zone rough detection and video image character examining are surveyed, and concrete steps are:

Step S1, sequence of video images pre-service: video sequence is resampled, obtain color video frequency image; And color video frequency image is converted to gray level image; Gray level image is carried out multi-scale transform obtain multiple dimensioned gray level image;

Step S2, video image character zone rough detection: at first multiple dimensioned gray level image is adopted and improve that Suo Beier (Sobel) operator carries out rim detection and morphology closes calculation, obtain edge image; Secondly edge image being carried out marginal density filters; Obtain candidate character region by connected domain analysis, rule analysis at last;

Step S3, the video image character examining is surveyed: at first the candidate character region that rough detection is obtained is by vertical projection and horizontal projection, again vertical projection image and horizontal projection being looked like to carry out cutting, to obtain candidate character capable, be the zonule by moving window with the capable cutting of candidate character then, edge feature is extracted in the zonule, adopt then based on the sorting technique of sparse expression (Sparse Representation) and classified respectively in each zonule, judge whether the zonule is character area, judged result according to the zonule, judge candidate character is capable, obtain and export final literal line zone.

Effect of the present invention is: compare with existing method, the present invention can orient character area fast, and has higher recall rate and accuracy rate.Can be applicable in visual classification and the searching system.System of the present invention adopts by thick to thin multiple dimensioned text detection framework, effectively filter out most non-legible zone in the rough detection stage by rapid edge filter density method, divide literal and non-legible zone in the examining stage of surveying by sorting technique active zone based on sparse expression, obtain higher accuracy rate, multiple dimensioned processing can detect the literal of different sizes.Therefore this method has improved the accuracy of word area detection, and not influenced by font size, illumination etc. under the prerequisite of taking into account speed and recall rate.

Description of drawings

Fig. 1 is a detection algorithm frame diagram of the present invention

Embodiment

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

As shown in Figure 1, a kind of video image character detecting method based on sparse expression from coarse to fine of the present invention specifically may further comprise the steps:

1, video sequence pre-service.

(1) video sequence resamples:

According to statistics, the literal in the video image appears in the tens continuous two field pictures at least.Because the difference of adjacent two two field pictures is especially little, when adopting same set of algorithm to handle, the result who obtains also can be closely similar.In this case, all frames are independently handled to be brought counting yield low.Therefore, on the accuracy and high performance basis that guarantee text detection and extraction, we resample to video sequence, and per 10 frames are got 1 two field picture, can make the work efficiency of system obtain the raising of several times like this, and not influence the accuracy of sampling.

(2) coloured image is converted into gray level image:

At first the coloured image with input is converted into gray level image, and it is changed referring to formula (1):

f _g(x，y)＝0.3R(x，y)+0.59G(x，y)+0.11B(x，y)?(1)

R in the formula (1) (x, y), G (x, y), B (x y) is the R of input color image, G, and the B component, x, y are the coordinate figure of pixel, f _g(x y) is the gray level image after the conversion.

Because the literal in the video image is not of uniform size, in order to detect the literal that varies in size, gray level image is carried out multi-scale transform, original image is decomposed into the image of different resolution.Detect at the enterprising style of writing word of each level of resolution then, the result of Jian Ceing is mapped among the former figure at last, and the detected literal of different scale is merged.Little character is detected on the higher subgraph of resolution, and relatively large character is detected on the low word figure of resolution.At last the result is integrated.

2, video image character zone rough detection

(1) video image rim detection:

The multiple dimensioned gray level image that obtains in the above-mentioned steps 1 is carried out improved rope Bel (Sobel) operator edge detection by formula 2.Concrete steps are the operator and formula (2) the computed image edge of the four direction of employing table 1, and table 1 is as described below:

E (x, y)=max (| S _H|, | S _V|, | S _LD|, | S _RD|)+k * | S _{⊥ MAX}| max represents to select maximal value, S in (2) formula (2) _H, S _V, S _LD, S _RDBe respectively Suo Beier (Sobel) edge intensity value computing on horizontal direction, vertical, left diagonal line, right cornerwise this four direction, S _{⊥ MAX}Represent the Grad of the direction vertical with the greatest gradient direction, k is a fixed coefficient, and (x is that coordinate is that (k ∈ (0,1) here k gets 0.5 for x, the edge intensity value computing of some y) y) to E.Because (x, value y) might surpass 255, therefore need (x, value linear change y) is between [0-255] with E to calculate back edge E.

(2) closing operation of mathematical morphology:

Because there is the interference of noise in the process edge-detected image, and some character stroke fracture, there are a lot of little gaps and isolated point.This will hinder the connected domain analysis of back.Therefore the point that needs to isolate is removed, and little gap is connected.The edge image that obtains by rim detection is carried out closing operation of mathematical morphology, effectively the little gap in the removal of images.

(3) marginal density filters:

Marginal density filters and to be meant when being that the edge intensity value computing of this window was set to zero, when greater than a certain value, remains unchanged when one of the center fixedly the marginal density in M * N window was lower than a certain value with certain pixel.Because the stroke feature of literal makes that the marginal density of character area is strong with respect to the marginal density of background area, be one of the center fixedly in the window of M * N with certain pixel just, with regard to the quantity of edge pixel, the quantity of pixel is greater than the quantity in the interior window pixel in background area (non-text filed) in the text filed interior window.

For the edge image that obtains in the step (2), carry out marginal density and filter.Set up an other width of cloth and the identical new image FE of former figure size, with all pixel zero setting of new image FE.By in formula (3) the edge calculation image with pixel i, j be center size for the pixel number EW in the window of M * N (i, j), if EW (i, j) greater than empirical value T, T ∈ (0, S _MN), S _MNBe that size is then will copy to the correspondence position of FE for the pixel in the window by the area of M * N window.Obtain marginal density image FE.

EW (i, j) = Σ_{x = i - M / 2}^{i + M / 2} Σ_{y = j - N / 2}^{j + N / 2} E (x, y) - - - (3)

Wherein (x is (x, edge intensity value computing y) for coordinate y) to E.Window size is M * N, and (i is with coordinate i j) to EW, the marginal density value in the stationary window M * N at j center.

We adopt formula (6) to carry out computing for the computing velocity of accelerating formula (3).At first by formula (4) and (5) iteration obtain IE (x, y), IE (x, y) be (x, y) upper left all pixel values and, promptly Wherein (i is (i, edge intensity value computing j) for coordinate j) to E.By formula (6) edge calculation density EW (x, y);

s(x，y)＝s(x，y-1)+E(x，y) (4)

IE(x，y)＝IE(x-1，y)+s(x，y) (5)

(x is that coordinate is that ((x y) is coordinate points (x, 0), (x, 1) to s for x, the edge intensity value computing of some y) y) to E in the formula (4) ... (x, y-1), (x, y) accumulated value of edge strength.IE in the formula (5) (x, y) be s (0, y), s (1, y) ... s (x-1, y), s (x, y) value and.By iterative formula (4) and formula (5), initial value s (x ,-1)=0, IE (1, y)=0, IE (x, value y) can once be calculated by color video frequency image and finish, and the value of some marginal densities can calculate fast by formula (6) arbitrarily,

EW(i，j)＝IE(i+M/2，j+N/2)+IE(i-M/2，j-N/2) (6)

-(IE(i+M/2，j-N/2)+IE(i-M/2，j+N/2))

Wherein (i is with coordinate i j) to EW, the marginal density value of a stationary window M * N at j center.(x y) can obtain by formula (4), (5) iteration IE.

(4) connected domain analysis:

Carry out 8 neighborhood connected domains for the image that obtains in the step (3) and demarcate, calibrate the zone that all pixel values are communicated with, i.e. connected member.

(5) rule analysis:

Analyze us by step (4) connected domain and obtained a lot of connected members, the size, area, length breadth ratio and the edge pixel that utilize connected member are judged as character area or non-legible zone than these geometric properties with connected member, and non-legible zone is abandoned.

Remaining connected member is merged according to the size that intersects area between connected member, till the connection piece that does not have to merge.Position and size to each text connected member are analyzed, and will form candidate character region in the text connected member combination with delegation or same row.

3, with the candidate character one's respective area that obtains under the different scale, merge according to geometric relationship, if the zone that two candidate character region intersect greater than certain ratio, two merge into a character area with these two character areas.

4, the character area examining is surveyed:

(1) candidate character region that step 3 is obtained is carried out cutting.Candidate character region is carried out vertical projection and horizontal projection, carry out cutting, thereby the candidate character of orienting in the picture is capable according to perspective view.

(2) the capable checking of candidate character to navigating in the step (1) keeps the literal line of correctly judging, with the literal line filtration of wrong Pu'an section.Is the zonule by moving window with the literal line cutting, carries out feature extraction for the zonule.Utilization is classified to the zonule based on the sorting technique of sparse expression (SparseRepresentation), and this method is divided into training and judges two processes:

Training process carries out in advance, in training process, has chosen the positive sample and the negative sample of a large amount of character areas, uses the method for k mean cluster and svd (K-SVD) to train; Obtain positive dictionary D _P, negative dictionary D _NDictionary D={D _P, D _N;

In deterministic process, output is designated as Z (w), detected character area w in the step (1) is judged by the reconstructed error of positive dictionary and negative dictionary, if the reconstructed error of positive dictionary is less, be judged as correct character area, be output as+1, otherwise, if the reconstructed error of positive dictionary is bigger than the reconstructed error of negative dictionary, then be judged as the character area of erroneous judgement, be output as-1; Utilize formula (7) that literal line is judged then, the literal line filtration with erroneous judgement keeps correct literal line, R represents line of text in the formula (7), w shows N * N size windows, and the image-region among Z (w) the expression window w passes through the judged result based on the sorting technique of sparse expression, d _wExpression window w center is to the distance at line of text R center, σ ₀Be variable (σ ₀∈ (0 ,+∞)), the classification results of C (R) expression line of text R, if C (R) greater than zero, then R belongs to correct line of text, on the contrary R belongs to the erroneous judgement line of text, and is filtered.

In the present embodiment, detailed process is as follows.

Training process: the line of text in the step (1) is normalized to H as the sample height, carry out Tuscany (Canny) rim detection.Use size to be that N * N, step-length are the moving window cutting line of text of k, the image block of N * N size is converted into vectorial y ∈ IR ^{N * N}, train positive dictionary D by the K-SVD algorithm ^PSelect the background area as negative sample, the negative dictionary D of training ^NPositive dictionary and the merging of negative dictionary are obtained D={D _P, D _N.

Deterministic process: sample process is highly normalized to H as training process, uses size to be that B * B, step-length are the moving window w cutting line of text of k, and the image block of N * N size is converted into vectorial y ∈ IR ^{N * N}, obtain sparse coefficient x={x by match tracing (Matching Pursuit) algorithm _P, x _N, difference error of calculation E _P=|| y-D _Px _P|| ₂, E _N=|| y-D _Nx _N|| ₂If E _P＞E _NTest sample y belongs to negative sample, and the zone in the corresponding window promptly belongs to the erroneous judgement character area, and output valve is-1, if E _P≤ E _N, sample y belongs to positive sample, and the zone in the corresponding window promptly belongs to correct character area, and output valve is+1.To export result queue is Z (w), for line of text R, because the contribution that belongs to literal for R the closer to the centre is also just big more, therefore adopts formula (7) that line of text is judged, the literal line of erroneous judgement is filtered, and keeps correct literal line; R represents line of text in the formula, and w represents N * N size windows, and Z (w) expression is adopted based on the sorting technique of the sparse expression judged result to the image-region among the window w, d _wThe center of expression window w is to the distance at line of text R center, σ ₀∈ (0 ,+∞) be variable, the classification results of the capable R of C (R) expression candidate character is if C (R) is greater than zero, judge that then line of text R is correct literal line, with its reservation and output, otherwise, if C (R) is less than zero, then literal line R belongs to the literal line of erroneous judgement, then will judge literal line by accident and be filtered;

C (R) = \underset{w &SubsetEqual; R}{Σ} Z (w) \cdot \frac{1}{\sqrt{2 π} σ_{0}} \exp (\frac{d_{w}^{2}}{{2 σ}_{0}^{2}}) - - - (7)

Below, under microcomputer Windows XP environment, adopt Object Oriented method and soft project standard, realize with C Plus Plus, we adopt resolution is that 480 * 360 1 sections Chinese news videos are tested, and video sequence is resampled, and per 10 frames are got 1 two field picture, the video image that obtains is transformed to gray level image by formula (1), by multi-scale transform gray level image is scaled 0.3,0.5,0.7 respectively then, 1 times, export multiple dimensioned gray level image.By formula (2) multiple dimensioned gray level image is carried out Suo Beier (Sobel) operator edge detection, obtain edge image, the rim value that obtains is normalized to [0,255].Then edge image is carried out closing operation of mathematical morphology, then edge image is carried out the rapid edge filter density, setting window size is 29 * 19, then the image behind the edge filter density is carried out the connected domain analysis obtains connected member, by utilizing geometrical rule, filter and the merging connected member, obtain candidate character region.The text block that different scale is obtained merges according to geometric relationship, then these candidate character region is carried out examining and surveys, and at first passes through candidate character region vertical projection and horizontal projection, and cutting is that candidate character is capable.Then these candidate character every trade height are normalized to 16 pixels, choose the moving window of 16 * 16 sizes, step-length is 8, the image in the window is carried out Tuscany (canny) rim detection obtain edge intensity value computing, obtains the proper vector of 256 dimensions.Utilize the dictionary D of match tracing (Matching PursuitMP) algorithm by having trained _P, D _NObtain positive dictionary coefficient x respectively _PWith negative dictionary coefficient x _N, difference error of calculation E _P=|| y-D _Px _P|| ₂, E _N=|| y-D _Nx _N|| ₂If E _P＞E _NTest sample y belongs to negative sample, and the zone in the corresponding window promptly belongs to the erroneous judgement character area, and output valve is-1, if E _P≤ E _N, sample y belongs to positive sample, and the zone in the corresponding window is correct character area, and output valve is+1.To export result queue is Z (w), by formula (7) line of text is judged.If C (R) is greater than zero then judge that line of text R is correct literal line, otherwise be the literal line of erroneous judgement, will be filtered.The line of text zone output that to correctly judge at last.

Experimental result

Table two is based on the text detection experimental result of sparse expression

In a word, the present invention has taken into full account video image character and has detected performance and speed, can orient text filedly fast and accurately, is not subjected to the influence of font size and language, has very strong versatility.Can provide favourable support facility for the classification of video image and retrieval etc.

Claims

1. the video image character detecting method based on sparse expression is characterized in that, video sequence pre-service that the method comprising the steps of, video image character zone rough detection and video image character examining are surveyed, and concrete steps are:

Step S3, the video image character examining is surveyed: at first the candidate character region that rough detection is obtained is by vertical projection and horizontal projection, again vertical projection image and horizontal projection being looked like to carry out cutting, to obtain candidate character capable, be the zonule by moving window with the capable cutting of candidate character then, edge feature is extracted in the zonule, adopt then based on the sorting technique of sparse expression and classified respectively in each zonule, judge whether the zonule is character area, judged result according to the zonule, judge candidate character is capable, obtain and export final literal line zone.

2. video image character detecting method as claimed in claim 1 is characterized in that, described rim detection is to adopt improved Sobel algorithm in the following manner: E (x, y)=max (| S _H|, | S _V|, | S _LD|, | S _RD|)+k * | S _{⊥ MAX}| obtain, (x is that the gray level image coordinate is for (the Sobel edge intensity value computing on the gray level image four direction is expressed as horizontal S respectively for x, edge intensity value computing y) y) to E _H, vertical S _V, left diagonal line S _LDWith right diagonal line S _RD, max represents to select the maximal value of Sobel edge strength, S _{⊥ MAX}The Grad of the direction that the greatest gradient direction of expression gray level image is vertical, k ∈ (0,1).

3. video image character detecting method as claimed in claim 1 is characterized in that, described marginal density be certain pixel with edge image be one of the center fixedly length and width be of a size of in the window of M * N, calculate the summation of this window inward flange value; Calculate according to following formula:

s(x，y)＝s(x，y-1)+E(x，y)，

IE(x，y)＝IE(x-1，y)+s(x，y)，

(x y) is coordinate points (x, 0), (x, 1) to s in the formula ... (x, y-1), (x, y) accumulated value of edge strength; (x is that coordinate is (x, the edge intensity value computing of some y), IE (x y) to E, y) be s (0, y), s (1, y) ... s (x-1, y), s (x, y) value and, to above-mentioned s (x, y) formula and IE (x, y) formula carries out iteration, if initial value be s (x ,-1)=0, IE (1, y)=0, IE (x, value y) is once calculated by color video frequency image and is finished, by following formula:

EW(i，j)＝IE(i+M/2，j+N/2)+IE(i-M/2，j-N/2)

-(IE (i+M/2, j-N/2)+IE (i-M/2, j+N/2)) calculates any some marginal density values of edge image.

4. video image character detecting method as claimed in claim 1 is characterized in that, adopts and based on the sorting technique of sparse expression is classified respectively in each zonule, and this classification comprises training and determining step, and is specific as follows described:

Training step: in advance positive sample and the negative sample of choosing the zonule are trained, obtain positive dictionary and negative dictionary;

Determining step: the zonule is judged by the reconstructed error of positive dictionary and negative dictionary, if the reconstructed error of positive dictionary is littler than the reconstructed error of negative dictionary, then be judged as correct character area, otherwise, if the reconstructed error of positive dictionary is bigger than the reconstructed error of negative dictionary, then be judged as the character area of erroneous judgement.

5. video image character detecting method as claimed in claim 1 is characterized in that, judges it is to utilize to candidate character is capable

Judge, the literal line of judging by accident is filtered, keep correct literal line; R represents line of text in the formula, and w represents N * N size windows, and Z (w) expression is adopted based on the sorting technique of the sparse expression judged result to the image-region among the window w, d _wThe center of expression window w is to the distance at line of text R center, σ ₀∈ (0, + ∞) be variable, the classification results of the capable R of C (R) expression candidate character, if C (R) is greater than zero, then literal line R belongs to correct literal line, and it is kept and the capable zone of output character, otherwise, if C (R) is less than zero, literal line R belongs to the erroneous judgement literal line, then will judge literal line by accident and filter out.