CN108805102A

CN108805102A - A kind of video caption detection and recognition methods and system based on deep learning

Info

Publication number: CN108805102A
Application number: CN201810690120.7A
Authority: CN
Inventors: 孙宏亮; 程国艮
Original assignee: Chinese Translation Language Through Polytron Technologies Inc
Current assignee: Chinese Translation Language Through Polytron Technologies Inc
Priority date: 2018-06-28
Filing date: 2018-06-28
Publication date: 2018-11-13

Abstract

The invention belongs to computer software technical fields, disclose it is a kind of based on deep learning video caption detection with recognition methods and system, deep learning theory of algorithm is applied to videotext zone location and identification process, video image is filtered by Gabor filter, obtains the textural characteristics of word in video image text；Using textural characteristics as training sample, incremental learning successively is carried out to texture image using limited Boltzmann machine, using morphological method to bianry image denoising, re-maps on positioning image, obtain the text image for only removing background area comprising text filed.The present invention uses the method that 2D-Gabor filters are combined with deep learning algorithm, realize the positioning to complex background text area in video, and it optimizes based on morphologic video image denoising method, the identification for realizing character by OCR system again, improves the accuracy rate of OCR system character recognition.

Description

A kind of video caption detection and recognition methods and system based on deep learning

Technical field

The invention belongs to computer software technical field more particularly to it is a kind of based on deep learning video caption detection with Recognition methods and system.

Background technology

Currently, the prior art commonly used in the trade is such：

With internet video content be continuously increased and digital library, video on demand, remote teaching etc. are a large amount of Multimedia application, required data how is retrieved in massive video seems most important.

Traditional video frequency searching based on keyword description is because of originals such as descriptive power are limited, subjectivity is strong, marks by hand Cause cannot meet the needs of massive video retrieval.Therefore since the 1990s, content based video retrieval system skill Art becomes the hot issue of research, and subtitle recognition technology is exactly to realize the key technology of video frequency searching, if it is possible to automatic to know Subtitle in other video, then can obtain the text message of reflecting video content, and base can be realized by these text messages In the video frequency searching of inquiry.So the technology is the key technology of next-generation search engine, there is highly important research and answer With value.

The detection and identification of video caption are the key technologies of videotext processing, are especially handled in foreign language video translation Process, local-caption extraction greatly facilitates effect with identification for complicated translation, and translator need not opposite video Check and the work of manual extraction subtitle makes translator's working efficiency obtain matter to greatly liberate translator Promotion.

This programme uses the recognition methods based on deep learning, the text location that can be solved under complicated High-speed Circumstance accurate Spend low, the problems such as text location and recognition speed are slow, there is efficient, high speed, can iteration, the features such as discrimination is high.

In conclusion problem of the existing technology is：

(1) traditional video frequency searching based on keyword description is because descriptive power is limited, subjectivity is strong, marks by hand etc. Reason cannot meet the needs of massive video retrieval.

(2) prior art is not used detection and partitioning algorithm based on edge, cannot be made full use of on local-caption extraction The redundancy of video in time carries out secondary filter to improve accuracy rate.In subtitle recognition, the prior art does not use base Judge the color of video caption in the method for connected region statistics, the two-value of gray scale picture is carried out based on partial sweep window Change, is come out the Text region in image by the method for artificial intelligence deep learning, it cannot be in the detection and knowledge of video caption The effect that Shang do not obtained.

(3) traditional technology based on pattern-recognition cannot be satisfied more scenes, high complexity situation due to technical reason Under correct identification, different scenes just needs to switch different algorithmic approach, and human input cost is huge, and effect is also bad.

Solve the difficulty and meaning of above-mentioned technical problem：

Text in video can give video frequency searching and index to provide important auxiliary information, sometimes the text packet in video The no information in other places, such as the subtitle of film leader are contained, the text in video is a kind of important and succinct sometimes Score, stock price in auxiliary information, such as sports tournament.If the text in video can be efficiently extracted and be known Not, then many high-level applications, such as video frequency abstract, artificial intelligence identification can be better achieved

Due to the size of word in complicated video image, style, color, font etc. is complicated and changeable, and there is presently no one kind Algorithm it is achieve the effect that in various applications it is satisfied, generally require several method to be used in combination with.

Invention content

In view of the problems of the existing technology, the present invention provides a kind of video caption detection and knowledge based on deep learning Other method and system.

The invention is realized in this way a kind of video caption detection and recognition methods based on deep learning, be set forth in depth The video caption of degree study, which is detected with recognition methods, includes：

Video image is filtered by Gabor filter, obtains the textural characteristics of word in video image text；

Then, using textural characteristics as training sample, successively texture image is increased using limited Boltzmann machine RBM Amount study；In learning process, marker samples is used to carry out network fine tuning as monitoring data, constitutes deep learning network DBN, and mark Remember text filed and background area bianry image；

Later, bianry image denoising is re-mapped on positioning image using morphological method, obtains only including text One's respective area and the text image for removing background area,

Finally, then by image binaryzation, gray scale subsequent processing are carried out, is sent to OCR character recognition systems and knows into line character Not.

Further, in video image being filtered by Gabor filter, using two-dimensional Gabor filter to video image It is filtered on different scale and direction, two-dimensional Gabor function is

G (x, y)=Kexp {-π [p²(x-x₀)²+q²(y-y₀)²]}

·exp{-2πj[u₀(x-x₀)+v₀(y-y₀)]}

Fourier transformation form

K is the amplitude of Gauss kernel functions in formula；(x0, y0) is the center of gaussian kernel function；(u0, v0) is modulation The center of frequency；(p, q) is the scale parameter of Gauss kernel functions；

If the peak position (x0, y0) of Gauss envelope functions is (0,0), selected by calculating filtering parameter p and q Gabor filter；

The filtering parameter p and q of filter is calculated by lower formula：

Uh and UI is respectively high-frequency center and the low frequency center in texture image region；T is direction number；M is scale parameter； λ is the period of Gabor filter.

Further, deep learning network DBN learning methods include：Unsupervised learning is used for the pre-training of each layer network； Every time with unsupervised learning only wherein one layer of training, using training result as high one layer of input two with from the lower supervision calculation in top Method goes to adjust all layers；

Assuming that node all in RBM models is all random binary (0,1) variable node, while assuming that full probability is distributed P (v, h) meets Boltzmann distributions, and in the case of known v, θ={ W, a, b } is parameter sets, visible elements and hiding section The bias vector of point indicates that then probability of the RBM at state θ is with a and b

Z (θ) is normalization factor ,-E (v, h in formula；It is θ) partition function, on the basis of given hidden layer, visual layers Probability be P (v | h), multiple limited Boltzmann machines combinations are built one by bottom-up.

Further, successively texture image is carried out in incremental learning using limited Boltzmann machine RBM, including：

DBN networks needs are trained to obtain best weight value, first carry out successively increment using RBM to texture template image Study, weights in network are constantly adjusted using maximum likelihood estimate, and RBM is made to reach energy balance, then with monitoring data, right Entire DBN networks are finely adjusted；During unsupervised learning, corresponding one layer of the node of each state value in DBN networks, The inputoutput data of calculating is all the probability value that corresponding node state value is 1, and H0 layers of input vector is each literal field The texture sample in domain, after alternate gibbs sampler, the input as DBN networks.

Further, image is subjected to binaryzation, gray scale subsequent processing, is sent to OCR character recognition systems and knows into line character Not, including：

The text filed positioning of video image goes out corresponding top-level feature from bottom Feature Mapping, maps layer by layer successively, directly To obtaining the result of top；

By text filed to DBN networks and after Morphological scale-space, binary conversion treatment is carried out, what removal was connected with boundary Textview field background black white reverse is then sent through OCR software and is identified by region.

Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize The computer program of method XXX methods.

Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize The information data processing terminal of method.

Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation so that computer executes the video caption detection and recognition methods based on deep learning.

Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize The video caption based on deep learning of method detects and identifying system, including：

Textural characteristics acquisition module is obtained for filtering video image by Gabor filter in video image text The textural characteristics of word；

Deep learning network DBN constitutes module, for using textural characteristics as training sample, utilizing limited Boltzmann machine RBM successively carries out incremental learning to texture image；In learning process, marker samples is used to carry out network fine tuning as monitoring data, Constitute deep learning network DBN, and the bianry image in retrtieval region and background area；

Text image acquisition module, for, to bianry image denoising, re-mapping positioning figure using morphological method As upper, the text image that background area is only removed comprising text filed is obtained,

Character recognition module is sent to OCR character recognition systems for image to be carried out binaryzation, gray scale subsequent processing Carry out character recognition.

Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize The video frequency search system of method.

In conclusion advantages of the present invention and good effect are：

On local-caption extraction, the present invention has used detection and partitioning algorithm based on edge, and make full use of video when Between on redundancy carry out secondary filter to improve accuracy rate, accuracy rate has been increased to 98.5%, is positioned from more scenes Accuracy rate is more stable, compares original out-of-date methods performance boost based on pattern-recognition percent 30.

In subtitle recognition, the color of video caption is judged with the method counted based on connected region first, then base The binaryzation of gray scale picture is carried out in partial sweep window, finally by the method for artificial intelligence deep learning by the text in image Word identifies, and achieves extraordinary effect in the detection and identification of video caption.

Test data shows that the increase with the network number of plies, the accuracy of DBN networks step up, and network approaches energy Power gradually enhances, and still, with the increase of the network number of plies, the complexity of network also can constantly increase, the extensive power meeting of network It gradually reduces, so being not that the network number of plies is The more the better.Test indicate that 4-DBN networks disclosure satisfy that text filed need It asks.

The feelings such as video frame images, font size, font color, uniline or multirow by 100 width different backgrounds of selection Under condition, is positioned and compared to text filed using 4 kinds of distinct methods as above, test result such as table

Description of the drawings

Fig. 1 is the video caption detection provided in an embodiment of the present invention based on deep learning and recognition methods flow chart.

Fig. 2 is Cyclic Operation Network schematic diagram provided in an embodiment of the present invention.

Fig. 3 is DBN network trainings flow chart provided in an embodiment of the present invention.

Fig. 4 is the video caption detection provided in an embodiment of the present invention based on deep learning and identifying system schematic diagram.

In figure：1, textural characteristics acquisition module；2, deep learning network DBN constitutes module；3, text image acquisition module； 4, character recognition module.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

Video caption detection and recognition methods provided in an embodiment of the present invention based on deep learning, are filtered using 2D-Gabor The method that wave device is combined with deep learning algorithm realizes the positioning to complex background text area in video, and optimizes base In morphologic video image denoising method, then by the identification of OCR system realization character, to improve OCR system character recognition Accuracy rate.

Such as Fig. 1, deep learning theory of algorithm is applied to videotext zone location and identification process, devised by the present invention A kind of successively increment depth learning algorithm based on textural characteristics.First, video image is filtered by Gabor filter, is obtained Obtain the textural characteristics of word in video image text；Then, using textural characteristics as training sample, limited Boltzmann machine is utilized (restrictedboltzmannmachine, RBM) successively carries out incremental learning to texture image, in learning process, with mark Remember that sample carries out network fine tuning as monitoring data, constitutes deep learning network (deepbeliefnetwork, DBN), and mark Text filed and background area bianry image；Later, using morphological method to bianry image denoising, it is fixed to re-map On bit image, obtain the text image that background area is only removed comprising text filed, finally, then by image carry out binaryzation, The subsequent processings such as gray scale are sent to OCR character recognition systems and carry out character recognition.

With reference to concrete analysis, the invention will be further described.

(1) caption area detects

During the entire process of local-caption extraction and identification, detection is that the first step is also a relatively difficult step, major embodiment ?：Very greatly, video caption background is very complicated, and the contrast of word and background has for the size of video caption, color and style variation When be not obvious.And subtitle will be correctly validated just must have differentiation with background, that is, need to present certain edge spy It seeks peace intensity band, so by being analyzed video frame edge strength to detect subtitle be an effective method.

(1) successively increment depth learns Video Text Location algorithm

The texture of character has periodically, and in certain frequency range self-energy Relatively centralized, it is possible to utilize Two-dimensional Gabor filter is filtered video image on different scale and direction, and Gabor filter theory can be well Description is defined corresponding to the partial structurtes information two-dimensional Gabor function of spatial frequency (scale), spatial position and set direction

Its Fourier transformation form

K is the amplitude of Gauss kernel functions in formula；(x0, y0) is the center of gaussian kernel function；(u0, v0) is modulation The center of frequency；(p, q) is the scale parameter of Gauss kernel functions.If the peak position (x0, y0) of Gauss envelope functions be (0, 0), Gabor filter is selected by calculating filtering parameter p and q.

The filtering parameter p and q of filter can be calculated by figure below formula：

Uh and UI is respectively high-frequency center and the low frequency center in texture image region；T is direction number；M is scale parameter； λ is the period of Gabor filter.It is mainly made of horizontal, slash, perpendicular, 4 kinds of basic strokes of right-falling stroke in view of Chinese character, so Gabor is filtered Wave device is required to reflect the stroke feature of Chinese character on this 4 directions, and is required to ensure to this 4 direction lines The frequency component in reason region has good response.

(2) structure of deep learning network (DBN)

Deep learning is a new problem in machine learning research field, and its object is to establish, simulate human brain to carry out The neural network of analytic learning.Deep learning algorithm is by the system on deep belief network (depthbeliefnetwork, DBN) Row are limited the probabilistic model composition of Boltzmann machine (restrictedboltzmannmachine, RBM).Deep learning algorithm one As to describe process as follows:Assuming that there are one system S, it has n-layer S1, S2 ..., Sn, if input is I, exports as O, the one of study As procedural representation be:I S1 S2 ..., Sn O input I by not having after this system change if output O is equal to input I There are any information loss or loss very little, be considered as being kept essentially constant, it means that input I passes through each layer Si, all almost without the loss of information, i.e., any one layer of Si is another expression of original information (inputting I).Depth The core ideas of learning algorithm has:1. unsupervised learning is used for the pre-training of each layer network；2. every time only with unsupervised learning Wherein one layer of training, using its training result as its high one layer of input；3. with from top and under supervision algorithm go adjustment all Layer.

Such as Fig. 2, it is assumed that all nodes are all random binary (0,1) variable nodes in RBM models, while assuming full probability Distribution P (v, h) meets Boltzmann distributions, is conditional sampling between all concealed nodes in the case of known v The energy of the joint configuration of Boltzmann machines can be expressed as

(3) network training and weighed value adjusting

DBN networks needs are trained to obtain best weight value, and usual DBN network trainings include bottom-up non-supervisory It practises and two parts of top-down supervised learning.

Such as Fig. 3, process is first to carry out successively incremental learning using RBM to texture template image, is estimated using maximum likelihood Meter method constantly adjusts weights in network, and RBM is made to reach energy balance, then with monitoring data, is carried out to entire DBN networks micro- Adjust during unsupervised learning, corresponding one layer of the node of each state value, the input and output number of calculating in DBN networks According to being all probability value that corresponding node state value is " 1 ", and H0 layer of input vector is the texture sample of each character area, logical After crossing alternate gibbs sampler, the input as DBN networks.If deep learning network structure includes n hidden layer, every layer Number of nodes is L1 respectively, L2 ..., and Ln. texture template images are sent to H0 layers of the input layer in DBN networks, constantly adjust H0 The weights W0 between H1, the weights W0 adjusted calculate one group of new probability with primary data and are sent into H1 layers, as H1 layers Input data.It repeats above-mentioned calculating process and obtains W1, W2 ..., Wn-1, finally obtain the initial weight Wi=of DBN networks { W0, W1, W2 ..., Wn-1 }, DBN networks include n+2 layers, i.e. H0, H1, H2 ..., Hn layers and sample label data Layer, wherein H0 is as input layer, number of nodes 64, and exemplar layer is output layer, and the number of nodes of intermediate n-layer is L1, L2 ... respectively, Ln. DBN networks are built using the training sample of no mark, by taking the training between H0 and H1 as an example, H0 and H1 layers constitute one RBM, H0 are identical as the number of nodes of visible layer v, and H1 is identical as the number of nodes of hidden layer h, are adjusted using alternate Gibbs model Whole weights W0, until RBM restrains.During unsupervised learning, the weights that RBM is adjusted are preserved, and as top-down Supervised learning initial weight.As supervised learning process, according to the mark of sample, finely tuned again using gradient descent method Weights, here, RBM networks and DBN networks use same network structure, input layer and hidden layer all having the same, including Every layer of interstitial content is also all identical, and only finally there are one output layers for DBN networks.

(2) OCR is identified

The text filed positioning of video image is all to go out corresponding top-level feature from bottom Feature Mapping, is reflected layer by layer successively It penetrates, until obtaining the result of top.

By successively increment depth learning algorithm proposed by the present invention and neural network, classics Kim methods and SVM methods pair Text filed positioning compares.Using recall ratio (RR), precision ratio (PR) and the coefficient F in formula come overall merit these types The using effect of method.

Wherein：C is the text filed number being correctly detecting in image；M is the text filed sum detected in image； N is the text filed sum of physical presence in image；F coefficients are used for carrying out overall ranking to each algorithm performance, are that will look into entirely The index linear combining of the two performances of rate and precision ratio forms.

With reference to effect, the invention will be further described.

To analyze influence of the different DBN network structures to algorithm performance, therefore test the performance of the different DBN networks numbers of plies. Test data shows that the increase with the network number of plies, the accuracy of DBN networks step up, and the approximation capability of network gradually increases By force, still, as the increase of the network number of plies, the complexity of network also can constantly increase, the extensive power of network can gradually reduce, So being not that the network number of plies is The more the better.Test indicate that 4-DBN networks disclosure satisfy that text filed demand.

With reference to the video caption detection based on deep learning, the invention will be further described with identifying system.

The embodiment of the present invention provides video caption detection and identifying system based on deep learning, including：

Textural characteristics acquisition module 1 obtains video image text for filtering video image by Gabor filter The textural characteristics of middle word；

Deep learning network DBN constitutes module 2, for using textural characteristics as training sample, utilizing limited Boltzmann Machine RBM successively carries out incremental learning to texture image；In learning process, use marker samples micro- as monitoring data progress network It adjusts, constitutes deep learning network DBN, and the bianry image in retrtieval region and background area；

Text image acquisition module 3, for, to bianry image denoising, re-mapping positioning figure using morphological method As upper, the text image that background area is only removed comprising text filed is obtained,

Character recognition module 4 is sent to OCR character recognition systems for image to be carried out binaryzation, gray scale subsequent processing Carry out character recognition.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Flow described in the embodiment of the present invention or function.The computer can be all-purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk SolidStateDisk (SSD)) etc..

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.

Claims

1. a kind of video caption detection and recognition methods based on deep learning, which is characterized in that be set forth in deep learning regards Frequency local-caption extraction includes with recognition methods：

(1) video image is filtered by Gabor filter, obtains the textural characteristics of word in video image text；

(2) using textural characteristics as training sample, increment successively is carried out to texture image using limited Boltzmann machine RBM It practises；In learning process, marker samples is used to carry out network fine tuning as monitoring data, constitutes deep learning network DBN, and mark text The bianry image of one's respective area and background area；

(3) it utilizes morphological method to bianry image denoising, re-maps on positioning image, obtain only including text filed And remove the text image of background area；

(4) text image is subjected to binaryzation, gray scale subsequent processing again, is sent to OCR character recognition systems and carries out character recognition.

2. video caption detection and recognition methods based on deep learning as described in claim 1, which is characterized in that by video During image is filtered by Gabor filter, video image is carried out on different scale and direction using two-dimensional Gabor filter Filtering, two-dimensional Gabor function are：

G (x, y)=Kexp {-π [p²(x-x₀)²+q²(y-y₀)²]}

·exp{-2π_j[u₀(x-x₀)+v₀(y-y₀)]}

Fourier transformation form

K is the amplitude of Gauss kernel functions in formula；(x₀, y₀) be gaussian kernel function center；(u₀, v₀) it is modulating frequency Center；(p, q) is the scale parameter of Gauss kernel functions；

If peak position (the x of Gauss envelope functions₀, y₀) it is (0,0), select Gabor to filter by calculating filtering parameter p and q Wave device；The filtering parameter p and q of filter is calculated by lower formula：

U_hAnd U_IRespectively the high-frequency center in texture image region and low frequency center；T is direction number；M is scale parameter；λ is The period of Gabor filter.

3. video caption detection and recognition methods based on deep learning as described in claim 1, which is characterized in that depth Practising network DBN learning methods includes：Unsupervised learning is used for the pre-training of each layer network；It is only trained with unsupervised learning every time Wherein one layer, using training result as high one layer of input；With from top and under supervision algorithm go to adjust all layers；

Assuming that node all in RBM models is all random binary (0,1) variable node, while assuming that full probability is distributed P (v, h) Meet Boltzmann distributions, in the case of known v, θ={ W, a, b } is parameter sets, visible elements and concealed nodes it is inclined It sets vector and indicates that then probability of the RBM at state θ is with a and b

Z (θ) is normalization factor ,-E (v, h in formula；Be θ) partition function, on the basis of given hidden layer, visual layers it is general Rate is P (v | h), is built multiple limited Boltzmann machines combinations by bottom-up.

4. as described in claim 1 based on deep learning video caption detection and recognition methods, which is characterized in that using by Boltzmann machine RBM is limited successively to carry out in incremental learning texture image, including：

DBN networks needs are trained to obtain best weight value, first carry out successively increment using RBM to texture template image It practises, weights in network is constantly adjusted using maximum likelihood estimate, RBM is made to reach energy balance, then with monitoring data, to whole A DBN networks are finely adjusted；During unsupervised learning, corresponding one layer of the node of each state value, meter in DBN networks The inputoutput data of calculation is all the probability value that corresponding node state value is 1, and H0 layers of input vector is each character area Texture sample, after alternate gibbs sampler, the input as DBN networks.

5. video caption detection and recognition methods based on deep learning as described in claim 1, which is characterized in that by image Binaryzation, gray scale subsequent processing are carried out, OCR character recognition systems is sent to and carries out character recognition, including：

The text filed positioning of video image goes out corresponding top-level feature from bottom Feature Mapping, maps layer by layer successively, until To the result of top；

By text filed to DBN networks and after Morphological scale-space, binary conversion treatment is carried out, the area being connected with boundary is removed Textview field background black white reverse is then sent through OCR software and is identified by domain.

6. a kind of realizing video caption detection and recognition methods based on deep learning described in Claims 1 to 5 any one Computer program.

7. a kind of realizing video caption detection and recognition methods based on deep learning described in Claims 1 to 5 any one Information data processing terminal.

8. a kind of computer readable storage medium, including instruction, when run on a computer so that computer is executed as weighed Profit requires the video caption detection and recognition methods based on deep learning described in 1-5 any one.

9. a kind of realize the video caption detection based on deep learning described in claim 1 with recognition methods based on deep learning Video caption detection and identifying system, which is characterized in that video caption based on deep learning, which is detected with identifying system, includes：

Textural characteristics acquisition module obtains word in video image text for filtering video image by Gabor filter Textural characteristics；

Deep learning network DBN constitutes module, for using textural characteristics as training sample, utilizing limited Boltzmann machine RBM Incremental learning successively is carried out to texture image；In learning process, marker samples is used to carry out network fine tuning as monitoring data, constituted Deep learning network DBN, and the bianry image in retrtieval region and background area；

Text image acquisition module, for, to bianry image denoising, being re-mapped on positioning image using morphological method, Obtain only removing the text image of background area comprising text filed,

Character recognition module is sent to the progress of OCR character recognition systems for image to be carried out binaryzation, gray scale subsequent processing Character recognition.

10. a kind of video frequency searching system realizing video caption detection and recognition methods based on deep learning described in claim 1 System.