CN108805102A - A kind of video caption detection and recognition methods and system based on deep learning - Google Patents
A kind of video caption detection and recognition methods and system based on deep learning Download PDFInfo
- Publication number
- CN108805102A CN108805102A CN201810690120.7A CN201810690120A CN108805102A CN 108805102 A CN108805102 A CN 108805102A CN 201810690120 A CN201810690120 A CN 201810690120A CN 108805102 A CN108805102 A CN 108805102A
- Authority
- CN
- China
- Prior art keywords
- image
- video
- deep learning
- text
- learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/217—Validation; Performance evaluation; Active pattern learning techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/30—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to computer software technical fields, disclose it is a kind of based on deep learning video caption detection with recognition methods and system, deep learning theory of algorithm is applied to videotext zone location and identification process, video image is filtered by Gabor filter, obtains the textural characteristics of word in video image text;Using textural characteristics as training sample, incremental learning successively is carried out to texture image using limited Boltzmann machine, using morphological method to bianry image denoising, re-maps on positioning image, obtain the text image for only removing background area comprising text filed.The present invention uses the method that 2D-Gabor filters are combined with deep learning algorithm, realize the positioning to complex background text area in video, and it optimizes based on morphologic video image denoising method, the identification for realizing character by OCR system again, improves the accuracy rate of OCR system character recognition.
Description
Technical field
The invention belongs to computer software technical field more particularly to it is a kind of based on deep learning video caption detection with
Recognition methods and system.
Background technology
Currently, the prior art commonly used in the trade is such:
With internet video content be continuously increased and digital library, video on demand, remote teaching etc. are a large amount of
Multimedia application, required data how is retrieved in massive video seems most important.
Traditional video frequency searching based on keyword description is because of originals such as descriptive power are limited, subjectivity is strong, marks by hand
Cause cannot meet the needs of massive video retrieval.Therefore since the 1990s, content based video retrieval system skill
Art becomes the hot issue of research, and subtitle recognition technology is exactly to realize the key technology of video frequency searching, if it is possible to automatic to know
Subtitle in other video, then can obtain the text message of reflecting video content, and base can be realized by these text messages
In the video frequency searching of inquiry.So the technology is the key technology of next-generation search engine, there is highly important research and answer
With value.
The detection and identification of video caption are the key technologies of videotext processing, are especially handled in foreign language video translation
Process, local-caption extraction greatly facilitates effect with identification for complicated translation, and translator need not opposite video
Check and the work of manual extraction subtitle makes translator's working efficiency obtain matter to greatly liberate translator
Promotion.
This programme uses the recognition methods based on deep learning, the text location that can be solved under complicated High-speed Circumstance accurate
Spend low, the problems such as text location and recognition speed are slow, there is efficient, high speed, can iteration, the features such as discrimination is high.
In conclusion problem of the existing technology is:
(1) traditional video frequency searching based on keyword description is because descriptive power is limited, subjectivity is strong, marks by hand etc.
Reason cannot meet the needs of massive video retrieval.
(2) prior art is not used detection and partitioning algorithm based on edge, cannot be made full use of on local-caption extraction
The redundancy of video in time carries out secondary filter to improve accuracy rate.In subtitle recognition, the prior art does not use base
Judge the color of video caption in the method for connected region statistics, the two-value of gray scale picture is carried out based on partial sweep window
Change, is come out the Text region in image by the method for artificial intelligence deep learning, it cannot be in the detection and knowledge of video caption
The effect that Shang do not obtained.
(3) traditional technology based on pattern-recognition cannot be satisfied more scenes, high complexity situation due to technical reason
Under correct identification, different scenes just needs to switch different algorithmic approach, and human input cost is huge, and effect is also bad.
Solve the difficulty and meaning of above-mentioned technical problem:
Text in video can give video frequency searching and index to provide important auxiliary information, sometimes the text packet in video
The no information in other places, such as the subtitle of film leader are contained, the text in video is a kind of important and succinct sometimes
Score, stock price in auxiliary information, such as sports tournament.If the text in video can be efficiently extracted and be known
Not, then many high-level applications, such as video frequency abstract, artificial intelligence identification can be better achieved
Due to the size of word in complicated video image, style, color, font etc. is complicated and changeable, and there is presently no one kind
Algorithm it is achieve the effect that in various applications it is satisfied, generally require several method to be used in combination with.
Invention content
In view of the problems of the existing technology, the present invention provides a kind of video caption detection and knowledge based on deep learning
Other method and system.
The invention is realized in this way a kind of video caption detection and recognition methods based on deep learning, be set forth in depth
The video caption of degree study, which is detected with recognition methods, includes:
Video image is filtered by Gabor filter, obtains the textural characteristics of word in video image text;
Then, using textural characteristics as training sample, successively texture image is increased using limited Boltzmann machine RBM
Amount study;In learning process, marker samples is used to carry out network fine tuning as monitoring data, constitutes deep learning network DBN, and mark
Remember text filed and background area bianry image;
Later, bianry image denoising is re-mapped on positioning image using morphological method, obtains only including text
One's respective area and the text image for removing background area,
Finally, then by image binaryzation, gray scale subsequent processing are carried out, is sent to OCR character recognition systems and knows into line character
Not.
Further, in video image being filtered by Gabor filter, using two-dimensional Gabor filter to video image
It is filtered on different scale and direction, two-dimensional Gabor function is
G (x, y)=Kexp {-π [p2(x-x0)2+q2(y-y0)2]}
·exp{-2πj[u0(x-x0)+v0(y-y0)]}
Fourier transformation form
K is the amplitude of Gauss kernel functions in formula;(x0, y0) is the center of gaussian kernel function;(u0, v0) is modulation
The center of frequency;(p, q) is the scale parameter of Gauss kernel functions;
If the peak position (x0, y0) of Gauss envelope functions is (0,0), selected by calculating filtering parameter p and q
Gabor filter;
The filtering parameter p and q of filter is calculated by lower formula:
Uh and UI is respectively high-frequency center and the low frequency center in texture image region;T is direction number;M is scale parameter;
λ is the period of Gabor filter.
Further, deep learning network DBN learning methods include:Unsupervised learning is used for the pre-training of each layer network;
Every time with unsupervised learning only wherein one layer of training, using training result as high one layer of input two with from the lower supervision calculation in top
Method goes to adjust all layers;
Assuming that node all in RBM models is all random binary (0,1) variable node, while assuming that full probability is distributed P
(v, h) meets Boltzmann distributions, and in the case of known v, θ={ W, a, b } is parameter sets, visible elements and hiding section
The bias vector of point indicates that then probability of the RBM at state θ is with a and b
Z (θ) is normalization factor ,-E (v, h in formula;It is θ) partition function, on the basis of given hidden layer, visual layers
Probability be P (v | h), multiple limited Boltzmann machines combinations are built one by bottom-up.
Further, successively texture image is carried out in incremental learning using limited Boltzmann machine RBM, including:
DBN networks needs are trained to obtain best weight value, first carry out successively increment using RBM to texture template image
Study, weights in network are constantly adjusted using maximum likelihood estimate, and RBM is made to reach energy balance, then with monitoring data, right
Entire DBN networks are finely adjusted;During unsupervised learning, corresponding one layer of the node of each state value in DBN networks,
The inputoutput data of calculating is all the probability value that corresponding node state value is 1, and H0 layers of input vector is each literal field
The texture sample in domain, after alternate gibbs sampler, the input as DBN networks.
Further, image is subjected to binaryzation, gray scale subsequent processing, is sent to OCR character recognition systems and knows into line character
Not, including:
The text filed positioning of video image goes out corresponding top-level feature from bottom Feature Mapping, maps layer by layer successively, directly
To obtaining the result of top;
By text filed to DBN networks and after Morphological scale-space, binary conversion treatment is carried out, what removal was connected with boundary
Textview field background black white reverse is then sent through OCR software and is identified by region.
Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize
The computer program of method XXX methods.
Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize
The information data processing terminal of method.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer
When upper operation so that computer executes the video caption detection and recognition methods based on deep learning.
Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize
The video caption based on deep learning of method detects and identifying system, including:
Textural characteristics acquisition module is obtained for filtering video image by Gabor filter in video image text
The textural characteristics of word;
Deep learning network DBN constitutes module, for using textural characteristics as training sample, utilizing limited Boltzmann machine
RBM successively carries out incremental learning to texture image;In learning process, marker samples is used to carry out network fine tuning as monitoring data,
Constitute deep learning network DBN, and the bianry image in retrtieval region and background area;
Text image acquisition module, for, to bianry image denoising, re-mapping positioning figure using morphological method
As upper, the text image that background area is only removed comprising text filed is obtained,
Character recognition module is sent to OCR character recognition systems for image to be carried out binaryzation, gray scale subsequent processing
Carry out character recognition.
Another object of the present invention is to provide the video caption detections and identification based on deep learning described in a kind of realize
The video frequency search system of method.
In conclusion advantages of the present invention and good effect are:
On local-caption extraction, the present invention has used detection and partitioning algorithm based on edge, and make full use of video when
Between on redundancy carry out secondary filter to improve accuracy rate, accuracy rate has been increased to 98.5%, is positioned from more scenes
Accuracy rate is more stable, compares original out-of-date methods performance boost based on pattern-recognition percent 30.
In subtitle recognition, the color of video caption is judged with the method counted based on connected region first, then base
The binaryzation of gray scale picture is carried out in partial sweep window, finally by the method for artificial intelligence deep learning by the text in image
Word identifies, and achieves extraordinary effect in the detection and identification of video caption.
Test data shows that the increase with the network number of plies, the accuracy of DBN networks step up, and network approaches energy
Power gradually enhances, and still, with the increase of the network number of plies, the complexity of network also can constantly increase, the extensive power meeting of network
It gradually reduces, so being not that the network number of plies is The more the better.Test indicate that 4-DBN networks disclosure satisfy that text filed need
It asks.
The feelings such as video frame images, font size, font color, uniline or multirow by 100 width different backgrounds of selection
Under condition, is positioned and compared to text filed using 4 kinds of distinct methods as above, test result such as table
Description of the drawings
Fig. 1 is the video caption detection provided in an embodiment of the present invention based on deep learning and recognition methods flow chart.
Fig. 2 is Cyclic Operation Network schematic diagram provided in an embodiment of the present invention.
Fig. 3 is DBN network trainings flow chart provided in an embodiment of the present invention.
Fig. 4 is the video caption detection provided in an embodiment of the present invention based on deep learning and identifying system schematic diagram.
In figure:1, textural characteristics acquisition module;2, deep learning network DBN constitutes module;3, text image acquisition module;
4, character recognition module.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
Video caption detection and recognition methods provided in an embodiment of the present invention based on deep learning, are filtered using 2D-Gabor
The method that wave device is combined with deep learning algorithm realizes the positioning to complex background text area in video, and optimizes base
In morphologic video image denoising method, then by the identification of OCR system realization character, to improve OCR system character recognition
Accuracy rate.
Such as Fig. 1, deep learning theory of algorithm is applied to videotext zone location and identification process, devised by the present invention
A kind of successively increment depth learning algorithm based on textural characteristics.First, video image is filtered by Gabor filter, is obtained
Obtain the textural characteristics of word in video image text;Then, using textural characteristics as training sample, limited Boltzmann machine is utilized
(restrictedboltzmannmachine, RBM) successively carries out incremental learning to texture image, in learning process, with mark
Remember that sample carries out network fine tuning as monitoring data, constitutes deep learning network (deepbeliefnetwork, DBN), and mark
Text filed and background area bianry image;Later, using morphological method to bianry image denoising, it is fixed to re-map
On bit image, obtain the text image that background area is only removed comprising text filed, finally, then by image carry out binaryzation,
The subsequent processings such as gray scale are sent to OCR character recognition systems and carry out character recognition.
With reference to concrete analysis, the invention will be further described.
(1) caption area detects
During the entire process of local-caption extraction and identification, detection is that the first step is also a relatively difficult step, major embodiment
?:Very greatly, video caption background is very complicated, and the contrast of word and background has for the size of video caption, color and style variation
When be not obvious.And subtitle will be correctly validated just must have differentiation with background, that is, need to present certain edge spy
It seeks peace intensity band, so by being analyzed video frame edge strength to detect subtitle be an effective method.
(1) successively increment depth learns Video Text Location algorithm
The texture of character has periodically, and in certain frequency range self-energy Relatively centralized, it is possible to utilize
Two-dimensional Gabor filter is filtered video image on different scale and direction, and Gabor filter theory can be well
Description is defined corresponding to the partial structurtes information two-dimensional Gabor function of spatial frequency (scale), spatial position and set direction
Its Fourier transformation form
K is the amplitude of Gauss kernel functions in formula;(x0, y0) is the center of gaussian kernel function;(u0, v0) is modulation
The center of frequency;(p, q) is the scale parameter of Gauss kernel functions.If the peak position (x0, y0) of Gauss envelope functions be (0,
0), Gabor filter is selected by calculating filtering parameter p and q.
The filtering parameter p and q of filter can be calculated by figure below formula:
Uh and UI is respectively high-frequency center and the low frequency center in texture image region;T is direction number;M is scale parameter;
λ is the period of Gabor filter.It is mainly made of horizontal, slash, perpendicular, 4 kinds of basic strokes of right-falling stroke in view of Chinese character, so Gabor is filtered
Wave device is required to reflect the stroke feature of Chinese character on this 4 directions, and is required to ensure to this 4 direction lines
The frequency component in reason region has good response.
(2) structure of deep learning network (DBN)
Deep learning is a new problem in machine learning research field, and its object is to establish, simulate human brain to carry out
The neural network of analytic learning.Deep learning algorithm is by the system on deep belief network (depthbeliefnetwork, DBN)
Row are limited the probabilistic model composition of Boltzmann machine (restrictedboltzmannmachine, RBM).Deep learning algorithm one
As to describe process as follows:Assuming that there are one system S, it has n-layer S1, S2 ..., Sn, if input is I, exports as O, the one of study
As procedural representation be:I S1 S2 ..., Sn O input I by not having after this system change if output O is equal to input I
There are any information loss or loss very little, be considered as being kept essentially constant, it means that input I passes through each layer
Si, all almost without the loss of information, i.e., any one layer of Si is another expression of original information (inputting I).Depth
The core ideas of learning algorithm has:1. unsupervised learning is used for the pre-training of each layer network;2. every time only with unsupervised learning
Wherein one layer of training, using its training result as its high one layer of input;3. with from top and under supervision algorithm go adjustment all
Layer.
Such as Fig. 2, it is assumed that all nodes are all random binary (0,1) variable nodes in RBM models, while assuming full probability
Distribution P (v, h) meets Boltzmann distributions, is conditional sampling between all concealed nodes in the case of known v
The energy of the joint configuration of Boltzmann machines can be expressed as
(3) network training and weighed value adjusting
DBN networks needs are trained to obtain best weight value, and usual DBN network trainings include bottom-up non-supervisory
It practises and two parts of top-down supervised learning.
Such as Fig. 3, process is first to carry out successively incremental learning using RBM to texture template image, is estimated using maximum likelihood
Meter method constantly adjusts weights in network, and RBM is made to reach energy balance, then with monitoring data, is carried out to entire DBN networks micro-
Adjust during unsupervised learning, corresponding one layer of the node of each state value, the input and output number of calculating in DBN networks
According to being all probability value that corresponding node state value is " 1 ", and H0 layer of input vector is the texture sample of each character area, logical
After crossing alternate gibbs sampler, the input as DBN networks.If deep learning network structure includes n hidden layer, every layer
Number of nodes is L1 respectively, L2 ..., and Ln. texture template images are sent to H0 layers of the input layer in DBN networks, constantly adjust H0
The weights W0 between H1, the weights W0 adjusted calculate one group of new probability with primary data and are sent into H1 layers, as H1 layers
Input data.It repeats above-mentioned calculating process and obtains W1, W2 ..., Wn-1, finally obtain the initial weight Wi=of DBN networks
{ W0, W1, W2 ..., Wn-1 }, DBN networks include n+2 layers, i.e. H0, H1, H2 ..., Hn layers and sample label data Layer, wherein
H0 is as input layer, number of nodes 64, and exemplar layer is output layer, and the number of nodes of intermediate n-layer is L1, L2 ... respectively,
Ln. DBN networks are built using the training sample of no mark, by taking the training between H0 and H1 as an example, H0 and H1 layers constitute one
RBM, H0 are identical as the number of nodes of visible layer v, and H1 is identical as the number of nodes of hidden layer h, are adjusted using alternate Gibbs model
Whole weights W0, until RBM restrains.During unsupervised learning, the weights that RBM is adjusted are preserved, and as top-down
Supervised learning initial weight.As supervised learning process, according to the mark of sample, finely tuned again using gradient descent method
Weights, here, RBM networks and DBN networks use same network structure, input layer and hidden layer all having the same, including
Every layer of interstitial content is also all identical, and only finally there are one output layers for DBN networks.
(2) OCR is identified
The text filed positioning of video image is all to go out corresponding top-level feature from bottom Feature Mapping, is reflected layer by layer successively
It penetrates, until obtaining the result of top.
By text filed to DBN networks and after Morphological scale-space, binary conversion treatment is carried out, what removal was connected with boundary
Textview field background black white reverse is then sent through OCR software and is identified by region.
By successively increment depth learning algorithm proposed by the present invention and neural network, classics Kim methods and SVM methods pair
Text filed positioning compares.Using recall ratio (RR), precision ratio (PR) and the coefficient F in formula come overall merit these types
The using effect of method.
Wherein:C is the text filed number being correctly detecting in image;M is the text filed sum detected in image;
N is the text filed sum of physical presence in image;F coefficients are used for carrying out overall ranking to each algorithm performance, are that will look into entirely
The index linear combining of the two performances of rate and precision ratio forms.
With reference to effect, the invention will be further described.
To analyze influence of the different DBN network structures to algorithm performance, therefore test the performance of the different DBN networks numbers of plies.
Test data shows that the increase with the network number of plies, the accuracy of DBN networks step up, and the approximation capability of network gradually increases
By force, still, as the increase of the network number of plies, the complexity of network also can constantly increase, the extensive power of network can gradually reduce,
So being not that the network number of plies is The more the better.Test indicate that 4-DBN networks disclosure satisfy that text filed demand.
The feelings such as video frame images, font size, font color, uniline or multirow by 100 width different backgrounds of selection
Under condition, is positioned and compared to text filed using 4 kinds of distinct methods as above, test result such as table
With reference to the video caption detection based on deep learning, the invention will be further described with identifying system.
The embodiment of the present invention provides video caption detection and identifying system based on deep learning, including:
Textural characteristics acquisition module 1 obtains video image text for filtering video image by Gabor filter
The textural characteristics of middle word;
Deep learning network DBN constitutes module 2, for using textural characteristics as training sample, utilizing limited Boltzmann
Machine RBM successively carries out incremental learning to texture image;In learning process, use marker samples micro- as monitoring data progress network
It adjusts, constitutes deep learning network DBN, and the bianry image in retrtieval region and background area;
Text image acquisition module 3, for, to bianry image denoising, re-mapping positioning figure using morphological method
As upper, the text image that background area is only removed comprising text filed is obtained,
Character recognition module 4 is sent to OCR character recognition systems for image to be carried out binaryzation, gray scale subsequent processing
Carry out character recognition.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination real
It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or
Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to
Flow described in the embodiment of the present invention or function.The computer can be all-purpose computer, special purpose computer, computer network
Network or other programmable devices.The computer instruction can store in a computer-readable storage medium, or from one
Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one
A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center
Transmission).The computer read/write memory medium can be that any usable medium that computer can access either includes one
The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie
Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk
SolidStateDisk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement etc., should all be included in the protection scope of the present invention made by within refreshing and principle.
Claims (10)
1. a kind of video caption detection and recognition methods based on deep learning, which is characterized in that be set forth in deep learning regards
Frequency local-caption extraction includes with recognition methods:
(1) video image is filtered by Gabor filter, obtains the textural characteristics of word in video image text;
(2) using textural characteristics as training sample, increment successively is carried out to texture image using limited Boltzmann machine RBM
It practises;In learning process, marker samples is used to carry out network fine tuning as monitoring data, constitutes deep learning network DBN, and mark text
The bianry image of one's respective area and background area;
(3) it utilizes morphological method to bianry image denoising, re-maps on positioning image, obtain only including text filed
And remove the text image of background area;
(4) text image is subjected to binaryzation, gray scale subsequent processing again, is sent to OCR character recognition systems and carries out character recognition.
2. video caption detection and recognition methods based on deep learning as described in claim 1, which is characterized in that by video
During image is filtered by Gabor filter, video image is carried out on different scale and direction using two-dimensional Gabor filter
Filtering, two-dimensional Gabor function are:
G (x, y)=Kexp {-π [p2(x-x0)2+q2(y-y0)2]}
·exp{-2πj[u0(x-x0)+v0(y-y0)]}
Fourier transformation form
K is the amplitude of Gauss kernel functions in formula;(x0, y0) be gaussian kernel function center;(u0, v0) it is modulating frequency
Center;(p, q) is the scale parameter of Gauss kernel functions;
If peak position (the x of Gauss envelope functions0, y0) it is (0,0), select Gabor to filter by calculating filtering parameter p and q
Wave device;The filtering parameter p and q of filter is calculated by lower formula:
UhAnd UIRespectively the high-frequency center in texture image region and low frequency center;T is direction number;M is scale parameter;λ is
The period of Gabor filter.
3. video caption detection and recognition methods based on deep learning as described in claim 1, which is characterized in that depth
Practising network DBN learning methods includes:Unsupervised learning is used for the pre-training of each layer network;It is only trained with unsupervised learning every time
Wherein one layer, using training result as high one layer of input;With from top and under supervision algorithm go to adjust all layers;
Assuming that node all in RBM models is all random binary (0,1) variable node, while assuming that full probability is distributed P (v, h)
Meet Boltzmann distributions, in the case of known v, θ={ W, a, b } is parameter sets, visible elements and concealed nodes it is inclined
It sets vector and indicates that then probability of the RBM at state θ is with a and b
Z (θ) is normalization factor ,-E (v, h in formula;Be θ) partition function, on the basis of given hidden layer, visual layers it is general
Rate is P (v | h), is built multiple limited Boltzmann machines combinations by bottom-up.
4. as described in claim 1 based on deep learning video caption detection and recognition methods, which is characterized in that using by
Boltzmann machine RBM is limited successively to carry out in incremental learning texture image, including:
DBN networks needs are trained to obtain best weight value, first carry out successively increment using RBM to texture template image
It practises, weights in network is constantly adjusted using maximum likelihood estimate, RBM is made to reach energy balance, then with monitoring data, to whole
A DBN networks are finely adjusted;During unsupervised learning, corresponding one layer of the node of each state value, meter in DBN networks
The inputoutput data of calculation is all the probability value that corresponding node state value is 1, and H0 layers of input vector is each character area
Texture sample, after alternate gibbs sampler, the input as DBN networks.
5. video caption detection and recognition methods based on deep learning as described in claim 1, which is characterized in that by image
Binaryzation, gray scale subsequent processing are carried out, OCR character recognition systems is sent to and carries out character recognition, including:
The text filed positioning of video image goes out corresponding top-level feature from bottom Feature Mapping, maps layer by layer successively, until
To the result of top;
By text filed to DBN networks and after Morphological scale-space, binary conversion treatment is carried out, the area being connected with boundary is removed
Textview field background black white reverse is then sent through OCR software and is identified by domain.
6. a kind of realizing video caption detection and recognition methods based on deep learning described in Claims 1 to 5 any one
Computer program.
7. a kind of realizing video caption detection and recognition methods based on deep learning described in Claims 1 to 5 any one
Information data processing terminal.
8. a kind of computer readable storage medium, including instruction, when run on a computer so that computer is executed as weighed
Profit requires the video caption detection and recognition methods based on deep learning described in 1-5 any one.
9. a kind of realize the video caption detection based on deep learning described in claim 1 with recognition methods based on deep learning
Video caption detection and identifying system, which is characterized in that video caption based on deep learning, which is detected with identifying system, includes:
Textural characteristics acquisition module obtains word in video image text for filtering video image by Gabor filter
Textural characteristics;
Deep learning network DBN constitutes module, for using textural characteristics as training sample, utilizing limited Boltzmann machine RBM
Incremental learning successively is carried out to texture image;In learning process, marker samples is used to carry out network fine tuning as monitoring data, constituted
Deep learning network DBN, and the bianry image in retrtieval region and background area;
Text image acquisition module, for, to bianry image denoising, being re-mapped on positioning image using morphological method,
Obtain only removing the text image of background area comprising text filed,
Character recognition module is sent to the progress of OCR character recognition systems for image to be carried out binaryzation, gray scale subsequent processing
Character recognition.
10. a kind of video frequency searching system realizing video caption detection and recognition methods based on deep learning described in claim 1
System.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810690120.7A CN108805102A (en) | 2018-06-28 | 2018-06-28 | A kind of video caption detection and recognition methods and system based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810690120.7A CN108805102A (en) | 2018-06-28 | 2018-06-28 | A kind of video caption detection and recognition methods and system based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108805102A true CN108805102A (en) | 2018-11-13 |
Family
ID=64072283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810690120.7A Pending CN108805102A (en) | 2018-06-28 | 2018-06-28 | A kind of video caption detection and recognition methods and system based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108805102A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109840492A (en) * | 2019-01-25 | 2019-06-04 | 厦门商集网络科技有限责任公司 | Document recognition methods and terminal based on deep learning network |
CN109857906A (en) * | 2019-01-10 | 2019-06-07 | 天津大学 | More video summarization methods of unsupervised deep learning based on inquiry |
CN109975308A (en) * | 2019-03-15 | 2019-07-05 | 维库(厦门)信息技术有限公司 | A kind of surface inspecting method based on deep learning |
CN111860472A (en) * | 2020-09-24 | 2020-10-30 | 成都索贝数码科技股份有限公司 | Television station caption detection method, system, computer equipment and storage medium |
CN112135108A (en) * | 2020-09-27 | 2020-12-25 | 苏州科达科技股份有限公司 | Video stream subtitle detection method, system, device and storage medium |
CN112560866A (en) * | 2021-02-25 | 2021-03-26 | 江苏东大集成电路系统工程技术有限公司 | OCR recognition method based on background suppression |
CN115410187A (en) * | 2022-09-05 | 2022-11-29 | 武汉言平科技有限公司 | Self-adaptive adjustment method and system for video recognition characters |
CN116363664A (en) * | 2023-04-10 | 2023-06-30 | 国网江苏省电力有限公司信息通信分公司 | OCR technology-based ciphertext-related book checking and labeling method and system |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106778732A (en) * | 2017-01-16 | 2017-05-31 | 哈尔滨理工大学 | Text information feature extraction and recognition method based on Gabor filter |
-
2018
- 2018-06-28 CN CN201810690120.7A patent/CN108805102A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106778732A (en) * | 2017-01-16 | 2017-05-31 | 哈尔滨理工大学 | Text information feature extraction and recognition method based on Gabor filter |
Non-Patent Citations (1)
Title |
---|
刘明珠等: "基于深度学习法的视频文本区域定位与识别", 《哈尔滨理工大学学报》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109857906A (en) * | 2019-01-10 | 2019-06-07 | 天津大学 | More video summarization methods of unsupervised deep learning based on inquiry |
CN109857906B (en) * | 2019-01-10 | 2023-04-07 | 天津大学 | Multi-video abstraction method based on query unsupervised deep learning |
CN109840492A (en) * | 2019-01-25 | 2019-06-04 | 厦门商集网络科技有限责任公司 | Document recognition methods and terminal based on deep learning network |
CN109975308A (en) * | 2019-03-15 | 2019-07-05 | 维库(厦门)信息技术有限公司 | A kind of surface inspecting method based on deep learning |
CN111860472A (en) * | 2020-09-24 | 2020-10-30 | 成都索贝数码科技股份有限公司 | Television station caption detection method, system, computer equipment and storage medium |
CN112135108A (en) * | 2020-09-27 | 2020-12-25 | 苏州科达科技股份有限公司 | Video stream subtitle detection method, system, device and storage medium |
CN112560866A (en) * | 2021-02-25 | 2021-03-26 | 江苏东大集成电路系统工程技术有限公司 | OCR recognition method based on background suppression |
CN115410187A (en) * | 2022-09-05 | 2022-11-29 | 武汉言平科技有限公司 | Self-adaptive adjustment method and system for video recognition characters |
CN116363664A (en) * | 2023-04-10 | 2023-06-30 | 国网江苏省电力有限公司信息通信分公司 | OCR technology-based ciphertext-related book checking and labeling method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113378632B (en) | Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method | |
CN108805102A (en) | A kind of video caption detection and recognition methods and system based on deep learning | |
Zhiqiang et al. | A review of object detection based on convolutional neural network | |
CN115019123B (en) | Self-distillation contrast learning method for remote sensing image scene classification | |
Zhang et al. | Road recognition from remote sensing imagery using incremental learning | |
Sumbul et al. | Informative and representative triplet selection for multilabel remote sensing image retrieval | |
CN112990282B (en) | Classification method and device for fine-granularity small sample images | |
Liu et al. | Subtler mixed attention network on fine-grained image classification | |
CN109299305A (en) | A kind of spatial image searching system based on multi-feature fusion and search method | |
Feng et al. | Bag of visual words model with deep spatial features for geographical scene classification | |
CN114360038B (en) | Weak supervision RPA element identification method and system based on deep learning | |
CN111461067A (en) | Zero sample remote sensing image scene identification method based on priori knowledge mapping and correction | |
Fang et al. | Detecting Uyghur text in complex background images with convolutional neural network | |
Liu et al. | A new patch selection method based on parsing and saliency detection for person re-identification | |
Naiemi et al. | Scene text detection using enhanced extremal region and convolutional neural network | |
Li et al. | Performance comparison of saliency detection | |
Kumar et al. | A technique for human upper body parts movement tracking | |
CN112613474A (en) | Pedestrian re-identification method and device | |
Sun et al. | Sample hardness guided softmax loss for face recognition | |
CN110363164A (en) | Unified method based on LSTM time consistency video analysis | |
Yong | Research on Painting Image Classification Based on Transfer Learning and Feature Fusion | |
Pan et al. | Leukocyte image segmentation using novel saliency detection based on positive feedback of visual perception | |
Ji et al. | Influence of embedded microprocessor wireless communication and computer vision in Wushu competition referees’ decision support | |
Zhang et al. | Research on lane identification based on deep learning | |
Yang et al. | Salient object detection based on global multi‐scale superpixel contrast |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181113 |
|
RJ01 | Rejection of invention patent application after publication |