CN110533041B - Regression-based multi-scale scene text detection method - Google Patents
Regression-based multi-scale scene text detection method Download PDFInfo
- Publication number
- CN110533041B CN110533041B CN201910838235.0A CN201910838235A CN110533041B CN 110533041 B CN110533041 B CN 110533041B CN 201910838235 A CN201910838235 A CN 201910838235A CN 110533041 B CN110533041 B CN 110533041B
- Authority
- CN
- China
- Prior art keywords
- convolution
- module
- size
- text
- convolution kernel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
Abstract
The invention relates to a regression-based multi-scale scene text detection method, and belongs to the field of digital image processing. The method specifically comprises the following steps: s1: setting sufficient training data with text position calibration; s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data; s3: using a cascade module for each layer of characteristics sent into the detection layer; s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image. The cascade module adopted by the invention improves the receptive field of the network, so that the set default frame of the text characteristics is very suitable, and finally the text position in the image is accurately detected.
Description
Technical Field
The invention belongs to the field of digital image processing, and relates to a multi-direction scene text detection method based on regression.
Background
With the popularization of intelligent equipment, people can acquire image information anytime and anywhere. The characters in the image serve as high-level semantic information, and important clues are provided for understanding and analyzing the image content. The characters are directly reflected by the image content, are easier to extract and understand compared with other elements, and the description of a plurality of characters can be directly used, so that the characters can be conveniently applied to retrieval and analysis of various image and video contents based on key words. Text detection has become a popular research topic in the field of computer vision.
There are many methods of text detection. The traditional scene text detection method needs manual feature design, different images need different feature extraction modes, and the workload is huge. Meanwhile, the requirement of the work of characteristic design on designers is high, and rich professional knowledge is needed. These all create a developing bottleneck for artificial design features. And the occurrence of deep learning solves the problem.
With the excellent detection effect of deep learning in the field of target detection, some text detection methods based on the improvement of a general target detection algorithm are developed. Detection methods based on universal targets can be divided into two main categories: candidate region-based methods and regression-based methods. Different from general target detection, the aspect ratio of the text changes drastically, and how to make the network have strong robustness to the change of the text scale is a problem to be considered. Text detection algorithms developed for candidate region based methods, such as: a Natural scene Text detection algorithm (CTPN) connected with a Text box, wherein the detection frame provides that the length of a Text sequence changes violently, and the horizontal position is more difficult to predict than the vertical position, and in order to generate a Text candidate box more accurately, the method fixes a default box as a width of 16 and only predicts the position in the vertical direction. Although the method realizes the end-to-end training of the convolutional neural network and the cyclic neural network for the first time, the spatial characteristics and the sequence characteristics of the text are extracted; and the detection precision of multi-scale and multi-language texts is higher, but only the detection of horizontal texts is aimed at, and the speed is lower. Text detection algorithms that improve regression-based methods, such as: a Fast Text detection algorithm (A Fast Text Detector with a Single Deep Neural Network, Textboxes) of a Single-step Deep Neural Network predicts in different layers, predicts small targets in a low layer and predicts large targets in a high layer. A default box is designed that fits the text scale. Although the speed and the precision of the method have good effects, the detection effect on small targets is not ideal because the feature extraction of the middle and lower layers is insufficient.
Therefore, a text detection method with strong robustness to the text scale change is needed.
Disclosure of Invention
In view of this, the present invention provides a regression-based multi-directional scene text detection method, which solves the problem that the current regression-based text detection network is not robust enough to the text scale change, sets a proper default box for text features, and finally detects the text position in an image.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-scale scene text detection method based on regression specifically comprises the following steps:
s1: setting sufficient training data with text position calibration;
s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data;
s3: using a cascade (initiation) module for each layer characteristic sent into the detection layer;
s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image.
Further, in step S2, the forward network from the low to the high includes: the system comprises an input module, first to fifth convolution modules, first to fifth pooling modules, a recurrent neural network module, sixth to tenth convolution modules and a sixth pooling module; the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are respectively connected with the first convolution module, the second convolution module and the third convolution module respectively; and the circulating neural network module, the sixth to tenth convolution modules and the sixth pooling module are sequentially cascaded.
Further, in step S2, the top-down feature fusion refers to fusion of a high-level feature and a low-level feature, and specifically includes: the high layer firstly obtains a characteristic diagram consistent with the size of the low layer through deconvolution, and then a Batch Normalization (Batch norm) module is BatchNorm connected; the low layer is firstly connected with a convolution module, the size of a convolution kernel is 1 x 1, the step length is 1, and the filling is 0; then connecting a BatchNorm module; finally, fusing the two feature layers by using element dot product operation (Eltwise); the fused output is used as the output of the whole feature extraction network.
Furthermore, the convolution kernels of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are all 3 x 3, the step length is 1, and the filling is 1; the convolution kernel size of the fifth pooling module is 3 x 3, the step size is 1, and the padding is 1; the convolution kernel size of the rest pooling modules is 2 x 2, the step length is 2, and the filling is 0; one circulating Neural Network module is a bidirectional Long Short-Term Memory circulating Neural Network (BLSTM-RNN), and the number of hidden layer units is 256; the size of the seventh convolution kernel is 1 x 1, the step size is 1, and the padding is 0; the eighth to tenth convolution modules each include two convolution kernels, one of which has a size of 1 × 1, a step size of 1, and a fill of 0, and the other of which has a size of 3 × 3, a step size of 2, and a fill of 1.
Further, in step S3, the cascade module includes an input spectrum end and a characteristic spectrum cascade end, the input spectrum end and the characteristic spectrum cascade end are connected by four parallel convolution branches, and each branch includes 1, 2, or 3 convolution modules.
Further, the cascade module comprises four convolution branches connected in parallel,
the first convolution branch comprises a convolution kernel, the size of the convolution kernel is 3 x 3, the step length is 1, and the padding is 1;
the second convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 1 x 5, step size is 1, and padding is 1; one convolution kernel size is 5 x 1, step size is 1, and padding is 1;
the third convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 5 x 1, step size is 1, and padding is 1; one convolution kernel size is 1 x 5, step size is 1, and padding is 1;
the fourth convolution branch comprises a pooling layer and a convolution kernel, wherein the size of the convolution kernel of the pooling layer is 3 x 3, the step size is 1, the filling is 1, the size of the convolution kernel is 1 x 1, the step size is 1, and the filling is 0;
after all the convolution kernels, a BatchNorm module and a Relu module are connected.
The invention has the beneficial effects that: the text detection method has strong robustness to text scale change. The method uses the convolution cyclic neural network to simultaneously extract the spatial features and the sequence features of the text. And (3) using the multi-layer prediction output of the feature pyramid structure, predicting a small target by using a low-layer feature map, and predicting a large target by using a high-layer feature map. By using feature fusion, semantic information of a high layer is used for classification, and structural information of a low layer is used for assisting regression, so that the problems of insufficient extraction of features of the low layer and low accuracy of small target prediction are solved to a certain extent. And finally, using an acceptance module for each layer of characteristics sent into the detection layer to further improve the receptive field of the network, then adopting a regression-based detection frame, setting a proper default frame aiming at the text characteristics, and finally detecting the text position in the image.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic diagram of a network architecture according to the present invention;
FIG. 2 is a schematic view of feature fusion;
fig. 3 is a schematic structural diagram of a cascade initiation module.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
Referring to fig. 1 to fig. 3, a preferred embodiment of a regression-based multi-scale scene text detection method according to the present invention includes the following steps:
the method comprises the following steps: preparing data;
several public data sets were aggregated — SynthText, ICDAR2011, ICDAR2013, SVT. Wherein SynthText contains 8 x 105The opening and combination picture is used for network pre-training, and 749 training pictures including ICDAR2011, ICDAR2013 and SVT are used for fine adjustment of the network. The three data sets of ICDAR2011, ICDAR2013, SVT total 585 training pictures for testing.
Step two: the network pre-training specifically comprises the following steps:
1) constructing a network structure as shown in fig. 1;
2) pre-training the network on the SynthText synthetic dataset: and (3) inputting the image normalized to 300 × 300 into a network model, and outputting the network as a score of the text positioning result and the text classification by adopting a loss function shown in a formula (1).
The loss function includes two parts: the binary loss of the text line and the return loss of the default frame position of the text line; where N denotes the number of matched default frames, α is 1, x is a matching matrix of the default frames and the real frames, c denotes the confidence of whether each default frame contains text, l denotes the positioning result of the network prediction of each default frame, and g denotes the position of the real frame. Two classification losses L of text lineconfWith cross entropy penalty, the default frame position of a line of text returns a penalty LlocLoss with smooth L1;
3) the losses obtained in 2) were optimized using a random optimizer (A Method for Stocharistic Optimization, Adam): parameters in the network are constantly updated by minimizing the loss function by an Adam optimizer. Network co-training 4 x 106Next, the learning rate is initialized to 10-34 x 10 per iteration5The sub-learning rate is multiplied by 0.1, and the 0.3 parameter is randomly discarded.
Step three: the network fine tuning specifically comprises the following steps:
1) fine-tuning the network model obtained in the second step by using 749 real pictures on the ICDAR2011, the ICDAR2013 and the SVT provided in the first step, and performing data enhancement on the 749 real pictures, wherein the data enhancement comprises operations of random turning, noise addition, blurring and the like;
2) setting default frames with 6 different length-width ratios on different output layers, wherein the default frames are respectively as follows: 1, 2, 3, 5, 7 and 10;
3) the detection layer uses a cascade (acceptance) module to cascade convolution kernels with different sizes, and the receptive field of the network is improved by increasing the network width, so that the text detection problem of the length-width ratio is solved;
4) setting learning rate to 10-5And 20000 iterations are performed in total. Optimizing by using random gradient descent in the process to obtain a final deep neural network model;
step four: and testing the learned network on a test set: in the step, the normalized test image is input into a network model, and the network output is the positioning result of the text and the score of the text classification.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.
Claims (3)
1. A multi-scale scene text detection method based on regression is characterized by specifically comprising the following steps:
s1: setting sufficient training data with text position calibration;
s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data; wherein the low-up forward network comprises: the system comprises an input module, first to fifth convolution modules, first to fifth pooling modules, a recurrent neural network module, sixth to tenth convolution modules and a sixth pooling module; the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are respectively connected with the first convolution module, the second convolution module and the third convolution module respectively; a cyclic neural network module, sixth to tenth convolution modules and a sixth pooling module are sequentially cascaded;
the convolution kernels of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are all 3 x 3, the step length is 1, and the filling is 1; the convolution kernel size of the fifth pooling module is 3 x 3, the step size is 1, and the padding is 1; the convolution kernel size of the rest pooling modules is 2 x 2, the step length is 2, and the filling is 0; one circulating Neural Network module is a bidirectional Long Short-Term Memory circulating Neural Network (BLSTM-RNN), and the number of hidden layer units is 256; the size of the seventh convolution kernel is 1 x 1, the step size is 1, and the padding is 0; the eighth to tenth convolution modules each include two convolution kernels, wherein one convolution kernel has a size of 1 × 1, a step size of 1, and a padding of 0, and the other convolution kernel has a size of 3 × 3, a step size of 2, and a padding of 1;
s3: using a cascade module for each layer of characteristics sent into the detection layer;
the cascade module comprises four convolution branches connected in parallel,
the first convolution branch comprises a convolution kernel, the size of the convolution kernel is 3 x 3, the step length is 1, and the padding is 1;
the second convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 1 x 5, step size is 1, and padding is 1; one convolution kernel size is 5 x 1, step size is 1, and padding is 1;
the third convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 5 x 1, step size is 1, and padding is 1; one convolution kernel size is 1 x 5, step size is 1, and padding is 1;
the fourth convolution branch comprises a pooling layer and a convolution kernel, wherein the size of the convolution kernel of the pooling layer is 3 x 3, the step size is 1, the filling is 1, the size of the convolution kernel is 1 x 1, the step size is 1, and the filling is 0;
after all convolution kernels, a BatchNorm module and a Linear rectification Unit module (Rectified Linear Unit, Relu) are connected;
s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image.
2. The regression-based multi-scale scene text detection method according to claim 1, wherein in step S2, the top-down feature fusion refers to fusion of high-level features and low-level features, and specifically includes: the high layer firstly obtains a characteristic diagram consistent with the size of the low layer through deconvolution, and then a Batch Normalization (Batch norm) module is BatchNorm connected; the low layer is firstly connected with a convolution module, the size of a convolution kernel is 1 x 1, the step length is 1, and the filling is 0; then connecting a BatchNorm module; finally, fusing the two feature layers by using element dot product operation; the fused output is used as the output of the whole feature extraction network.
3. The regression-based multi-scale scene text detection method according to claim 1, wherein in step S3, the cascade module includes an input spectrum end and a feature spectrum cascade end, the input spectrum end and the feature spectrum cascade end are connected by four convolution branches connected in parallel, and each branch includes 1, 2 or 3 convolution modules.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910838235.0A CN110533041B (en) | 2019-09-05 | 2019-09-05 | Regression-based multi-scale scene text detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910838235.0A CN110533041B (en) | 2019-09-05 | 2019-09-05 | Regression-based multi-scale scene text detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110533041A CN110533041A (en) | 2019-12-03 |
CN110533041B true CN110533041B (en) | 2022-07-01 |
Family
ID=68667081
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910838235.0A Active CN110533041B (en) | 2019-09-05 | 2019-09-05 | Regression-based multi-scale scene text detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110533041B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI689875B (en) * | 2018-06-29 | 2020-04-01 | 由田新技股份有限公司 | Defect inspection and classification apparatus and training apparatus using deep learning system |
CN111259764A (en) * | 2020-01-10 | 2020-06-09 | 中国科学技术大学 | Text detection method and device, electronic equipment and storage device |
CN111881943A (en) * | 2020-07-08 | 2020-11-03 | 泰康保险集团股份有限公司 | Method, device, equipment and computer readable medium for image classification |
CN112287962B (en) * | 2020-08-10 | 2023-06-09 | 南京行者易智能交通科技有限公司 | Training method, detection method and device for multi-scale target detection model, and terminal equipment |
CN113408525B (en) * | 2021-06-17 | 2022-08-02 | 成都崇瑚信息技术有限公司 | Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method |
CN115393868B (en) * | 2022-08-18 | 2023-05-26 | 中化现代农业有限公司 | Text detection method, device, electronic equipment and storage medium |
CN116704248A (en) * | 2023-06-07 | 2023-09-05 | 南京大学 | Serum sample image classification method based on multi-semantic unbalanced learning |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105631426A (en) * | 2015-12-29 | 2016-06-01 | 中国科学院深圳先进技术研究院 | Image text detection method and device |
CN107578060A (en) * | 2017-08-14 | 2018-01-12 | 电子科技大学 | A kind of deep neural network based on discriminant region is used for the method for vegetable image classification |
CN107688808A (en) * | 2017-08-07 | 2018-02-13 | 电子科技大学 | A kind of quickly natural scene Method for text detection |
CN108549893A (en) * | 2018-04-04 | 2018-09-18 | 华中科技大学 | A kind of end-to-end recognition methods of the scene text of arbitrary shape |
CN108734169A (en) * | 2018-05-21 | 2018-11-02 | 南京邮电大学 | One kind being based on the improved scene text extracting method of full convolutional network |
CN109086663A (en) * | 2018-06-27 | 2018-12-25 | 大连理工大学 | The natural scene Method for text detection of dimension self-adaption based on convolutional neural networks |
CN109271967A (en) * | 2018-10-16 | 2019-01-25 | 腾讯科技(深圳)有限公司 | The recognition methods of text and device, electronic equipment, storage medium in image |
CN109299274A (en) * | 2018-11-07 | 2019-02-01 | 南京大学 | A kind of natural scene Method for text detection based on full convolutional neural networks |
EP3534298A1 (en) * | 2018-02-26 | 2019-09-04 | Capital One Services, LLC | Dual stage neural network pipeline systems and methods |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2691214C1 (en) * | 2017-12-13 | 2019-06-11 | Общество с ограниченной ответственностью "Аби Продакшн" | Text recognition using artificial intelligence |
-
2019
- 2019-09-05 CN CN201910838235.0A patent/CN110533041B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105631426A (en) * | 2015-12-29 | 2016-06-01 | 中国科学院深圳先进技术研究院 | Image text detection method and device |
CN107688808A (en) * | 2017-08-07 | 2018-02-13 | 电子科技大学 | A kind of quickly natural scene Method for text detection |
CN107578060A (en) * | 2017-08-14 | 2018-01-12 | 电子科技大学 | A kind of deep neural network based on discriminant region is used for the method for vegetable image classification |
EP3534298A1 (en) * | 2018-02-26 | 2019-09-04 | Capital One Services, LLC | Dual stage neural network pipeline systems and methods |
CN108549893A (en) * | 2018-04-04 | 2018-09-18 | 华中科技大学 | A kind of end-to-end recognition methods of the scene text of arbitrary shape |
CN108734169A (en) * | 2018-05-21 | 2018-11-02 | 南京邮电大学 | One kind being based on the improved scene text extracting method of full convolutional network |
CN109086663A (en) * | 2018-06-27 | 2018-12-25 | 大连理工大学 | The natural scene Method for text detection of dimension self-adaption based on convolutional neural networks |
CN109271967A (en) * | 2018-10-16 | 2019-01-25 | 腾讯科技(深圳)有限公司 | The recognition methods of text and device, electronic equipment, storage medium in image |
CN109299274A (en) * | 2018-11-07 | 2019-02-01 | 南京大学 | A kind of natural scene Method for text detection based on full convolutional neural networks |
Non-Patent Citations (4)
Title |
---|
Deep Direct Regression for Multi-oriented Scene Text Detection;Wenhao HE等;《2017 IEEE International Conference on Computer Vision》;20171225;全文 * |
基于深度学习的自然场景文本检测与识别;方清;《中国优秀硕士学位论文全文数据库信息科技辑》;20180915;全文 * |
基于深度特征的多方向场景文字检测;杨小栋;《中国优秀硕士学位论文全文数据库信息科技辑》;20190715;全文 * |
多方向自然场景文本提取方法研究;雷绮仑;《中国优秀硕士学位论文全文数据库信息科技辑》;20190415;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110533041A (en) | 2019-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110533041B (en) | Regression-based multi-scale scene text detection method | |
CN111104898B (en) | Image scene classification method and device based on target semantics and attention mechanism | |
CN110866140B (en) | Image feature extraction model training method, image searching method and computer equipment | |
CN111210443B (en) | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance | |
CN109241982B (en) | Target detection method based on deep and shallow layer convolutional neural network | |
CN114202672A (en) | Small target detection method based on attention mechanism | |
CN111598183B (en) | Multi-feature fusion image description method | |
CN109743642B (en) | Video abstract generation method based on hierarchical recurrent neural network | |
CN111401293B (en) | Gesture recognition method based on Head lightweight Mask scanning R-CNN | |
CN102385592B (en) | Image concept detection method and device | |
CN110610210B (en) | Multi-target detection method | |
CN112329760A (en) | Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network | |
CN111259940A (en) | Target detection method based on space attention map | |
CN112784756B (en) | Human body identification tracking method | |
CN112348036A (en) | Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade | |
CN114049381A (en) | Twin cross target tracking method fusing multilayer semantic information | |
CN113569881A (en) | Self-adaptive semantic segmentation method based on chain residual error and attention mechanism | |
Li et al. | Learning hierarchical video representation for action recognition | |
CN112668492A (en) | Behavior identification method for self-supervised learning and skeletal information | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN114612832A (en) | Real-time gesture detection method and device | |
CN112507904A (en) | Real-time classroom human body posture detection method based on multi-scale features | |
CN116610778A (en) | Bidirectional image-text matching method based on cross-modal global and local attention mechanism | |
Sun et al. | Video understanding: from video classification to captioning | |
CN117197632A (en) | Transformer-based electron microscope pollen image target detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |