CN110533041B - Regression-based multi-scale scene text detection method - Google Patents

Regression-based multi-scale scene text detection method Download PDF

Info

Publication number
CN110533041B
CN110533041B CN201910838235.0A CN201910838235A CN110533041B CN 110533041 B CN110533041 B CN 110533041B CN 201910838235 A CN201910838235 A CN 201910838235A CN 110533041 B CN110533041 B CN 110533041B
Authority
CN
China
Prior art keywords
convolution
module
size
text
convolution kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910838235.0A
Other languages
Chinese (zh)
Other versions
CN110533041A (en
Inventor
景小荣
朱莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201910838235.0A priority Critical patent/CN110533041B/en
Publication of CN110533041A publication Critical patent/CN110533041A/en
Application granted granted Critical
Publication of CN110533041B publication Critical patent/CN110533041B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images

Abstract

The invention relates to a regression-based multi-scale scene text detection method, and belongs to the field of digital image processing. The method specifically comprises the following steps: s1: setting sufficient training data with text position calibration; s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data; s3: using a cascade module for each layer of characteristics sent into the detection layer; s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image. The cascade module adopted by the invention improves the receptive field of the network, so that the set default frame of the text characteristics is very suitable, and finally the text position in the image is accurately detected.

Description

Regression-based multi-scale scene text detection method
Technical Field
The invention belongs to the field of digital image processing, and relates to a multi-direction scene text detection method based on regression.
Background
With the popularization of intelligent equipment, people can acquire image information anytime and anywhere. The characters in the image serve as high-level semantic information, and important clues are provided for understanding and analyzing the image content. The characters are directly reflected by the image content, are easier to extract and understand compared with other elements, and the description of a plurality of characters can be directly used, so that the characters can be conveniently applied to retrieval and analysis of various image and video contents based on key words. Text detection has become a popular research topic in the field of computer vision.
There are many methods of text detection. The traditional scene text detection method needs manual feature design, different images need different feature extraction modes, and the workload is huge. Meanwhile, the requirement of the work of characteristic design on designers is high, and rich professional knowledge is needed. These all create a developing bottleneck for artificial design features. And the occurrence of deep learning solves the problem.
With the excellent detection effect of deep learning in the field of target detection, some text detection methods based on the improvement of a general target detection algorithm are developed. Detection methods based on universal targets can be divided into two main categories: candidate region-based methods and regression-based methods. Different from general target detection, the aspect ratio of the text changes drastically, and how to make the network have strong robustness to the change of the text scale is a problem to be considered. Text detection algorithms developed for candidate region based methods, such as: a Natural scene Text detection algorithm (CTPN) connected with a Text box, wherein the detection frame provides that the length of a Text sequence changes violently, and the horizontal position is more difficult to predict than the vertical position, and in order to generate a Text candidate box more accurately, the method fixes a default box as a width of 16 and only predicts the position in the vertical direction. Although the method realizes the end-to-end training of the convolutional neural network and the cyclic neural network for the first time, the spatial characteristics and the sequence characteristics of the text are extracted; and the detection precision of multi-scale and multi-language texts is higher, but only the detection of horizontal texts is aimed at, and the speed is lower. Text detection algorithms that improve regression-based methods, such as: a Fast Text detection algorithm (A Fast Text Detector with a Single Deep Neural Network, Textboxes) of a Single-step Deep Neural Network predicts in different layers, predicts small targets in a low layer and predicts large targets in a high layer. A default box is designed that fits the text scale. Although the speed and the precision of the method have good effects, the detection effect on small targets is not ideal because the feature extraction of the middle and lower layers is insufficient.
Therefore, a text detection method with strong robustness to the text scale change is needed.
Disclosure of Invention
In view of this, the present invention provides a regression-based multi-directional scene text detection method, which solves the problem that the current regression-based text detection network is not robust enough to the text scale change, sets a proper default box for text features, and finally detects the text position in an image.
In order to achieve the purpose, the invention provides the following technical scheme:
a multi-scale scene text detection method based on regression specifically comprises the following steps:
s1: setting sufficient training data with text position calibration;
s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data;
s3: using a cascade (initiation) module for each layer characteristic sent into the detection layer;
s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image.
Further, in step S2, the forward network from the low to the high includes: the system comprises an input module, first to fifth convolution modules, first to fifth pooling modules, a recurrent neural network module, sixth to tenth convolution modules and a sixth pooling module; the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are respectively connected with the first convolution module, the second convolution module and the third convolution module respectively; and the circulating neural network module, the sixth to tenth convolution modules and the sixth pooling module are sequentially cascaded.
Further, in step S2, the top-down feature fusion refers to fusion of a high-level feature and a low-level feature, and specifically includes: the high layer firstly obtains a characteristic diagram consistent with the size of the low layer through deconvolution, and then a Batch Normalization (Batch norm) module is BatchNorm connected; the low layer is firstly connected with a convolution module, the size of a convolution kernel is 1 x 1, the step length is 1, and the filling is 0; then connecting a BatchNorm module; finally, fusing the two feature layers by using element dot product operation (Eltwise); the fused output is used as the output of the whole feature extraction network.
Furthermore, the convolution kernels of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are all 3 x 3, the step length is 1, and the filling is 1; the convolution kernel size of the fifth pooling module is 3 x 3, the step size is 1, and the padding is 1; the convolution kernel size of the rest pooling modules is 2 x 2, the step length is 2, and the filling is 0; one circulating Neural Network module is a bidirectional Long Short-Term Memory circulating Neural Network (BLSTM-RNN), and the number of hidden layer units is 256; the size of the seventh convolution kernel is 1 x 1, the step size is 1, and the padding is 0; the eighth to tenth convolution modules each include two convolution kernels, one of which has a size of 1 × 1, a step size of 1, and a fill of 0, and the other of which has a size of 3 × 3, a step size of 2, and a fill of 1.
Further, in step S3, the cascade module includes an input spectrum end and a characteristic spectrum cascade end, the input spectrum end and the characteristic spectrum cascade end are connected by four parallel convolution branches, and each branch includes 1, 2, or 3 convolution modules.
Further, the cascade module comprises four convolution branches connected in parallel,
the first convolution branch comprises a convolution kernel, the size of the convolution kernel is 3 x 3, the step length is 1, and the padding is 1;
the second convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 1 x 5, step size is 1, and padding is 1; one convolution kernel size is 5 x 1, step size is 1, and padding is 1;
the third convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 5 x 1, step size is 1, and padding is 1; one convolution kernel size is 1 x 5, step size is 1, and padding is 1;
the fourth convolution branch comprises a pooling layer and a convolution kernel, wherein the size of the convolution kernel of the pooling layer is 3 x 3, the step size is 1, the filling is 1, the size of the convolution kernel is 1 x 1, the step size is 1, and the filling is 0;
after all the convolution kernels, a BatchNorm module and a Relu module are connected.
The invention has the beneficial effects that: the text detection method has strong robustness to text scale change. The method uses the convolution cyclic neural network to simultaneously extract the spatial features and the sequence features of the text. And (3) using the multi-layer prediction output of the feature pyramid structure, predicting a small target by using a low-layer feature map, and predicting a large target by using a high-layer feature map. By using feature fusion, semantic information of a high layer is used for classification, and structural information of a low layer is used for assisting regression, so that the problems of insufficient extraction of features of the low layer and low accuracy of small target prediction are solved to a certain extent. And finally, using an acceptance module for each layer of characteristics sent into the detection layer to further improve the receptive field of the network, then adopting a regression-based detection frame, setting a proper default frame aiming at the text characteristics, and finally detecting the text position in the image.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic diagram of a network architecture according to the present invention;
FIG. 2 is a schematic view of feature fusion;
fig. 3 is a schematic structural diagram of a cascade initiation module.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
Referring to fig. 1 to fig. 3, a preferred embodiment of a regression-based multi-scale scene text detection method according to the present invention includes the following steps:
the method comprises the following steps: preparing data;
several public data sets were aggregated — SynthText, ICDAR2011, ICDAR2013, SVT. Wherein SynthText contains 8 x 105The opening and combination picture is used for network pre-training, and 749 training pictures including ICDAR2011, ICDAR2013 and SVT are used for fine adjustment of the network. The three data sets of ICDAR2011, ICDAR2013, SVT total 585 training pictures for testing.
Step two: the network pre-training specifically comprises the following steps:
1) constructing a network structure as shown in fig. 1;
2) pre-training the network on the SynthText synthetic dataset: and (3) inputting the image normalized to 300 × 300 into a network model, and outputting the network as a score of the text positioning result and the text classification by adopting a loss function shown in a formula (1).
Figure BDA0002192874190000041
The loss function includes two parts: the binary loss of the text line and the return loss of the default frame position of the text line; where N denotes the number of matched default frames, α is 1, x is a matching matrix of the default frames and the real frames, c denotes the confidence of whether each default frame contains text, l denotes the positioning result of the network prediction of each default frame, and g denotes the position of the real frame. Two classification losses L of text lineconfWith cross entropy penalty, the default frame position of a line of text returns a penalty LlocLoss with smooth L1;
3) the losses obtained in 2) were optimized using a random optimizer (A Method for Stocharistic Optimization, Adam): parameters in the network are constantly updated by minimizing the loss function by an Adam optimizer. Network co-training 4 x 106Next, the learning rate is initialized to 10-34 x 10 per iteration5The sub-learning rate is multiplied by 0.1, and the 0.3 parameter is randomly discarded.
Step three: the network fine tuning specifically comprises the following steps:
1) fine-tuning the network model obtained in the second step by using 749 real pictures on the ICDAR2011, the ICDAR2013 and the SVT provided in the first step, and performing data enhancement on the 749 real pictures, wherein the data enhancement comprises operations of random turning, noise addition, blurring and the like;
2) setting default frames with 6 different length-width ratios on different output layers, wherein the default frames are respectively as follows: 1, 2, 3, 5, 7 and 10;
3) the detection layer uses a cascade (acceptance) module to cascade convolution kernels with different sizes, and the receptive field of the network is improved by increasing the network width, so that the text detection problem of the length-width ratio is solved;
4) setting learning rate to 10-5And 20000 iterations are performed in total. Optimizing by using random gradient descent in the process to obtain a final deep neural network model;
step four: and testing the learned network on a test set: in the step, the normalized test image is input into a network model, and the network output is the positioning result of the text and the score of the text classification.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (3)

1. A multi-scale scene text detection method based on regression is characterized by specifically comprising the following steps:
s1: setting sufficient training data with text position calibration;
s2: constructing a feature extraction network, which comprises a forward network process from bottom to top and a feature fusion process from top to bottom and is used for extracting low-level, middle-level and high-level features of each training data; wherein the low-up forward network comprises: the system comprises an input module, first to fifth convolution modules, first to fifth pooling modules, a recurrent neural network module, sixth to tenth convolution modules and a sixth pooling module; the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are respectively connected with the first convolution module, the second convolution module and the third convolution module respectively; a cyclic neural network module, sixth to tenth convolution modules and a sixth pooling module are sequentially cascaded;
the convolution kernels of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are all 3 x 3, the step length is 1, and the filling is 1; the convolution kernel size of the fifth pooling module is 3 x 3, the step size is 1, and the padding is 1; the convolution kernel size of the rest pooling modules is 2 x 2, the step length is 2, and the filling is 0; one circulating Neural Network module is a bidirectional Long Short-Term Memory circulating Neural Network (BLSTM-RNN), and the number of hidden layer units is 256; the size of the seventh convolution kernel is 1 x 1, the step size is 1, and the padding is 0; the eighth to tenth convolution modules each include two convolution kernels, wherein one convolution kernel has a size of 1 × 1, a step size of 1, and a padding of 0, and the other convolution kernel has a size of 3 × 3, a step size of 2, and a padding of 1;
s3: using a cascade module for each layer of characteristics sent into the detection layer;
the cascade module comprises four convolution branches connected in parallel,
the first convolution branch comprises a convolution kernel, the size of the convolution kernel is 3 x 3, the step length is 1, and the padding is 1;
the second convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 1 x 5, step size is 1, and padding is 1; one convolution kernel size is 5 x 1, step size is 1, and padding is 1;
the third convolution branch comprises three convolution kernels, wherein the size of one convolution kernel is 1 x 1, the step length is 1, and the filling is 0; one convolution kernel size is 5 x 1, step size is 1, and padding is 1; one convolution kernel size is 1 x 5, step size is 1, and padding is 1;
the fourth convolution branch comprises a pooling layer and a convolution kernel, wherein the size of the convolution kernel of the pooling layer is 3 x 3, the step size is 1, the filling is 1, the size of the convolution kernel is 1 x 1, the step size is 1, and the filling is 0;
after all convolution kernels, a BatchNorm module and a Linear rectification Unit module (Rectified Linear Unit, Relu) are connected;
s4: and adopting a regression-based detection framework, setting a proper default box according to the text characteristics, and detecting the text position in the image.
2. The regression-based multi-scale scene text detection method according to claim 1, wherein in step S2, the top-down feature fusion refers to fusion of high-level features and low-level features, and specifically includes: the high layer firstly obtains a characteristic diagram consistent with the size of the low layer through deconvolution, and then a Batch Normalization (Batch norm) module is BatchNorm connected; the low layer is firstly connected with a convolution module, the size of a convolution kernel is 1 x 1, the step length is 1, and the filling is 0; then connecting a BatchNorm module; finally, fusing the two feature layers by using element dot product operation; the fused output is used as the output of the whole feature extraction network.
3. The regression-based multi-scale scene text detection method according to claim 1, wherein in step S3, the cascade module includes an input spectrum end and a feature spectrum cascade end, the input spectrum end and the feature spectrum cascade end are connected by four convolution branches connected in parallel, and each branch includes 1, 2 or 3 convolution modules.
CN201910838235.0A 2019-09-05 2019-09-05 Regression-based multi-scale scene text detection method Active CN110533041B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910838235.0A CN110533041B (en) 2019-09-05 2019-09-05 Regression-based multi-scale scene text detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910838235.0A CN110533041B (en) 2019-09-05 2019-09-05 Regression-based multi-scale scene text detection method

Publications (2)

Publication Number Publication Date
CN110533041A CN110533041A (en) 2019-12-03
CN110533041B true CN110533041B (en) 2022-07-01

Family

ID=68667081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910838235.0A Active CN110533041B (en) 2019-09-05 2019-09-05 Regression-based multi-scale scene text detection method

Country Status (1)

Country Link
CN (1) CN110533041B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI689875B (en) * 2018-06-29 2020-04-01 由田新技股份有限公司 Defect inspection and classification apparatus and training apparatus using deep learning system
CN111259764A (en) * 2020-01-10 2020-06-09 中国科学技术大学 Text detection method and device, electronic equipment and storage device
CN111881943A (en) * 2020-07-08 2020-11-03 泰康保险集团股份有限公司 Method, device, equipment and computer readable medium for image classification
CN112287962B (en) * 2020-08-10 2023-06-09 南京行者易智能交通科技有限公司 Training method, detection method and device for multi-scale target detection model, and terminal equipment
CN113408525B (en) * 2021-06-17 2022-08-02 成都崇瑚信息技术有限公司 Multilayer ternary pivot and bidirectional long-short term memory fused text recognition method
CN115393868B (en) * 2022-08-18 2023-05-26 中化现代农业有限公司 Text detection method, device, electronic equipment and storage medium
CN116704248A (en) * 2023-06-07 2023-09-05 南京大学 Serum sample image classification method based on multi-semantic unbalanced learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631426A (en) * 2015-12-29 2016-06-01 中国科学院深圳先进技术研究院 Image text detection method and device
CN107578060A (en) * 2017-08-14 2018-01-12 电子科技大学 A kind of deep neural network based on discriminant region is used for the method for vegetable image classification
CN107688808A (en) * 2017-08-07 2018-02-13 电子科技大学 A kind of quickly natural scene Method for text detection
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN108734169A (en) * 2018-05-21 2018-11-02 南京邮电大学 One kind being based on the improved scene text extracting method of full convolutional network
CN109086663A (en) * 2018-06-27 2018-12-25 大连理工大学 The natural scene Method for text detection of dimension self-adaption based on convolutional neural networks
CN109271967A (en) * 2018-10-16 2019-01-25 腾讯科技(深圳)有限公司 The recognition methods of text and device, electronic equipment, storage medium in image
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks
EP3534298A1 (en) * 2018-02-26 2019-09-04 Capital One Services, LLC Dual stage neural network pipeline systems and methods

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2691214C1 (en) * 2017-12-13 2019-06-11 Общество с ограниченной ответственностью "Аби Продакшн" Text recognition using artificial intelligence

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631426A (en) * 2015-12-29 2016-06-01 中国科学院深圳先进技术研究院 Image text detection method and device
CN107688808A (en) * 2017-08-07 2018-02-13 电子科技大学 A kind of quickly natural scene Method for text detection
CN107578060A (en) * 2017-08-14 2018-01-12 电子科技大学 A kind of deep neural network based on discriminant region is used for the method for vegetable image classification
EP3534298A1 (en) * 2018-02-26 2019-09-04 Capital One Services, LLC Dual stage neural network pipeline systems and methods
CN108549893A (en) * 2018-04-04 2018-09-18 华中科技大学 A kind of end-to-end recognition methods of the scene text of arbitrary shape
CN108734169A (en) * 2018-05-21 2018-11-02 南京邮电大学 One kind being based on the improved scene text extracting method of full convolutional network
CN109086663A (en) * 2018-06-27 2018-12-25 大连理工大学 The natural scene Method for text detection of dimension self-adaption based on convolutional neural networks
CN109271967A (en) * 2018-10-16 2019-01-25 腾讯科技(深圳)有限公司 The recognition methods of text and device, electronic equipment, storage medium in image
CN109299274A (en) * 2018-11-07 2019-02-01 南京大学 A kind of natural scene Method for text detection based on full convolutional neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Deep Direct Regression for Multi-oriented Scene Text Detection;Wenhao HE等;《2017 IEEE International Conference on Computer Vision》;20171225;全文 *
基于深度学习的自然场景文本检测与识别;方清;《中国优秀硕士学位论文全文数据库信息科技辑》;20180915;全文 *
基于深度特征的多方向场景文字检测;杨小栋;《中国优秀硕士学位论文全文数据库信息科技辑》;20190715;全文 *
多方向自然场景文本提取方法研究;雷绮仑;《中国优秀硕士学位论文全文数据库信息科技辑》;20190415;全文 *

Also Published As

Publication number Publication date
CN110533041A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110533041B (en) Regression-based multi-scale scene text detection method
CN111104898B (en) Image scene classification method and device based on target semantics and attention mechanism
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN109241982B (en) Target detection method based on deep and shallow layer convolutional neural network
CN114202672A (en) Small target detection method based on attention mechanism
CN111598183B (en) Multi-feature fusion image description method
CN109743642B (en) Video abstract generation method based on hierarchical recurrent neural network
CN111401293B (en) Gesture recognition method based on Head lightweight Mask scanning R-CNN
CN102385592B (en) Image concept detection method and device
CN110610210B (en) Multi-target detection method
CN112329760A (en) Method for recognizing and translating Mongolian in printed form from end to end based on space transformation network
CN111259940A (en) Target detection method based on space attention map
CN112784756B (en) Human body identification tracking method
CN112348036A (en) Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN113569881A (en) Self-adaptive semantic segmentation method based on chain residual error and attention mechanism
Li et al. Learning hierarchical video representation for action recognition
CN112668492A (en) Behavior identification method for self-supervised learning and skeletal information
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN114612832A (en) Real-time gesture detection method and device
CN112507904A (en) Real-time classroom human body posture detection method based on multi-scale features
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
Sun et al. Video understanding: from video classification to captioning
CN117197632A (en) Transformer-based electron microscope pollen image target detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant