CN116229482A

CN116229482A - Visual multi-mode character detection recognition and error correction method in network public opinion analysis

Info

Publication number: CN116229482A
Application number: CN202310283922.7A
Authority: CN
Inventors: 魏富鹏; 刘星; 郑秋生; 乔亚琼; 姜维; 陈紫薇; 张政; 牛利月; 刘济宗; 王尚首
Original assignee: Zhongyuan University of Technology; North China University of Water Resources and Electric Power
Current assignee: Zhongyuan University of Technology; North China University of Water Resources and Electric Power
Priority date: 2023-02-03
Filing date: 2023-03-22
Publication date: 2023-06-06

Abstract

The invention provides a visual multi-mode character detection, identification and error correction method in network public opinion analysis, which comprises the following steps: labeling the characters in the related data of the network public opinion to construct a data set; extracting image features of public opinion images in a data set and image features of key frames in videos, and carrying out information coding on the image features of the public opinion images and the image features of the key frames in the videos; the text detection module detects the multi-mode feature coding information, and the text recognition module converts the text sequence with characters into text; correcting the obtained text information by adopting a word stock and a Transformer network in the public opinion field to obtain a text correction model; training the text error correction model, and correcting the identified text information. The invention can effectively mine the context information of Chinese character lines in images and videos, and achieves the more accurate character extraction effect in the public opinion multi-mode data; the result of character recognition is corrected properly, so that a better expected effect is achieved.

Description

Visual multi-mode character detection recognition and error correction method in network public opinion analysis

Technical Field

The invention relates to the technical field of multi-mode character detection and recognition, in particular to a visual multi-mode character detection and recognition and error correction method in network public opinion analysis.

Background

With advances in information technology, image and video data such as today's headlines, tremors, youTube, etc. are taking an increasingly important role in network information carriers. The method can well meet the supervision requirement of the network public opinion field for image and video text detection and identification, and text detection, identification and error correction are research difficulties of multi-mode information extraction in the network public opinion field.

The traditional video or image text detection and recognition technology is mainly divided into two modules, namely a text detection algorithm and a text recognition algorithm. In the field of deep learning, text detection algorithms can be classified into regression-based text detection algorithms and segmentation-based text detection algorithms. Text detection methods based on regression are further classified into horizontal text detection algorithms (CTPN, textbox, etc.), oblique text detection algorithms (eat, MOST, etc.), curved text detection algorithms (CTD, LOMO, etc.) according to text positions in a scene. Segmentation-based text detection algorithms mainly include PAN, seglink++, DB, PSENet, etc. For curved text, regression-based algorithms are limited by the shape of the anchor boxes, they can only regress to rectangular boxes of the target, cannot detect text instances of arbitrary shape, and finally have difficulty in obtaining a smooth text envelope curve. The character recognition is a key subtask for extracting image or video information, and the traditional character recognition method is mainly a classical CRNN character recognition algorithm, but the recognition accuracy of characters in video is low, and the application scene of the algorithm is limited. After that, the technology has wider application, such as handwriting recognition, license plate photographing recognition, bank card recognition, bill recognition and the like, but with the continuous development of technology level, the research on Chinese detection and recognition in videos and images can not meet the supervision requirement of people on the field of network public opinion.

At present, most of traditional methods only detect and identify characters in images or videos, and the speed is low, and the corresponding model can only be used in a single scene of the images or videos. In the social network of the current society under the background of the postepidemic age, the scale of visual data such as short videos, moving pictures, pictures and the like is exponentially increased, and highly sensitive information such as pornography, violence, reaction and the like on the network is easy to spread. Therefore, it is urgent to design a universal visual multi-mode character detection and recognition model.

Disclosure of Invention

Aiming at the technical problems that the existing character detection and recognition method can only aim at a single scene of an image or a video and has low speed, the invention provides a visual multi-mode character detection and recognition and correction method in network public opinion analysis, which is based on the improved PP-OCR visual multi-mode character detection and recognition method and combines a deep learning technology to realize multi-mode character recognition and correction; the image and video data are fused together to serve as input and train a character detection model, the visual multi-mode character detection and recognition has a good effect in the network public opinion analysis scene, and meanwhile, the range of the data set in public opinion comprises terrorism, riot, gambling, abuse and the like, and proper character correction is carried out on character recognition results, so that the good effect is achieved.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows: a visual multi-mode character detection, identification and error correction method in network public opinion analysis comprises the following steps:

(1) Labeling the characters in the related data of the network public opinion to construct a data set;

(2) Multimodal data processing: extracting image features of public opinion images in the data set by an image feature extractor, extracting image features of key frames in videos in the data set by a video feature extractor, and respectively carrying out information coding on the image features of the public opinion images and the image features of the key frames in the videos by adopting a multi-mode subspace coding method to obtain multi-mode feature coding information;

(3) Training of character detection and recognition: the text detection module detects the multi-mode feature coding information to obtain a positioned text line position or a single character position on a key frame image in the image and the video, and the text recognition module converts the positioned text sequence with characters into a text by using a joint time classification algorithm through a transcription layer;

(4) Character error correction training: constructing a word stock in the public opinion field, and correcting text information in the obtained image or key frame video by adopting the word stock in the public opinion field and a Transformer network to obtain a text correction model;

(5) Training a text error correction model through a data set, and correcting the identified text information by combining a public opinion lexicon.

Preferably, the data set is a RWF2000 violent video data set and a constructed gambling, violent and abusive picture data set; the characters are marked by using a PPOCRLAbel semiautomatic image marking tool, and the images are automatically marked to form marking results formed by surrounding four points, or manually marked: and selecting a four-point labeling mode, and clicking four points in sequence according to the requirement to form a four-point labeled picture.

Preferably, the image feature extractor adopts a VGG19 convolutional neural network trained in advance, and takes the last layer output result of the VGG19 convolutional neural network as the extracted public opinion image feature; finally, adding a full connection layer into the VGG19 convolutional neural network, extracting bottom features of public opinion image features by utilizing the VGG19 convolutional neural network, and finally outputting feature vectors as the public opinion image features:

f _v ＝σ(W*VGG ₁₉ (V ₁ )+b)；

wherein ,f_v As public opinion image characteristics, W, b is a parameter of the full connection layer; sigma is a sigmoid activation function, V ₁ Representing multi-source data features, VGG ₁₉ () Representing one-dimensional flattening operation of the feature map through a VGG19 convolutional neural network;

the video feature extractor adopts a ResNet101 network model to make a key frame sequence M= { M ₁ ,m ₂ ,...,m _n Each picture in the image is scaled to 225 x 225, and the image characteristics of the key frames are extracted through the ResNet101 network model and the full connection layer to obtain an image characteristic set L= { L ₁ ,l ₂ ,...,l _n}, wherein ,m₁ ,m ₂ ,...,m _n Is each key frame representation of an image in a public opinion video ₁ ,l ₂ ,...,l _n Representing the extracted image features of each key frame in the video after passing through the ResNet101 network model, n representing the total number of key frames extracted from the video; extracting semantic association between image features by adopting a self-attention mechanism, namely supervising the context semantic information of the image features through an attention module;

the multi-mode subspace coding method comprises public opinion image characteristics f _v The key frame image feature set L of the public opinion video is mapped to a shared sub-semantic space by multi-mode public opinion data feature representation, and the image feature H of different modes is shared and is reconstructed into the image feature X of each sample under each view angle through a group of mapping ^(v) The correlation formula between the two is obtained as follows: x is X ^(v) ＝C _v (H) V represents any view angle modal characteristic, C _v Representing a reconstruction mapping function corresponding to any view angle modal characteristic v; information encoding the different available views is:

p(X|H)＝p(X ⁽¹⁾ |H)p(X ⁽²⁾ |H)...p(X ^(v) |H)；

wherein p (X|H) is different available view multi-mode coding information, p (X) ^(V) I H) represents the encoded information of any view mode feature, V represents the total number of different learning views.

Preferably, the multimode feature coding information is input into an OpenCV, media and Processing in the OpenVINO reasoning module to conduct accelerated reasoning, and features of the multimode feature coding information are extracted and optimized; the multi-mode data is encoded and decoded, preprocessed and processed by the picture and video processing tool kit provided by the model optimizer, and the processed result is subjected to character detection and character recognition.

Preferably, the text detection module is an optimized mobilenet v3 model, and the optimized mobilenet v3 model is a lightweight backbone network in the differential binarization network, and the backbone network can automatically adjust the threshold value; the optimized MobileNet V3 model adopts an automatically-learned threshold strategy:

wherein ,

is the threshold value of end-to-end learning text segmentation when the DB binary diagram module is trained, T _i,j For the pixel value with coordinates (i, j) on the adaptive threshold map learned by the optimized MobileNet V3 model in the training process, k represents the amplification factor, P _i,j Pixel values representing coordinates (i, j) on the probability map P;

the text recognition module improves an LK-PAN module of an original text detection algorithm, and enlarges the original convolution kernel size to 12 x 12; the text recognition module adopts the existing transducer network to effectively mine the context information of the text line image.

Preferably, the optimized MobileNet V3 model uses ResNet-18 network to extract the characteristic information of public opinion image input, and the multiscale characteristic diagram is obtained through the up-sampling treatment of 3*3 convolution layer; on the feature map, the fusion module aggregates the 1/8, 1/16 and 1/32 feature information of the input image or video with the 1/4 feature map, and describes the complete information of the text through the predicted threshold map T and the probability map P; the method comprises the steps that a probability map P and a threshold map T are obtained from a feature map with the size of 1/4 through a series of convolution and transposition convolution, and binarization processing is carried out on the probability map P and the threshold map T by a DB method; binarizing the fixed threshold value to obtain an approximate binarization map, obtaining a text box from the approximate binarization map, and obtaining a positioning text box of an image or video key frame as input of a text recognition module;

the transducer network in the character recognition module consists of 1 linear mapping layer and 1 iteratable multi-head attention layer, and the calculation formula is as follows:

X ^l ＝MHA(LP(X ^l-1 ))+X ^l-1 ,l∈[1,L]；

Z _class ＝LN(MLP(X ^L (0)+X ^L (0))；

wherein ,

representing cascade operation in the vertical direction, X _class ,X ₁ ,...,X _N Representing a character sequence detected by characters, and X represents an aggregated sequence feature vector; MHA (-) represents a multi-layer attention mechanism calculation function, LP (-) represents a linear mapping layer, X ^l-1 and X ^l Respectively representing the attention matrix of the previous layer l-1 and the attention matrix of the current layer l; LN (·) represents the layer normalization function, when iterating L times the output attention matrix X ^L And the characteristic vector X of the 1 st row ^L (0) As input of the next multi-layer perceptron MLP, finally outputting key frame attention characteristic Z _class L represents the number of layers of the transducer network.

Preferably, a deep mutual learning strategy is added into the text detection module, and text lines or single character information positioned in the identification image are converted into text information;

adding a GTC strategy, a UIM strategy, a TextRotNet module and a data enhancement strategy for mining text context information into the text recognition module; the GTC strategy adopts the existing attention module to guide training of the CTC model, and simultaneously fuses the characteristics of a plurality of modes, so that the text recognition accuracy is improved; the UIM strategy predicts the unlabeled image to obtain a pseudo label, and takes a sample with high prediction probability as training data; the TextRotNet module is a self-supervised pre-training model, and initializes the weight and the paranoid factor of the mobilenet V3 model; the textConAug strategy is applied to the supervised learning task and enriches the context information of the training data.

Preferably, the construction method of the word stock in the public opinion field comprises the following steps: performing data crawling and cleaning on related websites of violence, gambling and abuse by using a Python crawler technology, marking the crawled data according to word levels, sorting 3308 entities and corresponding 8632 public opinion texts, and constructing a network public opinion text error correction task dictionary; the function of the online public opinion text correction task dictionary is that text correction codes can be combined with the dictionary to judge and correct the text recognition result.

Preferably, the error correction method is as follows: prediction in a sequence y ₁ ,y ₂ ,…,y _T Words at position t, given source sentence x ₁ ,x ₂ ,…,x _N On the premise of (1):

p _t (w)＝softmax(L ^trg h _t )；

wherein, matrix L epsilon R ^dx×|V| Is a word embedding matrix, dx represents the dimension of word embedding, V represents the vocabulary; l (L) ^src Representing the sentence x of the source ₁ ...x _N Expressed as an embedded vector, the encoder () represents the sequence x of sentences ₁ ...x _N Coding;

is the hidden state after encoding; y is _t-1...1 Representing predicted sequence, L ^trg Representing y _t-1...1 ,/>

Expressed as embedded feature vector, decoder () represents the hidden state +.>

Decoding, h _t Is the hidden state of the next word; target hidden state h by softmax function _t And word embedding matrix L ^trg To obtain the corrected sentence p _t (w)。

Preferably, the bidirectional transducer employs a pre-training model;

the information image information of the image in the image or video key frame is subjected to three subfunctions of text detection, picture preprocessing, searching and screening text areas in sequence, and the image features are respectively preprocessed, binarized and screened text area box coordinates, and the text information is obtained through recognition.

Compared with the prior art, the invention has the beneficial effects that: different from the traditional single-type data problem, the invention provides a universal model applicable image and video data, which can process the fused image and video data as input data of the model, thereby realizing visual multi-mode character detection and recognition and error correction. The innovation of the invention is characterized in that:

1. in terms of character detection and recognition, aiming at the problem of insufficient precision degree of character extraction in the existing video and images, the invention provides a visual multi-mode character detection recognition method based on improved PP-OCR in a new mode, which is different from the original PP-OCR method, firstly, an improved character detection model increases the original convolution kernel size of a large-scale convolution kernel pixel aggregation network (Large Kernel Pixel Aggregation Network, LK-PAN), and improves the receptive field covered at each position in a feature map; and secondly, the character recognition method is improved, so that the context information of the Chinese character line in the image and/or video can be effectively mined, the more accurate character extraction effect in the public opinion multi-mode data is achieved, and the problem of low character detection and recognition precision is solved.

2. In terms of text correction, aiming at the problems that a sufficient labeling corpus is lacking in a text correction task in the network public opinion field, a universal text correction model is difficult to effectively process network public opinion text and the like, the invention provides a text correction method based on a word stock in the network public opinion field.

The invention utilizes the multi-mode data characteristic representation information, combines an improved PP-OCR method and a natural language processing algorithm to effectively extract single character information or text line information in an image or a key frame video image, overcomes the problems of bending, shielding, blurring and the like in detection and recognition, designs a text error correction method based on a network public opinion lexicon, further improves the recognized result and improves the recognition performance.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a single image of the present invention.

Fig. 2 is a format of text labels in the single image shown in fig. 1.

Fig. 3 is a schematic diagram of the optimized MobileNetV3 model architecture according to the present invention.

Fig. 4 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

A visual multi-mode character detection, identification and error correction method in network public opinion analysis comprises the following steps:

step 1: and setting up an environment, and labeling the characters in the related data of the network public opinion to construct a data set.

The hardware device employs a Graphics Processor (GPU) named NVIDIA Tesla P100, 24G memory, using a deep learning framework Pytorch, and the programming development language employs Python version 3.8. And creating Conda virtual running environments on the GPU server, and sequentially installing environments required by code running. The data set adopts RWF2000 violent video data set and gambling, violent and abuse picture data set which are built by oneself, and the total number of pictures is 16098. The constructed data set labeling format is shown in fig. 1 and 2, the character labeling adopts a PPOCRLabael semiautomatic image labeling tool, and the tool can automatically label pictures to form labeling results surrounded by four points. The method can also be used for manually marking, a user firstly selects a four-point marking mode, then clicks four points in sequence according to the requirement, and then the mark clicking is completed, so that a four-point marked picture can be formed. The marker dataset will be the input for model training.

Step 2: multimodal data processing: extracting image features of public opinion images in the data set by an image feature extractor, extracting image features of key frames in videos in the data set by a video feature extractor, and carrying out information coding on the image features of the public opinion images and the image features of the key frames in the videos by adopting a multi-mode subspace coding method to obtain multi-mode feature coding information.

As shown in Stage1 in fig. 4, based on the existing multi-mode feature extraction and fusion technology, video and image multi-source data in the network public opinion supervision scene are used as the original data input. Firstly, a VGG19 convolutional neural network trained in advance is adopted as an image feature extractor, and the last layer of the VGG19 convolutional neural network is adopted as an extracted public opinion image feature; meanwhile, a full-connection layer is added to the VGG19 convolutional neural network, the VGG19 convolutional neural network is utilized to directly extract the bottom features of the public opinion image features, so that the problem of information bias caused by different dimensionalities between modes is avoided, and finally, feature vectors are output. The public opinion image features are expressed as:

f _v ＝σ(W*VGG ₁₉ (V ₁ )+b) (1)

wherein ,f_v For public opinion image features W, b is a parameter of the full connection layer. Sigma is a sigmoid activation function representing the activation of the result of the full connection layer output, V ₁ Representing multi-source data features, VGG ₁₉ () Representative pass throughThe VGG19 convolutional neural network performs one-dimensional flattening operation on the last convolutional pooled feature map.

The video feature extractor adopts a ResNet101 network model, the ResNet101 network model can extract image features of key frames of the public opinion video, and a key frame sequence M= { M ₁ ,m ₂ ,...,m _n Each picture in the image is scaled to 225 x 225, and the image characteristics of the key frames are extracted through the ResNet101 network model and the full connection layer, so that an image characteristic set L= { L can be obtained ₁ ,l ₂ ,...,l _n}, wherein ,m₁ ,m ₂ ,...,m _n Is each key frame representation of an image in a public opinion video ₁ ,l ₂ ,...,l _n Representing the image features extracted for each key frame in the video after passing through the ResNet101 network model, n represents the total extraction of n key frames from the video. However, implicit features exist in the image feature set L and the relation among images is not mined, and then the Self-Attention mechanism (Self-Attention) is adopted to extract semantic association among the image features, namely, the Attention module is used for supervising the context semantic information of the image features, so that important features are purposefully reinforced and the prominent learning is carried out, and the generalization and learning capability of the model are further improved. The ResNet101 network model converts key frame image features of the public opinion videos into a cognitive matrix that can be converted by a computer.

The method adopts a multi-mode subspace coding method to comprise the public opinion image characteristic f _v And the key frame image feature set L of the public opinion video maps the multi-mode public opinion data feature representation to a shared sub-semantic space, and the image feature X of each sample under each view angle can be reconstructed through a group of mapping under the assumption that the common representation image feature H of different modes exists ^(v) The correlation formula between the two can be obtained as follows: x is X ^(v) ＝C _v (H) V represents any view angle modal characteristic, C _v And (3) representing a reconstruction mapping function corresponding to any view mode characteristic v, and then carrying out effective information coding on different available views, wherein the calculation formula of the information coding is as follows:

p(X|H)＝p(X ⁽¹⁾ |H)p(X ⁽²⁾ |H)...p(X ^(v) |H) (2)

The multi-mode subspace coding method can further learn complementary and consistency information among multiple modes to obtain more complete multi-mode data characteristic representation, and takes coding information under different visual angles as input of a text detection model MobileNet V3.

Step 3: training of character detection and recognition: the text detection model MobileNetV3 detects information after information encoding to obtain a text line position or a single character position which is positioned on a key frame image of an image and a public opinion video, and the text recognition module converts a predicted text sequence with characters into a text by using a CTC algorithm (Connectionist Temporal Classification, CTC) through a transcription layer.

Stage1 in fig. 4 processes the multi-mode data to obtain multi-mode feature encoding information. The main operation is as follows: image features are extracted from the picture data by using VGG19 neural network, and video features are extracted from the video data by using ResNet101 and Self-Attention. Stage2 is a Model configuration and mainly comprises a set Model Optimizer (Model Optimizer), a set Inference acceleration Engine and an execution Inference Engine (information Engine). In Stage3, the multi-mode feature coding information sequentially passes through OpenCV, media and Processing acceleration reasoning in the OpenVINO reasoning module, features are extracted and optimized for the multi-mode feature coding information, and the operations can accelerate the reading and reasoning of the text detection and recognition model on the input image and video. Stage3 is post-processing of the results from the optimizer. The aim is to jointly optimize training and reasoning, namely performing encoding and decoding processing operation on multi-mode data by using a picture and video processing tool kit (OpenCV and Media SDK) provided by a model optimizer. And (4) performing character detection and character recognition on the result after the encoding and decoding processing to obtain text information. And correcting the information after character recognition by using a character correction model of Stage5, so as to improve the text accuracy and obtain final output information.

In the text detection training stage, firstly, different available visual angle multi-mode coding information is used as input of a text detection model MobileNet V3, multi-mode data features pass through the text detection model MobileNet V3 to finally obtain a text line position or a single character position which is positioned on an image and a key frame image, and the text line position or the single character position is used as input of a text recognition module, and the text recognition module converts a predicted text sequence with characters into a text by using a CTC algorithm through a transcription layer. Meanwhile, the text recognition module improves an LK-PAN module of an original text detection algorithm, enlarges the original convolution kernel size to 12 x 12, and improves the performance of text detection on extreme distortion and length-width ratio. The text detection module adopts an optimized MobileNet V3 model, the optimized MobileNet V3 model is a light backbone network of one of differential binarization networks (Differentiable Binarization Networks, DBNet), the backbone network can automatically adjust a threshold value, the detection precision can be improved in experimental effect, the time consumption problem of a post-processing process can be simplified, the structure of the optimized MobileNet V3 model is shown in figure 3, and the task in a text recognition training stage is to convert text lines or single character information positioned in a recognition image into text information. Meanwhile, the text recognition module adopts the existing Transformer network (the existing model), so that the context information of text line images can be effectively mined, and the problem of low text recognition precision is solved.

Aiming at the problem of time consumption in subsequent processing caused by threshold binarization processing, the optimized MobileNet V3 model adopts an automatically-learned threshold strategy, and multi-mode subspace coding information is used as input:

wherein ,

is the threshold value, T, of the DB binary diagram module in FIG. 3 that enables end-to-end learning of text segmentation during training _i,j For optimized MobilAdaptive thresholding learned during training of eNetV3 model, k represents amplification factor, P _i,j Representing probability map pixel points, (i, j) representing coordinate points of pixels on the image. The optimized MobileNet V3 model uses ResNet-18 network to extract the characteristic information of public opinion image input, and the purpose is to obtain multi-scale characteristic diagram through 3*3 convolution up-sampling treatment. Immediately on the feature map, the fusion module aggregates 1/8, 1/16 and 1/32 of the feature information of the input image or video with 1/4 of the feature layer, and describes the complete information of the text through the prediction threshold map and the probability map. The method comprises the steps of obtaining a probability map P and a threshold map T through a series of convolution and transposition convolution mechanisms of 1/4 size of feature maps, generating the probability map P and the threshold map T which are the same as the original map in size, performing binarization processing on the probability map P and the threshold map T by a DB method, namely, changing the feature map output aggregated by the fusion module into the same size by an upsampling method, cascading to generate a new feature map F, and predicting the probability map P and the threshold map T by the feature map F to finally obtain an approximate binary map. Thus T _i,j Pixel value P representing pixel point coordinates (i, j) on threshold map T _i,j The pixel value representing the pixel point coordinates (i, j) on the probability map P. And in the post-processing stage, the fixed threshold value is binarized to obtain an approximate binarization map, then a text box is obtained from the approximate binarization map, and finally a positioning text box of an image or video key frame is obtained and used as the input of a text recognition module, and the architecture diagram of the optimized MobileNet V3 model structure is shown in a figure 3. The Feature Map in fig. 3 is mainly used to obtain multi-scale features, so as to identify multi-scale text information in an image or video.

A deep mutual learning (Deep Mutual Learning, DML) strategy is added to the text detection module in Stage4 of fig. 4. The aim is to convert text lines or single character information located in the recognition image into text information. Meanwhile, the transducer network in the character recognition module mainly comprises 1 linear mapping layer and 1 iteratable multi-head attention layer, and the specific calculation formula is as follows:

X ^l ＝MHA(LP(X ^l-1 ))+X ^l-1 ,l∈[1,L] (5)

Z _class ＝LN(MLP(X ^L (0)+X ^L (0)) (6)

in the above-mentioned formula (4),

representing cascade operation in the vertical direction, X _class ,X ₁ ,...,X _N Representing the literal sequence detected by the literal, and X represents the aggregated sequence feature vector. MHA (-) in equation (5) represents a multi-layer attention mechanism calculation function, LP (-) represents a linear mapping layer, X ^l-1 and X ^l The attention matrix of the previous layer l-1 and the attention matrix of the current layer l are represented, respectively. LN (·) in the formula (6) represents a layer normalization function, and when the formula (6) iterates L times, the attention array X can be output ^L L represents the layer number of the transducer network and the characteristic vector X of the 1 st row ^L (0) As input of the next multi-layer perceptron MLP, finally outputting key frame attention characteristic Z _class . Thereby effectively mining the context information of the text line image.

In addition, in order to improve the text recognition precision, the invention adds GTC (Guided Training of CTC) strategy, UIM (Unlabeled Images Mining) strategy, textRotNet module and data enhancement (TextConAug) strategy for mining text context information into the text recognition module. The GTC strategy adopts the existing attention module to guide training of the CTC model, and simultaneously fuses the characteristics of a plurality of modes so as to improve the text recognition accuracy, the UIM strategy predicts the unlabeled image to obtain a pseudo label, and a sample with high prediction probability is used as training data to enhance the recognition effect. The TextRotNet module is a self-supervision pre-training model, and the invention is used for initializing the weight and the paranoid factor of the mobilenet V3 model so as to achieve the effect of improving the character recognition. The TextConAug strategy is applied to a supervised learning task to enrich the context information of the training data.

Step 4: character error correction training: and constructing a word stock in the public opinion field, and correcting text information in an image or video key frame obtained by the text recognition module by adopting a bidirectional Transformer network and the word stock in the public opinion field to obtain a text correction model. The bidirectional transform adopts the pre-training model, and has the advantage that under the condition of small labeling data quantity, the high-precision error correction effect can be achieved through fine adjustment of the model.

As shown in Stage5 in fig. 4, the existing text error correction task in the network public opinion field lacks sufficient labeling corpus, which causes errors in the text error correction process in the public opinion field. According to the invention, related websites such as violence, gambling, abuse and the like are subjected to data crawling and cleaning by adopting a Python crawler technology, the crawled data are marked according to word levels, 3308 entities and corresponding 8632 public opinion texts are arranged, and meanwhile, a network public opinion text error correction task dictionary is constructed.

First predict in a sequence y ₁ ,y ₂ ,…,y _T Words at position t, given source sentence x ₁ ,x ₂ ,…,x _N On the premise of (1):

p _t (w)＝softmax(L ^trg h _t ) (9)

matrix L εR in equation (7) above ^dx×|V| Is a word embedding matrix, dx represents the dimension of word embedding, V represents the vocabulary. L (L) ^src Representing the sentence x of the source ₁ ...x _N Expressed as an embedded vector, the encoder () represents the sequence x of sentences ₁ ...x _N Encoding is performed.

Is the encoded hidden state. Y in formula (8) _t-1...1 Representing predicted sequence, L ^trg Representing y _t-1...1 ,/>

Decoding, h _t Is the hidden state of the next word. Equation (9) hidden state h at target by softmax function _t And word embedding matrix L ^trg Can finally obtain the sentence p after error correction _t (w). Usually, the result obtained by the text recognition module has errors, and the text error correction method based on the public opinion lexicon can correct the recognized result, thereby being beneficial to improving the accuracy. The function of the online public opinion text correction task dictionary is that text correction codes can be combined with the dictionary to judge and correct the text recognition result.

Step 5: training the model constructed in the step 1-4 through a data set to obtain a trained model.

As shown in Stage6 in fig. 4, two stages of Stage4 and Stage5 in fig. 4 are fused, wherein after a multi-mode text detection and recognition training Stage is performed in the Stage4 process, image information in an image or video key frame can be converted into text information, the image information sequentially passes through three sub functions of detect (text detection), preprocess (picture preprocessing), findTextRegion (searching and screening text region), preprocessing, binarization and screening text region box coordinates are performed on image features, finally text information is obtained through recognition, then text information is corrected appropriately through a text error correction model based on public opinion lexicon rules trained in Stage5, finally visual multi-mode text detection recognition and text error correction effects are achieved, and the technology can be applied to the field of network public opinion supervision. The basic principle for judging the multi-mode character detection and recognition and character error correction is as follows: specifically, given a video sequence or image, the text detection module first locates text from the entire image or key frame video and detects it by the detect function in OpencvAnd detecting the text area and returning the detected text rectangular area coordinates. The position of the text is marked as (x) ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ )，(x _i ,y _i ) (i=1, 2,3, 4) are 4 vertex coordinates of a quadrangle. The text recognition module recognizes machine-readable semantic sequences from a quadrilateral image region containing text, i.e., recognizes text information from within four vertex coordinate intervals. And the text error correction model is used for correcting the identified text information appropriately in combination with the public opinion lexicon, and the effect is finally achieved.

The overall implementation flow of the visual multi-mode character detection recognition and correction method in the network public opinion analysis is described as follows: first, in a first phase (Stage 1) a dataset of images and videos is prepared as input to a training model. And secondly, through a second Stage (Stage 2) of an inference acceleration engine, the engine mainly comprises a model optimizer and an inference engine (a deep learning inference suite from Intel research and development), and can accelerate the reading and reasoning of a text detection and recognition model on input images and videos. Then, the third Stage (Stage 3) further optimizes the input information (image and video data) of the previous step, namely, adopts the picture and video processing toolkit (OpenCV and Media SDK) provided by the existing model optimizer to perform operations of encoding and decoding, preprocessing and reasoning result post-processing on the multi-mode data so as to improve the image and video quality. Next, the fourth phase Stage4 is entered. The word detection module in Stage4 adds improved LK-PAN, DML and other strategies, wherein LK-PAN strategy is used for improving the performance of word detection on extreme distortion and length-width ratio, and DML strategy is used for extracting text information from word areas in the image. Policies GTC, UIM, textRotNet, transformer, textConAug are added to the word recognition module and are applied to supervised learning tasks to enrich the context information of the training data. For example, the GTC strategy adopts the attention module to guide training of the CTC model and fuses a plurality of features so as to improve the text recognition accuracy, and the UIM strategy predicts the unlabeled image to obtain a pseudo label, and takes a sample with high prediction probability as training data. And then in a fifth Stage (Stage 5), carrying out error correction processing on the recognition result of the previous Stage and combining the constructed public opinion field word stock. Finally, in the sixth Stage (Stage 6), the corrected text is obtained.

The invention verifies the effectiveness on an experimental platform built by the user. The experiment is mainly configured to be carried out on an NVIDIA Tesla P100 display card server, and the experimental data have two types: one is the RWF2000 violent video dataset, and the other is the gambling, violent and abusive image dataset which is built by oneself, and 16098 pieces are all used.

The invention adopts the character detection and recognition technology, the Transformer network, the module such as the Scrapy crawler, the VGG19 and the like to fuse together for visual multi-mode character detection, recognition and error correction, can detect and recognize characters in data sets such as violence and the like, corrects recognition results, and can be applied to the field of network public opinion supervision. The invention adopts F1-score index to measure the model effect. Experimental results show that the text detection effect F1-score of the novel method provided by the invention on the image data set is improved by 10.11%, and the text detection effect F1-score in the video is improved by 17.97%.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A visual multi-mode character detection, recognition and error correction method in network public opinion analysis is characterized by comprising the following steps:

2. The method for detecting, identifying and correcting visual multi-mode characters in network public opinion analysis according to claim 1, wherein the data set adopts RWF2000 violent video data set and constructed gambling, violent and abusive picture data set; the characters are marked by using a PPOCRLAbel semiautomatic image marking tool, and the images are automatically marked to form marking results formed by surrounding four points, or manually marked: and selecting a four-point labeling mode, and clicking four points in sequence according to the requirement to form a four-point labeled picture.

3. The method for detecting, identifying and correcting visual multi-mode characters in network public opinion analysis according to claim 1 or 2, wherein the image feature extractor adopts a VGG19 convolutional neural network trained in advance, and takes the last layer output result of the VGG19 convolutional neural network as the extracted public opinion image feature; finally, adding a full connection layer into the VGG19 convolutional neural network, extracting bottom features of public opinion image features by utilizing the VGG19 convolutional neural network, and finally outputting feature vectors as the public opinion image features:

f _v ＝σ(W*VGG ₁₉ (V ₁ )+b)；

wherein ,f_v As public opinion image characteristics, W, b is a parameter of the full connection layer; sigma is a sigmoid activation function, V ₁ Representing multi-source data features, VGG ₁₉ () Representative generalPerforming one-dimensional flattening operation on the feature map through a VGG19 convolutional neural network;

p(X|H)＝p(X ⁽¹⁾ |H)p(X ⁽²⁾ |H)...p(X ^(v) |H)；

4. The method for detecting, identifying and correcting visual multi-mode characters in network public opinion analysis according to claim 3, wherein the multi-mode feature coding information is input into an OpenCV, media and Processing in an OpenVINO reasoning module for accelerated reasoning, and features of the multi-mode feature coding information are extracted and optimized; the multi-mode data is encoded and decoded, preprocessed and processed by the picture and video processing tool kit provided by the model optimizer, and the processed result is subjected to character detection and character recognition.

5. The method for detecting, identifying and correcting visual multi-mode characters in network public opinion analysis according to claim 3, wherein the character detection module is an optimized mobilenet v3 model, and the optimized mobilenet v3 model is a lightweight backbone network in a differential binarization network, and the backbone network can automatically adjust a threshold value; the optimized MobileNet V3 model adopts an automatically-learned threshold strategy:

wherein ,

6. The method for detecting, identifying and correcting visual multi-mode characters in network public opinion analysis according to claim 5, wherein the optimized MobileNetV3 model uses a res net-18 network to extract the characteristic information of the public opinion image input, and the multi-scale characteristic map is obtained through the up-sampling treatment of a 3*3 convolution layer; on the feature map, the fusion module aggregates the 1/8, 1/16 and 1/32 feature information of the input image or video with the 1/4 feature map, and describes the complete information of the text through the predicted threshold map T and the probability map P; the method comprises the steps that a probability map P and a threshold map T are obtained from a feature map with the size of 1/4 through a series of convolution and transposition convolution, and binarization processing is carried out on the probability map P and the threshold map T by a DB method; binarizing the fixed threshold value to obtain an approximate binarization map, obtaining a text box from the approximate binarization map, and obtaining a positioning text box of an image or video key frame as input of a text recognition module;

X ^l ＝MHA(LP(X ^l-1 ))+X ^l-1 ,l∈[1,L]；

Z _class ＝LN(MLP(X ^L (0)+X ^L (0))；

wherein ,

7. The method for detecting, identifying and correcting visual multi-mode characters in online public opinion analysis according to any one of claims 4-6, wherein a deep mutual learning strategy is added into the character detection module to convert text lines or single character information positioned in an identification image into text information;

8. The method for detecting, identifying and correcting visual multi-mode words in network public opinion analysis according to claim 7, wherein the construction method of the word stock in the public opinion field is as follows: performing data crawling and cleaning on related websites of violence, gambling and abuse by using a Python crawler technology, marking the crawled data according to word levels, sorting 3308 entities and corresponding 8632 public opinion texts, and constructing a network public opinion text error correction task dictionary; the function of the online public opinion text correction task dictionary is that text correction codes can be combined with the dictionary to judge and correct the text recognition result.

9. The method for detecting, identifying and correcting visual multi-mode words in network public opinion analysis according to claim 8, wherein the method for correcting errors is as follows: prediction in a sequence y ₁ ,y ₂ ,…,y _T Words at position t, given source sentence x ₁ ,x ₂ ,…,x _N On the premise of (1):

p _t (w)＝softmax(L ^trg h _t )；

is the hidden state after encoding; y is _t-1...1 Representing predicted sequence, L ^trg The representation will->

10. The method for detecting, identifying and correcting visual multi-mode characters in network public opinion analysis according to claim 8 or 9, wherein the bidirectional transducer adopts a pre-training model;