CN111460247A

CN111460247A - Automatic detection method for network picture sensitive characters

Info

Publication number: CN111460247A
Application number: CN201910053775.8A
Authority: CN
Inventors: 蔡元奇; 林金朝; 庞宇; 杨鹏; 马坤阳; 张焱杰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2020-07-28
Anticipated expiration: 2039-01-21
Also published as: CN111460247B

Abstract

The invention discloses an automatic detection method for sensitive text of network pictures, which crawls and downloads websites containing pictures to be detected, collects pictures and adds them to a database by means of online crawling and offline loading; and obtains the pictures from the picture database. Image and image target detection (text area positioning, image text recognition) and sensitive text information detection and other processing. Using the Faster R-CNN deep network architecture based on the Region Proposal Network (RPN), a two-level sensitive text information classifier is adopted in the sensitive text information detection link. Among them, the first-level classifier performs rough screening of sensitive words in the input sentence by using a method based on multi-dimensional expansion of sensitive word library. The second-layer filter performs deep-level semantic refinement of sensitive text information based on the combination of sentiment polarity thesaurus and SVM classifier to confirm whether the text information is sensitive information. The automatic detection of image-sensitive text is effectively realized, the detection efficiency is high, and the system response time delay is fast.

Description

Automatic detection method of sensitive text in network pictures

技术领域technical field

本发明涉及数字图像处理以及深度学习的相关算法，属于机器视觉及自然语言处理领域，具体是一种网络图片敏感文字自动检测方法。The invention relates to digital image processing and related algorithms of deep learning, belongs to the field of machine vision and natural language processing, and particularly relates to an automatic detection method for sensitive text in network pictures.

背景技术Background technique

随着科学技术的进步，我国互联网行业进入一个飞速发展的阶段。其中包括斗鱼、虎牙在内的直播平台孕育而生，微信、微博和QQ等在线社交平台得到不断更新和完善，这些直播平台和在社交平台不但拥有巨大的用户群体，而且非常活跃，特别深受青年和青少年用户的喜爱。伴随着海量的数据信息传输，如此庞大的信息交互量使得人们可以轻易地在网络上获取多样化的数据信息，但是这些数据信息中往往充斥着大量的敏感信息。基于传统文本信息的敏感文字过滤技术相对成熟，而图像中所包含敏感信息的监控相对困难，因此敏感图像的传播也更为隐蔽。许多组织和个人为了逃避政府等监控部门对互联网信息的监管，改用图像插入文本的形式散播敏感信息，包括色情信息、反社会信息和暴力信息等，这也成为当前敏感信息传播的主要途径之一。据相关调查表明，超过10％的网站含有敏感信息相关内容。不仅如此，很多不法分子通过腾讯QQ、微信、直播平台的用户头像进行敏感信息传播，这其中所充斥的色情敏感信息图像不仅对青少年的身心健康造成不利影响，同时包含的反动和暴力等相关的信息也有可能干扰到社会的稳定。网络自身所具备的数据共享、相互连接以及资源开放性等特点，是不法分子和组织敢于大肆传播敏感信息的根本原因。图片文字敏感信息主要特征在于：With the advancement of science and technology, my country's Internet industry has entered a stage of rapid development. Among them, live broadcast platforms including Douyu and Huya were born, and online social platforms such as WeChat, Weibo and QQ have been continuously updated and improved. These live broadcast platforms and social platforms not only have huge user groups, but also are very active, especially Loved by young and teenage users. With the transmission of massive data and information, such a huge amount of information interaction makes it easy for people to obtain diverse data information on the network, but these data information are often filled with a large amount of sensitive information. The sensitive text filtering technology based on traditional text information is relatively mature, and the monitoring of sensitive information contained in images is relatively difficult, so the transmission of sensitive images is also more concealed. In order to evade the supervision of Internet information by surveillance departments such as the government, many organizations and individuals use images inserted into text to disseminate sensitive information, including pornographic information, anti-social information and violent information. one. According to relevant surveys, more than 10% of websites contain content related to sensitive information. Not only that, many criminals disseminate sensitive information through the user avatars of Tencent QQ, WeChat, and live broadcast platforms. The images of pornographic sensitive information that are full of them not only adversely affect the physical and mental health of young people, but also contain reactionary and violence-related images. Information also has the potential to interfere with social stability. The characteristics of the network itself, such as data sharing, interconnection, and resource openness, are the fundamental reasons why criminals and organizations dare to disseminate sensitive information. The main features of image and text sensitive information are:

(1)敏感信息的表现形式差异性大(1) The manifestations of sensitive information vary greatly

敏感信息涉及的范围非常广泛，涵盖了思想政治问题、社会问题、文化问题等很多方面，而且不同主题的敏感信息，其表现形式有很大的差异性，既使同一主题在不同场合、不同文化背景下等方面的敏感度表现程度不同。类似“血洗”，“绝杀”等词汇，在体育为主题的信息中，大都表示比赛胜利的意思，而放在其他的主题中，很大可能是敏感信息的标志词。The scope of sensitive information is very wide, covering many aspects such as ideological and political issues, social issues, cultural issues, etc., and the expressions of sensitive information on different topics are very different. Sensitivity in the context and other aspects shows different degrees. Words such as "bloodbath" and "lore" are mostly used in sports-themed information to mean victory in the game, while in other themes, they are likely to be signs of sensitive information.

(2)脱离原文的字符识别易造成显著的歧义(2) Character recognition out of the original text is likely to cause significant ambiguity

不法人员考虑到敏感文字内容存在违法的可能性，会刻意使用同义词、同音字、拼音、左右结构字形的隔开输入替换等规避的方式来制作敏感文字图片。这就给文字识别增加了难度。Considering the possibility of illegal content of sensitive texts, criminals will deliberately use synonyms, homophones, pinyin, and left and right structural glyphs to separate input and replace to create sensitive text pictures. This increases the difficulty of character recognition.

由于网络上的图片形式各异，不同的图片在文字大小，文字颜色，文字尺寸，文字相对位置，文字字体等方面都有很大的不同，在识别其中的文字之前，需要先定位出图片中包含的文字区域部分，准确的图片文字区域定位是后续识别工作的基础。文字区域定位的传统方法有基于图像连通域特征的方法，基于图像纹理特征的方法、基于图像边缘特征的方法。随着机器学习技术的发展，近些年基于机器学习的图像特征目标检测算法效果得到了大幅度的提升。其中基于深度学习技术的方案检测效果十分显著。2014年Girshick率先提出了基于Region Proposal(候选区域)的RCNN(Region with CNN Features)方案。该方法的核心思想在于利用图像中具有代表性的部分候选区域代替公开数据集PASCAL VOC将最高检测率从35％提升至了53％。在2015年，该作者在RCNN的基础上又提出了一种新的检测方法fast RCNN。该方法在保证检测正确率与RCNN相当的情况下，大幅度降低了算法训练和测试的时间复杂度。检测系统总的训练时间从84个小时降低到了9.5小时，测试时长从47秒降低到了约0.3秒。同年，该作者团队又提出了faster RCNN。此方法的核心在于将之前RCNN的几个主要模块全部整合到了同一个深度网络框架内进行端到端处理。在文字区域定位方面，可以通过提取图像内容特征(包括图像颜色特征、纹理特征以及边缘特征等)进行有关特征学习，根据所学到的相关特征对图像区域进行分类，实现对文字区域的判定。Due to the different forms of pictures on the Internet, different pictures are very different in terms of text size, text color, text size, text relative position, text font, etc. Before identifying the text in it, it is necessary to locate the text in the image For the included text area, accurate image text area positioning is the basis for subsequent recognition work. The traditional methods of text area localization include the method based on the image connected domain feature, the method based on the image texture feature, and the method based on the image edge feature. With the development of machine learning technology, the effect of image feature target detection algorithms based on machine learning has been greatly improved in recent years. Among them, the scheme detection effect based on deep learning technology is very significant. In 2014, Girshick took the lead in proposing the RCNN (Region with CNN Features) scheme based on Region Proposal (candidate region). The core idea of this method is to replace the public dataset PASCAL VOC with some representative candidate regions in the image to increase the highest detection rate from 35% to 53%. In 2015, the author proposed a new detection method fast RCNN based on RCNN. This method greatly reduces the time complexity of algorithm training and testing while ensuring that the detection accuracy is comparable to that of RCNN. The total training time of the detection system was reduced from 84 hours to 9.5 hours, and the test time was reduced from 47 seconds to about 0.3 seconds. In the same year, the author team proposed faster RCNN. The core of this method is to integrate several main modules of the previous RCNN into the same deep network framework for end-to-end processing. In terms of text area positioning, relevant feature learning can be performed by extracting image content features (including image color features, texture features, and edge features, etc.), and the image regions can be classified according to the learned relevant features to realize the determination of text regions.

综上所述，尽管图片敏感信息识别已取得了不少的研究成果，但仍都存在一定的局限性和不足。基于图片文字内容的敏感信息识别主要存在类似于自然场景图片文字信息的文字区域定位困难、文字识别精度低以及对短文本敏感信息判别的困难等不良影响。目前来讲，当前图片敏感信息识别着重关注和研究的领域，也是网络监管部门亟待提升技术手段进行解决的焦点问题。To sum up, although a lot of research results have been achieved in the identification of sensitive information in pictures, there are still some limitations and deficiencies. Sensitive information recognition based on image text content mainly has adverse effects such as difficulty in locating text regions similar to natural scene image text information, low text recognition accuracy, and difficulty in discriminating sensitive information in short texts. At present, the current identification of sensitive information in pictures is an area of focus and research, and it is also a focus problem that network supervision departments need to improve technical means to solve.

发明内容SUMMARY OF THE INVENTION

本发明的目的是为克服已有图片文字敏感信息检测技术的不足之处，本文主要研究改进后的Faster R-CNN，提高了对小目标区域检测效果的情况，改进的算法在网络图片文字检测与识别上具有更好的效果，其识别准确率会更高。针对基于短文本敏感信息多级分类器的方法研究，在原有的基础上拓展敏感字库，改进敏感文字分类器。本发明总体框图见附图1。The purpose of the present invention is to overcome the shortcomings of the existing image text sensitive information detection technology. This paper mainly studies the improved Faster R-CNN, which improves the detection effect of small target areas. The improved algorithm is used in network image text detection. With better recognition effect, its recognition accuracy will be higher. Aiming at the method research of multi-level classifier based on short text sensitive information, the sensitive character library is expanded on the original basis, and the sensitive character classifier is improved. The overall block diagram of the present invention is shown in FIG. 1 .

传统的图片敏感文字信息检测长期依赖人工监管和取缔，并且传统人工举报的检测时长一般在小时级别，而在发表图片到举报的这段时间间隔里敏感信息可能已经得到了广泛的传播，这种以图片文字形式存在的敏感信息正游走于监管的边缘地带，深刻地影响互联网的健康环境和广大网民的身心健康。本发明是基于深度学习算法和机器学习算法共同来实现对网络图片敏感文字的自动检测。The traditional detection of sensitive text information in pictures has long relied on manual supervision and bans, and the detection time of traditional manual reports is generally at the hour level, and the sensitive information may have been widely disseminated during the time interval between the publication of the picture and the report. Sensitive information in the form of pictures and text is wandering on the fringes of supervision, profoundly affecting the healthy environment of the Internet and the physical and mental health of netizens. The present invention realizes the automatic detection of the sensitive text of network pictures based on the deep learning algorithm and the machine learning algorithm.

鉴于此，本发明采用的技术方案是：网络图片敏感文字自动检测方法，包括以下步骤：In view of this, the technical solution adopted in the present invention is: a method for automatic detection of sensitive text in network pictures, comprising the following steps:

步骤S1，使用网络爬虫对含有图片的网站进行图片抓取；并将图片的基本信息保存到数据源数据库中，同时将图片收集到图片数据库中，供后续使用；Step S1, use a web crawler to crawl a website containing pictures; save the basic information of the pictures in the data source database, and collect the pictures in the picture database for subsequent use;

步骤S2，从图片数据库获取图片并通过使用基于区域建议网络的FasterR-CNN深度网络，对图片进行文字目标检测，完成后将图片识别的文字信息进行提取转化为图片文本信息；Step S2, obtain the picture from the picture database and use the FasterR-CNN deep network based on the region proposal network to detect the text object of the picture, and after completion, extract the text information of the picture recognition and convert it into the picture text information;

步骤S3，将提取到的图片文本信息使用分类器进行敏感文字信息检测，包括第一级分类器通过基于多维拓展敏感字库的方式对输入语句进行敏感词粗筛选，将粗筛选后的文本信息使用中文分词处理，然后通过基于情感极性词库与SVM分类器方式的二级分类器进行深层次的敏感信息精筛选，完成网络图片敏感文字信息的自动检测。Step S3, using the classifier to detect the sensitive text information of the extracted image text information, including the first-level classifier performing rough screening of sensitive words on the input sentence based on the multi-dimensional expansion of the sensitive word library, and using the rough screened text information. Chinese word segmentation processing, and then through the second-level classifier based on sentiment polarity thesaurus and SVM classifier, the deep-level sensitive information is finely screened, and the automatic detection of sensitive text information in network pictures is completed.

进一步，所述图片的基本信息包含图片的链接，图片的大小，图片的名称。Further, the basic information of the picture includes the link of the picture, the size of the picture, and the name of the picture.

步骤S2中所述对图片进行文字目标检测的过程包括对区域建议网络的共享卷积层进行最大池化采样缩小和反卷积操作放大，然后对候选区域生成网络的特征映射层输出的特征图进行平均池化，生成固定大小的目标候选区域，候选区域优化网络的区域池化层根据候选区域生成网络输出的目标候选区域,对候选区域生成网络的特征映射层输出的特征图进行区域池化,生成固定大小的区域特征；The process of performing text target detection on the image described in step S2 includes performing maximum pooling sampling reduction and deconvolution operation amplification on the shared convolution layer of the region proposal network, and then generating the feature map output by the feature mapping layer of the network for the candidate region. Perform average pooling to generate fixed-size target candidate regions. The region pooling layer of the candidate region optimization network generates the target candidate region output by the network according to the candidate region, and performs region pooling on the feature map output by the feature mapping layer of the candidate region generation network. , generate a fixed-size region feature;

根据softmax层输出每个目标候选区域是否包含目标或背景的分类概率,只输出概率大于预设阈值的目标候选区域,即可排除大部分无效候选区域,得到优化后的目标候选区域，然后目标分类回归网络根据优化后的目标候选区域,从生成的共享特征图中提取区域特征,进行最终的目标文字类别判别以及目标边界框回归修正。According to the softmax layer, the classification probability of whether each target candidate area contains a target or background is output, and only target candidate areas with a probability greater than a preset threshold are output, and most of the invalid candidate areas can be excluded to obtain the optimized target candidate area, and then the target classification According to the optimized target candidate region, the regression network extracts regional features from the generated shared feature map, and performs final target text category discrimination and target bounding box regression correction.

步骤S3中所述敏感信息精筛选，将情感极性词加入到现有敏感信息短文本的数据集当中，结合情感倾向判断，标记文本信息，使用SVM模型对含有情感极性词敏感信息短文本的数据集进行训练。In step S3, the sensitive information is finely screened, the emotional polarity words are added to the data set of the existing sensitive information short texts, and the text information is marked in combination with the emotional tendency judgment, and the SVM model is used to analyze the sensitive information short texts containing emotional polar words. dataset for training.

所述SVM分类器，将训练集进行中文分词处理，然后通过词向量的形式对训练集中的文本进行编码，利用多维向量的方式表征文本的词汇，并对其进行特征提取和模型训练，最后利用训练好的分类模型对粗筛选处理后的短文本进行判断，确认该短文本是否为敏感文字信息文本。利用libsvm的交叉验证功能来实现向量参数的寻优，通过搜索参数取值空间来获取最佳的参数值。经过文本预处理、特征提取、特征表示、归一化处理后，已经把原来的文本信息抽象成一个向量化的样本集，然后把此样本集与训练好的模板文件进行相似度计算，进一步确认该短文本是否为敏感文字信息文本。The SVM classifier performs Chinese word segmentation on the training set, then encodes the text in the training set in the form of word vectors, uses multi-dimensional vectors to represent the vocabulary of the text, and performs feature extraction and model training on it, and finally uses The trained classification model judges the short text after rough screening to confirm whether the short text is sensitive text information text. The cross-validation function of libsvm is used to realize the optimization of vector parameters, and the optimal parameter values are obtained by searching the parameter value space. After text preprocessing, feature extraction, feature representation, and normalization, the original text information has been abstracted into a vectorized sample set, and then the similarity between the sample set and the trained template file is calculated to further confirm Whether the short text is sensitive text information text.

还包括对于确定含有敏感文字信息的图片进行跟踪报警，显示该图片的地址链接、图片名称信息、图片大小信息。It also includes tracking and alarming pictures that are determined to contain sensitive text information, and displaying the picture's address link, picture name information, and picture size information.

本发明最终实现了图片敏感文字自动检测的功能，相较传统方法大大减少了系统反应时延，提高系统检测对图片类的敏感信息准确率。特别地，对于图片上的小目标区域文字、文字的倾斜问题、歧义文字和较复杂敏感语义下的文字识别与检测问题具有明显提升。The present invention finally realizes the function of automatic detection of image-sensitive text, greatly reduces the system response time delay compared with the traditional method, and improves the accuracy of system detection of sensitive information of images. In particular, the recognition and detection of texts in small target areas on pictures, the inclination of texts, ambiguous texts and more complex and sensitive semantics have been significantly improved.

附图说明Description of drawings

图1为本发明图片文字目标检测流程图；Fig. 1 is the flow chart of the present invention's picture text target detection;

图2为本发明敏感文字信息检测流程图；Fig. 2 is the flow chart of sensitive text information detection of the present invention;

图3为第二级分类器网络结构图。Fig. 3 is the network structure diagram of the second-level classifier.

具体实施方式Detailed ways

本发明包括图片的目标检测(文字区域的定位与文字识别)和敏感文字信息检测两大部分。为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整的描述，但并不用于限定本发明。The invention includes two parts: target detection of pictures (location of text area and text recognition) and detection of sensitive text information. In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, but are not intended to limit the present invention. .

网络图片爬取模块通过使用网络爬虫对特定含有图片的网站进行图片抓取，并将图片的基本信息保存到数据源数据库中，同时将图片收集到图片数据库中供后续使用。对图片库中的图片进行适当地人工分类，以便后期检查与监管。首先设置在网站上抓取网站内容的图片获取规则,利用现有技术中的网络爬虫来通过网页的链接地址来寻找网页,一直循环下去,直到把这个网站所有的网页的图片都抓取完为止。在具体的应用实施过程中,为了更快的获取网站的图片,可以通过预先设置的信息获取规则来省略掉一些不需要获取的非图片等内容,来减少抓取内容的工作量。在本方法中使用的图片获取规则设定为每5分钟获取一次,获取的网站深度涉及到待检测网站的首页、首页上链接的第一层和第二层、后续页等，先对基本信息进行以文本格式的检测报告进行保存，可以想到的是,周期性的获取可以根据需要设置为更长或者更短一点的时间,根据检测的实际需要网站检测的深度可以将爬取的图片保存到图片数据库中，其它数据信息保存到数据源数据库中。The web image crawling module uses web crawlers to crawl images from specific websites containing images, saves the basic information of the images in the data source database, and collects images into the image database for subsequent use. Appropriate manual classification of pictures in the picture library for later inspection and supervision. First, set up a picture acquisition rule for crawling the content of the website on the website, use the web crawler in the prior art to find the webpage through the link address of the webpage, and keep looping until the pictures of all the webpages of the website have been crawled. . In the specific application implementation process, in order to obtain the pictures of the website faster, some non-picture content that does not need to be obtained can be omitted through the preset information acquisition rules, so as to reduce the workload of crawling the content. The image acquisition rule used in this method is set to be acquired every 5 minutes, and the acquired website depth involves the home page of the website to be detected, the first and second layers of links on the home page, and subsequent pages. To save the detection report in text format, it is conceivable that the periodic acquisition can be set to a longer or shorter time as needed, and the crawled images can be saved to the website according to the actual needs of the detection. In the picture database, other data information is saved in the data source database.

如图1所示的实施例，图片文字目标检测模块又包括包括区域候选网络提取部分(用于文字图像的空间特征)和Fast-R-CNN检测部分。图片文字目标检测模块主要具体步骤为：As shown in the embodiment shown in FIG. 1 , the image text target detection module further includes a region candidate network extraction part (used for the spatial features of text images) and a Fast-R-CNN detection part. The main specific steps of the image text target detection module are:

(1)合理划分数据集，采用标准数据集，进行标准化，统一输入维度，加快训练速度；(1) Reasonably divide the data set, use the standard data set for standardization, unify the input dimensions, and speed up the training speed;

(2)将不同层的卷积模块融合到一起，能提取多层特征，高层的抽象特征和底层的详细特征，把第一层的map做一个max pooling进行降采样来缩小，把最后一层的map做一个deconv(反卷积)进行上采样来放大，然后把前5层的卷积输出连接起来，连接1,3,5层的效果更好(因为中间有一个间隔层，各层之间的特征相互关联性小)。最后，我们使用局部响应标准化(LRN)来对多个feature maps进行归一化，若没有归一化，大的Feature会压制住小的feature。然后将多个feature maps合在一个单个的输出立方体，我们称为cube Featuremaps。最后一层加入反卷积层；(2) Integrate the convolution modules of different layers together, which can extract multi-layer features, high-level abstract features and low-level detailed features, make the map of the first layer a max pooling for downsampling to reduce, and reduce the last layer. The map does a deconv (deconvolution) for upsampling to enlarge, and then connects the convolution outputs of the first 5 layers, and the effect of connecting layers 1, 3, and 5 is better (because there is an interval layer in the middle, and the There is little correlation between the features). Finally, we use local response normalization (LRN) to normalize multiple feature maps. Without normalization, large features will overwhelm small features. Multiple feature maps are then combined into a single output cube, which we call cube Featuremaps. The last layer is added to the deconvolution layer;

(3)操作并行卷积，在第二个卷积模块中，将5*5和7*7卷积进行并行，不同大小的卷积核提取的特征不同，进行差异化提取并融合；(3) Operate parallel convolution. In the second convolution module, 5*5 and 7*7 convolutions are performed in parallel, and the features extracted by convolution kernels of different sizes are different, and are extracted and fused differently;

(4)引入交叉卷积核，将方型卷积核转化为非对称卷积结构，5*5的卷积核转化为5*1和1*5的卷积核，进行最大池化采样缩小和反卷积操作操作放大；(4) Introduce a cross convolution kernel, convert the square convolution kernel into an asymmetric convolution structure, convert the 5*5 convolution kernel into 5*1 and 1*5 convolution kernels, and perform maximum pooling sampling reduction and deconvolution operation to enlarge the operation;

(5)固定候选区域池化层，它的输入由多个不同深度卷积层得到的特征图组成，三个深度特征图共同作为候选区域池化层的输入，此层的目的是将尺寸不一的候选框，转化为输出特征图尺寸固定，以供下一步使用。然后对候选区域生成网络的特征映射层输出的特征图进行平均池化，生成固定大小的区域特征；(5) The fixed candidate region pooling layer, whose input is composed of feature maps obtained by multiple different depth convolution layers, and the three depth feature maps are jointly used as the input of the candidate region pooling layer. The purpose of this layer is to make the size different The candidate frame of one is converted into the output feature map with a fixed size for use in the next step. Then average pooling is performed on the feature map output by the feature mapping layer of the candidate region generation network to generate a fixed size region feature;

(6)候选区域优化网络采用零均值,标准差为a的高斯分布随机初始化,利用训练好的faster-R-CNN网络生成训练数据,单独训练候选区域优化网络,将训练集中的训练图片输入到网络中，候选区域生成网络输出的目标候选区域作为候选区域优化网络的训练数据。任一标注框的交并比IOU(交并比)大于目标候选区域的数据,作为正样本,与任一标注框的交并比IOU小于目标候选区域的数据,作为负样本；(6) The candidate region optimization network adopts a Gaussian distribution with zero mean and standard deviation a randomly initialized, uses the trained faster-R-CNN network to generate training data, trains the candidate region optimization network separately, and inputs the training images in the training set into In the network, the target candidate region output by the candidate region generation network is used as the training data of the candidate region optimization network. The data whose intersection ratio IOU (intersection and union ratio) of any annotation frame is greater than the target candidate area is regarded as a positive sample, and the data whose intersection ratio IOU with any annotation frame is smaller than the target candidate area is regarded as a negative sample;

(7)利用非极大值抑制取得分高的100个建议窗口，这些建议窗口基本可以覆盖所有出现的文字区域，如果选区过多会导致建议窗口重叠，会增加无用的计算量。进行边缘细化修正，它通过位置偏移量可以预测垂直方向的精确位置。公式如下：(7) Use non-maximum suppression to obtain 100 high-scoring suggestion windows. These suggestion windows can basically cover all the text areas that appear. If there are too many selections, the suggestion windows will overlap, which will increase the useless calculation amount. Perform edge refinement correction, which can predict the precise vertical position through the position offset. The formula is as follows:

其中x_side是最接近水平边到当前锚点的预测的x坐标，

是x轴的实际边缘坐标，它是从实际边界框和锚点位置预先计算的。

是x轴的锚点的中心，w^a是固定的锚点宽度w^a＝16。o、o^*分别表示预测和实际偏移量。使用边缘提议的偏移量来优化最终的文本行边界框。where x _side is the predicted x-coordinate of the closest horizontal side to the current anchor,

is the actual edge coordinate of the x-axis, which is precomputed from the actual bounding box and anchor position.

is the center of the anchor for the x-axis, and w ^a is the fixed anchor width w ^a =16. o, o ^* represent the predicted and actual offsets, respectively. The final text line bounding box is refined using the offsets of the edge proposals.

我们采用多任务学习来联合优化模型参数。根据输出的数据来源，引入了三种损失函数：

分别表示的是文本/分文本的二分类损失，坐标损失和边缘细化损失。根据最小损失规则，最小化图像的总体目标函数(L)最小化：We employ multi-task learning to jointly optimize model parameters. According to the source of the output data, three loss functions are introduced:

They represent the binary classification loss of text/sub-text, coordinate loss and edge refinement loss, respectively. The overall objective function (L) of the image is minimized according to the minimum loss rule:

其中每个锚点都是一个训练样本，i是一个小批量数据中一个锚点的索引。S_i是预测的锚点i作为实际文本的预测概率。k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离(例如8个像素)内的一组锚点。o_k和

是与第k个锚点关联的x轴的预测和实际偏移量。

是我们使用Softmax损失区分文本和非文本的分类损失。

和

是回归损失。N_s,N_v,N_o是标准化参数，表示

分别使用的锚点总数。where each anchor is a training sample and i is the index of an anchor in a mini-batch. S _i is the predicted probability of anchor i as the actual text. k is the index of the edge anchor, which is defined as a set of anchors within a horizontal distance (eg, 8 pixels) to the left or right of the actual text line bounding box. o _k and

are the predicted and actual offsets of the x-axis associated with the kth anchor.

is our classification loss to distinguish text from non-text using Softmax loss.

and

is the regression loss. N _s , N _v , and N _o are standardized parameters, indicating

The total number of anchors used respectively.

最后通过文本线构造算法合并建议窗口，就是将每两个相近的8*h的小建议窗口组成一个对组，然后合并不同的对组直到无法再合并为止，最后生成一个完整的建议框。文本行的构建非常简单。文本行构建如下。首先，我们为提议定义B_i一个配对邻居(B_j,B_i)，作为B_j->B_i，当B_j是最接近B_i的水平距离，该距离小于50像素，并且它们的垂直重叠大于0.6时。其次，如果B_j->B_i和B_i->B_j，则将两个提议分组为一对。然后通过顺序连接具有相同提议的建议对来构建文本行，对进行完成的目标类别判别以及目标边界框回归修正。通过后续的CNN+CTC的方式完成了输入图像的深度特征提取，序列标签概率预测和标签转录功能。并在此基础上增加了Tesseract二次识别，Fast-R-CNN检测部分将识别后的文本行以字符串的形式输入到敏感文字信息检测模块进行文本敏感语义检测。Finally, the suggestion window is merged through the text line construction algorithm, that is, every two similar 8*h small suggestion windows are formed into a pair, and then different pairs are merged until they can no longer be merged, and finally a complete suggestion box is generated. The construction of a text line is very simple. Text lines are constructed as follows. First, we define a paired neighbor (B _j ,B _i ) of B _i for the proposal as B _j ->B _i , when B _j is the closest horizontal distance to B _i , the distance is less than 50 pixels, and their vertical overlap greater than 0.6. Second, if B _j ->B _i and B _i ->B _j , then group the two proposals as a pair. Text lines are then constructed by sequentially concatenating proposal pairs with the same proposal, and the pair performs the completed object class discrimination and object bounding box regression correction. Through the subsequent CNN+CTC method, the deep feature extraction of the input image, the sequence label probability prediction and the label transcription function are completed. On this basis, Tesseract secondary recognition is added, and the Fast-R-CNN detection part inputs the recognized text lines in the form of strings into the sensitive text information detection module for text-sensitive semantic detection.

如图2所示的实施例，敏感文字信息检测模块包括第一级分类器、分词模块和第二级分类器。第一级分类器通过基于多维拓展敏感字库的文字规则过滤引擎方式对输入语句进行敏感词粗筛选。原有的敏感词进行多维拓展，具体包括同义词、同音字、拼音、旁半字等歧义方式以建立新的敏感信息词库，该词库包含反动、色情、暴力三大部分。分词模块的作用是对文本进行分词处理。由于中文文本中没有像西文中的空格分割方式，所以首先需要进行中文分词处理。第二级分类器将完成后训练集进行中文分词处理，然后通过词向量的形式对训练集中的文本进行编码，利用多维向量的方式表征文本的词汇，并对其进行特征提取和模型训练，最后利用训练好的分类模型对粗筛选处理后的短文本进行判断，确认该短文本是否为敏感文字信息文本。In the embodiment shown in FIG. 2 , the sensitive text information detection module includes a first-level classifier, a word segmentation module and a second-level classifier. The first-level classifier performs rough screening of sensitive words in the input sentence by means of a word rule filtering engine based on multi-dimensional expansion of sensitive word database. The original sensitive words are multi-dimensionally expanded, including synonyms, homophones, pinyin, side-by-side characters and other ambiguous methods to establish a new sensitive information thesaurus, which includes reaction, pornography, and violence. The function of the word segmentation module is to perform word segmentation processing on the text. Since there is no space segmentation method in the Chinese text as in the Western text, the Chinese word segmentation process needs to be performed first. The second-level classifier will perform Chinese word segmentation on the completed training set, then encode the text in the training set in the form of word vectors, use multi-dimensional vectors to represent the vocabulary of the text, and perform feature extraction and model training on it. Finally, Use the trained classification model to judge the short text after rough screening to confirm whether the short text is sensitive text information text.

如图3所示的实施例，二级分类器是SVM分类器。SVM是一个由分类超平面定义的判别分类器，也就是说给定一组带标签的训练样本，算法将会输出一个最优超平面对新样本(测试样本)进行分类，找一个超平面，并且它到离他最近的训练样本的距离要最大。即最优分割超平面最大化训练样本边界。In the embodiment shown in Figure 3, the secondary classifier is an SVM classifier. SVM is a discriminative classifier defined by a classification hyperplane, that is to say, given a set of labeled training samples, the algorithm will output an optimal hyperplane to classify new samples (test samples), find a hyperplane, And it has the largest distance to the training sample closest to him. That is, the optimal segmentation hyperplane maximizes the training sample boundary.

支持向量机分类，首先它是分类问题，对应着分类过程的两个重要的步骤，一个是使用训练数据集训练分类器，另一个就是使用测试数据集来评价分类器的分类精度。作为敏感信息类文本分类，基于libsvm实现文本分类实现的实现过程，如下所示：Support vector machine classification, first of all, it is a classification problem, which corresponds to two important steps in the classification process, one is to use the training data set to train the classifier, and the other is to use the test data set to evaluate the classification accuracy of the classifier. As sensitive information text classification, the implementation process of text classification implementation based on libsvm is as follows:

(1)选择文本训练数据集和测试数据集：训练集和测试集都是类标签已知的；(1) Select the text training data set and test data set: both the training set and the test set are known with known class labels;

(2)训练集文本预处理：这里主要包括中文分词、去停用词、建立词向量模型；(2) Training set text preprocessing: This mainly includes Chinese word segmentation, removal of stop words, and establishment of word vector model;

(3)选择文本分类使用的特征向量(词向量)：最终的目标是使得最终选出的特征向量在多个类别之间具有一定的类别区分度，以实现特征向量的分类筛选，由于中文分词后得到大量的词，通过选择降维技术能很好地减少计算量，还能维持分类的精度；(3) Select the feature vector (word vector) used for text classification: The ultimate goal is to make the final selected feature vector have a certain degree of category distinction between multiple categories, so as to realize the classification and screening of the feature vector, due to the Chinese word segmentation After obtaining a large number of words, by selecting dimensionality reduction technology, the amount of calculation can be well reduced, and the classification accuracy can be maintained;

(4)输出libsvm支持的量化情感极性词的训练样本集文件：类别名称、特征向量中每个词元素分别到数字编号的映射转换，以及基于类别和特征向量来量化文本训练集，能够满足使用libsvm训练所需要的数据格式；(4) Output the training sample set file of quantified emotional polarity words supported by libsvm: the mapping conversion of category name, each word element in the feature vector to the number number, and the quantized text training set based on the category and feature vector, which can satisfy The data format required for training with libsvm;

(5)测试数据集预处理：同样包括中文分词(需要和训练过程中使用的分词器一致)、去停用词、建立词向量模型(倒排表)，但是这时需要加载训练过程中生成的特征向量，用特征向量去排除多余的不在特征向量中的词(也称为降维)；(5) Test data set preprocessing: It also includes Chinese word segmentation (which needs to be consistent with the word segmentation device used in the training process), removing stop words, and establishing a word vector model (inverted list), but at this time, it is necessary to load the generated words during the training process. The feature vector of , and use the feature vector to exclude redundant words that are not in the feature vector (also known as dimensionality reduction);

(6)输出libsvm支持的量化的测试样本集文件：输出格式和训练数据集的预处理阶段的输出相同。使用训练集预处理阶段输出的量化的情感极性词数据集文件，最终输出分类模型文件。使用libsvm工具包训练文本分类器，在使用libsvm的开始，需要做一个尺度变换操作，有利于libsvm训练出更好的模型。libsvm使用的训练数据格式都是数字类型的，所以需要对训练集中的文档进行量化处理，我们使用TF-IDF度量，表示词与文档的相关性指标。前面输出的数据中，每一维向量都使用了TF-IDF的值，但是TF-IDF的值可能在一个不规范的范围之内(依赖于TF和IDF的值)，例如0.19872～8.3233，所以可以使用libsvm将所有的值都变换到同一个范围之内，如0～1.0，或者-1.0～1.0，可以根据实际需要选择；(6) Output the quantized test sample set file supported by libsvm: the output format is the same as the output of the preprocessing stage of the training data set. Use the quantified sentiment polarity word dataset file output in the training set preprocessing stage, and finally output the classification model file. Use the libsvm toolkit to train a text classifier. At the beginning of using libsvm, you need to do a scale transformation operation, which is conducive to libsvm to train a better model. The training data formats used by libsvm are all digital, so it is necessary to quantify the documents in the training set. We use the TF-IDF metric to represent the correlation between words and documents. In the previous output data, the value of TF-IDF is used for each dimension vector, but the value of TF-IDF may be in an irregular range (depending on the values of TF and IDF), such as 0.19872 ~ 8.3233, so You can use libsvm to transform all values into the same range, such as 0~1.0, or -1.0~1.0, which can be selected according to actual needs;

(7)使用libsvm验证分类模型的精度：使用测试集预处理阶段输出的量化的数据集文件和分类模型文件来验证分类的精度，选择合适的核函数，设置代价系数c，默认是1，表示在计算线性分类面时，可以容许一个点被分错。这时候，使用交叉验证(CrossValidation)来逐步优化计算，选择最合适的参数；(7) Use libsvm to verify the accuracy of the classification model: use the quantified data set file and classification model file output in the test set preprocessing stage to verify the accuracy of the classification, select the appropriate kernel function, and set the cost coefficient c, the default is 1, indicating When computing a linear classification surface, it is possible to tolerate a point being misclassified. At this time, use cross-validation (CrossValidation) to gradually optimize the calculation and select the most suitable parameters;

(8)分类模型参数寻优：如果经过libsvm训练出来的分类模型精度很差，可以通过libsvm自带的交叉验证功能来继续实现参数的寻优，通过搜索参数取值空间来获取最佳的参数值。经过文本预处理、特征提取、特征表示、归一化处理后，已经把原来的文本信息抽象成一个向量化的样本集，然后把此样本集与训练好的模板文件进行相似度计算，也就是确定待测文本与模板文件比较之后，敏感文字信息具体类别(色情，暴力，反动)的相似度概率。若不属于该类别，则与其他类别的模板文件进行计算，直到分进相应的具体类别；(8) Optimization of classification model parameters: If the accuracy of the classification model trained by libsvm is poor, you can continue to optimize the parameters through the cross-validation function that comes with libsvm, and obtain the best parameters by searching the parameter value space. value. After text preprocessing, feature extraction, feature representation, and normalization, the original text information has been abstracted into a vectorized sample set, and then the similarity between the sample set and the trained template file is calculated, that is, Determine the similarity probability of specific categories (pornography, violence, reaction) of sensitive text information after comparing the text to be tested with the template file. If it does not belong to this category, it will be calculated with the template files of other categories until it is classified into the corresponding specific category;

(9)最后检测结果输出对图片敏感文字及其对应的网站地址等信息的检测报告进行跟踪报警，对于确定含有敏感文字信息的图片进行提示，在相关区域中显示该图片的地址链接、图片名称信息、图片大小等信息。(9) The final detection result output Tracks and alarms the detection report of the sensitive text of the picture and its corresponding website address, etc., prompts the picture that contains sensitive text information, and displays the address link and picture name of the picture in the relevant area. information, image size, etc.

Claims

1. an automatic detection method for network image sensitive text, characterized in that, comprising the following steps:

Step S1, use a web crawler to crawl a website containing pictures; save the basic information of the pictures in the data source database, and collect the pictures in the picture database for subsequent use;

Step S2, obtain the picture from the picture database and use the Faster R-CNN deep network based on the region proposal network to perform text target detection on the picture, and after completion, extract the text information recognized by the picture and convert it into picture text information;

Step S3, using the classifier to detect the sensitive text information of the extracted image text information, including the first-level classifier performing rough screening of sensitive words on the input sentence based on the multi-dimensional expansion of the sensitive word library, and using the rough screened text information. Chinese word segmentation processing, and then through the second-level classifier based on sentiment polarity thesaurus and SVM classifier, the deep-level sensitive information is finely screened, and the automatic detection of sensitive text information in network pictures is completed.

2 . The automatic detection method for sensitive text in network pictures according to claim 1 , wherein the basic information of the picture includes the link of the picture, the size of the picture, and the name of the picture. 3 .

3. The method for automatic detection of sensitive text in network pictures according to claim 1, characterized in that: the process of carrying out text target detection to the picture described in step S2 comprises that the shared convolution layer of the regional suggestion network is subjected to maximum pooling sampling reduction and The deconvolution operation is enlarged, and then the feature map output by the feature mapping layer of the candidate region generation network is averagely pooled to generate a fixed-size target candidate region. The region pooling layer of the candidate region optimization network generates the target output of the network according to the candidate region. Candidate region, perform regional pooling on the feature map output by the feature mapping layer of the candidate region generation network, and generate regional features of a fixed size;

According to the softmax layer, the classification probability of whether each target candidate region contains the target or the background is output, and only the target candidate region whose probability is greater than the preset threshold is output. The target classification and regression network extracts the generated shared feature map according to the optimized target candidate region. Region features, the final target text category discrimination and target bounding box regression correction.

4. according to the described automatic detection method of the network picture sensitive text of claim 1, it is characterized in that: described in the step S3, the multi-dimensional expansion of the sensitive word library specifically includes synonyms, homophones, pinyin, and side half-words by the multi-dimensional expansion of original sensitive words. A new sensitive information thesaurus, the thesaurus includes reaction, pornography and violence.

5. according to the described automatic detection method of the network picture sensitive text of claim 4, it is characterized in that: the sensitive information fine screening described in step S3, the emotion polar word is added in the data set of the existing sensitive information short text, combine emotion Tendency judgment, labeling text information, and using the SVM model to train a dataset containing short texts with sensitive information of sentiment polarity words.

6. according to the described automatic detection method of network picture sensitive text of claim 5, it is characterized in that: SVM classifier described in step S3, the training set is carried out Chinese word segmentation processing, then by the form of word vector, the text in the training set is encoded , using multi-dimensional vectors to represent the vocabulary of the text, and perform feature extraction and model training on it, and finally use the trained classification model to judge the short text after rough screening to confirm whether the short text is sensitive text information text.

7. according to the described network picture sensitive text automatic detection method of claim 1-6, it is characterized in that: also comprise for determining that the picture that contains sensitive text information carries out tracking alarm, displays the address link, picture name information, picture size information of this picture .