CN111460247A - Automatic detection method for network picture sensitive characters - Google Patents
Automatic detection method for network picture sensitive characters Download PDFInfo
- Publication number
- CN111460247A CN111460247A CN201910053775.8A CN201910053775A CN111460247A CN 111460247 A CN111460247 A CN 111460247A CN 201910053775 A CN201910053775 A CN 201910053775A CN 111460247 A CN111460247 A CN 111460247A
- Authority
- CN
- China
- Prior art keywords
- sensitive
- picture
- information
- text
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses an automatic detection method of sensitive characters of network pictures, which comprises the steps of grabbing and downloading a website containing pictures to be detected, collecting the pictures in a database in an online grabbing and offline loading mode; and acquiring the picture from the picture database, and carrying out target detection (character area positioning, picture character recognition), sensitive character information detection and other processing on the picture. A regional suggestion network (RPN) -based fast R-CNN deep network architecture is used, and a two-stage sensitive text information classifier is adopted in a sensitive text information detection link. The first-stage classifier performs coarse screening on sensitive words on input sentences in a multi-dimensional expansion sensitive word library-based mode. And the second layer filter performs deep sensitive character information semantic fine screening based on a combination mode of the emotion polar word bank and the SVM classifier, and confirms whether the text information is sensitive information. The automatic detection of the picture sensitive characters is effectively realized, the detection efficiency is high, and the system response time delay is fast.
Description
Technical Field
The invention relates to a digital image processing and deep learning related algorithm, belongs to the field of machine vision and natural language processing, and particularly relates to an automatic detection method for sensitive characters of network pictures.
Background
With the progress of science and technology, the internet industry of China enters a rapid development stage. Live broadcast platforms including goby and tiger teeth are bred, online social platforms such as WeChat, microblog and QQ are continuously updated and improved, and the live broadcast platforms and the social platforms not only have huge user groups, but also are very active and are particularly popular with young and teenager users. With the transmission of massive data information, the huge amount of information interaction makes people easily acquire diversified data information on the network, but the data information is often filled with a large amount of sensitive information. The sensitive character filtering technology based on the traditional text information is relatively mature, and the monitoring of the sensitive information contained in the image is relatively difficult, so that the propagation of the sensitive image is more hidden. In order to avoid the supervision of monitoring departments such as governments and the like on internet information, many organizations and individuals use images to insert texts to disseminate sensitive information, including pornographic information, anti-social information, violence information and the like, which also becomes one of the main ways for the dissemination of the sensitive information at present. According to related investigation, more than 10% of websites contain relevant content of sensitive information. Moreover, many lawbreakers carry out sensitive information transmission through the user head portraits of Tencent QQ, WeChat and live broadcast platforms, and the saturated pornographic sensitive information images not only have adverse effects on physical and mental health of teenagers, but also contain relevant information such as reaction, violence and the like which possibly interfere with social stability. The characteristics of data sharing, interconnection, resource openness and the like of the network are the root causes of the courtesy of lawless persons and organizations for spreading sensitive information. The picture character sensitive information is mainly characterized in that:
(1) the difference of the expression forms of the sensitive information is large
The range of the sensitive information is very wide, the sensitive information covers many aspects such as ideological and political problems, social problems, cultural problems and the like, and the expression forms of the sensitive information of different subjects have great difference, so that the sensitivity expression degrees of the same subject in different occasions, different cultural backgrounds and the like are different. The words like 'bloodwash', 'kill', etc. mostly represent the meaning of winning the game in the information with the sports as the subject, and are most likely to be the sign words of sensitive information in other subjects.
(2) Character recognition deviating from the original text is easy to cause obvious ambiguity
In consideration of the possibility of illegal contents of sensitive characters, illegal persons can intentionally use evasive ways such as synonyms, homophones, pinyin, separated input replacement of left and right structural characters and the like to make sensitive character pictures. This adds difficulty to the recognition of the text.
The method comprises the steps that as the forms of pictures on a network are different, different pictures are different in the aspects of character size, character color, character size, character relative position, character font and the like, before characters in the pictures are recognized, a character Region part contained in the pictures needs to be positioned, accurate picture character Region positioning is the basis of subsequent recognition work, the traditional method for character Region positioning comprises a method based on image connected domain features, a method based on image texture features and a method based on image edge features, the effect of an image feature target detection algorithm based on machine learning in recent years is greatly improved along with the development of machine learning technology, the detection effect of a scheme based on a deep learning technology is quite remarkable, Girshick in 2014 proposes an RCNN (candidate Region) with a Region with a character of a character, a character Region with a character Region of a character.
In summary, although the picture sensitive information recognition has achieved many research results, there still exist some limitations and disadvantages. Sensitive information identification based on picture character content mainly has the adverse effects of difficult character area positioning, low character identification precision, difficult short text sensitive information judgment and the like which are similar to natural scene picture character information. At present, the current picture sensitive information identification focuses on the field of attention and research, and is also a focus problem that the network supervision department urgently needs to improve technical means to solve.
Disclosure of Invention
The invention aims to overcome the defects of the existing picture character sensitive information detection technology, the improved Faster R-CNN is mainly researched, the detection effect on a small target area is improved, the improved algorithm has better effect on the detection and identification of the network picture characters, and the identification accuracy is higher. Aiming at the method research based on the short text sensitive information multi-stage classifier, the sensitive word stock is expanded on the original basis, and the sensitive word classifier is improved. The invention is shown in the general block diagram of the attached figure 1.
The traditional picture sensitive character information detection relies on manual supervision and banning for a long time, the detection duration of the traditional manual report is generally in the hour level, the sensitive information can be widely spread in the time interval from picture publication to report, and the sensitive information in the picture character form is moving to the edge region of supervision, so that the health environment of the internet and the physical and mental health of vast netizens are deeply influenced. The invention realizes the automatic detection of sensitive characters of network pictures based on a deep learning algorithm and a machine learning algorithm.
In view of the above, the technical scheme adopted by the invention is as follows: the automatic detection method of the network picture sensitive characters comprises the following steps:
step S1, using web crawler to capture the pictures of the website containing the pictures; storing the basic information of the pictures into a data source database, and simultaneously collecting the pictures into a picture database for subsequent use;
step S2, acquiring pictures from the picture database, carrying out character target detection on the pictures by using a FasterR-CNN deep network based on a region suggestion network, and extracting character information identified by the pictures to convert the character information into picture text information after the detection is finished;
and step S3, the extracted picture text information is detected by a classifier, the method comprises the steps that a first-stage classifier performs sensitive word rough screening on input sentences by a mode based on a multi-dimensional expanded sensitive word bank, the text information after the rough screening is processed by Chinese word segmentation, and then deep sensitive information fine screening is performed by a second-stage classifier based on an emotion polarity word bank and an SVM classifier mode, so that automatic detection of the network picture sensitive word information is completed.
Further, the basic information of the picture includes a link of the picture, a size of the picture, and a name of the picture.
The process of detecting the text target of the picture in step S2 includes performing maximum pooling sampling reduction and deconvolution operation amplification on the shared convolution layer of the area-suggested network, then performing average pooling on the feature maps output by the feature mapping layer of the candidate area-generated network to generate a target candidate area of a fixed size, generating a target candidate area output by the network according to the candidate area by the area pooling layer of the candidate area-optimized network, performing area pooling on the feature maps output by the feature mapping layer of the candidate area-generated network, and generating an area feature of a fixed size;
and outputting the classification probability of whether each target candidate region contains a target or a background according to the softmax layer, and only outputting the target candidate region with the probability greater than a preset threshold value to eliminate most invalid candidate regions to obtain an optimized target candidate region, then extracting region features from the generated shared feature map by the target classification regression network according to the optimized target candidate region, and performing final target character type discrimination and target boundary frame regression correction.
And S3, fine screening the sensitive information, adding the emotion polar words into the data set of the existing short text of the sensitive information, judging in combination with emotion tendencies, marking text information, and training the data set containing the short text of the sensitive information of the emotion polar words by using an SVM model.
The SVM classifier carries out Chinese word segmentation processing on a training set, then encodes texts in the training set in a word vector mode, represents words of the texts in a multi-dimensional vector mode, carries out feature extraction and model training on the words, and finally judges short texts after rough screening processing by using a trained classification model to determine whether the short texts are sensitive character information texts. Optimizing vector parameters by using a cross validation function of libsvm, and acquiring the optimal parameter value by searching a parameter value space. After text preprocessing, feature extraction, feature representation and normalization, original text information is abstracted into a vectorized sample set, similarity calculation is carried out on the sample set and a trained template file, and whether the short text is a sensitive text of the character information is further confirmed.
And tracking and alarming the picture determined to contain the sensitive character information, and displaying the address link, the picture name information and the picture size information of the picture.
The invention finally realizes the function of automatically detecting the sensitive characters of the pictures, greatly reduces the system response time delay compared with the traditional method, and improves the accuracy rate of detecting the sensitive information of the pictures by the system. Particularly, the method has obvious improvement on the problems of small target area characters on the picture, inclination of the characters, ambiguous characters and character recognition and detection under more complex sensitive semantics.
Drawings
FIG. 1 is a flow chart of the present invention for detecting a target of a picture and a text;
FIG. 2 is a flow chart of the sensitive text message detection of the present invention;
FIG. 3 is a diagram of a second stage classifier network.
Detailed Description
The method comprises two parts of target detection (character area positioning and character recognition) of the picture and sensitive character information detection. In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, but the present invention is not limited thereto.
The network picture crawling module is used for crawling pictures of a specific website containing the pictures by using a network crawler, storing basic information of the pictures into a data source database, and collecting the pictures into a picture database for subsequent use. The pictures in the picture library are appropriately manually classified for later inspection and supervision. Firstly, setting a picture acquisition rule for capturing website contents on a website, searching a webpage through a link address of the webpage by using a web crawler in the prior art, and circulating until all the pictures of the webpage of the website are captured. In a specific application implementation process, in order to acquire pictures of a website more quickly, some contents such as non-pictures which do not need to be acquired can be omitted through a preset information acquisition rule, so that the workload of capturing the contents is reduced. The picture acquisition rule used in the method is set to be acquired once every 5 minutes, the acquired website depth relates to a first page of a website to be detected, a first layer and a second layer linked on the first page, a subsequent page and the like, basic information is stored in a text format detection report, it is conceivable that periodic acquisition can be set to be longer or shorter according to needs, crawled pictures can be stored in a picture database according to the detected depth of the website actually needed, and other data information is stored in a data source database.
As in the embodiment shown in fig. 1, the picture text object detection module in turn comprises a region candidate network extraction section (for spatial features of text images) and a Fast-R-CNN detection section. The image and character target detection module mainly comprises the following specific steps:
(1) reasonably dividing a data set, adopting a standard data set, standardizing, unifying input dimensions and accelerating the training speed;
(2) the convolution modules of different layers are fused together, so that multilayer features, high-level abstract features and bottom detailed features can be extracted, the map of the first layer is subjected to down-sampling by max boosting to reduce, the map of the last layer is subjected to deconv (deconvolution) to up-sampling to amplify, then the convolution outputs of the first 5 layers are connected, the effect of connecting the 1,3 and 5 layers is better (because a spacer layer is arranged in the middle, the mutual relevance of the features among the layers is small), finally, local response standardization (L RN) is used for normalizing a plurality of Feature maps, if the normalization is not carried out, the small features can be suppressed by the large Feature, and then the plurality of Feature maps are combined into a single output cube, which is called a cube Feature source, and the deconvolution layer is added in the last layer;
(3) operating parallel convolution, in a second convolution module, performing parallel convolution on 5 × 5 and 7 × 7, performing differential extraction and fusion on different features extracted by convolution kernels with different sizes;
(4) introducing a cross convolution kernel, converting a square convolution kernel into an asymmetric convolution structure, converting a convolution kernel of 5 x 5 into convolution kernels of 5 x 1 and 1 x 5, and performing maximum pooling sampling reduction and deconvolution operation amplification;
(5) the input of the fixed candidate region pooling layer is composed of feature maps obtained by a plurality of convolution layers with different depths, the three depth feature maps are used as the input of the candidate region pooling layer together, and the purpose of the layer is to convert candidate frames with different sizes into output feature maps with fixed sizes for the next step. Then, performing average pooling on feature maps output by a feature mapping layer of the candidate area generation network to generate area features with fixed sizes;
(6) the candidate area optimization network adopts zero mean value and Gaussian distribution random initialization with standard deviation a, training data are generated by using a trained false-R-CNN network, the candidate area optimization network is trained independently, training pictures in a training set are input into the network, and a target candidate area output by the candidate area generation network is used as the training data of the candidate area optimization network. Taking the data of which the intersection ratio IOU (intersection ratio) of any labeling frame is greater than that of the target candidate area as a positive sample, and taking the data of which the intersection ratio IOU of any labeling frame is less than that of the target candidate area as a negative sample;
(7) the non-maximum suppression is used for acquiring 100 high-scoring suggestion windows which can cover basically all the text areas, and if too many selection areas can cause the suggestion windows to overlap, useless calculation amount is increased. Edge refinement correction is performed, which predicts the exact position in the vertical direction from the position offset. The formula is as follows:
wherein xsideIs the predicted x-coordinate closest to the horizontal edge to the current anchor point,is the actual edge coordinates of the x-axis, which is pre-computed from the actual bounding box and anchor point locations.Is the center of the anchor point of the x-axis, waIs a fixed anchor width wa=16。o、o*Representing the predicted and actual offsets, respectively. The final text line bounding box is optimized using the offset of the edge proposal.
We use multitask learning to jointly optimize model parameters. Three losses are introduced depending on the source of the data outputFunction:according to the minimum loss rule, a total objective function (L) of the minimized image is minimized:
where each anchor is a training sample and i is an index of an anchor in a small batch of data. SiIt is the predicted anchor point i as the prediction probability of the actual text. k is the index of the edge anchor, which is defined as the set of anchors within a horizontal distance (e.g., 8 pixels) to the left or right of the actual text line bounding box. okAndis the predicted and actual offset of the x-axis associated with the k-th anchor point.Is the classification loss we use Softmax loss to distinguish text from non-text.Andis the regression loss. N is a radical ofs,Nv,NoIs a normalized parameter, a representationTotal number of anchor points used respectively.
And finally merging the suggestion windows through a text line construction algorithm, namely, forming every two small suggestion windows which are close to 8 x h into a pair group, merging different pair groups until the combination can not be further merged, and finally generating a complete suggestion box. The construction of the text line is very simple. The text lines are constructed as follows. HeadFirst, we define B for the proposaliA pairing neighbour (B)j,Bi) As Bj->BiWhen B is presentjIs closest to BiIs less than 50 pixels and their vertical overlap is greater than 0.6. Secondly, if Bj->BiAnd Bi->BjThe two proposals are grouped into a pair. And then constructing a text line by sequentially connecting the suggestion pairs with the same proposal, and carrying out the finished object class judgment and the object bounding box regression correction. The functions of depth feature extraction, sequence label probability prediction and label transcription of the input image are completed in a subsequent CNN + CTC mode. And secondly, Tesseract secondary recognition is added on the basis, and a Fast-R-CNN detection part inputs recognized text lines into a sensitive text information detection module in a character string mode to perform text sensitive semantic detection.
In the embodiment shown in fig. 2, the sensitive text information detection module includes a first-stage classifier, a word segmentation module and a second-stage classifier. The first-stage classifier performs coarse screening on sensitive words on input sentences in a word rule filtering engine mode based on a multidimensional expanded sensitive word library. Original sensitive words are subjected to multi-dimensional expansion, and a new sensitive information word bank is established by specifically including ambiguous modes such as synonyms, homophones, pinyin, side halfwords and the like, wherein the word bank comprises three major parts of action, pornography and violence. The word segmentation module is used for carrying out word segmentation processing on the text. Since the Chinese text has no space segmentation method like western text, the Chinese word segmentation process is needed first. And the second-stage classifier performs Chinese word segmentation on the finished training set, encodes texts in the training set in a word vector form, represents words of the texts in a multi-dimensional vector mode, performs feature extraction and model training on the words, and judges the short texts after coarse screening by using the trained classification model to determine whether the short texts are sensitive character information texts.
As in the embodiment shown in FIG. 3, the secondary classifier is an SVM classifier. The SVM is a discriminant classifier defined by a classification hyperplane, that is, given a set of labeled training samples, the algorithm will output an optimal hyperplane to classify the new sample (test sample), finding a hyperplane, and its distance to the nearest training sample to him is the greatest. I.e. optimal segmentation hyperplane maximization training sample boundary.
The support vector machine classification is a classification problem, and corresponds to two important steps of a classification process, wherein one step is to train a classifier by using a training data set, and the other step is to evaluate the classification accuracy of the classifier by using a test data set. As sensitive information type text classification, the implementation process of implementing text classification based on libsvm is as follows:
(1) selecting a text training dataset and a test dataset: both the training set and the test set are class label known;
(2) preprocessing a training set text: the method mainly comprises the steps of Chinese word segmentation, word stop and word vector model establishment;
(3) feature vectors (word vectors) used for selecting text classification: the final aim is to ensure that the finally selected feature vectors have certain category discrimination among a plurality of categories so as to realize the classification and screening of the feature vectors, and because a large number of words are obtained after Chinese word segmentation, the calculation amount can be well reduced by selecting a dimension reduction technology, and the classification precision can be maintained;
(4) outputting a training sample set file of the quantitative emotion polarity words supported by libsvm: the mapping conversion from each word element in the category name and the feature vector to the number respectively, and the quantification of the text training set based on the category and the feature vector can meet the data format required by the libsvm training;
(5) preprocessing a test data set: similarly, Chinese word segmentation (which needs to be consistent with a word segmentation device used in the training process), word deactivation, and word vector model (inverted list) establishment are included, but the feature vector generated in the training process needs to be loaded at this time, and redundant words which are not in the feature vector are eliminated by using the feature vector (also called dimension reduction);
(6) outputting a libsvm-supported quantized test sample set file: the output format is the same as the output of the pre-processing stage of the training data set. And finally, outputting a classification model file by using the quantized emotion polar word data set file output in the training set preprocessing stage. A libsvm toolkit is used for training a text classifier, and a scale transformation operation is needed at the beginning of using libsvm, so that a better model for libsvm training is facilitated. The training data format used by libsvm is digital, so that the documents in the training set need to be quantized, and the TF-IDF measurement is used for representing the relevance index of words and documents. In the data output in the foregoing, each dimension vector uses the value of TF-IDF, but the value of TF-IDF may be in an irregular range (depending on the values of TF and IDF), for example, 0.19872 to 8.3233, so libsvm can be used to transform all the values into the same range, for example, 0 to 1.0, or-1.0 to 1.0, which can be selected according to actual needs;
(7) the accuracy of the classification model was verified using libsvm: the precision of classification is verified by using a quantized data set file and a classification model file output in the test set preprocessing stage, a proper kernel function is selected, a cost coefficient c is set, the default is 1, and the condition that one point can be mistakenly classified when a linear classification surface is calculated is shown. At this time, cross validation (CrossValidation) is used to optimize the calculation step by step, and the most suitable parameters are selected;
(8) optimizing classification model parameters: if the classification model trained by the libsvm has poor precision, the optimization of the parameters can be continuously realized through the self-contained cross validation function of the libsvm, and the optimal parameter value is obtained by searching a parameter value space. After text preprocessing, feature extraction, feature representation and normalization, the original text information is abstracted into a vectorized sample set, and then the sample set and a trained template file are subjected to similarity calculation, namely, the similarity probability of the specific categories (pornography, violence and reaction) of the sensitive character information after the text to be detected is compared with the template file is determined. If not, calculating with template files of other categories until the template files are classified into corresponding specific categories;
(9) and finally, outputting a detection report of the sensitive characters of the picture and the information such as the corresponding website address and the like according to the detection result, performing tracking alarm, prompting the picture determined to contain the sensitive character information, and displaying the information such as the address link, the picture name information, the picture size and the like of the picture in a related area.
Claims (7)
1. The method for automatically detecting the sensitive characters of the network pictures is characterized by comprising the following steps of:
step S1, using web crawler to capture the pictures of the website containing the pictures; storing the basic information of the pictures into a data source database, and simultaneously collecting the pictures into a picture database for subsequent use;
step S2, acquiring pictures from a picture database, carrying out character target detection on the pictures by using an Faster R-CNN deep network based on a regional suggestion network, and extracting character information identified by the pictures to convert the character information into picture text information after the detection is finished;
and step S3, the extracted picture text information is detected by a classifier, the method comprises the steps that a first-stage classifier performs sensitive word rough screening on input sentences by a mode based on a multi-dimensional expanded sensitive word bank, the text information after the rough screening is processed by Chinese word segmentation, and then deep sensitive information fine screening is performed by a second-stage classifier based on an emotion polarity word bank and an SVM classifier mode, so that automatic detection of the network picture sensitive word information is completed.
2. The method of claim 1, wherein the method comprises: the basic information of the picture comprises the link of the picture, the size of the picture and the name of the picture.
3. The method of claim 1, wherein the method comprises: the process of detecting the text target of the picture in step S2 includes performing maximum pooling sampling reduction and deconvolution operation amplification on the shared convolution layer of the area-suggested network, then performing average pooling on the feature maps output by the feature mapping layer of the candidate area-generated network to generate a target candidate area of a fixed size, generating a target candidate area output by the network according to the candidate area by the area pooling layer of the candidate area-optimized network, performing area pooling on the feature maps output by the feature mapping layer of the candidate area-generated network, and generating an area feature of a fixed size;
and outputting the classification probability of whether each target candidate region contains a target or a background according to the softmax layer, only outputting the target candidate regions with the probability greater than a preset threshold, extracting the region characteristics from the generated shared characteristic diagram by using the target classification regression network according to the optimized target candidate regions, and performing final target character type judgment and target boundary frame regression correction.
4. The method of claim 1, wherein the method comprises: in the step S3, the multidimensional expansion sensitive word library establishes a new sensitive information word library by performing multidimensional expansion on the original sensitive words, specifically including synonyms, homophones, pinyin and side halfwords, wherein the word library includes three major parts of reaction, pornography and violence.
5. The method of claim 4, wherein the method comprises: and S3, fine screening the sensitive information, adding the emotion polar words into the data set of the existing short text of the sensitive information, judging in combination with emotion tendencies, marking text information, and training the data set containing the short text of the sensitive information of the emotion polar words by using an SVM model.
6. The method of claim 5, wherein the method comprises: the SVM classifier in step S3 performs chinese word segmentation on the training set, encodes the text in the training set in the form of word vectors, represents the vocabulary of the text in the form of multidimensional vectors, performs feature extraction and model training on the vocabulary, and finally determines the short text after the coarse screening by using the trained classification model to determine whether the short text is a sensitive text information text.
7. The method for automatically detecting network picture sensitive characters according to claims 1-6, wherein: and tracking and alarming the picture determined to contain the sensitive character information, and displaying the address link, the picture name information and the picture size information of the picture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910053775.8A CN111460247B (en) | 2019-01-21 | 2019-01-21 | Automatic detection method for network picture sensitive characters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910053775.8A CN111460247B (en) | 2019-01-21 | 2019-01-21 | Automatic detection method for network picture sensitive characters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111460247A true CN111460247A (en) | 2020-07-28 |
CN111460247B CN111460247B (en) | 2022-07-01 |
Family
ID=71679084
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910053775.8A Active CN111460247B (en) | 2019-01-21 | 2019-01-21 | Automatic detection method for network picture sensitive characters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111460247B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112307770A (en) * | 2020-10-13 | 2021-02-02 | 深圳前海微众银行股份有限公司 | Sensitive information detection method and device, electronic equipment and storage medium |
CN112417194A (en) * | 2020-11-20 | 2021-02-26 | 济南浪潮高新科技投资发展有限公司 | Multi-mode detection method for malicious graphics context |
CN112560858A (en) * | 2020-10-13 | 2021-03-26 | 国家计算机网络与信息安全管理中心 | Character and picture detection and rapid matching method combining lightweight network and personalized feature extraction |
CN113177409A (en) * | 2021-05-06 | 2021-07-27 | 上海慧洲信息技术有限公司 | Intelligent sensitive word recognition system |
CN113220533A (en) * | 2021-05-21 | 2021-08-06 | 南京诺迈特网络科技有限公司 | Network public opinion monitoring method and system |
CN113221906A (en) * | 2021-05-27 | 2021-08-06 | 江苏奥易克斯汽车电子科技股份有限公司 | Image sensitive character detection method and device based on deep learning |
CN113313693A (en) * | 2021-06-04 | 2021-08-27 | 北博(厦门)智能科技有限公司 | Image violation detection method and terminal based on neural network algorithm |
CN113676465A (en) * | 2021-08-10 | 2021-11-19 | 杭州民润科技有限公司 | Image filtering method, memory and processor for industrial enterprise network |
CN113762237A (en) * | 2021-04-26 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Text image processing method, device and equipment and storage medium |
CN114092743A (en) * | 2021-11-24 | 2022-02-25 | 开普云信息科技股份有限公司 | Compliance detection method and device for sensitive picture, storage medium and equipment |
CN114117533A (en) * | 2021-11-30 | 2022-03-01 | 重庆理工大学 | Method and system for classifying picture data |
CN114168771A (en) * | 2020-09-11 | 2022-03-11 | 北京搜狗科技发展有限公司 | Method and related device for constructing map matching library |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229929A (en) * | 2017-04-12 | 2017-10-03 | 西安电子科技大学 | A kind of license plate locating method based on R CNN |
CN107977671A (en) * | 2017-10-27 | 2018-05-01 | 浙江工业大学 | A kind of tongue picture sorting technique based on multitask convolutional neural networks |
CN108537329A (en) * | 2018-04-18 | 2018-09-14 | 中国科学院计算技术研究所 | A kind of method and apparatus carrying out operation using Volume R-CNN neural networks |
WO2018166114A1 (en) * | 2017-03-13 | 2018-09-20 | 平安科技(深圳)有限公司 | Picture identification method and system, electronic device, and medium |
CN108984530A (en) * | 2018-07-23 | 2018-12-11 | 北京信息科技大学 | A kind of detection method and detection system of network sensitive content |
CN109117836A (en) * | 2018-07-05 | 2019-01-01 | 中国科学院信息工程研究所 | Text detection localization method and device under a kind of natural scene based on focal loss function |
-
2019
- 2019-01-21 CN CN201910053775.8A patent/CN111460247B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018166114A1 (en) * | 2017-03-13 | 2018-09-20 | 平安科技(深圳)有限公司 | Picture identification method and system, electronic device, and medium |
CN107229929A (en) * | 2017-04-12 | 2017-10-03 | 西安电子科技大学 | A kind of license plate locating method based on R CNN |
CN107977671A (en) * | 2017-10-27 | 2018-05-01 | 浙江工业大学 | A kind of tongue picture sorting technique based on multitask convolutional neural networks |
CN108537329A (en) * | 2018-04-18 | 2018-09-14 | 中国科学院计算技术研究所 | A kind of method and apparatus carrying out operation using Volume R-CNN neural networks |
CN109117836A (en) * | 2018-07-05 | 2019-01-01 | 中国科学院信息工程研究所 | Text detection localization method and device under a kind of natural scene based on focal loss function |
CN108984530A (en) * | 2018-07-23 | 2018-12-11 | 北京信息科技大学 | A kind of detection method and detection system of network sensitive content |
Non-Patent Citations (2)
Title |
---|
REN SHAOQING 等: "Faster R-CNN: Towards Real-Time", 《TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》, vol. 39, no. 06, 1 June 2017 (2017-06-01), pages 1137 - 1149, XP055705510, DOI: 10.1109/TPAMI.2016.2577031 * |
陶新民 等: "基于样本特性欠取样的不均衡支持向量机", 《控制与决策》, vol. 28, no. 07, 18 April 2014 (2014-04-18), pages 978 - 984 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114168771A (en) * | 2020-09-11 | 2022-03-11 | 北京搜狗科技发展有限公司 | Method and related device for constructing map matching library |
CN112560858A (en) * | 2020-10-13 | 2021-03-26 | 国家计算机网络与信息安全管理中心 | Character and picture detection and rapid matching method combining lightweight network and personalized feature extraction |
CN112307770A (en) * | 2020-10-13 | 2021-02-02 | 深圳前海微众银行股份有限公司 | Sensitive information detection method and device, electronic equipment and storage medium |
CN112560858B (en) * | 2020-10-13 | 2023-04-07 | 国家计算机网络与信息安全管理中心 | Character and picture detection and rapid matching method combining lightweight network and personalized feature extraction |
CN112417194A (en) * | 2020-11-20 | 2021-02-26 | 济南浪潮高新科技投资发展有限公司 | Multi-mode detection method for malicious graphics context |
CN113762237B (en) * | 2021-04-26 | 2023-08-18 | 腾讯科技(深圳)有限公司 | Text image processing method, device, equipment and storage medium |
CN113762237A (en) * | 2021-04-26 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Text image processing method, device and equipment and storage medium |
CN113177409A (en) * | 2021-05-06 | 2021-07-27 | 上海慧洲信息技术有限公司 | Intelligent sensitive word recognition system |
CN113177409B (en) * | 2021-05-06 | 2024-05-31 | 上海慧洲信息技术有限公司 | Intelligent sensitive word recognition system |
CN113220533A (en) * | 2021-05-21 | 2021-08-06 | 南京诺迈特网络科技有限公司 | Network public opinion monitoring method and system |
CN113220533B (en) * | 2021-05-21 | 2024-05-31 | 南京诺迈特网络科技有限公司 | Network public opinion monitoring method and system |
CN113221906A (en) * | 2021-05-27 | 2021-08-06 | 江苏奥易克斯汽车电子科技股份有限公司 | Image sensitive character detection method and device based on deep learning |
CN113313693B (en) * | 2021-06-04 | 2023-07-18 | 北博(厦门)智能科技有限公司 | Picture violation detection method and terminal based on neural network algorithm |
CN113313693A (en) * | 2021-06-04 | 2021-08-27 | 北博(厦门)智能科技有限公司 | Image violation detection method and terminal based on neural network algorithm |
CN113676465A (en) * | 2021-08-10 | 2021-11-19 | 杭州民润科技有限公司 | Image filtering method, memory and processor for industrial enterprise network |
CN113676465B (en) * | 2021-08-10 | 2024-02-27 | 杭州民润科技有限公司 | Industrial enterprise network-oriented image filtering method, memory and processor |
CN114092743A (en) * | 2021-11-24 | 2022-02-25 | 开普云信息科技股份有限公司 | Compliance detection method and device for sensitive picture, storage medium and equipment |
CN114092743B (en) * | 2021-11-24 | 2022-07-26 | 开普云信息科技股份有限公司 | Compliance detection method and device for sensitive picture, storage medium and equipment |
CN114117533A (en) * | 2021-11-30 | 2022-03-01 | 重庆理工大学 | Method and system for classifying picture data |
Also Published As
Publication number | Publication date |
---|---|
CN111460247B (en) | 2022-07-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111460247B (en) | Automatic detection method for network picture sensitive characters | |
CN112035669B (en) | Social media multi-modal rumor detection method based on propagation heterogeneous graph modeling | |
CN113283551B (en) | Training method and training device of multi-mode pre-training model and electronic equipment | |
US11762990B2 (en) | Unstructured text classification | |
CN109902175A (en) | A kind of file classification method and categorizing system based on neural network structure model | |
Kantipudi et al. | Scene text recognition based on bidirectional LSTM and deep neural network | |
CN111160130B (en) | Multi-dimensional collision recognition method for multi-platform virtual identity account | |
CN113627550A (en) | Image-text emotion analysis method based on multi-mode fusion | |
CN114548274A (en) | Multi-modal interaction-based rumor detection method and system | |
CN111680577A (en) | Face detection method and device | |
Sharma et al. | Fake news detection using deep learning | |
CN118152594A (en) | News detection method, device and equipment containing misleading information | |
CN115640401B (en) | Text content extraction method and device | |
CN112052869A (en) | User psychological state identification method and system | |
CN112035670B (en) | Multi-modal rumor detection method based on image emotional tendency | |
Sanjaya et al. | BISINDO Sign Language Recognition: A Systematic Literature Review of Deep Learning Techniques for Image Processing | |
Wang et al. | A lightweight CNN model based on GhostNet | |
Mao et al. | Detection of artificial pornographic pictures based on multiple features and tree mode | |
Tan et al. | Sentiment analysis of chinese short text based on multiple features | |
Naosekpam et al. | A hybrid scene text script identification network for regional Indian languages | |
Ma et al. | Is a picture worth 1000 votes? Analyzing the sentiment of election related social photos | |
Neela et al. | An Ensemble Learning Frame Work for Robust Fake News Detection | |
Nam et al. | Spam Image Detection Model based on Deep Learning for Improving Spam Filter | |
CN113283240B (en) | Co-reference digestion method and electronic equipment | |
Akiladevi et al. | Event Detection in Social Media Analysis: A Survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |