CN111538550A

CN111538550A - Webpage information screening method based on image detection algorithm

Info

Publication number: CN111538550A
Application number: CN202010307694.9A
Authority: CN
Inventors: 姜海强; 秦斌
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2020-08-14

Abstract

The invention discloses a webpage information screening method based on an image detection algorithm, which comprises the following steps: step S1, pre-obtaining the link information of the target webpage to be filtered and accessing the webpage; step S3, rendering the target webpage through a Splash frame to obtain a webpage screenshot; step S7, the engine transmits the acquired webpage screenshot to a pre-training target detection model, acquires a detection target result and determines the number n of targets returned by the model; step S9, determining the relationship between the target number n and the effective target threshold m, wherein if n is smaller than the effective target threshold m. According to the invention, by detecting and screening the information content of the target webpage, dirty data or useless data in the network are filtered, so that network garbage images are filtered, the network is purified, the internet surfing experience of netizens is optimized, and workers can selectively filter the network and only download related images, thereby saving the network bandwidth and the cost of subsequent data cleaning.

Description

Webpage information screening method based on image detection algorithm

Technical Field

The invention relates to the technical field of data acquisition, in particular to a webpage information screening method based on an image detection algorithm.

Background

Artificial intelligence AI is a new technical science to study and develop theories, methods, techniques and applications for simulating, extending and expanding human intelligence. Industrial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others.

With the rapid development of artificial intelligence AI, the public data of the website is crucial to the development of AI, however, the public data of the website often contains a large amount of garbage data or data unnecessary for the algorithm.

In the prior art, the website public data is collected in full through a web crawler technology, and then data cleaning is carried out on the website public data, so that not only is network resources wasted, but also the cleaning cost in the later period is increased.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a webpage information screening method based on an image detection algorithm, which filters out dirty data or useless data in a network by detecting and screening target webpage information content so as to overcome the technical problems in the prior related art.

The technical scheme of the invention is realized as follows:

a webpage information screening method based on an image detection algorithm comprises the following steps:

step S1, pre-obtaining the link information of the target webpage to be filtered and accessing the webpage;

step S3, rendering the target webpage through a Splash frame to obtain a webpage screenshot;

step S7, the engine transmits the acquired webpage screenshot to a pre-training target detection model, acquires a detection target result and determines the number n of targets returned by the model;

step S9, judging the relation between the target number n and the effective target threshold value m, wherein if n is less than the effective target threshold value m, the webpage is an invalid webpage, and if n is more than or equal to the effective target threshold value m, the webpage is an effective webpage;

and step S11, the engine analyzes and downloads the effective web pages, ignores the ineffective web pages and finishes the web page information screening.

Further, step S1, further includes the following steps:

step S101, acquiring target information in advance through a public data set, and manually recording text information in the target information into a text list;

step S102, using Post 'target information key words' to a browser through a Python library Request, acquiring information responded by a webpage server, converting the information returned by the webpage into a Json format, analyzing the Json to acquire key information of the target information, and storing the key information into a text list;

and step S103, sequentially storing the information in the acquired text list into a distributed database Redis read based on the memory.

Further, step S7 includes a model pre-training and detection module, wherein;

the model pre-training is used for pre-training a target detection model through a public test set;

the detection module can acquire the positions and the number of the detection frames of the input webpage screenshot and determine the number n of the returned targets of the model.

Further, the detection module comprises image preprocessing, neural network reasoning and detection regression.

The invention has the beneficial effects that:

according to the invention, by detecting and screening the information content of the target webpage, dirty data or useless data in the network are filtered, so that network garbage images are filtered, the network is purified, the internet surfing experience of netizens is optimized, and workers can selectively filter the network and only download related images, thereby saving the network bandwidth and the cost of subsequent data cleaning.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flowchart of a method for screening web page information based on an image detection algorithm according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model pre-training process of a web page information screening method based on an image detection algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a model reasoning process of a web page information screening method based on an image detection algorithm according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a webpage rendering screenshot of a webpage information screening method based on an image detection algorithm according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a detection result of a web page information screening method based on an image detection algorithm according to an embodiment of the present invention;

fig. 6 is a brief flowchart of an MINI-SSD (micro-solid state drive) web page information screening method based on an image detection algorithm according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

According to the embodiment of the invention, a webpage information screening method based on an image detection algorithm is provided.

As shown in fig. 1 to 6, a method for screening web page information based on an image detection algorithm according to an embodiment of the present invention includes the following steps:

In addition, step S1 further includes the steps of:

In addition, step S7 includes a model pre-training and detection module, wherein;

In addition, the detection module includes image preprocessing, neural network reasoning, and detection regression.

In addition, the invention mainly comprises a seed module, a rendering module, a model pre-training module, a detection module and a downloading module.

With the aid of the above scheme, in one embodiment, taking brand information of an automobile as an example, the seed module is specifically as follows: first, we obtain the brand information of the car from the public data set, and can search the keyword "car brand" in Baidu, google, and manually record the text information therein as a text list. And converting information returned by the webpage into a Json format by using a Post keyword 'automobile brand' through a Python library Request, analyzing Json to obtain key information of the automobile brand, and storing the key information in a text list. And sequentially storing the information in the text list into a distributed database Redis based on memory reading, wherein the Redis has the advantages of memory reading based, quick response and simple calling method.

In addition, for the detection module, a Vehicle detection model is pre-trained by disclosing a Vehicle public test set CompCars data set and a BIT-Vehicle data set. The CompCars dataset contains data from two scenarios, including images from network properties and monitoring properties. The network nature data contains 163 cars, 1716 car models. There were 136726 image capture entire cars and 27618 image capture car parts. The complete car image is labeled with a bounding box and a viewpoint.

In addition, the BIT-Vehicle data set is collected by Beijing university of Rich technology, and the Vehicle images are derived from road monitoring. This data set contains 9580 pictures of vehicles, for 6 types: passenger cars, minivans, cars, off-road cities, and trucks. The number of types of vehicle pictures is 558, 883, 476, 5922, 1392, and 822. The size of the pictures is divided into 2 types: 1600 x 1200 and 1600 x 1080, samples of which were taken from 2 cameras at different time locations (including day and night). Alternatively, other public test sets of vehicles, such as UA-detac, may be selected.

In addition, as shown in fig. 2, a public data set is prepared, and the public data set is packaged into a data format memory mapping database LMDB which is easy to load by a deep learning framework Caffe, wherein the LMDB format includes a data file and a lock file, the LMDB essentially maps data in a hard disk into a memory through the memory, and a program can directly read an index in the memory to address the file in the loading process, thereby avoiding an IO bottleneck caused by traversing the data.

The LMDB data supports multi-process concurrent reading without separately maintaining an index table. Meanwhile, the LMDB can support the conversion of label, image and binary formats, can provide a uniform reading interface and is convenient for program management.

In addition, as shown in fig. 2-3, calculating the mean of the public dataset image, alternative method 1: the respective mean value calculation is performed for the three channels (RGB) of all pictures. Optional method 2: the random sampling picture is subjected to three-channel mean value calculation, and the method has the advantages of high calculation speed and poor universality. Optional method 3: three-channel default means (114, 117, 123) using common object detection or three-channel uniform use means (128, 128, 128). The mean value is subtracted to reduce the interference of light rays, a public part is eliminated, the performance of an individual key feature domain is highlighted, and meanwhile the mean value reduction and normalization operations help to accelerate the convergence speed of the model. But the magnitude of the mean parameter generally has little effect on the detection result.

In addition, for the model pre-training, the mobile net-SSD neural network is selected to pre-train the vehicle detection model in this embodiment, which has the advantages of simple structure and fast inference speed of the mobile net network. The MobileNet-SSD mainly comprises three parts, namely a backbone network, a FeatureMap layer and a Detection-out layer.

The backbone network is a MobileNet network structure, the last full connection layer (Fall connection) FC layer is removed, then 8 convolutional layers are added, and 6 convolutional layers are extracted as FeatureMap layers for detection.

In addition, the method requires a program to be fast in reasoning speed, and has low requirement on regression precision of the detection frame. Therefore, the invention optimizes and improves the backbone network mobilenet, so that the operation speed is faster, the parameter quantity is lower,

as shown in fig. 6, the present invention employs MINI-SSD as follows:

deleting all the conv, conv/bn, conv/scale and conv/relu layers from the front 5 layers to the 9 th layer of the MobilenetNet-SSD; two hyper-parameters are introduced in the MobileNet for reducing the number of parameters and the amount of calculation: a) width multiplier (width multiplier): the channels of input and output are reduced, and this parameter is set to 0.33 in this application. b) Resolution Multiplier (Resolution Multiplier): reducing the size of the feature map of the input and the output, wherein the parameter is set to be 1.5; delete 15 ~ 17 th layer featureMap layer, because the target in the webpage is generally less, keep 11, 13, 14 layers featureMap in this application, the characteristic dimension is big in the first few layers, and the receptive field is less suitable for little object detection.

In addition, for the shape of the detection target in the webpage, the application sets the aspect ratio of 11, 13 and 14 layers to be 3, 1 and 1/2, wherein the scale calculation formula is as follows:

wherein Smax represents the maximum dimension of 0.9, the minimum dimension of 0.2, and the corresponding scales of 8, 16 and 32 are provided in the application

The size of the featureMap at the 11 th layer of the application is 19 × 19, and the design infers that the average size of the picture is (152 × 152) according to the webpage information, assuming that each feature point is taken as the center (center) of the detection box (bbox) of convolution, and the size of the detection box is set according to the prior knowledge.

N sets of different aspect ratio (w h) default mounting boxes (x, y, w, h) are generated from each center, where n represents the three different aspect ratios mentioned above.

Each picture has a ground transistor location (x1, y1, w1, h1) coordinate, corresponding to which we add 4 offset convolution and the offset value used to represent the difference between gt bbox and default bbox. Each detection frame corresponds to c different classifications, wherein c is 2 in the application and represents a vehicle and a background.

We can therefore denote the number of parameters per layer in a featureMap layer as p (x) ═ c +4) × n × w × h.

In addition, in the application, 19 × 3+10 × 3+5 × 3 (1458 candidate frames) are generated by inputting the original webpage diagram through 11 th layers (19 × 19), 13 th layers (10 × 10) and 14 th layers (5 × 5), and the input original webpage diagram is input to the detection _ out layer for screening.

Setting the Non-Maximum Suppression (NMS) setting parameter to overlap threshold 0.4 in the detection _ out layer means that the Intersection-over-unity (IoU) ratio between the candidate box and the ground route box is greater than 0.4 for reservation. the topk set to 400 indicates that 400 candidate boxes remain after screening.

In the training process, it is first determined which prior frame a group channel (real target) in the training picture is matched with, and the bounding box corresponding to the matched prior frame is responsible for predicting the prior frame. For the remaining 400 candidate boxes in 23, in the present application, there are two main matching principles between the prior box of SSD and the ground channel. 1. For each group channel in the picture, finding the prior frame with the largest IOU, and matching the prior frame with the prior frame. Thus, each groudtruth can be guaranteed to be matched with a certain priori box. The prior box that matches the ground truth is usually called the positive sample. On the contrary, if a prior box is not matched with any ground channel, the prior box can only be matched with the background, and is a negative sample. In a picture, there are very few group entries and many prior frames, and if matching is performed only according to the first principle, many prior frames will be negative samples, and the positive and negative samples are extremely unbalanced, and the second principle is: for the remaining unmatched prior boxes, if the IOU of a certain group route is greater than a certain threshold (typically 0.5), then the prior box is also matched with the group route. This means that a certain group channel may match a number of a priori boxes, which is possible.

In addition, the prior box can only match one group channel, and if a plurality of group channel channels and a certain prior box IOU are larger than a threshold value, the prior box only matches with the prior box with the largest IOU. The second principle must be followed by the first principle, and carefully consider the case that if a certain group channel is smaller than the threshold at the maximum, and the matched prior frame is larger than the threshold of another group channel, the prior frame should match, and the answer should be the former, so as to firstly ensure that a certain group channel must have a prior frame matched with the prior frame. However, this condition i feel substantially absent. Due to the large number of prior boxes, the maximum IOU of a certain group route must be greater than the threshold.

In order to keep the balance of training samples, the invention adopts a difficult sample mining mode in the training process, negative samples are sampled, sequencing is carried out according to confidence errors during sampling, and a plurality of samples with the largest errors are taken as the negative samples. By the method, the proportion of positive and negative samples can be guaranteed to be in a reasonable interval of 1: 5-1: 3, and the final training over-fitting and poor universality are prevented. Generally we have an empirical value of 1: 3.

During the training process, the input image is subjected to forward reasoning through a CNN network, and a classified confidence Loss value (softmax Loss) between a training sample and a true value (gt) and a detected regression Loss (Smooth L1 Loss) are calculated.

In addition, the weighted loss function can be expressed as:

wherein L is_conf(x, c) represents confidence classification loss, L_loc(x, L, g) represents smooth-L1 penalty between the predicted box and the real box, N represents the number of matching default boxes (sum of positive and negative samples), x represents input samples, c represents confidence, L (location) represents the position of the model given candidate box (x, y, w, h) where (x, y) represents the coordinates of the top left corner, w represents width, h represents height, g represents groudtrth (real target).

In addition, optionally, the model used in the model pre-training process may be a reduced network model trained in the above manner. An open-source pre-training Model in Caffe Model Zoo can also be selected, and most of classification detection models are basically covered. In the backbone network, in order to pursue speed and ignore accuracy to a certain extent, a ResNet structure, a GoogleNet structure, a ShuffleNet structure and the like with higher accuracy can be selected.

In addition, the method is used for webpage automobile data detection, the online data of the structure expansion inquiry of the method can be filtered, and the automobile detection model in the method is replaced by models such as face detection, license plate detection and pedestrian detection, so that various effective online public data are mined.

In addition, the rendering module mainly comprises a database, a control engine and a rendering engine, wherein the database is a Redis database, a search keyword (keyword) is acquired through an interface (api) of Redis, and the control engine searches a browser by using the keyword.

The control engine is implemented by python's built-in library requests, first encapsulating the keyword into json format such as params {' keyword ': keys', 'pages': and 1', the control engine sends the Response data to the browser in a post form to acquire Response data of the browser. And sending the Response content to a parser for parsing. The resolver acquires Response information of the control engine, converts the Response into a Json format and performs resolution. And rendering the hypertext markup language (html) and the js code in the parsing result by a rendering engine.

In addition, as shown in fig. 4-5, the rendering engine sends the rendered screenshot of the webpage to the detection engine for model inference in a Post form mode, and the detection engine performs the inference process shown in fig. 3 to obtain the target number of the detection boxes. The reasoning engine responds the number of the detection boxes to the rendering engine by means of Response. And setting a downloading threshold value, wherein the downloading threshold value represents the minimum target number of downloading of one rendered page, the rendering engine judges the detection result, and if the number of the detection frames is greater than the downloading threshold value, the step 1 is returned. And if the number of the detection boxes is less than the download threshold value, returning to 0.

And the parser processes the returned result, ignores the webpage if the returned result is 0, and transmits the webpage to the downloader for downloading if the returned result is 1. The rendering engine uses an Splash frame, which is a JavaScript rendering service. The browser is a lightweight browser which realizes HTTP API, Splash is realized by Python, and an event-driven network engine framework (composed) and a graphical user interface (QT) are used for page rendering. The novel webpage structure carries out dynamic rendering through a client browser, cannot acquire internal information through html of an analytic webpage, cannot directly download webpage picture content to be rendered locally, provides web service through a bottle, receives picture input pictures, and returns a model detection result to a request end.

In addition, as shown in fig. 3, the inference engine first acquires data in the image Base64 format of the rendering engine Post, and converts the data into the BGR format. And subtracting the corresponding mean value from each channel of the image, wherein the mean value is the mean value in 11. The formula is expressed as:

wherein N represents a new image, x represents an input image, 1-3 represent three-dimensional graph channels of the image, and v (mean) represents the average value of the total data statistics of each channel.

The images after the averaging are subjected to a normalization operation, which is expressed as:

wherein xi represents the value of the pixel point, min (x), max (x) respectively represent the maximum value and the minimum value of the image pixel, and (0, 255) is selected as the minimum maximum value for normalization in the invention.

In addition, the model is loaded, data is loaded for forward reasoning, and a final returned result is obtained. Formats such as

[ class 1, conf, x, y, w, h ] … [ class N, conf, x, y, w, h ], wherein class represents classification, and is divided into "car 1" and "non-car 0" in the present invention, conf represents the confidence of the classification result by decimal number such as 0.78, (x, y) represents the coordinate of the upper left corner of the detection frame, w represents the width of the detection frame, and h represents the height of the detection frame. Calculating Classify in the result list as the number of the vehicles, wherein the formula is as follows:

in addition, the analyzer compares the number of the detection target boxes in the webpage rendering page calculated according to the formula with a downloading threshold value, wherein the downloading threshold value is 5 in the application, and the detection target boxes are sent to a downloader to be downloaded after exceeding the threshold value. And if the data content is less than the download threshold, the data richness in the webpage is considered to be insufficient, and the data are ignored.

In summary, by means of the technical scheme of the invention, dirty data or useless data in the network are filtered out by detecting and screening the information content of the target webpage, so that network garbage images are filtered, the network is purified, the internet surfing experience of netizens is optimized, and workers can selectively filter the network and only download related images, thereby saving the network bandwidth and the cost of subsequent data cleaning.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A webpage information screening method based on an image detection algorithm is characterized by comprising the following steps:

2. The method for screening web page information based on image detection algorithm according to claim 1, wherein step S1, further comprising the steps of:

3. The method for screening web page information based on image detection algorithm as claimed in claim 1, wherein step S7 includes a model pre-training and detection module, wherein;

4. The method for screening webpage information based on the image detection algorithm according to claim 3, wherein the detection module comprises image preprocessing, neural network reasoning and detection regression.