CN111144453A - Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data - Google Patents

Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data Download PDF

Info

Publication number
CN111144453A
CN111144453A CN201911266182.6A CN201911266182A CN111144453A CN 111144453 A CN111144453 A CN 111144453A CN 201911266182 A CN201911266182 A CN 201911266182A CN 111144453 A CN111144453 A CN 111144453A
Authority
CN
China
Prior art keywords
model
data set
data
calculation
constructing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911266182.6A
Other languages
Chinese (zh)
Inventor
吴琼
周楠
王元卓
刘武雷
常诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Original Assignee
Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences filed Critical Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Priority to CN201911266182.6A priority Critical patent/CN111144453A/en
Publication of CN111144453A publication Critical patent/CN111144453A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and equipment for constructing a multi-model fusion calculation model, which are used for acquiring data from a target website to obtain a webpage data set; optimizing the webpage data set to obtain an optimized data set; adding a classification mark to the optimized data set to obtain a classified data set; converting the classification data set into vectors through distributed word vectors to obtain a vector data set, and dividing the vector data set into a training data set and a testing data set; constructing a plurality of calculation models, and fusing the calculation models to form a first calculation model; importing a training data set into a first calculation model to obtain a second calculation model; importing the test data set into a second calculation model to be tested to obtain an optimal calculation model; the invention also provides a website data identification method and equipment, wherein the website data identification method comprises the following steps: obtaining an optimal calculation model by the method; importing the website set into an optimal calculation model to obtain a target result; by using the addition training algorithm of the model plus model, the multi-depth learning model is combined, and the risk webpage identification accuracy is further improved.

Description

Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
The technical field is as follows:
the invention relates to the related technical fields of risk website identification, multi-model fusion, natural language and deep learning, in particular to a method and equipment for constructing a multi-model fusion calculation model and a website data identification method and equipment.
Background art:
nowadays, network security situation is increasingly severe, some risk webpages aim at obtaining benefits, destroying social stability and endangering national security, and users are induced to implement crimes by clicking browsing and other operations under the condition that the users are unaware, so that money loss, privacy disclosure, illegal crimes and the like of the users are caused. Existing risk web pages include phishing web pages, yellow-related web pages, gambling web pages, and the like. How to effectively identify the risk webpages becomes a problem which needs to be solved urgently at present.
In the face of massive data on the internet, the traditional risk webpage identification method is to manually classify information, for example, to manually identify illegal webpage information such as yellow-related, gambling-related, terrorism-related and the like. However, such manual classification has a number of disadvantages: firstly, the consistency of the classification result is low, and secondly, a large amount of financial resources and human resources are needed. Even if the language quality of the classifier is high, the classification results of different people are different. Even the same person, sorting at different times may have different results. Therefore, the intelligent webpage classification and identification technology which is more accurate and efficient to explore is imperative.
At present, common algorithms for automatically classifying web pages are based on machine learning (such as KNN, RF, Bayes, SVM and the like), and have good effects in a web page classification and identification system. However, most of the traditional machine learning algorithms are shallow learning, the bottleneck of the algorithm model is feature extraction, the features of the data usually need to invest a lot of time to research and adjust, and the semantic features of the objects in deeper layers cannot be extracted, so that the recognition accuracy is greatly influenced. The method has the advantages that the end-to-end algorithm characteristic is deeply learned, so that the difficulty of manually designing the algorithm is reduced on the aspect of characteristic problems, the excellent neural network model can automatically mine high-level characteristics of data from simple data characteristics, and the association between the characteristics and tasks is learned after multiple rounds of repeated training.
In recent years, deep learning has achieved better performance than machine learning in solving webpage classification and recognition, but the existing webpage classification and recognition method based on deep learning generally adopts a single calculation model to perform learning and calculation, the classification accuracy is not high, the model training time complexity is high, and the model generalization capability is weak. By adopting the transfer learning method and using the multi-model fusion algorithm, the problems can be effectively solved, and the method has great significance in the field of risk website identification.
The invention content is as follows:
in view of the above, the present invention provides a method and an apparatus for constructing a multi-model fusion computation model, and a method and an apparatus for website data identification, so as to solve at least one technical problem in the prior art.
Specifically, the invention provides a method for constructing a multi-model fusion computing model, which comprises the following steps:
acquiring target webpage data from a target website data source to obtain a webpage data set;
performing data optimization on the webpage data set to obtain an optimized data set;
adding a classification mark to the optimized data set to obtain a classified data set;
converting the data in the classified data set into vector data to obtain a vector data set; dividing the vector data set into a training data set and a testing data set;
constructing a plurality of calculation models, and fusing the calculation models to form a first calculation model;
importing a training data set into a first calculation model, and obtaining a second calculation model after training;
and importing the test data set into a second calculation model for testing, and adjusting parameters until the test result reaches the standard to obtain the optimal calculation model.
Further, when the data optimization is performed on the webpage data set, the data optimization mode comprises data deduplication, data deletion and data combination.
Further, when the data optimization is carried out on the webpage data set, the stop words of the text are removed according to the Chinese stop dictionary, and the Chinese word segmentation of the text is carried out.
Further, the method for converting data into vector data uses a Word2Vec method (WordToVector, a correlation method model for generating Word vectors), and further, the dimension of the Word vector is set to 300.
Further, the plurality of calculation models comprise a VggNet model, a ResNet model, a DenseNet model and an Xconvergence model.
Further, when the first network model is trained, the plurality of calculation models are trained by using a transfer learning method.
Further, a differential parameter optimization method is adopted in the model training process, so that parameters of different layers are adjusted at different learning rates.
Further, the stochastic gradient descent updating formula of the general model parameters in the step of the differential parameter training method is as follows:
the formula for updating the Stochastic Gradient Descent (Stochastic Gradient) of the general model parameter θ is shown as follows:
θt=θt-1-η▽θJ(θ)
where η is the learning rate, ▽θIs about the gradient of the model objective function.
For differential parameter optimization, the parameter θ is divided into { θ }1,...,θLIn which θ1Is the parameter of the model at layer 1, L is the number of layers of the model, and again we get { η }1,...,ηLTherein η1Is the learning rate of layer 1, the learning rate of the last layer is first set to ηlTraining only the last layer, and then setting the learning rate of the bottom layer according to the following formula;
ηl-1=ηl/3
the discriminant random gradient descent calculation formula is as follows:
Figure BDA0002312898290000031
further, after the training of each single model is completed, the weight of each single model to the final model is learned, whether the result meets the requirement is judged according to the evaluation standard, the learning is continued until the result meets the requirement and the optimal weight value is obtained, and then the optimal computing network model is obtained.
Further, the calculation formula of the learning weight is as follows:
using f (x)i) To represent the ith sample predictor, x, in the set modeli=[pi1,pi2,pi3,pi4]TRepresents the output probability of the ith sample in each single model, where pij=[pij1,pij2,...,pijn]T(j ═ 1,2,3,4) represents the output probability of the ith sample in the jth model, the output probabilities are the probability of belonging to the first class, the probability of the second class, and the probability of the nth class, respectively, and w ═ w [ [ w ═ w1,w2,w3,w4]Weights, w, representing the fusion modeljRepresenting the weight of the jth model.
Figure BDA0002312898290000032
Where b is a constant term. Now assuming m samples, the prediction value f (x) can be measured by minimizing the MSEi) And the authentic label y of the ith sampleiThe difference between them:
Figure BDA0002312898290000033
in a second aspect of the present invention, a website data identification method is provided, including the following steps:
acquiring target webpage data from a target website data source to obtain a webpage data set;
performing data optimization on the webpage data set to obtain an optimized data set;
converting data in the classified data set into vector data to obtain a vector data set;
obtaining an optimal calculation model by a method of constructing a multi-model fusion calculation model;
and importing the vector data set into an optimal calculation model to obtain a target result.
In a third aspect of the present invention, an apparatus for constructing a multi-model fusion computational model is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the above method for constructing the multi-model fusion computational model.
In a fourth aspect of the present invention, a website data identification method apparatus is provided, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and when the processor executes the program, the website data identification method is implemented.
The invention has the following beneficial effects:
(1) according to the website data identification method and device provided by the invention, Word2Vec is used for webpage text distributed Word vector representation, and training data is combined with a large amount of risk webpage linguistic data, so that the semantic characteristics of the webpage text are better represented.
(2) The equipment for constructing the multi-model fusion calculation model combines the multi-depth learning models by using the additive training algorithm of the model plus model, realizes a correction mechanism based on model fusion, and further improves the identification accuracy of the risk webpage.
(3) According to the method for constructing the multi-model fusion calculation model, the transfer learning is applied to the deep learning model training, and the deep learning model can be subjected to fine tuning only, so that the method can fit a small amount of labeled data and has good generalization capability.
(4) The website data identification method provided by the invention accelerates the convergence speed of the model by setting and utilizing a different parameter optimization method, namely setting different learning rates in different layers.
Description of the drawings:
in order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic illustration of the steps of a method of constructing a multi-model fusion computational model;
fig. 2 is a schematic diagram illustrating steps of a website data identification method.
The specific implementation mode is as follows:
the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
Some concepts related to the present application are explained below:
1. chinese stop dictionary: in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after processing natural language data (or text), and the characters or Words are called Stop Words; chinese stop words are, for example, "? "sign, word, phrase, etc. without actual semantics, such as", "", "but", "even if"; the Chinese stop dictionary is a set of commonly used Chinese stop words.
VggNet model-the Visual Geometry Group of Oxford university (Visual Geometry Group) together with a deep convolutional neural network model developed by a researcher of Google deep Mind.
ResNet model: is an abbreviation for Residual Network (Residual Network), a classical neural Network model that is the backbone of many computer vision tasks.
DenseNet model: a cross-layer connection network model, wherein the input of each layer comprises the information of all previous layers, and by combining the features of the previous N layers, richer description and discrimination are formed.
Xception model: a linear stacked computational model of depth separable convolutional layers with residual connection.
RELU function: a Linear rectification function, a Rectified Linear Unit, is an activation function (activation function) commonly used in artificial neural networks, and generally refers to a nonlinear function represented by a ramp function and its variants.
Dropout layer: in the method for optimizing the artificial neural network with the deep structure, partial weight or output of a hidden layer is randomly zeroed in the learning process, so that interdependency (co-dependency) among nodes is reduced, regularization of the neural network is realized, and the structural risk (structural risk) of the neural network is reduced.
8. Evaluation criteria: evaluating the automatic identification and detection results of the risk web pages by using the accuracy and the recall rate; the sign of the automatic identification and detection algorithm evaluation of the risk web page is the accuracy of the detection risk web page; in the test data set, the more the number of the automatic identification and detection results which can correctly represent the actual risk web pages is, the higher the accuracy of the algorithm model is; wherein, TP is used for representing the number of risk web pages which are correctly identified and detected; FP represents the number of the webpages belonging to the risk but judged as non-risk webpages by mistake; FN represents the number of misclassified risk web pages.
The calculation formula of the accuracy rate is as follows:
Figure BDA0002312898290000061
the calculation formula of the recall ratio is as follows:
Figure BDA0002312898290000062
the accuracy and the recall rate are mutually influenced, and in the risk webpage detection, if a higher recall rate is obtained, the accuracy rate needs to be sacrificed, so that some webpages which do not belong to the risk are wrongly judged as risk webpages; otherwise, recall would be sacrificed and some risky web pages may be missed. The F1 value is a calculation method for balancing the accuracy and the recall rate, and the detection result can be comprehensively evaluated in the risk webpage detection.
F1 value calculation formula:
Figure BDA0002312898290000063
MSE (Mean Squared Error): is the expected value of the square of the difference between the estimated value of the parameter and the true value of the parameter.
The present invention will be described in detail below by way of examples.
In order to solve the technical problems, the general idea of the embodiment of the present application is as follows:
acquiring target webpage data from a target website data source to obtain a webpage data set;
performing data optimization on the webpage data set to obtain an optimized data set;
adding a classification mark to the optimized data set to obtain a classified data set;
converting the data in the classified data set into vector data to obtain a vector data set; dividing the vector data set into a training data set and a testing data set;
constructing a plurality of calculation models, and fusing the calculation models to form a first calculation model;
importing a training data set into a first calculation model, and obtaining a second calculation model after training;
and importing the test data set into a second calculation model for testing, and adjusting parameters until the test result reaches the standard to obtain the optimal calculation model.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
As shown in FIG. 1, in some embodiments of the present invention, a method of constructing a multi-model fusion computational model is provided, the method comprising:
s010 collects target webpage data from a target website data source to obtain a webpage data set;
in the specific implementation process, when target website data is collected, downloading of a webpage is completed by using a crawler frame script (an application frame for crawling website data and extracting structural data); firstly, manually screened seed addresses are allocated to the task crawler to be used as initial links, so that the crawler can directly enter a secondary page of a portal website and can not capture irrelevant websites under a depth priority strategy. Analyzing the webpage, positioning a classification directory through a Beautiful Soup tool (a python library which can quickly extract contents from HTML and xml), extracting various types of webpage directories, and enabling a crawler to enter the directory to pull effective webpages, wherein the pull strategy is a depth priority strategy based on the current directory until the directory number of the webpage is zero or a corresponding HTML label is empty; and finally storing the data into a data table in a database corresponding to the directory name. In this embodiment, the target webpage data is data on a risk webpage, that is, related data such as crime, money loss of a user, privacy disclosure, illegal crime and the like are induced by clicking and browsing the user under the condition that the user does not know the target webpage data, namely, for the purposes of obtaining benefits, destroying social stability and endangering national security, and the captured keywords are 'political involvement, terrorism, yellow involvement, gambling involvement, sensitive character involvement and the like'; in this embodiment, the data is 2 data contents obtained by crawling different sub-pages in a website, and the first data is: { "Web site: https:// www.guancha.cn/culture/2019_12_03_527189.shtml "," directory: https:// www.guancha.cn/culture/"," title: public distribution of refreshment "drugs" for hong kong rioter: skeleton head composition suspicion "," text content: the power of the overseas network is 12 months and 2 days, the 'repair wind waves' lasting for half a year are not calmed, and a longitudinal storm network with staggered joints of the pan root is arranged behind the 'black screen', so that a doctor who needs to save people falls into a fierce of a rioter. The online organization generation provides diagnosis and treatment for the rioter freely, and someone sends ' medicines ' freely at the riot site, even provides ' purple cloud cream ' facial scrub similar to blood plasma in shape, and the police are convenient to use and put down ' }; the second data is: { "Web site: https:// www.guancha.cn/3541093/2019/1202/content _31674893_1.html "," directory: https:// www.guancha.cn "," title: public distribution of refreshment "drugs" for hong kong rioter: skeleton head composition suspicion "," text content: the power of the overseas network is 12 months and 2 days, the 'repair wind waves' lasting for half a year are not calmed, and a longitudinal storm network with staggered joints of the pan root is arranged behind the 'black screen', so that a doctor who needs to save people falls into a fierce of a rioter. The online organization generation provides diagnosis and treatment for the rioter freely, and someone sends ' medicines ' freely at the riot site, even provides ' purple cloud cream ' facial scrub similar to blood plasma in shape, and the police are convenient to use and put down ' };
s020 is used for carrying out data optimization on the webpage data set to obtain an optimized data set;
in a specific implementation process, the optimization refers to data duplication removal, data deletion and data combination of the content in the webpage data set according to the data format requirement; then removing the stop words of the text according to the Chinese stop dictionary, and performing Chinese word segmentation on the text;
referring to 2 data of S010, firstly comparing webpage links in data optimization, and if the links are different, performing no duplicate removal operation; then comparing the text titles of the web pages, wherein the contents are the same, carrying out merging operation, deleting the second data and adding the text data of the second data center into the first text data; the text content in the embodiment is risk-related content, so that deletion is not required; finally, the text content { ' overseas ', ' net ', ' electricity ', ' continuous ', ' approximately half a year ', ' correction ', ' wind wave ', ' smooth ', ' black storm ', ' behind the screen ', ' wrong section of the disc root ', ' longitudinal storm ' and ' network } is obtained, and the content is stored in an optimized data set.
S030 adds classification marks to the optimized data set to obtain a classification data set;
in the specific implementation process, the data are divided into several types according to the characteristics of the data, and classification marks are added, wherein the method for adding the classification marks is manual marking; in the embodiment, risk data are divided into 6 risk categories of fear, confession, yellow-related, gambling-related, political-related and sensitive-related characters, corresponding values are respectively set, wherein the fear-related value is 1, the confession value is 2, the yellow-related value is 3, the gambling value is 4, the political-related value is 5 and the sensitive-related character value is 6; in actual operation, other categories can be added according to actual conditions; when the data are added with the classification marks, all the data can be manually marked to obtain a result comparison set; the data arrangement sequence of the comparison set is the same as that of the webpage data set; data bits can also be added to the last column of the web page data set, different tag values are assigned according to different tags, and the first mode is adopted in the embodiment.
S040 converts the data in the classification data set into vector data, and then obtains a vector data set; dividing the vector data set into a training data set and a testing data set;
in the specific implementation process, converting the data in the classified data set into Word vectors by a Word2Vec method, and setting the dimension of the output Word vectors to be 300;
in this embodiment, the data in the classified data set is { "overseas", "net", "electricity", "sustain", "near half year", "modification example", "wind wave", "still not", "calm", "black storm", "behind screen", "discriminant dislocation", "longitudinal storm", "network" }, and after Word2Vec conversion, the content is converted into vector values, i.e., overseas- - >0.46558156056166844, net- - >0.40967053422046074, electricity- - >0.16048151467632454, sustain- - > -0.3759397453267568, near half year- - - >0.2976097315644177, modification example- - > -0.2340424162024618, wind wave- - -0.05165401072757103, still not- - > - - -0.36234206383931733, calm- - > -0.04321021772205425, black storm- - >0.3153154888523255, behind screen- - > -0.1617250396990808, discriminant dislocation- - >0.37432323030932324, longitudinal storm- - >0.3211303845807112, network- - >0.1264156456484121, and the finally obtained vector set is 0.46558156056166844,0.40967053422046074,0.16048151467632454, -0.3759397453267568,0.2976097315644177, -0.2340424162024618, -0.05165401072757103, -0.36234206383931733, -0.04321021772205425,0.3153154888523255, -0.1617250396990808,0.37432323030932324,0.3211303845807112,0.1264156456484121}.
S050, constructing a plurality of calculation models, and fusing the calculation models to form a first calculation model;
in the specific implementation process, a plurality of calculation models are fused into one calculation model, so that the identification precision can be greatly improved.
In this embodiment, 4 convolutional neural network calculation models are used for fusion, which are respectively:
VggNet model:
the VggNet network model is input as a 300x300 webpage text vector, the whole network is divided into 5 blocks (convolution groups), each block is internally provided with a plurality of groups of convolutions, convolution kernels are all 3x3, the blocks are connected through a maximum pooling layer, after 5 convolution calculations are completed, the blocks are connected with a global average pooling layer, then a Dropout layer with a discarding rate of 0.5 is used, and in addition, a ReLU activation function is adopted for nonlinear conversion after the convolution;
xception model:
the Xception network is divided into three modules, an input module, a middle module and an output module, wherein the input of the network is a 300x300 webpage text vector, convolution operation with a convolution kernel of 3x3 is firstly carried out twice, nonlinearity is improved through a ReLU activation function, then a depth separable convolution operation unit is connected, each operation unit comprises 2 times of 3x3 depth separable convolution operation with the ReLU activation function and maximum pooling, the middle module is formed by connecting 8 identical depth separable convolution operation units, each operation unit comprises 3x3 depth separable convolution operation with the 3 times of the ReLU activation function, the output module is a depth separable convolution operation unit, the operation unit comprises 2 times of 3x3 depth separable convolution operation with the ReLU activation function and maximum pooling, then 3x3 depth separable convolution operation with the 2 times of the ReLU activation function is carried out, and finally the operation is connected with a global average pooling layer, then using a Dropout layer with a discarding rate of 0.5;
ResNet model:
the ResNet network structure is composed of 3 conv2_ x (convolution group labeled 2) units, 4 conv3_ x (convolution group labeled 3) units, 6 conv4_ x (convolution group labeled 4) units, and 3 conv5_ x (convolution group labeled 5) units, where conv2_ x, conv3_ x, conv4_ x and conv5_ x units each include 3 convolutional layers, and the convolutional operators are 1x1, 3x3 and 1x1, respectively. The input to the network is a 300x300 webpage text vector, the first layer is 1 convolution layer of 7x7, then 3x3 pooling is performed, then 3 conv2_ x units, 4 conv3_ x units, 6 conv4_ x units, and 3 conv5_ x units are connected. Finally, the Dropout layer with the discarding rate of 0.5 is used;
DenseNet model:
the DenseNet model inputs a 300x300 webpage text vector, the first Layer is 1 convolution Layer of 7x7, then 3x3 max pooling is performed, then 3 sense blocks (sense convolution groups) and 3 Transition layers (middle layers) are connected, the Transition Layer immediately follows each sense Block, each sense Block contains convolution operators such as 1x1 and 3x3, and each 3x3 convolution front contains a convolution operation of 1x 1. The first Dense Block contained 61 x1 convolutions and 63 x3 convolutions, the first Transition Layer contained 1x1 convolution, followed by 2x2 average pooling; the second Dense Block comprised 12 convolutions of 1x1 and 12 convolutions of 3x3, the second Transition Layer comprised 1 convolution of 1x1, followed by 2x2 average pooling; the third Dense Block contained 241 x1 convolutions and 24 3x3 convolutions, and the third Transition Layer contained 1x1 convolution followed by 2x2 average pooling. Finally, the Dropout layer with the discarding rate of 0.5 is used.
In the specific implementation process, a plurality of calculation models are trained by a transfer learning method, pre-training models on other large text data sets are loaded into the plurality of models respectively as initial parameters, different simulation operations are carried out, and the training time is greatly shortened.
In this embodiment, a small section of the parameters is taken as an example, where data of partial parameters loaded in the VggNet model is {0.3582, -0.0283, 0.2607, 0.5190, -0.2221, 0.0665, -0.2586, -0.33112, 0.1927}, data of partial parameters loaded in the ResNet model is {0.1658, 0.1248, -0.1684, -0.5532, -0.4822, 0.0034, -0.1547, 0.2348, 0.1135}, data of partial parameters loaded in the DenseNet model is { -0.2301, 0.1129, 0.6531, -0.5190, 0.0014, -0.3354, 0.4412, 0.3111, -0.0214}, and data of partial parameters loaded in the Xception model is {0.0541, 0.4215, 0.3145, 0.7411, -0.0114, 0.2210, -0.3315, -0.1114, -0.0003}, and these calculation models are imported respectively to construct a first calculation model.
S060 importing a training data set into the first calculation model, and obtaining a second calculation model after training;
in the specific implementation process, data of a training data set is led into a plurality of single calculation models in batches, parameters are adjusted continuously through training, and when all the single training models meet requirements, single model training is stopped.
In order to avoid the problems of catastrophic forgetting and slow convergence, a differential parameter optimization method is adopted in the training process, so that parameters of different layers are adjusted at different learning rates, and the correlation formula is as follows:
the formula for updating the random gradient descent (stochasticgradientdestone) of the general model parameter θ is shown below:
θt=θt-1-η▽θJ(θ)
where η is the learning rate, ▽θIs about the gradient of the model objective function, where J (theta) is the loss function, and as theta changes, the loss value J (theta) changes accordingly.
For differential parameter optimization, the parameter θ is divided into { θ }1,...,θLIn which θ1Is the parameter of the model at layer 1, L is the modulusNumber of layers of type, and again we obtained η1,...,ηLTherein η1Is the learning rate of layer 1, the learning rate of the last layer is first set to ηlTraining only the last layer, and then setting the learning rate of the bottom layer according to the following formula;
ηl-1=ηl/3
the discriminant random gradient descent calculation formula is as follows:
Figure BDA0002312898290000101
in this embodiment, a layer-by-layer optimization method with differences is used, the maximum learning rate is set to be 0.01, the second layer learning rate is set to be 0.033, the third layer learning rate is set to be 0.011.
And S070, importing the test data set into a second calculation model for testing, and adjusting parameters until the test result reaches the standard to obtain the optimal calculation model.
Importing a test data set and the second calculation model to carry out test operation, obtaining the weight of each single model through learning, and finally finishing classification identification according to a certain evaluation standard;
in a specific implementation process, the step of calculating the weight includes:
using f (x)i) To represent the ith sample predictor, x, in the set modeli=[pi1,pi2,pi3...pin]TRepresents the output probability of the ith sample in each single model, where pij=[pij1,pij2,...,pijn]TAnd (j ═ 1,2, 3.. n) represents the output probability of the ith sample in the jth model, the output probabilities are the probability of belonging to the first class, the probability of the second class and the probability of the nth class, respectively, w ═ w [ [ w ═ w ·1,w2,w3...wn]Weights, w, representing the fusion modelnRepresenting the weight of the nth model.
Figure BDA0002312898290000111
Where b is a constant term. Now assuming m samples, the prediction value f (x) can be measured by minimizing the MSEi) And the authentic label y of the ith sampleiThe difference between them is set as J (w, b) as the minimum difference value, and the formula is as follows:
Figure BDA0002312898290000112
in this embodiment, there are 4 models in total, j is 4, and therefore the formula becomes
Figure BDA0002312898290000113
Figure BDA0002312898290000114
After the corresponding probability value is obtained through calculation, the obtained model is operated, the evaluation standard is used for calculation according to the operation result, when the accuracy rate reaches 80% and the recall rate reaches 85%, the requirements are met, the optimal parameter is obtained, and then the optimal calculation network model is obtained; if the accuracy and the recall rate do not meet the requirements, continuing learning until the requirements are met; the probability values finally obtained by the models in this embodiment are respectively: the corresponding probability of VggNet is 0.13; the ResNet corresponding probability is 0.18; the DenseNet corresponding probability is 0.38; and the corresponding probability of the Xception is 0.31, and then the optimal calculation model is obtained.
As shown in fig. 2, based on the same inventive concept, the present invention provides a website data identification method, which includes the following steps:
acquiring target webpage data from a target website data source to obtain a webpage data set;
performing data optimization on the webpage data set to obtain an optimized data set;
converting data in the classified data set into vector data to obtain a vector data set;
obtaining an optimal calculation model by a method of constructing a multi-model fusion calculation model;
and importing the vector data set into an optimal calculation model to obtain a target result.
Based on the same inventive concept, the invention provides equipment for constructing a multi-model fusion calculation model, which comprises:
a processor;
storage means for storing one or more programs;
the one or more programs are executed by the one or more processors, causing the one or more processors to implement the scanning method described above.
Based on the same inventive concept, the invention provides website data identification equipment, which comprises:
a processor;
storage means for storing one or more programs;
the one or more programs are executed by the one or more processors, causing the one or more processors to implement the scanning method described above.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
It should be understood that the technical problems can be solved by combining and combining the features of the embodiments from the claims.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of constructing a multi-model fusion computational model, the method comprising the steps of:
acquiring target webpage data from a target website data source to obtain a webpage data set;
performing data optimization on the webpage data set to obtain an optimized data set;
adding a classification mark to the optimized data set to obtain a classified data set;
converting the data in the classified data set into vector data to obtain a vector data set; dividing the vector data set into a training data set and a testing data set;
constructing a plurality of calculation models, and fusing the calculation models to form a first calculation model;
importing a training data set into a first calculation model, and obtaining a second calculation model after training;
and importing the test data set into a second calculation model for testing, and adjusting parameters until the test result reaches the standard to obtain the optimal calculation model.
2. The method of constructing a multi-model fusion computational model of claim 1, wherein during optimization of a web page dataset, stop words are removed from a Chinese stop dictionary and Chinese segmentation of the text is performed.
3. The method for constructing a multi-model fusion computing model according to claim 2, wherein the method for converting data into vector data uses a Word2Vec method; the word vector dimension is set to 300.
4. The method for constructing a multi-model fusion computational model according to claim 1, wherein the plurality of computational models are trained using a transfer learning method when training the first network model.
5. The method of claim 4, wherein a differential parameter optimization method is used during model training to adjust parameters of different layers at different learning rates.
6. The method for constructing the multi-model fusion calculation model according to claim 5, wherein after the training of each single model is completed, the weight of each single model to the final model is learned, whether the result meets the requirement is judged according to the evaluation standard, the learning is continued until the result meets the requirement and the optimal weight value is obtained, and then the optimal calculation network model is obtained.
7. The method for constructing a multi-model fusion computational model according to claim 6, wherein the computational formula of the learning weight is as follows:
Figure FDA0002312898280000011
the predicted value f (x) may be measured by minimizing the MSEi) And the authentic label y of the ith sampleiThe difference between them:
Figure FDA0002312898280000021
8. the website data identification method is characterized by comprising the following steps of:
acquiring target webpage data from a target website data source to obtain a webpage data set;
performing data optimization on the webpage data set to obtain an optimized data set;
converting data in the classified data set into vector data to obtain a vector data set;
obtaining an optimal computational model by any one of the methods of constructing a multi-model fusion computational model as claimed in claims 1 to 7;
and importing the vector data set into an optimal calculation model to obtain a target result.
9. An apparatus for constructing a multi-model fusion computational model, the apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing any of the above methods for constructing a multi-model fusion computational model as claimed in claims 1 to 7.
10. Website data identification method apparatus, characterized in that the apparatus comprises a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the website data identification method according to claim 8 when executing the program.
CN201911266182.6A 2019-12-11 2019-12-11 Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data Pending CN111144453A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911266182.6A CN111144453A (en) 2019-12-11 2019-12-11 Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911266182.6A CN111144453A (en) 2019-12-11 2019-12-11 Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data

Publications (1)

Publication Number Publication Date
CN111144453A true CN111144453A (en) 2020-05-12

Family

ID=70518072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911266182.6A Pending CN111144453A (en) 2019-12-11 2019-12-11 Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data

Country Status (1)

Country Link
CN (1) CN111144453A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN112656431A (en) * 2020-12-15 2021-04-16 中国科学院深圳先进技术研究院 Electroencephalogram-based attention recognition method and device, terminal equipment and storage medium
CN117421986A (en) * 2023-11-02 2024-01-19 中国地质调查局西安地质调查中心(西北地质科技创新中心) Automatic extraction method of geological disaster slope unit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160071010A1 (en) * 2014-05-31 2016-03-10 Huawei Technologies Co., Ltd. Data Category Identification Method and Apparatus Based on Deep Neural Network
CN106021410A (en) * 2016-05-12 2016-10-12 中国科学院软件研究所 Source code annotation quality evaluation method based on machine learning
CN108199951A (en) * 2018-01-04 2018-06-22 焦点科技股份有限公司 A kind of rubbish mail filtering method based on more algorithm fusion models
CN110119689A (en) * 2019-04-18 2019-08-13 五邑大学 A kind of face beauty prediction technique based on multitask transfer learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160071010A1 (en) * 2014-05-31 2016-03-10 Huawei Technologies Co., Ltd. Data Category Identification Method and Apparatus Based on Deep Neural Network
CN106021410A (en) * 2016-05-12 2016-10-12 中国科学院软件研究所 Source code annotation quality evaluation method based on machine learning
CN108199951A (en) * 2018-01-04 2018-06-22 焦点科技股份有限公司 A kind of rubbish mail filtering method based on more algorithm fusion models
CN110119689A (en) * 2019-04-18 2019-08-13 五邑大学 A kind of face beauty prediction technique based on multitask transfer learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
丁鑫: "基于深度卷积网络特征优化的图像分类", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112656431A (en) * 2020-12-15 2021-04-16 中国科学院深圳先进技术研究院 Electroencephalogram-based attention recognition method and device, terminal equipment and storage medium
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN117421986A (en) * 2023-11-02 2024-01-19 中国地质调查局西安地质调查中心(西北地质科技创新中心) Automatic extraction method of geological disaster slope unit

Similar Documents

Publication Publication Date Title
Alfarisy et al. Deep learning based classification for paddy pests & diseases recognition
Hatami et al. Classification of time-series images using deep convolutional neural networks
Nguyen et al. Damage assessment from social media imagery data during disasters
CN112711953B (en) Text multi-label classification method and system based on attention mechanism and GCN
CN110909164A (en) Text enhancement semantic classification method and system based on convolutional neural network
Singh et al. A study of moment based features on handwritten digit recognition
Tang et al. Multi-label patent categorization with non-local attention-based graph convolutional network
CN104573130B (en) The entity resolution method and device calculated based on colony
CN107683469A (en) A kind of product classification method and device based on deep learning
CN111143838B (en) Database user abnormal behavior detection method
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN109190698B (en) Classification and identification system and method for network digital virtual assets
CN111144453A (en) Method and equipment for constructing multi-model fusion calculation model and method and equipment for identifying website data
Mahalakshmi et al. Ensembling of text and images using deep convolutional neural networks for intelligent information retrieval
Praveena et al. [Retracted] Effective CBMIR System Using Hybrid Features‐Based Independent Condensed Nearest Neighbor Model
Chatterjee et al. A clustering‐based feature selection framework for handwritten Indic script classification
Buvana et al. Content-based image retrieval based on hybrid feature extraction and feature selection technique pigeon inspired based optimization
Tian et al. Image classification based on the combination of text features and visual features
CN111582506A (en) Multi-label learning method based on global and local label relation
CN110377690A (en) A kind of information acquisition method and system based on long-range Relation extraction
Chen et al. Malicious URL detection based on improved multilayer recurrent convolutional neural network model
Abir et al. Bangla handwritten character recognition with multilayer convolutional neural network
Magotra et al. Malaria diagnosis using a lightweight deep convolutional neural network
Yuan et al. CSCIM_FS: Cosine similarity coefficient and information measurement criterion-based feature selection method for high-dimensional data
Bi et al. Judicial knowledge-enhanced magnitude-aware reasoning for numerical legal judgment prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 450000 8 / F, creative island building, no.6, Zhongdao East Road, Zhengdong New District, Zhengzhou City, Henan Province

Applicant after: China Science and technology big data Research Institute

Address before: 450000 8 / F, creative island building, no.6, Zhongdao East Road, Zhengdong New District, Zhengzhou City, Henan Province

Applicant before: Big data Research Institute Institute of computing technology Chinese Academy of Sciences

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20200512

RJ01 Rejection of invention patent application after publication