CN108364028A

CN108364028A - A kind of internet site automatic classification method based on deep learning

Info

Publication number: CN108364028A
Application number: CN201810181807.8A
Authority: CN
Inventors: 杜沐阳; 韩言妮; 杨兴华; 谭红艳
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2018-03-06
Filing date: 2018-03-06
Publication date: 2018-08-03

Abstract

The present invention relates to a kind of internet site automatic classification method based on deep learning, according to the original description information of a large amount of internet sites of dns server log collection as website data collection, and it carries out pretreatment and manually labels, then the high dimensional feature vector for extracting each website for inputting deep learning model indicates, and corresponding categories of websites label is increased to each website, and it is converted into categorization vector；High dimensional feature vector is denoted as the input of deep learning model, and output of the categorization vector as deep learning model supervises Recognition with Recurrent Neural Network deep learning model of the training based on LSTM using Adam gradient descent algorithm optimizers；Increase by one layer of SoftMax after trained LSTM Recognition with Recurrent Neural Network deep learning model to return；Using the corresponding categories of websites of the maximum dimension of probability value in ProbabilityDistribution Vector as categories of websites, and the categories of websites of output is compared with the actual classification in website, obtains the classification accuracy of internet site.

Description

A kind of internet site automatic classification method based on deep learning

Technical field

The present invention relates to a kind of internet site automatic classification method based on deep learning belongs to artificial intelligence technology neck Domain.

Background technology

In recent years, internet is in explosive growth, and China is even more that the world is allowed to exclaim in the development speed of internet arena. The 40th time issued by China Internet Network Information Center in July, 2017《China Internet network state of development statistical report》^[1]In It points out, by June, 2017, for Chinese netizen's scale up to 7.51 hundred million, half a year amounts to 19,920,000 people of newly-increased netizen.Internet penetration Reach 54.3%, and Chinese website sum reaches 5,060,000, the Websites quantity under " .CN " domain name reaches 2,700,000, and China is mutual Up-trend is all presented in terms of netizen's scale, Internet penetration and Websites quantity in networking.

In face of so huge Websites quantity, the demand and importance of websites collection gradually display, in many letters Breath management and retrieval task in play a crucial role, on the internet, website be sorted in web crawlers orientation crawl, The exploitation and maintenance of web catalogue, the Web Link Analysis of subject-oriented, the content-targeted push of online advertisement and network Analysis of Topological Structure etc. all plays important role.Websites collection can also assist search engine promoted accuracy, change The quality of kind search result, operator and Internet Service Provider can also utilize websites collection to be based on analysis user and access row For, and then optimize allocation of resources.

Deep learning^[2]Current research hotspot, the multiple fields such as artificial intelligence all produce outstanding research at Fruit, it is a branch of machine learning, and current deep learning refers to deep neural network, its representation method comes from neurology department It learns, it attempts to simulate and define the information transmission of human brain neuron and potential fluctuation, reaches in certain identification missions to realize Has the algorithm of same competitiveness with mankind's performance.

Website on internet can be divided into multiple classifications, such as " Taobao " can be classified as " e-commerce ", " take Journey net " can be classified as " tourist communications ", how website is accurately marked off to classification be one and important and significant grind Study carefully problem.Correlative study both domestic and external establishes disaggregated model using disclosed free data collection as training data mostly, this A little data sets can collect a certain number of websites and be fixed classification by their handmarkings, below to it is more both domestic and external There is the research method of websites collection to make introduction.

R.Rajalakshmi and C.Aravindan proposes a kind of websites collection algorithm based on naive Bayesian^[3], he Algorithm use the URL of website as the foundation of classification, and website is always divided into 9 classifications：Art, business, computer, Game, health, family, news, amusement and reference, they have used from ODP (Open Directory Project) data set 8,55,939 URL are built 9 classifications of middle extraction as training data, and using machine learning algorithm naive Bayesian in total Mould.

Min-Yen Kan propose a kind of webpage classification algorithm being based on SVM (Support Vector Machine)^[4], Their algorithm uses foundations of the URL of website as classification, and website has always been divided into 4 classifications：Course, teacher, item Mesh, student, his research are based on 98 WebKB data sets of ILP, wherein including 4167 webpages, and use machine learning algorithm Model construction of SVM.

Chen Qianxi and the model webpage classification algorithm of heap of stone for proposing one kind and being based on SAE (sparse self-encoding encoder)^[5], their algorithms Using website title, keyword and description information as classification foundation, and website has always been divided into 8 classifications：Leisure joy Pleasure, online shopping mall, network service, commercial economy, service for life, educational culture, blog forum, general website, their data Collection acquisition captures from each classification from " the first classified catalogue " website and sorts out 200 webpages, obtains 1600 webpages in total and makees For data set, and modeled using sparse self-encoding encoder.

Jia M Q, Wang Z M, Chen G delivered one propose it is a kind of based on decision tree and HTTP behavioural analyses Websites collection algorithm^[6], their algorithm use user access website during generate HTTP behaviors as classification foundation, and Website has always been divided into 3 classes, has been respectively：News category, resource sharing class, exchange community's class, their data are derived from HERNET The user of (Henan Province's China Education and Research Network) accesses record, therefrom extracts HTTP behaviors as data set, and use decision tree machine Learning method models.

Hou Cuiqin and Jiao Li is at proposing a kind of websites collection algorithm of the Co-Training based on figure：GCo- training^[7], their research is based on WebKB data sets, and algorithm iteratively learns under Co-training algorithm frames One semi-supervised classifier and a Bayes grader based on text feature based on the figure constructed by hyperlinked information, two Person helps each other, and respective performance is continuously improved, and then Bayes graders can be used for predicting largely to have no the classification of data.

Sun Jingchao proposes a kind of Web page classification method based on machine learning^[8], his multi-party spy studied using webpage Sign, such as URL, host name be divided into benign webpage and pernicious webpage two categories as classification foundation, and by webpage, he from A large amount of benign network address and malice network address are collected at Alexa and PhishTank as data set, then use a variety of machine learning sides Method includes that support vector machines, naive Bayesian and logistic regression are modeled.

Zhang Jingshen and Ren Fangying proposes a kind of Website classification method and device patent (publication number：CN106250402A), This method uses the first label information and the first web page contents of website to be sorted, according to preset labeling dictionary, determines The corresponding categories of websites of first label information.

Wang Yong has proposed a kind of Website classification method and system patent (publication number：CN101458713A), side therein Method includes：As unit of website, counting user search key and the information for clicking network address；Using statistical information, determines and be directed toward The set of keywords of website to be sorted, and establish with the set of keywords vector of website to be sorted；Determine the kind of known type Subnet station, and establish with the set of keywords vector of the seed website；Utilize the vector and kind subnet of website to be sorted The vector stood calculates the similarity of website to be sorted and seed website；According to similarity size, the class of website to be sorted is determined Type.

In existing traditional Website classification method, it there are some universal and disadvantages, it is main comprising once Four aspects, respectively：The problem of classification foundation and disadvantage, using the problem of method and the problem of disadvantage, data and lacking The problem of point, class categories and disadvantage.

In short, the shortcomings that prior art and deficiency are：The data volume that model training uses is not big enough, not enough newly, classification according to According to information content it is insufficient, the method and model used is not suitable for, to high dimensional data sample classification, extracting feature and learning information Scarce capacity, the number of classification is insufficient, it is difficult to meet polytypic demand.

[1] the 40th time《China Internet network state of development statistical report》, China Internet Network Information Centre, 2017/ 8/4.

[2]LeCun Y,Bengio Y,Hinton G.Deep learning[J].Nature,2015,521(7553): 436-444.

[3]R.Rajalakshmi,C.Aravindan:Naive Bayes Approach for Website Classification.V.V.Das,G.Thomas,and F.Lumban Gaol(Eds.):AIM 2011,CCIS 147, pp.323–326,2011.

[4]Min-Yen Kan:Web page categorization without the web page.WWW2004, May 17–22,2004,New York,New York,USA.ACM 1-58113-912-8/04/0005.

[5]Chen Qianxi,Fan Lei:Webpage Classification Based on Deep Learning Algorithm.Microcomputer Applications Vol.32,No.2,2016,pp.25–28,2016

[6]Jia M Q,Wang Z M,Chen G.Research on website classification based on user HTTP behavior analysis[J][J].Computer engineering and design,2010,2.

[7]HOU Cui-qin,,JIAO Li-cheng:Graph Based Co-Training Algorithm for Web Page Classification.ACTA ELECTRONICA SINICA.pp 2174-2180,Vol.37 No.10 Oct.2009.

[8]SUN Jingchao:A Classification Method of Web Page Using Machine Learning,Netinfo Security,Sep 2017,pp 45-48,1671-1122(2017)09-0045-04.

Invention content

The technology of the present invention solves the problems, such as：Overcome the deficiencies of the prior art and provide a kind of internet net based on deep learning It stands automatic classification method, is indicated pair using the high dimension vector of internet site by LSTM Recognition with Recurrent Neural Network deep learning models Website is classified, and is improved the accuracy rate of internet site classification, is provided for modern information retrieval field more effective and real Website classification method.

Realize that the technical solution of the object of the invention is：It is a kind of based on LSTM Recognition with Recurrent Neural Network deep learning models Internet site sorting technique, includes the following steps：

Step 1, the original description information of a large amount of internet sites of the existing net dns server log collection of basis are as website number According to collecting and carry out pretreatment and manually label, then extract each website be used to input the high dimensional feature of deep learning model to Amount indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, the specific steps are：

Step 1-1, the original description information of internet site is pre-processed, pretreatment includes removal invalid entries (mistake is not such as found comprising 404 pages, server internal error, website have stopped the corresponding data entry of service and gone Except), original description information stop words removes (such as：" ", the stop words such as " a " and special the meaningless words such as meet), will be pre- Treated, and whole websites manually label, and are divided into 15 classifications, respectively：E-commerce, science and technology, tourist communications, Commercial economy, social networks, service for life, physical fitness, medical treatment ＆ health, literature and art, news media, amusement and recreation, politics Law, computer network, educational training, Social Culture；

Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted；

Step 1-3, each website concentrated for website data calculates each word after its original description information segments TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions：

Wherein IV_wIt is the high dimensional feature vector expression of website, t_iIt is i-th of word in the w of website, TF_IDF_t,wIt is the word The reverse document frequency of word frequency-in the website data, Embedding_Vector_t,wIt is that the high dimension vector of the word indicates (logical It crosses Word2Vec models to obtain)；

Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OV_w.The vector Dimension be classification sum, internet site has been divided into 15 classifications by the present invention altogether, then all output categorization vectors are equal For 15 dimensions, the wherein corresponding dimension of number of website real classification is 1, other dimensions are 0；

Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of the website obtained in the first step, Output of the categorization vector of website as deep learning model is based on using Adam gradient descent algorithm optimizer supervised trainings The Recognition with Recurrent Neural Network deep learning model of LSTM, the classification abstract characteristics vector for extracting website, when training pattern, training Data are that website data concentrates 70% website data randomly selected, and verification and testing data are remaining 30% website datas, The specific steps are：

2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model, Wherein, each training data is all made of 2 parts：Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.

2-2, LSTM Recognition with Recurrent Neural Network model (Recurrent Neural Network) general frame is 1 input Layer, an output layer and 3 stack comprising a large amount of LSTM neurons neural network hidden layers, input layer, hidden layer and It is full connection status between output layer, each LSTM neurons in hidden layer consist of the following parts：Input gate (Input Gate), output gate (Output Gate), forgets that gate (Forget Gate), storage unit (Memory) reflect Function (y=Wx+b, wherein y are neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix) is penetrated, Activation primitive (Tanh).During model training, Adam optimizers can use gradient descent method to minimize loss function, this A process is each in the mapping function of each neuron of continuous updating neural network by way of positive transmission and back transfer The value of parameter and the value in storage unit reach.The data that training data is concentrated can be used to train neural network in order, The average information generated for neuron when storing last training in storage unit, these average informations can be used as " experience " number According to for this training and later training process.Neuron generate empirical data whether store in the memory unit be by Input, export and forget gate control, when inputting gate opening, average information of this training can be stored in storage unit, When exporting gate and opening, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage Unit can remove the pervious intermediate data preserved.Instruction in the past can be utilized by being advantageous in that using LSTM Recognition with Recurrent Neural Network Empirical data in white silk is merged with the input data that this is trained is trained model, and energy is predicted in the classification to improve model Power.

2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function The parameter of type, wherein loss function are defined as follows by way of cross entropy (Cross Entropy)：

Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.

Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site Classification accuracy, the specific steps are：

The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein P_jFor the vector after the processing of SoftMax functions Jth tie up value, O_jIt is the value of output vector jth dimension, O_kIt is the value of output vector kth dimension：

Then finally obtained vector is：

Output=SoftMax (OV_W)

For Output per one-dimensional value all between 0 to 1, all dimension values summations are 1 in 3-2,3-1, each dimension generation The classification of one website of table is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile classification is accurate Rate is defined as correctly predicting the ratio of number and all prediction numbers in prediction result.

The advantages of the present invention over the prior art are that：

(1) side for using title, keyword and the description information of website and combining it using Word2Vec and TF-IDF Method is converted into the vector of the high dimensional feature comprising high information quantity, which is embodied in 1000 dimensions, due to its possess it is very high Dimension, can include by the information content height in website original description word, using high dimensional feature vector as classification foundation, Avoid the problem of information content deficiency when using URL or HTTP behaviors as classification foundation；

(2) method for using deep learning, by establishing LSTM Recognition with Recurrent Neural Network deep learning models for website Mechanized classification overcomes common machines learning method the problem of performing poor in the more classification problems of high-dimensional large sample；

(3) consider importance of the model data to Model Practical, use a large amount of nets according to existing net dns server acquisition Data of standing and training and the test data manually to label as model solve previous research using free disclosure and are possible to Data that out-of-date data set is brought are inaccurate and the problem of cause model to lack practicability；

(4) in terms of the use of classification foundation, traditional research approach mostly uses greatly the URL of website as classification foundation, But URL itself have prodigious randomness and uncertainty, it includes information content it is very few also cause its be difficult to become classification Credible foundation.There are also methods, and user to be used to access the HTTP behaviors of website as classification foundation, however HTTP behavior phases The true content for including for website is often excessively abstract, and such as to video, the website of the request of picture, many classifications all has this The feature of sample, it is difficult to which website is subdivided into according to it by more classification；And the present invention by extract the title of website, keyword and For descriptive text as classification foundation, these information are website webmasters in order to which search engine optimization is write meticulously, are tended to The specific business of website is depicted precise and to the pointly, is based on these verbal descriptions, can ensure more information amount as far as possible Under the premise of, realize believable websites collection.

(5) in terms of the use of sorting technique, traditional research approach uses support vector machines, naive Bayesian etc. mostly Machine learning method is for classifying, however these methods are difficult to due to birth defect in more classification problems of high latitude large sample Obtain better effects；The present invention is based on deep learning thoughts to carry out websites collection, builds a variety of deep neural networks, simulates human brain Cognitive Mode, due to deep neural network have a large amount of neurons, have very strong feature extraction ability and study energy Power is suitble to be applied in the websites collection task of high information quantity.

(6) in terms of the use of grouped data, traditional research approach mostly uses free disclosed data set, and these The problems such as over time, often there is data volume deficiency in data set, data are out-of-date and mistake；The present invention then chooses operation Quotient now nets true user access logs, therefrom extracts whole website domain names, and uses these website domain names as seed Data voluntarily acquire a large amount of web datas and manually label as data, these data are with high timeliness and really Property, to ensure that the credibility and availability of final mask.

(7) in terms of class categories, traditional research method often by the less classification of websites collection, is such as divided into malice net It stands and benign website, or is divided into news category, resource sharing class and exchange community's class, usually less than 10 classifications, categorical measure It is less so that website can not be subdivided into specific category by these research methods；The present invention then use deep learning method by website from It is dynamic to be classified as 15 classifications, be respectively：E-commerce, science and technology, tourist communications, commercial economy, social networks, life clothes Business, physical fitness, medical treatment ＆ health, literature and art, news media, amusement and recreation, politics and laws, computer network, educational training, society Can be cultural, the diversification of classification so that websites collection is more careful, more accurately.

Description of the drawings

Internet site automatic classification method flow charts of the Fig. 1 based on deep learning；

Fig. 2 Sina website, the title of Tencent's video network, keyword and descriptive text example；

Fig. 3 data grabber programs are put in storage and pass through pretreated part website data；

Fig. 4 LSTM Recognition with Recurrent Neural Network deep learning model schematics；

Fig. 5 LSTM inside neurons structural schematic diagrams；

Fig. 6 is relevant effect or Test Drawing of the present invention etc..

Specific implementation mode

In conjunction with Fig. 1, the present invention is the internet site sorting technique based on LSTM Recognition with Recurrent Neural Network deep learning models, It comprises the steps of：

Step 1, in conjunction with Fig. 2, made according to the original description information of the existing a large amount of internet sites of net dns server log collection It for website data collection and carries out pretreatment and manually labels, then extract height of each website for inputting deep learning model Dimensional feature vector indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, specific steps For：

Step 1-1, in conjunction with Fig. 3, the original description information of internet site is pre-processed, pretreatment includes removal nothing Effect entry (such as does not find mistake, server internal error, website have stopped servicing corresponding data entry comprising 404 pages Remove), original description information stop words removes (such as：" ", the stop words such as " a " and special the meaningless words such as meet), Pretreated whole websites are manually labelled, are divided into 15 classifications, respectively：E-commerce, science and technology, tourism are handed over Logical, commercial economy, social networks, service for life, physical fitness, medical treatment ＆ health, literature and art, news media, amusement and recreation, political affairs Therapy rule, computer network, educational training, Social Culture；

Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OV_w.The vector Dimension be classification sum, internet site has been divided into 15 classifications by this patent altogether, then all output categorization vectors are equal For 15 dimensions, the wherein corresponding dimension of number of website real classification is 1, other dimensions are 0；

Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of obtained website, the class of website Not output of the vector as deep learning model, uses cycle of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM Neural network deep learning model, the classification abstract characteristics vector for extracting website, when training pattern, training data is website 70% website data randomly selected in data set, verification and testing data are remaining 30% website datas, the specific steps are：

2-2, in conjunction with Fig. 4, LSTM Recognition with Recurrent Neural Network model general frames are 1 input layer, an output layer and 3 Including a large amount of LSTM neurons neural network hidden layers stack, it is full connection between input layer, hidden layer and output layer State, in conjunction with Fig. 5, each LSTM neurons in hidden layer consist of the following parts：Gate (Input Gate) is inputted, it is defeated Go out gate (Output Gate), forgets gate (Forget Gate), storage unit (Memory), mapping function (y=Wx+b, Wherein y is neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix), activation primitive (Tanh). During model training, Adam optimizers can use gradient descent method to minimize loss function, this process passes through forward direction Transmit and back transfer each neuron of mode continuous updating neural network mapping function in parameters value and deposit Value in storage unit reaches.The data that training data is concentrated can be used to train in order neural network, be used in storage unit The average information that neuron generates when the training of storage last time, these average informations can be used as " experience " data to be trained for this And later training process.It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets Remember gate control, when inputting gate opening, average information of this training can be stored in storage unit, when output gate is opened When, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage unit can be removed and protected The pervious intermediate data deposited.The empirical data that can be utilized in training in the past is advantageous in that using LSTM Recognition with Recurrent Neural Network It is merged with the input data of this training and model is trained, to improve the classification predictive ability of model.

Then finally obtained vector is：

Output=SoftMax (OV_W)

Further detailed description is done to the present invention with reference to examples of implementation.

In conjunction with Fig. 1, the present invention is the internet site sorting technique based on LSTM Recognition with Recurrent Neural Network deep learning models, First by extracting website high dimensional feature vector, Adam gradients decline optimizer supervised training deep learning model is then used, And classified to internet site using trained deep learning model and SoftMax recurrence, obtain classification accuracy.Tool Body comprises the steps of：

In examples of implementation, it is assumed that a total of 10000 website original descriptions information data passes through above-mentioned pretreated side Formula retains 95000 valid data in total, these valid data can be segmented, and manually labels and be categorized into 15 classifications.

In examples of implementation, the word segmentation result of 95000 valid data is used to train Word2Vec models, model instruction After the completion of white silk, gives a word and input Word2Vec models, the high dimension vector that model can return to the word indicates.

Wherein IV_wIt is the high dimensional feature vector expression of website, i is the number of the word obtained after website segments, t_iIt is website I-th of word in w, TF_IDF_t,wIt is the reverse document frequency of word frequency-of the word in the website data, Embedding_ Vector_t,wIt is the high dimension vector expression (being obtained by Word2Vec models) of the word, N is the word obtained after the website segments Language sum.

In examples of implementation, each website can be converted into the higher-dimension website vector of 1000 dimensions using above-mentioned formula It indicates, has then obtained the input data for deep learning model training and test, include 95000 input datas altogether.

Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OV_w.The vector Dimension be classification sum, internet site has been divided into 15 classifications altogether in the present invention, then all output categorization vectors It is 15 dimensions, wherein the corresponding dimension of number of website real classification is 1, other dimensions are 0；

In examples of implementation, each website can be converted into the low-dimensional categories of websites table of 15 dimensions using above-mentioned formula Show, then obtained the output data for deep learning model training and test, includes 95000 output datas altogether.

The data set that all input datas and output data collectively constitute 95000 datas pair is used for deep learning model Training and test.

In examples of implementation, randomly selects 70% from website data concentration and be used as training dataset, i.e., 66500 trained numbers According to remaining 30% is used as test data, i.e. 28500 test datas.

In examples of implementation, LSTM Recognition with Recurrent Neural Network deep learning models share 1 input layer, an output layer and 3 A hidden layer, wherein it includes 1024 LSTM neurons layer by layer that first, which is hidden, second hidden layer includes 512 LSTM nerves Member, third hidden layer include 256 LSTM neurons.

2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function The parameter of type, wherein loss function Loss are defined as follows by way of cross entropy (Cross Entropy)：

Wherein x is each individual dimension of vector, and p (x) is the every one-dimensional value of output output layer output vector, q (x) It is the every one-dimensional value of true categorization vector.

In the embodiment of the present invention, using training dataset totally 66500 datas to LSTM Recognition with Recurrent Neural Network deep learnings Model exercises supervision repetitive exercise, and each iteration is chosen 10000 datas and is updated to neuron parameter in model, carries out altogether 4500 iteration.

Then finally obtained vector is：

Output=SoftMax (OV_W)

Wherein OV_WIt is output vector of the deep learning model to website w

3-2, Output are per one-dimensional value between 0 to 1, and all dimension values summations are 1, and each dimension represents one The classification of website is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile the accuracy rate definition of classification Correctly to predict the ratio of number and all prediction numbers in prediction result.

As shown in fig. 6, in the embodiment of the present invention, model parameter is updated by successive ignition, finally obtains optimal models The automatic classification accuracy of internet site is 87.9%.

In conclusion the present invention by LSTM Recognition with Recurrent Neural Network deep learning models using internet site higher-dimension to Amount indicates to classify to website, to improve the accuracy rate of internet site classification, is provided for modern information retrieval field More effective and practical Website classification method.

Above example is provided just for the sake of the description purpose of the present invention, and is not intended to limit the scope of the present invention.This The range of invention is defined by the following claims.It does not depart from spirit and principles of the present invention and the various equivalent replacements made and repaiies Change, should all cover within the scope of the present invention.

Claims

1. a kind of internet site automatic classification method based on deep learning, it is characterised in that：Include the following steps：

The first step, according to the original description information of a large amount of internet sites of dns server log collection as website data collection, and It carries out pretreatment and manually labels, then extract the high dimensional feature vector table of each website for inputting deep learning model Show, and corresponding categories of websites label is increased to each website, and is converted into categorization vector；

Second step is denoted as the input of deep learning model, categorization vector using the high dimensional feature vector obtained in the first step As the output of deep learning model, cycle nerve net of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM is used Network deep learning model, the classification abstract characteristics vector for extracting website；

Third walks, and increases by one layer after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step SoftMax is returned, and the classification abstract characteristics vector of the website of the output of LSTM Recognition with Recurrent Neural Network returns layer as SoftMax Input, exports the class probability distribution vector for website, and the maximum dimension of probability value in the ProbabilityDistribution Vector is corresponding Categories of websites is as categories of websites.

2. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that： In the first step, internet site original description information includes the title, keyword and descriptive text of internet site.

3. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that： Steps are as follows for the first step specific implementation：

Step 1-1, the original description information of internet site is pre-processed, pretreatment includes removal invalid entries, original Description information stop words removes, and pretreated whole websites are manually labelled, are divided into 15 classifications, respectively：Electronics Commercial affairs, science and technology, tourist communications, commercial economy, social networks, service for life, physical fitness, medical treatment ＆ health, literature and art, News media, amusement and recreation, politics and laws, computer network, educational training, Social Culture；

Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, then will All word segmentation results arrange in order in data set is used as input data, training Word2Vec word incorporation models, model instruction After the completion of white silk, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, use Word2Vec moulds When type, the 1000 dimensional vectors expression exported for word as the word is inputted；

Step 1-3, each website concentrated for website data calculates the TF-IDF of each word after its original description information participle Value, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can use following formula meter Calculating website data concentrates the high dimensional feature vector of each website to indicate, since each word passes through the high dimension vector table of 1000 dimensions The website high dimensional feature vector for showing, therefore being calculated according to formula once is also 1000 dimensions：

Wherein IV_wIt is the high dimensional feature vector expression of website, t_iIt is i-th of word in the w of website, TF_IDF_t,wIt is the word in net The reverse document frequency of word frequency-in data of standing, Embedding_Vector_t,wIt is the high dimension vector expression of word；

Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OV_w, the dimension of the vector Degree is the sum of classification, internet site has been divided into 15 classifications, then all output categorization vectors are 15 dimensions, wherein net The corresponding dimension of true classification of standing is 1, other dimensions are 0.

4. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that： The second step is as follows：

2-1,70% data are randomly selected from internet site data set as training data, for recycling nerve to LSTM Network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model, In, each training data is all made of 2 parts：Input vector and output vector, wherein input vector are 1000 dimension higher-dimension nets It stands feature vector, output vector is 15 dimension low-dimensional categories of websites vectors；

2-2, LSTM Recognition with Recurrent Neural Network model general frame are an input layer, and an output layer and three include a large amount of LSTM neuron neural network hidden layers stack, and are full connection status between input layer, hidden layer and output layer, hide Each LSTM neurons in layer consist of the following parts：Input gate (Input Gate), output gate (Output Gate), forget gate (Forget Gate), storage unit (Memory)；Mapping function y=Wx+b, wherein y are that neuron is defeated Go out value, x is neuron output value, and W and b are respectively weight and bias matrix；Activation primitive (Tanh)；In the model training mistake Cheng Zhong, Adam optimizer can use gradient descent method to minimize loss function, this process is transmitted by forward direction and passed with reversed The value and the value in storage unit of parameters in the mapping function for each neuron of mode continuous updating neural network passed To reach；The data that training data is concentrated can be used to train in order neural network, be used to store last instruction in storage unit The average information that neuron generates when practicing, these average informations empirically data can be used for this training and later training Process；It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets that gate controls, when When inputting gate opening, average information of this training can be stored in storage unit, when exporting gate opening, be protected when training in the past This is trained the intermediate data meeting parameter deposited, and when forgetting that gate is opened, storage unit can remove the pervious centre preserved Data；It is advantageous in that that the empirical data in training in the past can be utilized to train with this is defeated using LSTM Recognition with Recurrent Neural Network Enter data fusion to be trained model, to improve the classification predictive ability of model；

2-3, Adam gradient decline optimizer can update entire neural network model by way of optimizing to loss function Parameter, wherein loss function are defined as follows by way of cross entropy (Cross Entropy)：

5. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that： Steps are as follows for the third step specific implementation：

The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax Function can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and make institute It is that 1, SoftMax functions are defined as follows to have dimension values summation, wherein P_jFor vectorial jth after the processing of SoftMax functions Tie up value, O_jIt is the value of output vector jth dimension, O_kIt is the value of output vector kth dimension：

Then finally obtained vector is：

Output=SoftMax (OV_w)

Output is per one-dimensional value between 0 to 1 in 3-2,3-1, and all dimension values summations are 1, and each dimension represents one The classification of a website is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website；Meanwhile the accurate calibration of classification Justice is that the ratio of number and all prediction numbers is correctly predicted in prediction result.