CN108364028A - A kind of internet site automatic classification method based on deep learning - Google Patents

A kind of internet site automatic classification method based on deep learning Download PDF

Info

Publication number
CN108364028A
CN108364028A CN201810181807.8A CN201810181807A CN108364028A CN 108364028 A CN108364028 A CN 108364028A CN 201810181807 A CN201810181807 A CN 201810181807A CN 108364028 A CN108364028 A CN 108364028A
Authority
CN
China
Prior art keywords
website
vector
data
output
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810181807.8A
Other languages
Chinese (zh)
Inventor
杜沐阳
韩言妮
杨兴华
谭红艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201810181807.8A priority Critical patent/CN108364028A/en
Publication of CN108364028A publication Critical patent/CN108364028A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of internet site automatic classification method based on deep learning, according to the original description information of a large amount of internet sites of dns server log collection as website data collection, and it carries out pretreatment and manually labels, then the high dimensional feature vector for extracting each website for inputting deep learning model indicates, and corresponding categories of websites label is increased to each website, and it is converted into categorization vector;High dimensional feature vector is denoted as the input of deep learning model, and output of the categorization vector as deep learning model supervises Recognition with Recurrent Neural Network deep learning model of the training based on LSTM using Adam gradient descent algorithm optimizers;Increase by one layer of SoftMax after trained LSTM Recognition with Recurrent Neural Network deep learning model to return;Using the corresponding categories of websites of the maximum dimension of probability value in ProbabilityDistribution Vector as categories of websites, and the categories of websites of output is compared with the actual classification in website, obtains the classification accuracy of internet site.

Description

A kind of internet site automatic classification method based on deep learning
Technical field
The present invention relates to a kind of internet site automatic classification method based on deep learning belongs to artificial intelligence technology neck Domain.
Background technology
In recent years, internet is in explosive growth, and China is even more that the world is allowed to exclaim in the development speed of internet arena. The 40th time issued by China Internet Network Information Center in July, 2017《China Internet network state of development statistical report》[1]In It points out, by June, 2017, for Chinese netizen's scale up to 7.51 hundred million, half a year amounts to 19,920,000 people of newly-increased netizen.Internet penetration Reach 54.3%, and Chinese website sum reaches 5,060,000, the Websites quantity under " .CN " domain name reaches 2,700,000, and China is mutual Up-trend is all presented in terms of netizen's scale, Internet penetration and Websites quantity in networking.
In face of so huge Websites quantity, the demand and importance of websites collection gradually display, in many letters Breath management and retrieval task in play a crucial role, on the internet, website be sorted in web crawlers orientation crawl, The exploitation and maintenance of web catalogue, the Web Link Analysis of subject-oriented, the content-targeted push of online advertisement and network Analysis of Topological Structure etc. all plays important role.Websites collection can also assist search engine promoted accuracy, change The quality of kind search result, operator and Internet Service Provider can also utilize websites collection to be based on analysis user and access row For, and then optimize allocation of resources.
Deep learning[2]Current research hotspot, the multiple fields such as artificial intelligence all produce outstanding research at Fruit, it is a branch of machine learning, and current deep learning refers to deep neural network, its representation method comes from neurology department It learns, it attempts to simulate and define the information transmission of human brain neuron and potential fluctuation, reaches in certain identification missions to realize Has the algorithm of same competitiveness with mankind's performance.
Website on internet can be divided into multiple classifications, such as " Taobao " can be classified as " e-commerce ", " take Journey net " can be classified as " tourist communications ", how website is accurately marked off to classification be one and important and significant grind Study carefully problem.Correlative study both domestic and external establishes disaggregated model using disclosed free data collection as training data mostly, this A little data sets can collect a certain number of websites and be fixed classification by their handmarkings, below to it is more both domestic and external There is the research method of websites collection to make introduction.
R.Rajalakshmi and C.Aravindan proposes a kind of websites collection algorithm based on naive Bayesian[3], he Algorithm use the URL of website as the foundation of classification, and website is always divided into 9 classifications:Art, business, computer, Game, health, family, news, amusement and reference, they have used from ODP (Open Directory Project) data set 8,55,939 URL are built 9 classifications of middle extraction as training data, and using machine learning algorithm naive Bayesian in total Mould.
Min-Yen Kan propose a kind of webpage classification algorithm being based on SVM (Support Vector Machine)[4], Their algorithm uses foundations of the URL of website as classification, and website has always been divided into 4 classifications:Course, teacher, item Mesh, student, his research are based on 98 WebKB data sets of ILP, wherein including 4167 webpages, and use machine learning algorithm Model construction of SVM.
Chen Qianxi and the model webpage classification algorithm of heap of stone for proposing one kind and being based on SAE (sparse self-encoding encoder)[5], their algorithms Using website title, keyword and description information as classification foundation, and website has always been divided into 8 classifications:Leisure joy Pleasure, online shopping mall, network service, commercial economy, service for life, educational culture, blog forum, general website, their data Collection acquisition captures from each classification from " the first classified catalogue " website and sorts out 200 webpages, obtains 1600 webpages in total and makees For data set, and modeled using sparse self-encoding encoder.
Jia M Q, Wang Z M, Chen G delivered one propose it is a kind of based on decision tree and HTTP behavioural analyses Websites collection algorithm[6], their algorithm use user access website during generate HTTP behaviors as classification foundation, and Website has always been divided into 3 classes, has been respectively:News category, resource sharing class, exchange community's class, their data are derived from HERNET The user of (Henan Province's China Education and Research Network) accesses record, therefrom extracts HTTP behaviors as data set, and use decision tree machine Learning method models.
Hou Cuiqin and Jiao Li is at proposing a kind of websites collection algorithm of the Co-Training based on figure:GCo- training[7], their research is based on WebKB data sets, and algorithm iteratively learns under Co-training algorithm frames One semi-supervised classifier and a Bayes grader based on text feature based on the figure constructed by hyperlinked information, two Person helps each other, and respective performance is continuously improved, and then Bayes graders can be used for predicting largely to have no the classification of data.
Sun Jingchao proposes a kind of Web page classification method based on machine learning[8], his multi-party spy studied using webpage Sign, such as URL, host name be divided into benign webpage and pernicious webpage two categories as classification foundation, and by webpage, he from A large amount of benign network address and malice network address are collected at Alexa and PhishTank as data set, then use a variety of machine learning sides Method includes that support vector machines, naive Bayesian and logistic regression are modeled.
Zhang Jingshen and Ren Fangying proposes a kind of Website classification method and device patent (publication number:CN106250402A), This method uses the first label information and the first web page contents of website to be sorted, according to preset labeling dictionary, determines The corresponding categories of websites of first label information.
Wang Yong has proposed a kind of Website classification method and system patent (publication number:CN101458713A), side therein Method includes:As unit of website, counting user search key and the information for clicking network address;Using statistical information, determines and be directed toward The set of keywords of website to be sorted, and establish with the set of keywords vector of website to be sorted;Determine the kind of known type Subnet station, and establish with the set of keywords vector of the seed website;Utilize the vector and kind subnet of website to be sorted The vector stood calculates the similarity of website to be sorted and seed website;According to similarity size, the class of website to be sorted is determined Type.
In existing traditional Website classification method, it there are some universal and disadvantages, it is main comprising once Four aspects, respectively:The problem of classification foundation and disadvantage, using the problem of method and the problem of disadvantage, data and lacking The problem of point, class categories and disadvantage.
In short, the shortcomings that prior art and deficiency are:The data volume that model training uses is not big enough, not enough newly, classification according to According to information content it is insufficient, the method and model used is not suitable for, to high dimensional data sample classification, extracting feature and learning information Scarce capacity, the number of classification is insufficient, it is difficult to meet polytypic demand.
[1] the 40th time《China Internet network state of development statistical report》, China Internet Network Information Centre, 2017/ 8/4.
[2]LeCun Y,Bengio Y,Hinton G.Deep learning[J].Nature,2015,521(7553): 436-444.
[3]R.Rajalakshmi,C.Aravindan:Naive Bayes Approach for Website Classification.V.V.Das,G.Thomas,and F.Lumban Gaol(Eds.):AIM 2011,CCIS 147, pp.323–326,2011.
[4]Min-Yen Kan:Web page categorization without the web page.WWW2004, May 17–22,2004,New York,New York,USA.ACM 1-58113-912-8/04/0005.
[5]Chen Qianxi,Fan Lei:Webpage Classification Based on Deep Learning Algorithm.Microcomputer Applications Vol.32,No.2,2016,pp.25–28,2016
[6]Jia M Q,Wang Z M,Chen G.Research on website classification based on user HTTP behavior analysis[J][J].Computer engineering and design,2010,2.
[7]HOU Cui-qin,,JIAO Li-cheng:Graph Based Co-Training Algorithm for Web Page Classification.ACTA ELECTRONICA SINICA.pp 2174-2180,Vol.37 No.10 Oct.2009.
[8]SUN Jingchao:A Classification Method of Web Page Using Machine Learning,Netinfo Security,Sep 2017,pp 45-48,1671-1122(2017)09-0045-04.
Invention content
The technology of the present invention solves the problems, such as:Overcome the deficiencies of the prior art and provide a kind of internet net based on deep learning It stands automatic classification method, is indicated pair using the high dimension vector of internet site by LSTM Recognition with Recurrent Neural Network deep learning models Website is classified, and is improved the accuracy rate of internet site classification, is provided for modern information retrieval field more effective and real Website classification method.
Realize that the technical solution of the object of the invention is:It is a kind of based on LSTM Recognition with Recurrent Neural Network deep learning models Internet site sorting technique, includes the following steps:
Step 1, the original description information of a large amount of internet sites of the existing net dns server log collection of basis are as website number According to collecting and carry out pretreatment and manually label, then extract each website be used to input the high dimensional feature of deep learning model to Amount indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, the specific steps are:
Step 1-1, the original description information of internet site is pre-processed, pretreatment includes removal invalid entries (mistake is not such as found comprising 404 pages, server internal error, website have stopped the corresponding data entry of service and gone Except), original description information stop words removes (such as:" ", the stop words such as " a " and special the meaningless words such as meet), will be pre- Treated, and whole websites manually label, and are divided into 15 classifications, respectively:E-commerce, science and technology, tourist communications, Commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, politics Law, computer network, educational training, Social Culture;
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted;
Step 1-3, each website concentrated for website data calculates each word after its original description information segments TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, tiIt is i-th of word in the w of website, TF_IDFt,wIt is the word The reverse document frequency of word frequency-in the website data, Embedding_Vectort,wIt is that the high dimension vector of the word indicates (logical It crosses Word2Vec models to obtain);
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw.The vector Dimension be classification sum, internet site has been divided into 15 classifications by the present invention altogether, then all output categorization vectors are equal For 15 dimensions, the wherein corresponding dimension of number of website real classification is 1, other dimensions are 0;
Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of the website obtained in the first step, Output of the categorization vector of website as deep learning model is based on using Adam gradient descent algorithm optimizer supervised trainings The Recognition with Recurrent Neural Network deep learning model of LSTM, the classification abstract characteristics vector for extracting website, when training pattern, training Data are that website data concentrates 70% website data randomly selected, and verification and testing data are remaining 30% website datas, The specific steps are:
2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model, Wherein, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.
2-2, LSTM Recognition with Recurrent Neural Network model (Recurrent Neural Network) general frame is 1 input Layer, an output layer and 3 stack comprising a large amount of LSTM neurons neural network hidden layers, input layer, hidden layer and It is full connection status between output layer, each LSTM neurons in hidden layer consist of the following parts:Input gate (Input Gate), output gate (Output Gate), forgets that gate (Forget Gate), storage unit (Memory) reflect Function (y=Wx+b, wherein y are neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix) is penetrated, Activation primitive (Tanh).During model training, Adam optimizers can use gradient descent method to minimize loss function, this A process is each in the mapping function of each neuron of continuous updating neural network by way of positive transmission and back transfer The value of parameter and the value in storage unit reach.The data that training data is concentrated can be used to train neural network in order, The average information generated for neuron when storing last training in storage unit, these average informations can be used as " experience " number According to for this training and later training process.Neuron generate empirical data whether store in the memory unit be by Input, export and forget gate control, when inputting gate opening, average information of this training can be stored in storage unit, When exporting gate and opening, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage Unit can remove the pervious intermediate data preserved.Instruction in the past can be utilized by being advantageous in that using LSTM Recognition with Recurrent Neural Network Empirical data in white silk is merged with the input data that this is trained is trained model, and energy is predicted in the classification to improve model Power.
2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function The parameter of type, wherein loss function are defined as follows by way of cross entropy (Cross Entropy):
Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.
Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site Classification accuracy, the specific steps are:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein PjFor the vector after the processing of SoftMax functions Jth tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVW)
For Output per one-dimensional value all between 0 to 1, all dimension values summations are 1 in 3-2,3-1, each dimension generation The classification of one website of table is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile classification is accurate Rate is defined as correctly predicting the ratio of number and all prediction numbers in prediction result.
The advantages of the present invention over the prior art are that:
(1) side for using title, keyword and the description information of website and combining it using Word2Vec and TF-IDF Method is converted into the vector of the high dimensional feature comprising high information quantity, which is embodied in 1000 dimensions, due to its possess it is very high Dimension, can include by the information content height in website original description word, using high dimensional feature vector as classification foundation, Avoid the problem of information content deficiency when using URL or HTTP behaviors as classification foundation;
(2) method for using deep learning, by establishing LSTM Recognition with Recurrent Neural Network deep learning models for website Mechanized classification overcomes common machines learning method the problem of performing poor in the more classification problems of high-dimensional large sample;
(3) consider importance of the model data to Model Practical, use a large amount of nets according to existing net dns server acquisition Data of standing and training and the test data manually to label as model solve previous research using free disclosure and are possible to Data that out-of-date data set is brought are inaccurate and the problem of cause model to lack practicability;
(4) in terms of the use of classification foundation, traditional research approach mostly uses greatly the URL of website as classification foundation, But URL itself have prodigious randomness and uncertainty, it includes information content it is very few also cause its be difficult to become classification Credible foundation.There are also methods, and user to be used to access the HTTP behaviors of website as classification foundation, however HTTP behavior phases The true content for including for website is often excessively abstract, and such as to video, the website of the request of picture, many classifications all has this The feature of sample, it is difficult to which website is subdivided into according to it by more classification;And the present invention by extract the title of website, keyword and For descriptive text as classification foundation, these information are website webmasters in order to which search engine optimization is write meticulously, are tended to The specific business of website is depicted precise and to the pointly, is based on these verbal descriptions, can ensure more information amount as far as possible Under the premise of, realize believable websites collection.
(5) in terms of the use of sorting technique, traditional research approach uses support vector machines, naive Bayesian etc. mostly Machine learning method is for classifying, however these methods are difficult to due to birth defect in more classification problems of high latitude large sample Obtain better effects;The present invention is based on deep learning thoughts to carry out websites collection, builds a variety of deep neural networks, simulates human brain Cognitive Mode, due to deep neural network have a large amount of neurons, have very strong feature extraction ability and study energy Power is suitble to be applied in the websites collection task of high information quantity.
(6) in terms of the use of grouped data, traditional research approach mostly uses free disclosed data set, and these The problems such as over time, often there is data volume deficiency in data set, data are out-of-date and mistake;The present invention then chooses operation Quotient now nets true user access logs, therefrom extracts whole website domain names, and uses these website domain names as seed Data voluntarily acquire a large amount of web datas and manually label as data, these data are with high timeliness and really Property, to ensure that the credibility and availability of final mask.
(7) in terms of class categories, traditional research method often by the less classification of websites collection, is such as divided into malice net It stands and benign website, or is divided into news category, resource sharing class and exchange community's class, usually less than 10 classifications, categorical measure It is less so that website can not be subdivided into specific category by these research methods;The present invention then use deep learning method by website from It is dynamic to be classified as 15 classifications, be respectively:E-commerce, science and technology, tourist communications, commercial economy, social networks, life clothes Business, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, politics and laws, computer network, educational training, society Can be cultural, the diversification of classification so that websites collection is more careful, more accurately.
Description of the drawings
Internet site automatic classification method flow charts of the Fig. 1 based on deep learning;
Fig. 2 Sina website, the title of Tencent's video network, keyword and descriptive text example;
Fig. 3 data grabber programs are put in storage and pass through pretreated part website data;
Fig. 4 LSTM Recognition with Recurrent Neural Network deep learning model schematics;
Fig. 5 LSTM inside neurons structural schematic diagrams;
Fig. 6 is relevant effect or Test Drawing of the present invention etc..
Specific implementation mode
In conjunction with Fig. 1, the present invention is the internet site sorting technique based on LSTM Recognition with Recurrent Neural Network deep learning models, It comprises the steps of:
Step 1, in conjunction with Fig. 2, made according to the original description information of the existing a large amount of internet sites of net dns server log collection It for website data collection and carries out pretreatment and manually labels, then extract height of each website for inputting deep learning model Dimensional feature vector indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, specific steps For:
Step 1-1, in conjunction with Fig. 3, the original description information of internet site is pre-processed, pretreatment includes removal nothing Effect entry (such as does not find mistake, server internal error, website have stopped servicing corresponding data entry comprising 404 pages Remove), original description information stop words removes (such as:" ", the stop words such as " a " and special the meaningless words such as meet), Pretreated whole websites are manually labelled, are divided into 15 classifications, respectively:E-commerce, science and technology, tourism are handed over Logical, commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, political affairs Therapy rule, computer network, educational training, Social Culture;
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted;
Step 1-3, each website concentrated for website data calculates each word after its original description information segments TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, tiIt is i-th of word in the w of website, TF_IDFt,wIt is the word The reverse document frequency of word frequency-in the website data, Embedding_Vectort,wIt is that the high dimension vector of the word indicates (logical It crosses Word2Vec models to obtain);
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw.The vector Dimension be classification sum, internet site has been divided into 15 classifications by this patent altogether, then all output categorization vectors are equal For 15 dimensions, the wherein corresponding dimension of number of website real classification is 1, other dimensions are 0;
Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of obtained website, the class of website Not output of the vector as deep learning model, uses cycle of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM Neural network deep learning model, the classification abstract characteristics vector for extracting website, when training pattern, training data is website 70% website data randomly selected in data set, verification and testing data are remaining 30% website datas, the specific steps are:
2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model, Wherein, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.
2-2, in conjunction with Fig. 4, LSTM Recognition with Recurrent Neural Network model general frames are 1 input layer, an output layer and 3 Including a large amount of LSTM neurons neural network hidden layers stack, it is full connection between input layer, hidden layer and output layer State, in conjunction with Fig. 5, each LSTM neurons in hidden layer consist of the following parts:Gate (Input Gate) is inputted, it is defeated Go out gate (Output Gate), forgets gate (Forget Gate), storage unit (Memory), mapping function (y=Wx+b, Wherein y is neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix), activation primitive (Tanh). During model training, Adam optimizers can use gradient descent method to minimize loss function, this process passes through forward direction Transmit and back transfer each neuron of mode continuous updating neural network mapping function in parameters value and deposit Value in storage unit reaches.The data that training data is concentrated can be used to train in order neural network, be used in storage unit The average information that neuron generates when the training of storage last time, these average informations can be used as " experience " data to be trained for this And later training process.It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets Remember gate control, when inputting gate opening, average information of this training can be stored in storage unit, when output gate is opened When, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage unit can be removed and protected The pervious intermediate data deposited.The empirical data that can be utilized in training in the past is advantageous in that using LSTM Recognition with Recurrent Neural Network It is merged with the input data of this training and model is trained, to improve the classification predictive ability of model.
2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function The parameter of type, wherein loss function are defined as follows by way of cross entropy (Cross Entropy):
Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.
Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site Classification accuracy, the specific steps are:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein PjFor the vector after the processing of SoftMax functions Jth tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVW)
For Output per one-dimensional value all between 0 to 1, all dimension values summations are 1 in 3-2,3-1, each dimension generation The classification of one website of table is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile classification is accurate Rate is defined as correctly predicting the ratio of number and all prediction numbers in prediction result.
Further detailed description is done to the present invention with reference to examples of implementation.
In conjunction with Fig. 1, the present invention is the internet site sorting technique based on LSTM Recognition with Recurrent Neural Network deep learning models, First by extracting website high dimensional feature vector, Adam gradients decline optimizer supervised training deep learning model is then used, And classified to internet site using trained deep learning model and SoftMax recurrence, obtain classification accuracy.Tool Body comprises the steps of:
Step 1, in conjunction with Fig. 2, made according to the original description information of the existing a large amount of internet sites of net dns server log collection It for website data collection and carries out pretreatment and manually labels, then extract height of each website for inputting deep learning model Dimensional feature vector indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, specific steps For:
Step 1-1, in conjunction with Fig. 3, the original description information of internet site is pre-processed, pretreatment includes removal nothing Effect entry (such as does not find mistake, server internal error, website have stopped servicing corresponding data entry comprising 404 pages Remove), original description information stop words removes (such as:" ", the stop words such as " a " and special the meaningless words such as meet), Pretreated whole websites are manually labelled, are divided into 15 classifications, respectively:E-commerce, science and technology, tourism are handed over Logical, commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, political affairs Therapy rule, computer network, educational training, Social Culture;
In examples of implementation, it is assumed that a total of 10000 website original descriptions information data passes through above-mentioned pretreated side Formula retains 95000 valid data in total, these valid data can be segmented, and manually labels and be categorized into 15 classifications.
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted;
In examples of implementation, the word segmentation result of 95000 valid data is used to train Word2Vec models, model instruction After the completion of white silk, gives a word and input Word2Vec models, the high dimension vector that model can return to the word indicates.
Step 1-3, each website concentrated for website data calculates each word after its original description information segments TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, i is the number of the word obtained after website segments, tiIt is website I-th of word in w, TF_IDFt,wIt is the reverse document frequency of word frequency-of the word in the website data, Embedding_ Vectort,wIt is the high dimension vector expression (being obtained by Word2Vec models) of the word, N is the word obtained after the website segments Language sum.
In examples of implementation, each website can be converted into the higher-dimension website vector of 1000 dimensions using above-mentioned formula It indicates, has then obtained the input data for deep learning model training and test, include 95000 input datas altogether.
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw.The vector Dimension be classification sum, internet site has been divided into 15 classifications altogether in the present invention, then all output categorization vectors It is 15 dimensions, wherein the corresponding dimension of number of website real classification is 1, other dimensions are 0;
In examples of implementation, each website can be converted into the low-dimensional categories of websites table of 15 dimensions using above-mentioned formula Show, then obtained the output data for deep learning model training and test, includes 95000 output datas altogether.
The data set that all input datas and output data collectively constitute 95000 datas pair is used for deep learning model Training and test.
Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of obtained website, the class of website Not output of the vector as deep learning model, uses cycle of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM Neural network deep learning model, the classification abstract characteristics vector for extracting website, when training pattern, training data is website 70% website data randomly selected in data set, verification and testing data are remaining 30% website datas, the specific steps are:
2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model, Wherein, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.
In examples of implementation, randomly selects 70% from website data concentration and be used as training dataset, i.e., 66500 trained numbers According to remaining 30% is used as test data, i.e. 28500 test datas.
2-2, in conjunction with Fig. 4, LSTM Recognition with Recurrent Neural Network model general frames are 1 input layer, an output layer and 3 Including a large amount of LSTM neurons neural network hidden layers stack, it is full connection between input layer, hidden layer and output layer State, in conjunction with Fig. 5, each LSTM neurons in hidden layer consist of the following parts:Gate (Input Gate) is inputted, it is defeated Go out gate (Output Gate), forgets gate (Forget Gate), storage unit (Memory), mapping function (y=Wx+b, Wherein y is neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix), activation primitive (Tanh). During model training, Adam optimizers can use gradient descent method to minimize loss function, this process passes through forward direction Transmit and back transfer each neuron of mode continuous updating neural network mapping function in parameters value and deposit Value in storage unit reaches.The data that training data is concentrated can be used to train in order neural network, be used in storage unit The average information that neuron generates when the training of storage last time, these average informations can be used as " experience " data to be trained for this And later training process.It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets Remember gate control, when inputting gate opening, average information of this training can be stored in storage unit, when output gate is opened When, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage unit can be removed and protected The pervious intermediate data deposited.The empirical data that can be utilized in training in the past is advantageous in that using LSTM Recognition with Recurrent Neural Network It is merged with the input data of this training and model is trained, to improve the classification predictive ability of model.
In examples of implementation, LSTM Recognition with Recurrent Neural Network deep learning models share 1 input layer, an output layer and 3 A hidden layer, wherein it includes 1024 LSTM neurons layer by layer that first, which is hidden, second hidden layer includes 512 LSTM nerves Member, third hidden layer include 256 LSTM neurons.
2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function The parameter of type, wherein loss function Loss are defined as follows by way of cross entropy (Cross Entropy):
Wherein x is each individual dimension of vector, and p (x) is the every one-dimensional value of output output layer output vector, q (x) It is the every one-dimensional value of true categorization vector.
In the embodiment of the present invention, using training dataset totally 66500 datas to LSTM Recognition with Recurrent Neural Network deep learnings Model exercises supervision repetitive exercise, and each iteration is chosen 10000 datas and is updated to neuron parameter in model, carries out altogether 4500 iteration.
Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site Classification accuracy, the specific steps are:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein PjFor the vector after the processing of SoftMax functions Jth tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVW)
Wherein OVWIt is output vector of the deep learning model to website w
3-2, Output are per one-dimensional value between 0 to 1, and all dimension values summations are 1, and each dimension represents one The classification of website is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile the accuracy rate definition of classification Correctly to predict the ratio of number and all prediction numbers in prediction result.
As shown in fig. 6, in the embodiment of the present invention, model parameter is updated by successive ignition, finally obtains optimal models The automatic classification accuracy of internet site is 87.9%.
In conclusion the present invention by LSTM Recognition with Recurrent Neural Network deep learning models using internet site higher-dimension to Amount indicates to classify to website, to improve the accuracy rate of internet site classification, is provided for modern information retrieval field More effective and practical Website classification method.
Above example is provided just for the sake of the description purpose of the present invention, and is not intended to limit the scope of the present invention.This The range of invention is defined by the following claims.It does not depart from spirit and principles of the present invention and the various equivalent replacements made and repaiies Change, should all cover within the scope of the present invention.

Claims (5)

1. a kind of internet site automatic classification method based on deep learning, it is characterised in that:Include the following steps:
The first step, according to the original description information of a large amount of internet sites of dns server log collection as website data collection, and It carries out pretreatment and manually labels, then extract the high dimensional feature vector table of each website for inputting deep learning model Show, and corresponding categories of websites label is increased to each website, and is converted into categorization vector;
Second step is denoted as the input of deep learning model, categorization vector using the high dimensional feature vector obtained in the first step As the output of deep learning model, cycle nerve net of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM is used Network deep learning model, the classification abstract characteristics vector for extracting website;
Third walks, and increases by one layer after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step SoftMax is returned, and the classification abstract characteristics vector of the website of the output of LSTM Recognition with Recurrent Neural Network returns layer as SoftMax Input, exports the class probability distribution vector for website, and the maximum dimension of probability value in the ProbabilityDistribution Vector is corresponding Categories of websites is as categories of websites.
2. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that: In the first step, internet site original description information includes the title, keyword and descriptive text of internet site.
3. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that: Steps are as follows for the first step specific implementation:
Step 1-1, the original description information of internet site is pre-processed, pretreatment includes removal invalid entries, original Description information stop words removes, and pretreated whole websites are manually labelled, are divided into 15 classifications, respectively:Electronics Commercial affairs, science and technology, tourist communications, commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, News media, amusement and recreation, politics and laws, computer network, educational training, Social Culture;
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, then will All word segmentation results arrange in order in data set is used as input data, training Word2Vec word incorporation models, model instruction After the completion of white silk, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, use Word2Vec moulds When type, the 1000 dimensional vectors expression exported for word as the word is inputted;
Step 1-3, each website concentrated for website data calculates the TF-IDF of each word after its original description information participle Value, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can use following formula meter Calculating website data concentrates the high dimensional feature vector of each website to indicate, since each word passes through the high dimension vector table of 1000 dimensions The website high dimensional feature vector for showing, therefore being calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, tiIt is i-th of word in the w of website, TF_IDFt,wIt is the word in net The reverse document frequency of word frequency-in data of standing, Embedding_Vectort,wIt is the high dimension vector expression of word;
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw, the dimension of the vector Degree is the sum of classification, internet site has been divided into 15 classifications, then all output categorization vectors are 15 dimensions, wherein net The corresponding dimension of true classification of standing is 1, other dimensions are 0.
4. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that: The second step is as follows:
2-1,70% data are randomly selected from internet site data set as training data, for recycling nerve to LSTM Network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model, In, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimension nets It stands feature vector, output vector is 15 dimension low-dimensional categories of websites vectors;
2-2, LSTM Recognition with Recurrent Neural Network model general frame are an input layer, and an output layer and three include a large amount of LSTM neuron neural network hidden layers stack, and are full connection status between input layer, hidden layer and output layer, hide Each LSTM neurons in layer consist of the following parts:Input gate (Input Gate), output gate (Output Gate), forget gate (Forget Gate), storage unit (Memory);Mapping function y=Wx+b, wherein y are that neuron is defeated Go out value, x is neuron output value, and W and b are respectively weight and bias matrix;Activation primitive (Tanh);In the model training mistake Cheng Zhong, Adam optimizer can use gradient descent method to minimize loss function, this process is transmitted by forward direction and passed with reversed The value and the value in storage unit of parameters in the mapping function for each neuron of mode continuous updating neural network passed To reach;The data that training data is concentrated can be used to train in order neural network, be used to store last instruction in storage unit The average information that neuron generates when practicing, these average informations empirically data can be used for this training and later training Process;It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets that gate controls, when When inputting gate opening, average information of this training can be stored in storage unit, when exporting gate opening, be protected when training in the past This is trained the intermediate data meeting parameter deposited, and when forgetting that gate is opened, storage unit can remove the pervious centre preserved Data;It is advantageous in that that the empirical data in training in the past can be utilized to train with this is defeated using LSTM Recognition with Recurrent Neural Network Enter data fusion to be trained model, to improve the classification predictive ability of model;
2-3, Adam gradient decline optimizer can update entire neural network model by way of optimizing to loss function Parameter, wherein loss function are defined as follows by way of cross entropy (Cross Entropy):
Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.
5. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that: Steps are as follows for the third step specific implementation:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax Function can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and make institute It is that 1, SoftMax functions are defined as follows to have dimension values summation, wherein PjFor vectorial jth after the processing of SoftMax functions Tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVw)
Output is per one-dimensional value between 0 to 1 in 3-2,3-1, and all dimension values summations are 1, and each dimension represents one The classification of a website is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website;Meanwhile the accurate calibration of classification Justice is that the ratio of number and all prediction numbers is correctly predicted in prediction result.
CN201810181807.8A 2018-03-06 2018-03-06 A kind of internet site automatic classification method based on deep learning Pending CN108364028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810181807.8A CN108364028A (en) 2018-03-06 2018-03-06 A kind of internet site automatic classification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810181807.8A CN108364028A (en) 2018-03-06 2018-03-06 A kind of internet site automatic classification method based on deep learning

Publications (1)

Publication Number Publication Date
CN108364028A true CN108364028A (en) 2018-08-03

Family

ID=63003483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810181807.8A Pending CN108364028A (en) 2018-03-06 2018-03-06 A kind of internet site automatic classification method based on deep learning

Country Status (1)

Country Link
CN (1) CN108364028A (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109360052A (en) * 2018-09-27 2019-02-19 北京亚联之星信息技术有限公司 A kind of data classification based on machine learning algorithm, data processing method and equipment
CN109408947A (en) * 2018-10-19 2019-03-01 杭州刀豆网络科技有限公司 A kind of infringement webpage judgment method based on machine learning
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
CN109710838A (en) * 2018-12-05 2019-05-03 厦门笨鸟电子商务有限公司 A kind of company's site's keyword extracting method based on deep neural network
CN109784376A (en) * 2018-12-26 2019-05-21 中国科学院长春光学精密机械与物理研究所 A kind of Classifying Method in Remote Sensing Image and categorizing system
CN109886408A (en) * 2019-02-28 2019-06-14 北京百度网讯科技有限公司 A kind of deep learning method and device
CN109948665A (en) * 2019-02-28 2019-06-28 中国地质大学(武汉) Physical activity genre classification methods and system based on long Memory Neural Networks in short-term
CN110020024A (en) * 2019-03-15 2019-07-16 叶宇铭 Classification method, system, the equipment of link resources in a kind of scientific and technical literature
CN110298391A (en) * 2019-06-12 2019-10-01 同济大学 A kind of iterative increment dialogue intention classification recognition methods based on small sample
CN110321555A (en) * 2019-06-11 2019-10-11 国网江苏省电力有限公司南京供电分公司 A kind of power network signal classification method based on Recognition with Recurrent Neural Network model
CN110472045A (en) * 2019-07-11 2019-11-19 中山大学 A kind of short text falseness Question Classification prediction technique and device based on document insertion
CN110472885A (en) * 2019-08-22 2019-11-19 华南师范大学 A kind of website assessment system and its working method
CN110516210A (en) * 2019-08-22 2019-11-29 北京影谱科技股份有限公司 The calculation method and device of text similarity
CN110609898A (en) * 2019-08-19 2019-12-24 中国科学院重庆绿色智能技术研究院 Self-classification method for unbalanced text data
CN110765393A (en) * 2019-09-17 2020-02-07 微梦创科网络科技(中国)有限公司 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression
CN110855635A (en) * 2019-10-25 2020-02-28 新华三信息安全技术有限公司 URL (Uniform resource locator) identification method and device and data processing equipment
CN111078646A (en) * 2019-12-30 2020-04-28 弭迺彬 Method and system for grouping software based on running data of Internet equipment
CN111125563A (en) * 2018-10-31 2020-05-08 安碁资讯股份有限公司 Method for evaluating domain name and server thereof
CN111159416A (en) * 2020-04-02 2020-05-15 腾讯科技(深圳)有限公司 Language task model training method and device, electronic equipment and storage medium
CN111782802A (en) * 2020-05-15 2020-10-16 北京极兆技术有限公司 Method and system for obtaining national economy manufacturing industry corresponding to commodity based on machine learning
CN111895986A (en) * 2020-06-30 2020-11-06 西安建筑科技大学 MEMS gyroscope original output signal noise reduction method based on LSTM neural network
CN112287274A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112329524A (en) * 2020-09-25 2021-02-05 泰山学院 Signal classification and identification method, system and equipment based on deep time sequence neural network
CN112559099A (en) * 2020-12-04 2021-03-26 北京新能源汽车技术创新中心有限公司 Remote image display method, device and system based on user behavior and storage medium
CN112690823A (en) * 2020-12-22 2021-04-23 海南力维科贸有限公司 Method and system for identifying physiological sounds of lungs
CN112800229A (en) * 2021-02-05 2021-05-14 昆明理工大学 Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field
CN113168890A (en) * 2018-12-10 2021-07-23 生命科技股份有限公司 Deep base recognizer for Sanger sequencing
CN113392323A (en) * 2021-06-15 2021-09-14 电子科技大学 Business role prediction method based on multi-source data joint learning
CN113673620A (en) * 2021-08-27 2021-11-19 工银科技有限公司 Method, system, device, medium and program product for model generation
CN116258579A (en) * 2023-04-28 2023-06-13 成都新希望金融信息有限公司 Training method of user credit scoring model and user credit scoring method
TWI827984B (en) * 2021-10-05 2024-01-01 台灣大哥大股份有限公司 System and method for website classification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250402A (en) * 2016-07-19 2016-12-21 杭州华三通信技术有限公司 A kind of Website classification method and device
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN107066446A (en) * 2017-04-13 2017-08-18 广东工业大学 A kind of Recognition with Recurrent Neural Network text emotion analysis method of embedded logic rules
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
US20170357945A1 (en) * 2016-06-14 2017-12-14 Recruiter.AI, Inc. Automated matching of job candidates and job listings for recruitment
CN107480141A (en) * 2017-08-29 2017-12-15 南京大学 It is a kind of that allocating method is aided in based on the software defect of text and developer's liveness

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357945A1 (en) * 2016-06-14 2017-12-14 Recruiter.AI, Inc. Automated matching of job candidates and job listings for recruitment
CN106250402A (en) * 2016-07-19 2016-12-21 杭州华三通信技术有限公司 A kind of Website classification method and device
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN106779073A (en) * 2016-12-27 2017-05-31 西安石油大学 Media information sorting technique and device based on deep neural network
CN107066446A (en) * 2017-04-13 2017-08-18 广东工业大学 A kind of Recognition with Recurrent Neural Network text emotion analysis method of embedded logic rules
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN107480141A (en) * 2017-08-29 2017-12-15 南京大学 It is a kind of that allocating method is aided in based on the software defect of text and developer's liveness

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
陈芊希 等: "基于深度学习的网页分类算法研究", 《微型电脑应用》 *
黄磊 等: "基于递归神经网络的文本分类研究", 《北京化工大学学报(自然科学版)》 *

Cited By (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109360052A (en) * 2018-09-27 2019-02-19 北京亚联之星信息技术有限公司 A kind of data classification based on machine learning algorithm, data processing method and equipment
CN109408947A (en) * 2018-10-19 2019-03-01 杭州刀豆网络科技有限公司 A kind of infringement webpage judgment method based on machine learning
CN111125563A (en) * 2018-10-31 2020-05-08 安碁资讯股份有限公司 Method for evaluating domain name and server thereof
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
CN109710838A (en) * 2018-12-05 2019-05-03 厦门笨鸟电子商务有限公司 A kind of company's site's keyword extracting method based on deep neural network
CN109710838B (en) * 2018-12-05 2021-02-26 厦门笨鸟电子商务有限公司 Company website keyword extraction method based on deep neural network
CN113168890A (en) * 2018-12-10 2021-07-23 生命科技股份有限公司 Deep base recognizer for Sanger sequencing
CN113168890B (en) * 2018-12-10 2024-05-24 生命科技股份有限公司 Deep base identifier for Sanger sequencing
CN109784376A (en) * 2018-12-26 2019-05-21 中国科学院长春光学精密机械与物理研究所 A kind of Classifying Method in Remote Sensing Image and categorizing system
CN109948665B (en) * 2019-02-28 2020-11-27 中国地质大学(武汉) Human activity type classification method and system based on long-time and short-time memory neural network
CN109948665A (en) * 2019-02-28 2019-06-28 中国地质大学(武汉) Physical activity genre classification methods and system based on long Memory Neural Networks in short-term
CN109886408A (en) * 2019-02-28 2019-06-14 北京百度网讯科技有限公司 A kind of deep learning method and device
CN110020024A (en) * 2019-03-15 2019-07-16 叶宇铭 Classification method, system, the equipment of link resources in a kind of scientific and technical literature
CN110020024B (en) * 2019-03-15 2021-07-30 中国人民解放军军事科学院军事科学信息研究中心 Method, system and equipment for classifying link resources in scientific and technological literature
CN110321555A (en) * 2019-06-11 2019-10-11 国网江苏省电力有限公司南京供电分公司 A kind of power network signal classification method based on Recognition with Recurrent Neural Network model
CN110298391A (en) * 2019-06-12 2019-10-01 同济大学 A kind of iterative increment dialogue intention classification recognition methods based on small sample
CN110472045A (en) * 2019-07-11 2019-11-19 中山大学 A kind of short text falseness Question Classification prediction technique and device based on document insertion
CN110472045B (en) * 2019-07-11 2023-02-03 中山大学 Short text false problem classification prediction method and device based on document embedding
CN110609898B (en) * 2019-08-19 2023-05-05 中国科学院重庆绿色智能技术研究院 Self-classifying method for unbalanced text data
CN110609898A (en) * 2019-08-19 2019-12-24 中国科学院重庆绿色智能技术研究院 Self-classification method for unbalanced text data
CN110472885A (en) * 2019-08-22 2019-11-19 华南师范大学 A kind of website assessment system and its working method
CN110516210A (en) * 2019-08-22 2019-11-29 北京影谱科技股份有限公司 The calculation method and device of text similarity
CN110516210B (en) * 2019-08-22 2023-06-27 北京影谱科技股份有限公司 Text similarity calculation method and device
CN110765393A (en) * 2019-09-17 2020-02-07 微梦创科网络科技(中国)有限公司 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression
CN110855635A (en) * 2019-10-25 2020-02-28 新华三信息安全技术有限公司 URL (Uniform resource locator) identification method and device and data processing equipment
CN110855635B (en) * 2019-10-25 2022-02-11 新华三信息安全技术有限公司 URL (Uniform resource locator) identification method and device and data processing equipment
CN111078646B (en) * 2019-12-30 2023-12-05 山东蝶飞信息技术有限公司 Method and system for grouping software based on operation data of Internet equipment
CN111078646A (en) * 2019-12-30 2020-04-28 弭迺彬 Method and system for grouping software based on running data of Internet equipment
CN111159416B (en) * 2020-04-02 2020-07-17 腾讯科技(深圳)有限公司 Language task model training method and device, electronic equipment and storage medium
CN111159416A (en) * 2020-04-02 2020-05-15 腾讯科技(深圳)有限公司 Language task model training method and device, electronic equipment and storage medium
CN111782802B (en) * 2020-05-15 2023-11-24 北京极兆技术有限公司 Method and system for obtaining commodity corresponding to national economy manufacturing industry based on machine learning
CN111782802A (en) * 2020-05-15 2020-10-16 北京极兆技术有限公司 Method and system for obtaining national economy manufacturing industry corresponding to commodity based on machine learning
CN111895986A (en) * 2020-06-30 2020-11-06 西安建筑科技大学 MEMS gyroscope original output signal noise reduction method based on LSTM neural network
CN112329524A (en) * 2020-09-25 2021-02-05 泰山学院 Signal classification and identification method, system and equipment based on deep time sequence neural network
CN112287274B (en) * 2020-10-27 2022-10-18 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112287274A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112559099B (en) * 2020-12-04 2024-02-27 北京国家新能源汽车技术创新中心有限公司 Remote image display method, device and system based on user behaviors and storage medium
CN112559099A (en) * 2020-12-04 2021-03-26 北京新能源汽车技术创新中心有限公司 Remote image display method, device and system based on user behavior and storage medium
CN112690823A (en) * 2020-12-22 2021-04-23 海南力维科贸有限公司 Method and system for identifying physiological sounds of lungs
CN112800229B (en) * 2021-02-05 2022-12-20 昆明理工大学 Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field
CN112800229A (en) * 2021-02-05 2021-05-14 昆明理工大学 Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field
CN113392323B (en) * 2021-06-15 2022-04-19 电子科技大学 Business role prediction method based on multi-source data joint learning
CN113392323A (en) * 2021-06-15 2021-09-14 电子科技大学 Business role prediction method based on multi-source data joint learning
CN113673620A (en) * 2021-08-27 2021-11-19 工银科技有限公司 Method, system, device, medium and program product for model generation
TWI827984B (en) * 2021-10-05 2024-01-01 台灣大哥大股份有限公司 System and method for website classification
CN116258579A (en) * 2023-04-28 2023-06-13 成都新希望金融信息有限公司 Training method of user credit scoring model and user credit scoring method
CN116258579B (en) * 2023-04-28 2023-08-04 成都新希望金融信息有限公司 Training method of user credit scoring model and user credit scoring method

Similar Documents

Publication Publication Date Title
CN108364028A (en) A kind of internet site automatic classification method based on deep learning
Fan et al. Metapath-guided heterogeneous graph neural network for intent recommendation
CN109492157B (en) News recommendation method and theme characterization method based on RNN and attention mechanism
CN104933164B (en) In internet mass data name entity between relationship extracting method and its system
CN105893609A (en) Mobile APP recommendation method based on weighted mixing
Fang et al. Folksonomy-based visual ontology construction and its applications
CN107291902A (en) Automatic marking method is checked in a kind of popular contribution based on hybrid classification technology
Yanmei et al. Research on Chinese micro-blog sentiment analysis based on deep learning
CN113806630B (en) Attention-based multi-view feature fusion cross-domain recommendation method and device
CN110377605A (en) A kind of Sensitive Attributes identification of structural data and classification stage division
Cong Personalized recommendation of film and television culture based on an intelligent classification algorithm
Ji et al. Attention based meta path fusion for heterogeneous information network embedding
CN116186372A (en) Bibliographic system capable of providing personalized service
Bai et al. A rumor detection model incorporating propagation path contextual semantics and user information
Paraschiv et al. A unified graph-based approach to disinformation detection using contextual and semantic relations
Zhang et al. Reinforced adaptive knowledge learning for multimodal fake news detection
Zhu A book recommendation algorithm based on collaborative filtering
Wang Knowledge graph analysis of internal control field in colleges
CN109597946A (en) A kind of bad webpage intelligent detecting method based on deepness belief network algorithm
CN114722304A (en) Community search method based on theme on heterogeneous information network
Bao et al. Hot news prediction method based on natural language processing technology and its application
Gao et al. Statistics and Analysis of Targeted Poverty Alleviation Information Integrated with Big Data Mining Algorithm
Zeng et al. Model-Stacking-based network user portrait from multi-source campus data
Liu Construction of personalized recommendation system of university library based on SOM neural network
Wang et al. An api recommendation method based on beneficial interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180803

WD01 Invention patent application deemed withdrawn after publication