CN108364028A - A kind of internet site automatic classification method based on deep learning - Google Patents
A kind of internet site automatic classification method based on deep learning Download PDFInfo
- Publication number
- CN108364028A CN108364028A CN201810181807.8A CN201810181807A CN108364028A CN 108364028 A CN108364028 A CN 108364028A CN 201810181807 A CN201810181807 A CN 201810181807A CN 108364028 A CN108364028 A CN 108364028A
- Authority
- CN
- China
- Prior art keywords
- website
- vector
- data
- output
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of internet site automatic classification method based on deep learning, according to the original description information of a large amount of internet sites of dns server log collection as website data collection, and it carries out pretreatment and manually labels, then the high dimensional feature vector for extracting each website for inputting deep learning model indicates, and corresponding categories of websites label is increased to each website, and it is converted into categorization vector;High dimensional feature vector is denoted as the input of deep learning model, and output of the categorization vector as deep learning model supervises Recognition with Recurrent Neural Network deep learning model of the training based on LSTM using Adam gradient descent algorithm optimizers;Increase by one layer of SoftMax after trained LSTM Recognition with Recurrent Neural Network deep learning model to return;Using the corresponding categories of websites of the maximum dimension of probability value in ProbabilityDistribution Vector as categories of websites, and the categories of websites of output is compared with the actual classification in website, obtains the classification accuracy of internet site.
Description
Technical field
The present invention relates to a kind of internet site automatic classification method based on deep learning belongs to artificial intelligence technology neck
Domain.
Background technology
In recent years, internet is in explosive growth, and China is even more that the world is allowed to exclaim in the development speed of internet arena.
The 40th time issued by China Internet Network Information Center in July, 2017《China Internet network state of development statistical report》[1]In
It points out, by June, 2017, for Chinese netizen's scale up to 7.51 hundred million, half a year amounts to 19,920,000 people of newly-increased netizen.Internet penetration
Reach 54.3%, and Chinese website sum reaches 5,060,000, the Websites quantity under " .CN " domain name reaches 2,700,000, and China is mutual
Up-trend is all presented in terms of netizen's scale, Internet penetration and Websites quantity in networking.
In face of so huge Websites quantity, the demand and importance of websites collection gradually display, in many letters
Breath management and retrieval task in play a crucial role, on the internet, website be sorted in web crawlers orientation crawl,
The exploitation and maintenance of web catalogue, the Web Link Analysis of subject-oriented, the content-targeted push of online advertisement and network
Analysis of Topological Structure etc. all plays important role.Websites collection can also assist search engine promoted accuracy, change
The quality of kind search result, operator and Internet Service Provider can also utilize websites collection to be based on analysis user and access row
For, and then optimize allocation of resources.
Deep learning[2]Current research hotspot, the multiple fields such as artificial intelligence all produce outstanding research at
Fruit, it is a branch of machine learning, and current deep learning refers to deep neural network, its representation method comes from neurology department
It learns, it attempts to simulate and define the information transmission of human brain neuron and potential fluctuation, reaches in certain identification missions to realize
Has the algorithm of same competitiveness with mankind's performance.
Website on internet can be divided into multiple classifications, such as " Taobao " can be classified as " e-commerce ", " take
Journey net " can be classified as " tourist communications ", how website is accurately marked off to classification be one and important and significant grind
Study carefully problem.Correlative study both domestic and external establishes disaggregated model using disclosed free data collection as training data mostly, this
A little data sets can collect a certain number of websites and be fixed classification by their handmarkings, below to it is more both domestic and external
There is the research method of websites collection to make introduction.
R.Rajalakshmi and C.Aravindan proposes a kind of websites collection algorithm based on naive Bayesian[3], he
Algorithm use the URL of website as the foundation of classification, and website is always divided into 9 classifications:Art, business, computer,
Game, health, family, news, amusement and reference, they have used from ODP (Open Directory Project) data set
8,55,939 URL are built 9 classifications of middle extraction as training data, and using machine learning algorithm naive Bayesian in total
Mould.
Min-Yen Kan propose a kind of webpage classification algorithm being based on SVM (Support Vector Machine)[4],
Their algorithm uses foundations of the URL of website as classification, and website has always been divided into 4 classifications:Course, teacher, item
Mesh, student, his research are based on 98 WebKB data sets of ILP, wherein including 4167 webpages, and use machine learning algorithm
Model construction of SVM.
Chen Qianxi and the model webpage classification algorithm of heap of stone for proposing one kind and being based on SAE (sparse self-encoding encoder)[5], their algorithms
Using website title, keyword and description information as classification foundation, and website has always been divided into 8 classifications:Leisure joy
Pleasure, online shopping mall, network service, commercial economy, service for life, educational culture, blog forum, general website, their data
Collection acquisition captures from each classification from " the first classified catalogue " website and sorts out 200 webpages, obtains 1600 webpages in total and makees
For data set, and modeled using sparse self-encoding encoder.
Jia M Q, Wang Z M, Chen G delivered one propose it is a kind of based on decision tree and HTTP behavioural analyses
Websites collection algorithm[6], their algorithm use user access website during generate HTTP behaviors as classification foundation, and
Website has always been divided into 3 classes, has been respectively:News category, resource sharing class, exchange community's class, their data are derived from HERNET
The user of (Henan Province's China Education and Research Network) accesses record, therefrom extracts HTTP behaviors as data set, and use decision tree machine
Learning method models.
Hou Cuiqin and Jiao Li is at proposing a kind of websites collection algorithm of the Co-Training based on figure:GCo-
training[7], their research is based on WebKB data sets, and algorithm iteratively learns under Co-training algorithm frames
One semi-supervised classifier and a Bayes grader based on text feature based on the figure constructed by hyperlinked information, two
Person helps each other, and respective performance is continuously improved, and then Bayes graders can be used for predicting largely to have no the classification of data.
Sun Jingchao proposes a kind of Web page classification method based on machine learning[8], his multi-party spy studied using webpage
Sign, such as URL, host name be divided into benign webpage and pernicious webpage two categories as classification foundation, and by webpage, he from
A large amount of benign network address and malice network address are collected at Alexa and PhishTank as data set, then use a variety of machine learning sides
Method includes that support vector machines, naive Bayesian and logistic regression are modeled.
Zhang Jingshen and Ren Fangying proposes a kind of Website classification method and device patent (publication number:CN106250402A),
This method uses the first label information and the first web page contents of website to be sorted, according to preset labeling dictionary, determines
The corresponding categories of websites of first label information.
Wang Yong has proposed a kind of Website classification method and system patent (publication number:CN101458713A), side therein
Method includes:As unit of website, counting user search key and the information for clicking network address;Using statistical information, determines and be directed toward
The set of keywords of website to be sorted, and establish with the set of keywords vector of website to be sorted;Determine the kind of known type
Subnet station, and establish with the set of keywords vector of the seed website;Utilize the vector and kind subnet of website to be sorted
The vector stood calculates the similarity of website to be sorted and seed website;According to similarity size, the class of website to be sorted is determined
Type.
In existing traditional Website classification method, it there are some universal and disadvantages, it is main comprising once
Four aspects, respectively:The problem of classification foundation and disadvantage, using the problem of method and the problem of disadvantage, data and lacking
The problem of point, class categories and disadvantage.
In short, the shortcomings that prior art and deficiency are:The data volume that model training uses is not big enough, not enough newly, classification according to
According to information content it is insufficient, the method and model used is not suitable for, to high dimensional data sample classification, extracting feature and learning information
Scarce capacity, the number of classification is insufficient, it is difficult to meet polytypic demand.
[1] the 40th time《China Internet network state of development statistical report》, China Internet Network Information Centre, 2017/
8/4.
[2]LeCun Y,Bengio Y,Hinton G.Deep learning[J].Nature,2015,521(7553):
436-444.
[3]R.Rajalakshmi,C.Aravindan:Naive Bayes Approach for Website
Classification.V.V.Das,G.Thomas,and F.Lumban Gaol(Eds.):AIM 2011,CCIS 147,
pp.323–326,2011.
[4]Min-Yen Kan:Web page categorization without the web page.WWW2004,
May 17–22,2004,New York,New York,USA.ACM 1-58113-912-8/04/0005.
[5]Chen Qianxi,Fan Lei:Webpage Classification Based on Deep Learning
Algorithm.Microcomputer Applications Vol.32,No.2,2016,pp.25–28,2016
[6]Jia M Q,Wang Z M,Chen G.Research on website classification based
on user HTTP behavior analysis[J][J].Computer engineering and design,2010,2.
[7]HOU Cui-qin,,JIAO Li-cheng:Graph Based Co-Training Algorithm for
Web Page Classification.ACTA ELECTRONICA SINICA.pp 2174-2180,Vol.37 No.10
Oct.2009.
[8]SUN Jingchao:A Classification Method of Web Page Using Machine
Learning,Netinfo Security,Sep 2017,pp 45-48,1671-1122(2017)09-0045-04.
Invention content
The technology of the present invention solves the problems, such as:Overcome the deficiencies of the prior art and provide a kind of internet net based on deep learning
It stands automatic classification method, is indicated pair using the high dimension vector of internet site by LSTM Recognition with Recurrent Neural Network deep learning models
Website is classified, and is improved the accuracy rate of internet site classification, is provided for modern information retrieval field more effective and real
Website classification method.
Realize that the technical solution of the object of the invention is:It is a kind of based on LSTM Recognition with Recurrent Neural Network deep learning models
Internet site sorting technique, includes the following steps:
Step 1, the original description information of a large amount of internet sites of the existing net dns server log collection of basis are as website number
According to collecting and carry out pretreatment and manually label, then extract each website be used to input the high dimensional feature of deep learning model to
Amount indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, the specific steps are:
Step 1-1, the original description information of internet site is pre-processed, pretreatment includes removal invalid entries
(mistake is not such as found comprising 404 pages, server internal error, website have stopped the corresponding data entry of service and gone
Except), original description information stop words removes (such as:" ", the stop words such as " a " and special the meaningless words such as meet), will be pre-
Treated, and whole websites manually label, and are divided into 15 classifications, respectively:E-commerce, science and technology, tourist communications,
Commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, politics
Law, computer network, educational training, Social Culture;
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so
Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould
After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used
When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted;
Step 1-3, each website concentrated for website data calculates each word after its original description information segments
TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following
Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions
The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, tiIt is i-th of word in the w of website, TF_IDFt,wIt is the word
The reverse document frequency of word frequency-in the website data, Embedding_Vectort,wIt is that the high dimension vector of the word indicates (logical
It crosses Word2Vec models to obtain);
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw.The vector
Dimension be classification sum, internet site has been divided into 15 classifications by the present invention altogether, then all output categorization vectors are equal
For 15 dimensions, the wherein corresponding dimension of number of website real classification is 1, other dimensions are 0;
Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of the website obtained in the first step,
Output of the categorization vector of website as deep learning model is based on using Adam gradient descent algorithm optimizer supervised trainings
The Recognition with Recurrent Neural Network deep learning model of LSTM, the classification abstract characteristics vector for extracting website, when training pattern, training
Data are that website data concentrates 70% website data randomly selected, and verification and testing data are remaining 30% website datas,
The specific steps are:
2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM
Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model,
Wherein, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions
Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.
2-2, LSTM Recognition with Recurrent Neural Network model (Recurrent Neural Network) general frame is 1 input
Layer, an output layer and 3 stack comprising a large amount of LSTM neurons neural network hidden layers, input layer, hidden layer and
It is full connection status between output layer, each LSTM neurons in hidden layer consist of the following parts:Input gate
(Input Gate), output gate (Output Gate), forgets that gate (Forget Gate), storage unit (Memory) reflect
Function (y=Wx+b, wherein y are neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix) is penetrated,
Activation primitive (Tanh).During model training, Adam optimizers can use gradient descent method to minimize loss function, this
A process is each in the mapping function of each neuron of continuous updating neural network by way of positive transmission and back transfer
The value of parameter and the value in storage unit reach.The data that training data is concentrated can be used to train neural network in order,
The average information generated for neuron when storing last training in storage unit, these average informations can be used as " experience " number
According to for this training and later training process.Neuron generate empirical data whether store in the memory unit be by
Input, export and forget gate control, when inputting gate opening, average information of this training can be stored in storage unit,
When exporting gate and opening, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage
Unit can remove the pervious intermediate data preserved.Instruction in the past can be utilized by being advantageous in that using LSTM Recognition with Recurrent Neural Network
Empirical data in white silk is merged with the input data that this is trained is trained model, and energy is predicted in the classification to improve model
Power.
2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function
The parameter of type, wherein loss function are defined as follows by way of cross entropy (Cross Entropy):
Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.
Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer
SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax
Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector
Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site
Classification accuracy, the specific steps are:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions,
SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and
It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein PjFor the vector after the processing of SoftMax functions
Jth tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVW)
For Output per one-dimensional value all between 0 to 1, all dimension values summations are 1 in 3-2,3-1, each dimension generation
The classification of one website of table is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile classification is accurate
Rate is defined as correctly predicting the ratio of number and all prediction numbers in prediction result.
The advantages of the present invention over the prior art are that:
(1) side for using title, keyword and the description information of website and combining it using Word2Vec and TF-IDF
Method is converted into the vector of the high dimensional feature comprising high information quantity, which is embodied in 1000 dimensions, due to its possess it is very high
Dimension, can include by the information content height in website original description word, using high dimensional feature vector as classification foundation,
Avoid the problem of information content deficiency when using URL or HTTP behaviors as classification foundation;
(2) method for using deep learning, by establishing LSTM Recognition with Recurrent Neural Network deep learning models for website
Mechanized classification overcomes common machines learning method the problem of performing poor in the more classification problems of high-dimensional large sample;
(3) consider importance of the model data to Model Practical, use a large amount of nets according to existing net dns server acquisition
Data of standing and training and the test data manually to label as model solve previous research using free disclosure and are possible to
Data that out-of-date data set is brought are inaccurate and the problem of cause model to lack practicability;
(4) in terms of the use of classification foundation, traditional research approach mostly uses greatly the URL of website as classification foundation,
But URL itself have prodigious randomness and uncertainty, it includes information content it is very few also cause its be difficult to become classification
Credible foundation.There are also methods, and user to be used to access the HTTP behaviors of website as classification foundation, however HTTP behavior phases
The true content for including for website is often excessively abstract, and such as to video, the website of the request of picture, many classifications all has this
The feature of sample, it is difficult to which website is subdivided into according to it by more classification;And the present invention by extract the title of website, keyword and
For descriptive text as classification foundation, these information are website webmasters in order to which search engine optimization is write meticulously, are tended to
The specific business of website is depicted precise and to the pointly, is based on these verbal descriptions, can ensure more information amount as far as possible
Under the premise of, realize believable websites collection.
(5) in terms of the use of sorting technique, traditional research approach uses support vector machines, naive Bayesian etc. mostly
Machine learning method is for classifying, however these methods are difficult to due to birth defect in more classification problems of high latitude large sample
Obtain better effects;The present invention is based on deep learning thoughts to carry out websites collection, builds a variety of deep neural networks, simulates human brain
Cognitive Mode, due to deep neural network have a large amount of neurons, have very strong feature extraction ability and study energy
Power is suitble to be applied in the websites collection task of high information quantity.
(6) in terms of the use of grouped data, traditional research approach mostly uses free disclosed data set, and these
The problems such as over time, often there is data volume deficiency in data set, data are out-of-date and mistake;The present invention then chooses operation
Quotient now nets true user access logs, therefrom extracts whole website domain names, and uses these website domain names as seed
Data voluntarily acquire a large amount of web datas and manually label as data, these data are with high timeliness and really
Property, to ensure that the credibility and availability of final mask.
(7) in terms of class categories, traditional research method often by the less classification of websites collection, is such as divided into malice net
It stands and benign website, or is divided into news category, resource sharing class and exchange community's class, usually less than 10 classifications, categorical measure
It is less so that website can not be subdivided into specific category by these research methods;The present invention then use deep learning method by website from
It is dynamic to be classified as 15 classifications, be respectively:E-commerce, science and technology, tourist communications, commercial economy, social networks, life clothes
Business, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, politics and laws, computer network, educational training, society
Can be cultural, the diversification of classification so that websites collection is more careful, more accurately.
Description of the drawings
Internet site automatic classification method flow charts of the Fig. 1 based on deep learning;
Fig. 2 Sina website, the title of Tencent's video network, keyword and descriptive text example;
Fig. 3 data grabber programs are put in storage and pass through pretreated part website data;
Fig. 4 LSTM Recognition with Recurrent Neural Network deep learning model schematics;
Fig. 5 LSTM inside neurons structural schematic diagrams;
Fig. 6 is relevant effect or Test Drawing of the present invention etc..
Specific implementation mode
In conjunction with Fig. 1, the present invention is the internet site sorting technique based on LSTM Recognition with Recurrent Neural Network deep learning models,
It comprises the steps of:
Step 1, in conjunction with Fig. 2, made according to the original description information of the existing a large amount of internet sites of net dns server log collection
It for website data collection and carries out pretreatment and manually labels, then extract height of each website for inputting deep learning model
Dimensional feature vector indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, specific steps
For:
Step 1-1, in conjunction with Fig. 3, the original description information of internet site is pre-processed, pretreatment includes removal nothing
Effect entry (such as does not find mistake, server internal error, website have stopped servicing corresponding data entry comprising 404 pages
Remove), original description information stop words removes (such as:" ", the stop words such as " a " and special the meaningless words such as meet),
Pretreated whole websites are manually labelled, are divided into 15 classifications, respectively:E-commerce, science and technology, tourism are handed over
Logical, commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, political affairs
Therapy rule, computer network, educational training, Social Culture;
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so
Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould
After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used
When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted;
Step 1-3, each website concentrated for website data calculates each word after its original description information segments
TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following
Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions
The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, tiIt is i-th of word in the w of website, TF_IDFt,wIt is the word
The reverse document frequency of word frequency-in the website data, Embedding_Vectort,wIt is that the high dimension vector of the word indicates (logical
It crosses Word2Vec models to obtain);
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw.The vector
Dimension be classification sum, internet site has been divided into 15 classifications by this patent altogether, then all output categorization vectors are equal
For 15 dimensions, the wherein corresponding dimension of number of website real classification is 1, other dimensions are 0;
Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of obtained website, the class of website
Not output of the vector as deep learning model, uses cycle of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM
Neural network deep learning model, the classification abstract characteristics vector for extracting website, when training pattern, training data is website
70% website data randomly selected in data set, verification and testing data are remaining 30% website datas, the specific steps are:
2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM
Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model,
Wherein, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions
Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.
2-2, in conjunction with Fig. 4, LSTM Recognition with Recurrent Neural Network model general frames are 1 input layer, an output layer and 3
Including a large amount of LSTM neurons neural network hidden layers stack, it is full connection between input layer, hidden layer and output layer
State, in conjunction with Fig. 5, each LSTM neurons in hidden layer consist of the following parts:Gate (Input Gate) is inputted, it is defeated
Go out gate (Output Gate), forgets gate (Forget Gate), storage unit (Memory), mapping function (y=Wx+b,
Wherein y is neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix), activation primitive (Tanh).
During model training, Adam optimizers can use gradient descent method to minimize loss function, this process passes through forward direction
Transmit and back transfer each neuron of mode continuous updating neural network mapping function in parameters value and deposit
Value in storage unit reaches.The data that training data is concentrated can be used to train in order neural network, be used in storage unit
The average information that neuron generates when the training of storage last time, these average informations can be used as " experience " data to be trained for this
And later training process.It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets
Remember gate control, when inputting gate opening, average information of this training can be stored in storage unit, when output gate is opened
When, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage unit can be removed and protected
The pervious intermediate data deposited.The empirical data that can be utilized in training in the past is advantageous in that using LSTM Recognition with Recurrent Neural Network
It is merged with the input data of this training and model is trained, to improve the classification predictive ability of model.
2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function
The parameter of type, wherein loss function are defined as follows by way of cross entropy (Cross Entropy):
Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.
Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer
SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax
Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector
Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site
Classification accuracy, the specific steps are:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions,
SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and
It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein PjFor the vector after the processing of SoftMax functions
Jth tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVW)
For Output per one-dimensional value all between 0 to 1, all dimension values summations are 1 in 3-2,3-1, each dimension generation
The classification of one website of table is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile classification is accurate
Rate is defined as correctly predicting the ratio of number and all prediction numbers in prediction result.
Further detailed description is done to the present invention with reference to examples of implementation.
In conjunction with Fig. 1, the present invention is the internet site sorting technique based on LSTM Recognition with Recurrent Neural Network deep learning models,
First by extracting website high dimensional feature vector, Adam gradients decline optimizer supervised training deep learning model is then used,
And classified to internet site using trained deep learning model and SoftMax recurrence, obtain classification accuracy.Tool
Body comprises the steps of:
Step 1, in conjunction with Fig. 2, made according to the original description information of the existing a large amount of internet sites of net dns server log collection
It for website data collection and carries out pretreatment and manually labels, then extract height of each website for inputting deep learning model
Dimensional feature vector indicates, and increases corresponding categories of websites label to each website, and is converted into categorization vector, specific steps
For:
Step 1-1, in conjunction with Fig. 3, the original description information of internet site is pre-processed, pretreatment includes removal nothing
Effect entry (such as does not find mistake, server internal error, website have stopped servicing corresponding data entry comprising 404 pages
Remove), original description information stop words removes (such as:" ", the stop words such as " a " and special the meaningless words such as meet),
Pretreated whole websites are manually labelled, are divided into 15 classifications, respectively:E-commerce, science and technology, tourism are handed over
Logical, commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art, news media, amusement and recreation, political affairs
Therapy rule, computer network, educational training, Social Culture;
In examples of implementation, it is assumed that a total of 10000 website original descriptions information data passes through above-mentioned pretreated side
Formula retains 95000 valid data in total, these valid data can be segmented, and manually labels and be categorized into 15 classifications.
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, so
Word segmentation result all in data set is arranged in order afterwards and is used as input data, training Word2Vec word incorporation models, the mould
After the completion of type training, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, used
When Word2Vec models, the 1000 dimensional vectors expression exported for word as the word is inputted;
In examples of implementation, the word segmentation result of 95000 valid data is used to train Word2Vec models, model instruction
After the completion of white silk, gives a word and input Word2Vec models, the high dimension vector that model can return to the word indicates.
Step 1-3, each website concentrated for website data calculates each word after its original description information segments
TF-IDF values, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can using following
Formula calculates website data and the high dimensional feature vector of each website is concentrated to indicate, since each word passes through the higher-dimension of 1000 dimensions
The website high dimensional feature vector that vector is indicated, therefore calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, i is the number of the word obtained after website segments, tiIt is website
I-th of word in w, TF_IDFt,wIt is the reverse document frequency of word frequency-of the word in the website data, Embedding_
Vectort,wIt is the high dimension vector expression (being obtained by Word2Vec models) of the word, N is the word obtained after the website segments
Language sum.
In examples of implementation, each website can be converted into the higher-dimension website vector of 1000 dimensions using above-mentioned formula
It indicates, has then obtained the input data for deep learning model training and test, include 95000 input datas altogether.
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw.The vector
Dimension be classification sum, internet site has been divided into 15 classifications altogether in the present invention, then all output categorization vectors
It is 15 dimensions, wherein the corresponding dimension of number of website real classification is 1, other dimensions are 0;
In examples of implementation, each website can be converted into the low-dimensional categories of websites table of 15 dimensions using above-mentioned formula
Show, then obtained the output data for deep learning model training and test, includes 95000 output datas altogether.
The data set that all input datas and output data collectively constitute 95000 datas pair is used for deep learning model
Training and test.
Step 2, the input that deep learning model is denoted as using the high dimensional feature vector of obtained website, the class of website
Not output of the vector as deep learning model, uses cycle of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM
Neural network deep learning model, the classification abstract characteristics vector for extracting website, when training pattern, training data is website
70% website data randomly selected in data set, verification and testing data are remaining 30% website datas, the specific steps are:
2-1,70% data are randomly selected from internet site data set as training data, for being recycled to LSTM
Neural network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model,
Wherein, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimensions
Web site features vector, output vector are 15 dimension low-dimensional categories of websites vectors.
In examples of implementation, randomly selects 70% from website data concentration and be used as training dataset, i.e., 66500 trained numbers
According to remaining 30% is used as test data, i.e. 28500 test datas.
2-2, in conjunction with Fig. 4, LSTM Recognition with Recurrent Neural Network model general frames are 1 input layer, an output layer and 3
Including a large amount of LSTM neurons neural network hidden layers stack, it is full connection between input layer, hidden layer and output layer
State, in conjunction with Fig. 5, each LSTM neurons in hidden layer consist of the following parts:Gate (Input Gate) is inputted, it is defeated
Go out gate (Output Gate), forgets gate (Forget Gate), storage unit (Memory), mapping function (y=Wx+b,
Wherein y is neuron output value, and x is neuron output value, and W and b are respectively weight and bias matrix), activation primitive (Tanh).
During model training, Adam optimizers can use gradient descent method to minimize loss function, this process passes through forward direction
Transmit and back transfer each neuron of mode continuous updating neural network mapping function in parameters value and deposit
Value in storage unit reaches.The data that training data is concentrated can be used to train in order neural network, be used in storage unit
The average information that neuron generates when the training of storage last time, these average informations can be used as " experience " data to be trained for this
And later training process.It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets
Remember gate control, when inputting gate opening, average information of this training can be stored in storage unit, when output gate is opened
When, in the past when training the intermediate data that preserved can parameter this training, when forgetting that gate is opened, storage unit can be removed and protected
The pervious intermediate data deposited.The empirical data that can be utilized in training in the past is advantageous in that using LSTM Recognition with Recurrent Neural Network
It is merged with the input data of this training and model is trained, to improve the classification predictive ability of model.
In examples of implementation, LSTM Recognition with Recurrent Neural Network deep learning models share 1 input layer, an output layer and 3
A hidden layer, wherein it includes 1024 LSTM neurons layer by layer that first, which is hidden, second hidden layer includes 512 LSTM nerves
Member, third hidden layer include 256 LSTM neurons.
2-3, Adam gradient decline optimizer can update entire neural network mould by way of optimizing to loss function
The parameter of type, wherein loss function Loss are defined as follows by way of cross entropy (Cross Entropy):
Wherein x is each individual dimension of vector, and p (x) is the every one-dimensional value of output output layer output vector, q (x)
It is the every one-dimensional value of true categorization vector.
In the embodiment of the present invention, using training dataset totally 66500 datas to LSTM Recognition with Recurrent Neural Network deep learnings
Model exercises supervision repetitive exercise, and each iteration is chosen 10000 datas and is updated to neuron parameter in model, carries out altogether
4500 iteration.
Step 3, according to after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step increase by one layer
SoftMax is returned, and the categories of websites abstract characteristics vector of the output of LSTM Recognition with Recurrent Neural Network returns the defeated of layer as SoftMax
Enter, export the class probability distribution vector for website, by the corresponding website of the maximum dimension of probability value in the ProbabilityDistribution Vector
Classification of the classification as website, and the categories of websites of output is compared with the actual classification in website, obtain internet site
Classification accuracy, the specific steps are:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions,
SoftMax functions can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and
It is defined as follows so that all dimension values summations are 1, SoftMax functions, wherein PjFor the vector after the processing of SoftMax functions
Jth tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVW)
Wherein OVWIt is output vector of the deep learning model to website w
3-2, Output are per one-dimensional value between 0 to 1, and all dimension values summations are 1, and each dimension represents one
The classification of website is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website.Meanwhile the accuracy rate definition of classification
Correctly to predict the ratio of number and all prediction numbers in prediction result.
As shown in fig. 6, in the embodiment of the present invention, model parameter is updated by successive ignition, finally obtains optimal models
The automatic classification accuracy of internet site is 87.9%.
In conclusion the present invention by LSTM Recognition with Recurrent Neural Network deep learning models using internet site higher-dimension to
Amount indicates to classify to website, to improve the accuracy rate of internet site classification, is provided for modern information retrieval field
More effective and practical Website classification method.
Above example is provided just for the sake of the description purpose of the present invention, and is not intended to limit the scope of the present invention.This
The range of invention is defined by the following claims.It does not depart from spirit and principles of the present invention and the various equivalent replacements made and repaiies
Change, should all cover within the scope of the present invention.
Claims (5)
1. a kind of internet site automatic classification method based on deep learning, it is characterised in that:Include the following steps:
The first step, according to the original description information of a large amount of internet sites of dns server log collection as website data collection, and
It carries out pretreatment and manually labels, then extract the high dimensional feature vector table of each website for inputting deep learning model
Show, and corresponding categories of websites label is increased to each website, and is converted into categorization vector;
Second step is denoted as the input of deep learning model, categorization vector using the high dimensional feature vector obtained in the first step
As the output of deep learning model, cycle nerve net of the Adam gradient descent algorithm optimizer supervised trainings based on LSTM is used
Network deep learning model, the classification abstract characteristics vector for extracting website;
Third walks, and increases by one layer after the trained LSTM Recognition with Recurrent Neural Network deep learning model in second step
SoftMax is returned, and the classification abstract characteristics vector of the website of the output of LSTM Recognition with Recurrent Neural Network returns layer as SoftMax
Input, exports the class probability distribution vector for website, and the maximum dimension of probability value in the ProbabilityDistribution Vector is corresponding
Categories of websites is as categories of websites.
2. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that:
In the first step, internet site original description information includes the title, keyword and descriptive text of internet site.
3. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that:
Steps are as follows for the first step specific implementation:
Step 1-1, the original description information of internet site is pre-processed, pretreatment includes removal invalid entries, original
Description information stop words removes, and pretreated whole websites are manually labelled, are divided into 15 classifications, respectively:Electronics
Commercial affairs, science and technology, tourist communications, commercial economy, social networks, service for life, physical fitness, medical treatment & health, literature and art,
News media, amusement and recreation, politics and laws, computer network, educational training, Social Culture;
Step 1-2, the whole website original description information for concentrating pretreated website data carry out word segmentation processing, then will
All word segmentation results arrange in order in data set is used as input data, training Word2Vec word incorporation models, model instruction
After the completion of white silk, website data can be concentrated all words be mapped in the higher dimensional space of 1000 dimensions, use Word2Vec moulds
When type, the 1000 dimensional vectors expression exported for word as the word is inputted;
Step 1-3, each website concentrated for website data calculates the TF-IDF of each word after its original description information participle
Value, it is known that the TF-IDF values of the high dimension vector expression of each word and word in the website, you can use following formula meter
Calculating website data concentrates the high dimensional feature vector of each website to indicate, since each word passes through the high dimension vector table of 1000 dimensions
The website high dimensional feature vector for showing, therefore being calculated according to formula once is also 1000 dimensions:
Wherein IVwIt is the high dimensional feature vector expression of website, tiIt is i-th of word in the w of website, TF_IDFt,wIt is the word in net
The reverse document frequency of word frequency-in data of standing, Embedding_Vectort,wIt is the high dimension vector expression of word;
Step 1-4, each website concentrated for website data, generates its corresponding output categorization vector OVw, the dimension of the vector
Degree is the sum of classification, internet site has been divided into 15 classifications, then all output categorization vectors are 15 dimensions, wherein net
The corresponding dimension of true classification of standing is 1, other dimensions are 0.
4. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that:
The second step is as follows:
2-1,70% data are randomly selected from internet site data set as training data, for recycling nerve to LSTM
Network deep learning model exercises supervision training, remaining 30% is used as test data, for assessing deep learning model,
In, each training data is all made of 2 parts:Input vector and output vector, wherein input vector are 1000 dimension higher-dimension nets
It stands feature vector, output vector is 15 dimension low-dimensional categories of websites vectors;
2-2, LSTM Recognition with Recurrent Neural Network model general frame are an input layer, and an output layer and three include a large amount of
LSTM neuron neural network hidden layers stack, and are full connection status between input layer, hidden layer and output layer, hide
Each LSTM neurons in layer consist of the following parts:Input gate (Input Gate), output gate (Output
Gate), forget gate (Forget Gate), storage unit (Memory);Mapping function y=Wx+b, wherein y are that neuron is defeated
Go out value, x is neuron output value, and W and b are respectively weight and bias matrix;Activation primitive (Tanh);In the model training mistake
Cheng Zhong, Adam optimizer can use gradient descent method to minimize loss function, this process is transmitted by forward direction and passed with reversed
The value and the value in storage unit of parameters in the mapping function for each neuron of mode continuous updating neural network passed
To reach;The data that training data is concentrated can be used to train in order neural network, be used to store last instruction in storage unit
The average information that neuron generates when practicing, these average informations empirically data can be used for this training and later training
Process;It is by input that whether the empirical data that neuron generates, which stores in the memory unit, exports and forgets that gate controls, when
When inputting gate opening, average information of this training can be stored in storage unit, when exporting gate opening, be protected when training in the past
This is trained the intermediate data meeting parameter deposited, and when forgetting that gate is opened, storage unit can remove the pervious centre preserved
Data;It is advantageous in that that the empirical data in training in the past can be utilized to train with this is defeated using LSTM Recognition with Recurrent Neural Network
Enter data fusion to be trained model, to improve the classification predictive ability of model;
2-3, Adam gradient decline optimizer can update entire neural network model by way of optimizing to loss function
Parameter, wherein loss function are defined as follows by way of cross entropy (Cross Entropy):
Wherein p (x) is the every one-dimensional value of output output layer output vector, and q (x) is true categorization vector per one-dimensional value.
5. a kind of internet site automatic classification method based on deep learning according to claim 1, it is characterised in that:
Steps are as follows for the third step specific implementation:
The output vector of 3-1, LSTM Recognition with Recurrent Neural Network deep learning model output layer is the low-dimensional vector of 15 dimensions, SoftMax
Function can be used in the output vector, this can make output vector per one-dimensional value all between 0 to 1, and make institute
It is that 1, SoftMax functions are defined as follows to have dimension values summation, wherein PjFor vectorial jth after the processing of SoftMax functions
Tie up value, OjIt is the value of output vector jth dimension, OkIt is the value of output vector kth dimension:
Then finally obtained vector is:
Output=SoftMax (OVw)
Output is per one-dimensional value between 0 to 1 in 3-2,3-1, and all dimension values summations are 1, and each dimension represents one
The classification of a website is then worth the prediction classification that the corresponding classification of maximum dimension is exactly the website;Meanwhile the accurate calibration of classification
Justice is that the ratio of number and all prediction numbers is correctly predicted in prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810181807.8A CN108364028A (en) | 2018-03-06 | 2018-03-06 | A kind of internet site automatic classification method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810181807.8A CN108364028A (en) | 2018-03-06 | 2018-03-06 | A kind of internet site automatic classification method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108364028A true CN108364028A (en) | 2018-08-03 |
Family
ID=63003483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810181807.8A Pending CN108364028A (en) | 2018-03-06 | 2018-03-06 | A kind of internet site automatic classification method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108364028A (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109360052A (en) * | 2018-09-27 | 2019-02-19 | 北京亚联之星信息技术有限公司 | A kind of data classification based on machine learning algorithm, data processing method and equipment |
CN109408947A (en) * | 2018-10-19 | 2019-03-01 | 杭州刀豆网络科技有限公司 | A kind of infringement webpage judgment method based on machine learning |
CN109460473A (en) * | 2018-11-21 | 2019-03-12 | 中南大学 | The electronic health record multi-tag classification method with character representation is extracted based on symptom |
CN109710838A (en) * | 2018-12-05 | 2019-05-03 | 厦门笨鸟电子商务有限公司 | A kind of company's site's keyword extracting method based on deep neural network |
CN109784376A (en) * | 2018-12-26 | 2019-05-21 | 中国科学院长春光学精密机械与物理研究所 | A kind of Classifying Method in Remote Sensing Image and categorizing system |
CN109886408A (en) * | 2019-02-28 | 2019-06-14 | 北京百度网讯科技有限公司 | A kind of deep learning method and device |
CN109948665A (en) * | 2019-02-28 | 2019-06-28 | 中国地质大学(武汉) | Physical activity genre classification methods and system based on long Memory Neural Networks in short-term |
CN110020024A (en) * | 2019-03-15 | 2019-07-16 | 叶宇铭 | Classification method, system, the equipment of link resources in a kind of scientific and technical literature |
CN110298391A (en) * | 2019-06-12 | 2019-10-01 | 同济大学 | A kind of iterative increment dialogue intention classification recognition methods based on small sample |
CN110321555A (en) * | 2019-06-11 | 2019-10-11 | 国网江苏省电力有限公司南京供电分公司 | A kind of power network signal classification method based on Recognition with Recurrent Neural Network model |
CN110472045A (en) * | 2019-07-11 | 2019-11-19 | 中山大学 | A kind of short text falseness Question Classification prediction technique and device based on document insertion |
CN110472885A (en) * | 2019-08-22 | 2019-11-19 | 华南师范大学 | A kind of website assessment system and its working method |
CN110516210A (en) * | 2019-08-22 | 2019-11-29 | 北京影谱科技股份有限公司 | The calculation method and device of text similarity |
CN110609898A (en) * | 2019-08-19 | 2019-12-24 | 中国科学院重庆绿色智能技术研究院 | Self-classification method for unbalanced text data |
CN110765393A (en) * | 2019-09-17 | 2020-02-07 | 微梦创科网络科技(中国)有限公司 | Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression |
CN110855635A (en) * | 2019-10-25 | 2020-02-28 | 新华三信息安全技术有限公司 | URL (Uniform resource locator) identification method and device and data processing equipment |
CN111078646A (en) * | 2019-12-30 | 2020-04-28 | 弭迺彬 | Method and system for grouping software based on running data of Internet equipment |
CN111125563A (en) * | 2018-10-31 | 2020-05-08 | 安碁资讯股份有限公司 | Method for evaluating domain name and server thereof |
CN111159416A (en) * | 2020-04-02 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Language task model training method and device, electronic equipment and storage medium |
CN111782802A (en) * | 2020-05-15 | 2020-10-16 | 北京极兆技术有限公司 | Method and system for obtaining national economy manufacturing industry corresponding to commodity based on machine learning |
CN111895986A (en) * | 2020-06-30 | 2020-11-06 | 西安建筑科技大学 | MEMS gyroscope original output signal noise reduction method based on LSTM neural network |
CN112287274A (en) * | 2020-10-27 | 2021-01-29 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
CN112329524A (en) * | 2020-09-25 | 2021-02-05 | 泰山学院 | Signal classification and identification method, system and equipment based on deep time sequence neural network |
CN112559099A (en) * | 2020-12-04 | 2021-03-26 | 北京新能源汽车技术创新中心有限公司 | Remote image display method, device and system based on user behavior and storage medium |
CN112690823A (en) * | 2020-12-22 | 2021-04-23 | 海南力维科贸有限公司 | Method and system for identifying physiological sounds of lungs |
CN112800229A (en) * | 2021-02-05 | 2021-05-14 | 昆明理工大学 | Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field |
CN113168890A (en) * | 2018-12-10 | 2021-07-23 | 生命科技股份有限公司 | Deep base recognizer for Sanger sequencing |
CN113392323A (en) * | 2021-06-15 | 2021-09-14 | 电子科技大学 | Business role prediction method based on multi-source data joint learning |
CN113673620A (en) * | 2021-08-27 | 2021-11-19 | 工银科技有限公司 | Method, system, device, medium and program product for model generation |
CN116258579A (en) * | 2023-04-28 | 2023-06-13 | 成都新希望金融信息有限公司 | Training method of user credit scoring model and user credit scoring method |
TWI827984B (en) * | 2021-10-05 | 2024-01-01 | 台灣大哥大股份有限公司 | System and method for website classification |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250402A (en) * | 2016-07-19 | 2016-12-21 | 杭州华三通信技术有限公司 | A kind of Website classification method and device |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
CN106779073A (en) * | 2016-12-27 | 2017-05-31 | 西安石油大学 | Media information sorting technique and device based on deep neural network |
CN107066446A (en) * | 2017-04-13 | 2017-08-18 | 广东工业大学 | A kind of Recognition with Recurrent Neural Network text emotion analysis method of embedded logic rules |
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
US20170357945A1 (en) * | 2016-06-14 | 2017-12-14 | Recruiter.AI, Inc. | Automated matching of job candidates and job listings for recruitment |
CN107480141A (en) * | 2017-08-29 | 2017-12-15 | 南京大学 | It is a kind of that allocating method is aided in based on the software defect of text and developer's liveness |
-
2018
- 2018-03-06 CN CN201810181807.8A patent/CN108364028A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170357945A1 (en) * | 2016-06-14 | 2017-12-14 | Recruiter.AI, Inc. | Automated matching of job candidates and job listings for recruitment |
CN106250402A (en) * | 2016-07-19 | 2016-12-21 | 杭州华三通信技术有限公司 | A kind of Website classification method and device |
CN106599933A (en) * | 2016-12-26 | 2017-04-26 | 哈尔滨工业大学 | Text emotion classification method based on the joint deep learning model |
CN106779073A (en) * | 2016-12-27 | 2017-05-31 | 西安石油大学 | Media information sorting technique and device based on deep neural network |
CN107066446A (en) * | 2017-04-13 | 2017-08-18 | 广东工业大学 | A kind of Recognition with Recurrent Neural Network text emotion analysis method of embedded logic rules |
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN107480141A (en) * | 2017-08-29 | 2017-12-15 | 南京大学 | It is a kind of that allocating method is aided in based on the software defect of text and developer's liveness |
Non-Patent Citations (2)
Title |
---|
陈芊希 等: "基于深度学习的网页分类算法研究", 《微型电脑应用》 * |
黄磊 等: "基于递归神经网络的文本分类研究", 《北京化工大学学报(自然科学版)》 * |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109360052A (en) * | 2018-09-27 | 2019-02-19 | 北京亚联之星信息技术有限公司 | A kind of data classification based on machine learning algorithm, data processing method and equipment |
CN109408947A (en) * | 2018-10-19 | 2019-03-01 | 杭州刀豆网络科技有限公司 | A kind of infringement webpage judgment method based on machine learning |
CN111125563A (en) * | 2018-10-31 | 2020-05-08 | 安碁资讯股份有限公司 | Method for evaluating domain name and server thereof |
CN109460473A (en) * | 2018-11-21 | 2019-03-12 | 中南大学 | The electronic health record multi-tag classification method with character representation is extracted based on symptom |
CN109710838A (en) * | 2018-12-05 | 2019-05-03 | 厦门笨鸟电子商务有限公司 | A kind of company's site's keyword extracting method based on deep neural network |
CN109710838B (en) * | 2018-12-05 | 2021-02-26 | 厦门笨鸟电子商务有限公司 | Company website keyword extraction method based on deep neural network |
CN113168890A (en) * | 2018-12-10 | 2021-07-23 | 生命科技股份有限公司 | Deep base recognizer for Sanger sequencing |
CN113168890B (en) * | 2018-12-10 | 2024-05-24 | 生命科技股份有限公司 | Deep base identifier for Sanger sequencing |
CN109784376A (en) * | 2018-12-26 | 2019-05-21 | 中国科学院长春光学精密机械与物理研究所 | A kind of Classifying Method in Remote Sensing Image and categorizing system |
CN109948665B (en) * | 2019-02-28 | 2020-11-27 | 中国地质大学(武汉) | Human activity type classification method and system based on long-time and short-time memory neural network |
CN109948665A (en) * | 2019-02-28 | 2019-06-28 | 中国地质大学(武汉) | Physical activity genre classification methods and system based on long Memory Neural Networks in short-term |
CN109886408A (en) * | 2019-02-28 | 2019-06-14 | 北京百度网讯科技有限公司 | A kind of deep learning method and device |
CN110020024A (en) * | 2019-03-15 | 2019-07-16 | 叶宇铭 | Classification method, system, the equipment of link resources in a kind of scientific and technical literature |
CN110020024B (en) * | 2019-03-15 | 2021-07-30 | 中国人民解放军军事科学院军事科学信息研究中心 | Method, system and equipment for classifying link resources in scientific and technological literature |
CN110321555A (en) * | 2019-06-11 | 2019-10-11 | 国网江苏省电力有限公司南京供电分公司 | A kind of power network signal classification method based on Recognition with Recurrent Neural Network model |
CN110298391A (en) * | 2019-06-12 | 2019-10-01 | 同济大学 | A kind of iterative increment dialogue intention classification recognition methods based on small sample |
CN110472045A (en) * | 2019-07-11 | 2019-11-19 | 中山大学 | A kind of short text falseness Question Classification prediction technique and device based on document insertion |
CN110472045B (en) * | 2019-07-11 | 2023-02-03 | 中山大学 | Short text false problem classification prediction method and device based on document embedding |
CN110609898B (en) * | 2019-08-19 | 2023-05-05 | 中国科学院重庆绿色智能技术研究院 | Self-classifying method for unbalanced text data |
CN110609898A (en) * | 2019-08-19 | 2019-12-24 | 中国科学院重庆绿色智能技术研究院 | Self-classification method for unbalanced text data |
CN110472885A (en) * | 2019-08-22 | 2019-11-19 | 华南师范大学 | A kind of website assessment system and its working method |
CN110516210A (en) * | 2019-08-22 | 2019-11-29 | 北京影谱科技股份有限公司 | The calculation method and device of text similarity |
CN110516210B (en) * | 2019-08-22 | 2023-06-27 | 北京影谱科技股份有限公司 | Text similarity calculation method and device |
CN110765393A (en) * | 2019-09-17 | 2020-02-07 | 微梦创科网络科技(中国)有限公司 | Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression |
CN110855635A (en) * | 2019-10-25 | 2020-02-28 | 新华三信息安全技术有限公司 | URL (Uniform resource locator) identification method and device and data processing equipment |
CN110855635B (en) * | 2019-10-25 | 2022-02-11 | 新华三信息安全技术有限公司 | URL (Uniform resource locator) identification method and device and data processing equipment |
CN111078646B (en) * | 2019-12-30 | 2023-12-05 | 山东蝶飞信息技术有限公司 | Method and system for grouping software based on operation data of Internet equipment |
CN111078646A (en) * | 2019-12-30 | 2020-04-28 | 弭迺彬 | Method and system for grouping software based on running data of Internet equipment |
CN111159416B (en) * | 2020-04-02 | 2020-07-17 | 腾讯科技(深圳)有限公司 | Language task model training method and device, electronic equipment and storage medium |
CN111159416A (en) * | 2020-04-02 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Language task model training method and device, electronic equipment and storage medium |
CN111782802B (en) * | 2020-05-15 | 2023-11-24 | 北京极兆技术有限公司 | Method and system for obtaining commodity corresponding to national economy manufacturing industry based on machine learning |
CN111782802A (en) * | 2020-05-15 | 2020-10-16 | 北京极兆技术有限公司 | Method and system for obtaining national economy manufacturing industry corresponding to commodity based on machine learning |
CN111895986A (en) * | 2020-06-30 | 2020-11-06 | 西安建筑科技大学 | MEMS gyroscope original output signal noise reduction method based on LSTM neural network |
CN112329524A (en) * | 2020-09-25 | 2021-02-05 | 泰山学院 | Signal classification and identification method, system and equipment based on deep time sequence neural network |
CN112287274B (en) * | 2020-10-27 | 2022-10-18 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
CN112287274A (en) * | 2020-10-27 | 2021-01-29 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
CN112559099B (en) * | 2020-12-04 | 2024-02-27 | 北京国家新能源汽车技术创新中心有限公司 | Remote image display method, device and system based on user behaviors and storage medium |
CN112559099A (en) * | 2020-12-04 | 2021-03-26 | 北京新能源汽车技术创新中心有限公司 | Remote image display method, device and system based on user behavior and storage medium |
CN112690823A (en) * | 2020-12-22 | 2021-04-23 | 海南力维科贸有限公司 | Method and system for identifying physiological sounds of lungs |
CN112800229B (en) * | 2021-02-05 | 2022-12-20 | 昆明理工大学 | Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field |
CN112800229A (en) * | 2021-02-05 | 2021-05-14 | 昆明理工大学 | Knowledge graph embedding-based semi-supervised aspect-level emotion analysis method for case-involved field |
CN113392323B (en) * | 2021-06-15 | 2022-04-19 | 电子科技大学 | Business role prediction method based on multi-source data joint learning |
CN113392323A (en) * | 2021-06-15 | 2021-09-14 | 电子科技大学 | Business role prediction method based on multi-source data joint learning |
CN113673620A (en) * | 2021-08-27 | 2021-11-19 | 工银科技有限公司 | Method, system, device, medium and program product for model generation |
TWI827984B (en) * | 2021-10-05 | 2024-01-01 | 台灣大哥大股份有限公司 | System and method for website classification |
CN116258579A (en) * | 2023-04-28 | 2023-06-13 | 成都新希望金融信息有限公司 | Training method of user credit scoring model and user credit scoring method |
CN116258579B (en) * | 2023-04-28 | 2023-08-04 | 成都新希望金融信息有限公司 | Training method of user credit scoring model and user credit scoring method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108364028A (en) | A kind of internet site automatic classification method based on deep learning | |
Fan et al. | Metapath-guided heterogeneous graph neural network for intent recommendation | |
CN109492157B (en) | News recommendation method and theme characterization method based on RNN and attention mechanism | |
CN104933164B (en) | In internet mass data name entity between relationship extracting method and its system | |
CN105893609A (en) | Mobile APP recommendation method based on weighted mixing | |
Fang et al. | Folksonomy-based visual ontology construction and its applications | |
CN107291902A (en) | Automatic marking method is checked in a kind of popular contribution based on hybrid classification technology | |
Yanmei et al. | Research on Chinese micro-blog sentiment analysis based on deep learning | |
CN113806630B (en) | Attention-based multi-view feature fusion cross-domain recommendation method and device | |
CN110377605A (en) | A kind of Sensitive Attributes identification of structural data and classification stage division | |
Cong | Personalized recommendation of film and television culture based on an intelligent classification algorithm | |
Ji et al. | Attention based meta path fusion for heterogeneous information network embedding | |
CN116186372A (en) | Bibliographic system capable of providing personalized service | |
Bai et al. | A rumor detection model incorporating propagation path contextual semantics and user information | |
Paraschiv et al. | A unified graph-based approach to disinformation detection using contextual and semantic relations | |
Zhang et al. | Reinforced adaptive knowledge learning for multimodal fake news detection | |
Zhu | A book recommendation algorithm based on collaborative filtering | |
Wang | Knowledge graph analysis of internal control field in colleges | |
CN109597946A (en) | A kind of bad webpage intelligent detecting method based on deepness belief network algorithm | |
CN114722304A (en) | Community search method based on theme on heterogeneous information network | |
Bao et al. | Hot news prediction method based on natural language processing technology and its application | |
Gao et al. | Statistics and Analysis of Targeted Poverty Alleviation Information Integrated with Big Data Mining Algorithm | |
Zeng et al. | Model-Stacking-based network user portrait from multi-source campus data | |
Liu | Construction of personalized recommendation system of university library based on SOM neural network | |
Wang et al. | An api recommendation method based on beneficial interaction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180803 |
|
WD01 | Invention patent application deemed withdrawn after publication |