CN115130601A - Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion - Google Patents
Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion Download PDFInfo
- Publication number
- CN115130601A CN115130601A CN202210795308.4A CN202210795308A CN115130601A CN 115130601 A CN115130601 A CN 115130601A CN 202210795308 A CN202210795308 A CN 202210795308A CN 115130601 A CN115130601 A CN 115130601A
- Authority
- CN
- China
- Prior art keywords
- webpage
- data
- word
- text
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a two-stage academic data webpage classification method and system based on multi-dimensional feature fusion, which relate to the technical field of webpage classification and comprise the following steps: step S1: inputting a search engine for searching based on academic keywords to obtain search page content; step S2: developing a first-stage classification based on a short text logistic regression model; step S3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished; step S4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm; step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on a data portal website. The method and the device can quickly and accurately screen the data webpage from the Internet.
Description
Technical Field
The invention relates to the technical field of webpage classification, in particular to a webpage classification method and system for two-stage academic data based on multi-dimensional feature fusion.
Background
In recent years, web page classification has become an important research topic in the field of data mining, especially in the cross-discipline direction, which can help experts find discipline data web pages with research value from the internet. However, the web pages have the characteristics of large quantity, various themes, complex structure and the like, and how to quickly and accurately screen the subject data web pages from the internet is a great challenge.
The traditional webpage classification mode is manual classification, but different human judgment standards are difficult to unify, the cost is high, and the efficiency is low. With the gradual improvement of the artificial intelligence technology and the optimization of the computer computing power performance, the machine learning method gradually replaces the manual work, and the brand-new way is exposed in the field of webpage classification. However, with the complicated structure of the webpage, the feature extraction of machine learning becomes difficult and high accuracy is difficult to achieve, and the deep learning algorithm can directly integrate feature engineering into the model fitting process, deeply excavate features from simple texts and has better performance in the field of webpage classification.
Although the main content of the web page is text, the mainstream web page classification method is to filter only by the text information of the web page, but the web page is also important in terms of the web address and the web page structure information besides the content information, including the dom (document Object model) structure, the HTML structure, etc. of the web page, which can be used as feature input. Meanwhile, metric learning is a method for learning a distance between data objects from data, and is rapidly developed in the field of machine learning in recent years, and metric learning surrounding the similarity between samples is also a research hotspot in the field of current webpage classification. Therefore, it is meaningful and necessary to build a system for helping experts and scholars to quickly acquire related subject data web pages, and deep learning-based web page classification using metric learning algorithm and multi-modal information fusion is feasible and has important research value.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a two-stage academic data webpage classification method and system based on multi-dimensional feature fusion.
According to the two-stage academic data webpage classification method and system based on multi-dimensional feature fusion, the scheme is as follows:
in a first aspect, a two-stage academic data webpage classification method based on multi-dimensional feature fusion is provided, and the method includes:
step S1: inputting a search engine for searching based on academic keywords to obtain search page content;
step S2: developing a first-stage classification based on a short text logistic regression model;
step S3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished;
step S4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on the data portal website.
Preferably, the step S1 includes:
step S101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
step S102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
Preferably, the step S2 includes:
step S201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
step S202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
step S203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j :
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents a word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is a radical of an alkyl radical i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
wherein idf j Representing the inverse document frequency corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Denotes the number of documents in which the word wi appears (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers, TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
A module M204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2., N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of being a data webpage according to the webpage characteristics:
wherein the content of the first and second substances,is a vector composed of all the features of the web page,is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
the module M205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the web pages to be judged; y is i Representing the predicted category of the webpage;representing the square of the parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i The probability of an output variable of 1 is calculated according to the selected parameters.
And realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
Preferably, the object crawled by the content in step S3 is a web page that needs to be further subdivided after being classified in the first stage, and the specific content is obtained by clicking inside the web page, and the HTML information is refined into long text information.
Preferably, the step S4 includes:
step S401: acquiring training data; the input data has two dimensions: the content long text and website link of the webpage of the first stage classification result;
step S402: text preprocessing, including word segmentation and word filtering stop; the long text is consistent with the preprocessing method in the first stage, and the website characteristics divide each link into a protocol part, a host name part, a path part, a file name part and a parameter part according to the structural characteristics of the long text so as to highlight the sequence relation among different parts; important structural symbols are reserved when stop words are filtered, and messy code symbols which are too long and have no practical significance are deleted;
step S403: fixing the text length and the website link length corresponding to each webpage; taking a fixed text length and a website link length, truncating the length which is greater than the fixed text length and filling the length which is less than the fixed text length with < pad >;
step S404: number of long textsTraining respective word embedding models according to the website link data; the model selects continuous jumping element grammar, and the characteristic length of each vocabulary is 300; continuous jumping element grammar model method using central word C (w) To predict the probability of occurrence of a contextual word, i.e., P (C) (i) |c (w) ) Wherein: i is more than or equal to w-k and less than or equal to w + k, and w is not equal to k; in the formula, C (i) Words representing the surroundings of the central word; c (w) Represents a central word;
the calculation amount is reduced, the attention degree to the rarely-used words is reduced, and the dimension of the vocabulary is reduced: reserving part of high-frequency vocabulary, uniformly replacing the low-frequency vocabulary with < unk >, utilizing a trained word embedding model, respectively carrying out vectorization operation on long text and website link data, and obtaining a vocabulary matrix as the input of a text convolution neural network model;
step S405: the vocabulary matrixes of the long text and the website links pass through a text convolution neural network model respectively, and are finally spliced together in a full connection layer, and finally a feature vector with a fixed length is output;
step S406: obtaining a final model by using weight parameters in the triple loss training text convolutional neural network; the triplet loss function is shown by the following equation:
in the formula, X a 、X b 、X n Respectively an anchor point sample, a positive sample and a negative sample; m is an interval;representing the euclidean distance between the anchor sample and the positive sample;representing the Euclidean distance between the anchor sample and the negative sample;
step S407: when new data enter the model, a feature vector is obtained through the trained text convolution neural network model, the feature vector is projected into a feature space where the training data are located, the category of the webpage is judged based on a clustering algorithm, and a final prediction result is output.
Preferably, after the final classification result is obtained in step S5, the classification result needs to be put into a database for storage, and the database is stored in a database specific table and displayed on a web portal; and when the webpage content is crawled for the second time, besides the text and the webpage structure information, the language information, the IP position information and the webpage screenshot of the webpage are also acquired.
In a second aspect, a two-stage academic data webpage classification system based on multi-dimensional feature fusion is provided, and the system includes:
module M1: inputting a search engine for searching based on academic keywords to obtain search page content;
module M2: developing a first-stage classification based on a short text logistic regression model;
module M3: acquiring webpage HTML information labeled as a data webpage after the first-stage classification is finished;
module M4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
module M5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on the data portal website.
Preferably, said module M1 comprises:
a module M101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
the module M102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
Preferably, said module M2 comprises:
module M201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
a module M202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
a module M203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j :
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is a radical of an alkyl radical i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
wherein idf j Representing the frequency of the inverse document corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
A module M204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i=1,2, N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of the data webpage according to the webpage characteristics:
wherein the content of the first and second substances,is a vector composed of all the features of the web page,is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
the module M205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the web pages to be judged; y is i Representing web page predictions(ii) the obtained category;representing the square of the parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i The probability of an output variable of 1 is calculated according to the selected parameters.
And realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
Preferably, the object of content crawling in the module M3 is a web page that needs to be further subdivided after being classified in the first stage, and the specific content is obtained by clicking inside the web page, and the HTML information is refined into long text information.
Compared with the prior art, the invention has the following beneficial effects:
1. by fusing the webpage descriptive short text, the webpage content long text and the website link multidimensional characteristics as input, the error caused by single input is reduced, and the precision of a webpage classification model is improved;
2. by applying the loss function related to metric learning, the homogeneous samples are gathered, and the heterogeneous samples are distinguished, so that the interpretability and the generalization performance of the model are improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flowchart of the first stage machine learning web page classification algorithm of the present invention.
FIG. 3 is a flowchart of a second stage depth metric learning web page classification algorithm of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the invention.
The embodiment of the invention provides a two-stage academic data webpage classification method based on multi-dimensional feature fusion, which focuses on the problem of discovering the academic data webpage of a development network, and utilizes a method combining machine learning and deep learning to research, wherein in the first stage, a logistic regression algorithm is adopted, a webpage description short text is used as an input construction model, rapid preliminary classification is carried out, and a large number of non-data webpages are screened out; and in the second stage, a depth measurement learning algorithm combining a text convolution neural network with triple loss is adopted, a webpage content long text fused with website information is used as input, similar samples are aggregated in a characteristic space, heterogeneous samples are distinguished, and the improvement of the overall classification precision is realized. Referring to fig. 1, the method specifically includes:
step S1: based on academic keywords, inputting a search engine (such as Google, Baidu and the like) for searching, and acquiring the content of a search page.
Specifically, the steps include:
step S101: cleaning and expanding the keywords; the system inputs academic keywords corresponding to a data webpage to be searched, wherein the keywords can be Chinese or English. If a plurality of keywords of a subject are provided at one time, the previous cleaning and expansion tasks such as screening out repeated items, splitting partial longer keywords, adding missing Chinese and English contrasts and the like need to be completed.
Step S102: if the keyword is a chinese character, search information related to the keyword is input in a search engine, for example, "keyword + database OR dataset" is input in google search, "keyword + database OR database" is input in english, and accurate search is adopted for searching. And acquiring the returned webpage content, wherein the key points are to acquire the title and the descriptive short text of each webpage item.
Step S2: developing a first-stage classification based on the short text logistic regression model, which is shown in fig. 2;
specifically, the step S2 includes:
step S201: acquiring training data, extracting descriptive texts and title field information of the web pages to be classified from a Google retrieval content crawling table of a database, and splicing the descriptive texts and the title field information into short texts.
Step S202: and (4) text preprocessing, namely performing word segmentation on the Chinese and English text and removing stop words.
Step S203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j :
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
wherein idf j Representing the frequency of the inverse document corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers, TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
Step S204: training a logistic regression model by using vectorized data; definition W (webpages) denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2.., N) is a characteristic term, w i Is t i D (Datapages) represents data web pages, N (Non-Datapages) represents Non-data web pages; judging the probability of being a data webpage according to the webpage characteristics:
wherein, the first and the second end of the pipe are connected with each other,is a vector composed of all the features of the web page,is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
after obtaining the prediction probability, a threshold (threshold) is defined to determine whether the web page belongs to the data web page, and the threshold is usually 0.5, so the web page classification function is:
step S205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a penalty term to the original loss function, is realized by adding multiples based on different paradigms of parameter vectors after the loss function, and can be divided into L1 regularization calculated according to the sum of absolute values of parameters and L2 regularization calculated according to the square value of the sum of squares of the parameters, and a new loss function L (theta) is obtained by adding a formula of an L2 regularization term:
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the webpages to be judged; y is i Representing the predicted category of the webpage;representing the square of a parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i The likelihood of an output variable of 1 is calculated based on the selected parameters.
And realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
Step S3: acquiring webpage HTML information labeled as 'data webpage' after the classification of the first stage is finished; in step S3, the object of content crawling is a web page that needs to be further "subdivided" after being classified in the first stage, and the web page is clicked to obtain specific content, and the HTML information is refined into long text information.
Step S4: and developing the second-stage classification based on the long text and website information of the webpage, and adopting a text convolution neural network combined triple loss depth measurement learning algorithm as shown in figure 3.
Specifically, the step S4 includes:
step S401: acquiring training data; the input data has two dimensions: and the content long text and website link of the webpage are classified in the first stage.
Step S402: text preprocessing, including word segmentation and word filtering stop; the long text is consistent with the pretreatment method in the first stage, and the website characteristics are characterized by using ',': ', '/' etc. divide each link into protocol, host name, path, file name and parameter parts to highlight the sequential relationship between the different parts; and when the stop words are filtered, important structural symbols are reserved, and the messy code symbols which are too long and have no practical significance are deleted.
Step S403: fixing the text length and the website link length corresponding to each webpage; a fixed text length and a website link length are taken, are larger than the length and are truncated, and are smaller than the length, and are filled with < pad >, which is an artificially specified filling symbol and can be replaced as long as the fixed text length and the website link length are unified in the model.
Step S404: training respective word embedding models of the long text data and the website link data; the model selects a continuous Skip-gram (Skip-gram), and the characteristic length of each vocabulary is 300; continuous jumping element grammar model method using central word C (w) To predict the probability of occurrence of a contextual word, i.e., P (C) (i) |C (w) ) Wherein: w-k is less than or equal to i and less than or equal to w + k, and w is not equal to k; in the formula, C (i) Words representing the surroundings of the central word; c (w) Represents a central word; to reduce the amount of computation and the degree of attention to rare words, the vocabulary is reduced in dimension: reserve part of high-frequency vocabulary and use uniformly<unk>The low-frequency vocabulary is replaced,<unk>all in one<pad>The same are artificially defined symbols, which can be replaced. And respectively carrying out vectorization operation on the long text and the website link data by using the trained word embedding model to obtain a vocabulary matrix as the input of the text convolution neural network model.
Step S405: the vocabulary matrixes of the long text and the website links pass through the text convolution neural network model respectively, and are finally spliced together in a full connection layer, and finally a feature vector with a fixed length is output. The text convolutional neural network comprises a convolutional layer, a pooling layer and a random deactivation layer. Suppose there is a training set T ═ T 1 ,t 2 ,...,t N And N is the number of samples of the training set. For each of the training samples,wherein x is i ∈R k Is a k-dimensional word vector for the ith word in the sample document,is a connector, indicates that the document is formed by word concatenation, and specifies x i:i+j X of reference i ,x i+1 ,...,x i+j And (4) connecting. A convolution operation involves a convolution kernel w ∈ R hk The length of the convolution kernel is consistent with the dimension of the word vector, the width is h, and a sliding window covering h words is represented. E.g. feature c i Is formed by the word x in the window i:i+h-1 Obtaining:
c i =f(w·x i:i+h-1 +b)
where b is a bias term and f is a nonlinear activation function. This convolution kernel works on all words in the document to obtain a feature map:
c=[c 1 ,c 2 ,...,c n-h+1 ]
extracting the maximum value by using maximum pooling operation:
random deactivation is to prevent the model from overfitting, and cells in the hidden layer are randomly drawn with their weights set to 0. In the penultimate layerWhere m is the number of convolution kernels, the output is calculated using the following equation:
Step S406: obtaining a final model by using weight parameters in the triple loss training text convolutional neural network; the triplet loss function is shown by the following equation:
in the formula, X a 、X b 、X n Respectively an anchor point sample,A positive sample, a negative sample; m is an interval;representing the euclidean distance between the anchor sample and the positive sample;representing the Euclidean distance between the anchor sample and the negative sample; the training purpose of the loss function is to pull the distance between the positive sample and the anchor point closer, increase the distance between the negative sample and the anchor point, and the distance between the inter-class sample pairs is one interval greater than the distance between the intra-class sample pairs.
Step S407: when new data enter the model, a feature vector is obtained through the trained text convolution neural network model, the feature vector is projected into a feature space where the training data are located, the category of the webpage is judged based on a clustering algorithm (such as a K-nearest neighbor method), and a final prediction result is output.
Step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on the data portal website. Step S5, the final classification result is obtained and needs to be put in storage and sorted, and the classification result is stored in a specific table of a database and displayed on a portal website; when crawling the webpage content for the second time, besides the text and the webpage structure information, the language information, the IP position information and the webpage screenshot of the webpage are also acquired, so that the webpage content can be displayed in various forms on the portal website.
The embodiment of the invention provides a two-stage academic data webpage classification method and system based on multi-dimensional feature fusion, a brand-new open network academic data webpage discovery method is established, and the method has the advantages of popularization and capability of helping experts to quickly and accurately screen data webpages from the Internet.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the present invention can be regarded as a hardware component, and the devices, modules and units included therein for implementing various functions can also be regarded as structures within the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (10)
1. A two-stage academic data webpage classification method based on multi-dimensional feature fusion is characterized by comprising the following steps:
step S1: inputting a search engine for searching based on academic keywords to obtain search page content;
step S2: developing a first-stage classification based on a short text logistic regression model;
step S3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished;
step S4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on a data portal website.
2. The two-stage academic data web page classification method based on multi-dimensional feature fusion according to claim 1, wherein the step S1 includes:
step S101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
step S102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
3. The two-stage academic data webpage classification method based on multi-dimensional feature fusion according to claim 1, wherein the step S2 comprises:
step S201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
step S202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
step S203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j :
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
wherein idf j Representing the frequency of the inverse document corresponding to the word j; l T is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers, TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
Step S204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2., N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of being a data webpage according to the webpage characteristics:
wherein the content of the first and second substances,is a vector composed of all the features of the web page,is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
step S205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the web pages to be judged; y is i Representing the predicted category of the webpage;representing the square of the parameter to be punished; h is a total of θ (x i ) Represents the ith characteristic feature x i Calculating the possibility that the output variable is 1 according to the selected parameters;
and realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
4. The two-stage academic data webpage classification method based on multidimensional feature fusion as claimed in claim 1, wherein the object crawled in step S3 is a webpage which needs to be further subdivided after the first-stage classification, and specific content is obtained by clicking inside the webpage, and HTML information is refined into long text information.
5. The two-stage academic data web page classification method based on multi-dimensional feature fusion according to claim 1, wherein the step S4 includes:
step S401: acquiring training data; the input data has two dimensions: the content long text and website link of the webpage of the first stage classification result;
step S402: text preprocessing, including word segmentation and word filtering stop; the long text is consistent with the preprocessing method in the first stage, and the website characteristics divide each link into a protocol part, a host name part, a path part, a file name part and a parameter part according to the structural characteristics of the long text so as to highlight the sequence relation among the different parts; important structural symbols are reserved when stop words are filtered, and messy code symbols which are too long and have no practical significance are deleted;
step S403: fixing the text length and the website link length corresponding to each webpage; taking a fixed text length and a website link length, truncating the length which is greater than the fixed text length, and filling the length which is less than the fixed text length and the website link length with < pad >;
step S404: training respective word embedding models of the long text data and the website link data; the model selects continuous jumping element grammar, and the characteristic length of each vocabulary is 300; continuous jumping element grammar model method using central word C (w) To predict the probability of occurrence of a contextual word, i.e., P (C) (i) |C (w) ) Wherein: i is more than or equal to w-k and less than or equal to w + k, and w is not equal to k; in the formula, C (i) Words representing the surroundings of the central word; c (w) Represents a central word;
the calculation amount is reduced, the attention degree to the rarely-used words is reduced, and the dimension of the vocabulary is reduced: reserving part of high-frequency vocabulary, uniformly replacing the low-frequency vocabulary with < unk >, utilizing a trained word embedding model, respectively carrying out vectorization operation on long text and website link data, and obtaining a vocabulary matrix as the input of a text convolution neural network model;
step S405: the vocabulary matrixes of the long text and the website links pass through a text convolution neural network model respectively, and are finally spliced together in a full connection layer, and finally a feature vector with a fixed length is output;
step S406: obtaining a final model by using weight parameters in the triple loss training text convolutional neural network; the triplet loss function is shown by the following equation:
in the formula, X a 、X b 、X n Respectively an anchor point sample, a positive sample and a negative sample; m is an interval;representing the euclidean distance between the anchor sample and the positive sample;representing the Euclidean distance between the anchor sample and the negative sample;
step S407: when new data enter the model, a feature vector is obtained through the trained text convolution neural network model, the feature vector is projected into a feature space where the training data are located, the category of the webpage is judged based on a clustering algorithm, and a final prediction result is output.
6. The two-stage academic data webpage classification method based on multi-dimensional feature fusion according to claim 1, wherein the final classification result obtained in the step S5 needs to be put in storage and arranged, stored in a database specific table, and displayed on a web portal; and when the webpage content is crawled for the second time, besides the text and the webpage structure information, the language information, the IP position information and the webpage screenshot of the webpage are also acquired.
7. A two-stage academic data webpage classification system based on multi-dimensional feature fusion is characterized by comprising:
module M1: inputting a search engine for searching based on academic keywords to obtain search page content;
module M2: developing a first-stage classification based on a short text logistic regression model;
module M3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished;
module M4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
module M5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on a data portal website.
8. The two-stage academic data web page classification system based on multi-dimensional feature fusion of claim 7, wherein the module M1 includes:
a module M101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
the module M102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
9. The two-stage academic data web page classification system based on multi-dimensional feature fusion of claim 7, wherein the module M2 includes:
module M201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
the module M202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
module M203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j :
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document;k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is a radical of an alkyl radical i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
wherein idf j Representing the frequency of the inverse document corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes an inverse file frequency.
A module M204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2.., N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of being a data webpage according to the webpage characteristics:
wherein, the first and the second end of the pipe are connected with each other,is a vector composed of all the characteristics of the webpage,is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
the module M205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the webpages to be judged; y is i Representing the predicted category of the webpage;representing the square of the parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i Calculating the possibility that the output variable is 1 according to the selected parameters;
and realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
10. The two-stage academic data webpage classification system based on multi-dimensional feature fusion of claim 7, wherein the object of content crawling in the module M3 is a webpage which needs to be further subdivided after the first-stage classification, and specific content is obtained by clicking inside the webpage to refine HTML information into long text information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210795308.4A CN115130601A (en) | 2022-07-07 | 2022-07-07 | Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210795308.4A CN115130601A (en) | 2022-07-07 | 2022-07-07 | Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115130601A true CN115130601A (en) | 2022-09-30 |
Family
ID=83381906
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210795308.4A Pending CN115130601A (en) | 2022-07-07 | 2022-07-07 | Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115130601A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117010283A (en) * | 2023-10-07 | 2023-11-07 | 中国矿业大学(北京) | Method and system for predicting deformation of steel pipe column structure of PBA station |
CN117521673A (en) * | 2024-01-08 | 2024-02-06 | 安徽大学 | Natural language processing system with analysis training performance |
-
2022
- 2022-07-07 CN CN202210795308.4A patent/CN115130601A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117010283A (en) * | 2023-10-07 | 2023-11-07 | 中国矿业大学(北京) | Method and system for predicting deformation of steel pipe column structure of PBA station |
CN117010283B (en) * | 2023-10-07 | 2023-12-29 | 中国矿业大学(北京) | Method and system for predicting deformation of steel pipe column structure of PBA station |
CN117521673A (en) * | 2024-01-08 | 2024-02-06 | 安徽大学 | Natural language processing system with analysis training performance |
CN117521673B (en) * | 2024-01-08 | 2024-03-22 | 安徽大学 | Natural language processing system with analysis training performance |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10146862B2 (en) | Context-based metadata generation and automatic annotation of electronic media in a computer network | |
CN110929038B (en) | Knowledge graph-based entity linking method, device, equipment and storage medium | |
US8095539B2 (en) | Taxonomy-based object classification | |
CN113822067A (en) | Key information extraction method and device, computer equipment and storage medium | |
CN111291210B (en) | Image material library generation method, image material recommendation method and related devices | |
WO2020237856A1 (en) | Smart question and answer method and apparatus based on knowledge graph, and computer storage medium | |
CN103488724A (en) | Book-oriented reading field knowledge map construction method | |
CN115130601A (en) | Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion | |
Bernardini et al. | Full-subtopic retrieval with keyphrase-based search results clustering | |
WO2014054052A2 (en) | Context based co-operative learning system and method for representing thematic relationships | |
CN110737839A (en) | Short text recommendation method, device, medium and electronic equipment | |
CN110633366A (en) | Short text classification method, device and storage medium | |
Du et al. | An approach for selecting seed URLs of focused crawler based on user-interest ontology | |
CN111325030A (en) | Text label construction method and device, computer equipment and storage medium | |
CN110688405A (en) | Expert recommendation method, device, terminal and medium based on artificial intelligence | |
CN113378573A (en) | Content big data oriented small sample relation extraction method and device | |
CN113569118B (en) | Self-media pushing method, device, computer equipment and storage medium | |
WO2023246849A1 (en) | Feedback data graph generation method and refrigerator | |
CN109783650B (en) | Chinese network encyclopedia knowledge denoising method, system and knowledge base | |
CN116662566A (en) | Heterogeneous information network link prediction method based on contrast learning mechanism | |
CN111382333A (en) | Case element extraction method in news text sentence based on case correlation joint learning and graph convolution | |
US9195940B2 (en) | Jabba-type override for correcting or improving output of a model | |
CN113516202A (en) | Webpage accurate classification method for CBL feature extraction and denoising | |
CN113468410A (en) | System for intelligently optimizing search results and search engine | |
CN115391479A (en) | Ranking method, device, electronic medium and storage medium for document search |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |