CN115130601A - Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion - Google Patents

Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion Download PDF

Info

Publication number
CN115130601A
CN115130601A CN202210795308.4A CN202210795308A CN115130601A CN 115130601 A CN115130601 A CN 115130601A CN 202210795308 A CN202210795308 A CN 202210795308A CN 115130601 A CN115130601 A CN 115130601A
Authority
CN
China
Prior art keywords
webpage
data
word
text
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210795308.4A
Other languages
Chinese (zh)
Inventor
施家荣
卢彬
杨莉娜
甘小莺
王新兵
傅洛伊
周成虎
曹心德
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202210795308.4A priority Critical patent/CN115130601A/en
Publication of CN115130601A publication Critical patent/CN115130601A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a two-stage academic data webpage classification method and system based on multi-dimensional feature fusion, which relate to the technical field of webpage classification and comprise the following steps: step S1: inputting a search engine for searching based on academic keywords to obtain search page content; step S2: developing a first-stage classification based on a short text logistic regression model; step S3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished; step S4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm; step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on a data portal website. The method and the device can quickly and accurately screen the data webpage from the Internet.

Description

Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion
Technical Field
The invention relates to the technical field of webpage classification, in particular to a webpage classification method and system for two-stage academic data based on multi-dimensional feature fusion.
Background
In recent years, web page classification has become an important research topic in the field of data mining, especially in the cross-discipline direction, which can help experts find discipline data web pages with research value from the internet. However, the web pages have the characteristics of large quantity, various themes, complex structure and the like, and how to quickly and accurately screen the subject data web pages from the internet is a great challenge.
The traditional webpage classification mode is manual classification, but different human judgment standards are difficult to unify, the cost is high, and the efficiency is low. With the gradual improvement of the artificial intelligence technology and the optimization of the computer computing power performance, the machine learning method gradually replaces the manual work, and the brand-new way is exposed in the field of webpage classification. However, with the complicated structure of the webpage, the feature extraction of machine learning becomes difficult and high accuracy is difficult to achieve, and the deep learning algorithm can directly integrate feature engineering into the model fitting process, deeply excavate features from simple texts and has better performance in the field of webpage classification.
Although the main content of the web page is text, the mainstream web page classification method is to filter only by the text information of the web page, but the web page is also important in terms of the web address and the web page structure information besides the content information, including the dom (document Object model) structure, the HTML structure, etc. of the web page, which can be used as feature input. Meanwhile, metric learning is a method for learning a distance between data objects from data, and is rapidly developed in the field of machine learning in recent years, and metric learning surrounding the similarity between samples is also a research hotspot in the field of current webpage classification. Therefore, it is meaningful and necessary to build a system for helping experts and scholars to quickly acquire related subject data web pages, and deep learning-based web page classification using metric learning algorithm and multi-modal information fusion is feasible and has important research value.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a two-stage academic data webpage classification method and system based on multi-dimensional feature fusion.
According to the two-stage academic data webpage classification method and system based on multi-dimensional feature fusion, the scheme is as follows:
in a first aspect, a two-stage academic data webpage classification method based on multi-dimensional feature fusion is provided, and the method includes:
step S1: inputting a search engine for searching based on academic keywords to obtain search page content;
step S2: developing a first-stage classification based on a short text logistic regression model;
step S3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished;
step S4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on the data portal website.
Preferably, the step S1 includes:
step S101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
step S102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
Preferably, the step S2 includes:
step S201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
step S202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
step S203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j
Figure BDA0003735557710000021
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents a word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is a radical of an alkyl radical i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
Figure BDA0003735557710000031
wherein idf j Representing the inverse document frequency corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Denotes the number of documents in which the word wi appears (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers, TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
A module M204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2., N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of being a data webpage according to the webpage characteristics:
Figure BDA0003735557710000032
wherein the content of the first and second substances,
Figure BDA0003735557710000033
is a vector composed of all the features of the web page,
Figure BDA0003735557710000034
is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
Figure BDA0003735557710000035
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
Figure BDA0003735557710000036
the module M205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
Figure BDA0003735557710000037
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the web pages to be judged; y is i Representing the predicted category of the webpage;
Figure BDA0003735557710000038
representing the square of the parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i The probability of an output variable of 1 is calculated according to the selected parameters.
And realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
Preferably, the object crawled by the content in step S3 is a web page that needs to be further subdivided after being classified in the first stage, and the specific content is obtained by clicking inside the web page, and the HTML information is refined into long text information.
Preferably, the step S4 includes:
step S401: acquiring training data; the input data has two dimensions: the content long text and website link of the webpage of the first stage classification result;
step S402: text preprocessing, including word segmentation and word filtering stop; the long text is consistent with the preprocessing method in the first stage, and the website characteristics divide each link into a protocol part, a host name part, a path part, a file name part and a parameter part according to the structural characteristics of the long text so as to highlight the sequence relation among different parts; important structural symbols are reserved when stop words are filtered, and messy code symbols which are too long and have no practical significance are deleted;
step S403: fixing the text length and the website link length corresponding to each webpage; taking a fixed text length and a website link length, truncating the length which is greater than the fixed text length and filling the length which is less than the fixed text length with < pad >;
step S404: number of long textsTraining respective word embedding models according to the website link data; the model selects continuous jumping element grammar, and the characteristic length of each vocabulary is 300; continuous jumping element grammar model method using central word C (w) To predict the probability of occurrence of a contextual word, i.e., P (C) (i) |c (w) ) Wherein: i is more than or equal to w-k and less than or equal to w + k, and w is not equal to k; in the formula, C (i) Words representing the surroundings of the central word; c (w) Represents a central word;
the calculation amount is reduced, the attention degree to the rarely-used words is reduced, and the dimension of the vocabulary is reduced: reserving part of high-frequency vocabulary, uniformly replacing the low-frequency vocabulary with < unk >, utilizing a trained word embedding model, respectively carrying out vectorization operation on long text and website link data, and obtaining a vocabulary matrix as the input of a text convolution neural network model;
step S405: the vocabulary matrixes of the long text and the website links pass through a text convolution neural network model respectively, and are finally spliced together in a full connection layer, and finally a feature vector with a fixed length is output;
step S406: obtaining a final model by using weight parameters in the triple loss training text convolutional neural network; the triplet loss function is shown by the following equation:
Figure BDA0003735557710000041
in the formula, X a 、X b 、X n Respectively an anchor point sample, a positive sample and a negative sample; m is an interval;
Figure BDA0003735557710000042
representing the euclidean distance between the anchor sample and the positive sample;
Figure BDA0003735557710000043
representing the Euclidean distance between the anchor sample and the negative sample;
step S407: when new data enter the model, a feature vector is obtained through the trained text convolution neural network model, the feature vector is projected into a feature space where the training data are located, the category of the webpage is judged based on a clustering algorithm, and a final prediction result is output.
Preferably, after the final classification result is obtained in step S5, the classification result needs to be put into a database for storage, and the database is stored in a database specific table and displayed on a web portal; and when the webpage content is crawled for the second time, besides the text and the webpage structure information, the language information, the IP position information and the webpage screenshot of the webpage are also acquired.
In a second aspect, a two-stage academic data webpage classification system based on multi-dimensional feature fusion is provided, and the system includes:
module M1: inputting a search engine for searching based on academic keywords to obtain search page content;
module M2: developing a first-stage classification based on a short text logistic regression model;
module M3: acquiring webpage HTML information labeled as a data webpage after the first-stage classification is finished;
module M4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
module M5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on the data portal website.
Preferably, said module M1 comprises:
a module M101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
the module M102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
Preferably, said module M2 comprises:
module M201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
a module M202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
a module M203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j
Figure BDA0003735557710000051
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is a radical of an alkyl radical i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
Figure BDA0003735557710000052
wherein idf j Representing the frequency of the inverse document corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
A module M204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i=1,2, N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of the data webpage according to the webpage characteristics:
Figure BDA0003735557710000061
wherein the content of the first and second substances,
Figure BDA0003735557710000062
is a vector composed of all the features of the web page,
Figure BDA0003735557710000063
is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
Figure BDA0003735557710000064
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
Figure BDA0003735557710000065
the module M205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
Figure BDA0003735557710000066
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the web pages to be judged; y is i Representing web page predictions(ii) the obtained category;
Figure BDA0003735557710000067
representing the square of the parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i The probability of an output variable of 1 is calculated according to the selected parameters.
And realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
Preferably, the object of content crawling in the module M3 is a web page that needs to be further subdivided after being classified in the first stage, and the specific content is obtained by clicking inside the web page, and the HTML information is refined into long text information.
Compared with the prior art, the invention has the following beneficial effects:
1. by fusing the webpage descriptive short text, the webpage content long text and the website link multidimensional characteristics as input, the error caused by single input is reduced, and the precision of a webpage classification model is improved;
2. by applying the loss function related to metric learning, the homogeneous samples are gathered, and the heterogeneous samples are distinguished, so that the interpretability and the generalization performance of the model are improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flowchart of the first stage machine learning web page classification algorithm of the present invention.
FIG. 3 is a flowchart of a second stage depth metric learning web page classification algorithm of the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the invention.
The embodiment of the invention provides a two-stage academic data webpage classification method based on multi-dimensional feature fusion, which focuses on the problem of discovering the academic data webpage of a development network, and utilizes a method combining machine learning and deep learning to research, wherein in the first stage, a logistic regression algorithm is adopted, a webpage description short text is used as an input construction model, rapid preliminary classification is carried out, and a large number of non-data webpages are screened out; and in the second stage, a depth measurement learning algorithm combining a text convolution neural network with triple loss is adopted, a webpage content long text fused with website information is used as input, similar samples are aggregated in a characteristic space, heterogeneous samples are distinguished, and the improvement of the overall classification precision is realized. Referring to fig. 1, the method specifically includes:
step S1: based on academic keywords, inputting a search engine (such as Google, Baidu and the like) for searching, and acquiring the content of a search page.
Specifically, the steps include:
step S101: cleaning and expanding the keywords; the system inputs academic keywords corresponding to a data webpage to be searched, wherein the keywords can be Chinese or English. If a plurality of keywords of a subject are provided at one time, the previous cleaning and expansion tasks such as screening out repeated items, splitting partial longer keywords, adding missing Chinese and English contrasts and the like need to be completed.
Step S102: if the keyword is a chinese character, search information related to the keyword is input in a search engine, for example, "keyword + database OR dataset" is input in google search, "keyword + database OR database" is input in english, and accurate search is adopted for searching. And acquiring the returned webpage content, wherein the key points are to acquire the title and the descriptive short text of each webpage item.
Step S2: developing a first-stage classification based on the short text logistic regression model, which is shown in fig. 2;
specifically, the step S2 includes:
step S201: acquiring training data, extracting descriptive texts and title field information of the web pages to be classified from a Google retrieval content crawling table of a database, and splicing the descriptive texts and the title field information into short texts.
Step S202: and (4) text preprocessing, namely performing word segmentation on the Chinese and English text and removing stop words.
Step S203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j
Figure BDA0003735557710000081
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
Figure BDA0003735557710000082
wherein idf j Representing the frequency of the inverse document corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers, TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
Step S204: training a logistic regression model by using vectorized data; definition W (webpages) denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2.., N) is a characteristic term, w i Is t i D (Datapages) represents data web pages, N (Non-Datapages) represents Non-data web pages; judging the probability of being a data webpage according to the webpage characteristics:
Figure BDA0003735557710000083
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003735557710000084
is a vector composed of all the features of the web page,
Figure BDA0003735557710000085
is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
Figure BDA0003735557710000086
after obtaining the prediction probability, a threshold (threshold) is defined to determine whether the web page belongs to the data web page, and the threshold is usually 0.5, so the web page classification function is:
Figure BDA0003735557710000091
step S205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a penalty term to the original loss function, is realized by adding multiples based on different paradigms of parameter vectors after the loss function, and can be divided into L1 regularization calculated according to the sum of absolute values of parameters and L2 regularization calculated according to the square value of the sum of squares of the parameters, and a new loss function L (theta) is obtained by adding a formula of an L2 regularization term:
Figure BDA0003735557710000092
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the webpages to be judged; y is i Representing the predicted category of the webpage;
Figure BDA0003735557710000093
representing the square of a parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i The likelihood of an output variable of 1 is calculated based on the selected parameters.
And realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
Step S3: acquiring webpage HTML information labeled as 'data webpage' after the classification of the first stage is finished; in step S3, the object of content crawling is a web page that needs to be further "subdivided" after being classified in the first stage, and the web page is clicked to obtain specific content, and the HTML information is refined into long text information.
Step S4: and developing the second-stage classification based on the long text and website information of the webpage, and adopting a text convolution neural network combined triple loss depth measurement learning algorithm as shown in figure 3.
Specifically, the step S4 includes:
step S401: acquiring training data; the input data has two dimensions: and the content long text and website link of the webpage are classified in the first stage.
Step S402: text preprocessing, including word segmentation and word filtering stop; the long text is consistent with the pretreatment method in the first stage, and the website characteristics are characterized by using ',': ', '/' etc. divide each link into protocol, host name, path, file name and parameter parts to highlight the sequential relationship between the different parts; and when the stop words are filtered, important structural symbols are reserved, and the messy code symbols which are too long and have no practical significance are deleted.
Step S403: fixing the text length and the website link length corresponding to each webpage; a fixed text length and a website link length are taken, are larger than the length and are truncated, and are smaller than the length, and are filled with < pad >, which is an artificially specified filling symbol and can be replaced as long as the fixed text length and the website link length are unified in the model.
Step S404: training respective word embedding models of the long text data and the website link data; the model selects a continuous Skip-gram (Skip-gram), and the characteristic length of each vocabulary is 300; continuous jumping element grammar model method using central word C (w) To predict the probability of occurrence of a contextual word, i.e., P (C) (i) |C (w) ) Wherein: w-k is less than or equal to i and less than or equal to w + k, and w is not equal to k; in the formula, C (i) Words representing the surroundings of the central word; c (w) Represents a central word; to reduce the amount of computation and the degree of attention to rare words, the vocabulary is reduced in dimension: reserve part of high-frequency vocabulary and use uniformly<unk>The low-frequency vocabulary is replaced,<unk>all in one<pad>The same are artificially defined symbols, which can be replaced. And respectively carrying out vectorization operation on the long text and the website link data by using the trained word embedding model to obtain a vocabulary matrix as the input of the text convolution neural network model.
Step S405: the vocabulary matrixes of the long text and the website links pass through the text convolution neural network model respectively, and are finally spliced together in a full connection layer, and finally a feature vector with a fixed length is output. The text convolutional neural network comprises a convolutional layer, a pooling layer and a random deactivation layer. Suppose there is a training set T ═ T 1 ,t 2 ,...,t N And N is the number of samples of the training set. For each of the training samples,
Figure BDA0003735557710000101
wherein x is i ∈R k Is a k-dimensional word vector for the ith word in the sample document,
Figure BDA0003735557710000102
is a connector, indicates that the document is formed by word concatenation, and specifies x i:i+j X of reference i ,x i+1 ,...,x i+j And (4) connecting. A convolution operation involves a convolution kernel w ∈ R hk The length of the convolution kernel is consistent with the dimension of the word vector, the width is h, and a sliding window covering h words is represented. E.g. feature c i Is formed by the word x in the window i:i+h-1 Obtaining:
c i =f(w·x i:i+h-1 +b)
where b is a bias term and f is a nonlinear activation function. This convolution kernel works on all words in the document to obtain a feature map:
c=[c 1 ,c 2 ,...,c n-h+1 ]
extracting the maximum value by using maximum pooling operation:
Figure BDA0003735557710000103
random deactivation is to prevent the model from overfitting, and cells in the hidden layer are randomly drawn with their weights set to 0. In the penultimate layer
Figure BDA0003735557710000104
Where m is the number of convolution kernels, the output is calculated using the following equation:
Figure BDA0003735557710000105
wherein
Figure BDA0003735557710000106
For an elemental multiplier, r is obtained from the bernoulli function.
Step S406: obtaining a final model by using weight parameters in the triple loss training text convolutional neural network; the triplet loss function is shown by the following equation:
Figure BDA0003735557710000107
in the formula, X a 、X b 、X n Respectively an anchor point sample,A positive sample, a negative sample; m is an interval;
Figure BDA0003735557710000108
representing the euclidean distance between the anchor sample and the positive sample;
Figure BDA0003735557710000109
representing the Euclidean distance between the anchor sample and the negative sample; the training purpose of the loss function is to pull the distance between the positive sample and the anchor point closer, increase the distance between the negative sample and the anchor point, and the distance between the inter-class sample pairs is one interval greater than the distance between the intra-class sample pairs.
Step S407: when new data enter the model, a feature vector is obtained through the trained text convolution neural network model, the feature vector is projected into a feature space where the training data are located, the category of the webpage is judged based on a clustering algorithm (such as a K-nearest neighbor method), and a final prediction result is output.
Step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on the data portal website. Step S5, the final classification result is obtained and needs to be put in storage and sorted, and the classification result is stored in a specific table of a database and displayed on a portal website; when crawling the webpage content for the second time, besides the text and the webpage structure information, the language information, the IP position information and the webpage screenshot of the webpage are also acquired, so that the webpage content can be displayed in various forms on the portal website.
The embodiment of the invention provides a two-stage academic data webpage classification method and system based on multi-dimensional feature fusion, a brand-new open network academic data webpage discovery method is established, and the method has the advantages of popularization and capability of helping experts to quickly and accurately screen data webpages from the Internet.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices, modules, units provided by the present invention as pure computer readable program code, the system and its various devices, modules, units provided by the present invention can be fully implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the present invention can be regarded as a hardware component, and the devices, modules and units included therein for implementing various functions can also be regarded as structures within the hardware component; means, modules, units for performing the various functions may also be regarded as structures within both software modules and hardware components for performing the method.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A two-stage academic data webpage classification method based on multi-dimensional feature fusion is characterized by comprising the following steps:
step S1: inputting a search engine for searching based on academic keywords to obtain search page content;
step S2: developing a first-stage classification based on a short text logistic regression model;
step S3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished;
step S4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
step S5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on a data portal website.
2. The two-stage academic data web page classification method based on multi-dimensional feature fusion according to claim 1, wherein the step S1 includes:
step S101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
step S102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
3. The two-stage academic data webpage classification method based on multi-dimensional feature fusion according to claim 1, wherein the step S2 comprises:
step S201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
step S202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
step S203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j
Figure FDA0003735557700000011
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document; k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
Figure FDA0003735557700000021
wherein idf j Representing the frequency of the inverse document corresponding to the word j; l T is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers, TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes the inverse file frequency.
Step S204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2., N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of being a data webpage according to the webpage characteristics:
Figure FDA0003735557700000022
wherein the content of the first and second substances,
Figure FDA0003735557700000023
is a vector composed of all the features of the web page,
Figure FDA0003735557700000024
is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
Figure FDA0003735557700000025
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
Figure FDA0003735557700000026
step S205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
Figure FDA0003735557700000027
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the web pages to be judged; y is i Representing the predicted category of the webpage;
Figure FDA0003735557700000028
representing the square of the parameter to be punished; h is a total of θ (x i ) Represents the ith characteristic feature x i Calculating the possibility that the output variable is 1 according to the selected parameters;
and realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
4. The two-stage academic data webpage classification method based on multidimensional feature fusion as claimed in claim 1, wherein the object crawled in step S3 is a webpage which needs to be further subdivided after the first-stage classification, and specific content is obtained by clicking inside the webpage, and HTML information is refined into long text information.
5. The two-stage academic data web page classification method based on multi-dimensional feature fusion according to claim 1, wherein the step S4 includes:
step S401: acquiring training data; the input data has two dimensions: the content long text and website link of the webpage of the first stage classification result;
step S402: text preprocessing, including word segmentation and word filtering stop; the long text is consistent with the preprocessing method in the first stage, and the website characteristics divide each link into a protocol part, a host name part, a path part, a file name part and a parameter part according to the structural characteristics of the long text so as to highlight the sequence relation among the different parts; important structural symbols are reserved when stop words are filtered, and messy code symbols which are too long and have no practical significance are deleted;
step S403: fixing the text length and the website link length corresponding to each webpage; taking a fixed text length and a website link length, truncating the length which is greater than the fixed text length, and filling the length which is less than the fixed text length and the website link length with < pad >;
step S404: training respective word embedding models of the long text data and the website link data; the model selects continuous jumping element grammar, and the characteristic length of each vocabulary is 300; continuous jumping element grammar model method using central word C (w) To predict the probability of occurrence of a contextual word, i.e., P (C) (i) |C (w) ) Wherein: i is more than or equal to w-k and less than or equal to w + k, and w is not equal to k; in the formula, C (i) Words representing the surroundings of the central word; c (w) Represents a central word;
the calculation amount is reduced, the attention degree to the rarely-used words is reduced, and the dimension of the vocabulary is reduced: reserving part of high-frequency vocabulary, uniformly replacing the low-frequency vocabulary with < unk >, utilizing a trained word embedding model, respectively carrying out vectorization operation on long text and website link data, and obtaining a vocabulary matrix as the input of a text convolution neural network model;
step S405: the vocabulary matrixes of the long text and the website links pass through a text convolution neural network model respectively, and are finally spliced together in a full connection layer, and finally a feature vector with a fixed length is output;
step S406: obtaining a final model by using weight parameters in the triple loss training text convolutional neural network; the triplet loss function is shown by the following equation:
Figure FDA0003735557700000031
in the formula, X a 、X b 、X n Respectively an anchor point sample, a positive sample and a negative sample; m is an interval;
Figure FDA0003735557700000032
representing the euclidean distance between the anchor sample and the positive sample;
Figure FDA0003735557700000033
representing the Euclidean distance between the anchor sample and the negative sample;
step S407: when new data enter the model, a feature vector is obtained through the trained text convolution neural network model, the feature vector is projected into a feature space where the training data are located, the category of the webpage is judged based on a clustering algorithm, and a final prediction result is output.
6. The two-stage academic data webpage classification method based on multi-dimensional feature fusion according to claim 1, wherein the final classification result obtained in the step S5 needs to be put in storage and arranged, stored in a database specific table, and displayed on a web portal; and when the webpage content is crawled for the second time, besides the text and the webpage structure information, the language information, the IP position information and the webpage screenshot of the webpage are also acquired.
7. A two-stage academic data webpage classification system based on multi-dimensional feature fusion is characterized by comprising:
module M1: inputting a search engine for searching based on academic keywords to obtain search page content;
module M2: developing a first-stage classification based on a short text logistic regression model;
module M3: acquiring webpage HTML information labeled as a data webpage after the classification of the first stage is finished;
module M4: developing second-stage classification based on the webpage long text and website information, and adopting a text convolution neural network combined with a triple loss depth measurement learning algorithm;
module M5: and (4) warehousing and sorting the final classification results, analyzing necessary information in the webpage, and displaying the necessary information on a data portal website.
8. The two-stage academic data web page classification system based on multi-dimensional feature fusion of claim 7, wherein the module M1 includes:
a module M101: cleaning and expanding the keywords; inputting academic keywords corresponding to a data webpage to be searched by a system, if a plurality of keywords of a subject are provided at one time, screening out duplicate items, splitting partial longer keywords, and adding missing early-stage cleaning and expansion tasks including Chinese and English comparison;
the module M102: if the keyword is Chinese, inputting retrieval information related to the keyword in a search engine, and acquiring returned webpage content.
9. The two-stage academic data web page classification system based on multi-dimensional feature fusion of claim 7, wherein the module M2 includes:
module M201: acquiring training data, extracting descriptive texts and title field information of webpages to be classified from Google retrieval content crawling tables of a database, and splicing the descriptive texts and the title field information into short texts;
the module M202: text preprocessing, namely performing word segmentation on Chinese and English texts and removing stop words;
module M203: vectorizing and representing the text by using a word frequency-reverse file frequency method; calculating normalized word frequency tf of word j in document i i,j
Figure FDA0003735557700000041
Wherein, tf i,j Representing normalized word frequency; i represents a document order; j represents the word sequence number; i and j together represent a word in a document;k represents all words, Σ, in the traversal document k n k,j Representing a total number of terms comprising the j-term document; n is a radical of an alkyl radical i,j Is the frequency of occurrence of the word in the document, which is typically divided by the total number of words in the document as a normalized value; dividing the total document number by the number of the documents containing the words, and obtaining the inverse document frequency IDF by taking the logarithm of the quotient:
Figure FDA0003735557700000051
wherein idf j Representing the frequency of the inverse document corresponding to the word j; | T | is the total number of documents in the document set; l { j: w is a i ∈t j Represents the word w i Number of documents present (i.e. n) i,j ≠0);t j Meaning comprising the word w i The document of (1); the final result of TF-IDF is the multiplication of two numbers TF and IDF:
TF-IDF=TF×IDF
wherein, TF represents the word frequency; IDF denotes an inverse file frequency.
A module M204: training a logistic regression model by using vectorized data; definition W denotes a set of web pages to be classified, W ═ t 1 ,w 1 ;t 2 ,w 2 ;...;t n ,w n ),t i (i ═ 1, 2.., N) is a characteristic term, w i Is t i D represents a data web page and N represents a non-data web page; judging the probability of being a data webpage according to the webpage characteristics:
Figure FDA0003735557700000052
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003735557700000053
is a vector composed of all the characteristics of the webpage,
Figure FDA0003735557700000054
is the weight vector corresponding to the webpage feature vector; y represents a web page to be judged; similarly, the probability that a certain web page is a non-data web page is:
Figure FDA0003735557700000055
after the prediction probability is obtained, a threshold value is defined to judge whether the webpage belongs to the data webpage, and the threshold value is 0.5, so that the webpage classification function is as follows:
Figure FDA0003735557700000056
the module M205: optimizing logistic regression model parameters by means of regularization; the regularization avoids model overfitting by adding a punishment term to the original loss function, and is realized by adding multiples based on different paradigms of parameter vectors behind the loss function; the formula of adding the L2 regularization term yields a new loss function L (θ):
Figure FDA0003735557700000057
wherein L (θ) represents a likelihood function of the model parameter θ; m represents the total number of the webpages to be judged; y is i Representing the predicted category of the webpage;
Figure FDA0003735557700000058
representing the square of the parameter to be punished; h is θ (x i ) Represents the ith characteristic feature x i Calculating the possibility that the output variable is 1 according to the selected parameters;
and realizing logistic regression and parameter adjustment by means of a scimit-learn library, and selecting the most appropriate regularization coefficient lambda by means of grid search and k-fold cross validation.
10. The two-stage academic data webpage classification system based on multi-dimensional feature fusion of claim 7, wherein the object of content crawling in the module M3 is a webpage which needs to be further subdivided after the first-stage classification, and specific content is obtained by clicking inside the webpage to refine HTML information into long text information.
CN202210795308.4A 2022-07-07 2022-07-07 Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion Pending CN115130601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210795308.4A CN115130601A (en) 2022-07-07 2022-07-07 Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210795308.4A CN115130601A (en) 2022-07-07 2022-07-07 Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion

Publications (1)

Publication Number Publication Date
CN115130601A true CN115130601A (en) 2022-09-30

Family

ID=83381906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210795308.4A Pending CN115130601A (en) 2022-07-07 2022-07-07 Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion

Country Status (1)

Country Link
CN (1) CN115130601A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117010283A (en) * 2023-10-07 2023-11-07 中国矿业大学(北京) Method and system for predicting deformation of steel pipe column structure of PBA station
CN117521673A (en) * 2024-01-08 2024-02-06 安徽大学 Natural language processing system with analysis training performance

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117010283A (en) * 2023-10-07 2023-11-07 中国矿业大学(北京) Method and system for predicting deformation of steel pipe column structure of PBA station
CN117010283B (en) * 2023-10-07 2023-12-29 中国矿业大学(北京) Method and system for predicting deformation of steel pipe column structure of PBA station
CN117521673A (en) * 2024-01-08 2024-02-06 安徽大学 Natural language processing system with analysis training performance
CN117521673B (en) * 2024-01-08 2024-03-22 安徽大学 Natural language processing system with analysis training performance

Similar Documents

Publication Publication Date Title
US10146862B2 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
CN110929038B (en) Knowledge graph-based entity linking method, device, equipment and storage medium
US8095539B2 (en) Taxonomy-based object classification
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
CN111291210B (en) Image material library generation method, image material recommendation method and related devices
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN115130601A (en) Two-stage academic data webpage classification method and system based on multi-dimensional feature fusion
Bernardini et al. Full-subtopic retrieval with keyphrase-based search results clustering
WO2014054052A2 (en) Context based co-operative learning system and method for representing thematic relationships
CN110737839A (en) Short text recommendation method, device, medium and electronic equipment
CN110633366A (en) Short text classification method, device and storage medium
Du et al. An approach for selecting seed URLs of focused crawler based on user-interest ontology
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN110688405A (en) Expert recommendation method, device, terminal and medium based on artificial intelligence
CN113378573A (en) Content big data oriented small sample relation extraction method and device
CN113569118B (en) Self-media pushing method, device, computer equipment and storage medium
WO2023246849A1 (en) Feedback data graph generation method and refrigerator
CN109783650B (en) Chinese network encyclopedia knowledge denoising method, system and knowledge base
CN116662566A (en) Heterogeneous information network link prediction method based on contrast learning mechanism
CN111382333A (en) Case element extraction method in news text sentence based on case correlation joint learning and graph convolution
US9195940B2 (en) Jabba-type override for correcting or improving output of a model
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN113468410A (en) System for intelligently optimizing search results and search engine
CN115391479A (en) Ranking method, device, electronic medium and storage medium for document search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination