CN116628377A

CN116628377A - Webpage theme relevance judging method

Info

Publication number: CN116628377A
Application number: CN202310049639.8A
Authority: CN
Inventors: 李涛; 段翰聪; 李林; 王书涵; 陈铎汝; 邹涛; 李阳; 李�浩
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-08-22

Abstract

A method for judging the correlation degree of a webpage theme comprises the following steps of; step 1, training a word vector model; step 2, setting a subject term, and constructing a user subject term set; step 3, removing the webpage label to be evaluated to obtain a document only comprising a title and text; step 4, extracting keywords of the document, and constructing a webpage keyword set of the webpage to be evaluated; step 5, generating word vectors; step 6, sequentially calculating cosine distances between the keyword vector set of the webpage to be evaluated and word vectors in the user subject word vector set, and selecting the maximum value; step 7, calculating the average value of all keywords of the webpage to be evaluated, and taking the average value as the subject correlation degree of the webpage to be evaluated; and 8, setting a theme correlation threshold value, and judging whether the theme is correlated. According to the method, the word is processed by using the pre-training word vector model, the webpage content can be judged by calculating a small number of word vector cosine distances, and the calculation speed of judging the topic relevance of the single webpage is improved.

Description

Webpage theme relevance judging method

Technical Field

The invention belongs to the technical field of computer software, and particularly relates to a webpage topic relevance judging method.

Background

The traditional general search engine can provide information search service for common netizen users, but can only fuzzy match the user demands, and most semantic related web pages are lost in the returned results, so that the information demands more concentrated and deep in the specific field or the specific user are difficult to meet. At present, although researchers do a lot of technical research work on the correlation degree of the webpage theme, for some application tasks in specific fields with high requirements on the accuracy and the calculation amount of the theme correlation degree judgment, the current webpage theme judgment method still has a certain lifting or improving space, so that the improvement on the accuracy of the webpage theme correlation degree judgment and the reduction on the calculation amount are critical technical problems to be solved urgently.

Vector Space Model (VSM) is commonly adopted in the current webpage relevance calculation, and generally comprises the stages of preprocessing, feature extraction and representation, vector space construction, cosine similarity calculation and the like. The Bag-of-Words is a more classical vector space model, the method uses Words as characteristic items to carry out vector representation, TF-IDF values of the Words are used as characteristic weights, and finally cosine distances are calculated as similarity.

The vector space model is mainly used for judging the relevance of the webpage, and the problems in the following two aspects are mainly solved: firstly, the correlation calculation must undergo a plurality of processes of preprocessing, feature extraction representation, construction vector space, cosine similarity calculation and the like, and the calculation cost is relatively high; secondly, the sequence among the words is lost, the semantic relativity among the words is ignored, and the accuracy of judging the relativity of the webpage theme is reduced.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention discloses a webpage topic relevance judging method.

The invention discloses a method for judging the correlation degree of a webpage theme, which comprises the following steps of;

step 1, training a word vector model;

step 2, setting n subject words t ₁ ,t ₂ ,t ₃ ,…，t _n Constructing a user subject term set as topic_set= { t ₁ ,t ₂ ,t ₃ ,…，t _n }；

Step 3, removing an HTML tag in the page of the webpage to be evaluated to obtain a document only comprising a title and text;

step 4, extracting keywords of the document, and constructing a webpage keyword set of the webpage to be evaluated as page_set= { p ₁ ,p ₂ ,p ₃ ,…，p _m }，p ₁ ,p ₂ ,p ₃ ,…，p _m M keywords extracted;

and 5, generating word vectors. Characterizing each word in the user subject word set topic_set and the webpage keyword set page_set into a word vector by utilizing the word vector model obtained in the step 1, and mapping the word vector into a user subject word vector set V _topic ＝{vt ₁ ,vt ₂ ,vt ₃ ,…,vt _n Sum of the web page keyword vector sets V to be evaluated _page ＝{vp ₁ ,vp ₂ ,vp ₃ ,…,vp _m }；

Step 6, sequentially calculating a keyword vector set V of the webpage to be evaluated _page Each word vector vp in (a) _j With user subject word vector set V _topic Each word vector vt in (a) _i The cosine distance of (2) and then selecting the maximum value of the cosine distance as the j-th keyword p _j Topic relevance similarity (vp) _j Topic); the calculation formula is as follows:

Similar(vp _j ,V _topic )＝max(cs(vp _j ,vt _i ))

max () represents taking the maximum value, cs () represents solving the cosine distance of the two;

the cosine distance between the input vectors u and v is calculated by the formula

Where θ is the angle between u and v, |u|| is the L2 norm of the vector u, |v| is the L2 norm of the vector v, u _i ,v _i Different vector elements in the vectors u and v are respectively represented, and are all n-dimensional;

step 7, obtaining the topic relevance similarity (vp) of each keyword in the webpage to be evaluated _j Topic), the Similar (vp) of all keywords of the web page to be evaluated is obtained _j Topic) average value, which is the topic relevance similarity (page, topic) of the web page to be evaluated;

the calculation formula is as follows:

wherein m is the total number of keywords, and n is the total number of subject words;

and 8, setting a topic relevance threshold S, judging that the webpage to be evaluated is topic-relevant if the topic relevance similarity (page) of the webpage calculated in the step 7 reaches the relevance threshold S, otherwise, judging that the webpage to be evaluated is topic-irrelevant.

Preferably, in the step 1, word2Vec model is used to train Word vector models.

Preferably, in the step 4, a TextRank algorithm is used to extract the keywords.

The invention provides a webpage theme relativity judging method which can be used for improving the recognition effect by fusing a word2vec model and a TextRank algorithm and has the following beneficial effects:

1. the word is processed by using the pre-training word vector model, and the topic judgment can be carried out on the webpage content only by calculating a small number of word vector cosine distances, so that the calculation speed of topic relevance judgment of a single webpage is improved.

2. The keyword vector and the TextRank algorithm of the Word2Vec model are adopted to extract the keywords of the webpage to be evaluated, and the semantics between the subject Word and the webpage can be considered.

3. The method can rapidly identify the document theme through keyword comparison, can be applied to the fields of theme web crawlers, public opinion monitoring, spam classification identification, machine translation, automatic question-answering systems and the like, and has wide application range.

Drawings

Fig. 1 is a flowchart of a specific embodiment of a method for determining relevance of a web page theme according to the present invention.

Detailed Description

The following describes the present invention in further detail.

The invention provides a webpage topic relevance judging method, which is based on the basic idea that the cosine distance of a word vector between a keyword extracted from a webpage to be evaluated and a topic word set by a user is calculated based on a word2vec model to analyze the relevance degree between the keyword extracted from the webpage to be evaluated and the topic word.

The invention aims to provide an effective method capable of being applied to judging the relevance of a single webpage to a theme, which not only can assist people to find the webpage related to a specific theme in massive Internet data, but also can be widely applied to the fields of theme web crawlers, public opinion monitoring, spam classification and identification, machine translation, automatic question-answering systems and the like.

Firstly, training a large-scale corpus by using a word2vec model to obtain a word vector model, then extracting keywords of a webpage to be evaluated by adopting a TextRank algorithm, finally, characterizing the keywords extracted by the webpage to be evaluated and subject words set by a user into a word vector form by using a pre-trained word vector model, and calculating the rest chord distances to analyze the correlation degree between the keywords and the subject words, wherein the method comprises the following specific steps of:

and step 1, training a word vector model. The training corpus can adopt a Chinese corpus in Wikipedia, and a word2vec is used for training a K-dimensional word vector model based on a Skip-gram model of Hierarchical Softmax technology.

The Word2Vec model employs three layers of neural network language models, input (Input), projection (project) and Output (Output). Two training implementations of Skip-gram and continuous word bag model (continuous bag of words, CBOW) are adopted in the projection layer. The output layer adopts two training technologies of layering Softmax technology (Hierarchical Softmax) and negative sampling technology (Negative Sampling) to accelerate algorithm convergence, and words can be quickly converted into vector form and used for subsequent processing. .

And 2, setting a user subject word set. The user inputs n subject words t according to the requirement ₁ ,t ₂ ,t ₃ ,…，t _n Constructing a user subject term set as topic_set= { t ₁ ,t ₂ ,t ₃ ,…，t _n }。

And 3, preprocessing the webpage to be evaluated. And removing the HTML tag in the page of the webpage to be evaluated to obtain a document D only comprising the title and the text.

And 4, extracting keywords of the document D. Extracting m keywords of the preprocessed webpage to be evaluated by adopting a TextRank algorithm, and constructing a webpage keyword set to be evaluated as page_set= { p ₁ ,p ₂ ,p ₃ ,…，p _m }。

The TextRank algorithm is an unsupervised keyword extraction algorithm based on a graph model. A document is regarded as a network composed of words, the preprocessed candidate keywords are used as nodes, the semantic relation between the words is constructed through a sliding window mechanism to form links between the nodes in the network, the keyword extraction is realized essentially by researching the ranking problem of the candidate keywords, the corpus is not required to be marked by manpower, and the keyword extraction can be realized by utilizing the information of the single document.

And 5, generating word vectors. Characterizing each word in the user subject word set topic_set and the webpage keyword set page_set into a word vector by utilizing the word vector model obtained in the step 1, and mapping the word vector into a user subject word vector set V _topic ＝{vt ₁ ,vt ₂ ,vt ₃ ,…,vt _n Sum of the web page keyword vector sets V to be evaluated _page ＝{vp ₁ ,vp ₂ ,vp ₃ ,…,vp _m }. Word vector vt in the two sets _i And vp _j The K-dimensional vector is mapped by using the word vector model in the step (1).

And 6, calculating the cosine distance of the word vector. Sequentially calculating V _page Each word vector vp in (a) _j And V is equal to _topic Each word vector vt in (a) _i The cosine distance of (2) and then selecting the maximum value of the cosine distance as the keyword p _j Topic relevance similarity (vp) _j Topic). The calculation formula is as follows:

Similar(vp _j ,V _topic )＝max(cs(vp _j ,vt _i ))

max () represents taking the maximum value, cs () represents taking the cosine distance of both.

Where θ is the angle between u and v, |u|| is the L2 norm of the vector u, |v| is the L2 norm of the vector v, u _i ,v _i The different vector elements in the vectors u and v are represented in n dimensions.

And 7, calculating the correlation degree of the web pages. Obtaining topic relevance similarity (vp) of each keyword in a webpage to be evaluated _j Topic), the topic relevance similarity (vp) of all keywords of the webpage to be evaluated is obtained _j Topic) as a topic relevance similarity (page, topic) of the web page to be evaluated. The calculation formula is as follows:

where m is the total number of keywords and n is the total number of subject words.

And 8, outputting a theme relativity judging result. Setting a topic relevance threshold S, judging that the webpage to be evaluated is topic-relevant if the topic relevance similarity (page) of the webpage calculated in the step 7 reaches S, otherwise, judging that the webpage to be evaluated is topic-relevant.

Specific examples:

for the case that the web page to be evaluated is a Chinese-type web page, the training word vector data set is from a Chinese corpus data set of Wikipedia, "zhwiki-20221020-pages-characters. Word2Vec is trained based on the Skip-gram model of Hierarchical Softmax technique to obtain a Word vector model, which requires only 1 training. The parameter settings are shown in table 1.

TABLE 1 word2Cec parameter settings

From the news official network, a news report is randomly selected as a webpage to be evaluated, the selected experimental object is a news, the related conditions of A3, A4, A5 and the like between A1 and A2 are mainly reported, and the release time of the news is 2022, 11, 6, 18 and 36 minutes. Wherein A1 and A2 are place names, and A3, A4 and A5 are news contents.

The title and text content of the news are extracted from the tags < h1 class= "main-title" > </p > and < p > </p > of the web page html, and then stored in a content. Txt file in Json format.

Preprocessing such as word segmentation and stop word removal by means of a jieba word segmentation tool, extracting a webpage keyword set to be evaluated by using a TextRank algorithm, and setting and extracting 10 keywords.

The user inputs "A1, A2, A3, A4, A5" as the subject matter words, sets the topic relevance threshold s=0.6, and obviously can predict the conclusion as the topic relevance. The experimental result operated by the invention judges that the webpage to be evaluated is related to the theme, the degree of correlation is 0.786, and the conclusion of judging the degree of correlation is the same as the conclusion of judging the expected degree of correlation.

The user inputs 'B1, B2, B3, B4 and B5' as subject words, wherein B1, B2, B3, B4 and B5 are words which are different from any one of A1, A2, A3, A4 and A5, and a subject relativity threshold S=0.6 is set, so that a predicted conclusion is obviously irrelevant to a subject. The experimental result operated by the method judges that the webpage to be evaluated is irrelevant to the theme, the correlation degree is 0.538, and the conclusion is identical to the expected correlation degree judgment.

Experimental results show that the webpage topic relevance judging method fused with the word2vec model and the TextRank algorithm can accurately judge whether the webpage to be evaluated is relevant to the topic expected by the user, and calculate the topic relevance.

The foregoing description of the preferred embodiments of the present invention is not obvious contradiction or on the premise of a certain preferred embodiment, but all the preferred embodiments can be used in any overlapped combination, and the embodiments and specific parameters in the embodiments are only for clearly describing the invention verification process of the inventor and are not intended to limit the scope of the invention, and the scope of the invention is still subject to the claims, and all equivalent structural changes made by applying the specification and the content of the drawings of the present invention are included in the scope of the invention.

Claims

1. A method for judging the correlation degree of a webpage theme is characterized by comprising the following steps of;

step 1, training a word vector model;

step 4, extracting keywords of the document, and constructing a webpage keyword set of the webpage to be evaluated as page_set= { p ₁ , p ₂ , p ₃ ,…， p _m }， p ₁ , p ₂ , p ₃ ,…， p _m For m switches extractedKey words;

step 5, generating word vectors;

characterizing each word in the user subject word set topic_set and the webpage keyword set page_set into a word vector by utilizing the word vector model obtained in the step 1, and mapping the word vector into a user subject word vector set V _topic ={vt ₁ ,vt ₂ ,vt ₃ ,…,vt _n Sum of the web page keyword vector sets V to be evaluated _page ={vp ₁ ,vp ₂ ,vp ₃ ,…,vp _m }；

Similar(vp _j ,V _topic )=max(cs (vp _j ,vt _i ))

Where θ is the angle between u and v,is the L2 norm of vector u, +.>As L2 norm of vector v, u _i ,v _i Different vector elements in the vectors u and v are respectively represented, and are all n-dimensional;

the calculation formula is as follows:

2. The method for determining relevance of a web page theme according to claim 1, wherein Word2Vec model is used for Word vector model training in step 1.

3. The method for determining relevance of a web page theme according to claim 1, wherein the keyword is extracted by using TextRank algorithm in the step 4.