CN116628377A - Webpage theme relevance judging method - Google Patents

Webpage theme relevance judging method Download PDF

Info

Publication number
CN116628377A
CN116628377A CN202310049639.8A CN202310049639A CN116628377A CN 116628377 A CN116628377 A CN 116628377A CN 202310049639 A CN202310049639 A CN 202310049639A CN 116628377 A CN116628377 A CN 116628377A
Authority
CN
China
Prior art keywords
webpage
topic
evaluated
word
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310049639.8A
Other languages
Chinese (zh)
Inventor
李涛
段翰聪
李林
王书涵
陈铎汝
邹涛
李阳
李�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310049639.8A priority Critical patent/CN116628377A/en
Publication of CN116628377A publication Critical patent/CN116628377A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A method for judging the correlation degree of a webpage theme comprises the following steps of; step 1, training a word vector model; step 2, setting a subject term, and constructing a user subject term set; step 3, removing the webpage label to be evaluated to obtain a document only comprising a title and text; step 4, extracting keywords of the document, and constructing a webpage keyword set of the webpage to be evaluated; step 5, generating word vectors; step 6, sequentially calculating cosine distances between the keyword vector set of the webpage to be evaluated and word vectors in the user subject word vector set, and selecting the maximum value; step 7, calculating the average value of all keywords of the webpage to be evaluated, and taking the average value as the subject correlation degree of the webpage to be evaluated; and 8, setting a theme correlation threshold value, and judging whether the theme is correlated. According to the method, the word is processed by using the pre-training word vector model, the webpage content can be judged by calculating a small number of word vector cosine distances, and the calculation speed of judging the topic relevance of the single webpage is improved.

Description

Webpage theme relevance judging method
Technical Field
The invention belongs to the technical field of computer software, and particularly relates to a webpage topic relevance judging method.
Background
The traditional general search engine can provide information search service for common netizen users, but can only fuzzy match the user demands, and most semantic related web pages are lost in the returned results, so that the information demands more concentrated and deep in the specific field or the specific user are difficult to meet. At present, although researchers do a lot of technical research work on the correlation degree of the webpage theme, for some application tasks in specific fields with high requirements on the accuracy and the calculation amount of the theme correlation degree judgment, the current webpage theme judgment method still has a certain lifting or improving space, so that the improvement on the accuracy of the webpage theme correlation degree judgment and the reduction on the calculation amount are critical technical problems to be solved urgently.
Vector Space Model (VSM) is commonly adopted in the current webpage relevance calculation, and generally comprises the stages of preprocessing, feature extraction and representation, vector space construction, cosine similarity calculation and the like. The Bag-of-Words is a more classical vector space model, the method uses Words as characteristic items to carry out vector representation, TF-IDF values of the Words are used as characteristic weights, and finally cosine distances are calculated as similarity.
The vector space model is mainly used for judging the relevance of the webpage, and the problems in the following two aspects are mainly solved: firstly, the correlation calculation must undergo a plurality of processes of preprocessing, feature extraction representation, construction vector space, cosine similarity calculation and the like, and the calculation cost is relatively high; secondly, the sequence among the words is lost, the semantic relativity among the words is ignored, and the accuracy of judging the relativity of the webpage theme is reduced.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention discloses a webpage topic relevance judging method.
The invention discloses a method for judging the correlation degree of a webpage theme, which comprises the following steps of;
step 1, training a word vector model;
step 2, setting n subject words t 1 ,t 2 ,t 3 ,…,t n Constructing a user subject term set as topic_set= { t 1 ,t 2 ,t 3 ,…,t n };
Step 3, removing an HTML tag in the page of the webpage to be evaluated to obtain a document only comprising a title and text;
step 4, extracting keywords of the document, and constructing a webpage keyword set of the webpage to be evaluated as page_set= { p 1 ,p 2 ,p 3 ,…,p m },p 1 ,p 2 ,p 3 ,…,p m M keywords extracted;
and 5, generating word vectors. Characterizing each word in the user subject word set topic_set and the webpage keyword set page_set into a word vector by utilizing the word vector model obtained in the step 1, and mapping the word vector into a user subject word vector set V topic ={vt 1 ,vt 2 ,vt 3 ,…,vt n Sum of the web page keyword vector sets V to be evaluated page ={vp 1 ,vp 2 ,vp 3 ,…,vp m };
Step 6, sequentially calculating a keyword vector set V of the webpage to be evaluated page Each word vector vp in (a) j With user subject word vector set V topic Each word vector vt in (a) i The cosine distance of (2) and then selecting the maximum value of the cosine distance as the j-th keyword p j Topic relevance similarity (vp) j Topic); the calculation formula is as follows:
Similar(vp j ,V topic )=max(cs(vp j ,vt i ))
max () represents taking the maximum value, cs () represents solving the cosine distance of the two;
the cosine distance between the input vectors u and v is calculated by the formula
Where θ is the angle between u and v, |u|| is the L2 norm of the vector u, |v| is the L2 norm of the vector v, u i ,v i Different vector elements in the vectors u and v are respectively represented, and are all n-dimensional;
step 7, obtaining the topic relevance similarity (vp) of each keyword in the webpage to be evaluated j Topic), the Similar (vp) of all keywords of the web page to be evaluated is obtained j Topic) average value, which is the topic relevance similarity (page, topic) of the web page to be evaluated;
the calculation formula is as follows:
wherein m is the total number of keywords, and n is the total number of subject words;
and 8, setting a topic relevance threshold S, judging that the webpage to be evaluated is topic-relevant if the topic relevance similarity (page) of the webpage calculated in the step 7 reaches the relevance threshold S, otherwise, judging that the webpage to be evaluated is topic-irrelevant.
Preferably, in the step 1, word2Vec model is used to train Word vector models.
Preferably, in the step 4, a TextRank algorithm is used to extract the keywords.
The invention provides a webpage theme relativity judging method which can be used for improving the recognition effect by fusing a word2vec model and a TextRank algorithm and has the following beneficial effects:
1. the word is processed by using the pre-training word vector model, and the topic judgment can be carried out on the webpage content only by calculating a small number of word vector cosine distances, so that the calculation speed of topic relevance judgment of a single webpage is improved.
2. The keyword vector and the TextRank algorithm of the Word2Vec model are adopted to extract the keywords of the webpage to be evaluated, and the semantics between the subject Word and the webpage can be considered.
3. The method can rapidly identify the document theme through keyword comparison, can be applied to the fields of theme web crawlers, public opinion monitoring, spam classification identification, machine translation, automatic question-answering systems and the like, and has wide application range.
Drawings
Fig. 1 is a flowchart of a specific embodiment of a method for determining relevance of a web page theme according to the present invention.
Detailed Description
The following describes the present invention in further detail.
The invention provides a webpage topic relevance judging method, which is based on the basic idea that the cosine distance of a word vector between a keyword extracted from a webpage to be evaluated and a topic word set by a user is calculated based on a word2vec model to analyze the relevance degree between the keyword extracted from the webpage to be evaluated and the topic word.
The invention aims to provide an effective method capable of being applied to judging the relevance of a single webpage to a theme, which not only can assist people to find the webpage related to a specific theme in massive Internet data, but also can be widely applied to the fields of theme web crawlers, public opinion monitoring, spam classification and identification, machine translation, automatic question-answering systems and the like.
Firstly, training a large-scale corpus by using a word2vec model to obtain a word vector model, then extracting keywords of a webpage to be evaluated by adopting a TextRank algorithm, finally, characterizing the keywords extracted by the webpage to be evaluated and subject words set by a user into a word vector form by using a pre-trained word vector model, and calculating the rest chord distances to analyze the correlation degree between the keywords and the subject words, wherein the method comprises the following specific steps of:
and step 1, training a word vector model. The training corpus can adopt a Chinese corpus in Wikipedia, and a word2vec is used for training a K-dimensional word vector model based on a Skip-gram model of Hierarchical Softmax technology.
The Word2Vec model employs three layers of neural network language models, input (Input), projection (project) and Output (Output). Two training implementations of Skip-gram and continuous word bag model (continuous bag of words, CBOW) are adopted in the projection layer. The output layer adopts two training technologies of layering Softmax technology (Hierarchical Softmax) and negative sampling technology (Negative Sampling) to accelerate algorithm convergence, and words can be quickly converted into vector form and used for subsequent processing. .
And 2, setting a user subject word set. The user inputs n subject words t according to the requirement 1 ,t 2 ,t 3 ,…,t n Constructing a user subject term set as topic_set= { t 1 ,t 2 ,t 3 ,…,t n }。
And 3, preprocessing the webpage to be evaluated. And removing the HTML tag in the page of the webpage to be evaluated to obtain a document D only comprising the title and the text.
And 4, extracting keywords of the document D. Extracting m keywords of the preprocessed webpage to be evaluated by adopting a TextRank algorithm, and constructing a webpage keyword set to be evaluated as page_set= { p 1 ,p 2 ,p 3 ,…,p m }。
The TextRank algorithm is an unsupervised keyword extraction algorithm based on a graph model. A document is regarded as a network composed of words, the preprocessed candidate keywords are used as nodes, the semantic relation between the words is constructed through a sliding window mechanism to form links between the nodes in the network, the keyword extraction is realized essentially by researching the ranking problem of the candidate keywords, the corpus is not required to be marked by manpower, and the keyword extraction can be realized by utilizing the information of the single document.
And 5, generating word vectors. Characterizing each word in the user subject word set topic_set and the webpage keyword set page_set into a word vector by utilizing the word vector model obtained in the step 1, and mapping the word vector into a user subject word vector set V topic ={vt 1 ,vt 2 ,vt 3 ,…,vt n Sum of the web page keyword vector sets V to be evaluated page ={vp 1 ,vp 2 ,vp 3 ,…,vp m }. Word vector vt in the two sets i And vp j The K-dimensional vector is mapped by using the word vector model in the step (1).
And 6, calculating the cosine distance of the word vector. Sequentially calculating V page Each word vector vp in (a) j And V is equal to topic Each word vector vt in (a) i The cosine distance of (2) and then selecting the maximum value of the cosine distance as the keyword p j Topic relevance similarity (vp) j Topic). The calculation formula is as follows:
Similar(vp j ,V topic )=max(cs(vp j ,vt i ))
max () represents taking the maximum value, cs () represents taking the cosine distance of both.
The cosine distance between the input vectors u and v is calculated by the formula
Where θ is the angle between u and v, |u|| is the L2 norm of the vector u, |v| is the L2 norm of the vector v, u i ,v i The different vector elements in the vectors u and v are represented in n dimensions.
And 7, calculating the correlation degree of the web pages. Obtaining topic relevance similarity (vp) of each keyword in a webpage to be evaluated j Topic), the topic relevance similarity (vp) of all keywords of the webpage to be evaluated is obtained j Topic) as a topic relevance similarity (page, topic) of the web page to be evaluated. The calculation formula is as follows:
where m is the total number of keywords and n is the total number of subject words.
And 8, outputting a theme relativity judging result. Setting a topic relevance threshold S, judging that the webpage to be evaluated is topic-relevant if the topic relevance similarity (page) of the webpage calculated in the step 7 reaches S, otherwise, judging that the webpage to be evaluated is topic-relevant.
Specific examples:
for the case that the web page to be evaluated is a Chinese-type web page, the training word vector data set is from a Chinese corpus data set of Wikipedia, "zhwiki-20221020-pages-characters. Word2Vec is trained based on the Skip-gram model of Hierarchical Softmax technique to obtain a Word vector model, which requires only 1 training. The parameter settings are shown in table 1.
TABLE 1 word2Cec parameter settings
From the news official network, a news report is randomly selected as a webpage to be evaluated, the selected experimental object is a news, the related conditions of A3, A4, A5 and the like between A1 and A2 are mainly reported, and the release time of the news is 2022, 11, 6, 18 and 36 minutes. Wherein A1 and A2 are place names, and A3, A4 and A5 are news contents.
The title and text content of the news are extracted from the tags < h1 class= "main-title" > </p > and < p > </p > of the web page html, and then stored in a content. Txt file in Json format.
Preprocessing such as word segmentation and stop word removal by means of a jieba word segmentation tool, extracting a webpage keyword set to be evaluated by using a TextRank algorithm, and setting and extracting 10 keywords.
The user inputs "A1, A2, A3, A4, A5" as the subject matter words, sets the topic relevance threshold s=0.6, and obviously can predict the conclusion as the topic relevance. The experimental result operated by the invention judges that the webpage to be evaluated is related to the theme, the degree of correlation is 0.786, and the conclusion of judging the degree of correlation is the same as the conclusion of judging the expected degree of correlation.
The user inputs 'B1, B2, B3, B4 and B5' as subject words, wherein B1, B2, B3, B4 and B5 are words which are different from any one of A1, A2, A3, A4 and A5, and a subject relativity threshold S=0.6 is set, so that a predicted conclusion is obviously irrelevant to a subject. The experimental result operated by the method judges that the webpage to be evaluated is irrelevant to the theme, the correlation degree is 0.538, and the conclusion is identical to the expected correlation degree judgment.
Experimental results show that the webpage topic relevance judging method fused with the word2vec model and the TextRank algorithm can accurately judge whether the webpage to be evaluated is relevant to the topic expected by the user, and calculate the topic relevance.
The foregoing description of the preferred embodiments of the present invention is not obvious contradiction or on the premise of a certain preferred embodiment, but all the preferred embodiments can be used in any overlapped combination, and the embodiments and specific parameters in the embodiments are only for clearly describing the invention verification process of the inventor and are not intended to limit the scope of the invention, and the scope of the invention is still subject to the claims, and all equivalent structural changes made by applying the specification and the content of the drawings of the present invention are included in the scope of the invention.

Claims (3)

1. A method for judging the correlation degree of a webpage theme is characterized by comprising the following steps of;
step 1, training a word vector model;
step 2, setting n subject words t 1 ,t 2 ,t 3 ,…,t n Constructing a user subject term set as topic_set= { t 1 ,t 2 ,t 3 ,…,t n };
Step 3, removing an HTML tag in the page of the webpage to be evaluated to obtain a document only comprising a title and text;
step 4, extracting keywords of the document, and constructing a webpage keyword set of the webpage to be evaluated as page_set= { p 1 , p 2 , p 3 ,…, p m }, p 1 , p 2 , p 3 ,…, p m For m switches extractedKey words;
step 5, generating word vectors;
characterizing each word in the user subject word set topic_set and the webpage keyword set page_set into a word vector by utilizing the word vector model obtained in the step 1, and mapping the word vector into a user subject word vector set V topic ={vt 1 ,vt 2 ,vt 3 ,…,vt n Sum of the web page keyword vector sets V to be evaluated page ={vp 1 ,vp 2 ,vp 3 ,…,vp m };
Step 6, sequentially calculating a keyword vector set V of the webpage to be evaluated page Each word vector vp in (a) j With user subject word vector set V topic Each word vector vt in (a) i The cosine distance of (2) and then selecting the maximum value of the cosine distance as the j-th keyword p j Topic relevance similarity (vp) j Topic); the calculation formula is as follows:
Similar(vp j ,V topic )=max(cs (vp j ,vt i ))
max () represents taking the maximum value, cs () represents solving the cosine distance of the two;
the cosine distance between the input vectors u and v is calculated by the formula
Where θ is the angle between u and v,is the L2 norm of vector u, +.>As L2 norm of vector v, u i ,v i Different vector elements in the vectors u and v are respectively represented, and are all n-dimensional;
step 7, obtaining the topic relevance similarity (vp) of each keyword in the webpage to be evaluated j Topic), the Similar (vp) of all keywords of the web page to be evaluated is obtained j Topic) average value, which is the topic relevance similarity (page, topic) of the web page to be evaluated;
the calculation formula is as follows:
wherein m is the total number of keywords, and n is the total number of subject words;
and 8, setting a topic relevance threshold S, judging that the webpage to be evaluated is topic-relevant if the topic relevance similarity (page) of the webpage calculated in the step 7 reaches the relevance threshold S, otherwise, judging that the webpage to be evaluated is topic-irrelevant.
2. The method for determining relevance of a web page theme according to claim 1, wherein Word2Vec model is used for Word vector model training in step 1.
3. The method for determining relevance of a web page theme according to claim 1, wherein the keyword is extracted by using TextRank algorithm in the step 4.
CN202310049639.8A 2023-02-01 2023-02-01 Webpage theme relevance judging method Pending CN116628377A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310049639.8A CN116628377A (en) 2023-02-01 2023-02-01 Webpage theme relevance judging method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310049639.8A CN116628377A (en) 2023-02-01 2023-02-01 Webpage theme relevance judging method

Publications (1)

Publication Number Publication Date
CN116628377A true CN116628377A (en) 2023-08-22

Family

ID=87590794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310049639.8A Pending CN116628377A (en) 2023-02-01 2023-02-01 Webpage theme relevance judging method

Country Status (1)

Country Link
CN (1) CN116628377A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076649A (en) * 2023-10-13 2023-11-17 卓世科技(海南)有限公司 Emergency information query method and device based on large model thinking chain

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117076649A (en) * 2023-10-13 2023-11-17 卓世科技(海南)有限公司 Emergency information query method and device based on large model thinking chain
CN117076649B (en) * 2023-10-13 2024-01-26 卓世科技(海南)有限公司 Emergency information query method and device based on large model thinking chain

Similar Documents

Publication Publication Date Title
CN112347268B (en) Text-enhanced knowledge-graph combined representation learning method and device
CN109766544B (en) Document keyword extraction method and device based on LDA and word vector
CN108090070B (en) Chinese entity attribute extraction method
CN110222160A (en) Intelligent semantic document recommendation method, device and computer readable storage medium
CN110866117A (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN112231447B (en) Method and system for extracting Chinese document events
CN108287911B (en) Relation extraction method based on constrained remote supervision
CN111797196B (en) Service discovery method combining attention mechanism LSTM and neural topic model
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN112069831A (en) Unreal information detection method based on BERT model and enhanced hybrid neural network
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN113515632B (en) Text classification method based on graph path knowledge extraction
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN111144119A (en) Entity identification method for improving knowledge migration
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN111984782A (en) Method and system for generating text abstract of Tibetan language
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN112541083A (en) Text classification method based on active learning hybrid neural network
CN116628377A (en) Webpage theme relevance judging method
CN110297986A (en) A kind of Sentiment orientation analysis method of hot microblog topic
CN112989830B (en) Named entity identification method based on multiple features and machine learning
CN112613451A (en) Modeling method of cross-modal text picture retrieval model
CN115730232A (en) Topic-correlation-based heterogeneous graph neural network cross-language text classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination