CN116701812B

CN116701812B - Geographic information webpage text topic classification method based on block units

Info

Publication number: CN116701812B
Application number: CN202310969070.7A
Authority: CN
Inventors: 罗安; 王勇; 徐胜华; 车向红; 甄杰
Original assignee: Chinese Academy of Surveying and Mapping
Current assignee: Chinese Academy of Surveying and Mapping
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-11-28
Anticipated expiration: 2043-08-03
Also published as: CN116701812A

Abstract

The method for classifying the geographic information webpage text subjects based on the block units comprises the following steps: webpage structural feature analysis and block unit division, webpage block unit weight training and distribution, LDA text topic distribution representation based on block units, topic feature vector generation based on LDA2Vec, and geographic information topic classification based on SVM support vector machine classifier. According to the method, the structural characteristics of the webpage text are analyzed, the webpage text content topic model of geographic information is constructed by taking the structural characteristics of the webpage into consideration, and the problem that the traditional text classification method is poor in performance in a webpage text classification task is solved; the problem that the global semantic relation of the document and the local semantic relation of the word cannot be simultaneously mined in the classification of the webpage text is solved by acquiring the characteristic vector of the webpage text of the geographic information based on the block unit LDA2Vec model.

Description

Geographic information webpage text topic classification method based on block units

Technical Field

The invention belongs to the technical field of geographic information, in particular to the field of geographic data processing, and particularly relates to a geographic information webpage text topic classification method based on block units.

Background

With the popularization of the mobile internet and the rise of the big data age, information resources are explosively increased, and massive information resources are filled in the internet and daily life of people. According to the 50 th statistical report of the development status of the Internet of China issued by the Internet information center of China, the total number of netizens in China reaches 10.51 hundred million, the popularity of the Internet reaches 74.4%, and the Internet becomes a main information source in daily life of people. The network information is rich and various, and is mainly presented in the forms of video, audio, pictures, texts and the like, but most people still acquire information through Internet texts at present. Geographic information is widely scattered around the corners of the internet as an essential part of network information, and most of the geographic information exists in text form.

The method effectively classifies multi-source, massive and high-dynamic geographic information text data on the Internet, is one of the most important technical means for exerting application of geographic big data value, is also a great challenge and a great opportunity for mapping geographic information management and application in a brand new era background, is beneficial to better supporting geographic data management in big data era and promotes prosperous development of industry. Therefore, how to classify the network geographic information text according to the specific structural features of the webpage text has very important application value.

Currently, for web page text topic classification, common methods include the following three categories: a web page text topic classification method based on expert rules, machine learning and deep learning. These methods typically convert web page text into unstructured text for feature extraction based on text classification basis and related processes, and then classify based on semantic relationships between plain text. The first method is to make classification rules based on manual experience completely, and has great subjectivity and hysteresis, and the second two methods are to vectorize texts and to perform clustering judgment by using the spatial relationship existing between the vectors. The classification method according to expert rules can obtain good effects in the special fields with small data volume, but has the problems of poor timeliness, high cost, time and labor waste and the like on the current network of text data explosion. Therefore, the webpage text topic classification method combining text semantic features and webpage structural characteristics becomes a common method for solving webpage classification.

Although current web page text topic classification methods based on text vector representation have achieved good results on some datasets, they have the following problems: (1) Web page text, unlike plain text, has unique structural characteristics that, if the structural characteristics of the web page are ignored, and the web page text is processed similarly to the plain text, semantic information cannot be mined effectively. (2) In the past, more attention is paid to the whole semantic features of the document, but less attention is paid to the local semantic information, so that the whole and local potential semantic association is not well considered.

Therefore, how to overcome the problem that the expression of the structural features of the web page is not obvious in the prior art, and the whole and partial semantic relationship is difficult to consider is a technical problem to be solved in the prior art.

Disclosure of Invention

The invention aims to provide a geographic information webpage text topic classification method based on block units, so as to solve the technical problems of low accuracy and low efficiency of the existing geographic information webpage text classification method.

To achieve the purpose, the invention adopts the following technical scheme:

the method for classifying the geographic information webpage text subjects based on the block units comprises the following steps:

the web page structural feature analysis and block unit division step S110:

aiming at different information webpage structures and page layouts, performing column division on the webpage structures and layouts, identifying according to original HTML (hypertext markup language) tags of different semantic information, and dividing and reorganizing the whole webpage by analyzing the effect of different columns and HTML tags on topic classification to complete block unit division;

step S120 of training and assigning weights of the web page block units:

dividing a large number of web pages into different block units according to the step S110

The block units are used as corpus to analyze and train, so that weight distribution of different block units is realized, normalization processing is carried out, and unified assignment is carried out on weight values;

the LDA text topic distribution based on block units represents step S130:

the block units are taken as text topic modeling units, LDA topic models are introduced by combining the weight conditions of different block units, the LDA topic model construction based on the block units is realized, and the number of topics is optimized by analyzing the confusion degree and the consistency of the potential topics, so that the optimal potential topic category and the feature word dimension are determined;

the topic feature vector generation step S140 based on LDA2 Vec:

optimizing an LDA2Vec topic model vectorization flow, replacing an original classical LDA model with a block unit LDA, constructing an LDA2Vec topic model vectorization method based on the block unit, fusing webpage topic distribution generated by the block unit LDA and Word vectors trained by a Word2Vec vectorization model to form document feature vectors, and generating topic feature vectors of geographic information webpage texts;

geographic information subject classification step S150 based on SVM support vector machine classifier:

inputting the topic feature vector of the webpage text extracted in the step S140 into an SVM support vector machine classifier, converting the nonlinear problem into linear separable by using a radial basis function RBF, realizing the topic classification of the geographic information webpage by adopting a one-to-one classification strategy, and evaluating a classification result according to the accuracy, the recall rate and the F1 value.

Optionally, the step S110 of analyzing the structural features of the web page and dividing the block unit specifically includes:

s111: analyzing different geographic information web page structures and page layouts, and dividing the web page structures and layouts into different web page columns of menu options, text columns forming main body contents, advertisement columns for propaganda display, declaration web page copyright information columns and links related to the web page contents;

s112: selecting six types of HTML labels, namely title, text, link, form, inline and other, according to the original HTML labels of the webpage, and performing identification setting on the text content of the webpage;

s113: dividing the geographic information webpage text into four block types, namely a text block, a text property block, a semantic emphasis block and a title block according to the functional characteristics of different webpage columns and HTML labels and the topic semantic relativity of the text in the labels;

s114: and repeating the steps S111-S113 until all the webpages are executed, and finally dividing the webpage text content into block units with different categories and numbers.

Optionally, the step S120 of training and assigning weights of the web page block units specifically includes:

s121, setting contribution degrees of different types of block units according to the organization structure and experience analysis of the webpage, and defining the size sequence of the contribution degrees as follows: title block > semantic emphasis block > text property block > text block;

s122: setting the text block with the lowest contribution degree as 1, and setting the weight of other types of blocks relative to the contribution degree relation of the text blocks;

s123: combining the types and the number of the block units obtained by dividing in the step S110, calculating a semantic weight value of each webpage block unit as a weight factor for combining the block units, wherein the specific calculation formula is as follows:

wherein the method comprises the steps ofIndicate->Block of individual Web documents->Weight value of->Block number in web page->For the web page tag contained in the block, < >>Weights are assigned to the current tags for the block.

S124: and carrying out normalization processing on the block unit weights of the whole web page by utilizing a normalization processing function to obtain normalization weight factors of the block units, and ensuring that the weight sum of all the block units of each web page is 1.

Optionally, the block unit-based LDA text topic distribution representation step S130 specifically includes:

s131: according to the block units formed by the step S110, each original webpage is converted into a set formed by a plurality of sub-documents of the block units, and corpus is provided for webpage classification training;

s132: based on the corpus, the topic modeling of the webpage document is carried out by using a classical LAD topic classification model, and sub-document-topic distribution and topic-topic word distribution are obtained.

S133: combining the normalized weight factors of the block units obtained in the step S124, carrying out weight addition and combination on the topic distribution of the sub-documents of the block units of the whole webpage, and constructing and forming the webpage topic distribution based on the block units;

s134: and analyzing and evaluating the constructed webpage topic distribution by utilizing the potential topic confusion degree and topic consistency, optimizing and determining the optimal potential topic category number and topic word characteristic dimension, and generating the final webpage topic distribution.

Optionally, the generating step S140 of the topic feature vector based on LDA2Vec specifically includes:

s141: based on the final webpage topic distribution generated in the step S130, summarizing and constructing all webpage topic distribution;

s142: the Euclidean distance of the topic distribution of all the webpages is calculated, and the combination and elimination of the webpages topic and characteristic words are realized by setting the objective function parameters and continuously training, so that the topic distribution quantity of each webpage is ensured to be basically balanced;

s143: and carrying out Word vector representation on each webpage topic distribution based on the Skip-gram mode in the Word2vec vectorization model, and constructing topic feature vectors of webpage texts.

Optionally, the step S150 of classifying the geographic information subject based on the SVM support vector machine classifier specifically includes:

s151: introducing an SVM support vector machine classification model, inputting the topic feature vector of the webpage text generated in the step S143 into the SVM support vector machine classification model, setting a kernel function as a Gaussian radial basis function RBF, and assigning a penalty coefficient lambda to complete the construction of a single-type classifier;

s152: seven categories of high-precision maps, digital cities, live-action three dimensions, satellite remote sensing big data, mapping regulations, industry dynamics and overseas resources are selected as geographic information subject categories, 21 classifiers are constructed in pairs, and automatic classification of geographic information web page texts is achieved;

s153: and (3) according to the parameters such as the accuracy (P), recall (R) and F1 values of the classification result of each classifier, realizing the self-evaluation of the classification result, and readjusting the penalty coefficient lambda for the classification result which does not reach the expected threshold value, and repeating the steps S151-S153 until the classification result reaches the expected threshold value.

The invention further discloses a geographic information webpage text topic classification system based on block units, which comprises the following modules:

web page structural feature analysis and block unit division unit 210:

the web page block unit weight training and distribution unit 220:

according to the step S110, different block units divided by different webpages are analyzed and trained by taking a large number of webpage block units as corpus, so that weight distribution of the different block units is realized, normalization processing is carried out, and unified assignment is carried out on weight values;

tile unit-based LDA text topic distribution representation unit 230:

the block units are taken as text topic modeling units, LDA topic models are introduced by combining weight distribution conditions of different block units, the LDA topic model construction based on the block units is realized, and the number of topics is optimized by analyzing the potential topic confusion degree and topic consistency, so that the optimal potential topic category and feature word dimension are determined;

topic feature vector generation unit 240 based on LDA2 Vec:

geographic information topic classification unit 250 based on SVM support vector machine classifier:

the extracted topic feature vectors of the webpage text are input into an SVM support vector machine classifier, a radial basis function RBF is utilized to convert nonlinear problems into linear separable, a one-to-one classification strategy is adopted to realize topic classification of the geographic information webpage, and classification results are evaluated according to accuracy, recall rate and F1 values.

The invention further discloses a storage medium for storing computer-executable instructions,

the computer executable instructions, when executed by the processor, perform the above-described tile unit-based geographic information web page text topic classification method.

The invention has the following advantages:

1. the geographic information web page is divided into four types of blocks according to the semantic characteristics of the labels by analyzing the structural characteristics of the web page, the actions of the labels and the relation on the semantic level, modeling is carried out on finer granularity, and theme information is effectively extracted;

2. proper weight distribution methods are designed for different types of blocks, so that contribution degrees of text contents at different positions to the whole theme can be effectively distinguished;

3. the method for extracting the text characteristics of the geographic information web page based on the block unit LDA2 Vec. On a traditional LDA2Vec topic model vectorized topic embedding layer, a block unit LAD topic model considering the structural characteristics of a webpage replaces an original LDA topic model, topic modeling of geographic information text content is completed on the basis of a webpage block, and topic clustering capacity is enhanced.

Drawings

FIG. 1 is a flow chart of a method for tile unit-based geographic information web page text topic classification in accordance with an implementation of the present invention;

FIG. 2 is a flowchart illustrating steps of web page structural feature analysis and block unit partitioning according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the training and assigning steps of the weights of the tile units according to an embodiment of the present invention;

FIG. 4 is a specific flow chart of block unit-based LDA text topic distribution representation steps in accordance with an implementation of the present invention;

FIG. 5 is a specific flow chart of an LDA2 Vec-based topic feature vector generation step in accordance with an implementation of the present invention;

FIG. 6 is a specific flow chart of a geographic information topic classification step based on an SVM support vector machine classifier in accordance with an implementation of the present invention;

FIG. 7 is a block diagram of a geographic information web page text topic classification device based on block units in accordance with an implementation of the invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Referring to fig. 1, a flowchart of a block unit-based geographic information web page text topic classification method in accordance with an embodiment of the present invention is disclosed, comprising the steps of:

the web page structural feature analysis and block unit division step S110:

specifically, referring to fig. 2, the step S110 includes the following sub-steps:

In the step, the whole webpage is wholly divided into four types of block units according to the analyzed webpage structural characteristics, so that related operations such as theme modeling and characteristic acquisition can be performed in the subsequent step based on the divided block units.

Step S120 of training and assigning weights of the web page block units:

specifically, referring to fig. 3, step S120 includes the following sub-steps:

for example, the text block is set to 1, if it is desired to emphasize the title of the web page, the title block may be set to a relatively large value, such as 3 or 4, and the semantic emphasis block and the text property block may be set to values between 1 and 3; if the title is not required to be over emphasized, a smaller value, such as 2, can be set relatively, and the setting of weights of different block types is completed according to the principle;

wherein the method comprises the steps ofIndicate->Block of individual Web documents->Weight value of->For the number of blocks in the web page, < >>For the web page tag contained in the block, < >>Weights are assigned to the current tags for the block.

The LDA text topic distribution based on block units represents step S130:

in the step, divided block units are taken as text topic modeling units, an LDA topic model is introduced by combining the weight conditions of different block units, a document-topic word three-layer structure of the original LDA model is expanded into a document-block unit-topic word four-layer structure model, the LDA topic model construction based on the block units is realized, the topic quantity is optimized by analyzing the potential topic confusion degree and the topic consistency, and the optimal potential topic category and the characteristic word dimension are determined;

specifically, referring to fig. 4, step S130 includes the following sub-steps:

The topic feature vector generation step S140 based on LDA2 Vec:

in the step, a block unit LDA is used for replacing an original classical LDA model by optimizing an LDA2Vec topic model vectorization flow, a block unit-based LDA2Vec topic model vectorization method is constructed, and a document feature vector is formed by fusing Word vectors trained by a Word2Vec vectorization model with webpage topic distribution generated by the block unit LDA, so that topic feature vector generation of geographic information webpage texts is realized.

Specifically, referring to fig. 5, step S140 includes the following sub-steps:

s142: and calculating Euclidean distance of the topic distribution of all the webpages, and continuously training by setting objective function parameters to realize the combination and elimination of the webpage topics and the feature words and ensure the basic balance of the topic feature word distribution quantity of each webpage.

For example, the median of the distribution number of all the webpage theme feature words is X, and the distribution number of general webpage theme feature words should be more than 80% between [0.8X,1.2X ]. For example, the number of webpage theme feature words has a median of 30 dimensions, and the ratio of the number of webpage theme feature words distributed between [24,36] should be more than 80%.

inputting the topic feature vector of the webpage text extracted in the step S140 into an SVM support vector machine classifier, converting the nonlinear problem into linear separable by using a radial basis function RBF, realizing the topic classification of the geographic information webpage by adopting a one-to-one classification strategy, and evaluating the classification result, for example, evaluating the classification result according to the accuracy (P), recall (R) and F1 values.

In the step, based on the selected training set data, the SVM support vector machine classification model is utilized to carry out classification training on the feature vectors of the webpage text, the training model parameters are optimized through parameter tuning and penalty coefficients are set, and finally the optimal solution for classifying the geographic information webpage text is obtained.

Specifically, referring to fig. 6, the following sub-steps are included:

for example, the penalty coefficient λ may be set between (0, 1), by adjusting the penalty coefficient λ

The classification accuracy is improved, and the test discovers that the adjustment punishment coefficient lambda is 0.05 as the optimal solution.

s153: and carrying out self-evaluation on the classification results, for example, according to the parameters such as the accuracy (P), recall (R) and F1 value of the classification results of each classifier, and realizing the self-evaluation on the classification results. And (3) readjusting the penalty coefficient lambda for the classification result not reaching the expected threshold value, and repeating the steps S151-S153 until the classification result reaches the expected threshold value.

For example, an expected threshold (e.g., 85%) of the classification accuracy may be preset as needed, and the general classification accuracy requirement is not lower than 70%, and if the expected threshold is not reached, the penalty coefficient λ is readjusted, and steps S151-S153 are repeated until the classification result reaches the expected threshold.

Further, referring to fig. 7, a geographic information web text topic classification system based on a block unit LDA2Vec is disclosed, which is used for running the geographic information web text topic classification method based on the block unit LDA2Vec of the invention, and comprises the following modules:

web page structural feature analysis and block unit division unit 210:

the web page block unit weight training and distribution unit 220:

tile unit-based LDA text topic distribution representation unit 230:

topic feature vector generation unit 240 based on LDA2 Vec:

the extracted topic feature vectors of the webpage text are input into an SVM support vector machine classifier, a radial basis function RBF is utilized to convert nonlinear problems into linear separable, a one-to-one classification strategy is adopted to realize the topic classification of the geographic information webpage, and classification results are evaluated, for example, the classification results are evaluated according to accuracy (P), recall (R) and F1 values.

Furthermore, the invention also discloses a storage medium for storing computer executable instructions which, when being executed by a processor, execute the above-mentioned geographic information web page text topic classification method based on the block unit LDA2 Vec.

Examples:

experiments show that the method can well solve the problem that the traditional text classification method cannot well utilize semantic information contained in the structural features of the webpage text.

For example: in a web page text, if the word of remote sensing appears in the title, the article is related to remote sensing with high probability according to a large amount of experience, and can be used as a topic feature word to represent topic information of the article; if "remote sensing" appears in the body, it cannot be determined whether the web page article belongs to the remote sensing category. In the traditional text classification method, the position relation of feature words in the webpage text is not considered, but the method can well assign texts with different semantic relativity to different weights according to the structural features of the webpage, and can more effectively identify the topic information of the geographic information webpage text.

In summary, the method effectively solves the problems that the traditional text classification method is difficult to consider the whole and partial semantic information, effectively identify the topic information and the like in the webpage text classification task, divides the whole document into four types of block units by analyzing the structural characteristics of the webpage, carries out topic modeling on a level finer granularity than the whole text to obtain global topic distribution, combines word vectors with context semantics in training generated by LDA2Vec topic model vectors, generates topic feature vectors of the webpage text of geographic information, and adopts an SVM classifier to complete classification.

The invention has the following advantages:

It will be apparent to those skilled in the art that the elements or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or they may alternatively be implemented in program code executable by a computer device, such that they may be stored in a storage device for execution by the computing device, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

While the invention has been described in detail in connection with specific preferred embodiments thereof, it is not to be construed as limited thereto, but rather as a result of a simple deduction or substitution by a person having ordinary skill in the art without departing from the spirit of the invention, which is to be construed as falling within the scope of the invention defined by the appended claims.

Claims

1. The method for classifying the geographic information webpage text subjects based on the block units is characterized by comprising the following steps of:

the web page structural feature analysis and block unit division step S110:

step S120 of training and assigning weights of the web page block units:

the LDA text topic distribution based on block units represents step S130:

the topic feature vector generation step S140 based on LDA2 Vec:

inputting the topic feature vector of the webpage text extracted in the step S140 into an SVM support vector machine classifier, converting a nonlinear problem into linear separable by using a radial basis function RBF, realizing topic classification of the geographic information webpage by adopting a one-to-one classification strategy, and evaluating a classification result according to accuracy, recall rate and F1 value;

the step S140 of generating a topic feature vector based on LDA2Vec specifically includes:

2. The method for topic classification of geographic information web page text as recited in claim 1,

the web page structural feature analysis and block unit division step S110 specifically includes:

3. The method for topic classification of geographic information web page text as recited in claim 1,

the step S120 of training and assigning weights of the web page block units specifically includes:

s122: setting the text block with the lowest contribution degree as 1, and setting the weight of the contribution degree relation of other types of blocks relative to the text block;

wherein B is _xy The weight value of a block y representing the xth web page document, n is the number of labels in the web page, t is the web page label contained in the block, and w is the distribution weight of the current label of the block;

4. The method for topic classification of geographic information web page text as recited in claim 1,

the block unit-based LDA text topic distribution representation step S130 specifically includes:

s132: based on the corpus, performing topic modeling on the webpage document by using a classical LAD topic classification model to obtain sub-document-topic distribution and topic-topic word distribution;

5. The method for topic classification of geographic information web page text as recited in claim 1,

the step S150 of classifying the geographic information subject based on the SVM support vector machine classifier specifically includes:

s153: and according to the accuracy, recall rate and F1 value of the classification result of each classifier, realizing the self-evaluation of the classification result, and readjusting the penalty coefficient lambda for the classification result which does not reach the expected threshold value, and repeating the steps S151-S153 until the classification result reaches the expected threshold value.

6. A storage medium storing computer-executable instructions,

the computer executable instructions, when executed by a processor, perform the tile unit-based geographic information web page text topic classification method of any of claims 1-5.