CN110287409B

CN110287409B - Webpage type identification method and device

Info

Publication number: CN110287409B
Application number: CN201910486083.2A
Authority: CN
Inventors: 孙尚勇
Original assignee: New H3C Security Technologies Co Ltd
Current assignee: New H3C Security Technologies Co Ltd
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2022-07-22
Anticipated expiration: 2039-06-05
Also published as: CN110287409A

Abstract

The embodiment of the application provides a method and a device for identifying webpage types, wherein the method comprises the following steps: counting the TF-IDF weight of each text word in the webpage to be identified; counting the proportion of the occurrence frequency of each HTML label in the webpage to be identified to the total occurrence frequency; constructing a feature vector of a first preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each text word and the proportion of each HTML label; and inputting the characteristic vector corresponding to the webpage to be identified into a preset vector classification model to obtain the type of the webpage to be identified. By applying the technical scheme provided by the embodiment of the application, the manpower consumed by webpage type identification can be reduced, the identification of the webpage with unknown type can be realized, and the number of the effectively identified webpage types can be increased.

Description

Webpage type identification method and device

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for identifying a webpage type.

Background

In network security monitoring, it is often necessary to analyze to determine which web pages were accessed by a user, and the type of web pages. The types of the web pages include news, videos, forums, finance and the like. Thereby analyzing the user's behavior characteristics based on the type of web page.

Currently, identification of web page types is implemented depending on the way in which various types of web pages are recorded. Specifically, the administrator records various types of web pages in the database. After the electronic equipment acquires the webpage to be identified, searching the webpage which is the same as the webpage to be identified in a database, and determining the type of the searched webpage as the type of the webpage to be identified.

By adopting the method to identify the webpage types, a large amount of manpower is consumed to construct a database, only the known types of webpages can be identified, and the number of the types of the webpages effectively identified is limited.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for identifying webpage types, so that the labor consumed by webpage type identification is reduced, the identification of webpages of unknown types is realized, and the number of effectively identified webpage types is increased. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for identifying a webpage type, where the method includes:

performing word segmentation processing on text content on a webpage to be recognized to obtain at least one text word;

counting TF-IDF (Term Frequency-Inverse Document Frequency) weight of each text word;

counting the proportion of the occurrence frequency of each HTML (Hypertext Markup Language) label in the webpage to be recognized to the total occurrence frequency, wherein the total occurrence frequency is the sum of the occurrence frequencies of all HTML labels in the webpage to be recognized;

constructing a feature vector of a first preset quantity dimension corresponding to the webpage to be recognized according to the TF-IDF weight of each text word and the proportion of each HTML label;

and inputting the characteristic vector corresponding to the webpage to be identified into a preset vector classification model to obtain the type of the webpage to be identified.

In a second aspect, an embodiment of the present application provides an apparatus for identifying a webpage type, where the apparatus includes:

the first word segmentation unit is used for performing word segmentation processing on text contents on a webpage to be recognized to obtain at least one text word;

the first statistical unit is used for counting TF-IDF weight of each text word;

the second counting unit is used for counting the proportion of the occurrence frequency of each HTML label in the webpage to be identified to the total occurrence frequency, wherein the total occurrence frequency is the sum of the occurrence frequencies of all HTML labels in the webpage to be identified;

the first construction unit is used for constructing a feature vector of a first preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each text word and the proportion of each HTML label;

and the first identification unit is used for inputting the characteristic vector corresponding to the webpage to be identified into a preset vector classification model to obtain the type of the webpage to be identified.

In a third aspect, embodiments provide an electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to: implementing any of the method steps provided in the first aspect.

In a fourth aspect, embodiments of the present application provide a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to: implementing any of the method steps provided in the first aspect.

According to the webpage type identification method and device provided by the embodiment of the application, the electronic equipment utilizes a plurality of sample webpages and the types of the sample webpages to pre-train and obtain the vector classification model. The electronic equipment combines TF-IDF weights of text words of the webpage to be recognized and the proportion of the HTML label to construct a feature vector of a first preset quantity dimension corresponding to the webpage to be recognized, inputs the feature vector corresponding to the webpage to be recognized into a vector classification model obtained through pre-training, and obtains the type of the webpage to be recognized.

According to the technical scheme, the type of the webpage is identified by using the preset vector classification model, a database comprising multiple types of webpages does not need to be constructed, and the labor consumed by webpage type identification is reduced. In addition, the preset vector classification model can be trained and adjusted according to actual needs, and then the electronic equipment can recognize the types of known web pages or unknown web pages through the preset vector classification model, so that the number of the types of the web pages effectively recognized is increased.

Of course, it is not necessary for any product or method of the present application to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for identifying a webpage type according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a web page type identification apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Currently, identification of web page types is implemented depending on the way in which various types of web pages are recorded. This requires a significant amount of manpower to build a database comprising multiple types of web pages. In addition, since the known types of web pages are recorded in the database, only the known types of web pages can be identified in the identification of the types of web pages, and the number of types of web pages that can be effectively identified is limited.

In order to reduce the labor consumed by webpage type identification, realize identification of unknown types of webpages and increase the number of effectively identified webpage types, the embodiment of the application provides a webpage type identification method. The webpage type identification method can be applied to any electronic equipment such as a mobile phone, a notebook computer and a server. In the webpage type identification method, the electronic equipment utilizes a plurality of sample webpages and the types of the sample webpages to pre-train and obtain a vector classification model. The electronic equipment combines TF-IDF weights of text words of the webpage to be recognized and the proportion of the HTML label to construct a feature vector of a first preset quantity dimension corresponding to the webpage to be recognized, inputs the feature vector corresponding to the webpage to be recognized into a vector classification model obtained through pre-training, and obtains the type of the webpage to be recognized.

According to the technical scheme provided by the embodiment of the application, the type of the webpage is identified by using the preset vector classification model, a database comprising various types of webpages does not need to be constructed, and the manpower consumed by webpage type identification is reduced. In addition, the preset vector classification model can be trained and adjusted according to actual needs, and then the electronic equipment can recognize the types of known web pages or unknown web pages through the preset vector classification model, so that the number of the types of the web pages effectively recognized is increased.

The following describes a web page type identification method provided in the embodiment of the present application by using a specific embodiment.

Referring to fig. 1, fig. 1 is a schematic flowchart of a model training method provided in an embodiment of the present application. For convenience of description, the following description will be made with an electronic apparatus as an execution subject. The model training method comprises the following steps.

Step 101, a preset training set is obtained, wherein the preset training set comprises a plurality of sample webpages and the type of each sample webpage. The types of the web pages include news, videos, forums, finance and the like.

When the vector classification model is trained, the electronic device obtains a preset training set.

In order to improve the accuracy of the trained vector classification model for identifying the type of the webpage to be identified, the more the number of the sample webpages included in the preset training set acquired by the electronic equipment is, the better the method is.

Step 102, performing word segmentation processing on the text contents on a plurality of sample web pages to obtain at least one text word of each sample web page.

After the electronic equipment acquires a plurality of sample webpages, extracting text contents on each sample webpage, and performing word segmentation processing on the extracted text contents to obtain at least one text word of the sample webpage.

In an embodiment of the application, after performing word segmentation processing on the extracted text content, the electronic device may delete useless words in the multiple words obtained after the word segmentation processing, and use the remaining words as text words of the sample webpage. Here, the unnecessary word is a word which is not used for identifying the type of the web page, for example, "we", "them", "today", yesterday ", and the like. Therefore, the calculation amount of model training and webpage type identification can be reduced, the burden of electronic equipment is reduced, and the efficiency of model training and webpage type identification is improved.

In one embodiment of the present application, the process for the electronic device to determine at least one text word for each sample web page may include the following steps.

And step 1021, performing word segmentation processing on the text contents in the multiple sample webpages to obtain at least one text word of each sample webpage.

In the embodiment of the application, a large amount of text content exists in the page of the webpage. For each sample webpage in the multiple sample webpages, the electronic equipment extracts the text content in the sample webpage and performs word segmentation processing on the extracted text content to obtain at least one text word of the sample webpage.

Step 1022, performing word segmentation processing on the links on the multiple sample web pages to obtain at least one character string of each sample web page.

In the embodiment of the application, a webpage comprises a plurality of links. For example, web page A includes a link thereon to jump to web page B. Many characters are included in the link. For each sample webpage in the multiple sample webpages, the electronic equipment extracts all links on the sample webpage, and performs word segmentation processing on the extracted links to obtain at least one character string of the sample webpage.

Step 1023, for each sample web page, combining at least one text word of the sample web page and at least one character string of the sample web page to obtain at least one text word of the sample web page.

After determining at least one literal word of each sample webpage and at least one character string of each sample webpage, the electronic equipment combines the at least one literal word of the sample webpage and the at least one character string of the sample webpage to obtain at least one text word of the sample webpage.

For example, the electronic device determines that the text words of sample webpage a have { text word 1, text word 2, text word 3, text word 2 }. The electronic device determines that the character strings of sample web page a have character string 1, character string 2, character string 3, and character string 1. The electronic device may determine that the text words of sample web page a have { word 1, word 2, word 3, word 2, string 1, string 2, string 3, string 1 }.

In the embodiment of the application, the text words of the sample webpage not only consider the text content in the sample webpage, but also consider the links on the sample webpage, so that the types of the text words of the sample webpage are enriched, namely the types of the characteristics of the webpage type identification are increased, the extracted text words can represent the characteristics of the sample webpage to a higher degree, and the accuracy of the webpage type identification is improved.

And 103, counting the TF-IDF weight of each text word in at least one text word of each sample webpage.

In one embodiment of the application, for each sample web page, the electronic device determines the word frequency TF of each text word in at least one text word of the sample web page using the following formula (1)_w：

Wherein w represents a text word w, T in the at least one text word of the sample web page_wRepresents the number of occurrences of a text word w in at least one text word of the sample web page, T₀A total number of at least one text word representing the sample web page.

For each sample webpage, the electronic device determines an inverse document frequency IDF of each text word of at least one text word of the sample webpage using the following formula (2)_w：

Wherein w represents a text word w, F in the at least one text word of the sample web page_wRepresenting the number of web pages comprising text words w in a preset corpus; f₀Represents the total number of web pages included in the predetermined corpus. The preset corpus includes a large number of web pages and corresponding relationships of text words of the web pages. In one embodiment, the electronic device may acquire a large number of webpages through a tool such as a web crawler, perform word segmentation processing on the webpages to obtain text words of the webpages, and further construct a preset corpus. In another embodiment, the electronic device may obtain the corpus from other electronic devices, and store the corpus locally as the predetermined corpus. The embodiment of the application can also obtain the preset corpus in other ways, and the comparison is not specifically limited.

For each sample web page, the electronic device determines a TF-IDF weight δ for each text word of at least one text word of the sample web page using equation (3) below_w：

δ_w＝TF_w*IDF_w (3)

Wherein w represents a text word w, TF in at least one text word of the sample web page_wThe word frequency, IDF, of the text word w representing the sample web page_wThe inverse document frequency of the text word w representing the sample web page.

For example, for sample webpage B, the number of times T that the text word x1 appears in the text word of sample webpage B_x12, total number of text words T of sample Web Page B₀Is 10. Number F of web pages in preset corpus including text word x1_x110, the total number F of web pages included in the corpus is preset₀Is 100. Electronic device can determine TF_x1＝2/10＝0.2，IDF_x1Log (100/10) to 1, and determining a TF-IDF weight δ for the text word x1_x1＝0.2*1＝0.2。

In another embodiment of the present application, the above formula (2) may be transformed into formula (4) in order to improve the anti-slip effect of the TF-IDF weight calculation of the text word.

The electronic device determines a TF-IDF weight for each text word of the sample web page in conjunction with equations (1), (4), and (3).

And 104, counting the proportion of the occurrence frequency of each HTML label in each sample webpage to the total occurrence frequency. The total occurrence number corresponding to each sample webpage is the sum of the occurrence numbers of all HTML tags in the sample webpage.

In the embodiment of the present application, the HTML tag includes, but is not limited to < title >, < track >, < textarea >, < string >, < link >, < figure >, < code >, < audio >, < applet >, < video >, < wbr >, < table >, and < source >, etc. For each sample webpage, the electronic equipment acquires HTML tags included in the sample webpage, counts the occurrence frequency of each HTML tag in the sample webpage and the sum of the occurrence frequencies of all HTML tags in the sample webpage, namely the total occurrence frequency, and further counts the proportion of the occurrence frequency of each HTML tag in the sample webpage to the total occurrence frequency.

For example, in the statistical sample web page B, the < title > tag appears 6 times, the < link > tag appears 10 times, the < code > tag appears 4 times, and the other tags appear 0 times. The electronic device may count that the specific gravity of the < title > tag is 6/(6+10+4) 0.3, the specific gravity of the < link > tag is 10/(6+10+4) 0.5, the specific gravity of the < code > tag is 4/(6+10+4) 0.2, and the specific gravity of the other tags is 0/(6+10+4) 0.

In the embodiment of the application, the electronic equipment utilizes the characteristics which cannot be described by characters but can be represented by the HTML labels, so that the types of the characteristics of the webpage type identification are increased, and the accuracy of the webpage type identification is improved.

The execution order of steps 102 and 104 is not limited in the embodiments of the present application. Step 102 may be performed before step 104, after step 104, or simultaneously with step 104.

And 105, constructing a first feature vector with a preset quantity dimension corresponding to each sample webpage according to the TF-IDF weight of each text word of each sample webpage and the proportion of each HTML label.

The first preset quantity can be set according to actual requirements. In one example, the electronic device may be set according to the number of categories of text words available and the number of categories of HTML tags set. For example, the number of the types of the text words that can be obtained is 100, the number of the types of the set HTML tags is 10, and the first preset number is greater than or equal to 100+ 10.

For each sample webpage, the electronic equipment constructs a feature vector of a first preset number dimension corresponding to the sample webpage according to the TF-IDF weight of each text word of the sample webpage and the specific gravity of each HTML label.

In an embodiment of the application, for each sample web page, the electronic device determines a feature vector of a first preset number dimension corresponding to the sample web page as follows.

Step 1051, determining a second preset number of text terms with the highest TF-IDF weight among the at least one text term of the sample web page as the web page representative term of the sample web page.

In an optional embodiment, for each sample web page, the electronic device detects whether the total number of the at least one text word of the sample web page is smaller than a second preset number. If the number of the blank spaces is smaller than the second preset number, the electronic equipment obtains a target number of blank spaces, wherein the target number is a difference value between the second preset number and the total number of the at least one text word of the sample webpage. And the electronic equipment takes at least one text word and a target number of blank spaces of the sample webpage as the webpage representative word of the sample webpage. And if the number of the text words is larger than or equal to a second preset number, the electronic equipment extracts the text words with the highest TF-IDF weight in at least one text word of the sample webpage as the webpage representative words of the sample webpage.

For example, the second preset number is 100. If the number of text words extracted from the sample web page C is 80 and 80 is less than 100, the electronic device obtains 20 blank spaces between 100 and 80, and combines the 80 text words and the 20 blank spaces extracted from the sample web page C to obtain 100 web page representative words of the sample web page C. If the number of the text words extracted from the sample webpage C is 110 and 110 is greater than 100, the electronic equipment extracts the first 100 text words with the highest TF-IDF weight from the 110 text words as the webpage representative words of the sample webpage C.

In one embodiment, to improve the traversability of determining the webpage representative words, if the total number of the at least one text word of the sample webpage is greater than or equal to a second preset number, the electronic device sorts the at least one text word of the sample webpage according to the descending order of the TF-IDF weights, and extracts the first second preset number of text words with the highest TF-IDF weights as the webpage representative words of the sample webpage.

Step 1052, constructing a feature vector of a first preset number dimension corresponding to each sample webpage according to the TF-IDF weight of each webpage representative word of the sample webpage and the specific gravity of each HTML tag. Wherein the second preset number is smaller than the first preset number. In one example, the second preset number may be 100, and the first preset number may be 20000.

In another example, the second preset number may be 5, and the first preset number may be 20. The number of the preset HTML tags is 5. The distribution of the web page representative words and HTML tags corresponding to the elements in the feature vector of the preset first preset number dimension is shown in table 1.

TABLE 1

If the web page representative word of the sample web page D includes the word 3 and the word 4, the HTML tag of the sample web page D includes the HTML tag 1 and the HTML tag 4. Then there are values at positions 3, 4, 16 and 19 in the 20-dimensional feature vector corresponding to the sample web page D. If the weight of TF-IDF of word 3 is 0.2, the weight of TF-IDF of word 4 is 0.4, the specific gravity of 1 of HTML tag is 0.3, and the specific gravity of 4 of HTML tag is 0.7 in sample web page D, the 20-dimensional feature vector corresponding to sample web page D is {0, 0, 0.2, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.3, 0, 0, 0.7, 0 }.

And 106, training a preset machine learning classification algorithm by using the feature vector corresponding to each sample webpage and the type of each sample webpage to obtain a preset vector classification model.

In the embodiment of the present application, the preset machine learning classification algorithm includes, but is not limited to, a logistic regression algorithm, a support vector machine, a decision tree, a neural network, and the like. The type of the sample webpage is the real type of the sample webpage.

Specifically, the process of training the preset machine learning classification algorithm by the electronic device to obtain the preset vector classification model includes the following steps.

And the electronic equipment inputs the feature vectors corresponding to each sample webpage into a preset machine learning classification algorithm respectively to obtain the prediction type of each sample webpage.

The electronic equipment determines the accuracy of the webpage type identification based on the actual type of each sample webpage and the predicted type of each sample webpage.

If the accuracy is less than or equal to the preset threshold, the electronic device may adjust parameters of the preset machine learning classification algorithm by using a back propagation algorithm, a gradient descent algorithm, and the like, and then input the feature vectors corresponding to the sample webpages into the preset machine learning classification algorithm again, so as to obtain the prediction type of each sample webpage.

And if the accuracy is greater than the preset threshold, the electronic equipment takes the trained preset machine learning classification algorithm as a preset vector classification model.

According to the technical scheme provided by the embodiment of the application, the type of the webpage is identified by utilizing the preset vector classification model obtained by training the webpage based on a plurality of samples, a database comprising a plurality of types of webpages does not need to be constructed, the manpower consumed by webpage type identification is reduced, and the engineering requirement of webpage type identification can be better met. In addition, the preset vector classification model is obtained based on training of a plurality of sample webpages, and the electronic equipment can realize type identification of known webpages or unknown webpages through the preset vector classification model, so that the number of types of the webpages effectively identified is increased.

In the embodiment of the application, the text words of the webpage and the HTML tags of the sample webpage are jointly used as the webpage type identification features, the feature vectors corresponding to the webpage are constructed, the types of the features of the webpage type identification are increased, the feature vectors corresponding to the webpage have good representativeness and distinguishability, and the accuracy and the comprehensiveness of the webpage type identification are improved.

Based on the vector classification model obtained through training, the embodiment of the application provides a webpage type identification method. Referring to fig. 2, fig. 2 is a schematic flowchart of a method for identifying a web page type according to an embodiment of the present application, where the method includes the following steps.

Step 201, performing word segmentation processing on text content on a webpage to be recognized to obtain at least one text word.

In the embodiment of the application, after the electronic equipment acquires the webpage to be identified, the text content on the webpage to be identified is extracted, and word segmentation processing is performed on the extracted text content to obtain at least one text word of the webpage to be identified.

In an embodiment of the application, after performing word segmentation processing on the extracted text content, the electronic device may delete useless words in the multiple words obtained after the word segmentation processing, and use the remaining words as text words of the web page to be identified.

In one embodiment of the present application, the electronic device may determine at least one text word of the web page to be recognized in the following manner. Specifically, the electronic device performs word segmentation processing on the text content in the web page to be recognized to obtain at least one text word. The electronic equipment carries out word segmentation processing on the links on the webpage to be recognized to obtain at least one character string. The electronic equipment combines at least one word and at least one character string to obtain at least one text word of the webpage to be identified.

And step 202, counting TF-IDF weight of each text word.

In one embodiment of the present application, the electronic device determines the TF of each text word of at least one text word of a web page to be recognized using the following formula (1)_w。

Wherein w represents a text word w, T in at least one text word of the web page to be identified_wRepresenting the number of occurrences of a text word w in at least one text word of the web page to be recognized, T₀Represents the total number of at least one text word of the web page to be identified.

The electronic equipment determines the IDF of each text word in at least one text word of the webpage to be recognized by using the following formula (2)_w：

Wherein w represents a text word w, F in at least one text word of the web page to be recognized_wRepresenting the number of web pages comprising text words w in a preset corpus; f₀Represents the total number of web pages included in the preset corpus.

The electronic equipment determines TF-IDF weight delta of each text word in at least one text word of the webpage to be recognized by using the following formula (3)_w：

δ_w＝TF_w*IDF_w (3)

Wherein w represents a text word w, TF in at least one text word of the webpage to be identified_wWord frequency, IDF, of the text words w representing the web page to be recognized_wRepresenting the inverse document frequency of the text words w of the web page to be recognized.

In another embodiment of the present application, in order to improve the anti-slip effect of the TF-IDF weight calculation of the text words, the above formula (2) may be transformed into formula (4).

Step 203, counting the proportion of the occurrence frequency of each HTML label in the webpage to be recognized to the total occurrence frequency, wherein the total occurrence frequency is the sum of the occurrence frequencies of all HTML labels in the webpage to be recognized.

The electronic equipment acquires HTML tags included in the webpage to be recognized, counts the occurrence frequency of each HTML tag in the webpage to be recognized, and counts the sum of the occurrence frequencies of all HTML tags in the webpage to be recognized, namely the total occurrence frequency, so as to count the proportion of the occurrence frequency of each HTML tag in the webpage to be recognized to the total occurrence frequency.

In the embodiment of the present application, the execution sequence of step 201 and step 203 is not limited. Step 201 may be performed before step 203, may be performed after step 203, or may be performed simultaneously with step 203.

And step 204, constructing a first preset quantity dimension characteristic vector corresponding to the webpage to be identified according to the TF-IDF weight of each text word and the proportion of each HTML label.

The first preset quantity can be set according to actual requirements. The electronic equipment constructs a first feature vector with a preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each text word of the webpage to be identified and the proportion of each HTML label.

In an embodiment of the application, the electronic device determines a feature vector of a first preset number dimension corresponding to a webpage to be identified in the following manner. Specifically, the electronic equipment determines a second preset number of text words with the highest TF-IDF weight in at least one text word as a webpage representative word of the webpage to be identified. And the electronic equipment constructs a feature vector of a first preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each webpage representative word and the proportion of each HTML label. The second preset number is smaller than the first preset number.

In an optional embodiment, the electronic device detects whether the total number of the at least one text word of the webpage to be recognized is less than a second preset number. And if the number of the blank spaces is smaller than the second preset number, the electronic equipment acquires a target number of blank spaces, wherein the target number is the difference value between the second preset number and the total number of the at least one text word. The electronic equipment takes at least one text word and the blank grids with the target quantity of the web pages to be recognized as the web page representative words of the web pages to be recognized. And if the number of the text words is larger than or equal to the second preset number, the electronic equipment extracts the text words with the highest TF-IDF weight in at least one text word of the webpage to be recognized as the webpage representative words of the webpage to be recognized.

Step 205, inputting the feature vector corresponding to the webpage to be identified into a preset vector classification model to obtain the type of the webpage to be identified.

The preset vector classification model may be a preset machine learning classification algorithm. Machine learning classification algorithms include, but are not limited to, logistic regression algorithms, support vector machines, decision trees, neural networks, and the like. The preset vector classification model may also be a vector classification model obtained by training a machine learning classification algorithm. For example, the preset vector classification model is a vector classification model obtained by training with the model training method shown in fig. 1. This is not limited in the embodiments of the present application.

The description of the steps 201-205 is relatively simple, and specific reference may be made to the related description of the steps 101-105.

The web page type recognition method and the model training method can be executed on the same device, and can be executed on different devices. This is not limited in the embodiments of the present application.

According to the technical scheme, the type of the webpage is identified by using the preset vector classification model, a database comprising multiple types of webpages does not need to be constructed, and the labor consumed by webpage type identification is reduced. In addition, the preset vector classification model can be trained and adjusted according to actual needs, and then the electronic equipment can recognize the types of known webpages or unknown webpages through the preset vector classification model, so that the number of the types of the webpages effectively recognized is increased.

In addition, compared with the method that the database is adopted to identify the webpage features, the preset vector classification model is used to identify the types of the webpages, the calculation amount of electronic equipment is greatly reduced, and the efficiency of webpage type identification is improved.

Corresponding to the model training method and the web page type identification method shown in fig. 1-2, the embodiment of the present application provides a web page type identification apparatus. Referring to fig. 3, fig. 3 is a schematic structural diagram of a web page type identification apparatus according to an embodiment of the present disclosure. The device comprises: a first segmentation unit 301, a first statistics unit 302, a second statistics unit 303, a first construction unit 304 and a first identification unit 305.

The first word segmentation unit 301 is configured to perform word segmentation processing on text content on a web page to be identified to obtain at least one text word;

a first statistical unit 302, configured to count a TF-IDF weight of each text word;

a second counting unit 303, configured to count a proportion of occurrence times of each HTML tag in the web page to be identified to a total occurrence time, where the total occurrence time is a sum of occurrence times of all HTML tags in the web page to be identified;

the first construction unit 304 is configured to construct a feature vector of a first preset number dimension corresponding to the webpage to be identified according to the TF-IDF weight of each text word and the specific gravity of each HTML tag;

the first identifying unit 305 is configured to input a feature vector corresponding to the web page to be identified into a preset vector classification model, and obtain a type of the web page to be identified.

In an alternative embodiment, the first word segmentation unit 301 may be specifically configured to:

performing word segmentation processing on the text content in the webpage to be recognized to obtain at least one text word;

performing word segmentation processing on a link on a webpage to be recognized to obtain at least one character string;

and combining the at least one word and the at least one character string to obtain at least one text word of the webpage to be identified.

In an optional embodiment, the first statistical unit 302 may specifically be configured to:

determining a word frequency, TF, of each of at least one text word using the following formula_w：

Determining a reverse document frequency IDF for each of at least one text term using the following formula_w：

Determining a TF-IDF weight δ for each of at least one text word using the following formula_w：

δ_w＝TF_w*IDF_w；

Wherein w represents a text word w, T of the at least one text word_wRepresenting the number of occurrences of a text word w in at least one text word, T₀Representing a total number of at least one text word; f_wRepresenting the number of web pages comprising text words w in a preset corpus; f₀Represents the total number of web pages included in the preset corpus.

In an alternative embodiment, the first constructing unit 304 may specifically be configured to:

determining a second preset number of text words with the highest TF-IDF weight in at least one text word as a webpage representative word of the webpage to be identified;

and constructing a feature vector of a first preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each webpage representative word and the proportion of each HTML label, wherein the second preset quantity is smaller than the first preset quantity.

detecting whether the total number of at least one text word is smaller than a second preset number;

if yes, blank grids with the target quantity are obtained, and the target quantity is the difference value between the second preset quantity and the total quantity of the at least one text word; taking at least one text word and a target number of blank grids as webpage representative words of the webpage to be identified;

if not, extracting a second preset number of text words with the highest TF-IDF weight from at least one text word to be used as a webpage representative word of the webpage to be identified.

In an optional embodiment, the apparatus for identifying a type of a web page may further include:

the second acquisition unit is used for acquiring a preset training set, and the preset training set comprises a plurality of sample webpages and the type of each sample webpage;

the second word segmentation unit is used for carrying out word segmentation on the text content on each sample webpage to obtain at least one text word of each sample webpage;

the third statistical unit is used for counting TF-IDF weight of each text word in at least one text word of each sample webpage;

the fourth statistical unit is used for counting the proportion of the occurrence frequency of each HTML label in each sample webpage to the total occurrence frequency;

the second construction unit is used for constructing a feature vector of a first preset quantity dimension corresponding to each sample webpage according to the TF-IDF weight of each text word of each sample webpage and the proportion of each HTML label;

and the training unit is used for training a preset machine learning classification algorithm by using the feature vector corresponding to each sample webpage and the type of each sample webpage to obtain a preset vector classification model.

Corresponding to the model training method and the web page type identification method shown in fig. 1-2, an embodiment of the present application further provides an electronic device, as shown in fig. 4, including a processor 401 and a machine-readable storage medium 402, where the machine-readable storage medium 402 stores machine-executable instructions that can be executed by the processor 401. Processor 401 is caused by machine executable instructions to implement any of the steps shown in fig. 1-2 described above.

In an optional embodiment, as shown in fig. 4, the electronic device may further include: a communication interface 403 and a communication bus 404; the processor 401, the machine-readable storage medium 402, and the communication interface 403 complete communication with each other through the communication bus 404, and the communication interface 403 is used for communication between the electronic device and other devices.

Corresponding to the model training method and the web page type identification method shown in fig. 1-2, embodiments of the present application further provide a machine-readable storage medium storing machine-executable instructions executable by a processor. The processor is caused by machine executable instructions to implement any of the steps shown in fig. 1-2 described above.

The communication bus may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc.

The machine-readable storage medium may include a RAM (Random Access Memory) and a NVM (Non-Volatile Memory), such as at least one disk Memory. Additionally, the machine-readable storage medium may be at least one memory device located remotely from the aforementioned processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also DSPs (Digital Signal Processing), ASICs (Application Specific Integrated circuits), FPGAs (Field Programmable Gate arrays) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for embodiments based on the web page type identification apparatus, the electronic device, and the machine-readable storage medium, since they are substantially similar to the embodiments based on the web page type identification method, the description is relatively simple, and for relevant points, reference may be made to the partial description of the embodiments based on the web page type identification method.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the scope of protection of the present application.

Claims

1. A method for identifying a webpage type, the method comprising:

counting the word frequency-reverse file frequency TF-IDF weight of each text word;

counting the proportion of the occurrence frequency of each HTML label in the webpage to be recognized to the total occurrence frequency, wherein the total occurrence frequency is the sum of the occurrence frequencies of all HTML labels in the webpage to be recognized;

inputting the feature vector corresponding to the webpage to be identified into a preset vector classification model to obtain the type of the webpage to be identified;

the step of performing word segmentation processing on the text content on the webpage to be recognized to obtain at least one text word comprises the following steps:

performing word segmentation processing on the link on the webpage to be recognized to obtain at least one character string;

and combining the at least one word and the at least one character string to obtain at least one text word of the webpage to be recognized.

2. The method of claim 1, wherein the step of counting the TF-IDF weights for each text term comprises:

determining a word frequency, TF, of each of the at least one text word using the following formula_w：

Determining a reverse document frequency IDF for each of the at least one text term using the following formula_w：

Determining a TF-IDF weight δ for each of the at least one text word using the following formula_w：

δ_w＝TF_w*IDF_w；

Wherein w represents a text word w, T of the at least one text word_wRepresents the number of occurrences, T, of the text word w in the at least one text word₀Representing a total number of the at least one text word; f_wRepresenting the number of webpages including the text word w in a preset corpus; f₀And representing the total number of the webpages included in the preset corpus.

3. The method according to claim 1, wherein the step of constructing the feature vector of the first predetermined number dimension corresponding to the web page to be recognized according to the TF-IDF weight of each text word and the specific gravity of each HTML tag comprises:

determining a second preset number of text words with the highest TF-IDF weight in the at least one text word, wherein the second preset number of text words are webpage representative words of the webpage to be identified;

and constructing a feature vector of a first preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each webpage representative word and the specific gravity of each HTML label, wherein the second preset quantity is smaller than the first preset quantity.

4. The method of claim 3, wherein the step of determining a second preset number of text words with highest TF-IDF weight among the at least one text word as the web page representative words of the web page to be identified comprises:

detecting whether the total number of the at least one text word is smaller than a second preset number or not;

if yes, blank grids with a target quantity are obtained, wherein the target quantity is the difference value between the second preset quantity and the total quantity of the at least one text word; taking the at least one text word and the blank grids with the target quantity as webpage representative words of the webpage to be recognized;

if not, extracting a second preset number of text words with the highest TF-IDF weight from the at least one text word to be used as the webpage representative words of the webpage to be identified.

5. The method according to any one of claims 1-4, further comprising:

acquiring a preset training set, wherein the preset training set comprises a plurality of sample webpages and the type of each sample webpage;

performing word segmentation processing on text content on each sample webpage to obtain at least one text word of each sample webpage;

counting TF-IDF weight of each text word in at least one text word of each sample webpage;

counting the proportion of the occurrence frequency of each HTML label in each sample webpage to the total occurrence frequency;

constructing a first preset quantity dimension characteristic vector corresponding to each sample webpage according to the TF-IDF weight of each text word of each sample webpage and the proportion of each HTML label;

and training a preset machine learning classification algorithm by using the feature vector corresponding to each sample webpage and the type of each sample webpage to obtain the preset vector classification model.

6. An apparatus for identifying a type of a web page, the apparatus comprising:

the first statistical unit is used for counting the word frequency-reverse file frequency TF-IDF weight of each text word;

the second counting unit is used for counting the proportion of the occurrence frequency of each HTML tag in the webpage to be identified to the total occurrence frequency, wherein the total occurrence frequency is the sum of the occurrence frequencies of all the HTML tags in the webpage to be identified;

the first construction unit is used for constructing a feature vector of a first preset quantity dimension corresponding to the webpage to be identified according to the TF-IDF weight of each text word and the specific gravity of each HTML label;

the first identification unit is used for inputting the characteristic vector corresponding to the webpage to be identified into a preset vector classification model to obtain the type of the webpage to be identified;

the first word segmentation unit is specifically configured to: performing word segmentation processing on the text content in the webpage to be recognized to obtain at least one text word; performing word segmentation processing on the link on the webpage to be recognized to obtain at least one character string; and combining the at least one word and the at least one character string to obtain at least one text word of the webpage to be recognized.

7. The apparatus according to claim 6, wherein the first statistical unit is specifically configured to:

δ_w＝TF_w*IDF_w；

Wherein w represents a text word w, T of the at least one text word_wRepresents the number of occurrences of the text word w in the at least one text word, T₀Representing a total number of the at least one text word; f_wRepresenting the number of webpages including the text word w in a preset corpus; f₀And representing the total number of the webpages included in the preset corpus.

8. The apparatus according to claim 6, characterized in that said first building unit is specifically configured to:

9. The apparatus according to claim 8, characterized in that said first building unit is specifically configured to:

detecting whether the total number of the at least one text word is smaller than a second preset number;

if so, acquiring a target number of blank spaces, wherein the target number is the difference value between the second preset number and the total number of the at least one text word; taking the at least one text word and the blank grids with the target quantity as webpage representative words of the webpage to be recognized;

10. The apparatus according to any one of claims 6-9, further comprising:

the second construction unit is used for constructing a first preset quantity dimension characteristic vector corresponding to each sample webpage according to the TF-IDF weight of each text word of each sample webpage and the proportion of each HTML label;

and the training unit is used for training a preset machine learning classification algorithm by using the feature vector corresponding to each sample webpage and the type of each sample webpage to obtain the preset vector classification model.

11. An electronic device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to: carrying out the method steps of any one of claims 1 to 5.

12. A machine-readable storage medium having stored thereon machine-executable instructions executable by a processor, the processor being caused by the machine-executable instructions to: carrying out the method steps of any one of claims 1 to 5.