CN108255891B - Method and device for judging webpage type - Google Patents

Method and device for judging webpage type Download PDF

Info

Publication number
CN108255891B
CN108255891B CN201611270198.0A CN201611270198A CN108255891B CN 108255891 B CN108255891 B CN 108255891B CN 201611270198 A CN201611270198 A CN 201611270198A CN 108255891 B CN108255891 B CN 108255891B
Authority
CN
China
Prior art keywords
webpage
information
type
serving
ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611270198.0A
Other languages
Chinese (zh)
Other versions
CN108255891A (en
Inventor
郑立颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611270198.0A priority Critical patent/CN108255891B/en
Publication of CN108255891A publication Critical patent/CN108255891A/en
Application granted granted Critical
Publication of CN108255891B publication Critical patent/CN108255891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method for judging webpage types, which comprises the following steps: acquiring page information of a webpage to be judged; extracting title information from the page information; judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types; and if the title information does not contain the preset keywords, obtaining the webpage type of the webpage to be judged based on the webpage structure information corresponding to the webpage information and/or the title information. The method and the device can solve the problem that the efficiency of classifying the webpage types by relying on a manual mode in the prior art is low. The invention also discloses a device for judging the webpage type.

Description

Method and device for judging webpage type
Technical Field
The invention relates to the technical field of webpage classification, in particular to a method and a device for judging webpage types.
Background
With the rapid development of internet technology, the number of web pages recorded by a search engine is more and more, and the judgment on the types of the web pages is more and more important. The web page type refers to the media property of the web page, and can be divided into news, forums, blogs, posts, questions and answers, and the like. There are many application scenarios for classifying web page types, such as: 1. the brand exposure analysis is to collect and count the URL (Uniform Resource Locator) of the brand exposure and analyze the website category, so that the brand can be known as which media type is more exposed, and a brand owner can be helped to select the brand exposure media more pertinently; 2. the brand public sentiment analysis is used for knowing the positive and negative information of the brands on different media types by counting the brand public sentiment so as to more effectively deal with and release the information; 3. and in the webpage crawling process, different webpage analysis logics can be determined in advance by identifying the webpage types, and the webpage information can be extracted more reasonably. At present, the classification of the webpage types mainly depends on a manual mode, is time-consuming and labor-consuming, and obviously cannot be applied to the current situation that the number of webpages is increased sharply, so how to improve the classification efficiency of the webpage types is an urgent problem to be solved.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for determining a web page type, so as to solve the problem in the prior art that efficiency of classifying web page types by manual methods is low.
The invention provides a method for judging webpage types, which comprises the following steps:
acquiring page information of a webpage to be judged;
extracting title information from the page information;
judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
and if the title information does not contain the preset keywords, obtaining the webpage type of the webpage to be judged based on the webpage structure information and/or the title information corresponding to the webpage information.
Preferably, the method further comprises:
and if the header information contains the preset keywords, taking the webpage type corresponding to the preset keywords as the webpage type of the webpage to be judged.
Preferably, the acquiring page information of the web page to be determined includes:
analyzing the webpage to be judged, and extracting a domain name of a link corresponding to the webpage to be judged;
and simulating to access a Uniform Resource Locator (URL) corresponding to the domain name, and crawling page information of the webpage to be judged.
Preferably, the obtaining the webpage type of the webpage to be determined based on the page structure information and/or the header information corresponding to the page information includes:
acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
extracting tag information serving as a reference standard from page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the tag information serving as the reference standard in each known webpage type;
extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
matching each piece of label information with the label information serving as the reference standard, and counting the number of the label information which is successfully matched under each known webpage type;
acquiring the ratio of the number of successfully matched label information under each known webpage type to the number of label information serving as a reference standard under the known webpage type, and comparing the ratio with a preset ratio;
and if the ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged.
Preferably, the obtaining the webpage type of the webpage to be determined based on the page structure information and/or the header information corresponding to the page information includes:
acquiring title information of a plurality of webpages serving as reference standards under at least one known webpage type;
splitting phrases serving as reference standards from the title information of the webpage serving as the reference standards, and counting the number of the phrases serving as the reference standards under each known webpage type;
splitting at least one phrase from the title information of the webpage to be judged;
matching each phrase with the phrases serving as the reference standards respectively, and counting the number of the phrases which are successfully matched under each known webpage type;
acquiring the ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the ratio with a preset ratio;
and if the ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged.
Preferably, the obtaining the webpage type of the webpage to be determined based on the page structure information and/or the header information corresponding to the page information includes:
acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
extracting tag information serving as a reference standard from page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the tag information serving as the reference standard in each known webpage type;
extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
matching each piece of label information with the label information serving as the reference standard, and counting the number of the label information which is successfully matched under each known webpage type;
acquiring a first ratio of the number of successfully matched tag information under each known webpage type to the number of tag information serving as a reference standard under the known webpage type, and comparing the first ratio with a first preset ratio;
if the first ratio is larger than or equal to the first preset ratio, acquiring title information of a plurality of webpages serving as reference standards under the known webpage types corresponding to the first ratio;
splitting phrases serving as reference standards from the title information of the webpage serving as the reference standards, and counting the number of the phrases serving as the reference standards under each known webpage type;
splitting at least one phrase from the title information of the webpage to be judged;
matching each phrase with the phrases serving as the reference standards respectively, and counting the number of the phrases which are successfully matched under each known webpage type;
acquiring a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the second ratio with a second preset ratio;
and if the second ratio is greater than or equal to the second preset ratio, taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged.
An apparatus for discriminating a type of a web page, comprising:
the acquisition module is used for acquiring page information of a webpage to be judged;
the extraction module is used for extracting the title information from the page information;
the judging module is used for judging whether the title information contains preset keywords, and the preset keywords are keywords containing webpage types;
and the processing module is used for obtaining the webpage type of the webpage to be judged based on the page structure information and/or the title information corresponding to the page information if the preset keyword is not contained in the title information.
Preferably, the processing module is further configured to, if the header information includes the preset keyword, use a webpage type corresponding to the preset keyword as the webpage type of the webpage to be determined.
Preferably, the obtaining module includes:
the analysis unit is used for analyzing the webpage to be judged and extracting a domain name of a link corresponding to the webpage to be judged;
and the simulation access unit is used for simulating and accessing the uniform resource locator URL corresponding to the domain name and crawling the page information of the webpage to be judged.
Preferably, the processing module comprises:
the first acquisition unit is used for acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
the first statistical unit is used for extracting label information serving as a reference standard from the page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the label information serving as the reference standard under each known webpage type;
the first extraction unit is used for extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
the first matching unit is used for matching each piece of label information with the label information serving as the reference standard respectively and counting the number of the label information which is successfully matched under each known webpage type;
the first comparison unit is used for acquiring the ratio of the number of the successfully matched label information under each known webpage type to the number of the label information serving as a reference standard under the known webpage type, and comparing the ratio with a preset ratio;
and the first output unit is used for taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged if the ratio is greater than or equal to the preset ratio.
Preferably, the processing module comprises:
the second acquisition unit is used for acquiring title information of a plurality of webpages serving as reference standards under at least one known webpage type;
the second statistical unit is used for splitting phrases serving as the reference standard from the title information of the webpage serving as the reference standard and counting the number of the phrases serving as the reference standard in each known webpage type;
the first splitting unit is used for splitting at least one phrase from the title information of the webpage to be judged;
the second matching unit is used for matching each phrase with the phrases serving as the reference standards respectively and counting the number of the phrases which are successfully matched under each known webpage type;
the second comparison unit is used for acquiring the ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type and comparing the ratio with a preset ratio;
and the second output unit is used for taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged if the ratio is greater than or equal to the preset ratio.
Preferably, the processing module comprises:
the third acquisition unit is used for acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
a third statistical unit, configured to extract tag information serving as a reference standard from the page structure information corresponding to the page information of the web page serving as the reference standard, and count the number of tag information serving as the reference standard in each known web page type;
the second extraction unit is used for extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
the third matching unit is used for matching each piece of label information with the label information serving as the reference standard respectively and counting the number of the label information which is successfully matched under each known webpage type;
the third comparison unit is used for acquiring a first ratio of the number of successfully matched label information under each known webpage type to the number of label information serving as a reference standard under the known webpage type, and comparing the first ratio with a first preset ratio;
a fourth obtaining unit, configured to obtain title information of a plurality of webpages serving as reference standards in a known webpage type corresponding to the first ratio if the first ratio is greater than or equal to the first preset ratio;
the fourth statistical unit is used for splitting the word groups serving as the reference standard from the title information of the webpage serving as the reference standard, and counting the number of the word groups serving as the reference standard in each known webpage type;
the second splitting unit is used for splitting at least one phrase from the title information of the webpage to be judged;
the fourth matching unit is used for matching each phrase with the phrases serving as the reference standards respectively and counting the number of the phrases which are successfully matched under each known webpage type;
the fourth comparison unit is used for acquiring a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the second ratio with a second preset ratio;
and the third output unit is used for taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged if the second ratio is greater than or equal to the second preset ratio.
By means of the technical scheme, when the webpage type needs to be judged, firstly, the page information of the webpage to be judged is obtained, then, the title information is extracted from the obtained page information, then, whether the title information contains preset keywords capable of directly judging the webpage type is further judged, and when the title information does not contain the preset keywords, the webpage type of the webpage to be judged is obtained through the page structure information and/or the title information corresponding to the page information. Compared with the prior art that the webpage types are classified in a manual mode, the method and the device can automatically realize the classification of the webpage types and improve the efficiency of the classification of the webpage types.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flowchart of a method of embodiment 1 of a method for discriminating a type of a web page according to the present disclosure;
FIG. 2 is a flowchart of a method of embodiment 2 of the method for discriminating a type of a web page disclosed in the present invention;
FIG. 3 is a flowchart of a method of embodiment 3 of the method for determining a type of a web page disclosed in the present invention;
FIG. 4 is a flowchart of a method of embodiment 4 of the method for determining a type of a web page disclosed in the present invention;
FIG. 5 is a schematic structural diagram of an embodiment 1 of an apparatus for determining a type of a web page disclosed in the present invention;
FIG. 6 is a schematic structural diagram of an embodiment 2 of an apparatus for determining a type of a web page disclosed in the present invention;
FIG. 7 is a schematic structural diagram of an embodiment 3 of an apparatus for determining a type of a web page according to the present disclosure;
fig. 8 shows a schematic structural diagram of an embodiment 4 of the apparatus for determining a webpage type disclosed in the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As shown in fig. 1, which is a flowchart of a method of embodiment 1 of the method for determining a web page type disclosed in the present invention, the method may include the following steps:
s101, acquiring page information of a webpage to be judged;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed, a domain name of a link corresponding to the web page to be determined is extracted, then a uniform resource locator URL corresponding to an access domain name is simulated, and the page information of the web page to be determined is crawled. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
S102, extracting title information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
S103, judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
And S104, if the title information does not contain the preset keywords, obtaining the webpage type of the webpage to be judged based on the webpage structure information corresponding to the webpage information and/or the title information.
When the extracted header information is judged not to contain the preset keyword, namely the webpage type cannot be directly determined through the extracted header information, classifying the webpage to be judged further based on the extracted HTLM (Hypertext Markup Language) page structure information and/or the extracted header information so as to obtain the webpage type of the webpage to be judged. That is, when the header information does not include the preset keyword, the to-be-determined web page may be further classified according to the page structure information corresponding to the page information to obtain the web page type, or the to-be-determined web page may be classified according to the header information in the page information to obtain the web page type, or the to-be-determined web page may be classified according to the page structure information and the header information in the page information to obtain the web page type.
It should be noted that, when the header information includes the preset keyword, the webpage type corresponding to the preset keyword is used as the webpage type of the webpage to be determined. For example, if the title information is "as a lipstick control, try a color bar-entertainment bagua-forum", wherein the title information includes a preset keyword "forum", the type of the web page to be determined can be determined as the forum.
In summary, in the above embodiments, when the type of the web page needs to be determined, first page information of the web page to be determined is obtained, then header information is extracted from the obtained page information, and then whether the header information includes a preset keyword that can directly determine the type of the web page is further determined, and when the header information does not include the preset keyword, the type of the web page to be determined is obtained through page structure information and/or the header information corresponding to the page information. Compared with the prior art that the webpage types are classified in a manual mode, the method and the device can automatically realize the classification of the webpage types and improve the efficiency of the classification of the webpage types.
As shown in fig. 2, which is a flowchart of a method of embodiment 1 of the method for determining a web page type disclosed in the present invention, the method may include the following steps:
s201, acquiring page information of a webpage to be judged;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed, a domain name of a link corresponding to the web page to be determined is extracted, then a uniform resource locator URL corresponding to an access domain name is simulated, and the page information of the web page to be determined is crawled. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
S202, extracting title information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
S203, judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
S204, if the title information does not contain preset keywords, acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
when the header information does not contain preset keywords, namely the type of the webpage cannot be directly judged through the header information of the webpage to be judged, at least one webpage with a known webpage type is obtained, the page information of each webpage with the known webpage type is obtained at the same time, and the obtained page information is used as a reference standard.
S205, extracting tag information serving as a reference standard from page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the tag information serving as the reference standard under each known webpage type;
after page information of a plurality of webpages serving as reference standards in at least one known webpage type is acquired, extracting label information serving as the reference standards from page structure information corresponding to the page information of the webpages serving as the reference standards. Since each page structure information includes a plurality of tag information. For example, taking a web page of one known web page type as an example, the tag information includes: "meta", "link", "span", "a", "p", the number of tag information as a reference standard is counted, there are 12 "meta", 3 "link", 5 "span", 3 "a", and 3 "p".
S206, extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
and meanwhile, extracting at least one piece of label information used for determining the page type from the page structure information corresponding to the page information of the webpage to be judged. For example, "meta" and "div" are extracted.
S207, matching each piece of label information with label information serving as a reference standard, and counting the number of the label information which is successfully matched under each known webpage type;
each piece of tag information of the web page to be determined is matched with the tag information as the reference standard, and in the above example, the tag information "meta" of the web page to be determined is found through matching, and can be matched with the tag information "meta" as the reference standard. Then, the number of the tag information successfully matched under each known webpage type is further counted, and 10 pieces of "meta" are counted.
S208, obtaining the ratio of the number of successfully matched label information under each known webpage type to the number of label information serving as a reference standard under the known webpage type, and comparing the ratio with a preset ratio;
then, a ratio of the number of successfully matched tag information in each known web page type to the number of tag information serving as a reference standard in the known web page type is obtained, where the number ratio of the tag information "meta" in the above example is 5/6, and then the obtained ratio is compared with a preset ratio, where the preset ratio is flexibly set according to actual requirements, and certainly, if the set preset ratio is closer to the obtained ratio, the determined web page type is more accurate.
S209, if the ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged.
And when the obtained ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged. Taking the above example as an example, the web page type corresponding to the tag information "meta", "link", "span", "a" and "p" will be used as the web page type of the web page to be determined.
As shown in fig. 3, which is a flowchart of a method of embodiment 3 of the method for determining a web page type disclosed in the present invention, the method may include the following steps:
s301, acquiring page information of a webpage to be judged;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed to extract a domain name of a connection corresponding to the web page to be determined, and then a uniform resource locator URL corresponding to an access domain name is simulated to crawl the page information of the web page to be determined. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
S302, extracting title information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
S303, judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
S304, if the title information does not contain preset keywords, acquiring the title information of a plurality of webpages serving as reference standards under at least one known webpage type;
when the header information does not contain preset keywords, namely the type of the webpage cannot be directly judged through the header information of the webpage to be judged, at least one webpage with a known webpage type is obtained firstly, the header information in the page information of each webpage with the known webpage type is obtained at the same time, and the obtained header information is used as a reference standard.
S305, splitting phrases serving as reference standards from the title information of the webpage serving as the reference standards, and counting the number of the phrases serving as the reference standards under each known webpage type;
after the title information of a plurality of webpages serving as reference standards under at least one known webpage type is acquired, the title information of the webpages serving as the reference standards is subjected to phrase splitting, and phrases with the most reference standards are split. For example, taking one of the web pages of known web page types as an example, the title information of the web page is "as a lipstick controller to select a lipstick brand, try the color of the lipstick", perform word segmentation on the title, split into phrases "lipstick", "brand", and "color" as reference standards, and count the number of each phrase "lipstick" has 3, "brand" 1, "and" color 1 "as reference standards.
S306, splitting at least one phrase from the title information of the webpage to be judged;
meanwhile, at least one phrase is split from the title information of the webpage to be judged, for example, the title information of the webpage to be judged is that when the lipstick is selected, the lipstick has various types, and the split phrase comprises the lipstick and the type.
S307, matching each phrase with phrases serving as reference standards respectively, and counting the number of the phrases which are successfully matched under each known webpage type;
matching each phrase after splitting the title information of the webpage to be judged with a phrase serving as a reference standard, finding the lipstick in the title information of the webpage to be judged through matching by taking the above example as an example, and matching the lipstick with the phrase serving as the reference standard. Then further counting the number of successfully matched phrases under each known webpage type, wherein 2 'lipstick's are counted.
S308, obtaining the ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the ratio with a preset ratio;
then, a ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type is obtained, the number ratio of the example word group "lipstick" is 2/3, and then the obtained ratio is compared with a preset ratio, wherein the preset ratio is flexibly set according to actual requirements, and certainly, if the set preset ratio is closer to the obtained ratio, the determined webpage type is more accurate.
S309, if the ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged.
And when the obtained ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged. Taking the above example as an example, that is, the title information is "as a lipstick control to select the brand of lipstick and try the color of lipstick", and the corresponding web page type is used as the web page type of the web page to be determined.
As shown in fig. 4, which is a flowchart of a method of embodiment 4 of the method for determining a web page type disclosed in the present invention, the method may include the following steps:
s401, acquiring page information of a webpage to be judged;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed to extract a domain name of a link corresponding to the web page to be determined, and then a uniform resource locator URL corresponding to an access domain name is simulated to crawl the page information of the web page to be determined. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
S402, extracting title information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
S403, judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
S404, if the title information does not contain preset keywords, acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
when the header information does not contain preset keywords, namely the type of the webpage cannot be directly judged through the header information of the webpage to be judged, at least one webpage with a known webpage type is obtained, the page information of each webpage with the known webpage type is obtained at the same time, and the obtained page information is used as a reference standard.
S405, extracting label information serving as a reference standard from page structure information corresponding to page information of a webpage serving as the reference standard, and counting the number of the label information serving as the reference standard under each known webpage type;
after page information of a plurality of webpages serving as reference standards in at least one known webpage type is acquired, tag information serving as the reference standards is extracted from page structure information corresponding to the page information of the webpages serving as the reference standards, and each page structure comprises a plurality of tag information. For example, taking a web page of one known web page type as an example, the tag information includes: "meta", "link", "span", "a", "p", the number of tag information as a reference standard is counted, there are 12 "meta", 3 "link", 5 "span", 3 "a", and 3 "p".
S406, extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
and meanwhile, extracting at least one piece of label information used for determining the page type from the page structure information corresponding to the page information of the webpage to be judged. For example, "meta" and "div" are extracted.
S407, matching each piece of label information with label information serving as a reference standard, and counting the number of the label information which is successfully matched under each known webpage type;
each tag information of the web page to be judged is matched with the tag information as the reference standard, and the tag information "meta" with the judgment web page can be found by matching, for example, with the tag information "meta" as the reference standard. Then, the number of the tag information successfully matched under each known webpage type is further counted, and 10 pieces of "meta" are counted.
S408, obtaining a first ratio of the number of successfully matched label information under each known webpage type to the number of label information serving as a reference standard under the known webpage type, and comparing the first ratio with a first preset ratio;
then, a first ratio of the number of successfully matched tag information of each known web page type to the number of tag information serving as a reference standard of the known web page type is obtained, the ratio of the number of tag information "meta" in the above example is 5/6, and then the obtained first ratio is compared with a first preset ratio, wherein the first preset ratio is flexibly set according to actual requirements, and certainly, if the set first preset ratio is closer to the obtained ratio, the determined web page type is more accurate.
S409, if the first ratio is larger than or equal to a first preset ratio, acquiring title information of a plurality of webpages serving as reference standards under the known webpage types corresponding to the first ratio;
when the first ratio is larger than or equal to a first preset ratio, further acquiring the webpages of the known webpage types corresponding to the first ratio, and simultaneously acquiring the title information in the page information of each webpage of the known webpage types, wherein the acquired title information is used as a reference standard. It should be noted that there may be a plurality of web pages of the known web page type corresponding to the first ratio.
S410, splitting phrases serving as reference standards from the title information of the webpage serving as the reference standards, and counting the number of the phrases serving as the reference standards under each known webpage type;
after the header information of a plurality of webpages serving as reference standards under at least one known webpage type is acquired, the header information of the webpages serving as the reference standards is subjected to phrase splitting, and phrases serving as the reference standards are split. For example, taking one of the web pages of known web page types as an example, the title information of the web page is "as a lipstick controller to select a lipstick brand, try the color of the lipstick", perform word segmentation on the title, split into phrases "lipstick", "brand", and "color" as reference standards, and count the number of each phrase "lipstick" has 3, "brand" 1, "and" color 1 "as reference standards.
S411, splitting at least one phrase from the title information of the webpage to be judged;
meanwhile, at least one phrase is split from the title information of the webpage to be judged, for example, the title information of the webpage to be judged is that when the lipstick is selected, the lipstick has various types, and the split phrase comprises the lipstick and the type.
S412, matching each phrase with a phrase serving as a reference standard respectively, and counting the number of the phrases which are successfully matched under each known webpage type;
matching each phrase after splitting the title information of the webpage to be judged with a phrase serving as a reference standard, finding the lipstick in the title information of the webpage to be judged through matching by taking the above example as an example, and matching the lipstick with the phrase serving as the reference standard. Then further counting the number of successfully matched phrases under each known webpage type, wherein 2 'lipstick's are counted.
S413, obtaining a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the second ratio with a second preset ratio;
then, a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type is obtained, the number ratio of the example word group "lipstick" is 2/3, and then the obtained second ratio is compared with a second preset ratio, wherein the second preset ratio is flexibly set according to actual requirements, and certainly, if the set second preset ratio is closer to the obtained second ratio, the determined webpage type is more accurate.
And S414, if the second ratio is greater than or equal to a second preset ratio, taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged.
And when the obtained second ratio is larger than or equal to a second preset ratio, taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged. Taking the above example as an example, that is, the title information is "as a lipstick control to select the brand of lipstick and try the color of lipstick", and the corresponding web page type is used as the web page type of the web page to be determined.
It should be noted that, in the above embodiment, if the second ratio is smaller than the second preset ratio, the known web page type corresponding to the first ratio may be used as the web page type of the web page to be determined.
As shown in fig. 5, which is a schematic structural diagram of an embodiment 1 of an apparatus for determining a webpage type according to the present invention, the apparatus may include:
an obtaining module 501, configured to obtain page information of a webpage to be determined;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed to extract a domain name of a connection corresponding to the web page to be determined, and then a uniform resource locator URL corresponding to an access domain name is simulated to crawl the page information of the web page to be determined. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
An extracting module 502, configured to extract header information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
The judging module 503 is configured to judge whether the header information includes a preset keyword, where the preset keyword is a keyword including a webpage type;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
The processing module 504 is configured to, if the header information does not include the preset keyword, obtain the webpage type of the webpage to be determined based on the webpage structure information and/or the header information corresponding to the page information.
When the extracted header information is judged not to contain the preset keyword, namely the webpage type cannot be directly determined through the extracted header information, classifying the webpage to be judged further based on the extracted HTLM (Hypertext Markup Language) page structure information and/or the extracted header information so as to obtain the webpage type of the webpage to be judged. That is, when the header information does not include the preset keyword, the to-be-determined web page may be further classified according to the page structure information corresponding to the page information to obtain the web page type, or the to-be-determined web page may be classified according to the header information in the page information to obtain the web page type, or the to-be-determined web page may be classified according to the page structure information and the header information in the page information to obtain the web page type.
It should be noted that, when the header information includes the preset keyword, the webpage type corresponding to the preset keyword is used as the webpage type of the webpage to be determined. For example, if the title information is "as a lipstick control, try a color bar-entertainment bagua-forum", wherein the title information includes a preset keyword "forum", the type of the web page to be determined can be determined as the forum.
The device for judging the webpage type comprises a processor and a memory, wherein the first acquiring module, the extracting module, the judging module, the processing module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem of low webpage classification efficiency is solved by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
In summary, in the above embodiments, when the type of the web page needs to be determined, first page information of the web page to be determined is obtained, then header information is extracted from the obtained page information, and then whether the header information includes a preset keyword that can directly determine the type of the web page is further determined, and when the header information does not include the preset keyword, the type of the web page to be determined is obtained through page structure information and/or the header information corresponding to the page information. Compared with the prior art that the webpage types are classified in a manual mode, the method and the device can automatically realize the classification of the webpage types and improve the efficiency of the classification of the webpage types.
As shown in fig. 6, which is a schematic structural diagram of an embodiment 2 of an apparatus for determining a webpage type according to the present invention, the apparatus may include:
an obtaining module 601, configured to obtain page information of a webpage to be determined;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed to extract a domain name of a connection corresponding to the web page to be determined, and then a uniform resource locator URL corresponding to an access domain name is simulated to crawl the page information of the web page to be determined. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
An extracting module 602, configured to extract header information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
The determining module 603 is configured to determine whether the header information includes a preset keyword, where the preset keyword is a keyword including a webpage type;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
A first obtaining unit 604, configured to obtain page information of a plurality of webpages serving as reference standards in at least one known webpage type if the header information does not include a preset keyword;
when the header information does not contain preset keywords, namely the type of the webpage cannot be directly judged through the header information of the webpage to be judged, at least one webpage with a known webpage type is obtained, the page information of each webpage with the known webpage type is obtained at the same time, and the obtained page information is used as a reference standard.
A first statistical unit 605, configured to extract tag information serving as a reference standard from page structure information corresponding to page information of a web page serving as the reference standard, and count the number of tag information serving as the reference standard in each known web page type;
after page information of a plurality of webpages serving as reference standards in at least one known webpage type is acquired, tag information serving as the reference standards is extracted from page structure information corresponding to the page information of the webpages serving as the reference standards, and each page structure comprises a plurality of tag information. For example, taking a web page of one known web page type as an example, the tag information includes: "meta", "link", "span", "a", "p", the number of tag information as a reference standard is counted, there are 12 "meta", 3 "link", 5 "span", 3 "a", and 3 "p".
A first extracting unit 606, configured to extract at least one piece of tag information from page structure information corresponding to page information of a web page to be determined;
and meanwhile, extracting at least one piece of label information used for determining the page type from the page structure information corresponding to the page information of the webpage to be judged. For example, "meta" and "div" are extracted.
The first matching unit 607 is configured to match each piece of tag information with tag information serving as a reference standard, and count the number of pieces of tag information that are successfully matched under each known webpage type;
each tag information of the web page to be judged is matched with the tag information as the reference standard, and the tag information "meta" with the judgment web page can be found by matching, for example, with the tag information "meta" as the reference standard. Then, the number of the tag information successfully matched under each known webpage type is further counted, and 10 pieces of "meta" are counted.
A first comparing unit 608, configured to obtain a ratio between the number of successfully matched tag information in each known web page type and the number of tag information serving as a reference standard in the known web page type, and compare the ratio with a preset ratio;
then, a ratio of the number of successfully matched tag information in each known web page type to the number of tag information serving as a reference standard in the known web page type is obtained, where the number ratio of the tag information "meta" in the above example is 5/6, and then the obtained ratio is compared with a preset ratio, where the preset ratio is flexibly set according to actual requirements, and certainly, if the set preset ratio is closer to the obtained ratio, the determined web page type is more accurate.
The first output unit 609 is configured to, if the ratio is greater than or equal to the preset ratio, use the known webpage type corresponding to the ratio as the webpage type of the webpage to be determined.
And when the obtained ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged. Taking the above example as an example, the web page type corresponding to the tag information "meta", "link", "span", "a" and "p" will be used as the web page type of the web page to be determined.
As shown in fig. 7, which is a schematic structural diagram of an embodiment 3 of an apparatus for determining a webpage type according to the present invention, the apparatus may include:
an obtaining module 701, configured to obtain page information of a webpage to be determined;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed to extract a domain name of a link corresponding to the web page to be determined, and then a uniform resource locator URL corresponding to an access domain name is simulated to crawl the page information of the web page to be determined. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
An extracting module 702, configured to extract header information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
The determining module 703 is configured to determine whether the header information includes a preset keyword, where the preset keyword is a keyword including a webpage type;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
A second obtaining unit 704, configured to obtain, if the header information does not include a preset keyword, header information of a plurality of webpages serving as reference standards in at least one known webpage type;
when the header information does not contain preset keywords, namely the type of the webpage cannot be directly judged through the header information of the webpage to be judged, at least one webpage with a known webpage type is obtained firstly, the header information in the page information of each webpage with the known webpage type is obtained at the same time, and the obtained header information is used as a reference standard.
A second statistical unit 705, configured to split a phrase serving as a reference standard from the header information of the web page serving as the reference standard, and count the number of phrases serving as the reference standard in each known web page type;
after the title information of a plurality of webpages serving as reference standards under at least one known webpage type is acquired, the title information of the webpages serving as the reference standards is subjected to phrase splitting, and phrases with the most reference standards are split. For example, taking one of the web pages of known web page types as an example, the title information of the web page is "as a lipstick controller to select a lipstick brand, try the color of the lipstick", perform word segmentation on the title, split into phrases "lipstick", "brand", and "color" as reference standards, and count the number of each phrase "lipstick" has 3, "brand" 1, "and" color 1 "as reference standards.
The first splitting unit 706 is configured to split at least one phrase from the header information of the web page to be determined;
meanwhile, at least one phrase is split from the title information of the webpage to be judged, for example, the title information of the webpage to be judged is that when the lipstick is selected, the lipstick has various types, and the split phrase comprises the lipstick and the type.
A second matching unit 707, configured to match each phrase with a phrase serving as a reference standard, and count the number of phrases that are successfully matched under each known webpage type;
matching each phrase after splitting the title information of the webpage to be judged with a phrase serving as a reference standard, finding the lipstick in the title information of the webpage to be judged through matching by taking the above example as an example, and matching the lipstick with the phrase serving as the reference standard. Then further counting the number of successfully matched phrases under each known webpage type, wherein 2 'lipstick's are counted.
A second comparing unit 708, configured to obtain a ratio between the number of successfully matched phrases in each known web page type and the number of phrases serving as a reference standard in the known web page type, and compare the ratio with a preset ratio;
then, a ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type is obtained, the number ratio of the example word group "lipstick" is 2/3, and then the obtained ratio is compared with a preset ratio, wherein the preset ratio is flexibly set according to actual requirements, and certainly, if the set preset ratio is closer to the obtained ratio, the determined webpage type is more accurate.
A second output unit 709, configured to, if the ratio is greater than or equal to the preset ratio, use a known webpage type corresponding to the ratio as the webpage type of the webpage to be determined.
And when the obtained ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged. Taking the above example as an example, that is, the title information is "as a lipstick control to select the brand of lipstick and try the color of lipstick", and the corresponding web page type is used as the web page type of the web page to be determined.
As shown in fig. 8, which is a schematic structural diagram of an embodiment 4 of an apparatus for determining a webpage type according to the present invention, the apparatus may include:
an obtaining unit 801, configured to obtain page information of a web page to be determined;
when the type of the web page to which the web page belongs needs to be determined, for example, whether the web page belongs to a news web page or a forum web page or not is determined. Firstly, page information of a webpage to be judged is obtained, wherein the page information of the webpage to be judged comprises title information and page structure information.
Specifically, when obtaining the page information of the web page to be determined, one implementation manner of the method may be that the web page to be determined is analyzed to extract a domain name of a connection corresponding to the web page to be determined, and then a uniform resource locator URL corresponding to an access domain name is simulated to crawl the page information of the web page to be determined. When the web page to be determined is parsed, the original URL (Uniform Resource Locator) of the web page to be determined may be parsed. Extracting a domain name in a webpage link to be judged by analysis, wherein the domain name can be defined as the first http://' and the first appearance after the first in the URL: "between. For example, the link of the webpage to be judged is http:// example.com:1234/test.htm, and the domain name extracted by analyzing the webpage to be judged can be example.com. When the URL corresponding to the domain name is simulated to access and page information of the webpage to be judged is crawled, a Python crawler library can be used for simulating access, or other programming languages are used for simulating access, and the information in the page is crawled through the simulating access.
An extracting module 802, configured to extract header information from the page information;
after the page information of the webpage to be judged is obtained, the title information is extracted from a crawled HTLM (HyperText markup language) page.
The determining module 803 is configured to determine whether the header information includes a preset keyword, where the preset keyword is a keyword including a webpage type;
and then, judging whether the extracted title information contains preset keywords capable of directly determining the type of the webpage, for example, judging whether the extracted title information contains preset keywords such as "career", "forum", "news", "blog", and the like.
A third obtaining unit 804, configured to obtain page information of a plurality of webpages serving as reference standards in at least one known webpage type if the header information does not include a preset keyword;
when the header information does not contain preset keywords, namely the type of the webpage cannot be directly judged through the header information of the webpage to be judged, at least one webpage with a known webpage type is obtained, the page information of each webpage with the known webpage type is obtained at the same time, and the obtained page information is used as a reference standard.
A third statistical unit 805, configured to extract tag information serving as a reference standard from page structure information corresponding to page information of a web page serving as the reference standard, and count the number of tag information serving as the reference standard in each known web page type;
after page information of a plurality of webpages serving as reference standards in at least one known webpage type is acquired, tag information serving as the reference standards is extracted from page structure information corresponding to the page information of the webpages serving as the reference standards, and each page structure comprises a plurality of tag information. For example, taking a web page of one known web page type as an example, the tag information includes: "meta", "link", "span", "a", "p", the number of tag information as a reference standard is counted, there are 12 "meta", 3 "link", 5 "span", 3 "a", and 3 "p".
A second extracting unit 806, configured to extract at least one piece of tag information from the page structure information corresponding to the page information of the web page to be determined;
and meanwhile, extracting at least one piece of label information used for determining the page type from the page structure information corresponding to the page information of the webpage to be judged. For example, "meta" and "div" are extracted.
A third matching unit 807, configured to match each piece of tag information with tag information serving as a reference standard, and count the number of pieces of tag information that are successfully matched under each known web page type;
each tag information of the web page to be judged is matched with the tag information as the reference standard, and the tag information "meta" with the judgment web page can be found by matching, for example, with the tag information "meta" as the reference standard. Then, the number of the tag information successfully matched under each known webpage type is further counted, and 10 pieces of "meta" are counted.
A third comparing unit 808, configured to obtain a first ratio between the number of successfully matched tag information in each known web page type and the number of tag information serving as a reference standard in the known web page type, and compare the first ratio with a first preset ratio;
then, a first ratio of the number of successfully matched tag information of each known web page type to the number of tag information serving as a reference standard of the known web page type is obtained, the ratio of the number of tag information "meta" in the above example is 5/6, and then the obtained first ratio is compared with a first preset ratio, wherein the first preset ratio is flexibly set according to actual requirements, and certainly, if the set first preset ratio is closer to the obtained ratio, the determined web page type is more accurate.
The fourth obtaining unit 809 is configured to obtain the title information of the webpages serving as the reference standard under the known webpage type corresponding to the first ratio if the first ratio is greater than or equal to a first preset ratio;
when the first ratio is larger than or equal to a first preset ratio, further acquiring the webpages of the known webpage types corresponding to the first ratio, and simultaneously acquiring the title information in the page information of each webpage of the known webpage types, wherein the acquired title information is used as a reference standard. It should be noted that there may be a plurality of web pages of the known web page type corresponding to the first ratio.
A fourth statistical unit 810, configured to split phrases serving as reference standards from the header information of the web pages serving as reference standards, and count the number of phrases serving as reference standards in each known web page type;
after the title information of a plurality of webpages serving as reference standards under at least one known webpage type is acquired, the title information of the webpages serving as the reference standards is subjected to phrase splitting, and phrases with the most reference standards are split. For example, taking one of the web pages of known web page types as an example, the title information of the web page is "as a lipstick controller to select a lipstick brand, try the color of the lipstick", perform word segmentation on the title, split into phrases "lipstick", "brand", and "color" as reference standards, and count the number of each phrase "lipstick" has 3, "brand" 1, "and" color 1 "as reference standards.
The second splitting unit 811 is configured to split at least one phrase from the header information of the web page to be determined;
meanwhile, at least one phrase is split from the title information of the webpage to be judged, for example, the title information of the webpage to be judged is that when the lipstick is selected, the lipstick has various types, and the split phrase comprises the lipstick and the type.
A fourth matching unit 812, configured to match each phrase with a phrase serving as a reference standard, and count the number of phrases that are successfully matched under each known web page type;
matching each phrase after splitting the title information of the webpage to be judged with a phrase serving as a reference standard, finding the lipstick in the title information of the webpage to be judged through matching by taking the above example as an example, and matching the lipstick with the phrase serving as the reference standard. Then further counting the number of successfully matched phrases under each known webpage type, wherein 2 'lipstick's are counted.
A fourth comparing unit 813, configured to obtain a second ratio between the number of successfully matched phrases in each known web page type and the number of phrases serving as a reference standard in the known web page type, and compare the second ratio with a second preset ratio;
then, a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type is obtained, the number ratio of the example word group "lipstick" is 2/3, and then the obtained second ratio is compared with a second preset ratio, wherein the second preset ratio is flexibly set according to actual requirements, and certainly, if the set second preset ratio is closer to the obtained second ratio, the determined webpage type is more accurate.
The third output unit 814 is configured to, if the second ratio is greater than or equal to a second preset ratio, use the known webpage type corresponding to the second ratio as the webpage type of the webpage to be determined.
And when the obtained second ratio is larger than or equal to a second preset ratio, taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged. Taking the above example as an example, that is, the title information is "as a lipstick control to select the brand of lipstick and try the color of lipstick", and the corresponding web page type is used as the web page type of the web page to be determined.
It should be noted that, in the above embodiment, if the second ratio is smaller than the second preset ratio, the known web page type corresponding to the first ratio may be used as the web page type of the web page to be determined.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device:
acquiring page information of a webpage to be judged;
extracting title information from the page information;
judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
and if the title information does not contain the preset keywords, obtaining the webpage type of the webpage to be judged based on the webpage structure information and/or the title information corresponding to the webpage information.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for discriminating a type of a web page, comprising:
acquiring page information of a webpage to be judged;
extracting title information from the page information;
judging whether the title information contains preset keywords, wherein the preset keywords are keywords containing webpage types;
if the title information does not contain the preset keywords, obtaining the webpage type of the webpage to be judged based on the title information;
obtaining the webpage type of the webpage to be judged based on the title information comprises:
acquiring title information of a plurality of webpages serving as reference standards under at least one known webpage type;
splitting phrases serving as reference standards from the title information of the webpage serving as the reference standards, and counting the number of the phrases serving as the reference standards under each known webpage type;
splitting at least one phrase from the title information of the webpage to be judged;
matching each phrase with the phrases serving as the reference standards respectively, and counting the number of the phrases which are successfully matched under each known webpage type;
acquiring the ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the ratio with a preset ratio;
and if the ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged.
2. The method of claim 1, further comprising:
and if the header information contains the preset keywords, taking the webpage type corresponding to the preset keywords as the webpage type of the webpage to be judged.
3. The method according to claim 1, wherein the acquiring page information of the web page to be determined comprises:
analyzing the webpage to be judged, and extracting a domain name of a link corresponding to the webpage to be judged;
and simulating to access a Uniform Resource Locator (URL) corresponding to the domain name, and crawling page information of the webpage to be judged.
4. The method according to any one of claims 1-3, further comprising: obtaining the webpage type of the webpage to be judged based on the page structure information corresponding to the page information;
the obtaining of the webpage type of the webpage to be judged based on the page structure information corresponding to the page information includes:
acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
extracting tag information serving as a reference standard from page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the tag information serving as the reference standard in each known webpage type;
extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
matching each piece of label information with the label information serving as the reference standard, and counting the number of the label information which is successfully matched under each known webpage type;
acquiring the ratio of the number of successfully matched label information under each known webpage type to the number of label information serving as a reference standard under the known webpage type, and comparing the ratio with a preset ratio;
and if the ratio is larger than or equal to the preset ratio, taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged.
5. The method according to any one of claims 1-3, further comprising: obtaining the webpage type of the webpage to be judged based on the page structure information corresponding to the page information;
the obtaining of the webpage type of the webpage to be judged based on the page structure information corresponding to the page information includes:
acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
extracting tag information serving as a reference standard from page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the tag information serving as the reference standard in each known webpage type;
extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
matching each piece of label information with the label information serving as the reference standard, and counting the number of the label information which is successfully matched under each known webpage type;
acquiring a first ratio of the number of successfully matched tag information under each known webpage type to the number of tag information serving as a reference standard under the known webpage type, and comparing the first ratio with a first preset ratio;
if the first ratio is larger than or equal to the first preset ratio, acquiring title information of a plurality of webpages serving as reference standards under the known webpage types corresponding to the first ratio;
splitting phrases serving as reference standards from the title information of the webpage serving as the reference standards, and counting the number of the phrases serving as the reference standards under each known webpage type;
splitting at least one phrase from the title information of the webpage to be judged;
matching each phrase with the phrases serving as the reference standards respectively, and counting the number of the phrases which are successfully matched under each known webpage type;
acquiring a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the second ratio with a second preset ratio;
and if the second ratio is greater than or equal to the second preset ratio, taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged.
6. An apparatus for determining a type of a web page, comprising:
the acquisition module is used for acquiring page information of a webpage to be judged;
the extraction module is used for extracting the title information from the page information;
the judging module is used for judging whether the title information contains preset keywords, and the preset keywords are keywords containing webpage types;
the processing module is used for obtaining the webpage type of the webpage to be judged based on the title information corresponding to the webpage information if the title information does not contain the preset keyword;
wherein the processing module comprises:
the second acquisition unit is used for acquiring title information of a plurality of webpages serving as reference standards under at least one known webpage type;
the second statistical unit is used for splitting phrases serving as the reference standard from the title information of the webpage serving as the reference standard and counting the number of the phrases serving as the reference standard in each known webpage type;
the first splitting unit is used for splitting at least one phrase from the title information of the webpage to be judged;
the second matching unit is used for matching each phrase with the phrases serving as the reference standards respectively and counting the number of the phrases which are successfully matched under each known webpage type;
the second comparison unit is used for acquiring the ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type and comparing the ratio with a preset ratio;
and the second output unit is used for taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged if the ratio is greater than or equal to the preset ratio.
7. The apparatus according to claim 6, wherein the processing module is further configured to, if the header information includes the preset keyword, use a webpage type corresponding to the preset keyword as the webpage type of the webpage to be determined.
8. The apparatus of claim 7, wherein the obtaining module comprises:
the analysis unit is used for analyzing the webpage to be judged and extracting a domain name of a link corresponding to the webpage to be judged;
and the simulation access unit is used for simulating and accessing the uniform resource locator URL corresponding to the domain name and crawling the page information of the webpage to be judged.
9. The apparatus according to any one of claims 6-8, wherein the processing module comprises:
the first acquisition unit is used for acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
the first statistical unit is used for extracting label information serving as a reference standard from the page structure information corresponding to the page information of the webpage serving as the reference standard, and counting the number of the label information serving as the reference standard under each known webpage type;
the first extraction unit is used for extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
the first matching unit is used for matching each piece of label information with the label information serving as the reference standard respectively and counting the number of the label information which is successfully matched under each known webpage type;
the first comparison unit is used for acquiring the ratio of the number of the successfully matched label information under each known webpage type to the number of the label information serving as a reference standard under the known webpage type, and comparing the ratio with a preset ratio;
and the first output unit is used for taking the known webpage type corresponding to the ratio as the webpage type of the webpage to be judged if the ratio is greater than or equal to the preset ratio.
10. The apparatus according to any one of claims 6-8, wherein the processing module comprises:
the third acquisition unit is used for acquiring page information of a plurality of webpages serving as reference standards under at least one known webpage type;
a third statistical unit, configured to extract tag information serving as a reference standard from the page structure information corresponding to the page information of the web page serving as the reference standard, and count the number of tag information serving as the reference standard in each known web page type;
the second extraction unit is used for extracting at least one piece of label information from the page structure information corresponding to the page information of the webpage to be judged;
the third matching unit is used for matching each piece of label information with the label information serving as the reference standard respectively and counting the number of the label information which is successfully matched under each known webpage type;
the third comparison unit is used for acquiring a first ratio of the number of successfully matched label information under each known webpage type to the number of label information serving as a reference standard under the known webpage type, and comparing the first ratio with a first preset ratio;
a fourth obtaining unit, configured to obtain title information of a plurality of webpages serving as reference standards in a known webpage type corresponding to the first ratio if the first ratio is greater than or equal to the first preset ratio;
the fourth statistical unit is used for splitting the word groups serving as the reference standard from the title information of the webpage serving as the reference standard, and counting the number of the word groups serving as the reference standard in each known webpage type;
the second splitting unit is used for splitting at least one phrase from the title information of the webpage to be judged;
the fourth matching unit is used for matching each phrase with the phrases serving as the reference standards respectively and counting the number of the phrases which are successfully matched under each known webpage type;
the fourth comparison unit is used for acquiring a second ratio of the number of the successfully matched phrases in each known webpage type to the number of the phrases serving as the reference standard in the known webpage type, and comparing the second ratio with a second preset ratio;
and the third output unit is used for taking the known webpage type corresponding to the second ratio as the webpage type of the webpage to be judged if the second ratio is greater than or equal to the second preset ratio.
CN201611270198.0A 2016-12-29 2016-12-29 Method and device for judging webpage type Active CN108255891B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611270198.0A CN108255891B (en) 2016-12-29 2016-12-29 Method and device for judging webpage type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611270198.0A CN108255891B (en) 2016-12-29 2016-12-29 Method and device for judging webpage type

Publications (2)

Publication Number Publication Date
CN108255891A CN108255891A (en) 2018-07-06
CN108255891B true CN108255891B (en) 2020-08-28

Family

ID=62721846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611270198.0A Active CN108255891B (en) 2016-12-29 2016-12-29 Method and device for judging webpage type

Country Status (1)

Country Link
CN (1) CN108255891B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287409B (en) * 2019-06-05 2022-07-22 新华三信息安全技术有限公司 Webpage type identification method and device
CN115004181A (en) * 2020-06-17 2022-09-02 深圳市欢太数字科技有限公司 Webpage detection method and device, electronic equipment and storage medium
CN113297525B (en) * 2021-06-17 2023-12-12 恒安嘉新(北京)科技股份公司 Webpage classification method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN101814083A (en) * 2010-01-08 2010-08-25 上海复歌信息科技有限公司 Automatic webpage classification method and system
WO2012083874A1 (en) * 2010-12-22 2012-06-28 北大方正集团有限公司 Webpage information detection method and system
CN103309862A (en) * 2012-03-07 2013-09-18 腾讯科技(深圳)有限公司 Webpage type recognition method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814083A (en) * 2010-01-08 2010-08-25 上海复歌信息科技有限公司 Automatic webpage classification method and system
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
WO2012083874A1 (en) * 2010-12-22 2012-06-28 北大方正集团有限公司 Webpage information detection method and system
CN103309862A (en) * 2012-03-07 2013-09-18 腾讯科技(深圳)有限公司 Webpage type recognition method and system

Also Published As

Publication number Publication date
CN108255891A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN108241621B (en) legal knowledge retrieval method and device
CN106649316B (en) Video pushing method and device
CN110968663B (en) Answer display method and device of question-answering system
CN106610931B (en) Topic name extraction method and device
CN108255891B (en) Method and device for judging webpage type
CN107368489B (en) Information data processing method and device
CN110602045A (en) Malicious webpage identification method based on feature fusion and machine learning
CN107015986B (en) Method and device for crawling webpage by crawler
CN109582883B (en) Column page determination method and device
CN112818126B (en) Training method, application method and device for network security corpus construction model
CN113869789A (en) Risk monitoring method and device, computer equipment and storage medium
CN110569429B (en) Method, device and equipment for generating content selection model
CN113535817A (en) Method and device for generating characteristic broad table and training business processing model
CN104750604A (en) Generating method and device for browser compatibility test case
CN117033744A (en) Data query method and device, storage medium and electronic equipment
CN110019295B (en) Database retrieval method, device, system and storage medium
CN106776654B (en) Data searching method and device
CN110929188A (en) Method and device for rendering server page
CN114021064A (en) Website classification method, device, equipment and storage medium
CN110968691B (en) Judicial hotspot determination method and device
CN110019771B (en) Text processing method and device
CN106997353B (en) Method and device for monitoring webpage version change
CN111428037A (en) Method for analyzing matching performance of behavior policy
CN113392628A (en) Method and device for checking text analysis result
CN110968821A (en) Website processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant