CN106250402B - Website classification method and device - Google Patents

Website classification method and device Download PDF

Info

Publication number
CN106250402B
CN106250402B CN201610574744.3A CN201610574744A CN106250402B CN 106250402 B CN106250402 B CN 106250402B CN 201610574744 A CN201610574744 A CN 201610574744A CN 106250402 B CN106250402 B CN 106250402B
Authority
CN
China
Prior art keywords
website
category
label
word
classified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610574744.3A
Other languages
Chinese (zh)
Other versions
CN106250402A (en
Inventor
张惊申
任方英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Technologies Co Ltd
Original Assignee
New H3C Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Technologies Co Ltd filed Critical New H3C Technologies Co Ltd
Priority to CN201610574744.3A priority Critical patent/CN106250402B/en
Publication of CN106250402A publication Critical patent/CN106250402A/en
Application granted granted Critical
Publication of CN106250402B publication Critical patent/CN106250402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a website classification method and a device, wherein the method comprises the following steps: acquiring first label information and first webpage content of a website to be classified, wherein the first label information is a part of the first webpage content; determining the website category corresponding to the first tag information according to a preset tag classification dictionary, wherein the tag classification dictionary comprises: the corresponding relation between the label information and the website category; and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content. By applying the technical scheme provided by the embodiment of the invention, the efficiency of website classification is improved.

Description

Website classification method and device
Technical Field
The invention relates to the technical field of internet, in particular to a website classification method and device.
Background
The number of web sites in the internet is extremely large, which includes various web sites, such as: news-like websites, sports-like websites, shopping-like websites, and the like. In the face of a wide variety of websites, businesses or organizations often need to filter the websites to prohibit insiders from accessing a given category of websites. Here, it is determined whether a web site needs to be filtered out, and the web sites need to be classified first.
Currently, the process of website classification is generally as follows: determining the content in the website pages to be accessed, and matching the determined content with words in all preset website classification dictionaries, wherein each website class corresponds to one website classification dictionary, and the website classification dictionaries comprise: corresponding relation between words and weighted values; and determining the category of the website to be accessed according to the matched weight value. When the website category is determined, the website category is matched with all words in all the website classification dictionaries, so that the efficiency of website classification is low.
Disclosure of Invention
The embodiment of the invention discloses a website classification method and device, which improve the efficiency of website classification.
In order to achieve the above object, the embodiment of the present invention discloses a website classification method, which comprises:
acquiring first label information and first webpage content of a website to be classified, wherein the first label information is a part of the first webpage content;
determining the website category corresponding to the first tag information according to a preset tag classification dictionary, wherein the tag classification dictionary comprises: the corresponding relation between the label information and the website category;
and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.
In order to achieve the above object, an embodiment of the present invention further discloses a website classification apparatus, where the apparatus includes:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring first label information and first webpage content of a website to be classified, and the first label information is a part of the first webpage content;
a first determining unit, configured to determine, according to a preset tag classification dictionary, a website category corresponding to the first tag information, where the tag classification dictionary includes: the corresponding relation between the label information and the website category;
and the second determining unit is used for determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.
The embodiment of the invention provides a website classification method and device, wherein first label information and first webpage content of a website to be classified are obtained, the first label information is less, the first webpage content is more, website classes which are possibly websites to be classified are screened out from all website classes according to the first label information and a preset label classification dictionary, and the website classes of the website to be classified are determined according to the first webpage content and a website classification dictionary corresponding to the determined website classes, so that the website classification efficiency is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a website classification method according to an embodiment of the present invention;
fig. 2 is a schematic view illustrating a construction process of a classification dictionary in the website classification method according to the embodiment of the present invention;
fig. 3 is a schematic structural diagram of a website classification device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a device for constructing a classification dictionary used in the website classification device according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention will be described in detail below with reference to specific examples.
Referring to fig. 1, fig. 1 is a schematic flowchart of a website classification method according to an embodiment of the present invention, where the method includes:
s101: acquiring first label information and first webpage content of a website to be classified;
here, the website to be classified may be a website that the user needs to visit, or may be a website preset by the user. The first tag information is a part of the first webpage content, and may be title information of the first webpage content, such as "tianmao supermarket", "Baidu post bar", and the like; column titles in the first web content, such as "entertainment stars", "movies", and "novels" in "Baidu Bar" may also be provided.
It should be noted that, in this embodiment, the first tag information is not limited, and can represent the content of the website features, and all of the content can be used as the first tag information.
In an embodiment of the present invention, a URL (Uniform Resource Locator) of a website to be classified may be first obtained, a web crawler tool is used to access the URL, and tag information and web page content of the website are extracted from content fed back by the website.
S102: determining a website category corresponding to the first label information according to a preset label classification dictionary;
wherein the label classification dictionary comprises: and the corresponding relation between the label information and the website category.
The words contained in the label information are few, the first label information is matched with the label classification dictionary, and the website category corresponding to the first label information can be quickly determined.
In one embodiment of the invention, the tag information in the tag classification dictionary may be tag words. At this time, if the first label information and the label classification dictionary are matched, a mismatch may occur, such as: the label information is "Beijing university research center", and in the label information, two characters of "big" and "study" are adjacent and can be matched by the label word of "university", but in practice, the two characters of "big" and "study" belong to different words and are respectively "maximum" and "study", and at this time, if the label information is matched by the label word of "university", the problem of mismatching occurs.
In order to avoid the problem of mismatching, the first label information may be segmented to obtain at least one first label word, such as: the label information 'Beijing university research center' is subjected to word segmentation to obtain a label word: the label information can be prevented from being matched by the label word of university, and the problem of mismatching is effectively avoided.
After at least one first tag word is obtained, the website category corresponding to each first tag word can be determined, and then the website category corresponding to the first tag information is determined. Specifically, it may be: matching each first label word with the label words in the label classification dictionary, and gathering the website categories corresponding to each matched label word together to obtain an initial classification set corresponding to the first label information; and removing the repeated website categories in the initial classification set, and determining the website categories in the initial classification set after the repeated website categories are removed to be the website categories corresponding to the first label information. In one embodiment, the website categories corresponding to the first tag information may be collected together to serve as a suspected classification set of the websites to be classified.
Supposing that the obtained first label information of the website to be classified is as follows: "dica cannon sports supermarket | professional sporting goods store monopoly", this first label information is participled, obtains 7 first label words: "dicarbanon", "sports", "supermarket", "professional", "sporting goods", "shop" and "monopoly", matching each first tagged term with a tagged term in a tag classification dictionary, can determine:
the website categories corresponding to "sports" are: "sports";
the website categories corresponding to the supermarket are as follows: "shopping" and "business";
the website categories corresponding to the "stores" are: "shopping" and "business";
the other 4 words do not belong to any one website category.
At this time, the initial classification set corresponding to the first tag information may be determined as: { "sports", "shopping", "business" }, removing repeated website categories { "shopping" and "business" }inthe initial classification set, and determining the website category corresponding to the first label information, that is, the suspected classification set of the website to be classified is: { "sports", "shopping", "business" }.
S103: and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.
In one embodiment of the present invention, to reduce the data amount of the matched words and avoid adding invalid words to the website classification dictionary, the website classification dictionary corresponding to each website category includes: valid words for the website category and a weight value for each valid word. Here, the invalid words include: webpage code of non-webpage effective content, script character set, annotated character set, and the like.
Under the condition, when the website category of the website to be classified is determined, the text information of the first webpage content can be extracted, and the extracted text information is subjected to word segmentation to obtain at least one first effective word; obtaining a first weight value of each first effective word aiming at each website category according to the website classification dictionary corresponding to the determined website category and each first effective word; and calculating the sum of first weight values corresponding to each website category, and taking the website category with the maximum sum of the first weight values as the website category of the websites to be classified.
As assumed in S102, if the first valid word obtained from the first web page content includes: x1、X2、X3、X4And X5And respectively matching the first effective words with effective words in a website classification dictionary corresponding to the sports website classification, the shopping website classification and the commercial website classification, and determining:
"sports" website classification: x1Is 100; x2Has a first weight value of 200; x3Has a first weight value of 240; x4Has a first weight value of 70; x5The first weight value of (1) is 300;
the 'shopping' website classification: x1Has a first weight value of 400; x2The first weight value of (1) is 300; x3Has a first weight value of 500; x4Is 1460; x5Is 1330;
the "commercial" website classification: x1Has a first weight value of 50; x2Is 100; x3The first weight value of (1) is 300; x4Has a first weight value of 20; x5Has a first weight value of 150;
according to the obtained first weight values, the sum of the finally calculated first weight values corresponding to each website category is as follows:
the sum of the weighted values of the sports website classification is: 910;
the sum of the weighted values of the 'shopping' website classification is: 2990;
the sum of the weight values for the "commercial" website classification is: 620;
at this time, the website category of the website to be classified may be determined as "shopping".
In practical applications, there are some website categories, which are prone to cause misclassification when being matched with all other website categories, such as: the "news" category of web sites, which includes a wide variety of web page types, may include: web page types such as "shopping," "sports," "business," and "education"; the following steps are repeated: the "advertisement" website category, which is specific to this type of website. In the embodiment of the invention, firstly, the website category corresponding to the label information, namely a suspected classification set, is determined through the label information, and then the website category of the website to be classified is determined according to the website classification dictionary corresponding to the determined website category and the first webpage content.
In an embodiment of the present invention, when determining the website category corresponding to the tag information, the website category may not be determined, and the determined website category is empty, that is, the above-mentioned suspected classification set is empty, and in this case, in order to determine the website category of the website to be classified, the website category of the website to be classified may be determined according to all the website classification dictionaries and the first webpage content.
In an embodiment of the present invention, to ensure website classification, before acquiring first tag information and first web page content of a website to be classified, a tag classification dictionary and a website classification dictionary need to be constructed in advance, and in particular, referring to fig. 2, the method includes:
s201: configuring N initial website categories, wherein N is a positive integer;
here, the initial website categories may include: "news," "sports," "finance," and so on. In addition, all website classifications can be set as a first-level classification, and can also be subdivided into a second-level classification and a third-level classification, such as: can set up "news" for the first grade is categorised, sets up the second grade classification under "news" is categorised: "current events", "sports", "shopping", etc.; can set up "finance" for the primary classification, set up the secondary classification under "finance" classification: "bank", "securities", etc.
S202: acquiring second label information and second webpage content of at least one sample website corresponding to each initial website category;
specifically, a URL of at least one sample website corresponding to the initial website category is obtained, the corresponding sample website URL is accessed according to the website category through a web crawler tool, and the label information and the webpage content of the sample website are extracted from the content fed back by the sample website. Suppose that: the determined initial website categories are: "sports" and "shopping", the URL of the sample website corresponding to the category of the "sports" initial website can be obtained as: URLs of sports websites such as Xinlang sports, Fox searching sports, Tencent sports and the like are accessed, and label information and webpage content corresponding to the category of the 'sports' initial website are obtained; the URL of the sample website corresponding to the initial website category of shopping is obtained as follows: URLs of shopping websites such as Taobao, Wei-Hui and Jumei excellence are accessed, and tag information and webpage content corresponding to the category of the initial shopping website are acquired.
S203: for each initial website category, extracting second label words from second label information of each corresponding sample website, and correspondingly storing the second label words and the initial website category to a label classification dictionary;
each second label information of the sample website is segmented, words closely related to the initial website category corresponding to the sample website are extracted from the segmented words, and the extracted words are used as second label words, so that the second label words and the initial website category are correspondingly stored in a label classification dictionary, the data quantity of information stored in the label classification dictionary is reduced, and the website classification speed can be further improved. As mentioned above, the category of the initial website of "shopping" may be extracted from the sample websites of "panning, virtuous, and gathering of beautiful products" to obtain the second tagged word: the second label terms such as the supermarket, the flagship store and the shop are correspondingly stored to the label classification dictionary along with the shopping.
S204: for each initial website category, segmenting the text information of the second webpage content of each corresponding sample website, removing invalid words to obtain at least one second valid word, and configuring a second weight value for each second valid word; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.
It should be noted that S204 may be executed before S203, or may be executed simultaneously with S203, which is not limited in the present invention. Here, the website classification dictionary may be in a table form or a text form. In this case, all the website classification dictionaries may be placed in one classification dictionary set, that is, all the website classification dictionaries may be placed in one table or text; each website classification dictionary may also be stored separately, i.e., each website classification dictionary is placed in a table or text.
The embodiment of the invention provides a website classification method, which comprises the steps of obtaining first label information and first webpage content of a website to be classified, wherein the first label information is less, the first webpage content is more, screening website categories which are possibly the website to be classified from all the website categories according to the first label information and a preset label classification dictionary, and determining the website categories of the website to be classified according to the first webpage content and a website classification dictionary corresponding to the determined website categories, so that the website classification efficiency is effectively improved.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a website classification device according to an embodiment of the present invention, the device including:
a first obtaining unit 301, configured to obtain first tag information and first web content of a website to be classified, where the first tag information is a part of the first web content;
a first determining unit 302, configured to determine, according to a preset tag classification dictionary, a website category corresponding to the first tag information, where the tag classification dictionary includes: the corresponding relation between the label information and the website category;
a second determining unit 303, configured to determine a website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.
In an embodiment of the present invention, the first obtaining unit 301 is specifically configured to:
acquiring a Uniform Resource Locator (URL) of a website to be classified; and accessing the URL to acquire first label information and first webpage content of the website to be classified.
In one embodiment of the present invention, the tag information in the tag classification dictionary is a tag word;
in this case, the first determining unit 302 may include:
a first word segmentation subunit (not shown in fig. 3) configured to perform word segmentation on the first tag information to obtain at least one first tag word;
a first determining subunit (not shown in fig. 3) configured to determine, according to a preset tag classification dictionary, a website category corresponding to each first tag word;
a second determining subunit (not shown in fig. 3), configured to determine, as the website category corresponding to the first tag information, the website type corresponding to each first tag word.
In one embodiment of the present invention, the website classification dictionary of each website category includes valid words of the website category and a weight value of each valid word;
in this case, the second determining unit 303 may include:
a second word segmentation subunit (not shown in fig. 3) configured to segment the text information of the first web content to obtain at least one first valid word;
an obtaining subunit (not shown in fig. 3) configured to obtain, according to the website classification dictionary corresponding to the determined website category and each first valid word, a first weight value of each first valid word for each website category;
and a third determining subunit (not shown in fig. 3) configured to determine the website category with the largest sum of the first weight values as the website category of the websites to be classified.
In order to ensure website classification, an embodiment of the present invention provides a device for constructing a classification dictionary used in a website classification device, which may refer to fig. 4, where the device includes:
a configuration unit 401, configured to configure N initial website categories, where N is a positive integer;
a second obtaining unit 402, configured to obtain second tag information and second web page content of at least one sample website corresponding to each initial website category;
a first extracting unit 403, configured to, for each initial website category, extract a second tagged word from second tag information of each corresponding sample website, and store the second tagged word and the initial website category in the tag classification dictionary in a corresponding manner;
a second extracting unit 404, configured to, for each initial website category, perform word segmentation on text information of the second web content of each corresponding sample website, remove an invalid word, obtain at least one second valid word, and configure a second weight value for each second valid word; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.
In an embodiment of the present invention, the website classifying device may further include:
a third determining unit (not shown in fig. 3), configured to determine, if the determined website category is empty, a website category of the website to be classified according to all the website classification dictionaries and the first webpage content.
The embodiment of the invention provides a website classification device, which is used for acquiring first label information and first webpage content of a website to be classified, wherein the first label information is less, the first webpage content is more, the website category which is possibly the website to be classified is screened out from all website categories according to the first label information and a preset label classification dictionary, and then the website category of the website to be classified is determined according to the first webpage content and a website classification dictionary corresponding to the determined website category, so that the website classification efficiency is effectively improved.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Those skilled in the art will appreciate that all or part of the steps in the above method embodiments may be implemented by a program to instruct relevant hardware to perform the steps, and the program may be stored in a computer-readable storage medium, which is referred to herein as a storage medium, such as: ROM/RAM, magnetic disk, optical disk, etc.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (12)

1. A method for classifying a website, the method comprising:
acquiring first label information and first webpage content of a website to be classified, wherein the first label information is a part of the first webpage content;
determining the website category corresponding to the first tag information according to a preset tag classification dictionary, wherein the tag classification dictionary comprises: the corresponding relation between the label information and the website category; the website category corresponding to the first label information is a suspected website category of the website to be classified;
and determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.
2. The method of claim 1, wherein the obtaining the first label information and the first webpage content of the website to be classified comprises:
acquiring a Uniform Resource Locator (URL) of a website to be classified;
and accessing the URL to acquire first label information and first webpage content of the website to be classified.
3. The method of claim 1, wherein the label information in the label classification dictionary is a label word;
the determining the website category corresponding to the first tag information according to a preset tag classification dictionary includes:
performing word segmentation on the first label information to obtain at least one first label word;
determining the website category corresponding to each first label word according to a preset label classification dictionary;
and determining the website type corresponding to each first label word as the website category corresponding to the first label information.
4. The method of claim 1, wherein the website classification dictionary of each website category comprises valid words of the website category and a weight value of each valid word;
determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content, wherein the determining the website category of the website to be classified comprises the following steps:
segmenting words of the text information of the first webpage content to obtain at least one first effective word;
obtaining a first weight value of each first effective word aiming at each website category according to the website classification dictionary corresponding to the determined website category and each first effective word;
and determining the website category with the maximum sum of the first weight values as the website category of the websites to be classified.
5. The method of claim 1, wherein the label information in the label classification dictionary is a label word;
before the obtaining of the first label information and the first webpage content of the website to be classified, the method further includes:
configuring N initial website categories, wherein N is a positive integer;
acquiring second label information and second webpage content of at least one sample website corresponding to each initial website category;
for each initial website category, extracting second tag words from second tag information of each corresponding sample website, and correspondingly storing the second tag words and the initial website category to the tag classification dictionary;
for each initial website category, segmenting the text information of the second webpage content of each corresponding sample website, removing invalid words to obtain at least one second valid word, and configuring a second weight value for each second valid word; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.
6. The method of claim 1, wherein if the determined website category is empty, the method further comprises:
and determining the website category of the website to be classified according to all the website classification dictionaries and the first webpage content.
7. An apparatus for classifying a website, the apparatus comprising:
the system comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring first label information and first webpage content of a website to be classified, and the first label information is a part of the first webpage content;
a first determining unit, configured to determine, according to a preset tag classification dictionary, a website category corresponding to the first tag information, where the tag classification dictionary includes: the corresponding relation between the label information and the website category; the website category corresponding to the first label information is a suspected website category of the website to be classified;
and the second determining unit is used for determining the website category of the website to be classified according to the website classification dictionary corresponding to the determined website category and the first webpage content.
8. The apparatus according to claim 7, wherein the first obtaining unit is specifically configured to:
acquiring a Uniform Resource Locator (URL) of a website to be classified; and accessing the URL to acquire first label information and first webpage content of the website to be classified.
9. The apparatus of claim 7, wherein the label information in the label classification dictionary is a label word;
the first determination unit includes:
the first word segmentation subunit is used for segmenting the first label information to obtain at least one first label word;
the first determining subunit is used for determining the website category corresponding to each first label word according to a preset label classification dictionary;
and the second determining subunit is used for determining the website type corresponding to each first label word as the website category corresponding to the first label information.
10. The apparatus of claim 7, wherein the website classification dictionary of each website category comprises valid words of the website category and a weight value of each valid word;
the second determination unit includes:
the second word segmentation subunit is used for performing word segmentation on the text information of the first webpage content to obtain at least one first effective word;
the obtaining subunit is configured to obtain, according to the website classification dictionary corresponding to the determined website category and each first valid word, a first weight value of each first valid word for each website category;
and the third determining subunit is used for determining the website category with the maximum sum of the first weight values as the website category of the websites to be classified.
11. The apparatus according to any one of claims 7-10, further comprising:
the configuration unit is used for configuring N initial website categories, wherein N is a positive integer;
the second acquisition unit is used for acquiring second label information and second webpage content of at least one sample website corresponding to each initial website type;
the first extraction unit is used for extracting second label words from the second label information of each corresponding sample website for each initial website category and storing the second label words and the initial website category into the label classification dictionary in a corresponding mode;
the second extraction unit is used for segmenting the text information of the second webpage content of each corresponding sample website according to each initial website category, removing invalid terms, obtaining at least one second valid term, and configuring a second weight value for each second valid term; and correspondingly storing each second valid word and each second weighted value to the website classification dictionary of the initial website category.
12. The apparatus of claim 7, further comprising:
and the third determining unit is used for determining the website category of the website to be classified according to all the website classification dictionaries and the first webpage content if the determined website category is empty.
CN201610574744.3A 2016-07-19 2016-07-19 Website classification method and device Active CN106250402B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610574744.3A CN106250402B (en) 2016-07-19 2016-07-19 Website classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610574744.3A CN106250402B (en) 2016-07-19 2016-07-19 Website classification method and device

Publications (2)

Publication Number Publication Date
CN106250402A CN106250402A (en) 2016-12-21
CN106250402B true CN106250402B (en) 2022-01-21

Family

ID=57613403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610574744.3A Active CN106250402B (en) 2016-07-19 2016-07-19 Website classification method and device

Country Status (1)

Country Link
CN (1) CN106250402B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250402B (en) * 2016-07-19 2022-01-21 新华三技术有限公司 Website classification method and device
CN107545020A (en) * 2017-05-10 2018-01-05 新华三信息安全技术有限公司 A kind of determination method and device of Web page classifying
CN107506478A (en) * 2017-09-08 2017-12-22 北京京东尚科信息技术有限公司 A kind of method and apparatus for distinguishing Website page
CN108364028A (en) * 2018-03-06 2018-08-03 中国科学院信息工程研究所 A kind of internet site automatic classification method based on deep learning
CN110321517B (en) * 2019-07-25 2021-07-30 秒针信息技术有限公司 Method and device for detecting playing proportion of browsing resources and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814083A (en) * 2010-01-08 2010-08-25 上海复歌信息科技有限公司 Automatic webpage classification method and system
CN102207961A (en) * 2011-05-25 2011-10-05 盛乐信息技术(上海)有限公司 Automatic web page classification method and device
CN103678310A (en) * 2012-08-31 2014-03-26 腾讯科技(深圳)有限公司 Method and device for classifying webpage topics
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN106250402A (en) * 2016-07-19 2016-12-21 杭州华三通信技术有限公司 A kind of Website classification method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539329B2 (en) * 2006-11-01 2013-09-17 Bloxx Limited Methods and systems for web site categorization and filtering
US20110010224A1 (en) * 2009-07-13 2011-01-13 Naveen Gupta System and method for user-targeted listings
CN102955859B (en) * 2012-11-16 2016-06-01 北京奇虎科技有限公司 Web page content revealing method and device
US10115121B2 (en) * 2013-12-11 2018-10-30 Adobe Systems Incorporated Visitor session classification based on clickstreams
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101814083A (en) * 2010-01-08 2010-08-25 上海复歌信息科技有限公司 Automatic webpage classification method and system
CN102207961A (en) * 2011-05-25 2011-10-05 盛乐信息技术(上海)有限公司 Automatic web page classification method and device
CN103678310A (en) * 2012-08-31 2014-03-26 腾讯科技(深圳)有限公司 Method and device for classifying webpage topics
CN104424308A (en) * 2013-09-04 2015-03-18 中兴通讯股份有限公司 Web page classification standard acquisition method and device and web page classification method and device
CN106250402A (en) * 2016-07-19 2016-12-21 杭州华三通信技术有限公司 A kind of Website classification method and device

Also Published As

Publication number Publication date
CN106250402A (en) 2016-12-21

Similar Documents

Publication Publication Date Title
CN106250402B (en) Website classification method and device
CN108256104B (en) Comprehensive classification method of internet websites based on multidimensional characteristics
CN107437038B (en) Webpage tampering detection method and device
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
US20150186503A1 (en) Method, system, and computer readable medium for interest tag recommendation
CN106156372B (en) A kind of classification method and device of internet site
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN108737423A (en) Fishing website based on webpage key content similarity analysis finds method and system
US20150161278A1 (en) Method and apparatus for identifying webpage type
CN109165373B (en) Data processing method and device
Alghamdi et al. Topic detections in Arabic dark websites using improved vector space model
CN106168968B (en) Website classification method and device
CN104915422A (en) Webpage collecting method and device based on browser
Burbano et al. Identifying human trafficking patterns online
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
CN106202349B (en) Webpage classification dictionary generation method and device
CN108874870A (en) A kind of data pick-up method, equipment and computer can storage mediums
CN106776640A (en) A kind of stock information information displaying method and device
CN110955855B (en) Information interception method, device and terminal
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN105159898A (en) Searching method and searching device
CN104572720A (en) Webpage information duplicate eliminating method and device and computer-readable storage medium
CN113505317A (en) Illegal advertisement identification method and device, electronic equipment and storage medium
CN106779080A (en) A kind of people information knowledge base method for auto constructing
KR101692244B1 (en) Method for spam classfication, recording medium and device for performing the method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 310052 Binjiang District Changhe Road, Zhejiang, China, No. 466, No.

Applicant after: Xinhua three Technology Co., Ltd.

Address before: 310053 Hangzhou science and Technology Industrial Park, high tech Industrial Development Zone, Zhejiang Province, No. six and road, No. 310

Applicant before: Huasan Communication Technology Co., Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant