CN106557520A - The recognition methods of the Type of website and device - Google Patents

The recognition methods of the Type of website and device Download PDF

Info

Publication number
CN106557520A
CN106557520A CN201510634837.6A CN201510634837A CN106557520A CN 106557520 A CN106557520 A CN 106557520A CN 201510634837 A CN201510634837 A CN 201510634837A CN 106557520 A CN106557520 A CN 106557520A
Authority
CN
China
Prior art keywords
website
type
predetermined
websites
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510634837.6A
Other languages
Chinese (zh)
Inventor
李曙聪
牛朋涛
董长阳
蒋智超
徐元峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510634837.6A priority Critical patent/CN106557520A/en
Publication of CN106557520A publication Critical patent/CN106557520A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The invention provides a kind of recognition methods of Type of website and device, method therein includes:Screening is carried out according to predetermined keyword set to reservations database and obtains set of websites;The set of websites is filtered according to the predetermined characteristic of each website in the set of websites;The type of website is gone out according to the feature recognition of the website after filtration.The method by the website in reservations database is screened and is filtered after to realize the identification of the Type of website, reduce amount of calculation, improve recognition efficiency and the Type of website identification the degree of accuracy.

Description

The recognition methods of the Type of website and device
Technical field
The present invention relates to computer realm, more particularly to a kind of recognition methods and the device of Type of website.
Background technology
With developing rapidly for internet, the quantity of website is also being continuously increased, the style and shape of website It is ever-changing that formula is also designed, user can be determined according to the type of website website security and can Operability.The current identification to the Type of website is usually the content all webpages stored in database It is analyzed after all capturing, to determine the type of the corresponding website of webpage, but this needs is substantial amounts of Poke space and amount of calculation, result in the waste of resource.
The content of the invention
It is an object of the invention to provide a kind of identifying schemes of the Type of website.
According to an aspect of the invention, there is provided a kind of recognition methods of the Type of website, including:
Screening is carried out according to predetermined keyword set to reservations database and obtains set of websites;
The set of websites is filtered according to the predetermined characteristic of each website in the set of websites;
The type of website is gone out according to the feature recognition of the website after filtration.
According to another aspect of the present invention, there is provided a kind of identifying device of the Type of website, including:
For carrying out screening the device for obtaining set of websites to reservations database according to predetermined keyword set;
The set of websites is carried out for the predetermined characteristic according to each website in the set of websites The device of filtration;
For going out the device of the type of website according to the feature recognition of the website after filtration.
Due to recognition methods and the device of the Type of website of the present embodiment, by reservations database Website screened and filtered after to realize the identification of the Type of website, reduce amount of calculation, improve knowledge Other efficiency and the degree of accuracy of Type of website identification.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, this The other features, objects and advantages of invention will become more apparent upon:
The flow chart that Fig. 1 shows the recognition methods of a Type of website according to embodiments of the present invention.
Fig. 2 shows step S110 in the recognition methods of a Type of website according to embodiments of the present invention Particular flow sheet.
Fig. 3 shows step S130 in the recognition methods of a Type of website according to embodiments of the present invention Another particular flow sheet.
Fig. 4 shows also included step in the recognition methods of a Type of website according to embodiments of the present invention Flow chart.
The flow chart that Fig. 5 shows the recognition methods of another Type of website according to embodiments of the present invention.
Fig. 6 shows the structured flowchart of the identifying device of a Type of website according to embodiments of the present invention.
Fig. 7 shows the structured flowchart of the identifying device of another Type of website according to embodiments of the present invention.
"-" in accompanying drawing below word represents embedded link form, same or analogous accompanying drawing in accompanying drawing Mark represents same or analogous part.
Specific embodiment
Although those of ordinary skill in the art will be appreciated that detailed description below by referenced in schematic embodiment, Accompanying drawing is carried out, but the present invention is not limited in these embodiments.But, the scope of the present invention is extensive , and it is intended to be bound only by appended claims restriction the scope of the present invention.
It should be mentioned that some exemplary enforcements before exemplary embodiment is discussed in greater detail Example is described as process or the method described as flow chart.Although operations are described as by flow chart The process of order, but many of which operation can by concurrently, concomitantly or while implement. Additionally, the order of operations can be rearranged.The process when its operations are completed can be by Terminate, it is also possible to have the additional step being not included in accompanying drawing.The process can correspond to Method, function, code, subroutine, subprogram etc..
Alleged within a context " terminal ", also referred to as " computer ", referring to can be predetermined by operation Program or the smart electronicses terminal for instructing to perform the predetermined process process such as numerical computations and/or logical calculated, Which can include processor and memory, be come by the survival instruction that computing device is prestored in memory Predetermined process process is performed, or predetermined process process is performed by hardware such as ASIC, FPGA, DSP, Or combined to realize by said two devices.Terminal include but is not limited to server, PC, Notebook computer, panel computer, smart mobile phone etc..
The terminal includes user terminal and the network terminal.Wherein, the user terminal includes But it is not limited to computer, smart mobile phone, PDA etc.;The network terminal includes but is not limited to single network Server, the server group of multiple webservers composition are based on cloud computing (Cloud Computing) The cloud being made up of a large amount of computers or the webserver, wherein, cloud computing is the one of Distributed Calculation Kind, a super virtual computer being made up of the loosely-coupled computer collection of a group.Wherein, it is described Terminal can isolated operation can access realizing the present invention, also network and by with network in its The interactive operation of his terminal is realizing the present invention.Wherein, the net residing for the terminal Network includes but is not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN etc..
It should be noted that the user terminal, the network terminal and network etc. are only for example, other are existing Terminal that is having or being likely to occur from now on or network are such as applicable to the present invention, should also be included in Within the scope of the present invention, and it is incorporated herein by reference.
Method (some of them are illustrated by flow process) discussed hereafter can by hardware, software, Firmware, middleware, microcode, hardware description language or its any combination are implementing.When with software, When firmware, middleware or microcode are to implement, to the program code or code segment of implementing necessary task Can be stored in machine or computer-readable medium (such as storage medium).(one or more) Processor can implement necessary task.
Concrete structure disclosed herein and function detail are only representational, and are for describing The purpose of the exemplary embodiment of the present invention.But the present invention can be by many alternative forms come concrete Realize, and be not interpreted as being limited only by the embodiments set forth herein.
Although it should be appreciated that may have been used term " first ", " second " etc. here to describe Unit, but these units should not be limited by these terms.The use of these terms is only to be One unit and another unit made a distinction.For example, without departing substantially from exemplary embodiment Scope in the case of, first module can be referred to as second unit, and similarly second unit can To be referred to as first module.Term "and/or" used herein above include one of them or it is more listed Associated item any and all combination.
It should be appreciated that when a unit is referred to as " connection " or during " coupled " to another unit, which can To be connected or coupled to another unit, or there may be temporary location.On the other hand, When a unit is referred to as " when being directly connected " or " directly coupled " to another unit, then there is no middle list Unit.Other word (examples for being used for describe relation unit between are explained in a comparable manner should If " between being in ... " is compared to " between being directly in ... ", " with ... it is neighbouring " compared to " with ... it is directly adjacent to " Etc.).
Term used herein above is not intended to limit exemplary reality just for the sake of description specific embodiment Apply example.Unless the context clearly dictates otherwise, singulative " one " otherwise used herein above, " one " also attempt to include plural number.It is to be further understood that term " including " used herein above and/or " bag Containing " presence of the stated feature of regulation, integer, step, operation, unit and/or component, and do not arrange Except exist or add one or more other features, integer, step, operation, unit, component and/or Its combination.
It should further be mentioned that in some replaces realization modes, the function/action being previously mentioned can be by Occur according to the order different from indicating in accompanying drawing.For example, depending on involved function/action, The two width figures for illustrating in succession can essentially substantially simultaneously perform or sometimes can be according to contrary suitable Sequence is performing.
Below in conjunction with the accompanying drawings the present invention is described in further detail.
Fig. 1 is the flow chart of the construction method of data under line according to an embodiment of the invention.
With reference to shown in Fig. 1, the recognition methods of the Type of website described in the present embodiment, including following step Suddenly:
S110, according to predetermined keyword set to reservations database carry out screening obtain set of websites;
S120, set of websites is filtered according to the predetermined characteristic of each website in set of websites;
S130, the type for going out website according to the feature recognition of the website after filtration.
Below each step is described in further detail.
In step S110, predetermined keyword set can be predetermined resource feature, and the predetermined resource feature can To be the resource characteristic of information management department offer, for example:The financial class net that information management department provides The resource characteristic that may have of standing is:Insurance, loan, bank, stock, foreign exchange, financing, money Conclude the business, transfer accounts or pay.The predetermined keyword set can also be that the operation behavior to user is carried out The Type of website feature set that statistics is obtained, for example:The feature set that financial class website may have is:Account Family, password, transfer operation or buy in sells operation etc..Predetermined keyword set can also be predetermined resource The common factor of feature and Type of website feature set.Reservations database is generally the corresponding website group of Chinese web page Into database.
With reference to shown in Fig. 2, step S110 (is screened to reservations database according to predetermined keyword set Obtain set of websites) may comprise steps of:
S1101, according to predetermined keyword set using natural language processing algorithm to every in reservations database The content of individual website is analyzed.
Specifically, using natural language processing Algorithm Analysis go out each website correspondence webpage navigation bar, Website containing the word same or like with predetermined keyword set in title and content.
S1102, the website filtered out with predetermined keyword set matching degree higher than predetermined value constitute website collection Close.
Specifically, filter out predetermined more than first with the similarity of word in predetermined keyword set in website Value, and similarity exceedes second predetermined value more than the number of the word of first predetermined value in each website Website.First predetermined value and second predetermined value are set according to the requirement of user or system requirements , for example:If system requirements checks the precise classification of website, first predetermined value can be 95%, Second predetermined value can be 50;If user requires to check website classification situation substantially that first is pre- Definite value can be 80%, and second predetermined value can be 20.
Step S120 (is filtered to set of websites according to the predetermined characteristic of each website in set of websites) Following at least one step can be included:
S1201, according to the URL of each website (Uniform Resource Locator, URL) filter out the website of presumptive area.
For example, if the URL addresses of the first website are http://www.12306.cn, the second website URL addresses are http:The web page address 3 that //www.google.hk, the URL of the 3rd website are filtered out For http://www.sipo.gov.cn, then illustrate that the first website is Chinese website, and the second website is Hong Kong Website, the 3rd website are Chinese Government website, then it is China to filter out the first website and the 3rd website The website in domain.
S1202, filtered out according to the daily record of each website visit capacity more than first threshold website.
For example:The journal displaying same day of the first website has 1000 users to browse, the daily record of the second website Show that the same day there are 5000 users to browse, the journal displaying same day of the 3rd website there are 100 users to browse, If first threshold is 500, the first website and the second website are filtered out.
S1203, filtered out according to the daily record of each website trading volume more than Second Threshold website.
For example:The journal displaying same day of the first website has 1000 users to complete to pay, the second website There is 5000 users purchase stock on the journal displaying same day, and there are 100 use on the journal displaying same day of the 3rd website Family is transferred accounts, if Second Threshold is 500, filters out the first website and the second website.
Used as optional, the website after filtration is using the homepage of website as the index of website.
With reference to shown in Fig. 3, step S130 (goes out the class of website according to the feature recognition of the website after filtration Type) may comprise steps of:
S1301, the web page characteristics that website is extracted according to the daily record of the website after filtration.
For example:The daily record of financial class website can be including but not limited to:Be linked to Third-party payment platform, Transfer accounts, refund, stock exchange, financing information show or provide a loan etc..If the daily record of website includes transferring accounts, Refund or provide a loan etc., then the web page characteristics that can extract website are remaining sum change, pecuniary exchange, transfer accounts Deng.
S1302, website is determined according to predtermined category standard according to web page characteristics and predetermined resource feature Type.
Specifically, criteria for classification can be a kind of net of feature correspondence at least one predetermined resource feature Stand classification, wherein each feature in predetermined resource feature is it is determined that shared ratio during categories of websites can With identical, it is also possible to different.When in predetermined resource feature each feature it is determined that the type when institute of website During the ratio difference for accounting for, the spy of other similar type websites during each type website is typically set, is different from Levy shared ratio in predetermined resource feature maximum.For example:Predetermined resource feature includes:Credit card, Deposit, provide a loan, transferring accounts, balance left, fund and/or financing etc., these predetermined resource features are true When determining Bank Danamon class website, the shared ratio of predetermined resource feature is followed successively by from big to small:Credit card, Deposit, loan, fund, manage money matters, transfer accounts, balance left.
Further, it is possible to according to web page characteristics, the corresponding predetermined resource feature in each type website and every Webpage institute described in ratio-dependent in the corresponding predetermined resource feature in one type website shared by each feature Website type.For example:Web page characteristics contain credit card, deposit, provide a loan and transfer accounts, then may be used With according to credit card, deposit and the ratio-dependent website provided a loan in predetermined resource feature and resource characteristic Type be Bank Danamon class.As optional, according to comparing for web page characteristics and predetermined resource feature As a result determine the matching degree with each type website, matching degree highest type website is selected as webpage The type of the website at place.
With reference to shown in Fig. 4, the embodiment of the present invention can also be comprised the following steps:
S140, web page characteristics are added in predetermined keyword set.
Specifically, when the word not having in predetermined keyword set is contained in web page characteristics, need to be by the net Page feature is added in keyword set.
S150, according to the addition after predetermined keyword set the Type of website in database is carried out weight It is new to recognize, until the web page characteristics are the word in predetermined keyword set.
Specifically, once containing the word not having in predetermined keyword set in web page characteristics, then explanation is known The website not gone out is not comprehensive enough, thus need to by the word be added to repeat step S110 in predetermined keyword set, S120 and S130 is re-recognized to the type of website in database, with the website for ensureing to identify It is comprehensive and identification the Type of website accuracy.
The embodiment of the present invention can also include:According to the attribute information of website to the website identified Type is verified.
Specifically, according to the attribute information of website using voting mechanism verifying the Type of website for identifying It is whether accurate.So that the degree of accuracy of the Type of website for identifying is further enhanced.The attribute of website Information can be the basic configuration information of website, for example:Domain name or IP address etc..
With reference to shown in Fig. 5, the recognition methods of the another Type of website described in the embodiment of the present invention, tool Identification of the body acupuncture to the financial class Type of website, including:
S510, according to predetermined keyword set using natural language processing algorithm in reservations database each The content of website is analyzed.
For example:Predetermined keyword set can include insurance, loan, bank, stock, foreign exchange, financing, Pecuniary exchange, transfer accounts, pay, account, password, remaining sum, buying in and sell.
S520, the website filtered out with the predetermined keyword set matching degree higher than predetermined value constitute website Set.
For example:According to predetermined keyword set account, password and pecuniary exchange, filter out containing account, The set of websites for paying the corresponding website composition of the page of user name, password, the amount of money, amount and transaction.
S530, the website for going out presumptive area according to the url filtering of each website.
S540, the index by the website after filtration using the homepage of website as website.
S550, the web page characteristics that website is extracted according to the daily record of the website after filtration.
For example:Daily record can be transferred accounts first to B100 including A, and A buys 100 strands of predetermined stock, B 10,000 yuan are provided a loan, C also 5000 yuan of credits card, the then feature for extracting the webpage of website can include turning Account, stock, loan, credit card, refund etc..
S560, website is determined according to predtermined category standard according to the web page characteristics and predetermined resource feature Type.
For example:According to the feature of webpage:Transfer accounts, stock, loan, credit card and refund, can be true The fixed corresponding website of the webpage is bank's class.According to the feature of webpage:Depreciation and mortgage, it may be determined that The corresponding website of the webpage is mortgage class.
Specifically, the type of financial class website can be including but not limited to:Insurance, loan, finance clothes Business, guarantee, mortgage, bank, charitable stock, noble metal transaction platform, common reserve fund charg`e d'affaires, security, Financing information displaying, financing or payment platform etc..
S570, the web page characteristics not having in predetermined keyword set are added in predetermined keyword set.
For example:Depreciation, mortgage, credit card and refund are added in predetermined keyword set.
S580, according to the addition after predetermined keyword set repeat step S510-S560 to database In the Type of website re-recognized, until the web page characteristics are the word in predetermined keyword set Language.
With reference to shown in Fig. 6, the identifying device of the Type of website described in the present embodiment, including following dress Put:
For the dress of set of websites is obtained according to predetermined keyword set to screening is carried out in reservations database Put (hereinafter referred to as " website screening unit ") 110.
The set of websites is carried out for the predetermined characteristic according to each website in the set of websites The device (hereinafter referred to as " website programming unit ") 120 of filtration.
For going out type (hereinafter referred to as " the type knowledge of website according to the feature recognition of the website after filtration Other unit ") 130.
Below each step is described in further detail.
In website screening unit 110, predetermined keyword set can be predetermined resource feature, the predetermined money Source feature can be the resource characteristic that information management department provides, for example:What information management department provided The resource characteristic that financial class website may have is:Insurance, loan, bank, stock, foreign exchange, melt Money, pecuniary exchange, transfer accounts or pay.The predetermined keyword set can also be the operation to user Behavior carries out counting the Type of website feature set for obtaining, for example:The feature that financial class website may have Collect and be:Account, password, transfer operation or buy in sells operation etc..Predetermined keyword set can also be The common factor of predetermined resource feature and Type of website feature set.Reservations database is generally Chinese web page correspondence Website composition database.
With reference to shown in Fig. 7, website screening unit 110 can include following device:
For according to predetermined keyword set using natural language processing algorithm in reservations database each The device (hereinafter referred to as " key word analysis subelement ") 1101 that the content of website is analyzed.
Specifically, using natural language processing Algorithm Analysis go out each website correspondence webpage navigation bar, Website containing the word same or like with predetermined keyword set in title and content.
For filtering out the website composition website with the predetermined keyword set matching degree higher than predetermined value The device (hereinafter referred to as " website coupling subelement ") 1102 of set.
Specifically, filter out predetermined more than first with the similarity of word in predetermined keyword set in website Value, and similarity exceedes second predetermined value more than the number of the word of first predetermined value in each website Website.First predetermined value and second predetermined value are set according to the requirement of user or system requirements , for example:If system requirements checks the precise classification of website, first predetermined value can be 95%, Second predetermined value can be 50;If user requires to check website classification situation substantially that first is pre- Definite value can be 80%, and second predetermined value can be 20.
With reference to shown in Fig. 7, website programming unit 120 can include following at least one device:
Device (letter below for the website by presumptive area is gone out according to the url filtering of each website Claim " first filters subelement ") 1201.
For example, if the URL addresses of the first website are http://www.12306.cn, the second website URL addresses are http:The web page address 3 that //www.google.hk, the URL of the 3rd website are filtered out For http://www.sipo.gov.cn, then illustrate that the first website is Chinese website, and the second website is Hong Kong Website, the 3rd website are Chinese Government website, then it is China to filter out the first website and the 3rd website The website in domain.
For filtering out device of the visit capacity more than the website of first threshold according to the daily record of each website (hereinafter referred to as " second filters subelement ") 1202.
For example:The journal displaying same day of the first website has 1000 users to browse, the daily record of the second website Show that the same day there are 5000 users to browse, the journal displaying same day of the 3rd website there are 100 users to browse, If first threshold is 500, the first website and the second website are filtered out.
For filtering out device of the trading volume more than the website of Second Threshold according to the daily record of each website (hereinafter referred to as " the 3rd filters subelement ") 1203.
For example:The journal displaying same day of the first website has 1000 users to complete to pay, the second website There is 5000 users purchase stock on the journal displaying same day, and there are 100 use on the journal displaying same day of the 3rd website Family is transferred accounts, if Second Threshold is 500, filters out the first website and the second website.
Used as optional, the website after filtration is using the homepage of website as the index of website.
With reference to shown in Fig. 7, type identification unit 130 can include following device:
For according to the daily record of the website after filtration extract website web page characteristics device (hereinafter referred to as " web page characteristics extraction subelement ") 1301.
For example:The daily record of financial class website can be including but not limited to:Be linked to Third-party payment platform, Transfer accounts, refund, stock exchange, financing information show or provide a loan etc..If the daily record of website includes transferring accounts, Refund or provide a loan etc., then the web page characteristics that can extract website are remaining sum change, pecuniary exchange, transfer accounts Deng.
For website being determined according to predtermined category standard according to the web page characteristics and predetermined resource feature Type device (hereinafter referred to as " Type of website determination subelement ") 1302.
Specifically, criteria for classification can be a kind of net of feature correspondence at least one predetermined resource feature Stand classification, wherein each feature in the predetermined resource feature is it is determined that ratio shared during categories of websites Example can be with identical, it is also possible to different.When in predetermined resource feature each feature it is determined that the type of website When the ratio of Shi Suozhan is different, other similar type websites during each type website is typically set, are different from The ratio shared in predetermined resource feature of feature it is maximum.For example:Predetermined resource feature includes:Letter With card, deposit, provide a loan, transferring accounts, balance left, fund and/or financing etc., these predetermined resources are special Levy it is determined that ratio during Bank Danamon class website shared by predetermined resource feature is followed successively by from big to small: Credit card, deposit, loan, fund, manage money matters, transfer accounts, balance left.
With reference to shown in Fig. 7, Type of website determination subelement 1302 can include:
For according to web page characteristics, the corresponding predetermined resource feature in each type website and each type net The website that webpage described in ratio-dependent in corresponding predetermined resource feature of standing shared by each feature is located Type device (hereinafter referred to as " determining Type of website subelement ") 13021.
For example:Web page characteristics contain credit card, deposit, provide a loan and transfer accounts, then can be according to predetermined money In source feature and resource characteristic, the type of credit card, deposit and the ratio-dependent website provided a loan is finance Bank's class.As optional, according to web page characteristics and the comparison result of predetermined resource feature determine with it is every The matching degree of one type website, the website that selection matching degree highest type website is located as webpage Type.
With reference to shown in Fig. 7, the embodiment of the present invention can also include following device:
For web page characteristics are added to the device in predetermined keyword set (hereinafter referred to as " adding device ") 140。
Specifically, when the word not having in predetermined keyword set is contained in web page characteristics, need to be by the net Page feature is added in keyword set.
Weight is carried out to the Type of website in database for the predetermined keyword set after according to the addition New to recognize, the device for being the word in predetermined keyword set until the web page characteristics is (hereinafter referred to as " re-recognizing unit " ") 150.
Specifically, once containing the word not having in predetermined keyword set in web page characteristics, then explanation is known The website not gone out is not comprehensive enough, therefore the word need to be added in predetermined keyword set again through webpage Screening unit 110, website determining unit 120 and type identification unit 130 are to website in database Type is re-recognized, to ensure the Type of website of the comprehensive of the website identified and identification Accuracy.
With reference to shown in Fig. 7, the embodiment of the present invention can also include:
For the device verified to the Type of website for identifying according to the attribute information of website (hereinafter referred to as " authentication unit ") 160.
Specifically, according to the attribute information of website using voting mechanism verifying the Type of website for identifying It is whether accurate.So that the degree of accuracy of the Type of website for identifying is further enhanced.The attribute of website Information can be the basic configuration information of website, for example:Domain name or IP address etc..
The recognition methods of the Type of website described in the embodiment of the present invention and device, by information management department The resource characteristic of offer and/or by carrying out counting the predetermined resource feature pair for obtaining to user operation behavior Website in reservations database is screened, and to being filtered according to the predetermined characteristic of website after, root According to carrying out feature extraction to determine the type of website to website, improve to the accurate of Type of website identification Property and stability.Simultaneously by the feature of extraction is added in keyword with further again to data In storehouse, the type of website is re-recognized, it is ensured that the website identified it is comprehensive, further Improve accuracy.
It should be noted that the present invention can be carried out in the assembly of software and/or software with hardware, For example, each of the invention device can adopt special IC (ASIC) or any other is similar hard Part terminal is realizing.In one embodiment, software program of the invention can pass through computing device To realize steps described above or function.Similarly, software program of the invention (includes related number According to structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, magnetic or CD-ROM driver or floppy disc and similar terminal.In addition, some steps or function of the present invention can be using hard Part realizing, for example, as coordinating so as to perform the circuit of each step or function with processor.
It is obvious to a person skilled in the art that the invention is not restricted to the thin of above-mentioned one exemplary embodiment Section, and without departing from the spirit or essential characteristics of the present invention, can be with other concrete Form realizes the present invention.Therefore, no matter from the point of view of which point, embodiment all should be regarded as exemplary , and be nonrestrictive, the scope of the present invention is by claims rather than described above is limited It is fixed, it is intended that all changes in the implication and scope of the equivalency of claim that will fall are included In the present invention.Any reference in claim should not be considered as the right involved by limiting will Ask.Furthermore, it is to be understood that " including " word is not excluded for other units or step, odd number is not excluded for plural number.System In system claim multiple units for stating or device can also by a unit or device by software or Person's hardware is realizing.The first, the second grade word is used for representing title, and is not offered as any specific Sequentially.
Although above specifically shown and describe exemplary embodiment, those skilled in the art will Will be appreciated that, in the case of the spirit and scope without departing substantially from claims, in its form and carefully Can be varied from terms of section.Protection sought herein is illustrated in the dependent claims.

Claims (16)

1. a kind of recognition methods of the Type of website, including:
Screening is carried out according to predetermined keyword set to reservations database and obtains set of websites;
The set of websites is filtered according to the predetermined characteristic of each website in the set of websites;
The type of website is gone out according to the feature recognition of the website after filtration.
2. recognition methods according to claim 1, the feature of the website according to after filtration The step of type for identifying website, includes:
The web page characteristics of website are extracted according to the daily record of website;
The class of website is determined according to predtermined category standard according to the web page characteristics and predetermined resource feature Type.
3. recognition methods according to claim 2, it is described according to the web page characteristics and predetermined The step of resource characteristic determines the type of website according to predtermined category standard includes:
According to web page characteristics, the corresponding predetermined resource feature in each type website and each type website pair The class of the website that webpage described in the ratio-dependent in the predetermined resource feature answered shared by each feature is located Type.
4. recognition methods according to claim 2, also includes:
The web page characteristics are added in predetermined keyword set.
5. recognition methods according to claim 4, also includes:
Predetermined keyword set after according to the addition is known again to the Type of website in database Not, the word in the web page characteristics are predetermined keyword set.
6. recognition methods according to claim 1, it is described according to predetermined keyword set to predetermined Database carries out the step of screening obtains set of websites to be included:
According to predetermined keyword set using natural language processing algorithm to each website in reservations database Content be analyzed;
Filter out the website composition set of websites higher than predetermined value with the predetermined keyword set matching degree.
7. recognition methods according to claim 1, it is described according in the set of websites each The step of predetermined characteristic of website is filtered to the set of websites includes at least one of:
The website of presumptive area is filtered out according to the URL of each website;
Website of the visit capacity more than first threshold is filtered out according to the daily record of each website;
Website of the trading volume more than Second Threshold is filtered out according to the daily record of each website.
8. the recognition methods according to any one of claim 1-7, also includes:
The Type of website for identifying is verified according to the attribute information of website.
9. a kind of identifying device of the Type of website, including:
For carrying out screening the device for obtaining set of websites to reservations database according to predetermined keyword set;
The set of websites is carried out for the predetermined characteristic according to each website in the set of websites The device of filtration;
For going out the device of the type of website according to the feature recognition of the website after filtration.
10. identifying device according to claim 9, described for according to the website after filtration Feature recognition goes out the device of the type of website to be included:
For the device of the web page characteristics of website is extracted according to the daily record of website;
For website being determined according to predtermined category standard according to the web page characteristics and predetermined resource feature Type device.
11. identifying devices according to claim 10, it is described for according to the web page characteristics Include with the device of the type that predetermined resource feature determines website according to predtermined category standard:
For according to web page characteristics, the corresponding predetermined resource feature in each type website and each type net The website that webpage described in ratio-dependent in corresponding predetermined resource feature of standing shared by each feature is located Type device.
12. identifying devices according to claim 10, also include:
For the web page characteristics are added to the device in predetermined keyword set.
13. identifying devices according to claim 12, also include:
Weight is carried out to the Type of website in database for the predetermined keyword set after according to the addition It is new to recognize, the device of the word in the web page characteristics are predetermined keyword set.
14. identifying devices according to claim 9, it is described for according to predetermined keyword set pair Reservations database carries out screening the device of acquisition set of websites to be included:
For according to predetermined keyword set using natural language processing algorithm in reservations database each The device that the content of website is analyzed;
For filtering out the website composition website with the predetermined keyword set matching degree higher than predetermined value The device of set.
15. identifying devices according to claim 9, it is described for according in the set of websites The device filtered to the set of websites by the predetermined characteristic of each website at least includes one below:
For the device of the website of presumptive area is filtered out according to the URL of each website;
For filtering out device of the visit capacity more than the website of first threshold according to the daily record of each website;
For filtering out device of the trading volume more than the website of Second Threshold according to the date of each website.
16. identifying devices according to any one of claim 9-15, also include:
For the device verified to the Type of website for identifying according to the attribute information of website.
CN201510634837.6A 2015-09-29 2015-09-29 The recognition methods of the Type of website and device Pending CN106557520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510634837.6A CN106557520A (en) 2015-09-29 2015-09-29 The recognition methods of the Type of website and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510634837.6A CN106557520A (en) 2015-09-29 2015-09-29 The recognition methods of the Type of website and device

Publications (1)

Publication Number Publication Date
CN106557520A true CN106557520A (en) 2017-04-05

Family

ID=58417170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510634837.6A Pending CN106557520A (en) 2015-09-29 2015-09-29 The recognition methods of the Type of website and device

Country Status (1)

Country Link
CN (1) CN106557520A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049456A (en) * 2011-10-14 2013-04-17 腾讯科技(深圳)有限公司 Method and device for screening web pages
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages
CN104239340A (en) * 2013-06-19 2014-12-24 北京搜狗信息服务有限公司 Search result screening method and search result screening device
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049456A (en) * 2011-10-14 2013-04-17 腾讯科技(深圳)有限公司 Method and device for screening web pages
CN103235824A (en) * 2013-05-06 2013-08-07 上海河广信息科技有限公司 Method and system for determining web page texts users interested in according to browsed web pages
CN104239340A (en) * 2013-06-19 2014-12-24 北京搜狗信息服务有限公司 Search result screening method and search result screening device
CN104750754A (en) * 2013-12-31 2015-07-01 北龙中网(北京)科技有限责任公司 Website industry classification method and server

Similar Documents

Publication Publication Date Title
Patil et al. Digital payments adoption: an analysis of literature
Jensen Political risk, democratic institutions, and foreign direct investment
US10025834B2 (en) Methods and systems for analyzing entity performance
EP2884441A1 (en) Methods and systems for analyzing entity performance
CN106296389A (en) The appraisal procedure of a kind of user credit degree and device
US20100318422A1 (en) Method for recommending information of goods and system for executing the method
CN103577988A (en) Method and device for recognizing specific user
US20200327548A1 (en) Merchant classification based on content derived from web crawling merchant websites
US10529017B1 (en) Automated business plan underwriting for financial institutions
US10984446B1 (en) Method and system for predicting relevant offerings for users of data management systems using machine learning processes
Lewer et al. Estimating the institutional and network effects of religious cultures on international trade
JP2022120150A (en) Account management system, method for managing account, and program
JP2010224734A (en) Information providing server, information display system, and information providing program
CN113343109A (en) List recommendation method, computing device and computer storage medium
JP6549195B2 (en) Credit information extraction device and credit information extraction method
CN112084209B (en) Knowledge base retrieval method, device, readable medium and equipment
CN104751234B (en) A kind of prediction technique and device of user's assets
Lenard et al. Big data, privacy and the familiar solutions
Titu et al. Online banking system-its application in some selected private commercial banks in Bangladesh
US10956925B1 (en) Method and system for performing transactions using aggregate payment media
CN103744920A (en) Commodity attribute name-value pair extraction method and system
JP6114656B2 (en) Non-payable information processing apparatus and non-payable information processing method
CN111553487B (en) Business object identification method and device
CN106557520A (en) The recognition methods of the Type of website and device
Beall Bentham open

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170405