CN106557520A - The recognition methods of the Type of website and device - Google Patents
The recognition methods of the Type of website and device Download PDFInfo
- Publication number
- CN106557520A CN106557520A CN201510634837.6A CN201510634837A CN106557520A CN 106557520 A CN106557520 A CN 106557520A CN 201510634837 A CN201510634837 A CN 201510634837A CN 106557520 A CN106557520 A CN 106557520A
- Authority
- CN
- China
- Prior art keywords
- website
- type
- predetermined
- websites
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Abstract
The invention provides a kind of recognition methods of Type of website and device, method therein includes:Screening is carried out according to predetermined keyword set to reservations database and obtains set of websites;The set of websites is filtered according to the predetermined characteristic of each website in the set of websites;The type of website is gone out according to the feature recognition of the website after filtration.The method by the website in reservations database is screened and is filtered after to realize the identification of the Type of website, reduce amount of calculation, improve recognition efficiency and the Type of website identification the degree of accuracy.
Description
Technical field
The present invention relates to computer realm, more particularly to a kind of recognition methods and the device of Type of website.
Background technology
With developing rapidly for internet, the quantity of website is also being continuously increased, the style and shape of website
It is ever-changing that formula is also designed, user can be determined according to the type of website website security and can
Operability.The current identification to the Type of website is usually the content all webpages stored in database
It is analyzed after all capturing, to determine the type of the corresponding website of webpage, but this needs is substantial amounts of
Poke space and amount of calculation, result in the waste of resource.
The content of the invention
It is an object of the invention to provide a kind of identifying schemes of the Type of website.
According to an aspect of the invention, there is provided a kind of recognition methods of the Type of website, including:
Screening is carried out according to predetermined keyword set to reservations database and obtains set of websites;
The set of websites is filtered according to the predetermined characteristic of each website in the set of websites;
The type of website is gone out according to the feature recognition of the website after filtration.
According to another aspect of the present invention, there is provided a kind of identifying device of the Type of website, including:
For carrying out screening the device for obtaining set of websites to reservations database according to predetermined keyword set;
The set of websites is carried out for the predetermined characteristic according to each website in the set of websites
The device of filtration;
For going out the device of the type of website according to the feature recognition of the website after filtration.
Due to recognition methods and the device of the Type of website of the present embodiment, by reservations database
Website screened and filtered after to realize the identification of the Type of website, reduce amount of calculation, improve knowledge
Other efficiency and the degree of accuracy of Type of website identification.
Description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, this
The other features, objects and advantages of invention will become more apparent upon:
The flow chart that Fig. 1 shows the recognition methods of a Type of website according to embodiments of the present invention.
Fig. 2 shows step S110 in the recognition methods of a Type of website according to embodiments of the present invention
Particular flow sheet.
Fig. 3 shows step S130 in the recognition methods of a Type of website according to embodiments of the present invention
Another particular flow sheet.
Fig. 4 shows also included step in the recognition methods of a Type of website according to embodiments of the present invention
Flow chart.
The flow chart that Fig. 5 shows the recognition methods of another Type of website according to embodiments of the present invention.
Fig. 6 shows the structured flowchart of the identifying device of a Type of website according to embodiments of the present invention.
Fig. 7 shows the structured flowchart of the identifying device of another Type of website according to embodiments of the present invention.
"-" in accompanying drawing below word represents embedded link form, same or analogous accompanying drawing in accompanying drawing
Mark represents same or analogous part.
Specific embodiment
Although those of ordinary skill in the art will be appreciated that detailed description below by referenced in schematic embodiment,
Accompanying drawing is carried out, but the present invention is not limited in these embodiments.But, the scope of the present invention is extensive
, and it is intended to be bound only by appended claims restriction the scope of the present invention.
It should be mentioned that some exemplary enforcements before exemplary embodiment is discussed in greater detail
Example is described as process or the method described as flow chart.Although operations are described as by flow chart
The process of order, but many of which operation can by concurrently, concomitantly or while implement.
Additionally, the order of operations can be rearranged.The process when its operations are completed can be by
Terminate, it is also possible to have the additional step being not included in accompanying drawing.The process can correspond to
Method, function, code, subroutine, subprogram etc..
Alleged within a context " terminal ", also referred to as " computer ", referring to can be predetermined by operation
Program or the smart electronicses terminal for instructing to perform the predetermined process process such as numerical computations and/or logical calculated,
Which can include processor and memory, be come by the survival instruction that computing device is prestored in memory
Predetermined process process is performed, or predetermined process process is performed by hardware such as ASIC, FPGA, DSP,
Or combined to realize by said two devices.Terminal include but is not limited to server, PC,
Notebook computer, panel computer, smart mobile phone etc..
The terminal includes user terminal and the network terminal.Wherein, the user terminal includes
But it is not limited to computer, smart mobile phone, PDA etc.;The network terminal includes but is not limited to single network
Server, the server group of multiple webservers composition are based on cloud computing (Cloud Computing)
The cloud being made up of a large amount of computers or the webserver, wherein, cloud computing is the one of Distributed Calculation
Kind, a super virtual computer being made up of the loosely-coupled computer collection of a group.Wherein, it is described
Terminal can isolated operation can access realizing the present invention, also network and by with network in its
The interactive operation of his terminal is realizing the present invention.Wherein, the net residing for the terminal
Network includes but is not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN, VPN etc..
It should be noted that the user terminal, the network terminal and network etc. are only for example, other are existing
Terminal that is having or being likely to occur from now on or network are such as applicable to the present invention, should also be included in
Within the scope of the present invention, and it is incorporated herein by reference.
Method (some of them are illustrated by flow process) discussed hereafter can by hardware, software,
Firmware, middleware, microcode, hardware description language or its any combination are implementing.When with software,
When firmware, middleware or microcode are to implement, to the program code or code segment of implementing necessary task
Can be stored in machine or computer-readable medium (such as storage medium).(one or more)
Processor can implement necessary task.
Concrete structure disclosed herein and function detail are only representational, and are for describing
The purpose of the exemplary embodiment of the present invention.But the present invention can be by many alternative forms come concrete
Realize, and be not interpreted as being limited only by the embodiments set forth herein.
Although it should be appreciated that may have been used term " first ", " second " etc. here to describe
Unit, but these units should not be limited by these terms.The use of these terms is only to be
One unit and another unit made a distinction.For example, without departing substantially from exemplary embodiment
Scope in the case of, first module can be referred to as second unit, and similarly second unit can
To be referred to as first module.Term "and/or" used herein above include one of them or it is more listed
Associated item any and all combination.
It should be appreciated that when a unit is referred to as " connection " or during " coupled " to another unit, which can
To be connected or coupled to another unit, or there may be temporary location.On the other hand,
When a unit is referred to as " when being directly connected " or " directly coupled " to another unit, then there is no middle list
Unit.Other word (examples for being used for describe relation unit between are explained in a comparable manner should
If " between being in ... " is compared to " between being directly in ... ", " with ... it is neighbouring " compared to " with ... it is directly adjacent to "
Etc.).
Term used herein above is not intended to limit exemplary reality just for the sake of description specific embodiment
Apply example.Unless the context clearly dictates otherwise, singulative " one " otherwise used herein above, " one
" also attempt to include plural number.It is to be further understood that term " including " used herein above and/or " bag
Containing " presence of the stated feature of regulation, integer, step, operation, unit and/or component, and do not arrange
Except exist or add one or more other features, integer, step, operation, unit, component and/or
Its combination.
It should further be mentioned that in some replaces realization modes, the function/action being previously mentioned can be by
Occur according to the order different from indicating in accompanying drawing.For example, depending on involved function/action,
The two width figures for illustrating in succession can essentially substantially simultaneously perform or sometimes can be according to contrary suitable
Sequence is performing.
Below in conjunction with the accompanying drawings the present invention is described in further detail.
Fig. 1 is the flow chart of the construction method of data under line according to an embodiment of the invention.
With reference to shown in Fig. 1, the recognition methods of the Type of website described in the present embodiment, including following step
Suddenly:
S110, according to predetermined keyword set to reservations database carry out screening obtain set of websites;
S120, set of websites is filtered according to the predetermined characteristic of each website in set of websites;
S130, the type for going out website according to the feature recognition of the website after filtration.
Below each step is described in further detail.
In step S110, predetermined keyword set can be predetermined resource feature, and the predetermined resource feature can
To be the resource characteristic of information management department offer, for example:The financial class net that information management department provides
The resource characteristic that may have of standing is:Insurance, loan, bank, stock, foreign exchange, financing, money
Conclude the business, transfer accounts or pay.The predetermined keyword set can also be that the operation behavior to user is carried out
The Type of website feature set that statistics is obtained, for example:The feature set that financial class website may have is:Account
Family, password, transfer operation or buy in sells operation etc..Predetermined keyword set can also be predetermined resource
The common factor of feature and Type of website feature set.Reservations database is generally the corresponding website group of Chinese web page
Into database.
With reference to shown in Fig. 2, step S110 (is screened to reservations database according to predetermined keyword set
Obtain set of websites) may comprise steps of:
S1101, according to predetermined keyword set using natural language processing algorithm to every in reservations database
The content of individual website is analyzed.
Specifically, using natural language processing Algorithm Analysis go out each website correspondence webpage navigation bar,
Website containing the word same or like with predetermined keyword set in title and content.
S1102, the website filtered out with predetermined keyword set matching degree higher than predetermined value constitute website collection
Close.
Specifically, filter out predetermined more than first with the similarity of word in predetermined keyword set in website
Value, and similarity exceedes second predetermined value more than the number of the word of first predetermined value in each website
Website.First predetermined value and second predetermined value are set according to the requirement of user or system requirements
, for example:If system requirements checks the precise classification of website, first predetermined value can be 95%,
Second predetermined value can be 50;If user requires to check website classification situation substantially that first is pre-
Definite value can be 80%, and second predetermined value can be 20.
Step S120 (is filtered to set of websites according to the predetermined characteristic of each website in set of websites)
Following at least one step can be included:
S1201, according to the URL of each website (Uniform Resource Locator,
URL) filter out the website of presumptive area.
For example, if the URL addresses of the first website are http://www.12306.cn, the second website
URL addresses are http:The web page address 3 that //www.google.hk, the URL of the 3rd website are filtered out
For http://www.sipo.gov.cn, then illustrate that the first website is Chinese website, and the second website is Hong Kong
Website, the 3rd website are Chinese Government website, then it is China to filter out the first website and the 3rd website
The website in domain.
S1202, filtered out according to the daily record of each website visit capacity more than first threshold website.
For example:The journal displaying same day of the first website has 1000 users to browse, the daily record of the second website
Show that the same day there are 5000 users to browse, the journal displaying same day of the 3rd website there are 100 users to browse,
If first threshold is 500, the first website and the second website are filtered out.
S1203, filtered out according to the daily record of each website trading volume more than Second Threshold website.
For example:The journal displaying same day of the first website has 1000 users to complete to pay, the second website
There is 5000 users purchase stock on the journal displaying same day, and there are 100 use on the journal displaying same day of the 3rd website
Family is transferred accounts, if Second Threshold is 500, filters out the first website and the second website.
Used as optional, the website after filtration is using the homepage of website as the index of website.
With reference to shown in Fig. 3, step S130 (goes out the class of website according to the feature recognition of the website after filtration
Type) may comprise steps of:
S1301, the web page characteristics that website is extracted according to the daily record of the website after filtration.
For example:The daily record of financial class website can be including but not limited to:Be linked to Third-party payment platform,
Transfer accounts, refund, stock exchange, financing information show or provide a loan etc..If the daily record of website includes transferring accounts,
Refund or provide a loan etc., then the web page characteristics that can extract website are remaining sum change, pecuniary exchange, transfer accounts
Deng.
S1302, website is determined according to predtermined category standard according to web page characteristics and predetermined resource feature
Type.
Specifically, criteria for classification can be a kind of net of feature correspondence at least one predetermined resource feature
Stand classification, wherein each feature in predetermined resource feature is it is determined that shared ratio during categories of websites can
With identical, it is also possible to different.When in predetermined resource feature each feature it is determined that the type when institute of website
During the ratio difference for accounting for, the spy of other similar type websites during each type website is typically set, is different from
Levy shared ratio in predetermined resource feature maximum.For example:Predetermined resource feature includes:Credit card,
Deposit, provide a loan, transferring accounts, balance left, fund and/or financing etc., these predetermined resource features are true
When determining Bank Danamon class website, the shared ratio of predetermined resource feature is followed successively by from big to small:Credit card,
Deposit, loan, fund, manage money matters, transfer accounts, balance left.
Further, it is possible to according to web page characteristics, the corresponding predetermined resource feature in each type website and every
Webpage institute described in ratio-dependent in the corresponding predetermined resource feature in one type website shared by each feature
Website type.For example:Web page characteristics contain credit card, deposit, provide a loan and transfer accounts, then may be used
With according to credit card, deposit and the ratio-dependent website provided a loan in predetermined resource feature and resource characteristic
Type be Bank Danamon class.As optional, according to comparing for web page characteristics and predetermined resource feature
As a result determine the matching degree with each type website, matching degree highest type website is selected as webpage
The type of the website at place.
With reference to shown in Fig. 4, the embodiment of the present invention can also be comprised the following steps:
S140, web page characteristics are added in predetermined keyword set.
Specifically, when the word not having in predetermined keyword set is contained in web page characteristics, need to be by the net
Page feature is added in keyword set.
S150, according to the addition after predetermined keyword set the Type of website in database is carried out weight
It is new to recognize, until the web page characteristics are the word in predetermined keyword set.
Specifically, once containing the word not having in predetermined keyword set in web page characteristics, then explanation is known
The website not gone out is not comprehensive enough, thus need to by the word be added to repeat step S110 in predetermined keyword set,
S120 and S130 is re-recognized to the type of website in database, with the website for ensureing to identify
It is comprehensive and identification the Type of website accuracy.
The embodiment of the present invention can also include:According to the attribute information of website to the website identified
Type is verified.
Specifically, according to the attribute information of website using voting mechanism verifying the Type of website for identifying
It is whether accurate.So that the degree of accuracy of the Type of website for identifying is further enhanced.The attribute of website
Information can be the basic configuration information of website, for example:Domain name or IP address etc..
With reference to shown in Fig. 5, the recognition methods of the another Type of website described in the embodiment of the present invention, tool
Identification of the body acupuncture to the financial class Type of website, including:
S510, according to predetermined keyword set using natural language processing algorithm in reservations database each
The content of website is analyzed.
For example:Predetermined keyword set can include insurance, loan, bank, stock, foreign exchange, financing,
Pecuniary exchange, transfer accounts, pay, account, password, remaining sum, buying in and sell.
S520, the website filtered out with the predetermined keyword set matching degree higher than predetermined value constitute website
Set.
For example:According to predetermined keyword set account, password and pecuniary exchange, filter out containing account,
The set of websites for paying the corresponding website composition of the page of user name, password, the amount of money, amount and transaction.
S530, the website for going out presumptive area according to the url filtering of each website.
S540, the index by the website after filtration using the homepage of website as website.
S550, the web page characteristics that website is extracted according to the daily record of the website after filtration.
For example:Daily record can be transferred accounts first to B100 including A, and A buys 100 strands of predetermined stock, B
10,000 yuan are provided a loan, C also 5000 yuan of credits card, the then feature for extracting the webpage of website can include turning
Account, stock, loan, credit card, refund etc..
S560, website is determined according to predtermined category standard according to the web page characteristics and predetermined resource feature
Type.
For example:According to the feature of webpage:Transfer accounts, stock, loan, credit card and refund, can be true
The fixed corresponding website of the webpage is bank's class.According to the feature of webpage:Depreciation and mortgage, it may be determined that
The corresponding website of the webpage is mortgage class.
Specifically, the type of financial class website can be including but not limited to:Insurance, loan, finance clothes
Business, guarantee, mortgage, bank, charitable stock, noble metal transaction platform, common reserve fund charg`e d'affaires, security,
Financing information displaying, financing or payment platform etc..
S570, the web page characteristics not having in predetermined keyword set are added in predetermined keyword set.
For example:Depreciation, mortgage, credit card and refund are added in predetermined keyword set.
S580, according to the addition after predetermined keyword set repeat step S510-S560 to database
In the Type of website re-recognized, until the web page characteristics are the word in predetermined keyword set
Language.
With reference to shown in Fig. 6, the identifying device of the Type of website described in the present embodiment, including following dress
Put:
For the dress of set of websites is obtained according to predetermined keyword set to screening is carried out in reservations database
Put (hereinafter referred to as " website screening unit ") 110.
The set of websites is carried out for the predetermined characteristic according to each website in the set of websites
The device (hereinafter referred to as " website programming unit ") 120 of filtration.
For going out type (hereinafter referred to as " the type knowledge of website according to the feature recognition of the website after filtration
Other unit ") 130.
Below each step is described in further detail.
In website screening unit 110, predetermined keyword set can be predetermined resource feature, the predetermined money
Source feature can be the resource characteristic that information management department provides, for example:What information management department provided
The resource characteristic that financial class website may have is:Insurance, loan, bank, stock, foreign exchange, melt
Money, pecuniary exchange, transfer accounts or pay.The predetermined keyword set can also be the operation to user
Behavior carries out counting the Type of website feature set for obtaining, for example:The feature that financial class website may have
Collect and be:Account, password, transfer operation or buy in sells operation etc..Predetermined keyword set can also be
The common factor of predetermined resource feature and Type of website feature set.Reservations database is generally Chinese web page correspondence
Website composition database.
With reference to shown in Fig. 7, website screening unit 110 can include following device:
For according to predetermined keyword set using natural language processing algorithm in reservations database each
The device (hereinafter referred to as " key word analysis subelement ") 1101 that the content of website is analyzed.
Specifically, using natural language processing Algorithm Analysis go out each website correspondence webpage navigation bar,
Website containing the word same or like with predetermined keyword set in title and content.
For filtering out the website composition website with the predetermined keyword set matching degree higher than predetermined value
The device (hereinafter referred to as " website coupling subelement ") 1102 of set.
Specifically, filter out predetermined more than first with the similarity of word in predetermined keyword set in website
Value, and similarity exceedes second predetermined value more than the number of the word of first predetermined value in each website
Website.First predetermined value and second predetermined value are set according to the requirement of user or system requirements
, for example:If system requirements checks the precise classification of website, first predetermined value can be 95%,
Second predetermined value can be 50;If user requires to check website classification situation substantially that first is pre-
Definite value can be 80%, and second predetermined value can be 20.
With reference to shown in Fig. 7, website programming unit 120 can include following at least one device:
Device (letter below for the website by presumptive area is gone out according to the url filtering of each website
Claim " first filters subelement ") 1201.
For example, if the URL addresses of the first website are http://www.12306.cn, the second website
URL addresses are http:The web page address 3 that //www.google.hk, the URL of the 3rd website are filtered out
For http://www.sipo.gov.cn, then illustrate that the first website is Chinese website, and the second website is Hong Kong
Website, the 3rd website are Chinese Government website, then it is China to filter out the first website and the 3rd website
The website in domain.
For filtering out device of the visit capacity more than the website of first threshold according to the daily record of each website
(hereinafter referred to as " second filters subelement ") 1202.
For example:The journal displaying same day of the first website has 1000 users to browse, the daily record of the second website
Show that the same day there are 5000 users to browse, the journal displaying same day of the 3rd website there are 100 users to browse,
If first threshold is 500, the first website and the second website are filtered out.
For filtering out device of the trading volume more than the website of Second Threshold according to the daily record of each website
(hereinafter referred to as " the 3rd filters subelement ") 1203.
For example:The journal displaying same day of the first website has 1000 users to complete to pay, the second website
There is 5000 users purchase stock on the journal displaying same day, and there are 100 use on the journal displaying same day of the 3rd website
Family is transferred accounts, if Second Threshold is 500, filters out the first website and the second website.
Used as optional, the website after filtration is using the homepage of website as the index of website.
With reference to shown in Fig. 7, type identification unit 130 can include following device:
For according to the daily record of the website after filtration extract website web page characteristics device (hereinafter referred to as
" web page characteristics extraction subelement ") 1301.
For example:The daily record of financial class website can be including but not limited to:Be linked to Third-party payment platform,
Transfer accounts, refund, stock exchange, financing information show or provide a loan etc..If the daily record of website includes transferring accounts,
Refund or provide a loan etc., then the web page characteristics that can extract website are remaining sum change, pecuniary exchange, transfer accounts
Deng.
For website being determined according to predtermined category standard according to the web page characteristics and predetermined resource feature
Type device (hereinafter referred to as " Type of website determination subelement ") 1302.
Specifically, criteria for classification can be a kind of net of feature correspondence at least one predetermined resource feature
Stand classification, wherein each feature in the predetermined resource feature is it is determined that ratio shared during categories of websites
Example can be with identical, it is also possible to different.When in predetermined resource feature each feature it is determined that the type of website
When the ratio of Shi Suozhan is different, other similar type websites during each type website is typically set, are different from
The ratio shared in predetermined resource feature of feature it is maximum.For example:Predetermined resource feature includes:Letter
With card, deposit, provide a loan, transferring accounts, balance left, fund and/or financing etc., these predetermined resources are special
Levy it is determined that ratio during Bank Danamon class website shared by predetermined resource feature is followed successively by from big to small:
Credit card, deposit, loan, fund, manage money matters, transfer accounts, balance left.
With reference to shown in Fig. 7, Type of website determination subelement 1302 can include:
For according to web page characteristics, the corresponding predetermined resource feature in each type website and each type net
The website that webpage described in ratio-dependent in corresponding predetermined resource feature of standing shared by each feature is located
Type device (hereinafter referred to as " determining Type of website subelement ") 13021.
For example:Web page characteristics contain credit card, deposit, provide a loan and transfer accounts, then can be according to predetermined money
In source feature and resource characteristic, the type of credit card, deposit and the ratio-dependent website provided a loan is finance
Bank's class.As optional, according to web page characteristics and the comparison result of predetermined resource feature determine with it is every
The matching degree of one type website, the website that selection matching degree highest type website is located as webpage
Type.
With reference to shown in Fig. 7, the embodiment of the present invention can also include following device:
For web page characteristics are added to the device in predetermined keyword set (hereinafter referred to as " adding device ")
140。
Specifically, when the word not having in predetermined keyword set is contained in web page characteristics, need to be by the net
Page feature is added in keyword set.
Weight is carried out to the Type of website in database for the predetermined keyword set after according to the addition
New to recognize, the device for being the word in predetermined keyword set until the web page characteristics is (hereinafter referred to as
" re-recognizing unit " ") 150.
Specifically, once containing the word not having in predetermined keyword set in web page characteristics, then explanation is known
The website not gone out is not comprehensive enough, therefore the word need to be added in predetermined keyword set again through webpage
Screening unit 110, website determining unit 120 and type identification unit 130 are to website in database
Type is re-recognized, to ensure the Type of website of the comprehensive of the website identified and identification
Accuracy.
With reference to shown in Fig. 7, the embodiment of the present invention can also include:
For the device verified to the Type of website for identifying according to the attribute information of website
(hereinafter referred to as " authentication unit ") 160.
Specifically, according to the attribute information of website using voting mechanism verifying the Type of website for identifying
It is whether accurate.So that the degree of accuracy of the Type of website for identifying is further enhanced.The attribute of website
Information can be the basic configuration information of website, for example:Domain name or IP address etc..
The recognition methods of the Type of website described in the embodiment of the present invention and device, by information management department
The resource characteristic of offer and/or by carrying out counting the predetermined resource feature pair for obtaining to user operation behavior
Website in reservations database is screened, and to being filtered according to the predetermined characteristic of website after, root
According to carrying out feature extraction to determine the type of website to website, improve to the accurate of Type of website identification
Property and stability.Simultaneously by the feature of extraction is added in keyword with further again to data
In storehouse, the type of website is re-recognized, it is ensured that the website identified it is comprehensive, further
Improve accuracy.
It should be noted that the present invention can be carried out in the assembly of software and/or software with hardware,
For example, each of the invention device can adopt special IC (ASIC) or any other is similar hard
Part terminal is realizing.In one embodiment, software program of the invention can pass through computing device
To realize steps described above or function.Similarly, software program of the invention (includes related number
According to structure) can be stored in computer readable recording medium storing program for performing, for example, RAM memory, magnetic or
CD-ROM driver or floppy disc and similar terminal.In addition, some steps or function of the present invention can be using hard
Part realizing, for example, as coordinating so as to perform the circuit of each step or function with processor.
It is obvious to a person skilled in the art that the invention is not restricted to the thin of above-mentioned one exemplary embodiment
Section, and without departing from the spirit or essential characteristics of the present invention, can be with other concrete
Form realizes the present invention.Therefore, no matter from the point of view of which point, embodiment all should be regarded as exemplary
, and be nonrestrictive, the scope of the present invention is by claims rather than described above is limited
It is fixed, it is intended that all changes in the implication and scope of the equivalency of claim that will fall are included
In the present invention.Any reference in claim should not be considered as the right involved by limiting will
Ask.Furthermore, it is to be understood that " including " word is not excluded for other units or step, odd number is not excluded for plural number.System
In system claim multiple units for stating or device can also by a unit or device by software or
Person's hardware is realizing.The first, the second grade word is used for representing title, and is not offered as any specific
Sequentially.
Although above specifically shown and describe exemplary embodiment, those skilled in the art will
Will be appreciated that, in the case of the spirit and scope without departing substantially from claims, in its form and carefully
Can be varied from terms of section.Protection sought herein is illustrated in the dependent claims.
Claims (16)
1. a kind of recognition methods of the Type of website, including:
Screening is carried out according to predetermined keyword set to reservations database and obtains set of websites;
The set of websites is filtered according to the predetermined characteristic of each website in the set of websites;
The type of website is gone out according to the feature recognition of the website after filtration.
2. recognition methods according to claim 1, the feature of the website according to after filtration
The step of type for identifying website, includes:
The web page characteristics of website are extracted according to the daily record of website;
The class of website is determined according to predtermined category standard according to the web page characteristics and predetermined resource feature
Type.
3. recognition methods according to claim 2, it is described according to the web page characteristics and predetermined
The step of resource characteristic determines the type of website according to predtermined category standard includes:
According to web page characteristics, the corresponding predetermined resource feature in each type website and each type website pair
The class of the website that webpage described in the ratio-dependent in the predetermined resource feature answered shared by each feature is located
Type.
4. recognition methods according to claim 2, also includes:
The web page characteristics are added in predetermined keyword set.
5. recognition methods according to claim 4, also includes:
Predetermined keyword set after according to the addition is known again to the Type of website in database
Not, the word in the web page characteristics are predetermined keyword set.
6. recognition methods according to claim 1, it is described according to predetermined keyword set to predetermined
Database carries out the step of screening obtains set of websites to be included:
According to predetermined keyword set using natural language processing algorithm to each website in reservations database
Content be analyzed;
Filter out the website composition set of websites higher than predetermined value with the predetermined keyword set matching degree.
7. recognition methods according to claim 1, it is described according in the set of websites each
The step of predetermined characteristic of website is filtered to the set of websites includes at least one of:
The website of presumptive area is filtered out according to the URL of each website;
Website of the visit capacity more than first threshold is filtered out according to the daily record of each website;
Website of the trading volume more than Second Threshold is filtered out according to the daily record of each website.
8. the recognition methods according to any one of claim 1-7, also includes:
The Type of website for identifying is verified according to the attribute information of website.
9. a kind of identifying device of the Type of website, including:
For carrying out screening the device for obtaining set of websites to reservations database according to predetermined keyword set;
The set of websites is carried out for the predetermined characteristic according to each website in the set of websites
The device of filtration;
For going out the device of the type of website according to the feature recognition of the website after filtration.
10. identifying device according to claim 9, described for according to the website after filtration
Feature recognition goes out the device of the type of website to be included:
For the device of the web page characteristics of website is extracted according to the daily record of website;
For website being determined according to predtermined category standard according to the web page characteristics and predetermined resource feature
Type device.
11. identifying devices according to claim 10, it is described for according to the web page characteristics
Include with the device of the type that predetermined resource feature determines website according to predtermined category standard:
For according to web page characteristics, the corresponding predetermined resource feature in each type website and each type net
The website that webpage described in ratio-dependent in corresponding predetermined resource feature of standing shared by each feature is located
Type device.
12. identifying devices according to claim 10, also include:
For the web page characteristics are added to the device in predetermined keyword set.
13. identifying devices according to claim 12, also include:
Weight is carried out to the Type of website in database for the predetermined keyword set after according to the addition
It is new to recognize, the device of the word in the web page characteristics are predetermined keyword set.
14. identifying devices according to claim 9, it is described for according to predetermined keyword set pair
Reservations database carries out screening the device of acquisition set of websites to be included:
For according to predetermined keyword set using natural language processing algorithm in reservations database each
The device that the content of website is analyzed;
For filtering out the website composition website with the predetermined keyword set matching degree higher than predetermined value
The device of set.
15. identifying devices according to claim 9, it is described for according in the set of websites
The device filtered to the set of websites by the predetermined characteristic of each website at least includes one below:
For the device of the website of presumptive area is filtered out according to the URL of each website;
For filtering out device of the visit capacity more than the website of first threshold according to the daily record of each website;
For filtering out device of the trading volume more than the website of Second Threshold according to the date of each website.
16. identifying devices according to any one of claim 9-15, also include:
For the device verified to the Type of website for identifying according to the attribute information of website.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510634837.6A CN106557520A (en) | 2015-09-29 | 2015-09-29 | The recognition methods of the Type of website and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510634837.6A CN106557520A (en) | 2015-09-29 | 2015-09-29 | The recognition methods of the Type of website and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106557520A true CN106557520A (en) | 2017-04-05 |
Family
ID=58417170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510634837.6A Pending CN106557520A (en) | 2015-09-29 | 2015-09-29 | The recognition methods of the Type of website and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106557520A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049456A (en) * | 2011-10-14 | 2013-04-17 | 腾讯科技(深圳)有限公司 | Method and device for screening web pages |
CN103235824A (en) * | 2013-05-06 | 2013-08-07 | 上海河广信息科技有限公司 | Method and system for determining web page texts users interested in according to browsed web pages |
CN104239340A (en) * | 2013-06-19 | 2014-12-24 | 北京搜狗信息服务有限公司 | Search result screening method and search result screening device |
CN104750754A (en) * | 2013-12-31 | 2015-07-01 | 北龙中网(北京)科技有限责任公司 | Website industry classification method and server |
-
2015
- 2015-09-29 CN CN201510634837.6A patent/CN106557520A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049456A (en) * | 2011-10-14 | 2013-04-17 | 腾讯科技(深圳)有限公司 | Method and device for screening web pages |
CN103235824A (en) * | 2013-05-06 | 2013-08-07 | 上海河广信息科技有限公司 | Method and system for determining web page texts users interested in according to browsed web pages |
CN104239340A (en) * | 2013-06-19 | 2014-12-24 | 北京搜狗信息服务有限公司 | Search result screening method and search result screening device |
CN104750754A (en) * | 2013-12-31 | 2015-07-01 | 北龙中网(北京)科技有限责任公司 | Website industry classification method and server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Patil et al. | Digital payments adoption: an analysis of literature | |
Jensen | Political risk, democratic institutions, and foreign direct investment | |
US10025834B2 (en) | Methods and systems for analyzing entity performance | |
EP2884441A1 (en) | Methods and systems for analyzing entity performance | |
CN106296389A (en) | The appraisal procedure of a kind of user credit degree and device | |
US20100318422A1 (en) | Method for recommending information of goods and system for executing the method | |
CN103577988A (en) | Method and device for recognizing specific user | |
US20200327548A1 (en) | Merchant classification based on content derived from web crawling merchant websites | |
US10529017B1 (en) | Automated business plan underwriting for financial institutions | |
US10984446B1 (en) | Method and system for predicting relevant offerings for users of data management systems using machine learning processes | |
Lewer et al. | Estimating the institutional and network effects of religious cultures on international trade | |
JP2022120150A (en) | Account management system, method for managing account, and program | |
JP2010224734A (en) | Information providing server, information display system, and information providing program | |
CN113343109A (en) | List recommendation method, computing device and computer storage medium | |
JP6549195B2 (en) | Credit information extraction device and credit information extraction method | |
CN112084209B (en) | Knowledge base retrieval method, device, readable medium and equipment | |
CN104751234B (en) | A kind of prediction technique and device of user's assets | |
Lenard et al. | Big data, privacy and the familiar solutions | |
Titu et al. | Online banking system-its application in some selected private commercial banks in Bangladesh | |
US10956925B1 (en) | Method and system for performing transactions using aggregate payment media | |
CN103744920A (en) | Commodity attribute name-value pair extraction method and system | |
JP6114656B2 (en) | Non-payable information processing apparatus and non-payable information processing method | |
CN111553487B (en) | Business object identification method and device | |
CN106557520A (en) | The recognition methods of the Type of website and device | |
Beall | Bentham open |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170405 |