CN106294443A - The URL classification recognition methods in a kind of knowledge based storehouse and system - Google Patents
The URL classification recognition methods in a kind of knowledge based storehouse and system Download PDFInfo
- Publication number
- CN106294443A CN106294443A CN201510280344.7A CN201510280344A CN106294443A CN 106294443 A CN106294443 A CN 106294443A CN 201510280344 A CN201510280344 A CN 201510280344A CN 106294443 A CN106294443 A CN 106294443A
- Authority
- CN
- China
- Prior art keywords
- classification
- url
- information
- knowledge base
- classifying
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Abstract
The invention discloses the URL classification recognition methods in a kind of knowledge based storehouse, by importing classification information in knowledge base;According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;Based on URL structure to the internet information after preliminary classification further be layered identification and classification;Output identifies and the result of classification, achieve and internet content is carried out Classification and Identification, because need not carry out text analyzing or image recognition for the content of text of magnanimity, the simply layering identification of network address, service response ability can infinitely improve, also in all of knowledge database structure to internal memory, it is not required to carry out hard disk IO, it is entirely network I/O and internal storage access, decrease the consumption to system resource, concept based on layering, same website can be accomplished, existing identical content, there is again the classification of different content, because key assignments is simple, therefore classified inquiry when, just can accomplish the minimal consumption to system resource.
Description
Technical field
The present invention relates to technical field of internet application, particularly relate to the URL in a kind of knowledge based storehouse
Classifying identification method and system.
Background technology
Many search engine service, such as Baidu and google, provide by interconnection for search
The information that net can access.These search engine service allow user removal search user interested
Display page, such as news web page.Have submitted user after including the searching request of search terms, search
Rope engine service identification may be relevant to those search terms webpage.The pass of any specific webpage
Keyword can utilize various known information retrieval technique to identify, such as identify title word,
The word provided in the metadata of webpage, highlighted word etc..Search engine service can basis
The degree of closeness of each coupling, webpage popularization etc., generate associated score and point out webpage
Information and the degree of correlation of searching request.Search engine service is then according to their sequence suitable
Sequence, displays to the user that the link of those webpages.
Although search engine service can return many webpages as Search Results, with clooating sequence
The webpage occurred, it may be difficult to make user carry out actual discovery those users webpage of special interest.
Owing to first webpage presented may be directed to popular theme, to obscure theme sense
The user of interest may need the many pages browsing Search Results could find net interested
Page.In order to make user more easily find webpage interested, the webpage of Search Results can root
According to some classification or classification of webpage, present with the tissue of classification.Such as, if user carries
Handing over the searching request of " court battles ", Search Results can include being classified as motion phase
That close or that law is relevant webpage.User may prefer to show at the beginning the tabulation of webpage,
So user can select the classification of webpage interested.For example, it may be possible to first present for user
Have been classified as the instruction of the webpage of the relevant Search Results relevant with law of motion.User
Can then select the classification that law is relevant to check the webpage that law is relevant.Contrary, due to
The webpage that webpage that motion is relevant is more relevant than law is more popular, if most popular webpage first in
Existing, user may browse many webpages to find the webpage that law is relevant.
The currently available millions of webpage of manual sort is unpractical.Although automatically classifying
Technology is already used to be classified based on the content of text, but those technology are normally unsuitable for webpage
Classification.Webpage has the tissue including noisy content, such as advertisement or navigation bar, they
Not primary with webpage theme is directly related.Because traditional text based sorting technique exists
During classification webpage, will utilize such noisy content, these technology will result in webpage
Incorrect classification.
The existing sorting technique for webpage is mainly, analysis based on internet page content
Identifying, this method accuracy rate is the most impracticable, for a large amount of requests that height is concurrent, response speed
Also and not reliable.
Summary of the invention
In view of current technical field of internet application above shortcomings, the present invention provides a kind of
The URL classification recognition methods in knowledge based storehouse and system, layering identification based on network address classification,
Service response ability is improved.
For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that
The URL classification recognition methods in a kind of knowledge based storehouse, the URL in described knowledge based storehouse divides
Class recognition methods comprises the following steps:
Classification information is imported in knowledge base;
According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;
Based on URL structure to the internet information after preliminary classification further be layered identification and
Classification;
Output identifies and the result of classification.
According to one aspect of the present invention, the described concrete reality importing classification information in knowledge base
The mode of executing can be: imports classification information and stores in knowledge base, and all knowledge bases is added
It is downloaded in internal memory.
According to one aspect of the present invention, described according to the classification information bank URL to internet information
The detailed description of the invention carrying out preliminary classification can be: according to the keyword in classification information bank, right
The URL of the internet information comprising described keyword carries out preliminary classification.
According to one aspect of the present invention, described based on URL structure to the interconnection after preliminary classification
Net information is layered the detailed description of the invention identified and classify further: to preliminary classification
After the URL structure of internet information be analyzed, concept based on layering, according to URL layer
Described internet information is classified by secondary difference further.
According to one aspect of the present invention, the described concrete reality importing classification information in knowledge base
The mode of executing can be: imports in knowledge base and classifies information in plain text and/or import ciphertext in knowledge base
Classification information.
According to one aspect of the present invention, described classifying identification method based on URL includes following
Step: input internet information by the way of socket connects to carry out inquiring about and Classification and Identification.
A kind of classifying and identifying system based on URL, described classifying and identifying system bag based on URL
Include:
Import module, for importing classification information in knowledge base;
Preliminary classification module, for according to the URL to internet information of the classification information in knowledge base
Carry out preliminary classification;
Layering identification module, for entering the internet information after preliminary classification based on URL structure
Row layering further identifies and classification;
Output module, identifies and the result of classification for exporting.
According to one aspect of the present invention, the specific works mode of described importing module can be: leads
Enter classification information to store in knowledge base, and by all knowledge database structure to internal memory.
According to one aspect of the present invention, the specific works mode of described layering identification module can be:
The URL structure of the internet information after preliminary classification is analyzed, concept based on layering,
Described internet information is classified by difference according to URL level further.
According to one aspect of the present invention, described classifying and identifying system based on URL also includes:
Socket link block, for inputting internet information to carry out by the way of socket connects
Inquiry and Classification and Identification.
The advantage that the present invention implements: the URL classification identification side in knowledge based storehouse of the present invention
Method is by importing classification information in knowledge base;According to the classification information in knowledge base to the Internet
The URL of information carries out preliminary classification;Based on URL structure to the internet information after preliminary classification
It is layered identification and classification further;Output identifies and the result of classification, it is achieved that to interconnection
Net content carries out Classification and Identification, because need not carry out text analyzing for the content of text of magnanimity
Or image recognition, simply the layering identification of network address, service response ability can infinitely improve,
Also in all of knowledge database structure to internal memory, thus the when of classification engine work, and it is not required to
Hard disk IO to be carried out, is entirely network I/O and internal storage access, decreases the consumption to system resource,
Concept based on layering, can accomplish same website, and existing identical content, in having difference again
The classification held, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish
The minimal consumption of system resource.
Accompanying drawing explanation
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, below will be to embodiment
The accompanying drawing used required in is briefly described, it should be apparent that, the accompanying drawing in describing below
It is only some embodiments of the present invention, for those of ordinary skill in the art, is not paying
On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the URL classification recognition methods schematic diagram in a kind of knowledge based storehouse of the present invention;
Fig. 2 is that the URL classification identification system structure in a kind of knowledge based storehouse of the present invention is shown
It is intended to.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical side in the embodiment of the present invention
Case is clearly and completely described, it is clear that described embodiment is only the present invention one
Divide embodiment rather than whole embodiments.Based on the embodiment in the present invention, this area is general
The every other embodiment that logical technical staff is obtained under not making creative work premise,
Broadly fall into the scope of protection of the invention.
As it is shown in figure 1, the URL classification recognition methods in a kind of knowledge based storehouse, described based on knowing
The URL classification recognition methods knowing storehouse comprises the following steps:
Step S1: import classification information in knowledge base;
Described step S1 imports the detailed description of the invention of classification information in knowledge base: import
Classification information stores in knowledge base, and by all knowledge database structure to internal memory.Pass through
To use in knowledge database structure to internal memory, the most simple network accesses and internal storage access,
Without access hard disk, decreasing taking of resource, speed of response of server can be greatly improved.
In actual applications, plaintext classification information can be imported in knowledge base and/or in knowledge base
Import ciphertext classification information.
In actual applications, the source of described classification information is concretely: capture mould from the Internet
The information of formula, then based on keyword, the information captured is carried out classification process to obtain classification letter
Breath sets up classification information bank.
Step S2: the URL of internet information is carried out tentatively according to the classification information in knowledge base
Classification;
At the beginning of the URL of internet information is carried out by described step S2 according to the classification information in knowledge base
The detailed description of the invention of step classification can be: according to the keyword in classification information bank, to comprising
The URL of the internet information stating keyword carries out preliminary classification.Such as, according to classification information bank
In keyword " news ", to the URL of internet information about news in internet information
It is classified, such as the internet information containing news in URL is divided into a class.
Step S3: the internet information after preliminary classification is divided further based on URL structure
Layer identifies and classification;
Internet information after preliminary classification is carried out further by described step S3 based on URL structure
Layering identifies and the detailed description of the invention of classification can be: to the internet information after preliminary classification
URL structure is analyzed, and concept based on layering, according to the difference of URL level by described interconnection
Net information is classified further, concept based on layering, can accomplish same website, both
There is identical content, have again the classification of different content.
In actual applications, following mode classification can be carried out to classify:
Forhttp://a.com/1/Withhttp://a.com/1/index.jspAnd
Http:// a.com/1/indes233.jsp makees similar process.
Inhomogeneity is done for http://a.com/1/2/ and http://a.com/1/ process.
Forhttp://a.com/1/Withhttp://a.com/1/2And
http://a.com/1/2All make inhomogeneity with http://a.com/1/2/3/4 to process.
Thus, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish
The minimal consumption of system resource.
Simultaneously as need not carry out text analyzing or image knowledge for the content of text of magnanimity
Not, the simply layering identification of network address, service response ability can infinitely improve.
Step S4: output identifies and the result of classification.
Apply in reality, internet information can be inputted by the way of socket connects to look into
Ask and Classification and Identification.
In actual applications, system provide socket service, based on self-defining vlan query protocol VLAN,
Support batch query, can disposably submit to any number of Web site query to ask, the most whole system
The service performance bottleneck of system, is only the network bandwidth that client uses.
Classifying identification method based on URL described in the present embodiment divides by importing in knowledge base
Category information;According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;
Based on URL structure to the internet information after preliminary classification further be layered identification and classification;
Output identifies and the result of classification, it is achieved that internet content is carried out Classification and Identification, because not
Be required for the content of text of magnanimity and carry out text analyzing or image recognition, simply network address point
Layer identifies, service response ability can infinitely improve, also all of knowledge database structure to internal memory
In, thus the when of classification engine work, be not required to carry out hard disk IO, be entirely network I/O
And internal storage access, decrease the consumption to system resource, concept based on layering, can accomplish
Same website, existing identical content, there is again the classification of different content, because key assignments is simple,
Therefore classified inquiry when, it is possible to accomplish the minimal consumption to system resource.
A kind of classifying and identifying system embodiment based on URL
As in figure 2 it is shown, a kind of classifying and identifying system based on URL, described based on URL point
Class identification system includes:
Import module 1, for importing classification information in knowledge base;
Preliminary classification module 2, is used for according to the classification information in knowledge base internet information
URL carries out preliminary classification;
Layering identification module 3, for entering the internet information after preliminary classification based on URL structure
Row layering further identifies and classification;
Output module 4, identifies and the result of classification for exporting.
In actual applications, the specific works mode of described importing module can be: imports classification letter
Breath stores in knowledge base, and by all knowledge database structure to internal memory.
In actual applications, the specific works mode of described layering identification module can be: to tentatively
The URL structure of sorted internet information is analyzed, and concept based on layering, according to URL
Described internet information is classified by the difference of level further.
In actual applications, described classifying and identifying system based on URL also includes: socket is even
Connection module 5, for input by the way of socket connects internet information with carry out inquiry and
Classification and Identification.
The advantage that the present invention implements: the URL classification identification side in knowledge based storehouse of the present invention
Method is by importing classification information in knowledge base;According to the classification information in knowledge base to the Internet
The URL of information carries out preliminary classification;Based on URL structure to the internet information after preliminary classification
It is layered identification and classification further;Output identifies and the result of classification, it is achieved that to interconnection
Net content carries out Classification and Identification, because need not carry out text analyzing for the content of text of magnanimity
Or image recognition, simply the layering identification of network address, service response ability can infinitely improve,
Also in all of knowledge database structure to internal memory, thus the when of classification engine work, and it is not required to
Hard disk IO to be carried out, is entirely network I/O and internal storage access, decreases the consumption to system resource,
Concept based on layering, can accomplish same website, and existing identical content, in having difference again
The classification held, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish
The minimal consumption of system resource.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is also
Being not limited to this, any those skilled in the art is at technology model disclosed by the invention
In enclosing, the change that can readily occur in or replacement, all should contain within protection scope of the present invention.
Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.
Claims (10)
1. the URL classification recognition methods in a knowledge based storehouse, it is characterised in that described based on
The URL classification recognition methods of knowledge base comprises the following steps:
Classification information is imported in knowledge base;
According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;
Based on URL structure to the internet information after preliminary classification further be layered identification and
Classification;
Output identifies and the result of classification.
Classifying identification method based on URL the most according to claim 1, its feature exists
In, described in knowledge base import classification information detailed description of the invention can be: import classification letter
Breath stores in knowledge base, and by all knowledge database structure to internal memory.
Classifying identification method based on URL the most according to claim 1, its feature exists
In, the described concrete reality that according to classification information bank, the URL of internet information is carried out preliminary classification
The mode of executing can be: according to the keyword in classification information bank, to the interconnection comprising described keyword
The URL of net information carries out preliminary classification.
Classifying identification method based on URL the most according to claim 1, it is characterised in that
Described based on URL structure to the internet information after preliminary classification further be layered identification and
The detailed description of the invention of classification can be: enters the URL structure of the internet information after preliminary classification
Row is analyzed, concept based on layering, is entered by described internet information according to the difference of URL level
Row classification further.
Classifying identification method based on URL the most according to claim 1, it is characterised in that
Described in knowledge base import classification information detailed description of the invention can be: in knowledge base import
In plain text classification information and/or import ciphertext classification information in knowledge base.
6. according to the classifying identification method based on URL one of claim 1 to 5 Suo Shu, its
Being characterised by, described classifying identification method based on URL comprises the following steps: pass through socket
The mode connected inputs internet information to carry out inquiring about and Classification and Identification.
7. a classifying and identifying system based on URL, it is characterised in that described based on URL
Classifying and identifying system includes:
Import module, for importing classification information in knowledge base;
Preliminary classification module, for according to the URL to internet information of the classification information in knowledge base
Carry out preliminary classification;
Layering identification module, for entering the internet information after preliminary classification based on URL structure
Row layering further identifies and classification;
Output module, identifies and the result of classification for exporting.
Classifying and identifying system based on URL the most according to claim 7, it is characterised in that
The specific works mode of described importing module can be: imports classification information and deposits in knowledge base
Storage, and by all knowledge database structure to internal memory.
Classifying and identifying system based on URL the most according to claim 8, it is characterised in that
The specific works mode of described layering identification module can be: to the internet information after preliminary classification
URL structure be analyzed, concept based on layering, according to the difference of URL level by described
Internet information is classified further.
10. according to the classifying and identifying system based on URL one of claim 7 to 9 Suo Shu,
It is characterized in that, described classifying and identifying system based on URL also includes: socket link block,
Internet information is inputted by socket to carry out inquiring about and Classification and Identification by the way of being connected.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510280344.7A CN106294443A (en) | 2015-05-28 | 2015-05-28 | The URL classification recognition methods in a kind of knowledge based storehouse and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510280344.7A CN106294443A (en) | 2015-05-28 | 2015-05-28 | The URL classification recognition methods in a kind of knowledge based storehouse and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106294443A true CN106294443A (en) | 2017-01-04 |
Family
ID=57635575
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510280344.7A Pending CN106294443A (en) | 2015-05-28 | 2015-05-28 | The URL classification recognition methods in a kind of knowledge based storehouse and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106294443A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567494A (en) * | 2011-12-22 | 2012-07-11 | 北京亿赞普网络技术有限公司 | Website classification method and device |
CN102955810A (en) * | 2011-08-26 | 2013-03-06 | 中国移动通信集团公司 | Webpage classification method and device |
US20140136569A1 (en) * | 2012-11-09 | 2014-05-15 | Microsoft Corporation | Taxonomy Driven Commerce Site |
-
2015
- 2015-05-28 CN CN201510280344.7A patent/CN106294443A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102955810A (en) * | 2011-08-26 | 2013-03-06 | 中国移动通信集团公司 | Webpage classification method and device |
CN102567494A (en) * | 2011-12-22 | 2012-07-11 | 北京亿赞普网络技术有限公司 | Website classification method and device |
US20140136569A1 (en) * | 2012-11-09 | 2014-05-15 | Microsoft Corporation | Taxonomy Driven Commerce Site |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9449271B2 (en) | Classifying resources using a deep network | |
CN102063476B (en) | Video searching method and system | |
JP5588981B2 (en) | Providing posts to discussion threads in response to search queries | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN105765573B (en) | Improvements in website traffic optimization | |
US8682882B2 (en) | System and method for automatically identifying classified websites | |
CN103294815B (en) | Based on key class and there are a search engine device and method of various presentation modes | |
US20080077569A1 (en) | Integrated Search Service System and Method | |
CN106339502A (en) | Modeling recommendation method based on user behavior data fragmentation cluster | |
CN104021125B (en) | A kind of method, system and a kind of search engine of search engine sequence | |
US8712999B2 (en) | Systems and methods for online search recirculation and query categorization | |
CN101261629A (en) | Specific information searching method based on automatic classification technology | |
CN101477554A (en) | User interest based personalized meta search engine and search result processing method | |
CN102037464A (en) | Search results with most clicked next objects | |
CN1930566A (en) | Systems and methods for search query processing using trend analysis | |
CN102567494B (en) | Website classification method and device | |
EP2460095A1 (en) | Keyword assignment to a web page | |
CN102930038A (en) | Combined method of search result similar items and system of the same | |
KR102601545B1 (en) | Geographic position point ranking method, ranking model training method and corresponding device | |
CN104598604A (en) | Browsing method of website navigation applied in various browsers | |
TW201220097A (en) | capable of performing relevancy processing for at least one product corresponding to product identifiers referenced in relevant web pages | |
CN108021715A (en) | Isomery tag fusion system based on semantic structure signature analysis | |
CN109857952A (en) | A kind of search engine and method for quickly retrieving with classification display | |
CN106919703A (en) | Film information searching method and device | |
CN104123321B (en) | A kind of determining method and device for recommending picture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170104 |
|
RJ01 | Rejection of invention patent application after publication |