CN106294443A - The URL classification recognition methods in a kind of knowledge based storehouse and system - Google Patents

The URL classification recognition methods in a kind of knowledge based storehouse and system Download PDF

Info

Publication number
CN106294443A
CN106294443A CN201510280344.7A CN201510280344A CN106294443A CN 106294443 A CN106294443 A CN 106294443A CN 201510280344 A CN201510280344 A CN 201510280344A CN 106294443 A CN106294443 A CN 106294443A
Authority
CN
China
Prior art keywords
classification
url
information
knowledge base
classifying
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510280344.7A
Other languages
Chinese (zh)
Inventor
王栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Le Le Mdt Infotech Ltd
Original Assignee
Shanghai Le Le Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Le Le Mdt Infotech Ltd filed Critical Shanghai Le Le Mdt Infotech Ltd
Priority to CN201510280344.7A priority Critical patent/CN106294443A/en
Publication of CN106294443A publication Critical patent/CN106294443A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses the URL classification recognition methods in a kind of knowledge based storehouse, by importing classification information in knowledge base;According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;Based on URL structure to the internet information after preliminary classification further be layered identification and classification;Output identifies and the result of classification, achieve and internet content is carried out Classification and Identification, because need not carry out text analyzing or image recognition for the content of text of magnanimity, the simply layering identification of network address, service response ability can infinitely improve, also in all of knowledge database structure to internal memory, it is not required to carry out hard disk IO, it is entirely network I/O and internal storage access, decrease the consumption to system resource, concept based on layering, same website can be accomplished, existing identical content, there is again the classification of different content, because key assignments is simple, therefore classified inquiry when, just can accomplish the minimal consumption to system resource.

Description

The URL classification recognition methods in a kind of knowledge based storehouse and system
Technical field
The present invention relates to technical field of internet application, particularly relate to the URL in a kind of knowledge based storehouse Classifying identification method and system.
Background technology
Many search engine service, such as Baidu and google, provide by interconnection for search The information that net can access.These search engine service allow user removal search user interested Display page, such as news web page.Have submitted user after including the searching request of search terms, search Rope engine service identification may be relevant to those search terms webpage.The pass of any specific webpage Keyword can utilize various known information retrieval technique to identify, such as identify title word, The word provided in the metadata of webpage, highlighted word etc..Search engine service can basis The degree of closeness of each coupling, webpage popularization etc., generate associated score and point out webpage Information and the degree of correlation of searching request.Search engine service is then according to their sequence suitable Sequence, displays to the user that the link of those webpages.
Although search engine service can return many webpages as Search Results, with clooating sequence The webpage occurred, it may be difficult to make user carry out actual discovery those users webpage of special interest. Owing to first webpage presented may be directed to popular theme, to obscure theme sense The user of interest may need the many pages browsing Search Results could find net interested Page.In order to make user more easily find webpage interested, the webpage of Search Results can root According to some classification or classification of webpage, present with the tissue of classification.Such as, if user carries Handing over the searching request of " court battles ", Search Results can include being classified as motion phase That close or that law is relevant webpage.User may prefer to show at the beginning the tabulation of webpage, So user can select the classification of webpage interested.For example, it may be possible to first present for user Have been classified as the instruction of the webpage of the relevant Search Results relevant with law of motion.User Can then select the classification that law is relevant to check the webpage that law is relevant.Contrary, due to The webpage that webpage that motion is relevant is more relevant than law is more popular, if most popular webpage first in Existing, user may browse many webpages to find the webpage that law is relevant.
The currently available millions of webpage of manual sort is unpractical.Although automatically classifying Technology is already used to be classified based on the content of text, but those technology are normally unsuitable for webpage Classification.Webpage has the tissue including noisy content, such as advertisement or navigation bar, they Not primary with webpage theme is directly related.Because traditional text based sorting technique exists During classification webpage, will utilize such noisy content, these technology will result in webpage Incorrect classification.
The existing sorting technique for webpage is mainly, analysis based on internet page content Identifying, this method accuracy rate is the most impracticable, for a large amount of requests that height is concurrent, response speed Also and not reliable.
Summary of the invention
In view of current technical field of internet application above shortcomings, the present invention provides a kind of The URL classification recognition methods in knowledge based storehouse and system, layering identification based on network address classification, Service response ability is improved.
For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that
The URL classification recognition methods in a kind of knowledge based storehouse, the URL in described knowledge based storehouse divides Class recognition methods comprises the following steps:
Classification information is imported in knowledge base;
According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;
Based on URL structure to the internet information after preliminary classification further be layered identification and Classification;
Output identifies and the result of classification.
According to one aspect of the present invention, the described concrete reality importing classification information in knowledge base The mode of executing can be: imports classification information and stores in knowledge base, and all knowledge bases is added It is downloaded in internal memory.
According to one aspect of the present invention, described according to the classification information bank URL to internet information The detailed description of the invention carrying out preliminary classification can be: according to the keyword in classification information bank, right The URL of the internet information comprising described keyword carries out preliminary classification.
According to one aspect of the present invention, described based on URL structure to the interconnection after preliminary classification Net information is layered the detailed description of the invention identified and classify further: to preliminary classification After the URL structure of internet information be analyzed, concept based on layering, according to URL layer Described internet information is classified by secondary difference further.
According to one aspect of the present invention, the described concrete reality importing classification information in knowledge base The mode of executing can be: imports in knowledge base and classifies information in plain text and/or import ciphertext in knowledge base Classification information.
According to one aspect of the present invention, described classifying identification method based on URL includes following Step: input internet information by the way of socket connects to carry out inquiring about and Classification and Identification.
A kind of classifying and identifying system based on URL, described classifying and identifying system bag based on URL Include:
Import module, for importing classification information in knowledge base;
Preliminary classification module, for according to the URL to internet information of the classification information in knowledge base Carry out preliminary classification;
Layering identification module, for entering the internet information after preliminary classification based on URL structure Row layering further identifies and classification;
Output module, identifies and the result of classification for exporting.
According to one aspect of the present invention, the specific works mode of described importing module can be: leads Enter classification information to store in knowledge base, and by all knowledge database structure to internal memory.
According to one aspect of the present invention, the specific works mode of described layering identification module can be: The URL structure of the internet information after preliminary classification is analyzed, concept based on layering, Described internet information is classified by difference according to URL level further.
According to one aspect of the present invention, described classifying and identifying system based on URL also includes: Socket link block, for inputting internet information to carry out by the way of socket connects Inquiry and Classification and Identification.
The advantage that the present invention implements: the URL classification identification side in knowledge based storehouse of the present invention Method is by importing classification information in knowledge base;According to the classification information in knowledge base to the Internet The URL of information carries out preliminary classification;Based on URL structure to the internet information after preliminary classification It is layered identification and classification further;Output identifies and the result of classification, it is achieved that to interconnection Net content carries out Classification and Identification, because need not carry out text analyzing for the content of text of magnanimity Or image recognition, simply the layering identification of network address, service response ability can infinitely improve, Also in all of knowledge database structure to internal memory, thus the when of classification engine work, and it is not required to Hard disk IO to be carried out, is entirely network I/O and internal storage access, decreases the consumption to system resource, Concept based on layering, can accomplish same website, and existing identical content, in having difference again The classification held, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish The minimal consumption of system resource.
Accompanying drawing explanation
For the technical scheme being illustrated more clearly that in the embodiment of the present invention, below will be to embodiment The accompanying drawing used required in is briefly described, it should be apparent that, the accompanying drawing in describing below It is only some embodiments of the present invention, for those of ordinary skill in the art, is not paying On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the URL classification recognition methods schematic diagram in a kind of knowledge based storehouse of the present invention;
Fig. 2 is that the URL classification identification system structure in a kind of knowledge based storehouse of the present invention is shown It is intended to.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical side in the embodiment of the present invention Case is clearly and completely described, it is clear that described embodiment is only the present invention one Divide embodiment rather than whole embodiments.Based on the embodiment in the present invention, this area is general The every other embodiment that logical technical staff is obtained under not making creative work premise, Broadly fall into the scope of protection of the invention.
As it is shown in figure 1, the URL classification recognition methods in a kind of knowledge based storehouse, described based on knowing The URL classification recognition methods knowing storehouse comprises the following steps:
Step S1: import classification information in knowledge base;
Described step S1 imports the detailed description of the invention of classification information in knowledge base: import Classification information stores in knowledge base, and by all knowledge database structure to internal memory.Pass through To use in knowledge database structure to internal memory, the most simple network accesses and internal storage access, Without access hard disk, decreasing taking of resource, speed of response of server can be greatly improved.
In actual applications, plaintext classification information can be imported in knowledge base and/or in knowledge base Import ciphertext classification information.
In actual applications, the source of described classification information is concretely: capture mould from the Internet The information of formula, then based on keyword, the information captured is carried out classification process to obtain classification letter Breath sets up classification information bank.
Step S2: the URL of internet information is carried out tentatively according to the classification information in knowledge base Classification;
At the beginning of the URL of internet information is carried out by described step S2 according to the classification information in knowledge base The detailed description of the invention of step classification can be: according to the keyword in classification information bank, to comprising The URL of the internet information stating keyword carries out preliminary classification.Such as, according to classification information bank In keyword " news ", to the URL of internet information about news in internet information It is classified, such as the internet information containing news in URL is divided into a class.
Step S3: the internet information after preliminary classification is divided further based on URL structure Layer identifies and classification;
Internet information after preliminary classification is carried out further by described step S3 based on URL structure Layering identifies and the detailed description of the invention of classification can be: to the internet information after preliminary classification URL structure is analyzed, and concept based on layering, according to the difference of URL level by described interconnection Net information is classified further, concept based on layering, can accomplish same website, both There is identical content, have again the classification of different content.
In actual applications, following mode classification can be carried out to classify:
Forhttp://a.com/1/Withhttp://a.com/1/index.jspAnd Http:// a.com/1/indes233.jsp makees similar process.
Inhomogeneity is done for http://a.com/1/2/ and http://a.com/1/ process.
Forhttp://a.com/1/Withhttp://a.com/1/2And
http://a.com/1/2All make inhomogeneity with http://a.com/1/2/3/4 to process.
Thus, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish The minimal consumption of system resource.
Simultaneously as need not carry out text analyzing or image knowledge for the content of text of magnanimity Not, the simply layering identification of network address, service response ability can infinitely improve.
Step S4: output identifies and the result of classification.
Apply in reality, internet information can be inputted by the way of socket connects to look into Ask and Classification and Identification.
In actual applications, system provide socket service, based on self-defining vlan query protocol VLAN, Support batch query, can disposably submit to any number of Web site query to ask, the most whole system The service performance bottleneck of system, is only the network bandwidth that client uses.
Classifying identification method based on URL described in the present embodiment divides by importing in knowledge base Category information;According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification; Based on URL structure to the internet information after preliminary classification further be layered identification and classification; Output identifies and the result of classification, it is achieved that internet content is carried out Classification and Identification, because not Be required for the content of text of magnanimity and carry out text analyzing or image recognition, simply network address point Layer identifies, service response ability can infinitely improve, also all of knowledge database structure to internal memory In, thus the when of classification engine work, be not required to carry out hard disk IO, be entirely network I/O And internal storage access, decrease the consumption to system resource, concept based on layering, can accomplish Same website, existing identical content, there is again the classification of different content, because key assignments is simple, Therefore classified inquiry when, it is possible to accomplish the minimal consumption to system resource.
A kind of classifying and identifying system embodiment based on URL
As in figure 2 it is shown, a kind of classifying and identifying system based on URL, described based on URL point Class identification system includes:
Import module 1, for importing classification information in knowledge base;
Preliminary classification module 2, is used for according to the classification information in knowledge base internet information URL carries out preliminary classification;
Layering identification module 3, for entering the internet information after preliminary classification based on URL structure Row layering further identifies and classification;
Output module 4, identifies and the result of classification for exporting.
In actual applications, the specific works mode of described importing module can be: imports classification letter Breath stores in knowledge base, and by all knowledge database structure to internal memory.
In actual applications, the specific works mode of described layering identification module can be: to tentatively The URL structure of sorted internet information is analyzed, and concept based on layering, according to URL Described internet information is classified by the difference of level further.
In actual applications, described classifying and identifying system based on URL also includes: socket is even Connection module 5, for input by the way of socket connects internet information with carry out inquiry and Classification and Identification.
The advantage that the present invention implements: the URL classification identification side in knowledge based storehouse of the present invention Method is by importing classification information in knowledge base;According to the classification information in knowledge base to the Internet The URL of information carries out preliminary classification;Based on URL structure to the internet information after preliminary classification It is layered identification and classification further;Output identifies and the result of classification, it is achieved that to interconnection Net content carries out Classification and Identification, because need not carry out text analyzing for the content of text of magnanimity Or image recognition, simply the layering identification of network address, service response ability can infinitely improve, Also in all of knowledge database structure to internal memory, thus the when of classification engine work, and it is not required to Hard disk IO to be carried out, is entirely network I/O and internal storage access, decreases the consumption to system resource, Concept based on layering, can accomplish same website, and existing identical content, in having difference again The classification held, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish The minimal consumption of system resource.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is also Being not limited to this, any those skilled in the art is at technology model disclosed by the invention In enclosing, the change that can readily occur in or replacement, all should contain within protection scope of the present invention. Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.

Claims (10)

1. the URL classification recognition methods in a knowledge based storehouse, it is characterised in that described based on The URL classification recognition methods of knowledge base comprises the following steps:
Classification information is imported in knowledge base;
According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification;
Based on URL structure to the internet information after preliminary classification further be layered identification and Classification;
Output identifies and the result of classification.
Classifying identification method based on URL the most according to claim 1, its feature exists In, described in knowledge base import classification information detailed description of the invention can be: import classification letter Breath stores in knowledge base, and by all knowledge database structure to internal memory.
Classifying identification method based on URL the most according to claim 1, its feature exists In, the described concrete reality that according to classification information bank, the URL of internet information is carried out preliminary classification The mode of executing can be: according to the keyword in classification information bank, to the interconnection comprising described keyword The URL of net information carries out preliminary classification.
Classifying identification method based on URL the most according to claim 1, it is characterised in that Described based on URL structure to the internet information after preliminary classification further be layered identification and The detailed description of the invention of classification can be: enters the URL structure of the internet information after preliminary classification Row is analyzed, concept based on layering, is entered by described internet information according to the difference of URL level Row classification further.
Classifying identification method based on URL the most according to claim 1, it is characterised in that Described in knowledge base import classification information detailed description of the invention can be: in knowledge base import In plain text classification information and/or import ciphertext classification information in knowledge base.
6. according to the classifying identification method based on URL one of claim 1 to 5 Suo Shu, its Being characterised by, described classifying identification method based on URL comprises the following steps: pass through socket The mode connected inputs internet information to carry out inquiring about and Classification and Identification.
7. a classifying and identifying system based on URL, it is characterised in that described based on URL Classifying and identifying system includes:
Import module, for importing classification information in knowledge base;
Preliminary classification module, for according to the URL to internet information of the classification information in knowledge base Carry out preliminary classification;
Layering identification module, for entering the internet information after preliminary classification based on URL structure Row layering further identifies and classification;
Output module, identifies and the result of classification for exporting.
Classifying and identifying system based on URL the most according to claim 7, it is characterised in that The specific works mode of described importing module can be: imports classification information and deposits in knowledge base Storage, and by all knowledge database structure to internal memory.
Classifying and identifying system based on URL the most according to claim 8, it is characterised in that The specific works mode of described layering identification module can be: to the internet information after preliminary classification URL structure be analyzed, concept based on layering, according to the difference of URL level by described Internet information is classified further.
10. according to the classifying and identifying system based on URL one of claim 7 to 9 Suo Shu, It is characterized in that, described classifying and identifying system based on URL also includes: socket link block, Internet information is inputted by socket to carry out inquiring about and Classification and Identification by the way of being connected.
CN201510280344.7A 2015-05-28 2015-05-28 The URL classification recognition methods in a kind of knowledge based storehouse and system Pending CN106294443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510280344.7A CN106294443A (en) 2015-05-28 2015-05-28 The URL classification recognition methods in a kind of knowledge based storehouse and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510280344.7A CN106294443A (en) 2015-05-28 2015-05-28 The URL classification recognition methods in a kind of knowledge based storehouse and system

Publications (1)

Publication Number Publication Date
CN106294443A true CN106294443A (en) 2017-01-04

Family

ID=57635575

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510280344.7A Pending CN106294443A (en) 2015-05-28 2015-05-28 The URL classification recognition methods in a kind of knowledge based storehouse and system

Country Status (1)

Country Link
CN (1) CN106294443A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567494A (en) * 2011-12-22 2012-07-11 北京亿赞普网络技术有限公司 Website classification method and device
CN102955810A (en) * 2011-08-26 2013-03-06 中国移动通信集团公司 Webpage classification method and device
US20140136569A1 (en) * 2012-11-09 2014-05-15 Microsoft Corporation Taxonomy Driven Commerce Site

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955810A (en) * 2011-08-26 2013-03-06 中国移动通信集团公司 Webpage classification method and device
CN102567494A (en) * 2011-12-22 2012-07-11 北京亿赞普网络技术有限公司 Website classification method and device
US20140136569A1 (en) * 2012-11-09 2014-05-15 Microsoft Corporation Taxonomy Driven Commerce Site

Similar Documents

Publication Publication Date Title
US9449271B2 (en) Classifying resources using a deep network
CN102063476B (en) Video searching method and system
JP5588981B2 (en) Providing posts to discussion threads in response to search queries
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN105765573B (en) Improvements in website traffic optimization
US8682882B2 (en) System and method for automatically identifying classified websites
CN103294815B (en) Based on key class and there are a search engine device and method of various presentation modes
US20080077569A1 (en) Integrated Search Service System and Method
CN106339502A (en) Modeling recommendation method based on user behavior data fragmentation cluster
CN104021125B (en) A kind of method, system and a kind of search engine of search engine sequence
US8712999B2 (en) Systems and methods for online search recirculation and query categorization
CN101261629A (en) Specific information searching method based on automatic classification technology
CN101477554A (en) User interest based personalized meta search engine and search result processing method
CN102037464A (en) Search results with most clicked next objects
CN1930566A (en) Systems and methods for search query processing using trend analysis
CN102567494B (en) Website classification method and device
EP2460095A1 (en) Keyword assignment to a web page
CN102930038A (en) Combined method of search result similar items and system of the same
KR102601545B1 (en) Geographic position point ranking method, ranking model training method and corresponding device
CN104598604A (en) Browsing method of website navigation applied in various browsers
TW201220097A (en) capable of performing relevancy processing for at least one product corresponding to product identifiers referenced in relevant web pages
CN108021715A (en) Isomery tag fusion system based on semantic structure signature analysis
CN109857952A (en) A kind of search engine and method for quickly retrieving with classification display
CN106919703A (en) Film information searching method and device
CN104123321B (en) A kind of determining method and device for recommending picture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170104

RJ01 Rejection of invention patent application after publication