CN106294443A

CN106294443A - The URL classification recognition methods in a kind of knowledge based storehouse and system

Info

Publication number: CN106294443A
Application number: CN201510280344.7A
Authority: CN
Inventors: 王栋
Original assignee: Shanghai Le Le Mdt Infotech Ltd
Current assignee: Shanghai Le Le Mdt Infotech Ltd
Priority date: 2015-05-28
Filing date: 2015-05-28
Publication date: 2017-01-04

Abstract

The invention discloses the URL classification recognition methods in a kind of knowledge based storehouse, by importing classification information in knowledge base；According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification；Based on URL structure to the internet information after preliminary classification further be layered identification and classification；Output identifies and the result of classification, achieve and internet content is carried out Classification and Identification, because need not carry out text analyzing or image recognition for the content of text of magnanimity, the simply layering identification of network address, service response ability can infinitely improve, also in all of knowledge database structure to internal memory, it is not required to carry out hard disk IO, it is entirely network I/O and internal storage access, decrease the consumption to system resource, concept based on layering, same website can be accomplished, existing identical content, there is again the classification of different content, because key assignments is simple, therefore classified inquiry when, just can accomplish the minimal consumption to system resource.

Description

The URL classification recognition methods in a kind of knowledge based storehouse and system

Technical field

The present invention relates to technical field of internet application, particularly relate to the URL in a kind of knowledge based storehouse Classifying identification method and system.

Background technology

Many search engine service, such as Baidu and google, provide by interconnection for search The information that net can access.These search engine service allow user removal search user interested Display page, such as news web page.Have submitted user after including the searching request of search terms, search Rope engine service identification may be relevant to those search terms webpage.The pass of any specific webpage Keyword can utilize various known information retrieval technique to identify, such as identify title word, The word provided in the metadata of webpage, highlighted word etc..Search engine service can basis The degree of closeness of each coupling, webpage popularization etc., generate associated score and point out webpage Information and the degree of correlation of searching request.Search engine service is then according to their sequence suitable Sequence, displays to the user that the link of those webpages.

Although search engine service can return many webpages as Search Results, with clooating sequence The webpage occurred, it may be difficult to make user carry out actual discovery those users webpage of special interest. Owing to first webpage presented may be directed to popular theme, to obscure theme sense The user of interest may need the many pages browsing Search Results could find net interested Page.In order to make user more easily find webpage interested, the webpage of Search Results can root According to some classification or classification of webpage, present with the tissue of classification.Such as, if user carries Handing over the searching request of " court battles ", Search Results can include being classified as motion phase That close or that law is relevant webpage.User may prefer to show at the beginning the tabulation of webpage, So user can select the classification of webpage interested.For example, it may be possible to first present for user Have been classified as the instruction of the webpage of the relevant Search Results relevant with law of motion.User Can then select the classification that law is relevant to check the webpage that law is relevant.Contrary, due to The webpage that webpage that motion is relevant is more relevant than law is more popular, if most popular webpage first in Existing, user may browse many webpages to find the webpage that law is relevant.

The currently available millions of webpage of manual sort is unpractical.Although automatically classifying Technology is already used to be classified based on the content of text, but those technology are normally unsuitable for webpage Classification.Webpage has the tissue including noisy content, such as advertisement or navigation bar, they Not primary with webpage theme is directly related.Because traditional text based sorting technique exists During classification webpage, will utilize such noisy content, these technology will result in webpage Incorrect classification.

The existing sorting technique for webpage is mainly, analysis based on internet page content Identifying, this method accuracy rate is the most impracticable, for a large amount of requests that height is concurrent, response speed Also and not reliable.

Summary of the invention

In view of current technical field of internet application above shortcomings, the present invention provides a kind of The URL classification recognition methods in knowledge based storehouse and system, layering identification based on network address classification, Service response ability is improved.

For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that

The URL classification recognition methods in a kind of knowledge based storehouse, the URL in described knowledge based storehouse divides Class recognition methods comprises the following steps:

Classification information is imported in knowledge base；

According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification；

Based on URL structure to the internet information after preliminary classification further be layered identification and Classification；

Output identifies and the result of classification.

According to one aspect of the present invention, the described concrete reality importing classification information in knowledge base The mode of executing can be: imports classification information and stores in knowledge base, and all knowledge bases is added It is downloaded in internal memory.

According to one aspect of the present invention, described according to the classification information bank URL to internet information The detailed description of the invention carrying out preliminary classification can be: according to the keyword in classification information bank, right The URL of the internet information comprising described keyword carries out preliminary classification.

According to one aspect of the present invention, described based on URL structure to the interconnection after preliminary classification Net information is layered the detailed description of the invention identified and classify further: to preliminary classification After the URL structure of internet information be analyzed, concept based on layering, according to URL layer Described internet information is classified by secondary difference further.

According to one aspect of the present invention, the described concrete reality importing classification information in knowledge base The mode of executing can be: imports in knowledge base and classifies information in plain text and/or import ciphertext in knowledge base Classification information.

According to one aspect of the present invention, described classifying identification method based on URL includes following Step: input internet information by the way of socket connects to carry out inquiring about and Classification and Identification.

A kind of classifying and identifying system based on URL, described classifying and identifying system bag based on URL Include:

Import module, for importing classification information in knowledge base；

Preliminary classification module, for according to the URL to internet information of the classification information in knowledge base Carry out preliminary classification；

Layering identification module, for entering the internet information after preliminary classification based on URL structure Row layering further identifies and classification；

Output module, identifies and the result of classification for exporting.

According to one aspect of the present invention, the specific works mode of described importing module can be: leads Enter classification information to store in knowledge base, and by all knowledge database structure to internal memory.

According to one aspect of the present invention, the specific works mode of described layering identification module can be: The URL structure of the internet information after preliminary classification is analyzed, concept based on layering, Described internet information is classified by difference according to URL level further.

According to one aspect of the present invention, described classifying and identifying system based on URL also includes: Socket link block, for inputting internet information to carry out by the way of socket connects Inquiry and Classification and Identification.

The advantage that the present invention implements: the URL classification identification side in knowledge based storehouse of the present invention Method is by importing classification information in knowledge base；According to the classification information in knowledge base to the Internet The URL of information carries out preliminary classification；Based on URL structure to the internet information after preliminary classification It is layered identification and classification further；Output identifies and the result of classification, it is achieved that to interconnection Net content carries out Classification and Identification, because need not carry out text analyzing for the content of text of magnanimity Or image recognition, simply the layering identification of network address, service response ability can infinitely improve, Also in all of knowledge database structure to internal memory, thus the when of classification engine work, and it is not required to Hard disk IO to be carried out, is entirely network I/O and internal storage access, decreases the consumption to system resource, Concept based on layering, can accomplish same website, and existing identical content, in having difference again The classification held, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish The minimal consumption of system resource.

Accompanying drawing explanation

For the technical scheme being illustrated more clearly that in the embodiment of the present invention, below will be to embodiment The accompanying drawing used required in is briefly described, it should be apparent that, the accompanying drawing in describing below It is only some embodiments of the present invention, for those of ordinary skill in the art, is not paying On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the URL classification recognition methods schematic diagram in a kind of knowledge based storehouse of the present invention；

Fig. 2 is that the URL classification identification system structure in a kind of knowledge based storehouse of the present invention is shown It is intended to.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical side in the embodiment of the present invention Case is clearly and completely described, it is clear that described embodiment is only the present invention one Divide embodiment rather than whole embodiments.Based on the embodiment in the present invention, this area is general The every other embodiment that logical technical staff is obtained under not making creative work premise, Broadly fall into the scope of protection of the invention.

As it is shown in figure 1, the URL classification recognition methods in a kind of knowledge based storehouse, described based on knowing The URL classification recognition methods knowing storehouse comprises the following steps:

Step S1: import classification information in knowledge base；

Described step S1 imports the detailed description of the invention of classification information in knowledge base: import Classification information stores in knowledge base, and by all knowledge database structure to internal memory.Pass through To use in knowledge database structure to internal memory, the most simple network accesses and internal storage access, Without access hard disk, decreasing taking of resource, speed of response of server can be greatly improved.

In actual applications, plaintext classification information can be imported in knowledge base and/or in knowledge base Import ciphertext classification information.

In actual applications, the source of described classification information is concretely: capture mould from the Internet The information of formula, then based on keyword, the information captured is carried out classification process to obtain classification letter Breath sets up classification information bank.

Step S2: the URL of internet information is carried out tentatively according to the classification information in knowledge base Classification；

At the beginning of the URL of internet information is carried out by described step S2 according to the classification information in knowledge base The detailed description of the invention of step classification can be: according to the keyword in classification information bank, to comprising The URL of the internet information stating keyword carries out preliminary classification.Such as, according to classification information bank In keyword " news ", to the URL of internet information about news in internet information It is classified, such as the internet information containing news in URL is divided into a class.

Step S3: the internet information after preliminary classification is divided further based on URL structure Layer identifies and classification；

Internet information after preliminary classification is carried out further by described step S3 based on URL structure Layering identifies and the detailed description of the invention of classification can be: to the internet information after preliminary classification URL structure is analyzed, and concept based on layering, according to the difference of URL level by described interconnection Net information is classified further, concept based on layering, can accomplish same website, both There is identical content, have again the classification of different content.

In actual applications, following mode classification can be carried out to classify:

Forhttp://a.com/1/Withhttp://a.com/1/index.jspAnd Http:// a.com/1/indes233.jsp makees similar process.

Inhomogeneity is done for http://a.com/1/2/ and http://a.com/1/ process.

Forhttp://a.com/1/Withhttp://a.com/1/2And

http://a.com/1/2All make inhomogeneity with http://a.com/1/2/3/4 to process.

Thus, because key assignments is simple, therefore classified inquiry when, it is possible to it is right to accomplish The minimal consumption of system resource.

Simultaneously as need not carry out text analyzing or image knowledge for the content of text of magnanimity Not, the simply layering identification of network address, service response ability can infinitely improve.

Step S4: output identifies and the result of classification.

Apply in reality, internet information can be inputted by the way of socket connects to look into Ask and Classification and Identification.

In actual applications, system provide socket service, based on self-defining vlan query protocol VLAN, Support batch query, can disposably submit to any number of Web site query to ask, the most whole system The service performance bottleneck of system, is only the network bandwidth that client uses.

Classifying identification method based on URL described in the present embodiment divides by importing in knowledge base Category information；According to the classification information in knowledge base, the URL of internet information is carried out preliminary classification； Based on URL structure to the internet information after preliminary classification further be layered identification and classification； Output identifies and the result of classification, it is achieved that internet content is carried out Classification and Identification, because not Be required for the content of text of magnanimity and carry out text analyzing or image recognition, simply network address point Layer identifies, service response ability can infinitely improve, also all of knowledge database structure to internal memory In, thus the when of classification engine work, be not required to carry out hard disk IO, be entirely network I/O And internal storage access, decrease the consumption to system resource, concept based on layering, can accomplish Same website, existing identical content, there is again the classification of different content, because key assignments is simple, Therefore classified inquiry when, it is possible to accomplish the minimal consumption to system resource.

A kind of classifying and identifying system embodiment based on URL

As in figure 2 it is shown, a kind of classifying and identifying system based on URL, described based on URL point Class identification system includes:

Import module 1, for importing classification information in knowledge base；

Preliminary classification module 2, is used for according to the classification information in knowledge base internet information URL carries out preliminary classification；

Layering identification module 3, for entering the internet information after preliminary classification based on URL structure Row layering further identifies and classification；

Output module 4, identifies and the result of classification for exporting.

In actual applications, the specific works mode of described importing module can be: imports classification letter Breath stores in knowledge base, and by all knowledge database structure to internal memory.

In actual applications, the specific works mode of described layering identification module can be: to tentatively The URL structure of sorted internet information is analyzed, and concept based on layering, according to URL Described internet information is classified by the difference of level further.

In actual applications, described classifying and identifying system based on URL also includes: socket is even Connection module 5, for input by the way of socket connects internet information with carry out inquiry and Classification and Identification.

The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is also Being not limited to this, any those skilled in the art is at technology model disclosed by the invention In enclosing, the change that can readily occur in or replacement, all should contain within protection scope of the present invention. Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.

Claims

1. the URL classification recognition methods in a knowledge based storehouse, it is characterised in that described based on The URL classification recognition methods of knowledge base comprises the following steps:

Classification information is imported in knowledge base；

Output identifies and the result of classification.

Classifying identification method based on URL the most according to claim 1, its feature exists In, described in knowledge base import classification information detailed description of the invention can be: import classification letter Breath stores in knowledge base, and by all knowledge database structure to internal memory.

Classifying identification method based on URL the most according to claim 1, its feature exists In, the described concrete reality that according to classification information bank, the URL of internet information is carried out preliminary classification The mode of executing can be: according to the keyword in classification information bank, to the interconnection comprising described keyword The URL of net information carries out preliminary classification.

Classifying identification method based on URL the most according to claim 1, it is characterised in that Described based on URL structure to the internet information after preliminary classification further be layered identification and The detailed description of the invention of classification can be: enters the URL structure of the internet information after preliminary classification Row is analyzed, concept based on layering, is entered by described internet information according to the difference of URL level Row classification further.

Classifying identification method based on URL the most according to claim 1, it is characterised in that Described in knowledge base import classification information detailed description of the invention can be: in knowledge base import In plain text classification information and/or import ciphertext classification information in knowledge base.

6. according to the classifying identification method based on URL one of claim 1 to 5 Suo Shu, its Being characterised by, described classifying identification method based on URL comprises the following steps: pass through socket The mode connected inputs internet information to carry out inquiring about and Classification and Identification.

7. a classifying and identifying system based on URL, it is characterised in that described based on URL Classifying and identifying system includes:

Import module, for importing classification information in knowledge base；

Output module, identifies and the result of classification for exporting.

Classifying and identifying system based on URL the most according to claim 7, it is characterised in that The specific works mode of described importing module can be: imports classification information and deposits in knowledge base Storage, and by all knowledge database structure to internal memory.

Classifying and identifying system based on URL the most according to claim 8, it is characterised in that The specific works mode of described layering identification module can be: to the internet information after preliminary classification URL structure be analyzed, concept based on layering, according to the difference of URL level by described Internet information is classified further.

10. according to the classifying and identifying system based on URL one of claim 7 to 9 Suo Shu, It is characterized in that, described classifying and identifying system based on URL also includes: socket link block, Internet information is inputted by socket to carry out inquiring about and Classification and Identification by the way of being connected.