CN106202259A

CN106202259A - A kind of info web extracting method based on body thought

Info

Publication number: CN106202259A
Application number: CN201610499614.8A
Authority: CN
Inventors: 董雄飞
Original assignee: Hefei Minzhongyixing Software Development Co Ltd
Current assignee: Hefei Minzhongyixing Software Development Co Ltd
Priority date: 2016-06-29
Filing date: 2016-06-29
Publication date: 2016-12-07

Abstract

The invention discloses a kind of info web extracting method based on body thought, the method uses vector space model, first it is analyzed drawing Feature Words to webpage word segmentation result, next calculates feature weight, the degree of subject relativity of webpage is analyzed then in conjunction with body thought, finally take degree of subject relativity to compare with the threshold value of default, thus extract the subject information of this webpage.The method makes the operand of web page analysis reduce, and reduces the omission of info web, improves the quality of information retrieval.

Description

A kind of info web extracting method based on body thought

Technical field

The invention belongs to network method field, it is more particularly related to a kind of webpage based on body thought letter Breath extracting method.

Background technology

Along with developing rapidly of the Internet, the webpage number on Web is just with exponential explosive trend growth.In the face of such as This huge resource, retrieves on Web and finds that valuable information has become an important task.The research of sing on web Relating to information retrieval, information filtering, information extraction, search engine, Web page classifying etc., the main object that they researchs process is exactly Info web.In webpage in addition to expressing the body matter of theme, navigation bar, the advertisement the most unrelated with subject content are believed The noise content such as breath, copyright information and peer link.

Summary of the invention

Problem to be solved by this invention is to provide a kind of info web extracting method based on body thought.

To achieve these goals, the technical scheme that the present invention takes is:

A kind of info web extracting method based on body thought, comprises the steps:

(1) web document pretreatment

Using the webpage of information to be extracted as information source, Theme Crawler of Content to Web page anchor text, web page title, text title and Text carries out structured analysis in the way of tag tree, is processed into web page text；

(2) ontological classification

The interface utilizing Words partition system FreeICTCLAS carries out participle, and word is carried out ontological classification, obtains simultaneously The frequency that Feature Words occurs in the text；

(3) weight computing

According to vector space model, each web page text is abstracted into a vector, then passes through formula by the spy of text Levying the weight computing shared by key word out, described formula is Wi=∑ (Wt*Pt*Wi)；

(4) degree of subject relativity is calculated

According to degree of subject relativity formula

Analyze degree of subject relativity；

(5) degree of subject relativity is analyzed

The threshold value that calculated degree of subject relativity and system are arranged is compared.

Preferably, in described step (1), the process that realizes of Theme Crawler of Content is divided into:

1. the choosing of training set；

2. the transition probability between each subject categories and subject classification device are obtained by training set；

3. utilize the VIPS algorithm of view-based access control model feature by web page release；

4. on the basis of web page blocks, predict the access privileges of URL in block.

Preferably, in described step (1), Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:

1. the URL self-contained information of string is utilized to determine the weight of URL；

2. the Anchor Text information utilizing URL to go here and there determines the weight of URL；

3. for common URL remaining in web page blocks, judging according to grader first with web page blocks content information should Theme q belonging to web page blocks, is then web page blocks with the similarity of q and q to target topic by the weight assignment of these common URL The product of transition probability；

4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, weighted value is high URL preferentially creeps.

Preferably, the structure of described step 2. subject classification device comprises the steps:

1) the choosing of Feature Words；

2) weight of each component of class center vector；

3) subject categories belonging to web page blocks to be sorted is determined；

4) similarity of webpage and the subject categories downloaded is calculated.

Preferably, in described step (3) weight computing also should in conjunction with term frequencies, reverse document-frequency and normalization because of Son.

Preferably, described step (5) if in the threshold value that arranges more than system of degree of subject relativity, then retain this webpage, if main The threshold value that topic degree of association is arranged less than system, then give up.

Beneficial effect: the invention provides a kind of info web extracting method based on body thought, the method use to Quantity space model, first is analyzed drawing Feature Words, next calculates feature weight, then in conjunction with body to webpage word segmentation result Thought analyzes the degree of subject relativity of webpage, finally takes degree of subject relativity to compare with the threshold value of default, thus extracts The subject information of this webpage.The method makes the operand of web page analysis reduce, and reduces the omission of info web, and the information of improve carries The quality taken.

Detailed description of the invention

Fig. 1 is the flow chart of a kind of info web extracting method based on body thought；

A kind of info web extracting method based on body thought, it is characterised in that comprise the steps:

(1) web document pretreatment

Using the webpage of information to be extracted as information source, Theme Crawler of Content to Web page anchor text, web page title, text title and Text carries out structured analysis in the way of tag tree, is processed into web page text, and the process that realizes of Theme Crawler of Content is divided into:

1. the choosing of training set；

2. the transition probability between each subject categories and subject classification device, the structure of described subject classification device are obtained by training set Comprise the steps:

1) the choosing of Feature Words；

2) weight of each component of class center vector；

4) similarity of webpage and the subject categories downloaded is calculated；

4. on the basis of web page blocks, predict the access privileges of URL in block；

Described Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:

4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, weighted value is high URL preferentially creeps

(2) ontological classification

(3) weight computing

According to vector space model, each web page text is abstracted into a vector, then passes through formula by the spy of text Levy the weight computing shared by key word out, weight computing also should in conjunction with term frequencies, reverse document-frequency and normalization factor, Described formula is Wi=∑ (Wt*Pt*Wi)；

(4) degree of subject relativity is calculated

According to degree of subject relativity formula

Analyze degree of subject relativity；

(5) degree of subject relativity is analyzed

The threshold value that calculated degree of subject relativity and system are arranged is compared, if degree of subject relativity sets more than system The threshold value put, then retain this webpage, if the threshold value that degree of subject relativity is arranged less than system, then gives up.

The invention provides a kind of info web extracting method based on body thought, the method uses vector space mould Type, first is analyzed drawing Feature Words, next calculates feature weight to webpage word segmentation result, analyzes then in conjunction with body thought The degree of subject relativity of webpage, finally takes degree of subject relativity to compare with the threshold value of default, thus extracts this webpage Subject information.The method makes the operand of web page analysis reduce, and reduces the omission of info web, improves the matter of information retrieval Amount.

The foregoing is only embodiments of the invention, not thereby limit the scope of the claims of the present invention, every utilize this Equivalent structure or equivalence flow process that bright description is made convert, or are directly or indirectly used in other relevant technology necks Territory, is the most in like manner included in the scope of patent protection of the present invention.

Claims

1. an info web extracting method based on body thought, it is characterised in that comprise the steps:

(1) web document pretreatment

Using the webpage of information to be extracted as information source, Theme Crawler of Content is to Web page anchor text, web page title, text title and text In the way of tag tree, carry out structured analysis, be processed into web page text；

(2) ontological classification

The interface utilizing Words partition system FreeICTCLAS carries out participle, and word carries out ontological classification, obtains feature simultaneously The frequency that word occurs in the text；

(3) weight computing

According to vector space model, each web page text is abstracted into a vector, then passes through formula and the feature of text is closed Out, described formula is Wi=∑ (Wt*Pt*Wi) to weight computing shared by keyword；

(4) degree of subject relativity is calculated

According to degree of subject relativity formula

Analyze degree of subject relativity；

(5) degree of subject relativity is analyzed

2. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly in (1), the process that realizes of Theme Crawler of Content is divided into:

1. the choosing of training set；

3. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly in (1), Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:

3. for common URL remaining in web page blocks, this webpage is judged first with web page blocks content information according to grader Theme q belonging to block, then by the weight assignment of these common URL be web page blocks with the similarity of q and q to target topic turn Move the product of probability；

4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, the URL that weighted value is high is excellent First creep.

4. according to a kind of based on body thought the info web extracting method described in claim 2, it is characterised in that: described step The structure of rapid 2. subject classification device comprises the steps:

1) the choosing of Feature Words；

2) weight of each component of class center vector；

4) similarity of webpage and the subject categories downloaded is calculated.

5. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly in (3), weight computing also should be in conjunction with term frequencies, reverse document-frequency and normalization factor.

6. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step Suddenly (5) if in degree of subject relativity more than system arrange threshold value, then retain this webpage, if degree of subject relativity less than system arrange Threshold value, then give up.