CN106202259A - A kind of info web extracting method based on body thought - Google Patents
A kind of info web extracting method based on body thought Download PDFInfo
- Publication number
- CN106202259A CN106202259A CN201610499614.8A CN201610499614A CN106202259A CN 106202259 A CN106202259 A CN 106202259A CN 201610499614 A CN201610499614 A CN 201610499614A CN 106202259 A CN106202259 A CN 106202259A
- Authority
- CN
- China
- Prior art keywords
- web page
- url
- subject
- degree
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention discloses a kind of info web extracting method based on body thought, the method uses vector space model, first it is analyzed drawing Feature Words to webpage word segmentation result, next calculates feature weight, the degree of subject relativity of webpage is analyzed then in conjunction with body thought, finally take degree of subject relativity to compare with the threshold value of default, thus extract the subject information of this webpage.The method makes the operand of web page analysis reduce, and reduces the omission of info web, improves the quality of information retrieval.
Description
Technical field
The invention belongs to network method field, it is more particularly related to a kind of webpage based on body thought letter
Breath extracting method.
Background technology
Along with developing rapidly of the Internet, the webpage number on Web is just with exponential explosive trend growth.In the face of such as
This huge resource, retrieves on Web and finds that valuable information has become an important task.The research of sing on web
Relating to information retrieval, information filtering, information extraction, search engine, Web page classifying etc., the main object that they researchs process is exactly
Info web.In webpage in addition to expressing the body matter of theme, navigation bar, the advertisement the most unrelated with subject content are believed
The noise content such as breath, copyright information and peer link.
Summary of the invention
Problem to be solved by this invention is to provide a kind of info web extracting method based on body thought.
To achieve these goals, the technical scheme that the present invention takes is:
A kind of info web extracting method based on body thought, comprises the steps:
(1) web document pretreatment
Using the webpage of information to be extracted as information source, Theme Crawler of Content to Web page anchor text, web page title, text title and
Text carries out structured analysis in the way of tag tree, is processed into web page text;
(2) ontological classification
The interface utilizing Words partition system FreeICTCLAS carries out participle, and word is carried out ontological classification, obtains simultaneously
The frequency that Feature Words occurs in the text;
(3) weight computing
According to vector space model, each web page text is abstracted into a vector, then passes through formula by the spy of text
Levying the weight computing shared by key word out, described formula is Wi=∑ (Wt*Pt*Wi);
(4) degree of subject relativity is calculated
According to degree of subject relativity formula
Analyze degree of subject relativity;
(5) degree of subject relativity is analyzed
The threshold value that calculated degree of subject relativity and system are arranged is compared.
Preferably, in described step (1), the process that realizes of Theme Crawler of Content is divided into:
1. the choosing of training set;
2. the transition probability between each subject categories and subject classification device are obtained by training set;
3. utilize the VIPS algorithm of view-based access control model feature by web page release;
4. on the basis of web page blocks, predict the access privileges of URL in block.
Preferably, in described step (1), Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:
1. the URL self-contained information of string is utilized to determine the weight of URL;
2. the Anchor Text information utilizing URL to go here and there determines the weight of URL;
3. for common URL remaining in web page blocks, judging according to grader first with web page blocks content information should
Theme q belonging to web page blocks, is then web page blocks with the similarity of q and q to target topic by the weight assignment of these common URL
The product of transition probability;
4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, weighted value is high
URL preferentially creeps.
Preferably, the structure of described step 2. subject classification device comprises the steps:
1) the choosing of Feature Words;
2) weight of each component of class center vector;
3) subject categories belonging to web page blocks to be sorted is determined;
4) similarity of webpage and the subject categories downloaded is calculated.
Preferably, in described step (3) weight computing also should in conjunction with term frequencies, reverse document-frequency and normalization because of
Son.
Preferably, described step (5) if in the threshold value that arranges more than system of degree of subject relativity, then retain this webpage, if main
The threshold value that topic degree of association is arranged less than system, then give up.
Beneficial effect: the invention provides a kind of info web extracting method based on body thought, the method use to
Quantity space model, first is analyzed drawing Feature Words, next calculates feature weight, then in conjunction with body to webpage word segmentation result
Thought analyzes the degree of subject relativity of webpage, finally takes degree of subject relativity to compare with the threshold value of default, thus extracts
The subject information of this webpage.The method makes the operand of web page analysis reduce, and reduces the omission of info web, and the information of improve carries
The quality taken.
Detailed description of the invention
Fig. 1 is the flow chart of a kind of info web extracting method based on body thought;
A kind of info web extracting method based on body thought, it is characterised in that comprise the steps:
(1) web document pretreatment
Using the webpage of information to be extracted as information source, Theme Crawler of Content to Web page anchor text, web page title, text title and
Text carries out structured analysis in the way of tag tree, is processed into web page text, and the process that realizes of Theme Crawler of Content is divided into:
1. the choosing of training set;
2. the transition probability between each subject categories and subject classification device, the structure of described subject classification device are obtained by training set
Comprise the steps:
1) the choosing of Feature Words;
2) weight of each component of class center vector;
3) subject categories belonging to web page blocks to be sorted is determined;
4) similarity of webpage and the subject categories downloaded is calculated;
3. utilize the VIPS algorithm of view-based access control model feature by web page release;
4. on the basis of web page blocks, predict the access privileges of URL in block;
Described Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:
1. the URL self-contained information of string is utilized to determine the weight of URL;
2. the Anchor Text information utilizing URL to go here and there determines the weight of URL;
3. for common URL remaining in web page blocks, judging according to grader first with web page blocks content information should
Theme q belonging to web page blocks, is then web page blocks with the similarity of q and q to target topic by the weight assignment of these common URL
The product of transition probability;
4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, weighted value is high
URL preferentially creeps
(2) ontological classification
The interface utilizing Words partition system FreeICTCLAS carries out participle, and word is carried out ontological classification, obtains simultaneously
The frequency that Feature Words occurs in the text;
(3) weight computing
According to vector space model, each web page text is abstracted into a vector, then passes through formula by the spy of text
Levy the weight computing shared by key word out, weight computing also should in conjunction with term frequencies, reverse document-frequency and normalization factor,
Described formula is Wi=∑ (Wt*Pt*Wi);
(4) degree of subject relativity is calculated
According to degree of subject relativity formula
Analyze degree of subject relativity;
(5) degree of subject relativity is analyzed
The threshold value that calculated degree of subject relativity and system are arranged is compared, if degree of subject relativity sets more than system
The threshold value put, then retain this webpage, if the threshold value that degree of subject relativity is arranged less than system, then gives up.
The invention provides a kind of info web extracting method based on body thought, the method uses vector space mould
Type, first is analyzed drawing Feature Words, next calculates feature weight to webpage word segmentation result, analyzes then in conjunction with body thought
The degree of subject relativity of webpage, finally takes degree of subject relativity to compare with the threshold value of default, thus extracts this webpage
Subject information.The method makes the operand of web page analysis reduce, and reduces the omission of info web, improves the matter of information retrieval
Amount.
The foregoing is only embodiments of the invention, not thereby limit the scope of the claims of the present invention, every utilize this
Equivalent structure or equivalence flow process that bright description is made convert, or are directly or indirectly used in other relevant technology necks
Territory, is the most in like manner included in the scope of patent protection of the present invention.
Claims (6)
1. an info web extracting method based on body thought, it is characterised in that comprise the steps:
(1) web document pretreatment
Using the webpage of information to be extracted as information source, Theme Crawler of Content is to Web page anchor text, web page title, text title and text
In the way of tag tree, carry out structured analysis, be processed into web page text;
(2) ontological classification
The interface utilizing Words partition system FreeICTCLAS carries out participle, and word carries out ontological classification, obtains feature simultaneously
The frequency that word occurs in the text;
(3) weight computing
According to vector space model, each web page text is abstracted into a vector, then passes through formula and the feature of text is closed
Out, described formula is Wi=∑ (Wt*Pt*Wi) to weight computing shared by keyword;
(4) degree of subject relativity is calculated
According to degree of subject relativity formula
Analyze degree of subject relativity;
(5) degree of subject relativity is analyzed
The threshold value that calculated degree of subject relativity and system are arranged is compared.
2. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step
Suddenly in (1), the process that realizes of Theme Crawler of Content is divided into:
1. the choosing of training set;
2. the transition probability between each subject categories and subject classification device are obtained by training set;
3. utilize the VIPS algorithm of view-based access control model feature by web page release;
4. on the basis of web page blocks, predict the access privileges of URL in block.
3. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step
Suddenly in (1), Theme Crawler of Content determines that the priority tasks that the URL in this webpage accesses is:
1. the URL self-contained information of string is utilized to determine the weight of URL;
2. the Anchor Text information utilizing URL to go here and there determines the weight of URL;
3. for common URL remaining in web page blocks, this webpage is judged first with web page blocks content information according to grader
Theme q belonging to block, then by the weight assignment of these common URL be web page blocks with the similarity of q and q to target topic turn
Move the product of probability;
4. being inserted in queue to be creeped according to the height of the weighted value of its correspondence by the URL in web page blocks, the URL that weighted value is high is excellent
First creep.
4. according to a kind of based on body thought the info web extracting method described in claim 2, it is characterised in that: described step
The structure of rapid 2. subject classification device comprises the steps:
1) the choosing of Feature Words;
2) weight of each component of class center vector;
3) subject categories belonging to web page blocks to be sorted is determined;
4) similarity of webpage and the subject categories downloaded is calculated.
5. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step
Suddenly in (3), weight computing also should be in conjunction with term frequencies, reverse document-frequency and normalization factor.
6. according to a kind of based on body thought the info web extracting method described in claim 1, it is characterised in that: described step
Suddenly (5) if in degree of subject relativity more than system arrange threshold value, then retain this webpage, if degree of subject relativity less than system arrange
Threshold value, then give up.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610499614.8A CN106202259A (en) | 2016-06-29 | 2016-06-29 | A kind of info web extracting method based on body thought |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610499614.8A CN106202259A (en) | 2016-06-29 | 2016-06-29 | A kind of info web extracting method based on body thought |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106202259A true CN106202259A (en) | 2016-12-07 |
Family
ID=57463372
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610499614.8A Withdrawn CN106202259A (en) | 2016-06-29 | 2016-06-29 | A kind of info web extracting method based on body thought |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202259A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590219A (en) * | 2017-09-04 | 2018-01-16 | 电子科技大学 | Webpage personage subject correlation message extracting method |
CN109063076A (en) * | 2018-07-24 | 2018-12-21 | 维沃移动通信有限公司 | A kind of Picture Generation Method and mobile terminal |
CN111143649A (en) * | 2019-12-09 | 2020-05-12 | 杭州迪普科技股份有限公司 | Webpage searching method and device |
CN112597369A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage spider theme type search system based on improved cloud platform |
-
2016
- 2016-06-29 CN CN201610499614.8A patent/CN106202259A/en not_active Withdrawn
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590219A (en) * | 2017-09-04 | 2018-01-16 | 电子科技大学 | Webpage personage subject correlation message extracting method |
CN109063076A (en) * | 2018-07-24 | 2018-12-21 | 维沃移动通信有限公司 | A kind of Picture Generation Method and mobile terminal |
CN109063076B (en) * | 2018-07-24 | 2021-07-13 | 维沃移动通信有限公司 | Picture generation method and mobile terminal |
CN111143649A (en) * | 2019-12-09 | 2020-05-12 | 杭州迪普科技股份有限公司 | Webpage searching method and device |
CN112597369A (en) * | 2020-12-22 | 2021-04-02 | 荆门汇易佳信息科技有限公司 | Webpage spider theme type search system based on improved cloud platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107818105B (en) | Recommendation method of application program and server | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN104199972B (en) | A kind of name entity relation extraction and construction method based on deep learning | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
CN107992764B (en) | Sensitive webpage identification and detection method and device | |
CN111104510B (en) | Text classification training sample expansion method based on word embedding | |
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN106202294B (en) | Related news computing method and device based on keyword and topic model fusion | |
CN105975491A (en) | Enterprise news analysis method and system | |
CN106202259A (en) | A kind of info web extracting method based on body thought | |
CN111310476B (en) | Public opinion monitoring method and system using aspect-based emotion analysis method | |
CN110909531B (en) | Information security screening method, device, equipment and storage medium | |
CN103309862A (en) | Webpage type recognition method and system | |
CN104361037B (en) | Microblogging sorting technique and device | |
CN101763431A (en) | PL clustering method based on massive network public sentiment information | |
CN109145180B (en) | Enterprise hot event mining method based on incremental clustering | |
CN105528422A (en) | Focused crawler processing method and apparatus | |
CN106126502A (en) | A kind of emotional semantic classification system and method based on support vector machine | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
CN105912525A (en) | Sentiment classification method for semi-supervised learning based on theme characteristics | |
CN109299277A (en) | The analysis of public opinion method, server and computer readable storage medium | |
CN109558587A (en) | A kind of classification method for the unbalanced public opinion orientation identification of category distribution | |
CN106294786A (en) | A kind of code search method and system | |
CN109992703A (en) | A kind of credibility evaluation method of the differentiation feature mining based on multi-task learning | |
CN106096055A (en) | A kind of info web extracting method based on body thought |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20161207 |